Get startedGet started for free

Can you guess the language?

1. Can you guess the language?

Often in real applications not all documents carrying sentiment will be in English. We might want to detect what language is used or build specific features related to language.

2. Language of a string in Python

In Python there are a few libraries that can detect the language of a string. In this course, we will use langdetect because it is one of the better performing packages. But you can follow the same structure using another package. We first import the detect_langs function from the langdetect package. Now imagine we have a string called foreign, which is a sentence in another language. Our goal is to identify its language. We apply the detect_langs function to our string. This function will return a list. Each item of the list contains a pair of a language and a number saying how likely it is that the string is in this particular language. In this case, we observe only 1 item in the list, namely Spanish. That's because the function is fairly certain the language is Spanish. In other cases we might get longer lists, where the most likely candidate languages will appear first, followed by less likely ones.

3. Language of a column

In real applications, we usually work not with a single string but with many strings, often contained in a column of a dataset. A common problem is to detect the language of each of the strings and capture the most likely language in a new column. How to do that? We again start by importing the detect_langs function from the langdetect package. We import our familiar dataset with product reviews.

4. Building a feature for the language

The steps we follow next are quite similar to our approach when capturing the length of a review. First, we create an empty list, called languages. We want to iterate over the rows of our dataset using a for loop. In the first line of the loop, we apply the len() function to our dataset, which returns the number of rows. We still need to call the range() function since we want to iterate over the number of rows. In the second line of the loop, we apply the detect_lang function on the review column of the dataset, which is the second column in our case, while selecting one row at a time. We want to store each detected language as an item in a list, therefore we append the result of detect_langs to the empty list languages. When we print languages, we see that it is a list of lists, where each element contains the detected language of the respective row and how likely that language is. In some cases, the individual lists contain more than one item.

5. Building a feature for the language

We have one more step before we create our language feature. We saw that languages is a list of lists. We want to extract the first element of each list within languages since the first item is always the most likely language. One fast way to do that is by list comprehension. Let's break down the command in steps. For example, let's take the first element of the languages and split it on a colon sign. After that, we extract the first element of the resulting split, returning '[es'. Finally,since there is a left bracket before the language, we select everything from the 2nd element onwards, resulting in 'es' for Spanish.

6. Building a feature for the language

To write the list comprehension, we put these steps together by iterating over each item in our list of lists. Lastly, we assign the cleaned list to a new feature, called language.

7. Let's practice!

I know this is a lot of code but the exercises will help you digest it.