Get startedGet started for free

Language detection of product reviews

You will practice language detection on a small dataset called non_english_reviews. It is a sample of non-English reviews from the Amazon product reviews.

You will iterate over the rows of the dataset, detecting the language of each row and appending it to an empty list. The list needs to be cleaned so that it only contains the language of the review such as 'en' for English instead of the regular output en:0.9987654. Remember that the language detection function might detect more than one language and the first item in the returned list is the most likely candidate. Finally, you will assign the list to a new column.

The logic is the same as used in the slides and the exercise before but instead of applying the function to a list, you work with a dataset.

This exercise is part of the course

Sentiment Analysis in Python

View Course

Exercise instructions

  • Iterate over the rows of the non_english_reviews dataset.
  • Inside the loop, detect the language of the second column of the dataset.
  • Clean the string by splitting on a : inside the list comprehension expression.
  • Finally, assign the cleaned list to a new column.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

from langdetect import detect_langs
languages = [] 

# Loop over the rows of the dataset and append  
for row in ____(____(non_english_reviews)):
    languages.append(____(non_english_reviews.iloc[row, 1]))

# Clean the list by splitting     
languages = [str(lang).____(':')[0][1:] for lang in languages]

# Assign the list to a new feature 
non_english_reviews['language'] = ____

print(non_english_reviews.head())
Edit and Run Code