Language detection of product reviews
You will practice language detection on a small dataset called non_english_reviews
. It is a sample of non-English reviews from the Amazon product reviews.
You will iterate over the rows of the dataset, detecting the language of each row and appending it to an empty list. The list needs to be cleaned so that it only contains the language of the review such as 'en'
for English instead of the regular output en:0.9987654
. Remember that the language detection function might detect more than one language and the first item in the returned list is the most likely candidate. Finally, you will assign the list to a new column.
The logic is the same as used in the slides and the exercise before but instead of applying the function to a list, you work with a dataset.
This exercise is part of the course
Sentiment Analysis in Python
Exercise instructions
- Iterate over the rows of the
non_english_reviews
dataset. - Inside the loop, detect the language of the second column of the dataset.
- Clean the string by splitting on a
:
inside the list comprehension expression. - Finally, assign the cleaned list to a new column.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
from langdetect import detect_langs
languages = []
# Loop over the rows of the dataset and append
for row in ____(____(non_english_reviews)):
languages.append(____(non_english_reviews.iloc[row, 1]))
# Clean the list by splitting
languages = [str(lang).____(':')[0][1:] for lang in languages]
# Assign the list to a new feature
non_english_reviews['language'] = ____
print(non_english_reviews.head())