Frequency analysis of product reviews
You now have access to a larger dataset of TechZone product reviews. Just like before, you've preprocessed and transformed the reviews into a BoW representation X
. Your task now is to analyze the word frequencies and identify the most common terms in the dataset.
To help with the analysis, a helper function called get_top_ten()
is provided. It takes in a list of words and their corresponding counts, and returns the 10 most frequent words and their counts.
This exercise is part of the course
Natural Language Processing (NLP) in Python
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
def preprocess(text):
text = text.lower()
tokens = word_tokenize(text)
tokens = [word for word in tokens if word not in string.punctuation]
return " ".join(tokens)
cleaned_reviews = [preprocess(review) for review in product_reviews]
X = vectorizer.fit_transform(cleaned_reviews)
# Get word counts
word_counts = np.____(X.____, axis=0)
# Get words
words = vectorizer.____
top_words_with_stopwords, top_counts_with_stopwords = get_top_ten(words, word_counts)
print(top_words_with_stopwords, top_counts_with_stopwords)