Frequency analysis of product reviews
You now have access to a larger dataset of TechZone product reviews. Just like before, you've preprocessed and transformed the reviews into a BoW representation X. Your task now is to analyze the word frequencies and identify the most common terms in the dataset.
To help with the analysis, a helper function called get_top_ten() is provided. It takes in a list of words and their corresponding counts, and returns the 10 most frequent words and their counts.
Questo esercizio fa parte del corso
Natural Language Processing (NLP) in Python
Esercizio pratico interattivo
Prova a risolvere questo esercizio completando il codice di esempio.
def preprocess(text):
text = text.lower()
tokens = word_tokenize(text)
tokens = [word for word in tokens if word not in string.punctuation]
return " ".join(tokens)
cleaned_reviews = [preprocess(review) for review in product_reviews]
X = vectorizer.fit_transform(cleaned_reviews)
# Get word counts
word_counts = np.____(X.____, axis=0)
# Get words
words = vectorizer.____
top_words_with_stopwords, top_counts_with_stopwords = get_top_ten(words, word_counts)
print(top_words_with_stopwords, top_counts_with_stopwords)