BOW with n-grams and vocabulary size
In this exercise, you will practice building a bag-of-words once more, using the reviews
dataset of Amazon product reviews. Your main task will be to limit the size of the vocabulary and specify the length of the token sequence.
This exercise is part of the course
Sentiment Analysis in Python
Exercise instructions
- Import the vectorizer from
sklearn
. - Build the vectorizer and make sure to specify the following parameters: the size of the vocabulary should be limited to 1000, include only bigrams, and ignore terms that appear in more than 500 documents.
- Fit the vectorizer to the
review
column. - Create a DataFrame from the BOW representation.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
#Import the vectorizer
from sklearn.____.____ import ____
# Build the vectorizer, specify max features and fit
vect = ____(____=1000, ____=(2, 2), ____=500)
vect.____(reviews.review)
# Transform the review
X_review = vect.transform(reviews.review)
# Create a DataFrame from the bow representation
X_df = pd.DataFrame(X_review.____, columns=____._____)
print(X_df.head())