Specify token sequence length with BOW
We saw in the video that by specifying different length of tokens - what we called n-grams - we can better capture the context, which can be very important.
In this exercise, you will work with a sample of the Amazon product reviews. Your task is to build a BOW vocabulary, using the review
column and specify the sequence length of tokens.
This exercise is part of the course
Sentiment Analysis in Python
Exercise instructions
- Build the vectorizer, specifying the token sequence length to be uni- and bigrams.
- Fit the vectorizer.
- Transform the fitted vectorizer.
- In the DataFrame, make sure to correctly specify the column names.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
from sklearn.feature_extraction.text import CountVectorizer
# Build the vectorizer, specify token sequence and fit
vect = ____(____=(___,___))
vect.____(reviews.review)
# Transform the review column
X_review = vect.____(reviews.review)
# Create the bow representation
X_df = pd.DataFrame(X_review.toarray(), columns=vect.____)
print(X_df.head())