Step 2: Building a vectorizer
In this exercise, you are asked to build a TfIDf transformation of the review
column in the reviews
dataset. You are asked to specify the n-grams, stop words, the pattern of tokens and the size of the vocabulary arguments.
This is the last step before we train a classifier to predict the sentiment of a review.
Make sure you specify the maximum number of features properly, as a very large vocabulary size could disconnect your session.
This exercise is part of the course
Sentiment Analysis in Python
Exercise instructions
- Import the Tfidf vectorizer and the default list of English stop words.
- Build the Tfidf vectorizer, specifying - in this order - the following arguments: use as stop words the default list of English stop words; as n-grams use uni- and bi-grams;the maximum number of features should be 200; capture only words using the specified pattern.
- Create a DataFrame using the Tfidf vectorizer.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Import the TfidfVectorizer and default list of English stop words
from sklearn.feature_extraction.text import ____, ____
# Build the vectorizer
vect = ____(____=____, ____=(1, 2), ____=200, ____=r'\b[^\d\W][^\d\W]+\b').fit(reviews.review)
# Create sparse matrix from the vectorizer
X = vect.transform(reviews.review)
# Create a DataFrame
reviews_transformed = pd.DataFrame(X.____, columns=vect.____)
print('Top 5 rows of the DataFrame: \n', reviews_transformed.head())