Get startedGet started for free

Step 2: Building a vectorizer

In this exercise, you are asked to build a TfIDf transformation of the review column in the reviews dataset. You are asked to specify the n-grams, stop words, the pattern of tokens and the size of the vocabulary arguments.

This is the last step before we train a classifier to predict the sentiment of a review.

Make sure you specify the maximum number of features properly, as a very large vocabulary size could disconnect your session.

This exercise is part of the course

Sentiment Analysis in Python

View Course

Exercise instructions

  • Import the Tfidf vectorizer and the default list of English stop words.
  • Build the Tfidf vectorizer, specifying - in this order - the following arguments: use as stop words the default list of English stop words; as n-grams use uni- and bi-grams;the maximum number of features should be 200; capture only words using the specified pattern.
  • Create a DataFrame using the Tfidf vectorizer.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Import the TfidfVectorizer and default list of English stop words
from sklearn.feature_extraction.text import ____, ____

# Build the vectorizer
vect = ____(____=____, ____=(1, 2), ____=200, ____=r'\b[^\d\W][^\d\W]+\b').fit(reviews.review)
# Create sparse matrix from the vectorizer
X = vect.transform(reviews.review)

# Create a DataFrame
reviews_transformed = pd.DataFrame(X.____, columns=vect.____)
print('Top 5 rows of the DataFrame: \n', reviews_transformed.head())
Edit and Run Code