Size of vocabulary of movies reviews
In this exercise, you will practice different ways to limit the size of the vocabulary using a sample of the movies
reviews dataset. The first column is the review
, which is of type object
and the second column is the label
, which is 0
for a negative review and 1
for a positive one.
The three methods that you will use will transform the text column to new numeric columns, capturing the count of a word or a phrase in each review. Each method will ultimately result in building a different number of new features.
This exercise is part of the course
Sentiment Analysis in Python
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
from sklearn.feature_extraction.text import CountVectorizer
# Build the vectorizer, specify size of vocabulary and fit
vect = CountVectorizer(____=____)
vect.fit(movies.review)
# Transform the review column
X_review = vect.transform(movies.review)
# Create the bow representation
X_df = pd.DataFrame(X_review.toarray(), columns=vect.get_feature_names())
print(X_df.head())