Size of vocabulary of movies reviews

In this exercise, you will practice different ways to limit the size of the vocabulary using a sample of the movies reviews dataset. The first column is the review, which is of type object and the second column is the label, which is 0 for a negative review and 1 for a positive one.

The three methods that you will use will transform the text column to new numeric columns, capturing the count of a word or a phrase in each review. Each method will ultimately result in building a different number of new features.

Diese Übung ist Teil des Kurses

Sentiment Analysis in Python

Kurs anzeigen

Interaktive Übung

Versuche dich an dieser Übung, indem du diesen Beispielcode vervollständigst.

from sklearn.feature_extraction.text import CountVectorizer 

# Build the vectorizer, specify size of vocabulary and fit
vect = CountVectorizer(____=____)
vect.fit(movies.review)

# Transform the review column
X_review = vect.transform(movies.review)
# Create the bow representation
X_df = pd.DataFrame(X_review.toarray(), columns=vect.get_feature_names())
print(X_df.head())

Code bearbeiten und ausführen