BoW vectors for movie reviews
In this exercise, you have been given two pandas Series, X_train
and X_test
, which consist of movie reviews. They represent the training and the test review data respectively. Your task is to preprocess the reviews and generate BoW vectors for these two sets using CountVectorizer
.
Once we have generated the BoW vector matrices X_train_bow
and X_test_bow
, we will be in a very good position to apply a machine learning model to it and conduct sentiment analysis.
This exercise is part of the course
Feature Engineering for NLP in Python
Exercise instructions
- Import
CountVectorizer
from thesklearn
library. - Instantiate a
CountVectorizer
object namedvectorizer
. Ensure that all words are converted to lowercase andenglish
stopwords are removed. - Using
X_train
, fitvectorizer
and then use it to transformX_train
to generate the set of BoW vectorsX_train_bow
. - Transform
X_test
usingvectorizer
to generate the set of BoW vectorsX_test_bow
.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Import CountVectorizer
from sklearn.feature_extraction.text import ____
# Create a CountVectorizer object
vectorizer = ____(lowercase=____, stop_words=____)
# Fit and transform X_train
X_train_bow = vectorizer.____(____)
# Transform X_test
X_test_bow = vectorizer.____(____)
# Print shape of X_train_bow and X_test_bow
print(X_train_bow.shape)
print(X_test_bow.shape)