Analyzing dimensionality and preprocessing
In this exercise, you have been provided with a lem_corpus
which contains the pre-processed versions of the movie taglines from the previous exercise. In other words, the taglines have been lowercased and lemmatized, and stopwords have been removed.
Your job is to generate the bag of words representation bow_lem_matrix
for these lemmatized taglines and compare its shape with that of bow_matrix
obtained in the previous exercise. The first five lemmatized taglines in lem_corpus
have been printed to the console for you to examine.
This exercise is part of the course
Feature Engineering for NLP in Python
Exercise instructions
- Import the
CountVectorizer
class fromsklearn
. - Instantiate a
CountVectorizer
object. Name itvectorizer
. - Using
fit_transform()
, generatebow_lem_matrix
forlem_corpus
.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Import CountVectorizer
from sklearn.feature_extraction.text import ____
# Create CountVectorizer object
____ = ____
# Generate matrix of word vectors
bow_lem_matrix = ____.____(lem_corpus)
# Print the shape of bow_lem_matrix
print(bow_lem_matrix.shape)