CommencerCommencer gratuitement

Analyzing dimensionality and preprocessing

In this exercise, you have been provided with a lem_corpus which contains the pre-processed versions of the movie taglines from the previous exercise. In other words, the taglines have been lowercased and lemmatized, and stopwords have been removed.

Your job is to generate the bag of words representation bow_lem_matrix for these lemmatized taglines and compare its shape with that of bow_matrix obtained in the previous exercise. The first five lemmatized taglines in lem_corpus have been printed to the console for you to examine.

Cet exercice fait partie du cours

Feature Engineering for NLP in Python

Afficher le cours

Instructions

  • Import the CountVectorizer class from sklearn.
  • Instantiate a CountVectorizer object. Name it vectorizer.
  • Using fit_transform(), generate bow_lem_matrix for lem_corpus.

Exercice interactif pratique

Essayez cet exercice en complétant cet exemple de code.

# Import CountVectorizer
from sklearn.feature_extraction.text import ____

# Create CountVectorizer object
____ = ____

# Generate matrix of word vectors
bow_lem_matrix = ____.____(lem_corpus)

# Print the shape of bow_lem_matrix
print(bow_lem_matrix.shape)
Modifier et exécuter le code