Session Ready
Exercise

Preprocessing text features

Here, you'll perform a similar preprocessing pipeline step, only this time you'll use the text column from the sample data.

To preprocess the text, you'll turn to CountVectorizer() to generate a bag-of-words representation of the data, as in Chapter 2. Using the default arguments, add a (step, transform) tuple to the steps list in your pipeline.

Make sure you select only the text column for splitting your training and test sets.

As usual, your sample_df is ready and waiting in the workspace.

Instructions
100 XP
  • Import CountVectorizer from sklearn.feature_extraction.text.
  • Create training and test sets by selecting the correct subset of sample_df: 'text'.
  • Add the CountVectorizer step (with the name 'vec') to the correct position in the pipeline.
  • Fit the pipeline to the training data and compute its accuracy.