Get startedGet started for free

CountVectorizer for text classification

It's time to begin building your text classifier! The data has been loaded into a DataFrame called df. Explore it in the IPython Shell to investigate what columns you can use. The .head() method is particularly informative.

In this exercise, you'll use pandas alongside scikit-learn to create a sparse text vectorizer you can use to train and test a simple supervised model. To begin, you'll set up a CountVectorizer and investigate some of its features.

This exercise is part of the course

Introduction to Natural Language Processing in Python

View Course

Exercise instructions

  • Import CountVectorizer from sklearn.feature_extraction.text and train_test_split from sklearn.model_selection.
  • Create a Series y to use for the labels by assigning the .label attribute of df to y.
  • Using df["text"] (features) and y (labels), create training and test sets using train_test_split(). Use a test_size of 0.33 and a random_state of 53.
  • Create a CountVectorizer object called count_vectorizer. Ensure you specify the keyword argument stop_words="english" so that stop words are removed.
  • Fit and transform the training data X_train using the .fit_transform() method of your CountVectorizer object. Do the same with the test data X_test, except using the .transform() method.
  • Print the first 10 features of the count_vectorizer using its .get_feature_names() method.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Import the necessary modules
____
____

# Print the head of df
print(df.head())

# Create a series to store the labels: y
y = ____

# Create training and test sets
X_train, X_test, y_train, y_test = ____

# Initialize a CountVectorizer object: count_vectorizer
count_vectorizer = ____

# Transform the training data using only the 'text' column values: count_train 
count_train = ____

# Transform the test data using only the 'text' column values: count_test 
count_test = ____

# Print the first 10 features of the count_vectorizer
print(____[:10])
Edit and Run Code