CountVectorizer for text classification
It's time to begin building your text classifier! The data has been loaded into a DataFrame called df
. Explore it in the IPython Shell to investigate what columns you can use. The .head()
method is particularly informative.
In this exercise, you'll use pandas
alongside scikit-learn to create a sparse text vectorizer you can use to train and test a simple supervised model. To begin, you'll set up a CountVectorizer
and investigate some of its features.
This exercise is part of the course
Introduction to Natural Language Processing in Python
Exercise instructions
- Import
CountVectorizer
fromsklearn.feature_extraction.text
andtrain_test_split
fromsklearn.model_selection
. - Create a Series
y
to use for the labels by assigning the.label
attribute ofdf
toy
. - Using
df["text"]
(features) andy
(labels), create training and test sets usingtrain_test_split()
. Use atest_size
of0.33
and arandom_state
of53
. - Create a
CountVectorizer
object calledcount_vectorizer
. Ensure you specify the keyword argumentstop_words="english"
so that stop words are removed. - Fit and transform the training data
X_train
using the.fit_transform()
method of yourCountVectorizer
object. Do the same with the test dataX_test
, except using the.transform()
method. - Print the first 10 features of the
count_vectorizer
using its.get_feature_names()
method.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Import the necessary modules
____
____
# Print the head of df
print(df.head())
# Create a series to store the labels: y
y = ____
# Create training and test sets
X_train, X_test, y_train, y_test = ____
# Initialize a CountVectorizer object: count_vectorizer
count_vectorizer = ____
# Transform the training data using only the 'text' column values: count_train
count_train = ____
# Transform the test data using only the 'text' column values: count_test
count_test = ____
# Print the first 10 features of the count_vectorizer
print(____[:10])