CountVectorizer for text classification
It's time to begin building your text classifier! The data has been loaded into a DataFrame called df. Explore it in the IPython Shell to investigate what columns you can use. The .head() method is particularly informative.
In this exercise, you'll use pandas alongside scikit-learn to create a sparse text vectorizer you can use to train and test a simple supervised model. To begin, you'll set up a CountVectorizer and investigate some of its features.
This exercise is part of the course
Introduction to Natural Language Processing in Python
Exercise instructions
- Import
CountVectorizerfromsklearn.feature_extraction.textandtrain_test_splitfromsklearn.model_selection. - Create a Series
yto use for the labels by assigning the.labelattribute ofdftoy. - Using
df["text"](features) andy(labels), create training and test sets usingtrain_test_split(). Use atest_sizeof0.33and arandom_stateof53. - Create a
CountVectorizerobject calledcount_vectorizer. Ensure you specify the keyword argumentstop_words="english"so that stop words are removed. - Fit and transform the training data
X_trainusing the.fit_transform()method of yourCountVectorizerobject. Do the same with the test dataX_test, except using the.transform()method. - Print the first 10 features of the
count_vectorizerusing its.get_feature_names()method.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Import the necessary modules
____
____
# Print the head of df
print(df.head())
# Create a series to store the labels: y
y = ____
# Create training and test sets
X_train, X_test, y_train, y_test = ____
# Initialize a CountVectorizer object: count_vectorizer
count_vectorizer = ____
# Transform the training data using only the 'text' column values: count_train
count_train = ____
# Transform the test data using only the 'text' column values: count_test
count_test = ____
# Print the first 10 features of the count_vectorizer
print(____[:10])