Text classification using tf/idf vectors
Now that you've encoded the volunteer
dataset's title
column into tf/idf vectors, you'll use those vectors to predict the category_desc
column.
This exercise is part of the course
Preprocessing for Machine Learning in Python
Exercise instructions
- Split the
text_tfidf
vector andy
target variable into training and test sets, setting thestratify
parameter equal toy
, since the class distribution is uneven. Notice that we have to run the.toarray()
method on the tf/idf vector, in order to get in it the proper format for scikit-learn. - Fit the
X_train
andy_train
data to the Naive Bayes model,nb
. - Print out the test set accuracy.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Split the dataset according to the class distribution of category_desc
y = volunteer["category_desc"]
X_train, X_test, y_train, y_test = ____(____.toarray(), ____, ____=____, random_state=42)
# Fit the model to the training data
nb.____(____, ____)
# Print out the model's accuracy
print(nb.____(____, ____))