TfidfVectorizer for text classification
Similar to the sparse CountVectorizer
created in the previous exercise, you'll work on creating tf-idf vectors for your documents. You'll set up a TfidfVectorizer
and investigate some of its features.
In this exercise, you'll use pandas
and sklearn
along with the same X_train
, y_train
and X_test
, y_test
DataFrames and Series you created in the last exercise.
This exercise is part of the course
Introduction to Natural Language Processing in Python
Exercise instructions
- Import
TfidfVectorizer
fromsklearn.feature_extraction.text
. - Create a
TfidfVectorizer
object calledtfidf_vectorizer
. When doing so, specify the keyword argumentsstop_words="english"
andmax_df=0.7
. - Fit and transform the training data.
- Transform the test data.
- Print the first 10 features of
tfidf_vectorizer
. - Print the first 5 vectors of the tfidf training data using slicing on the
.A
(or array) attribute oftfidf_train
.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Import TfidfVectorizer
____
# Initialize a TfidfVectorizer object: tfidf_vectorizer
tfidf_vectorizer = ____
# Transform the training data: tfidf_train
tfidf_train = ____
# Transform the test data: tfidf_test
tfidf_test = ____
# Print the first 10 features
print(____[:10])
# Print the first 5 vectors of the tfidf training data
print(____[:5])