TfidfVectorizer for text classification
Similar to the sparse CountVectorizer created in the previous exercise, you'll work on creating tf-idf vectors for your documents. You'll set up a TfidfVectorizer and investigate some of its features.
In this exercise, you'll use pandas and sklearn along with the same X_train, y_train and X_test, y_test DataFrames and Series you created in the last exercise.
This exercise is part of the course
Introduction to Natural Language Processing in Python
Exercise instructions
- Import
TfidfVectorizerfromsklearn.feature_extraction.text. - Create a
TfidfVectorizerobject calledtfidf_vectorizer. When doing so, specify the keyword argumentsstop_words="english"andmax_df=0.7. - Fit and transform the training data.
- Transform the test data.
- Print the first 10 features of
tfidf_vectorizer. - Print the first 5 vectors of the tfidf training data using slicing on the
.A(or array) attribute oftfidf_train.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Import TfidfVectorizer
____
# Initialize a TfidfVectorizer object: tfidf_vectorizer
tfidf_vectorizer = ____
# Transform the training data: tfidf_train
tfidf_train = ____
# Transform the test data: tfidf_test
tfidf_test = ____
# Print the first 10 features
print(____[:10])
# Print the first 5 vectors of the tfidf training data
print(____[:5])