A tf-idf word-frequency array
In this exercise, you'll create a tf-idf word frequency array for a toy collection of documents. For this, use the TfidfVectorizer from sklearn. It transforms a list of documents into a word frequency array, which it outputs as a csr_matrix. It has fit() and transform() methods like other sklearn objects.
You are given a list documents of toy documents about pets.
This exercise is part of the course
Unsupervised Learning in Python
Exercise instructions
- Import
TfidfVectorizerfromsklearn.feature_extraction.text. - Create a
TfidfVectorizerinstance calledtfidf. - Apply
.fit_transform()method oftfidftodocumentsand assign the result tocsr_mat. This is a word-frequency array in csr_matrix format. - Inspect
csr_matby calling its.toarray()method and printing the result. This has been done for you. - The columns of the array correspond to words. Get the list of words by calling the
.get_feature_names_out()method oftfidf, and assign the result towords.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Import TfidfVectorizer
from ____ import ____
# Create a TfidfVectorizer: tfidf
tfidf = ____
# Apply fit_transform to document: csr_mat
csr_mat = ____
# Print result of toarray() method
print(csr_mat.toarray())
# Get the words: words
words = ____
# Print words
print(words)