TF-IDF of movie plots
Let us use the plots of randomly selected movies to perform document clustering on. Before performing clustering on documents, they need to be cleaned of any unwanted noise (such as special characters and stop words) and converted into a sparse matrix through TF-IDF of the documents.
Use the TfidfVectorizer
class to perform the TF-IDF of movie plots stored in the list plots
. The remove_noise()
function is available to use as a tokenizer
in the TfidfVectorizer
class. The .fit_transform()
method fits the data into the TfidfVectorizer
objects and then generates the TF-IDF sparse matrix.
Note: It takes a few seconds to run the .fit_transform()
method.
Diese Übung ist Teil des Kurses
Cluster Analysis in Python
Anleitung zur Übung
- Import
TfidfVectorizer
class fromsklearn
. - Initialize the
TfidfVectorizer
class with minimum and maximum frequencies of 0.1 and 0.75, and 50 maximum features. - Use the
fit_transform()
method on the initializedTfidfVectorizer
class with the list plots.
Interaktive Übung
Versuche dich an dieser Übung, indem du diesen Beispielcode vervollständigst.
# Import TfidfVectorizer class from sklearn
from sklearn.feature_extraction.text import ____
# Initialize TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(____)
# Use the .fit_transform() method on the list plots
tfidf_matrix = ____