1. Learn
  2. /
  3. Courses
  4. /
  5. Cluster Analysis in Python

Exercise

TF-IDF of movie plots

Let us use the plots of randomly selected movies to perform document clustering on. Before performing clustering on documents, they need to be cleaned of any unwanted noise (such as special characters and stop words) and converted into a sparse matrix through TF-IDF of the documents.

Use the TfidfVectorizer class to perform the TF-IDF of movie plots stored in the list plots. The remove_noise() function is available to use as a tokenizer in the TfidfVectorizer class. The .fit_transform() method fits the data into the TfidfVectorizer objects and then generates the TF-IDF sparse matrix.

Note: It takes a few seconds to run the .fit_transform() method.

Instructions

100 XP
  • Import TfidfVectorizer class from sklearn.
  • Initialize the TfidfVectorizer class with minimum and maximum frequencies of 0.1 and 0.75, and 50 maximum features.
  • Use the fit_transform() method on the initialized TfidfVectorizer class with the list plots.