Get startedGet started for free

Instantiate the TF-IDF model

TF-IDF by default generates a column for every word in all of your documents (movie summaries in our case). This creates a huge and unintuitive dataset as it will contain both very common words that appear in every document, and words that appear so rarely they provide no value in finding similarities between items.

In this exercise, you will work with the df_plots DataFrame. It contains movies' names in the Title column and their plots in the Plot column.

Using this DataFrame, you will generate the default TF-IDF scores and see if non-valuable columns are present.

You will go on to rerun the TF-IDF calculations, this time limiting the number of columns using the min_df and max_df arguments and hopefully see the improvement.

This exercise is part of the course

Building Recommendation Engines in Python

View Course

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

from sklearn.feature_extraction.text import TfidfVectorizer

# Instantiate the vectorizer object to the vectorizer variable
vectorizer = ____()

# Fit and transform the plot column
vectorized_data = vectorizer.____(df_plots['Plot'])

# Look at the features generated
print(____.____())
Edit and Run Code