Instantiate the TF-IDF model
TF-IDF by default generates a column for every word in all of your documents (movie summaries in our case). This creates a huge and unintuitive dataset as it will contain both very common words that appear in every document, and words that appear so rarely they provide no value in finding similarities between items.
In this exercise, you will work with the df_plots
DataFrame. It contains movies' names in the Title
column and their plots in the Plot
column.
Using this DataFrame, you will generate the default TF-IDF scores and see if non-valuable columns are present.
You will go on to rerun the TF-IDF calculations, this time limiting the number of columns using the min_df
and max_df
arguments and hopefully see the improvement.
This exercise is part of the course
Building Recommendation Engines in Python
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
from sklearn.feature_extraction.text import TfidfVectorizer
# Instantiate the vectorizer object to the vectorizer variable
vectorizer = ____()
# Fit and transform the plot column
vectorized_data = vectorizer.____(df_plots['Plot'])
# Look at the features generated
print(____.____())