Session Ready
Exercise

Instantiate the TF-IDF model

TF-IDF by default generates a column for every word in all of your documents (movie summaries in our case). This creates a huge and unintuitive dataset as it will contain both very common words that appear in every document, and words that appear so rarely they provide no value in finding similarities between items.

In this exercise, you will work with the df_plots DataFrame. It contains movies' names in the Title column and their plots in the Plot column.

Using this DataFrame, you will generate the default TF-IDF scores and see if non-valuable columns are present.

You will go on to rerun the TF-IDF calculations, this time limiting the number of columns using the min_df and max_df arguments and hopefully see the improvement.

Instructions 1/2
undefined XP
  • 1
  • 2
  • Create a TfidfVectorizer and call it vectorizer.
  • Use vectorizer to transform the data in the Plots column of df_plots and assign the output to vectorized_data.
  • Inspect the features that have been generated by the transformation.