Session Ready
Exercise

Size of vocabulary of movies reviews

In this exercise, you will practice different ways to limit the size of the vocabulary using a sample of the movies reviews dataset. The first column is the review, which is of type object and the second column is the label, which is 0 for a negative review and 1 for a positive one.

The three methods that you will use will transform the text column to new numeric columns, capturing the count of a word or a phrase in each review. Each method will ultimately result in building a different number of new features.

Instructions 1/3
undefined XP
  • 1

    Using the movies dataset, limit the size of the vocabulary to 100.

    • 2

      Using the movies dataset, limit the size of the vocabulary to include terms which occur in no more than 200 documents.

    • 3

      Using the movies dataset, limit the size of the vocabulary to ignore terms which occur in less than 50 documents.