Session Ready
Exercise

Filtering by word frequency

The small size of our corpus poses a problem: some terms will occur only once and are not useful for inferring the topics. In this exercise your task is to remove the words whose corpus-wide frequency is less than 10. This will require grouping by words and then adding up per-document frequencies.

Unnesting tokens and removing stopwords using anti_join() has already been done for you.

Instructions
100 XP
  • Count occurrences within documents/awards.
  • Group the data using word as the grouping variable.
  • Filter using a nested call to sum(n) for corpus-wide frequency that is 10 or higher.
  • Ungroup the data and create a document-term matrix.