Get startedGet started for free

Word frequency analysis

Congratulations! You've just joined PyBooks. PyBooks is developing a book recommendation system and they want to find patterns and trends in text to improve their recommendations.

To begin, you'll want to understand the frequency of words in a given text and remove any rare words.

Note that typical real-world datasets will be larger than this example.

This exercise is part of the course

Deep Learning for Text with PyTorch

View Course

Exercise instructions

  • Import get_tokenizer from torchtext and FreqDist from the nltk library.
  • Initialize the tokenizer for English and tokenize the given text.
  • Calculate the frequency distribution of the tokens and remove rare words using list comprehension.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Import the necessary functions
from torchtext.data.utils import ____
from nltk.probability import ____

text = "In the city of Dataville, a data analyst named Alex explores hidden insights within vast data. With determination, Alex uncovers patterns, cleanses the data, and unlocks innovation. Join this adventure to unleash the power of data-driven decisions."

# Initialize the tokenizer and tokenize the text
tokenizer = ____("basic_english")
tokens = tokenizer(____)

threshold = 1
# Remove rare words and print common tokens
freq_dist = ____(____)
common_tokens = [token for token in tokens if ____[token] > ____]
print(common_tokens)
Edit and Run Code