Word frequency analysis
Congratulations! You've just joined PyBooks. PyBooks is developing a book recommendation system and they want to find patterns and trends in text to improve their recommendations.
To begin, you'll want to understand the frequency of words in a given text and remove any rare words.
Note that typical real-world datasets will be larger than this example.
This exercise is part of the course
Deep Learning for Text with PyTorch
Exercise instructions
- Import
get_tokenizer
fromtorchtext
andFreqDist
from thenltk
library. - Initialize the tokenizer for English and tokenize the given
text
. - Calculate the frequency distribution of the
tokens
and remove rare words using list comprehension.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Import the necessary functions
from torchtext.data.utils import ____
from nltk.probability import ____
text = "In the city of Dataville, a data analyst named Alex explores hidden insights within vast data. With determination, Alex uncovers patterns, cleanses the data, and unlocks innovation. Join this adventure to unleash the power of data-driven decisions."
# Initialize the tokenizer and tokenize the text
tokenizer = ____("basic_english")
tokens = tokenizer(____)
threshold = 1
# Remove rare words and print common tokens
freq_dist = ____(____)
common_tokens = [token for token in tokens if ____[token] > ____]
print(common_tokens)