Building a Counter with bag-of-words
In this exercise, you'll build your first (in this course) bag-of-words counter using a Wikipedia article, which has been pre-loaded as article
. Try doing the bag-of-words without looking at the full article text, and guessing what the topic is! If you'd like to peek at the title at the end, we've included it as article_title
. Note that this article text has had very little preprocessing from the raw Wikipedia database entry.
word_tokenize
has been imported for you.
This exercise is part of the course
Introduction to Natural Language Processing in Python
Exercise instructions
- Import
Counter
fromcollections
. - Use
word_tokenize()
to split the article into tokens. - Use a list comprehension with
t
as the iterator variable to convert all the tokens into lowercase. The.lower()
method converts text into lowercase. - Create a bag-of-words counter called
bow_simple
by usingCounter()
withlower_tokens
as the argument. - Use the
.most_common()
method ofbow_simple
to print the 10 most common tokens.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Import Counter
____
# Tokenize the article: tokens
tokens = ____
# Convert the tokens into lowercase: lower_tokens
lower_tokens = [____ for ____ in ____]
# Create a Counter with the lowercase tokens: bow_simple
bow_simple = ____
# Print the 10 most common tokens
print(____)