Get startedGet started for free

Word counts with bag-of-words

1. Word counts with bag-of-words

Welcome to chapter two! We'll begin with using word counts with a bag of words approach.

2. Bag-of-words

Bag of words is a very simple and basic method to finding topics in a text. For bag of words, you need to first create tokens using tokenization, and then count up all the tokens you have. The theory is that the more frequent a word or token is, the more central or important it might be to the text. Bag of words can be a great way to determine the significant words in a text based on the number of times they are used.

3. Bag-of-words example

Here we see an example series of sentences, mainly about a cat and a box. If we just us a simple bag of words model with tokenization like we learned in chapter one and remove the punctuation, we can see the example result. Box, cat, The and the are some of the most important words because they are the most frequent. Notice that the word THE appears twice in the bag of words, once with uppercase and once lowercase. If we added a preprocessing step to handle this issue, we could lowercase all of the words in the text so each word is counted only once.

4. Bag-of-words in Python

We can use the NLP fundamentals we already know, such as tokenization with NLTK to create a list of tokens. We will use a new class called Counter which we import from the standard library module collections. The list of tokens generated using word_tokenize can be passed as the initialization argument for the Counter class. The result is a counter object which has similar structure to a dictionary and allows us to see each token and the frequency of the token. Counter objects also have a method called `most_common`, which takes an integer argument, such as 2 and would then return the top 2 tokens in terms of frequency. The return object is a series of tuples inside a list. For each tuple, the first element holds the token and the second element represents the frequency. Note: other than ordering by token frequency, the most_common method does not sort the tokens it returns or tell us there are more tokens with that same frequency.

5. Let's practice!

Now you know a bit about bag of words and can get started building your own using Python.