Building a bag of words model

1. Building a bag of words model

In this chapter, we will cover vectorization which is, as you may recall, the process of converting text into vectors.

2. Recap of data format for ML algorithms

Recall that for any ML algorithm to run properly, data fed into it must be in tabular form and all the training features must be numerical. This is clearly not the case for textual data. In this lesson, we will learn a technique called bag of words that converts text documents into vectors.

3. Bag of words model

The bag of words model is a procedure of extracting word tokens from a text document (henceforth, we will refer to this as just document), computing the frequency of these word tokens and constructing a word vector based on these frequencies and the vocabulary of the entire corpus of documents. This is best explained with the help of an example.

4. Bag of words model example

Consider a corpus of three documents. The lion is the king of the jungle. Lions have an average lifespan of 15 years. And, the lion is an endangered species.

5. Bag of words model example

We now extract the unique word tokens that occur in this corpus of documents. This will be the vocabulary of our model. In this example, the following 15 word tokens will constitute our vocabulary. Since there are 15 words in our vocabulary, our word vectors will have 15 dimensions and each dimension's value will correspond to the frequency of the word token corresponding to that dimension. For instance, the second dimension will correspond to the number of times the second word in the vocabulary, an, occurs in the document. Let's now convert our documents into word vectors using this bag of words model. The lion is the king of the jungle is converted to the following vector. Similarly, the other two sentences have the following word vector representations.

6. Text preprocessing

As we were constructing this model, you may have noticed how text preprocessing would have been extremely useful in creating arguably better models. We would usually want Lions and lion to mean the same thing and therefore, counted as the same thing. The same applies to 'the' with different cases. We would also want to remove punctuations and stopwords as they are extremely common and don't really contribute much to the character of the document. Performing text preprocessing usually leads to smaller vocabularies, which is a good thing. While working with vectorization, it is routine to form word vectors running into thousands of dimensions and keeping this to a minimum helps improve performance.

7. Bag of words model using sklearn

To construct the bag of words model in Python, we will use the scikit-learn library. We will use the corpus from before, consisting of the three sentences on lions. Let's ignore text preprocessing for now.

8. Bag of words model using sklearn

We import the CountVectorizer class from sklearn.feature_extraction.text. This is the class that will help us build our bag of words model. Next, we instantiate a CountVectorizer object vectorizer. We finally create our matrix of word vectors by passing corpus to the fit_transform method of vectorizer. This is stored in bow_matrix. This bow_matrix is a sparse matrix and we can print out its 2D array form using bow matrix dot toarray(). This gives us the following output. Notice how this is different from the word vectors we generated. This is because CountVectorizer automatically lowercases words and ignores single character tokens such as 'a'. Also, it doesn't necessarily index the vocabulary in alphabetical order. We will learn how to map the vocabulary to the indices in the exercises. We can now use this bow_matrix as our training features in ML models.

9. Let's practice!

We've covered a lot of theory in this lesson. Let us practice this in the exercises.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.