Bag-of-words

1. Bag-of-words

Welcome to the next chapter of this course! We proceed on our journey by embarking on the first step in performing a sentiment analysis task: transforming our text data to numeric form. Why do we need to do that? A machine learning model cannot work with the text data directly, but rather with numeric features we create from the data.

2. What is a bag-of-words (BOW) ?

We start with a basic and crude, but often quite useful method, called bag-of-words (BOW). A bag-of-words approach describes the occurrence, or frequency, of words within a document, or a collection of documents (called corpus). It basically comes down to building a vocabulary of all the words occurring in the document and keeping track of their frequencies.

3. Amazon product reviews

Before we continue with the discussion of BOW, we will introduce the data we will use throughout the chapter, namely reviews of Amazon products. The dataset consists of two columns: the first contains the score, which is 1 if positive and 0 if negative; The second column contains the actual review of the product.

4. Sentiment analysis with BOW: Example

Let’s see how BOW would work applied to an example review. Imagine you have the following string: "This is the best book ever. I loved the book and highly recommend it." The goal of a BOW approach would be to build the following dictionary-like output: 'This', occurs once in our string, so it has a count of 1, 'is' occurs once, 'the' occurs two times and so on. One thing to note is that we lose the word order and grammar rules, that’s why this approach is called a ‘bag’ of words, resembling dropping a bunch of items in a bag and losing any sense of their order. This sounds straightforward but sometimes deciding how to build the vocabulary can be complex. We discuss some of the trade-offs we need to consider in later chapters.

5. BOW end result

When we transform the text column with a BOW, the end result looks something like the table that we see: where the column is the word (also called token), and the row represents how many times we have encountered it in the respective review.

6. CountVectorizer function

How do we execute a BOW process in Python? The simplest way to do this is by using the CountVectorizer from the text library in the sklearn.feature_extraction submodule. In Python, we import the CountVectorizer() from sklearn.feature_extraction.text. In the CountVectorizer function, for the moment we leave the default functional options, except for the max_features argument, which only considers the features with highest term frequency, i.e. it will pick the 1000 most frequent words across the corpus of reviews. We need to do that sometimes for memory’s sake. We use the `fit()` method from the CountVectorizer, calling fit() on our text column. To create a BOW representation, we call the transform() method, applied again to our text column.

7. CountVectorizer output

The result is a sparse matrix. A sparse matrix only stores entities that are non-zero, where the rows correspond to the number of rows in the dataset, and the columns to the BOW vocabulary.

8. Transforming the vectorizer

To look at the actual contents of a sparse matrix, we need to perform an additional step to transform it back to a 'dense' NumPy array, using the .toarray() method. We can build a pandas DataFrame from the array, where the columns' names are obtained from the `.get_feature_names()` method of the vectorizer. This returns a list where every entry corresponds to one feature.

9. Let's practice!

That was our introduction to BOW! Let's apply what we've learned in the exercises.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.