Bag-of-Words representation

1. Bag-of-Words representation

Before analyzing text with models, we need to convert it into numbers.

2. NLP workflow recap

We'll do this with feature extraction, the next step in our NLP workflow.

3. Bag-of-Words (BoW)

One foundational technique is Bag-of-Words or BoW, where we represent text by counting how often each word appears. It's like throwing words into a bag and counting them. We ignore grammar and word order and only focus on frequency.

4. BoW example

Suppose we have a dataset of two sentences: "I love NLP" and "I love machine learning". To represent them using BoW,

5. BoW example

we first build a vocabulary of all the unique words across both sentences. In this case, that's five words.

6. BoW example

Next, for each sentence, we count how many times each word from the vocabulary appears. This gives us a list of numbers, called a vector, where each number represents the count of a specific word in the sentence. For the first sentence, the vector is: [1, 1, 1, 0, 0] For the second, it is: [1, 1, 0, 1, 1]

7. BoW with code

Suppose we have a list of movie reviews stored as strings, and we want to represent them using a BoW model. We'll first clean the text using a function called preprocess() that applies lowercasing, tokenization, and punctuation removal. Then, we join the clean tokens with spaces to return a sentence. This function will be used throughout the course for basic text cleaning. We apply this function to each review using a list comprehension and get a list of cleaned_reviews, lowercased, and punctuation free.

8. BoW with code

To apply BoW, we use scikit-learn, also known as sklearn, a popular Python library for machine learning and data processing. First, we import CountVectorizer from sklearn.feature_extraction.text. Next, we initialize our vectorizer and call vectorizer.fit() to build the vocabulary from the cleaned_reviews. Vocabulary can be accessed with vectorizer.get_feature_names_out()

9. BoW output

To transform the cleaned_reviews, we use vectorizer.transform(). Alternatively, we can use vectorizer.fit_transform() to build the vocabulary and transform the data in one step. This gives us a sparse matrix called X. A sparse matrix is a table mostly filled with zeros, because most words from the vocabulary don't appear in every review. This format is memory-efficient and common in text data.

10. BoW output

X.toarray() converts the result to a regular NumPy array, where each row represents a review and each column represents a word from the vocabulary. The order of the columns matches the order of words returned by vectorizer.get_feature_names_out()

11. Word frequencies

To get a sense of the most frequent words in the dataset, we first sum the word counts across all reviews using np.sum() with axis=0. Then, we retrieve the word labels using vectorizer.get_feature_names_out(). We can visualize these lists with a bar plot using plt.bar(), passing in both words and word_counts, specifying a suitable title and labels for both axis. The most frequent words include stop words like "was", "the", and "it". This is expected, since we haven't removed stop words during preprocessing. If we're only interested in word frequencies, removing stop words helps. But for goals like capturing context or sentiment, we need a more advanced approach, which we'll explore in the next video.

12. Let's practice!

Time for some practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.