Building a BoW Naive Bayes classifier

1. Building a BoW Naive Bayes classifier

In this lesson, we will walk through a machine learning problem that utilizes feature engineering techniques we've learned, to arrive at a desired result.

2. Spam filtering

Let's take a look at the spam filtering problem. We're given a dataset of messages that have been labelled as spam or ham. Here, you can see a typical spam and ham message. Our task is to train an ML model that can predict the label given a particular text.

3. Steps

There are 3 steps involved. The first is to preprocess the text. Next, we proceed to build the bag-of-words model. Finally, we conduct predictive modeling using the generated BoW vectors. Note that although we use the term 'modeling' in the context of both BoW and machine learning, they mean two different things.

4. Text preprocessing using CountVectorizer

We've already learned how to conduct text preprocessing using spaCy. However, it is also possible to do this using CountVectorizer. CountVectorizer takes in a number of arguments to perform preprocessing. The lowercase argument, when set to True, converts words to lowercase. The strip_accents argument can convert accented characters according to unicode or ASCII mapping. Passing in a stopwords argument will lead to CountVectorizer ignoring stopwords. You can pass in a custom list or the string 'english' to use scikit-learn's list of English stopwords. You can specify tokenization using a regular expression as the value of the token_pattern argument. Tokenization can also be specified using a tokenizer argument. Here, you can pass a function that takes a string as an argument and returns a list of tokens. This way, CountVectorizer allows usage of spaCy's tokenization techniques. CountVectorizer cannot perform certain steps such as lemmatization automatically. This is where spaCy is useful. Although it performs tokenization and preprocessing, CountVectorizer's main job is to convert a corpus into a matrix of numerical vectors.

5. Building the BoW model

As usual, we import CountVectorizer from scikit-learn. We then instantiate a CountVectorizer object called vectorizer. We perform accent stripping using ASCII mapping and remove English stopwords. We also set the lowercase argument to False. This is because spam messages usually tend to abuse all-capital words and we might want to preserve this information for the ML step. The dataset has been already been loaded into the dataframe df. We split this dataset into training and test sets using scikit-learn's train test split function.

6. Building the BoW model

We now fit the vectorizer on the training set and transform it into its bag-of-words representation. We can perform both these steps together using the fit transform method. Next, we transform the test set into its BoW representation. Note, that we do not fit the vectorizer with the test data. It is possible that there are some words in the test data that is not in the vocabulary of the vectorizer. In such cases, CountVectorizer simply ignores these words.

7. Training the Naive Bayes classifier

We're now in a good position to train an ML model. We will use the Multinomial Naive Bayes classifier for this task. We import the Multinomial NB class from scikit-learn and create an object named clf. We then fit the training BoW vectors and their corresponding labels to clf. We can now test the performance of our model. We compute the accuracy of the model on the test set using clf dot score. In this case, our model registered an accuracy of 76% on the test set.

8. Let's practice!

We've covered a lot of ground in building a spam filter in this lesson. In the exercises, we will perform similar steps to perform sentiment analysis on movie reviews. Let's practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.