Applying Naive Bayes to other problems

1. Applying Naive Bayes to other problems

Smartphone destination suggestions are a very specific application of Naive Bayes. But the algorithm can also be used for many other types of problems. Naive Bayes tends to work well on problems where the information from multiple attributes needs to be considered simultaneously and evaluated as a whole. The process is somewhat analogous to how a medical doctor might evaluate symptoms and test results to make a final diagnosis. Historically, Naive Bayes has also been frequently used for classifying text data, like identifying whether or not an email is spam. In this video, I'll present some of the challenges you may encounter when applying the algorithm to other classification tasks.

2. How Naive Bayes uses data

Consider the fact that Naive Bayes makes predictions by computing conditional probabilities of events and outcomes. Beginning with a tabular dataset, it builds frequency tables that count the number of times each event overlaps with the outcome of interest. The probabilities are then multiplied, naively, in a chain of all the events. A consequence of this approach is that each of the predictors used in Naive Bayes typically comprises a set of categories. Numeric properties, like age or time-of-day, are difficult for Naive Bayes to use as-is without knowing more about the properties of the data. Similarly, unstructured text data also defies categorization. Thus, it is generally necessary to prepare these types of data before using them with Naive Bayes.

3. Binning numeric data for Naive Bayes

A technique called binning is a simple method for creating categories from numeric data. The idea is to divide a range of numbers into a series of sets called "bins." For instance, you might divide a numbers into bins based on percentiles by creating a category for the bottom 25%, the next 25%, and so on. Perhaps a better approach is to group ranges of values into meaningful bins. For instance, you might group times into categories like afternoon and evening, and temperature readings into values like hot, warm, and cold. You can use R's data preparation functions to recode data this way.

4. Preparing text data for Naive Bayes

Text documents are considered unstructured data because they do not conform to the typical table or spreadsheet format of most datasets. A common process for adding structure to text data uses a model called bag-of-words. The bag-of-words model does not consider word order, grammar, or semantics. It simply creates an event for each word that appears in a particular collection of text documents. For example, the bag-of-words for this document on Naive Bayes would include events for words like "naive" and "bayes" and "understanding." In spreadsheet form, this results in a wide table where the rows are documents and the columns are words that may appear in the documents. Each spreadsheet cell indicates whether or not the word appeared in that document. When the Naive Bayes algorithm is applied to the bag of words, it can estimate the probability of the outcome given the evidence provided by the words in the text. For instance, a document with the words "viagra" and "prescription" is more likely to be spam than a document with the words "naive" and "bayes." Naive Bayes models trained with bag of words can be very effective text classifiers.

5. Let's practice!

You can learn more about this in DataCamp's course on text mining, which will teach you how to apply R's tm package to build datasets you can use with Naive Bayes.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.