Latent Dirichlet allocation

1. Latent Dirichlet allocation

We've seen how word counts and visualizations suggest something about content. We've also used sentiment dictionaries to learn about the emotional valence of a document. We'll now move beyond word counts to uncover the underlying topics in a collection of documents. We will be using a standard topic model known as latent Dirichlet allocation.

2. Unsupervised learning

The latent Dirichlet allocation topic model or LDA searches for patterns of words occurring together within and across a collection of documents, also known as a corpus. Imagine again our bag-of-words. LDA finds topics in a corpus by creating a separate bag for each document and dumping the words out to look for patterns in which words appear together -- not just in one bag, but consistently across all the document-specific bags in our corpus. Note that LDA isn't trying to explain or predict any dependent variable, like in a regression. Rather, it's looking for patterns within a group of explanatory variables. This pattern search is known as unsupervised learning.

3. Word probabilities

The topics themselves are a list of all the words in the corpus, often referred to as a dictionary, with probabilities of the words appearing within each topic. Words that appear together often will have high probabilities of occurring in one or more topics. For example, here we have two topics for our product review data. The first topic is a collection of words that appear to be connected with the performance of a robotic vacuum, with words like hair, vacuum, floors, house, and day occurring together with high probability. The second topic is a collection of words that appear to be connected with possible frustrations of a robotic vacuum, with words like clean, cleaning, bin, and time occurring together with high probability.

4. Clustering vs. topic modeling

Because topic modeling is a type of unsupervised learning, it naturally draws comparisons from other unsupervised techniques. In particular, topic modeling is often compared to clustering, which is common in market segmentation and a host of other applications. It's worth taking a moment to distinguish these two techniques. Common clustering techniques like k-means and hierarchical clustering are based on the distance between objects, which is a continuous measure. Furthermore, each object being clustered is assigned to a single cluster. Topic models like LDA are based on word counts, which is a discrete measure. Furthermore, each object (in this case, a document within a corpus) is a mixture or partial member of every topic.

5. Let's practice!

Let's build some more intuition for what topics are by exploring the output of a latent Dirichlet allocation.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.