Get startedGet started for free

LDA in practice

1. LDA in practice

Welcome back - so we can place words into topics, and even describe documents as a collection of these topics using LDA, but how do we use these results?

2. Finalizing LDA results

After your LDA preparation is completed, that is you have collected, cleaned, and prepared your data, you need to select the number of topics. LDA will create the topics for you, but it wont tell you how many to create. We will cover a metric for this process called perplexity, which we can use when selecting a larger number of topics, however - we will focus on selecting the number of topics based on your situation.

3. Perplexity

Perplexity is a measure of how well a probability model fits new data. Lower is better and perplexity is often used to compare models. So we will use perplexity to select how many topics should be in our LDA model. Our first step is to create a train/test split of the data, just as we did for classification modeling. We must assess perplexity on the testing dataset to make sure our topics are also extendable to new data.

4. Perplexity in R

Next, we create LDA models. For each K, we train a model, and calculate the perplexity score using the perplexity function from the topicmodels package and save the perplexity score to a vector so we can compare each K. We can plot these values, with the number of topics as our X, and the perplexity value as the Y.

5. Perplexity again!

Notice that between 12 and 15 topics, our perplexity score is not improving very much with the addition of more topics. We gain no real value by using more than 15 topics. Based on this graphic, we should use a model with 10 to 15 topics.

6. Practical selection

LDA is often more about practical use, then it is selecting the optimal number of topics based on perplexity. A collection of articles might comprise 15-20 topics, but describing 20 topics to an audience or even a boss might not be feasible. Similarly, visuals with only 4-5 topics are a lot easier to comprehend than graphics with 100 topics. A good rule of thumb is to go with a smaller number of topics where each topic is represented by a large number of documents. If time allows, and exploring a large number of smaller topics is possible, then topic models with more than 10 topics can be used.

7. Using results

When using a small number of topics, it is common to have a subject matter expert review the words of the topics and some of the articles most aligned with each topic to provide a theme for each topic. Let's look at how to extract this information.

8. Review output

To start, review the top words for each topic. Let's assume we created an LDA model on news articles. The beta matrix contains the beta values for each word after performing LDA. If we explore the words for topic 1, the words used look like they are describing athletes. To confirm this, we can also review the actual articles themselves. In practice, we would have reviewers skim the top articles and finalize a theme for topic 1. We would repeat this for each topic. Hopefully this clarifies why analyzing 30 or 40 topics would be difficult.

9. Summarize output

Before we end this lesson, I want to provide you all with a couple of ways to summarize your output. First, a quick way to count how many times each topic was the highest weighted topic. We group by, arrange, slice, group_by again, and then tally the totals. Here, topic 1 was the top topic for 1,326 documents.

10. Summarize output again

And finally, a quick way to view how strong a topic was when it was the top topic. We follow a very similar process, but this time we summarize with the mean gamma value. Topic 1 had the highest average weight when it was the top topic.

11. LDA practice.

Let's practice applying LDA models