Get startedGet started for free

Introduction to topic modeling

1. Introduction to topic modeling

Hello again - we have finally arrived at one of the most interesting topics in text analysis - topic modeling!

2. Topic modeling

The motivation for topic modeling is pretty straight forward. A collection of texts is likely to be made up of a collection of topics. For example, let's consider a set of articles written about sports. Generally, these articles consist of several recurring themes, which are known by even passive sports fans. What if you were given a collection of articles on weather in Zambia though. If you knew nothing on this subject and had no expert to talk to, you'd have to read all of the articles to find the underlying topics.

3. Latent dirichlet allocation

Algorithms can identify topics within a collection of text, and one of the most common algorithms for topic modeling is Latent dirichlet allocation. LDA uses two basic principles Each document is made-up of a small mixture of topics. And the presence of a word in each document can be attributed to a topic. For example, a sports story on a player being traded might be 70% on team news and 30% on player gossip. And words such as trade, pitcher, move, new, angry, change, and money might be attributed to each topic.

4. Preparing for LDA

Performing LDA is straight forward, but it might take some time to understand the results. In order to perform LDA, we need a document-term matrix with term frequency weights. We start with our standing preparation and then cast the word counts to a document-term matrix. Notice the weighting used here. LDA requires we use the term-frequency weighting.

5. LDA

Once we have a document-term matrix, we call the function LDA on our matrix, we specify the number of topics by setting the parameter k, and we designate a seed, so our results are reproducible. We are using the "Gibbs" implementation of topic modeling instead of the default implementation.

6. LDA results

After performing LDA, we are can extract a tibble containing the term and its corresponding beta coefficient for each topic. Without getting into the weeds, Beta is a per-topic word distribution. It explains how related to each topic a term is. Words more related to a single topic, will have a higher beta value. Also note that the sum of these values should be equal to the total number of topics.

7. Top words per topic

Let's look at the top words by topic. For topic 1, we see words like napoleon, animal, and windmill For topic 2, we see similar words. This might indicate that we need to remove some of the non-entity words such as animal, and rerun our analysis.

8. Top words continued

Here, we have reran LDA after removing animal and farm. I wouldn't suggest removing too many words, but since the booked is called "animal farm", it made sense to do so. Based on this visualization which was taken from the tidytext manual, Topic 1 appears to be more about pig, daisy, and boxer, while topic 2 is focused on snowball and napoleon and dogs.

9. Labeling documents as topics

Now that we know that words correspond with topics, we use the words of each chapter to assign topics to chapters. To extract the topic assignments, we can use the tidy function once more, only this time we specify that we want the gamma matrix, as gamma represents how much of the chapter is made up of a single topic. Chapter 1 is mostly comprised of topic 3, with a little of topics 1 and 2 as well.

10. LDA practice!

Let's look at a few examples.