1. Running topic models
With a sense of what topic models do and knowing how to structure data as the necessary DTM input, let's start running some topic models!
2. Using LDA()
To run a topic model, we will be using the topicmodels package in conjunction with the tidyverse and tidytext. With a DTM as an input, running a topic model is straightforward: we use the LDA() function. There are four arguments for this function that we'll need to specify.
The first argument is the DTM input. The second argument is k, or the number of topics we want the model to produce. We'll see later how we can decide on what we should set k to. The third argument is the estimation method. The default is a quick approximation. However, if we prefer a longer but more complete method, we should specify the Gibbs sampler. Finally, the fourth argument allows us to specify the simulation seed, which we need to treat as a list, a data structure in R that's beyond the scope of this course. Specifying the simulation seed will help us recover consistent topics on repeat model runs, given the probabilistic nature of model estimation.
3. LDA() output
After running the model, which might take just a few moments or as much as a few hours, depending on how much text we're throwing at it, we can see that the output is, much like the DTM, an R object encoded specifically for the topicmodels package.
4. Using glimpse()
However, we can use the glimpse() function to see what is included in this encoded object. There are a number of items stored within lda_out. For example, we can see k, the number of topics we specified, as well as beta, the word probabilities that define the topics.
5. Using tidy()
We would like to evaluate the model output in a way that is consistent with our suite of tidy tools. In particular, the most important output from a topic model are the topics themselves: the dictionary of all words in the corpus sorted according to the probability each word occurs as part of that topic. Once again, we can use a tidytext function, in this case tidy(), to take the matrix of topic probabilities and put them into a form that is easily visualized using ggplot2. Using tidy() requires us to specify the name and structure of the piece of lda_out we want to tidy; in this case, that's a “matrix” and “beta”.
If cast_dtm() allows us to navigate out of tidy data formats to run the topic model, tidy() allows us to take the model output and navigate back into a tidy data format.
6. Let's practice!
How do we know if two topics is enough? How do we interpret the topics? We'll discuss these topics, no pun intended, next. For now, it's your turn to run your first topic model!