Interpreting topics

1. Interpreting topics

Interpreting a topic model is something of an art form. Much like clustering and other unsupervised learning techniques, we get a description of what each topic is composed of but no other direction as to what the topics mean. Similarly, while we might use a variety of diagnostics to try and decide on the number of topics to estimate, the decision ultimately remains with the one conducting the analysis. The key is to find topics that are different where the topics don't repeat.

2. Two topics

Here we run LDA with k = 2 topics and pipe the model output directly into tidy() to extract the word probabilities. We then do the standard selecting of words with the highest probabilities and reordering the terms by topic to produce word_probs.

3. Two topics

Finally, we plot word_probs, treating topic as a factor to add some color, removing the redundant bar plot legend, and faceting by topic to easily see what makes the topics different. To interpret these topics, we need to consider what the words occurring with high probability within each topic suggest. Again, this is a subjective decision. The first topic is a collection of words that appear to be connected with the performance of a robotic vacuum, with words like hair, vacuum, floors, house, and day occurring together with high probability. The second topic is a collection of words that appear to be connected with possible frustrations with the functionality of a robotic vacuum, with words like clean, cleaning, bin, and time occurring together with high probability. We could name topic one “Performance” and topic two “Functionality”.

4. Three topics

Now let's repeat this same process but with k = 3 topics. We follow the same steps of modeling, tidying, filtering, and reordering.

5. Three topics

And, finally, plotting. Note that adding a topic has the potential to drastically change the previous solution. The first topic looks like our previous second topic. The second topic looks like our previous first topic. The third topic is something new altogether, with words like time, floor, run, and stuck suggesting possible frustrations more clearly. We could name topic one “Functionality,” topic two “Performance”, and topic three “Frustrations”.

6. Four topics

To finish our illustration, let's repeat this process one more time but with k = 4 topics. The same steps were followed to produce this plot. Again, we see a change across topics. The first topic now seems to be about a specific use case: rugs. The second topic appears to be about another use case: floors. The third topic looks a lot like the previous first topic, with a focus on functionality. The fourth topic looks a lot like the previous second topic, with a focus on performance. We could name these topics “Rugs,” “Floors,” “Functionality,” and “Performance.”

7. The art of model selection

As we added topics, we saw more granularity, with each new topic adding something specific. Once we arrive at the point where adding a new topic appears to simply duplicate an existing topic, we know we've gone too far. Finally, we name the topics based on what the words with high probability appear to be indicating.

8. Let's practice!

Let's practice the art of selecting k and naming topics!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.