1. Topic modeling of tweets
Imagine being able to extract the most important topics discussed across thousands of tweets in a few seconds.
2. Lesson Overview
In this lesson, we will understand the fundamentals of topic modeling and what a document term matrix or DTM is.
We will learn how to create a DTM from a text corpus and how to build a topic model from the DTM.
3. Topic and Document
A topic is a collection of dominant keywords that are typical representatives of that topic.
The keywords "travel", "vacation", and "hotel" are representative of the topic "tourism".
4. Topic and Document
A document is a term used to describe one text record. For example, a tweet on tourism is a document.
5. Topic modeling
We know what is a topic and a document. What is topic modeling?
Topic modeling is the task of automatically discovering topics from a vast amount of text.
It is used to quickly extract core discussion topics from large datasets like tweets.
Topic modeling of tweets on a brand helps quickly summarize the vast information into distinct topics.
6. How LDA works
We will be using the Latent Dirichlet Allocation algorithm for topic modeling.
The LDA model is a mathematical model to simultaneously estimate:
7. How LDA works
the mixture of words associated with a topic and
8. How LDA works
a mixture of topics that describe each document.
9. Document term matrix (DTM)
The first step in topic modeling is to create a document term matrix or DTM of the text corpus.
The DTM is a matrix representation of a corpus.
It is made up of documents as rows and words or terms in the document as columns.
10. Create a document term matrix
We create a DTM for the corpus on Obesity using the DocumentTermMatrix() function in the tm library.
This function takes the tweet corpus as input.
11. Create a document term matrix
Let's examine the first few rows in the DTM using inspect().
12. Create a document term matrix
We see that the DTM has 1000 documents and 5079 terms.
The non-sparse entries are an extremely small fraction of the total terms in the corpus which translates to a sparsity of almost 100%.
Here, 100% indicates that several terms have occurred rarely as observed by the presence of many zeros in the matrix shown below.
A few documents and terms in those documents have been laid out as rows and columns of the matrix.
13. Preparing the DTM
The DTM needs to be filtered for rows that have a row sum greater than 0 before it is input to the LDA function.
First, calculate the sum of word counts in each row using the apply() function.
This function takes the following arguments:
the DTM, "1" so that the sum() function is applied on rows, and sum() to add up the word count in each row.
Next, we subset the DTM by selecting rows that have row totals greater than zero and save the output in a new DTM.
14. Build the topic model
It is now time to create the topic model using the LDA() function from the topicmodels library.
The function takes as input the new DTM created in the last step and the number of topics set to "5".
15. Build the topic model
We have now extracted 5 topics from the tweet corpus on "Obesity".
Let us look at the top 10 terms under each topic to understand what these topics are about.
The terms() function takes the following inputs: the topic model and the number of terms.
16. View top 10 terms in the topic model
We can see that each topic talks about the various aspects and side effects of Obesity.
An obesity management program can center its theme around one of these core topics on obesity.
17. Let's practice!
You have learned how to build a topic model from tweets. Let's practice.