1. Flagging fraud based on topics
In this video, you'll learn more about how to use your LDA model results for fraud detection.
2. Using your LDA model results for fraud detection
If you don't have labels, you can first of all check for frequency of suspicious words within topics, and check whether topics seem to describe the fraudulent behavior.
For the Enron email data, a suspicious topic would be one where employees are discussing stock bonuses, selling stock, the Enron stock price, and perhaps mentions of accounting or weak financials. Defining suspicious topics does require some pre-knowledge about the fraudulent behavior. If the fraudulent topic is noticeable, you can flag all instances that have a high probability for this topic.
If you have previous cases of fraud, you could run a topic model on the fraud text only, as well as on the non-fraud text. Check whether the results are similar, ie whether the frequency of the topics are the same in fraud versus non-fraud.
Lastly, you can check whether fraud cases have a higher probability score for certain topics. If so, you can run a topic model on new data and create a flag directly on the instances that score high on those topics.
3. To understand topics, you need to visualize
Interpretation of the abstract topics can sometimes be difficult, so you need good visualization tools to dive deeper and try to understand what the underlying topics mean.
There is a visualization tool called pyLDAvis for gensim available, that does an excellent job. Be mindful though, this tool only works with Jupyter notebooks. Once you have created your model, you can create a detailed visualization in just two lines of code. As you can see here, I input the model, the corpus, and the dictionary into the pyLDAvis library, and then simply display the results.
4. Inspecting how topics differ
The display looks like this and is interactive.
So how to interpret this output? Each bubble on the left-hand side represents a topic. The larger the bubble, the more prevalent that topic is. You can click on each topic to get the details per topic in the right panel. The words are the most important keywords that form the selected topic.
A good topic model will have fairly big, non-overlapping bubbles scattered throughout the chart.
A model with too many topics, will typically have many overlaps, or small sized bubbles, clustered in one region. In our case, there is a slight overlap between topic two and three, so that may point to one topic too many.
5. Assign topics to your original data
One of the practical application of topic modeling is to determine what topic a given text is about.
To find that, you need to find the topic number that has the highest percentage contribution in that text.
This is, in fact, not that straightforward. However, the function get_topic_details, shown here, nicely aggregates this information in a presentable table.
Going into detail into this function is beyond the scope of this course, but you will get a chance to work with this function in the exercises.
6. Assign topics to your original data
The function can be applied as follows. I take the original text data and combine that with the output of the get_topic_details function. The results looks like this.
Each row contains the dominant topic number, the probability score with that topic, and the original text data.
7. Let's practice!
Let's practice!