Get startedGet started for free

Topic modeling on fraud

1. Topic modeling on fraud

Let's now dive a bit deeper into topic modeling for fraud detection.

2. Topic modeling: discover hidden patterns in text data

Topic modeling can be a powerful tool when searching for fraud in text data. Topic modeling allows you to discover abstract topics that occur in a collection of documents. Intuitively, given that a document is about a particular topic, one would expect particular words to appear in the document more or less frequently. Topic modeling, therefore, tells us in a very efficient way what the text is about, based on the words it contains. Conceptually, it is similar to clustering, as it clusters words belonging to the same topic together. If you have text data of known fraud cases, it allows you to check what are the most common topics for those fraud cases, and use that to compare unknown cases. Without known labels, you can inspect which topics seem to point to fraudulent behavior and are interesting to further investigate.

3. Latent Dirichlet Allocation (LDA)

Latent Dirichlet Allocation, or LDA, is an example of topic model and is commonly used. It creates a "topic per text item" model and "words per topic" model, which are called Dirichlet distributions. Implementation of LDA is straightforward. First, you need to clean your data as described in the last video, and this is the most work. Then, you create a dictionary containing which words appear how often in all of the text, and also a corpus, containing, for each text line in your data, the count of words that appear.

4. Latent Dirichlet Allocation (LDA)

The results you get from this model are twofold. First, you see how each word in your total data is associated with each topic. Second, you can also see how each text item in your data associates with topics, also in the form of probabilities. You can see this in the image here on the right. This image comes from a blogpost on DataCamp about LDA, which I encourage you to read if you want to learn more in detail.

5. Bag of words: dictionary and corpus

Let's talk about the implementation of an LDA model. You start by importing the corpora function from gensim. I use the dictionary function in corpora to create a dictionary from our text data, in this case, from the cleaned emails. The dictionary contains the number of times a word appears for each word. You then filter out words that appear in less than 5 emails and keep only the 50000 most frequent words, in a way of cleaning out the outlier noise of the text data. Last, you create a corpus that tells you, for each email, how many words it contains and how many times those words appear. You can use the doc2bow function for this. Doc2bow stands for document to bag of words. This function converts our text data into a bag-of-words format. That means, each row in our data is now a list of words with their associated word count.

6. Latent Dirichlet Allocation (LDA) with gensim

After cleaning the text data, and creating dictionary and corpus, you are now ready to run your LDA model. I use gensim again for this. You need to pass the corpus and dictionary into the LDA model. As with K-means, you need to pick the number of topics you want beforehand, even if you're not sure yet what the topics are. The LDA model calculated here, now contains the associated words for each topic, and the topic scores per email. You can obtain the top words from the three topics with the function print_topics. As you can see, after running the model, I print the three topics and the four top keywords associated with the topic, for a first interpretation of the results.

7. Let's practice!

Let's practice!