1. Text mining to detect fraud
In this video, you'll learn more about the basics of text mining.
2. Cleaning your text data
Whenever you work with text data, be it for word search, topic modeling, sentiment analysis, or text style, you need to do some rigorous text cleaning in order to be able to work with the data. Here are four steps you must take, before working with the data further.
First, you always need to split the text into sentences, and the sentences into words. Transform everything into lowercase and remove punctuation. This is called tokenization.
Secondly, you need to remove all stopwords as they mess up your data. Luckily, there are dictionaries for this to help you do that.
Thirdly, you need to lemmatize words. For example, this means changing words from ?third person into first person, changing verbs in past and future tenses into present tenses. This allows you to combine all words that point to the same thing.
Lastly, all verbs need to be stemmed, such that they are reduced to their root form. For example, walking and walked are reduced to just their stem, walk.
3. Go from this...
When you take these four steps, it allows you to go from this type of data
4. To this...
to this nice, clean, structured list of words per dataframe row.
5. Data preprocessing part 1
Let's look at how to clean your data in practice. Following the four steps, you begin by tokenizing the text data into words. Tokenizers divide strings into lists of substrings. The standard nltk word
tokenizer can be used to find the words and punctuation in a string. It splits the words on, for example, white space, and separates the punctuations out. You then use rstrip to remove all the whitespaces from the beginning and end of the strings, and, finally, you make sure all text is lower case, by replacing all letters with their lowercase counterpart, using regular expressions.
You then clean the text further by removing stopwords and punctuation. You can use the tokenized text and get rid of all punctuation. Nltk has a stopwords list for the English language that you can use. Since every row consists of a list of strings, you need to create small loops that select the words you want to keep. I use join here to separate the words I want to keep with a space.
6. Data preprocessing part 2
The next step is to lemmatize the words, this can be easily done with the nltk WordNetLemmatizer. Again, I loop over the words and make sure they are joined together with a space between them. Stemming your verbs is equally simple, you can use the nltk PorterStemmer for this.
After this work your text data should be nice and clean. It consists of lists of cleaned words for each row in your dataframe.
7. Let's practice!
Let's try and clean the Enron email dataset.