Different frequency criteria

1. Different frequency criteria

2. Term weights

So far we have only used the default term frequency or TF. This is just a simple count of words when making a TDM or DTM. You may recall that leaving the word "chardonnay" in the word cloud made it difficult to gain an insight. This was due to the high frequency of chardonnay in our corpus. Totally expected by the way. To combat this when constructing a TDM, you can pass in a control parameter to change the term weighting. One popular method is the TFIDF, or Term Frequency Inverse Document Frequency method. Essentially words are counted but penalized if they appear in a lot of documents. The thought being that words that are both common and across all documents have little informational value.

3. Term weights

In this example we create two TDMs. The first is the standard one we have used throughout the course. When examining the 505th to 510th row and tweets 5 to 10 you see coffee shows as 1 across the board. Now let's look at the same TDM but with TFIDF weighting. When calling TermDocumentMatrix, you add in a control parameter weightTfIdf. As you look at the same section of the TDM, you can see how the coffee values are very diminished. Changing to TFIDF helps to lower the impact of coffee in this corpus and probably makes sense to do since we know it's in every tweet.

4. Retaining document metadata

The last item is how to retain meta data from your documents. If you remember all the way back to chapter one, we threw out a lot of information from our data frame coffee_tweets by selecting only the text column. If you want to keep the meta information when making a corpus you use the DataframeSource function. Using DataframeSource, the first column needs to be called doc_id and the second text. Every other column is then assumed to be meta-information. In this example, the first columns are declared as doc_id, then text. The data frame is then passed into VCorpus using DataframeSource In this example we add a simple clean_corpus function before examining the results. The new corpus acts as a list with nested elements. You can use double brackets to select the document and single brackets for selecting the content or meta-data. However, it's easier to use the content function and the meta function to extract the correct list elements corresponding to text-content or meta-data like id, author, date, and the default "en" for English.

5. Let's practice!

The final push! You are so close! In chapter 4, you'll apply all your new text mining skills to a real case study. You're going to get a limited number of employer reviews from Amazon and Google to see if you can extract any useful insights.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.