1. Steps 4 & 5: Feature extraction & analysis
2. Feature extraction
The next step, number 4, is where you extract features from the text. This can take the form of sentiment scoring, or in this case, organizing the clean corpora text into a bi-gram TDM. To do so, you use RWeka like in chapter three's tokenizer function. You use it as a control when constructing the TermDocumentMatrix. In this example, tokenizer is applied to the amazon positive employee reviews to make amzn_p_tdm. In this case study, making the bi-gram TDM is the only major feature extraction that you are performing on your corpora.
3. Get term frequencies
The fifth step is to do some analysis on the extracted text features. An initial analysis is simply exploring your data to make sure it is in the intended form. This is one of the basic exercises from the chapter where you change the TDM to a matrix, calculate rowSums, sort the bigrams in decreasing order and then review the top 1 to 5 tokens. So not only have you extracted the bigrams from the text, but you are also extracting the most frequent bigrams to inform your analysis.
4. Create visuals with plotrix
Within steps 4 and 5, you will do a lot of feature extraction because you will have 4 bi-gram TDMs representing your four corpora. For the analysis, you now have some foundation to extract word associations and make some text-based plots. One of the visuals you are going to make is shown here. It's a pyramid plot from the plotrix package. In this exercise, you will find the common_words from negative amazon and Google bi-grams. Next, you calculate the absolute difference between the two vectors. Then, add the difference calculation to the words in common using cbind and order the entire data frame by the difference column with decreasing equals TRUE. Next, you create a top15_df data frame before passing it to the pyramid-plot with some aesthetics. This is just one of the interesting plots you will make during this case study, which should help lead you to a conclusion.
5. Let's practice!