Step 4: Feature Extraction & Step 5: Time for analysis... almost there!

1. Revising the comparison cloud

Now that your data is organized and no matter how you cut it, term document matrix or tibble, you know the rental review polarity skews positive. This often happens since there are some social pressures when writing a review. For example, people may dislike an aspect of a service but still throw in some positive language because they feel bad for the call center rep or wait staff. This section attempts to explore the aspects of good rentals while cutting through the polarity “grade inflation”.

2. Author effort

The first exercise you will explore the relationship between polarity and author effort. In your experience I bet you write longer reviews when you are raving about something like this DataCamp course and conversely are more short when writing a ho-hum review of a python course. It turns out this happens with lots of people, the more strongly your feelings the more you may write and with more language means more opportunities for polarized word usage. Recall that the reviews tibble has a column called ID. Since the data is organized as a tibble, counting the instances of the ID will tally the total number of words for the review. This will be used in your visual to make a scatter plot demonstrating the relationship between polarity and author effort.

3. Comparisons

Next up you’ll run that simple comparison cloud you created in chapter 3. To begin with you apply it to the reviews based on the polarity as it was calculated. Remember that with the comparison cloud large words are related to frequency. Further, a comparison cloud will only plot the terms that are not shared.

4. Revising the comparison cloud

The last exercise of this section you will redo the comparison cloud but you will first apply the scale function to the polarity scores. Specifically, you will scale the polarity score from qdap’s polarity values. If you aren’t familiar, the scale function standardizes your data. Using scale’s defaults, all data points have the column’s mean average subtracted and then each data point is divided by the standard deviation among all points. If you remember your stats days this is a z-score. In this case it has the effect of bringing back the polarity mean from its inflated level back to zero. Since this is the value used to subset the corpus now, you will have some reviews change to the other subsection and the comparison cloud will be different as a result.

5. Always more analysis can be done!

Keep in mind this is a short case study and a lot more interesting sentiment analysis can be performed on the data. Pay close attention to the large words in the visuals to answer questions later. I personally don’t like word clouds much but in only a few short exercises you will be able to see what makes a good rental review so there is no denying the benefit of using word frequency along with polarity as a starting point for a sentiment analysis.

6. Let's practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.