Improving sentiment analysis

1. Improving sentiment analysis

With the sentiment dictionaries appended to our tidy text data, the sky is the limit in terms of how we want to explore and visualize the relationship between sentiment and other features of our data. In this video, we'll walk through a few functions that will be especially helpful for improving our sentiment analyses.

2. Count sentiment by rating

In our tidy_review data, we have a star rating for each review. After using an inner_join() to append the bing sentiment dictionary, here we count() how often each sentiment appears with each star rating. But what about the difference between the number of positive and negative words for each star rating? Following the principles of a tidy data frame, sentiment and count are both columns, meaning for every star rating from 1 to 5, we have two rows, one for each sentiment and count. This makes it challenging to compute the difference between positive and negative sentiment.

3. Using pivot_wider()

There is a tidyverse function built specifically for this problem. The verb is pivot_wider(), and it comes from the tidyr package. To use pivot_wider(), first, we specify the column that we'd like to become multiple columns, and second, we specify which column's values should become the values for these new columns. It sounds more confusing than it is. Here we can see that we have used pivot_wider() on the old sentiment column to create two columns: positive and negative. The values for these columns come from the count column n that we specified as the second argument in pivot_wider(). Note that since we have applied pivot_wider() on the sentiment column, there is now a single row for every star rating.

4. Computing overall sentiment

After calling pivot_wider() and turning our old sentiment column into multiple columns such that we have a single row for each star rating, computing the difference between the positive and negative counts is now a matter of using mutate(). We'll call this new column overall_sentiment, with positive values indicating more positive words than negative words, and vice versa, for reviews with the given star rating. At first glance, it looks like what we might expect, with the 5-star rating reviews having the most positive overall sentiment.

5. Visualize sentiment by rating

Let's visualize this new overall_sentiment. With the foresight of knowing the visualization will depend on the structure of the data, let's first reorder the stars by overall_sentiment so the plot will be easy to read from most to least positive.

6. Visualize sentiment by rating

To visualize this, it would be nice to have different colors based on stars. However, the column mapped to the fill aesthetic needs to be a factor. To treat the stars column as a factor, we wrap it in a new function as dot-factor(). We also add a subtitle to our labs() layer to make it clear that these are reviews for robotic vacuums.

7. Visualize sentiment by rating

Now we can clearly see that not only is the 5-star rating overwhelmingly positive, but a rating of 1 to 4 isn't really positive at all, with a 4-star rating providing neutral overall sentiment at best. For marketing applications, this certainly highlights the disparity between a 5-star rating and any other rating.

8. Let's practice!

The pivot_wider() verb is one of many incredibly helpful, almost magical, functions in the tidyr package used for cleaning up and reshaping data frames. Let's practice using pivot_wider() to improve our sentiment analysis!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.