Get startedGet started for free

Appending dictionaries

1. Appending dictionaries

With four sentiment dictionaries included as part of the tidytext package, we have a lot of material to work with. However, the dictionaries won't do us much good on their own. We need to first append the sentiment dictionary to our tokenized and cleaned text data.

2. Using inner_join()

To append the sentiment dictionary, we need to use another of the join functions from dplyr. Remember that joins are used, as the name suggests, to join two data frames together based on one or more matching columns. We've already used an anti_join() to remove stop words from our data, and we'll now use an inner_join(). In an inner_join(), every row in both the data frame on the left and the data frame on the right are retained and joined as long as there is a match. Let's illustrate this with our tidy_review data. If we pipe tidy_review into an inner_join() that contains the “loughran” sentiment dictionary using get_sentiments(), so that tidy_review is the left data frame and the “loughran” sentiment dictionary is the right data frame, we can see that a new column with the sentiment tag for each word has been appended. Note that the number of rows has drastically been reduced since only those words that have a matching word in the sentiment dictionary have been retained. To be clear, we only know the sentiment of the words in our data that are also in the dictionary we're using; thus our sentiment analysis is conditioned on the dictionary we use.

3. Counting sentiment

Now that we have appended a sentiment dictionary to our tidy data, we can summarize the emotional content of the document in whatever way we would like. Easily enough, we can count() the prevalence of each sentiment. Here we see that the reviews are, on aggregate, equally positive and negative using this sentiment dictionary.

4. Counting sentiment

We can also count() by both word and sentiment to find what words are used most often for each sentiment. Here we can see that easy, which has a positive sentiment, is used most often while trouble is the most-used negative word.

5. Visualizing sentiment

Suppose we want to visualize the most common words for only positive and negative sentiments. Here we filter for either positive or negative sentiment by using the %in% operator in place of == and the c() function to combine the sentiments we want to keep. We can read this filter() as: filter sentiment_review and keep any row that has sentiment equal to positive OR equal to negative. We can then apply what we've learned about word counts, group_by(), slice_max(), ungroup(), and fct_reorder() to find the top ten positive and negative words in order.

6. Visualizing sentiment

Now we visualize these top ten positive and negative words using facet_wrap(). Instead of setting scales equal to "free_y" in facet_wrap, we set both axes free with “free”. And instead of using ggtitle(), we again use the labs() layer to change the title and axes titles at the same time.

7. Visualizing sentiment

Here we can see that the negative words are used to describe difficulties cleaning while the positive words describe reactions to the performance.

8. Let's practice!

Once again, the combination of the tidyverse and tidytext packages makes what could be incredibly complicated quite trivial. Let's practice!