1. Plotting word counts
With the tidyverse and tidytext packages at our disposal, we have a set of powerful tools for analyzing text. We've already used a number of dplyr functions in combination with unnest_tokens() to compute word counts. We'll now cover the basics of visualizing text with ggplot2.
2. Starting with tidy text
Let's look again at the tidied robotic vacuum product review data. Before tokenizing and removing stop words, we've used the mutate() verb along with a new function called row_number(). Like n() or desc(), row_number() is a helper function that, you guessed it, creates a number for each row. We use this to create a new ID column for each of the product reviews. Then we tokenize the reviews using unnest_tokens() and remove the standard dictionary of stop words using anti_join().
3. Starting with tidy text
As we print this tidied data, the new ID column makes it clear that after tokenizing, each row or observation is a single word so that each product has as many rows as non-stop-words in the review. For example, the first review was full of stop words and the second review has only two non-stop-words.
This format of having each observation in its own row and each variable of interest (in this case, the id, date, product, stars, and words) as their own columns is known as a tidy data frame. When we talk about the tidyverse, this kind of data frame is one of those principles that are common across packages. It's important enough that the tidyverse is even named for it!
4. Visualizing counts with geom_col()
With our text tidied, our initial summary is again as quick as using count(), as well as arrange() and desc() to make it readable. Even better, let's visualize these counts with a bar plot! Here we use ggplot(), where the first argument is our count data, and the second argument is using the aes() or aesthetic helper function to map columns in our data to elements of the plot. Here it is natural to assign word to the x-axis and n (the count) to the y-axis. Finally, we add the geom_col() layer to produce a bar plot. As an aside, the “col” in geom_col() stands for column, where a column plot is another way to refer to a bar plot.
However, what we get is a mess. There are a number of things wrong with this bar plot. Let's address them in turn.
5. filter() before visualizing
The first problem is we are trying to plot way too many words at once. What we typically care about are the words with the largest counts. Right after using count() on the tidy text, we can filter based on the count n. Here we keep only those words used more than 300 times in the product reviews. This cutoff will depend on the data.
6. filter() before visualizing
After filtering, we still have the most frequently used words at the top of the list, but the number of rows has gone from over 9,000 to just 25. We can fit that in a bar plot.
7. Using coord_flip()
The second problem was that the words overlapped and were hard to read on the x-axis. After geom_col(), we can add coord_flip(). This flips the coordinates of our plot so it's easier to read our filtered set of word counts. For good measure, we use ggtitle() to explain what it is we're plotting.
8. Let's practice!
You might be wondering why our use of arrange() and desc() didn't transfer to the bar plot. We'll look at that next. For now, let's practice!