Get startedGet started for free

The babynames data

1. The babynames data

So far in this course, we've been using the counties dataset, which contains US census data at a county level. In this chapter, we're going to analyze a new dataset, one representing the names of babies born in the United States each year.

2. The babynames data

This will allow us to both practice what we've already learned on a new dataset, and learn a few new dplyr tools for exploring data. We'll also learn a bit about the ggplot2 package, which is the Tidyverse visualization package. In the babynames table, each observation, or row, represents one combination of a year and a name, like 1880 and John. The third variable, number, represents the number of babies born in the United States with that name. For example, in 1880 there were 102 babies born named Aaron.

3. Frequency of a name

We can learn a lot from this table using some of the verbs we learned earlier. For example, suppose you want to find the frequency of the name Amy in each year. We could filter where name equals Amy. Unless your first name is very rare in the United States, you can probably find your own name in the table as well. We can work with this output as a table like this, but it's probably more useful to turn it into a line graph, so we can see how the frequency varies over time.

4. Amy plot

We'll visualize the data using the ggplot2 package, which we can load with library(ggplot2). There are three parts to a ggplot2 plot: the data, the aesthetics, and the layers. The data, in this case, is the filtered version of babynames, babynames_filtered. We call the ggplot() function to begin the plot, and pass the data as the first argument, and the aethetics as the second. The aesthetics are specified using the aes() function, and inside, we specify what we would like to display on each axis: year on the x-axis and number on the y-axis. To tell ggplot that we want a line plot, we must add a layer,

5. Amy plot

which we do by including a plus-sign after the ggplot() call, and calling the geom_line function, which maps the data onto a line plot.

6. Amy plot

Based on this, we can see that many babies born in the 70s and 80s were named Amy, but relatively few today. Notice how nicely dplyr and ggplot2 worked together to draw these insights.

7. Filter for multiple names

Besides filtering for just one name, you could filter for multiple names, using the percent-in-percent operator. You could search for the names Amy and Christopher by doing percent-in-percent, then providing a vector, with c, containing Amy and Christopher. In the exercises, you'll see how this can be used to make a graph of multiple names.

8. When was each name most common?

As one more use of dplyr, recall that you can use the slice_max verb, along with group_by, to find the year in which each name was most common. For instance, this tells us that the year when the most babies were named Garfield was 1880. In the exercises, you'll see that you could use a similar approach to find the most common name from every year.

9. Let's practice!

Let's practice!