Get startedGet started for free

Grouped mutates

1. Grouped mutates

Previously, we learned about the babynames data, and how we can use it to answer questions about specific names or years.

2. babynames graph

This graph shows the frequency of three different baby names over time. The y-axis represents the number of people born in each year with the name Matthew, Steven, or Thomas. But remember that a different total number of babies are born in each year, and what we're interested in isn't exactly the number who are born. Rather, we're interested in what percentage of people born in that year have that name. To calculate that, we'll need to work with the total number of people born in each year. To do that, you'll learn how to do a grouped mutate.

3. Review: group_by() and summarize()

Remember how group_by and summarize work. We could first tell dplyr to group by the year column, then summarize to calculate the sum of the number column. This gets a table with one row for every year. This is already pretty useful by itself. But we want to get the total number of people born in each year alongside the original data. For that, change the summarize to a mutate.

4. Combining group_by() and mutate()

We call this a grouped mutate. Just like group_by and summarize work well together, group_by and mutate are a great pair. The group_by tells dplyr that we only want to add up within each year. Then, the mutate creates a new column called year_total, with the total number of people born in that year in this dataset. Notice from the header that the table is still grouped by year, which could affect other verbs we want to use in the future. In particular, it can make other mutates or filters slower to run, especially if there are a lot of groups in the table.

5. ungroup()

Therefore, it's good practice to use one more dplyr verb, ungroup(), since we're done with grouped calculations.

6. Add the fraction column

Now that we have the total in each year, we can calculate the fraction of people born in each year that have each name. This is number divided by year_total. So we do one more mutate to add that column. We can now see the fraction of babies in 1880 that received each name. This could make your graphs more interpretable, since they'll be looking at the fraction of babies born in each year that have a particular name.

7. Comparing visualizations

Remember how you graphed the number of babies born in each year. If you graphed this fraction instead, you'd have seen a very different visualization! This is because the dataset includes relatively few babies from the 1800s and early 1900s. Thanks to our grouped mutate, we have a better visualization of the relative popularity of the name in each year.

8. Let's practice!

Let's practice!