The group_by, summarize, and ungroup verbs

1. The group_by, summarize, and ungroup verbs

We've learned about to aggregate data using the count() verb, but count() is a special case of a more general set of verbs: group_by and summarize.

2. Summarize

The summarize verb takes many observations and turns them into one observation. If we wanted to find the total population of the United States, we could use the summarize() verb. We provide the a new variable name for this total, total-underscore-population, and set it equal to the sum of the population.

3. Aggregate and summarize

We can define multiple variables in a summarize call, and aggregate each in different ways. For example, we could find the total population and the average unemployment rate, using the mean() function in this case.

4. Summary functions

There are other summary functions we can use in summarize(), such as median(), min() for minimum, max() for maximum, and n() for the size of the group. We could combine these to find summaries like the highest average income of any county, the median percentage that drives to work, or the average income level.

5. Aggregate within groups

Summarizing the entire table is useful, but, ideally, we want to aggregate within groups, such as finding the total population within each state, or the average unemployment. We can achieve this by piping first into group_by(), and choosing the variable to group on, state, then piping from that into summarize(). The result is the total population and average unemployment for each state.

6. Sorting summaries

It's useful to add an additional step that sorts the results, so that we can focus on the most notable examples. We could sort the results by average unemployment in descending order by nesting the desc() function inside arrange(). This tells us that Mississippi is the state with the highest unemployment.

7. Metro column

Finally, we can group by multiple columns at the same time. The dataset also includes a metro column, which describes whether the county is a metro area- that is, a city- or nonmetro.

8. Grouping on multiple columns

We can group by both columns by passing both column names to group_by. This will result in one row for each combination of state and metro. Instead of 50 observations in the output, we have 97, since a few states don't have any counties that aren't metro areas. For instance, here we see that the total population in Alabama metro areas is 3-point-6 million, and the population in nonmetro areas is 1-point-2 million. Notice that the result is still grouped by state: you can see "Groups: state" at the top of the table. When you use summarize on a table that has multiple groups, only the last group gets "peeled off". This is useful when you want to continue doing additional summaries or aggregations.

9. Ungroup

If you don't want to keep state as a group, you can add another dplyr verb: ungroup().

10. Let's practice!

You've learned three new verbs: group_by, summarize, and ungroup. Let's practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.