The count verb

1. The count verb

So far in this course, we've learned to select variables from a dataset, to filter and sort observations, and to create new variables with the mutate verb. But, so far, we've been working at the same level as the initial data, where every observation corresponds to one US county. In this chapter, you'll learn to aggregate data, that is, to take many observations and summarize them into one. This is a common strategy for making datasets manageable and interpretable.

2. Count

One way we can aggregate data is to count it: to find out the number of observations. The dplyr verb for this is count(). The result is a table with one row and one column, called n. This tells us there are 3,138 observations in the table, or counties in the United States.

3. Count variable

Counting the total number of observations can be useful, but the real value of the verb is when we give it a specific variable to count. For example, we could count the number of counties in each state by passing the state variable to count(). Notice that the resulting table has 50 observations: one for each of the 50 states. We've aggregated more than three thousand observations into a more manageable number. The second column, n, tells us that there are 67 counties in Alabama, 28 in Alaska, and so on.

4. Count and sort

When we explore datasets, we're often interested in sorting the counted data to find the most common observations. The count verb takes a second argument, sort, that allows us to do just that. We can add comma sort equals TRUE, and the resulting rows will be sorted from the most common observations to the least. This tells us that Texas is the state with the most counties, followed by Georgia and Virginia.

5. Count population

Finally, when we're adding up counties, we may want to weigh each of them differently. For example, recall that our counties dataset has a column called population. What if instead of finding the number of counties in each state, we wanted to know the total number of people in each state?

6. Add weight

We can add the argument wt, which stands for "weight", equals population. This means that the n column will be weighted by the population. In the result, instead of seeing the number of counties in each state, we see the total population. Here we can see that California is the US state with the highest population, followed by Texas.

7. Let's practice!

Counting is a very useful type of aggregation when you're starting to analyze a dataset, and you'll get more practice with it in the exercises. Let's practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.