Grouping and summarizing

1. Grouping and summarizing

In your last exercises you cleaned up the raw data into create a processed set of votes,

2. Processed votes

which looked like this. Now we can start trying to pull real insights out of the data. There are far too many observations in this dataset to extract anything interpretable by looking through it manually, so we’ll need to choose a way to summarize it that's interesting to us. Here I’ll propose a simple metric we’ll be using a lot in this course:

3. Using “% of Yes votes” as a summary

“percentage of yes votes.” If a country votes yes on most resolutions, we might infer that it tends to agree with the international consensus, while if it votes no we could assume that it tends to go against it.

4. dplyr verb: summarize

To calculate this you’ll use another dplyr verb: summarize. Summarize takes many rows and turns them into one - while calculating overall metrics, such as an average or total.

5. dplyr verbs: summarize

For example, we can pipe the votes_processed data into a summarize operation, telling it to create a new variable called total. n is a special function within a summarize that means “the number of rows.” The result is a one-row data frame telling us the total number of rows - 353 thousand.

6. dplyr verbs: summarize

We can add another variable to this summary with our “percentage yes” variable. Since 1 is “yes” in our dataset, we want the percentage of the rows where the vote variable is equal to 1. The way to calculate this in R is “mean vote equals equals 1”. (If you’d like to know why, this is because it first compares each vote to 1 to get true or false, then treats true cases as “ones” and falses as “zeroes”.). By calculating this, we see that about 79-point-9 percent of United Nations votes in history were “yes” votes. This overall summary isn’t much information. We may want to know whether this percentage has changed over time.

7. dplyr verb: group_by

So we introduce another verb- group_by. When done before a summarize operation, this tells the summarize to create one row for each sub-group, instead of one row overall.

8. dplyr verbs: group_by

For example, here we perform the same summary, but first group by year before summarizing. Now instead of getting one row overall, we get one row for each year: we see that 56-point-9% of votes in 1947 were yes, but only 43-point-8% in 1949. In later lessons you’ll use this to visualize the trend in the percentage over time. Summarizing by subgroups is a powerful way to turn large datasets into smaller ones that you can interpret. In your exercises, you’ll try grouping by country instead of year, which shows you which countries are more prone to voting “yes” or “no”.

9. Let's practice!