1. The United Nations Voting Dataset
Hi, I'm Dave Robinson and I'll be your instructor for this course. I'm a data scientist and I really enjoy using R to dive into a dataset and discover interesting things. In this course, we're going to be using some of my favorite R packages, such as dplyr and ggplot2, to explore and draw conclusions from a real-world dataset. If you've used these packages before, this will be a great opportunity to practice using them in an analysis.
2. UN Voting Dataset
Let's introduce the dataset, which contains the historical voting data from the General Assembly of the United Nations. In the General Assembly every member nation gets a vote, which makes this a great opportunity to explore the history of international relations.
In our data analysis vocabulary, rows of a dataset are called "observations" and columns are called "variables". In this dataset, each observation represents one combination of a roll call vote and a country
3. UN Voting Dataset
The first variable, rcid, is the "roll call ID".
4. UN Voting Dataset
describing one round of voting, such as to approve a UN resolution. The session variable represents which year-long session in the UN's history the vote was cast. Note that to keep the dataset at a reasonable size, only sessions from alternating years are included.
5. UN Voting Dataset
The vote column represents that country's choice.
6. UN Voting Dataset
For example, 1 means a yes vote, and 9 means a country was not a member of the United Nations. The ccode column is a country code
7. UN Voting Dataset
that uniquely specifies the country.
8. Votes in dplyr
To work with this in R, we’d start by loading the dplyr package, which offers tools for manipulating data. Then we can view the votes dataset by simply typing “votes” into the R prompt. Here you can see each of the columns of the table , as well as the table’s size - 508 thousand rows.
As with almost any dataset you’ll run into, you’ll need to clean this data before we can start analyzing it. Let’s review one of the most important tools for performing multiple sequential steps on data: the pipe operator.
9. The pipe operator
The pipe, typed as “percent greater than percent”, tells R to pass one object in as the first argument of the next function,
10. The pipe operator
which lets us perform multiple operations in a series. While it may seem complicated if you haven’t used it much before, you’ll quickly get comfortable with it.
11. dplyr verbs
The operations we’ll usually be composing are dplyr’s “verbs”- functions that perform a single, simple action on a dataset. Recall that the “filter” verb subsets observations from a dataset, to remove rows that aren’t interesting to us.
12. dplyr verbs
The “mutate” verb adds a variable or changes an existing variable.
Here’s an example of each.
13. Original data
In our original dataset, the vote column has five possible values : 1 for yes, 2 for abstain, 3 for no, 8 meaning the country wasn’t present, and 9 meaning the country was not a member. We only care about the first three values- yes, no and abstain.
14. dplyr verbs: filter
To remove the others, we pipe the dataset into the filter function. Within that filter we describe a condition: vote <= 3. The resulting data frame is smaller - it only kept the observations where our condition was met.
15. dplyr verbs: mutate
You’ll also be using the mutate function. The session variable is hard to interpret, but if you know the first session of the United Nations was held in 1946, you can use it to get the year each vote was cast, which is much more interpretable. To do this you could pipe the data into the “mutate” function, where you can define your new “year” column as 1945 + the session. Notice the new “year” column with the result. In your exercises, you’ll also clean up the country column to include full country names instead of IDs.
16. Chaining operations in data cleaning
The pipe operator lets you chain these simple actions together in a sequence. You’ll get into the habit of piping many small, simple operations together to perform a richer analysis.
17. Let's practice!