1. Tidy data
Consider this
2. United Kingdom
graph of UN voting trends over time. Like other graphs you've made, it maps
3. United Kingdom
"year" to the x-axis,
4. United Kingdom
"percentage yes" to the y-axis,
5. United Kingdom
and "country" to color. This graph, however, is faceted across the six topics, using one sub-graph for each topic. For instance,
6. United Kingdom
one single point on this graph represents
7. United Kingdom
the votes of the United Kingdom on the topic of colonialism in 2001. This useful kind of analysis is possible only with a particular structure of data:
8. Tidy data: topic is a variable
one where each observation, or row, represents a single combination of
9. Tidy data: topic is a variable
country, year, and topic. This allows every observation
10. Tidy data: topic is a variable
in the data to map to one point in your plot. Notice that this data includes a variable called "topic",
11. Tidy data: topic is a variable
which specifies for each observation whether it relates to colonialism, nuclear weapons, and so on. We call this arrangement "tidy".
12. Topic is spread across six columns
In the votes_joined dataset you used in the previous exercises, you don't have a single topic variable, but rather one column for each of the six topics containing a zero or a one. This means there's no easy way to use dplyr to summarize by topic, or to visualize the results for six topics on the same graph.
In order to do that, we need to bring topic into a single variable.
13. Use gather() to bring columns into two
This can be done with the gather function in the tidyr package. gather is a reshaping operation that takes any number of columns and collects them into two: key,
14. Use gather() to bring columns into two
with the original column names, and value,
15. Use gather() to bring columns into two
with the contents of those columns. Notice that this typically increases the number of rows in the data.
16. Use gather() to bring columns into two variables
You can apply the gather function on the votes_joined data to collect topic into one variable. First, you specify that you want to join the m-e through e-c columns : those are the six topic columns in the joined dataset. You then specify the names of the key and value columns: use "topic" to store the key, which then contains the column names, and "has_topic" for the value, which is either 0 or 1. This achieves your goal of constructing a "topic" variable with six possible values. Notice that there are now 6 rows for each vote, one for each topic.
In this case, you don't actually care about rows where "has_topic" is zero. For example, these rows are effectively saying that a roll call vote was not related to m-e, the Palestinian conflict.
17. Use gather() to bring columns into one variable
Thus, you should add one more step where you filter for all the cases where has_topic is 1. Thus, the topic column now describes each of the votes it is associated with. Note that votes with multiple topics may appear multiple times in the dataset.
By constructing a country-vote-topic dataset, you've now made it possible to group and summarize the data by topic, or to compare all six in the same visualization.
18. Let's practice!
Many analyses will require this kind of manipulation and restructuring of your data using tidyr and other tools.