1. The bind_rows verb
You'll end the course by learning one more dplyr verb that combines two tables together, in a slightly different way than the six join verbs do.
2. Comparing tables
So far when we've combined tables, we've combined them based on values that match between them. For instance, you've joined questions and answers based on the id linking to the question id.
But notice that these two tables also have a similar structure, even having three variables- id, creation date, and score- in common. In some situations, instead of joining one next to the other, we may want to stack one on top of the other.
3. Binding rows
We do this with the bind rows verb. When we use it to combine the two tables, we end up with a combination of questions and answers in one table, which we could call "posts", since it describes them both.
We originally had 294,000 questions and 380,000 answers, but now we have 675,000 posts. Notice that in those first ten observations question id is always NA, because those were questions originally, and only the answers table had that column.
It's often useful in this posts table to keep track of which observations are questions and which are answers. A common approach when combining tables with bind rows is to mutate in an extra column that distinguishes each of them before you do the bind rows.
4. Using bind rows
Here's an example. We mutate in a column called type to the questions table, and then to the answers table.
Notice that the outcome now has a type column, with either question or answer. The two tables have been combined so that you can work with them using the same dplyr verbs, but the observations can still be distinguished.
5. Aggregating
What could you do with this combined table? Well, consider that you could do some aggregations. If we wanted to find the average score of questions and of answers, you could do that with a group by and summarize.
We see that the average answer is higher rated than the average question. But the aggregations can get a lot more complicated than that. For instance, we could look at question and answer activity over time.
6. Creating date variable
You haven't yet used the creation date variable in questions or in answers. To use it, you might want to calculate a variable for year, rather than just date. You can do this with the year() function from the lubridate package, which takes a date and turns it into the relevant year.
Notice that the date 2014-03-01 becomes just the number 2014.
7. Counting date variable
You can follow this up with a count, to find the number of posts of each type each year.
After running this count, notice that we now have the number of R questions and the number of R answers that were posted within each year.
8. Plotting date variable
This is especially useful for a visualization, since it's ready for using in ggplot2.
9. The posts plot
In the exercises, you'll see other ways that this method of binding two tables, one on top of the other, can be useful in various analyses.
10. Let's practice!
Let's get to it!