Joining 2 datasets
There are multiple students who have answered both questionnaires in our two datasets. Unfortunately we do not have a single identification variable to identify these students. However, we can use a bunch of background questions together for identification.
Combining two data sets is easy if the data have a mutual identifier column or if a combination of mutual columns can be used as identifiers.
Here we'll use inner.join()
function from the dplyr library to combine the data (remember the dplyr cheatsheet!). This means that we'll only keep the students who answered the questionnaire in both math and portuguese classes.
This exercise is part of the course
Helsinki Open Data Science
Exercise instructions
- Access the dplyr library and create object
join_by
. - Adjust the code: define the argument
by
in theinner_join()
function to join themath
andpor
data frames. Use the columns defined injoin_by
. - Print out the column names of the joined data set.
- Adjust the code again: add the argument
suffix
toinner_join()
and give it a vector of two strings: ".math" and ".por". - Join the datasets again and print out the new column names.
- Use the
glimpse()
function (from dplyr) to look at the joined data. Which data types are present?
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# math and por are available
# access the dplyr library
library(dplyr)
# common columns to use as identifiers
join_by <- c("school","sex","age","address","famsize","Pstatus","Medu","Fedu","Mjob","Fjob","reason","nursery","internet")
# join the two datasets by the selected identifiers
math_por <- inner_join(math, por, by = "change me!")
# see the new column names
# glimpse at the data