Joining 2 datasets

There are multiple students who have answered both questionnaires in our two datasets. Unfortunately we do not have a single identification variable to identify these students. However, we can use a bunch of background questions together for identification.

Combining two data sets is easy if the data have a mutual identifier column or if a combination of mutual columns can be used as identifiers.

Here we'll use inner.join() function from the dplyr library to combine the data (remember the dplyr cheatsheet!). This means that we'll only keep the students who answered the questionnaire in both math and portuguese classes.

This exercise is part of the course

Helsinki Open Data Science

View Course

Exercise instructions

Access the dplyr library and create object join_by.
Adjust the code: define the argument by in the inner_join() function to join the math and por data frames. Use the columns defined in join_by.
Print out the column names of the joined data set.
Adjust the code again: add the argument suffix to inner_join() and give it a vector of two strings: ".math" and ".por".
Join the datasets again and print out the new column names.
Use the glimpse() function (from dplyr) to look at the joined data. Which data types are present?

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# math and por are available

# access the dplyr library
library(dplyr)

# common columns to use as identifiers
join_by <- c("school","sex","age","address","famsize","Pstatus","Medu","Fedu","Mjob","Fjob","reason","nursery","internet")

# join the two datasets by the selected identifiers
math_por <- inner_join(math, por, by = "change me!")

# see the new column names


# glimpse at the data

Edit and Run Code