Joining 2 datasets

There are multiple students who have answered both questionnaires in our two datasets. Unfortunately we do not have a single identification variable to identify these students. However, we can use a bunch of background questions together for identification.

Combining two data sets is easy if the data have a mutual identifier column or if a combination of mutual columns can be used as identifiers.

Here we'll use inner.join() function from the dplyr library to combine the data (remember the dplyr cheatsheet!). This means that we'll only keep the students who answered the questionnaire in both math and portuguese classes.

Este ejercicio forma parte del curso

Helsinki Open Data Science

Ver curso

Instrucciones del ejercicio

Access the dplyr library and create object join_by.
Adjust the code: define the argument by in the inner_join() function to join the math and por data frames. Use the columns defined in join_by.
Print out the column names of the joined data set.
Adjust the code again: add the argument suffix to inner_join() and give it a vector of two strings: ".math" and ".por".
Join the datasets again and print out the new column names.
Use the glimpse() function (from dplyr) to look at the joined data. Which data types are present?

Ejercicio interactivo práctico

Prueba este ejercicio y completa el código de muestra.

# math and por are available

# access the dplyr library
library(dplyr)

# common columns to use as identifiers
join_by <- c("school","sex","age","address","famsize","Pstatus","Medu","Fedu","Mjob","Fjob","reason","nursery","internet")

# join the two datasets by the selected identifiers
math_por <- inner_join(math, por, by = "change me!")

# see the new column names


# glimpse at the data

Editar y ejecutar código