Join together for fun
1. Join together for fun
Welcome back. In this chapter, we'll look at methods of combining data from different sources. Data rarely only comes from a single source, so being able to link together two or more sources is a valuable programming skill. We'll first refresh our knowledge of some dplyr joins.2. dplyr join diagrams
A left join keeps all of the original rows in the left tibble and marks the values without a match in the right tibble as missing. An inner join looks for matches in the right tibble corresponding to all entries in the key column in the left tibble. The rows without a match based on key values to the inner join have been faded. Lastly, an anti join chooses rows in the left tibble where the key is not found in the right tibble. This is typically used for diagnosing data errors with left and inner joins. Let's see these in action with world_bank_data and imf_data.3. Some IMF data for Uruguay
Let's focus on just a few columns from the IMF data and filter rows after 2010 for the country of Uruguay. The resulting tibble from this pipeline gives us some Uruguayan characteristics from the IMF.4. Some World Bank data for Uruguay
Similarly, here is a subset for Uruguay from the World Bank data. Note here that the columns of iso, country, and year are in both tibbles.5. Left join on Uruguayan tibbles
The left_join function in dplyr takes two required arguments. The left tibble comes before the pipe, and the right tibble is the other argument. Recall that if we have columns with matching names as we do here, dplyr will automatically try to join on all of those columns as we see here. Since the uruguay_wb tibble did not have values for the years 2011, 2012, 2017, 2018, and 2019, we see that the rural population values are missing. The left join did bring in the matching values for the years 2013 to 2016, though.6. Inner join on Uruguayan tibbles
The code for an inner_join is very similar. Let's also be explicit here and specify the by argument to list which columns we'd like to use as keys. Note the need to include quotation marks around the column names here, which is different to usual tidyverse syntax. As we saw with the left join, the years of 2013 to 2016 appear in both tibbles. This inner join will, thus, only return those rows that have matching key entries for the years of 2013 to 2016.7. Anti join on Uruguayan tibbles
The anti join can be thought of as the opposite of an inner join. For this example, it will return only the rows in the IMF data that do not have matching key entries in the World Bank data. 2011, 2012, 2017, 2018, and 2019 are those years. The syntax again remains consistent here, and we have again included the by argument values for clarity.8. Let's practice!
Try out some joins on a few exercises as a refresher.Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.