Exercise

Checking for Bad Joins

Joins can go bad silently if we are not careful, meaning they will not error out but instead return mangled data with more or less data than you'd intended. Let's take a look at a couple ways that joining incorrectly can change your data set for the worse.

In this example we will look at what happens if you join two dataframes together when the join keys are not the same precision and compare the record counts between the correct join and the incorrect one.

Instructions

100 XP
  • Create a join between df_orig, the dataframe before its precision was corrected, and walk_df that matches on longitude and latitude in the respective dataframes.
  • Count the number of missing values with where() isNull() on df['walkscore'] and correct_join['walkscore']. You should notice that there are many missing values because our datatypes and precision do not match.
  • Create a join between df and walk_df that only matches on longitude
  • Count the number of records with count(): few_keys_df and correct_join_df. You should notice that there are many more values as we have not constrained our matching correctly.