Checking for Bad Joins
Joins can go bad silently if we are not careful, meaning they will not error out but instead return mangled data with more or less data than you'd intended. Let's take a look at a couple ways that joining incorrectly can change your data set for the worse.
In this example we will look at what happens if you join two dataframes together when the join keys are not the same precision and compare the record counts between the correct join and the incorrect one.
This exercise is part of the course
Feature Engineering with PySpark
Exercise instructions
- Create a join between
df_orig
, the dataframe before its precision was corrected, andwalk_df
that matches onlongitude
andlatitude
in the respective dataframes. - Count the number of missing values with
where()
isNull()
ondf['walkscore']
andcorrect_join['walkscore']
. You should notice that there are many missing values because our datatypes and precision do not match. - Create a join between
df
andwalk_df
that only matches onlongitude
- Count the number of records with
count()
:few_keys_df
andcorrect_join_df
. You should notice that there are many more values as we have not constrained our matching correctly.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Join on mismatched keys precision
wrong_prec_cond = [____ == ____, ____ == ____]
wrong_prec_df = ____.join(____, on=wrong_prec_cond, how='left')
# Compare bad join to the correct one
print(wrong_prec_df.____(____.____()).count())
print(correct_join_df.____(____.____).count())
# Create a join on too few keys
few_keys_cond = [____ == ____]
few_keys_df = ____.join(____, on=few_keys_cond, how='left')
# Compare bad join to the correct one
print("Record Count of the Too Few Keys Join Example: " + str(____.____()))
print("Record Count of the Correct Join Example: " + str(____.____()))