Checking for Bad Joins

Joins can go bad silently if we are not careful, meaning they will not error out but instead return mangled data with more or less data than you'd intended. Let's take a look at a couple ways that joining incorrectly can change your data set for the worse.

In this example we will look at what happens if you join two dataframes together when the join keys are not the same precision and compare the record counts between the correct join and the incorrect one.

This exercise is part of the course

Feature Engineering with PySpark

View Course

Exercise instructions

Create a join between df_orig, the dataframe before its precision was corrected, and walk_df that matches on longitude and latitude in the respective dataframes.
Count the number of missing values with where() isNull() on df['walkscore'] and correct_join['walkscore']. You should notice that there are many missing values because our datatypes and precision do not match.
Create a join between df and walk_df that only matches on longitude
Count the number of records with count(): few_keys_df and correct_join_df. You should notice that there are many more values as we have not constrained our matching correctly.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Join on mismatched keys precision 
wrong_prec_cond = [____ == ____, ____ == ____]
wrong_prec_df = ____.join(____, on=wrong_prec_cond, how='left')

# Compare bad join to the correct one
print(wrong_prec_df.____(____.____()).count())
print(correct_join_df.____(____.____).count())

# Create a join on too few keys
few_keys_cond = [____ == ____]
few_keys_df = ____.join(____, on=few_keys_cond, how='left')

# Compare bad join to the correct one
print("Record Count of the Too Few Keys Join Example: " + str(____.____()))
print("Record Count of the Correct Join Example: " + str(____.____()))

Edit and Run Code