A Dangerous Join
In this exercise, we will be joining on Latitude and Longitude to bring in another dataset that measures how walk-friendly a neighborhood is. We'll need to be careful to make sure our joining columns are the same data type and ensure we are joining on the same precision (number of digits after the decimal) or our join won't work!
Below you will find that df['latitude']
and df['longitude']
are at a higher precision than walk_df['longitude']
and walk_df['latitude']
we'll need to round them to the same precision so the join will work correctly.
This exercise is part of the course
Feature Engineering with PySpark
Exercise instructions
- Convert
walk_df['latitude']
andwalk_df['longitude']
to type double by usingcast('double')
on the column and replacing the column in placewithColumn()
. - Round the columns in place with
withColumn()
andround('latitude', 5)
andround('longitude', 5)
. - Create the join condition of
walk_df['latitude']
matchingdf['latitude']
andwalk_df['longitude']
matchingdf['longitude']
. - Join
df
andwalk_df
together withjoin()
, using the condition above and theleft
join type. Save the joined dataframe asjoin_df
.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Cast data types
walk_df = ____.withColumn('longitude', ____.cast('double'))
walk_df = ____.withColumn(____, ____.cast('double'))
# Round precision
df = df.withColumn('longitude', round(____, 5))
df = df.withColumn(____, round(____, 5))
# Create join condition
condition = [____ == ____, ____ == ____]
# Join the dataframes together
join_df = ____.join(____, on=____, how=____)
# Count non-null records from new field
print(join_df.where(~join_df['walkscore'].isNull()).count())