Joining On Time Components
Often times you will use date components to join in other sets of information. However, in this example, we need to use data that would have been available to those considering buying a house. This means we will need to use the previous year's reporting data for our analysis.
This exercise is part of the course
Feature Engineering with PySpark
Exercise instructions
- Extract the year from
LISTDATE
usingyear()
and put it into a new column calledlist_year
withwithColumn()
- Create another new column called
report_year
by subtracting 1 from thelist_year
- Create a join condition that matches
df['CITY']
withprice_df['City']
anddf['report_year']
withprice_df['Year']
- Perform a left join between
df
andprice_df
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
from pyspark.sql.functions import year
# Initialize dataframes
df = real_estate_df
price_df = median_prices_df
# Create year column
df = df.____(____, ____(____))
# Adjust year to match
df = df.withColumn(____, (df[____] - 1))
# Create join condition
condition = [df[____] == price_df[____], df[____] == price_df[____]]
# Join the dataframes together
df = ____.join(____, on=condition, how=____)
# Inspect that new columns are available
df[['MedianHomeValue']].show()