Get startedGet started for free

Joining On Time Components

Often times you will use date components to join in other sets of information. However, in this example, we need to use data that would have been available to those considering buying a house. This means we will need to use the previous year's reporting data for our analysis.

This exercise is part of the course

Feature Engineering with PySpark

View Course

Exercise instructions

  • Extract the year from LISTDATE using year() and put it into a new column called list_year with withColumn()
  • Create another new column called report_year by subtracting 1 from the list_year
  • Create a join condition that matches df['CITY'] with price_df['City'] and df['report_year'] with price_df['Year']
  • Perform a left join between df and price_df

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

from pyspark.sql.functions import year

# Initialize dataframes
df = real_estate_df
price_df = median_prices_df

# Create year column
df = df.____(____, ____(____))

# Adjust year to match
df = df.withColumn(____, (df[____] - 1))

# Create join condition
condition = [df[____] == price_df[____], df[____] == price_df[____]]

# Join the dataframes together
df = ____.join(____, on=condition, how=____)
# Inspect that new columns are available
df[['MedianHomeValue']].show()
Edit and Run Code