Joining On Time Components
Often times you will use date components to join in other sets of information. However, in this example, we need to use data that would have been available to those considering buying a house. This means we will need to use the previous year's reporting data for our analysis.
Cet exercice fait partie du cours
Feature Engineering with PySpark
Instructions
- Extract the year from LISTDATEusingyear()and put it into a new column calledlist_yearwithwithColumn()
- Create another new column called report_yearby subtracting 1 from thelist_year
- Create a join condition that matches df['CITY']withprice_df['City']anddf['report_year']withprice_df['Year']
- Perform a left join between dfandprice_df
Exercice interactif pratique
Essayez cet exercice en complétant cet exemple de code.
from pyspark.sql.functions import year
# Initialize dataframes
df = real_estate_df
price_df = median_prices_df
# Create year column
df = df.____(____, ____(____))
# Adjust year to match
df = df.withColumn(____, (df[____] - 1))
# Create join condition
condition = [df[____] == price_df[____], df[____] == price_df[____]]
# Join the dataframes together
df = ____.join(____, on=condition, how=____)
# Inspect that new columns are available
df[['MedianHomeValue']].show()