Joining On Time Components
Often times you will use date components to join in other sets of information. However, in this example, we need to use data that would have been available to those considering buying a house. This means we will need to use the previous year's reporting data for our analysis.
Questo esercizio fa parte del corso
Feature Engineering with PySpark
Istruzioni dell'esercizio
- Extract the year from
LISTDATEusingyear()and put it into a new column calledlist_yearwithwithColumn() - Create another new column called
report_yearby subtracting 1 from thelist_year - Create a join condition that matches
df['CITY']withprice_df['City']anddf['report_year']withprice_df['Year'] - Perform a left join between
dfandprice_df
Esercizio pratico interattivo
Prova a risolvere questo esercizio completando il codice di esempio.
from pyspark.sql.functions import year
# Initialize dataframes
df = real_estate_df
price_df = median_prices_df
# Create year column
df = df.____(____, ____(____))
# Adjust year to match
df = df.withColumn(____, (df[____] - 1))
# Create join condition
condition = [df[____] == price_df[____], df[____] == price_df[____]]
# Join the dataframes together
df = ____.join(____, on=condition, how=____)
# Inspect that new columns are available
df[['MedianHomeValue']].show()