Joining On Time Components
Often times you will use date components to join in other sets of information. However, in this example, we need to use data that would have been available to those considering buying a house. This means we will need to use the previous year's reporting data for our analysis.
Deze oefening maakt deel uit van de cursus
Feature Engineering with PySpark
Oefeninstructies
- Extract the year from
LISTDATEusingyear()and put it into a new column calledlist_yearwithwithColumn() - Create another new column called
report_yearby subtracting 1 from thelist_year - Create a join condition that matches
df['CITY']withprice_df['City']anddf['report_year']withprice_df['Year'] - Perform a left join between
dfandprice_df
Praktische interactieve oefening
Probeer deze oefening eens door deze voorbeeldcode in te vullen.
from pyspark.sql.functions import year
# Initialize dataframes
df = real_estate_df
price_df = median_prices_df
# Create year column
df = df.____(____, ____(____))
# Adjust year to match
df = df.withColumn(____, (df[____] - 1))
# Create join condition
condition = [df[____] == price_df[____], df[____] == price_df[____]]
# Join the dataframes together
df = ____.join(____, on=condition, how=____)
# Inspect that new columns are available
df[['MedianHomeValue']].show()