CommencerCommencer gratuitement

Joining On Time Components

Often times you will use date components to join in other sets of information. However, in this example, we need to use data that would have been available to those considering buying a house. This means we will need to use the previous year's reporting data for our analysis.

Cet exercice fait partie du cours

Feature Engineering with PySpark

Afficher le cours

Instructions

  • Extract the year from LISTDATE using year() and put it into a new column called list_year with withColumn()
  • Create another new column called report_year by subtracting 1 from the list_year
  • Create a join condition that matches df['CITY'] with price_df['City'] and df['report_year'] with price_df['Year']
  • Perform a left join between df and price_df

Exercice interactif pratique

Essayez cet exercice en complétant cet exemple de code.

from pyspark.sql.functions import year

# Initialize dataframes
df = real_estate_df
price_df = median_prices_df

# Create year column
df = df.____(____, ____(____))

# Adjust year to match
df = df.withColumn(____, (df[____] - 1))

# Create join condition
condition = [df[____] == price_df[____], df[____] == price_df[____]]

# Join the dataframes together
df = ____.join(____, on=condition, how=____)
# Inspect that new columns are available
df[['MedianHomeValue']].show()
Modifier et exécuter le code