ComenzarEmpieza gratis

Joining On Time Components

Often times you will use date components to join in other sets of information. However, in this example, we need to use data that would have been available to those considering buying a house. This means we will need to use the previous year's reporting data for our analysis.

Este ejercicio forma parte del curso

Feature Engineering with PySpark

Ver curso

Instrucciones del ejercicio

  • Extract the year from LISTDATE using year() and put it into a new column called list_year with withColumn()
  • Create another new column called report_year by subtracting 1 from the list_year
  • Create a join condition that matches df['CITY'] with price_df['City'] and df['report_year'] with price_df['Year']
  • Perform a left join between df and price_df

Ejercicio interactivo práctico

Prueba este ejercicio y completa el código de muestra.

from pyspark.sql.functions import year

# Initialize dataframes
df = real_estate_df
price_df = median_prices_df

# Create year column
df = df.____(____, ____(____))

# Adjust year to match
df = df.withColumn(____, (df[____] - 1))

# Create join condition
condition = [df[____] == price_df[____], df[____] == price_df[____]]

# Join the dataframes together
df = ____.join(____, on=condition, how=____)
# Inspect that new columns are available
df[['MedianHomeValue']].show()
Editar y ejecutar código