MulaiMulai sekarang secara gratis

Joining On Time Components

Often times you will use date components to join in other sets of information. However, in this example, we need to use data that would have been available to those considering buying a house. This means we will need to use the previous year's reporting data for our analysis.

Latihan ini adalah bagian dari kursus

Feature Engineering with PySpark

Lihat Kursus

Petunjuk latihan

  • Extract the year from LISTDATE using year() and put it into a new column called list_year with withColumn()
  • Create another new column called report_year by subtracting 1 from the list_year
  • Create a join condition that matches df['CITY'] with price_df['City'] and df['report_year'] with price_df['Year']
  • Perform a left join between df and price_df

Latihan interaktif praktis

Cobalah latihan ini dengan menyelesaikan kode contoh berikut.

from pyspark.sql.functions import year

# Initialize dataframes
df = real_estate_df
price_df = median_prices_df

# Create year column
df = df.____(____, ____(____))

# Adjust year to match
df = df.withColumn(____, (df[____] - 1))

# Create join condition
condition = [df[____] == price_df[____], df[____] == price_df[____]]

# Join the dataframes together
df = ____.join(____, on=condition, how=____)
# Inspect that new columns are available
df[['MedianHomeValue']].show()
Edit dan Jalankan Kode