Pivot & Join
Being able to explode and pivot a compound field is great, but you are left with a dataframe of only those pivoted values. To really be valuable you'll need to rejoin it to the original dataset! After joining the datasets we will have a lot of NULL values for the newly created columns since we know the context of how they were created we can safely fill them in with zero as either the new has an attribute or it doesn't.
Este exercício faz parte do curso
Feature Engineering with PySpark
Instruções do exercício
- Pivot the values of
ex_garage_list
by grouping by the record idNO
withgroupBy()
use the provided code to aggregateconstant_val
to ignore nulls and take the first value. - Left join
piv_df
todf
usingNO
as the join condition. - Create the list of columns,
zfill_cols
, to zero fill by using thecolumns
attribute onpiv_df
- Zero fill the pivoted dataframes columns,
zfill_cols
, by usingfillna()
with asubset
.
Exercício interativo prático
Experimente este exercício completando este código de exemplo.
from pyspark.sql.functions import coalesce, first
# Pivot
piv_df = ex_df.____(____).____(____).agg(coalesce(first(____)))
# Join the dataframes together and fill null
joined_df = ____.join(____, on='NO', how=____)
# Columns to zero fill
zfill_cols = ____.____
# Zero fill the pivoted values
zfilled_df = joined_df.____(0, subset=____)