Scaling your scalers
In the previous exercise, we minmax scaled a single variable. Suppose you have a LOT of variables to scale, you don't want hundreds of lines to code for each. Let's expand on the previous exercise and make it a function.
This exercise is part of the course
Feature Engineering with PySpark
Exercise instructions
- Define a function called
min_max_scaler
that takes parametersdf
a dataframe andcols_to_scale
the list of columns to scale. - Use a
for
loop to iterate through each column in the list and minmax scale them. - Return the dataframe
df
with the new columns added. - Apply the function
min_max_scaler()
ondf
and the list of columnscols_to_scale
.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
def ____(____, ____):
# Takes a dataframe and list of columns to minmax scale. Returns a dataframe.
for col in ____:
# Define min and max values and collect them
max_days = df.agg({col: 'max'}).collect()[0][0]
min_days = df.agg({____: 'min'}).collect()[0][0]
new_column_name = 'scaled_' + col
# Create a new column based off the scaled data
df = df.withColumn(____,
(df[____] - min_days) / (max_days - min_days))
return ____
df = min_max_scaler(____, ____)
# Show that our data is now between 0 and 1
df[['DAYSONMARKET', 'scaled_DAYSONMARKET']].show()