Custom Percentage Scaling
In the slides we showed how to scale the data between 0 and 1. Sometimes you may wish to scale things differently for modeling or display purposes.
This exercise is part of the course
Feature Engineering with PySpark
Exercise instructions
- Calculate the max and min of
DAYSONMARKET
and put them into variablesmax_days
andmin_days
, don't forget to usecollect()
onagg()
. - Using
withColumn()
create a new column called 'percentagescaleddays' based onDAYSONMARKET
. percentage_scaled_days
should be a column of integers ranging from 0 to 100, useround()
to get integers.- Print the
max()
andmin()
for the new columnpercentage_scaled_days
.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Define max and min values and collect them
max_days = df.____({____: ____}).____()[0][0]
min_days = df.____({____: ____}).____()[0][0]
# Create a new column based off the scaled data
df = df.____(____,
____((df[____] - min_days) / (max_days - min_days)) * ____)
# Calc max and min for new column
print(df.____({____: ____}).____())
print(df.____({____: ____}).____())