Get startedGet started for free

Scaling your scalers

In the previous exercise, we minmax scaled a single variable. Suppose you have a LOT of variables to scale, you don't want hundreds of lines to code for each. Let's expand on the previous exercise and make it a function.

This exercise is part of the course

Feature Engineering with PySpark

View Course

Exercise instructions

  • Define a function called min_max_scaler that takes parameters df a dataframe and cols_to_scale the list of columns to scale.
  • Use a for loop to iterate through each column in the list and minmax scale them.
  • Return the dataframe df with the new columns added.
  • Apply the function min_max_scaler() on df and the list of columns cols_to_scale.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

def ____(____, ____):
  # Takes a dataframe and list of columns to minmax scale. Returns a dataframe.
  for col in ____:
    # Define min and max values and collect them
    max_days = df.agg({col: 'max'}).collect()[0][0]
    min_days = df.agg({____: 'min'}).collect()[0][0]
    new_column_name = 'scaled_' + col
    # Create a new column based off the scaled data
    df = df.withColumn(____, 
                      (df[____] - min_days) / (max_days - min_days))
  return ____
  
df = min_max_scaler(____, ____)
# Show that our data is now between 0 and 1
df[['DAYSONMARKET', 'scaled_DAYSONMARKET']].show()
Edit and Run Code