Scaling your scalers

In the previous exercise, we minmax scaled a single variable. Suppose you have a LOT of variables to scale, you don't want hundreds of lines to code for each. Let's expand on the previous exercise and make it a function.

Este exercício faz parte do curso

Feature Engineering with PySpark

Ver curso

Instruções do exercício

Define a function called min_max_scaler that takes parameters df a dataframe and cols_to_scale the list of columns to scale.
Use a for loop to iterate through each column in the list and minmax scale them.
Return the dataframe df with the new columns added.
Apply the function min_max_scaler() on df and the list of columns cols_to_scale.

Exercício interativo prático

Experimente este exercício completando este código de exemplo.

def ____(____, ____):
  # Takes a dataframe and list of columns to minmax scale. Returns a dataframe.
  for col in ____:
    # Define min and max values and collect them
    max_days = df.agg({col: 'max'}).collect()[0][0]
    min_days = df.agg({____: 'min'}).collect()[0][0]
    new_column_name = 'scaled_' + col
    # Create a new column based off the scaled data
    df = df.withColumn(____, 
                      (df[____] - min_days) / (max_days - min_days))
  return ____
  
df = min_max_scaler(____, ____)
# Show that our data is now between 0 and 1
df[['DAYSONMARKET', 'scaled_DAYSONMARKET']].show()

Editar e executar o código