One Hot Encoding

In the United States where you live determines which schools your kids can attend. Therefore it's understandable that many people care deeply about which school districts their future home will be in. While the school districts are numbered in SCHOOLDISTRICTNUMBER they are really categorical. Meaning that summing or averaging these values has no apparent meaning. Therefore in this example we will convert SCHOOLDISTRICTNUMBER from a categorial variable into a numeric vector to use in our machine learning model later.

Este ejercicio forma parte del curso

Feature Engineering with PySpark

Ver curso

Instrucciones del ejercicio

Instantiate a StringIndexer transformer called string_indexer with SCHOOLDISTRICTNUMBER as the input and School_Index as the output.
Apply the transformer string_indexer to df with fit() and transform(). Store the transformed dataframe in indexed_df.
Create a OneHotEncoder transformer called encoder using School_Index as the input and School_Vec as the output.
Apply the transformation to indexed_df using transform(). Inspect the iterative steps of the transformation with the supplied code.

Ejercicio interactivo práctico

Prueba este ejercicio y completa el código de muestra.

from pyspark.ml.feature import OneHotEncoder, StringIndexer

# Map strings to numbers with string indexer
string_indexer = ____(inputCol=____, outputCol=____)
indexed_df = ____.____(df).____(df)

# Onehot encode indexed values
encoder = ____(inputCol=____, outputCol=____)
encoded_df = ____.____(indexed_df)

# Inspect the transformation steps
encoded_df[['SCHOOLDISTRICTNUMBER', 'School_Index', 'School_Vec']].show(truncate=100)

Editar y ejecutar código