One Hot Encoding

In the United States where you live determines which schools your kids can attend. Therefore it's understandable that many people care deeply about which school districts their future home will be in. While the school districts are numbered in SCHOOLDISTRICTNUMBER they are really categorical. Meaning that summing or averaging these values has no apparent meaning. Therefore in this example we will convert SCHOOLDISTRICTNUMBER from a categorial variable into a numeric vector to use in our machine learning model later.

Cet exercice fait partie du cours

Feature Engineering with PySpark

Afficher le cours

Instructions

Instantiate a StringIndexer transformer called string_indexer with SCHOOLDISTRICTNUMBER as the input and School_Index as the output.
Apply the transformer string_indexer to df with fit() and transform(). Store the transformed dataframe in indexed_df.
Create a OneHotEncoder transformer called encoder using School_Index as the input and School_Vec as the output.
Apply the transformation to indexed_df using transform(). Inspect the iterative steps of the transformation with the supplied code.

Exercice interactif pratique

Essayez cet exercice en complétant cet exemple de code.

from pyspark.ml.feature import OneHotEncoder, StringIndexer

# Map strings to numbers with string indexer
string_indexer = ____(inputCol=____, outputCol=____)
indexed_df = ____.____(df).____(df)

# Onehot encode indexed values
encoder = ____(inputCol=____, outputCol=____)
encoded_df = ____.____(indexed_df)

# Inspect the transformation steps
encoded_df[['SCHOOLDISTRICTNUMBER', 'School_Index', 'School_Vec']].show(truncate=100)

Modifier et exécuter le code