One Hot Encoding

In the United States where you live determines which schools your kids can attend. Therefore it's understandable that many people care deeply about which school districts their future home will be in. While the school districts are numbered in SCHOOLDISTRICTNUMBER they are really categorical. Meaning that summing or averaging these values has no apparent meaning. Therefore in this example we will convert SCHOOLDISTRICTNUMBER from a categorial variable into a numeric vector to use in our machine learning model later.

This exercise is part of the course

Feature Engineering with PySpark

View Course

Exercise instructions

Instantiate a StringIndexer transformer called string_indexer with SCHOOLDISTRICTNUMBER as the input and School_Index as the output.
Apply the transformer string_indexer to df with fit() and transform(). Store the transformed dataframe in indexed_df.
Create a OneHotEncoder transformer called encoder using School_Index as the input and School_Vec as the output.
Apply the transformation to indexed_df using transform(). Inspect the iterative steps of the transformation with the supplied code.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

from pyspark.ml.feature import OneHotEncoder, StringIndexer

# Map strings to numbers with string indexer
string_indexer = ____(inputCol=____, outputCol=____)
indexed_df = ____.____(df).____(df)

# Onehot encode indexed values
encoder = ____(inputCol=____, outputCol=____)
encoded_df = ____.____(indexed_df)

# Inspect the transformation steps
encoded_df[['SCHOOLDISTRICTNUMBER', 'School_Index', 'School_Vec']].show(truncate=100)

Edit and Run Code