One Hot Encoding
In the United States where you live determines which schools your kids can attend. Therefore it's understandable that many people care deeply about which school districts their future home will be in. While the school districts are numbered in SCHOOLDISTRICTNUMBER
they are really categorical. Meaning that summing or averaging these values has no apparent meaning. Therefore in this example we will convert SCHOOLDISTRICTNUMBER
from a categorial variable into a numeric vector to use in our machine learning model later.
This exercise is part of the course
Feature Engineering with PySpark
Exercise instructions
- Instantiate a
StringIndexer
transformer calledstring_indexer
withSCHOOLDISTRICTNUMBER
as the input andSchool_Index
as the output. - Apply the transformer
string_indexer
todf
withfit()
andtransform()
. Store the transformed dataframe inindexed_df
. - Create a
OneHotEncoder
transformer calledencoder
usingSchool_Index
as the input andSchool_Vec
as the output. - Apply the transformation to
indexed_df
usingtransform()
. Inspect the iterative steps of the transformation with the supplied code.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
from pyspark.ml.feature import OneHotEncoder, StringIndexer
# Map strings to numbers with string indexer
string_indexer = ____(inputCol=____, outputCol=____)
indexed_df = ____.____(df).____(df)
# Onehot encode indexed values
encoder = ____(inputCol=____, outputCol=____)
encoded_df = ____.____(indexed_df)
# Inspect the transformation steps
encoded_df[['SCHOOLDISTRICTNUMBER', 'School_Index', 'School_Vec']].show(truncate=100)