One Hot Encoding
In the United States where you live determines which schools your kids can attend. Therefore it's understandable that many people care deeply about which school districts their future home will be in. While the school districts are numbered in SCHOOLDISTRICTNUMBER they are really categorical. Meaning that summing or averaging these values has no apparent meaning. Therefore in this example we will convert SCHOOLDISTRICTNUMBER from a categorial variable into a numeric vector to use in our machine learning model later.
This exercise is part of the course
Feature Engineering with PySpark
Exercise instructions
- Instantiate a
StringIndexertransformer calledstring_indexerwithSCHOOLDISTRICTNUMBERas the input andSchool_Indexas the output. - Apply the transformer
string_indexertodfwithfit()andtransform(). Store the transformed dataframe inindexed_df. - Create a
OneHotEncodertransformer calledencoderusingSchool_Indexas the input andSchool_Vecas the output. - Apply the transformation to
indexed_dfusingtransform(). Inspect the iterative steps of the transformation with the supplied code.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
from pyspark.ml.feature import OneHotEncoder, StringIndexer
# Map strings to numbers with string indexer
string_indexer = ____(inputCol=____, outputCol=____)
indexed_df = ____.____(df).____(df)
# Onehot encode indexed values
encoder = ____(inputCol=____, outputCol=____)
encoded_df = ____.____(indexed_df)
# Inspect the transformation steps
encoded_df[['SCHOOLDISTRICTNUMBER', 'School_Index', 'School_Vec']].show(truncate=100)