One Hot Encoding
In the United States where you live determines which schools your kids can attend. Therefore it's understandable that many people care deeply about which school districts their future home will be in. While the school districts are numbered in SCHOOLDISTRICTNUMBER they are really categorical. Meaning that summing or averaging these values has no apparent meaning. Therefore in this example we will convert SCHOOLDISTRICTNUMBER from a categorial variable into a numeric vector to use in our machine learning model later.
Cet exercice fait partie du cours
Feature Engineering with PySpark
Instructions
- Instantiate a StringIndexertransformer calledstring_indexerwithSCHOOLDISTRICTNUMBERas the input andSchool_Indexas the output.
- Apply the transformer string_indexertodfwithfit()andtransform(). Store the transformed dataframe inindexed_df.
- Create a OneHotEncodertransformer calledencoderusingSchool_Indexas the input andSchool_Vecas the output.
- Apply the transformation to indexed_dfusingtransform(). Inspect the iterative steps of the transformation with the supplied code.
Exercice interactif pratique
Essayez cet exercice en complétant cet exemple de code.
from pyspark.ml.feature import OneHotEncoder, StringIndexer
# Map strings to numbers with string indexer
string_indexer = ____(inputCol=____, outputCol=____)
indexed_df = ____.____(df).____(df)
# Onehot encode indexed values
encoder = ____(inputCol=____, outputCol=____)
encoded_df = ____.____(indexed_df)
# Inspect the transformation steps
encoded_df[['SCHOOLDISTRICTNUMBER', 'School_Index', 'School_Vec']].show(truncate=100)