One Hot Encoding
In the United States where you live determines which schools your kids can attend. Therefore it's understandable that many people care deeply about which school districts their future home will be in. While the school districts are numbered in SCHOOLDISTRICTNUMBER
they are really categorical. Meaning that summing or averaging these values has no apparent meaning. Therefore in this example we will convert SCHOOLDISTRICTNUMBER
from a categorial variable into a numeric vector to use in our machine learning model later.
Este exercício faz parte do curso
Feature Engineering with PySpark
Instruções do exercício
- Instantiate a
StringIndexer
transformer calledstring_indexer
withSCHOOLDISTRICTNUMBER
as the input andSchool_Index
as the output. - Apply the transformer
string_indexer
todf
withfit()
andtransform()
. Store the transformed dataframe inindexed_df
. - Create a
OneHotEncoder
transformer calledencoder
usingSchool_Index
as the input andSchool_Vec
as the output. - Apply the transformation to
indexed_df
usingtransform()
. Inspect the iterative steps of the transformation with the supplied code.
Exercício interativo prático
Experimente este exercício completando este código de exemplo.
from pyspark.ml.feature import OneHotEncoder, StringIndexer
# Map strings to numbers with string indexer
string_indexer = ____(inputCol=____, outputCol=____)
indexed_df = ____.____(df).____(df)
# Onehot encode indexed values
encoder = ____(inputCol=____, outputCol=____)
encoded_df = ____.____(indexed_df)
# Inspect the transformation steps
encoded_df[['SCHOOLDISTRICTNUMBER', 'School_Index', 'School_Vec']].show(truncate=100)