Strings and factors
As you know, Spark requires numeric data for modeling. So far this hasn't been an issue; even boolean columns can easily be converted to integers without any trouble. But you'll also be using the airline and the plane's destination as features in your model. These are coded as strings and there isn't any obvious way to convert them to a numeric data type.
Fortunately, PySpark has functions for handling this built into the pyspark.ml.features
submodule. You can create what are called 'one-hot vectors' to represent the carrier and the destination of each flight. A one-hot vector is a way of representing a categorical feature where every observation has a vector in which all elements are zero except for at most one element, which has a value of one (1).
Each element in the vector corresponds to a level of the feature, so it's possible to tell what the right level is by seeing which element of the vector is equal to one (1).
The first step to encoding your categorical feature is to create a StringIndexer
. Members of this class are Estimator
s that take a DataFrame with a column of strings and map each unique string to a number. Then, the Estimator
returns a Transformer
that takes a DataFrame, attaches the mapping to it as metadata, and returns a new DataFrame with a numeric column corresponding to the string column.
The second step is to encode this numeric column as a one-hot vector using a OneHotEncoder
. This works exactly the same way as the StringIndexer
by creating an Estimator
and then a Transformer
. The end result is a column that encodes your categorical feature as a vector that's suitable for machine learning routines!
This may seem complicated, but don't worry! All you have to remember is that you need to create a StringIndexer
and a OneHotEncoder
, and the Pipeline
will take care of the rest.
Why do you have to encode a categorical feature as a one-hot vector?
This exercise is part of the course
Foundations of PySpark
Hands-on interactive exercise
Turn theory into action with one of our interactive exercises
