Encoding flight origin
The org
column in the flights data is a categorical variable giving the airport from which a flight departs.
- ORD — O'Hare International Airport (Chicago)
- SFO — San Francisco International Airport
- JFK — John F Kennedy International Airport (New York)
- LGA — La Guardia Airport (New York)
- SMF — Sacramento
- SJC — San Jose
- OGG — Kahului (Hawaii)
Obviously this is only a small subset of airports. Nevertheless, since this is a categorical variable, it needs to be one-hot encoded before it can be used in a regression model.
The data are in a variable called flights
. You have already used a string indexer to create a column of indexed values corresponding to the strings in org
.
You might find it useful to revise the slides from the lessons in the Slides panel next to the IPython Shell.
This exercise is part of the course
Machine Learning with PySpark
Exercise instructions
- Import the one-hot encoder class.
- Create a one-hot encoder instance, naming the input column
org_idx
and the output columnorg_dummy
. - Apply the one-hot encoder to the flights data.
- Generate a summary of the mapping from categorical values to binary encoded dummy variables. Include only unique values and order by
org_idx
.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Import the one hot encoder class
from pyspark.ml.____ import ____
# Create an instance of the one hot encoder
onehot = ____(inputCols=[____], outputCols=[____])
# Apply the one hot encoder to the flights data
onehot = onehot.____(____)
flights_onehot = onehot.____(____)
# Check the results
flights_onehot.____('org', 'org_idx', 'org_dummy').____().____('org_idx').show()