Categorical columns
In the flights data there are two columns, carrier
and org
, which hold categorical data. You need to transform those columns into indexed numerical values.
This exercise is part of the course
Machine Learning with PySpark
Exercise instructions
- Import the appropriate class and create an indexer object to transform the
carrier
column from a string to an numeric index. - Prepare the indexer object on the flight data.
- Use the prepared indexer to create the numeric index column.
- Repeat the process for the
org
column.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
from pyspark.ml.feature import ____
# Create an indexer
indexer = ____(inputCol=____, outputCol='carrier_idx')
# Indexer identifies categories in the data
indexer_model = indexer.____(flights)
# Indexer creates a new column with numeric index values
flights_indexed = ____.____(____)
# Repeat the process for the other categorical feature
flights_indexed = ____(inputCol=____, outputCol='org_idx').____(____).____(____)
flights_indexed.show(5)