Naively Handling Missing and Categorical Values
Random Forest Regression is robust enough to allow us to ignore many of the more time consuming and tedious data preparation steps. While some implementations of Random Forest handle missing and categorical values automatically, PySpark's does not. The math remains the same however so we can get away with some naive value replacements.
For missing values since our data is strictly positive, we will assign -1
. The random forest will split on this value and handle it differently than the rest of the values in the same feature.
For categorical values, we can just map the text values to numbers and again the random forest will appropriately handle them by splitting on them. In this example, we will dust off pipelines
from Introduction to PySpark to write our code more concisely. Please note that the exercise will start by displaying the dtypes
of the columns in the dataframe, compare them to the results at the end of this exercise.
NOTE: Pipeline
and StringIndexer
are already imported for you. The list categorical_cols
is also available.
This exercise is part of the course
Feature Engineering with PySpark
Exercise instructions
- Replace the values in
WALKSCORE
andBIKESCORE
with -1 usingfillna()
and thesubset
parameter. - Create a list of
StringIndexer
s by using list comprehension to iterate over each column incategorical_cols
. - Apply
fit()
andtransform()
to the pipelineindexer_pipeline
. - Drop the
categorical_cols
usingdrop()
since they are no longer needed. Inspect the result data types usingdtypes
.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Replace missing values
df = df.____(____, ____=[____, ____])
# Create list of StringIndexers using list comprehension
indexers = [____(inputCol=____, outputCol=____+"_IDX")\
.setHandleInvalid("keep") for ____ in ____]
# Create pipeline of indexers
indexer_pipeline = Pipeline(stages=indexers)
# Fit and Transform the pipeline to the original data
df_indexed = ____.____(df).____(df)
# Clean up redundant columns
df_indexed = df_indexed.____(*____)
# Inspect data transformations
print(df_indexed.dtypes)