Naively Handling Missing and Categorical Values
Random Forest Regression is robust enough to allow us to ignore many of the more time consuming and tedious data preparation steps. While some implementations of Random Forest handle missing and categorical values automatically, PySpark's does not. The math remains the same however so we can get away with some naive value replacements.
For missing values since our data is strictly positive, we will assign -1. The random forest will split on this value and handle it differently than the rest of the values in the same feature.
For categorical values, we can just map the text values to numbers and again the random forest will appropriately handle them by splitting on them. In this example, we will dust off pipelines from Introduction to PySpark to write our code more concisely. Please note that the exercise will start by displaying the dtypes of the columns in the dataframe, compare them to the results at the end of this exercise.
NOTE: Pipeline and StringIndexer are already imported for you. The list categorical_cols is also available.
This exercise is part of the course
Feature Engineering with PySpark
Exercise instructions
- Replace the values in
WALKSCOREandBIKESCOREwith -1 usingfillna()and thesubsetparameter. - Create a list of
StringIndexers by using list comprehension to iterate over each column incategorical_cols. - Apply
fit()andtransform()to the pipelineindexer_pipeline. - Drop the
categorical_colsusingdrop()since they are no longer needed. Inspect the result data types usingdtypes.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Replace missing values
df = df.____(____, ____=[____, ____])
# Create list of StringIndexers using list comprehension
indexers = [____(inputCol=____, outputCol=____+"_IDX")\
.setHandleInvalid("keep") for ____ in ____]
# Create pipeline of indexers
indexer_pipeline = Pipeline(stages=indexers)
# Fit and Transform the pipeline to the original data
df_indexed = ____.____(df).____(df)
# Clean up redundant columns
df_indexed = df_indexed.____(*____)
# Inspect data transformations
print(df_indexed.dtypes)