Naively Handling Missing and Categorical Values

Random Forest Regression is robust enough to allow us to ignore many of the more time consuming and tedious data preparation steps. While some implementations of Random Forest handle missing and categorical values automatically, PySpark's does not. The math remains the same however so we can get away with some naive value replacements.

For missing values since our data is strictly positive, we will assign -1. The random forest will split on this value and handle it differently than the rest of the values in the same feature.

For categorical values, we can just map the text values to numbers and again the random forest will appropriately handle them by splitting on them. In this example, we will dust off pipelines from Introduction to PySpark to write our code more concisely. Please note that the exercise will start by displaying the dtypes of the columns in the dataframe, compare them to the results at the end of this exercise.

NOTE: Pipeline and StringIndexer are already imported for you. The list categorical_cols is also available.

Replace the values in WALKSCORE and BIKESCORE with -1 using fillna() and the subset parameter.
Create a list of StringIndexers by using list comprehension to iterate over each column in categorical_cols.
Apply fit() and transform() to the pipeline indexer_pipeline.
Drop the categorical_cols using drop() since they are no longer needed. Inspect the result data types using dtypes.

Exploratory Data Analysis

Wrangling with Spark Functions

Feature Engineering

Building a Model

Exercise

Naively Handling Missing and Categorical Values

Instructions