Building a Regression Model

One of the great things about PySpark ML module is that most algorithms can be tried and tested without changing much code. Random Forest Regression is a fairly simple ensemble model, using bagging to fit. Another tree based ensemble model is Gradient Boosted Trees which uses a different approach called boosting to fit. In this exercise let's train a GBTRegressor.

Import GBTRegressor from pyspark.ml.regression which you will notice is the same module as RandomForestRegressor.
Instantiate GBTRegressor with featuresCol set to the vector column of our features named, features, labelCol set to our dependent variable, SALESCLOSEPRICE and the random seed to 42
Train the model by calling fit() on gbt with the imported training data, train_df.

Exploratory Data Analysis

Wrangling with Spark Functions

Feature Engineering

Building a Model

Exercise

Building a Regression Model

Instructions