Random Forest: modeling

Like gradient boosted trees, random forests are another form of ensemble model. That is, they use lots of simpler models (decision trees, again) and combine them to make a single better model. Rather than running the same model iteratively, random forests run lots of separate models in parallel, each on a randomly chosen subset of the data, with a randomly chosen subset of features. Then the final decision tree makes predictions by aggregating the results from the individual models.

sparklyr's random forest function is called ml_random_forest(). Its usage is exactly the same as ml_gradient_boosted_trees() (see the first exercise of this chapter for a reminder on syntax).

A Spark connection has been created for you as spark_conn. A tibble attached to the combined and filtered track metadata/timbre data stored in Spark has been pre-defined as track_data_to_model_tbl.

Repeat your year prediction analysis, using a random forest model this time.
- Get the timbre columns from track_data_to_model_tbl and assign the result to feature_colnames.
- Create the formula for the model using reformulate().
- Run the random forest model and assign the result to random_forest_model.

Light My Fire: Starting To Use Spark With dplyr Syntax

Tools of the Trade: Advanced dplyr Usage

Going Native: Use The Native Interface to Manipulate Spark DataFrames

Case Study: Learning to be a Machine: Running Machine Learning Models on Spark

Exercise

Random Forest: modeling

Instructions