Random Forest: modeling
Like gradient boosted trees, random forests are another form of ensemble model. That is, they use lots of simpler models (decision trees, again) and combine them to make a single better model. Rather than running the same model iteratively, random forests run lots of separate models in parallel, each on a randomly chosen subset of the data, with a randomly chosen subset of features. Then the final decision tree makes predictions by aggregating the results from the individual models.
sparklyr
's random forest function is called ml_random_forest()
. Its usage is exactly the same as ml_gradient_boosted_trees()
(see the first exercise of this chapter for a reminder on syntax).
This exercise is part of the course
Introduction to Spark with sparklyr in R
Exercise instructions
A Spark connection has been created for you as spark_conn
. A tibble attached to the combined and filtered track metadata/timbre data stored in Spark has been pre-defined as track_data_to_model_tbl
.
- Repeat your year prediction analysis, using a random forest model this time.
- Get the
timbre
columns fromtrack_data_to_model_tbl
and assign the result tofeature_colnames
. - Create the formula for the model using
reformulate()
. - Run the random forest model and assign the result to
random_forest_model
.
- Get the
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# track_data_to_model_tbl has been pre-defined
track_data_to_model_tbl
# Get the timbre columns
feature_colnames <- ___
# Create the formula for the model
year_formula <- ___
# Run the random forest model
random_forest_model <- ___