Get startedGet started for free

Random Forest: modeling

Like gradient boosted trees, random forests are another form of ensemble model. That is, they use lots of simpler models (decision trees, again) and combine them to make a single better model. Rather than running the same model iteratively, random forests run lots of separate models in parallel, each on a randomly chosen subset of the data, with a randomly chosen subset of features. Then the final decision tree makes predictions by aggregating the results from the individual models.

sparklyr's random forest function is called ml_random_forest(). Its usage is exactly the same as ml_gradient_boosted_trees() (see the first exercise of this chapter for a reminder on syntax).

This exercise is part of the course

Introduction to Spark with sparklyr in R

View Course

Exercise instructions

A Spark connection has been created for you as spark_conn. A tibble attached to the combined and filtered track metadata/timbre data stored in Spark has been pre-defined as track_data_to_model_tbl.

  • Repeat your year prediction analysis, using a random forest model this time.
    • Get the timbre columns from track_data_to_model_tbl and assign the result to feature_colnames.
    • Create the formula for the model using reformulate().
    • Run the random forest model and assign the result to random_forest_model.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# track_data_to_model_tbl has been pre-defined
track_data_to_model_tbl

# Get the timbre columns
feature_colnames <- ___

# Create the formula for the model
year_formula <- ___

# Run the random forest model
random_forest_model <- ___
Edit and Run Code