1. Learn
  2. /
  3. Courses
  4. /
  5. Introduction to Spark with sparklyr in R

Exercise

Random Forest: modeling

Like gradient boosted trees, random forests are another form of ensemble model. That is, they use lots of simpler models (decision trees, again) and combine them to make a single better model. Rather than running the same model iteratively, random forests run lots of separate models in parallel, each on a randomly chosen subset of the data, with a randomly chosen subset of features. Then the final decision tree makes predictions by aggregating the results from the individual models.

sparklyr's random forest function is called ml_random_forest(). Its usage is exactly the same as ml_gradient_boosted_trees() (see the first exercise of this chapter for a reminder on syntax).

Instructions

100 XP

A Spark connection has been created for you as spark_conn. A tibble attached to the combined and filtered track metadata/timbre data stored in Spark has been pre-defined as track_data_to_model_tbl.

  • Repeat your year prediction analysis, using a random forest model this time.
    • Get the timbre columns from track_data_to_model_tbl and assign the result to feature_colnames.
    • Run the random forest model and assign the result to random_forest_model.