The test-train split
In a disciplined machine learning workflow it is crucial to withhold a portion of your data (testing data) from any decision-making process. This allows you to independently assess the performance of your model when it is finalized. The remaining data, the training data, is used to build and select the best model.
In this exercise, you will use the rsample
package to split your data to perform the initial train-test split of your gapminder
data.
Note: Since this is a random split of the data it is good practice to set a seed before splitting it.
This is a part of the course
“Machine Learning in the Tidyverse”
Exercise instructions
- Split your data into 75% training and 25% testing using the
initial_split()
function and assign it togap_split
. - Extract the training data frame from
gap_split
using thetraining()
function. - Extract the testing data frame from
gap_split
using thetesting()
function. - Ensure that the dimensions of your new data frames are what you expected by using the
dim()
function ontraining_data
andtesting_data
.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
set.seed(42)
# Prepare the initial split object
gap_split <- initial_split(___, prop = ___)
# Extract the training data frame
training_data <- ___
# Extract the testing data frame
testing_data <- ___
# Calculate the dimensions of both training_data and testing_data
dim(___)
dim(___)
This exercise is part of the course
Machine Learning in the Tidyverse
Leverage tidyr and purrr packages in the tidyverse to generate, explore, and evaluate machine learning models.
In this chapter you will learn how to use the List Column Workflow to build, tune and evaluate regression models. You will have the chance to work with two types of models: linear models and random forest models.
Exercise 1: Training, test and validation splitsExercise 2: The test-train splitExercise 3: Cross-validation data framesExercise 4: Measuring cross-validation performanceExercise 5: Build cross-validated modelsExercise 6: Preparing for evaluationExercise 7: Evaluate model performanceExercise 8: Building and tuning a random forest modelExercise 9: Build a random forest modelExercise 10: Evaluate a random forest modelExercise 11: Fine tune your modelExercise 12: The best performing parameterExercise 13: Measuring the test performanceExercise 14: Build & evaluate the best modelWhat is DataCamp?
Learn the data skills you need online at your own pace—from non-coding essentials to data science and machine learning.