Randomly order the data frame
One way you can take a train/test split of a dataset is to order the dataset randomly, then divide it into the two sets. This ensures that the training set and test set are both random samples and that any biases in the ordering of the dataset (e.g. if it had originally been ordered by price or size) are not retained in the samples we take for training and testing your models. You can think of this like shuffling a brand new deck of playing cards before dealing hands.
First, you set a random seed so that your work is reproducible and you get the same random split each time you run your script:
set.seed(42)
Next, you use the sample()
function to shuffle the row indices of the diamonds
dataset. You can later use these indices to reorder the dataset.
rows <- sample(nrow(diamonds))
Finally, you can use this random vector to reorder the diamonds dataset:
diamonds <- diamonds[rows, ]
This is a part of the course
“Machine Learning with caret in R”
Exercise instructions
- Set the random seed to 42.
- Make a vector of row indices called
rows
. - Randomly reorder the
diamonds
data frame, assigning toshuffled_diamonds
.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Set seed
# Shuffle row indices: rows
# Randomly order data
This exercise is part of the course
Machine Learning with caret in R
This course teaches the big ideas in machine learning like how to build and evaluate predictive models.
In the first chapter of this course, you'll fit regression models with <code>train()</code> and evaluate their out-of-sample performance using cross-validation and root-mean-square error (RMSE).
Exercise 1: Welcome to the courseExercise 2: In-sample RMSE for linear regressionExercise 3: In-sample RMSE for linear regression on diamondsExercise 4: Out-of-sample error measuresExercise 5: Out-of-sample RMSE for linear regressionExercise 6: Randomly order the data frameExercise 7: Try an 80/20 splitExercise 8: Predict on test setExercise 9: Calculate test set RMSE by handExercise 10: Comparing out-of-sample RMSE to in-sample RMSEExercise 11: Cross-validationExercise 12: Advantage of cross-validationExercise 13: 10-fold cross-validationExercise 14: 5-fold cross-validationExercise 15: 5 x 5-fold cross-validationExercise 16: Making predictions on new dataWhat is DataCamp?
Learn the data skills you need online at your own pace—from non-coding essentials to data science and machine learning.