Get Started

Randomly order the data frame

One way you can take a train/test split of a dataset is to order the dataset randomly, then divide it into the two sets. This ensures that the training set and test set are both random samples and that any biases in the ordering of the dataset (e.g. if it had originally been ordered by price or size) are not retained in the samples we take for training and testing your models. You can think of this like shuffling a brand new deck of playing cards before dealing hands.

First, you set a random seed so that your work is reproducible and you get the same random split each time you run your script:

set.seed(42)

Next, you use the sample() function to shuffle the row indices of the diamonds dataset. You can later use these indices to reorder the dataset.

rows <- sample(nrow(diamonds))

Finally, you can use this random vector to reorder the diamonds dataset:

diamonds <- diamonds[rows, ]

This is a part of the course

“Machine Learning with caret in R”

View Course

Exercise instructions

  • Set the random seed to 42.
  • Make a vector of row indices called rows.
  • Randomly reorder the diamonds data frame, assigning to shuffled_diamonds.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Set seed


# Shuffle row indices: rows


# Randomly order data

This exercise is part of the course

Machine Learning with caret in R

AdvancedSkill Level
4.5+
17 reviews

This course teaches the big ideas in machine learning like how to build and evaluate predictive models.

In the first chapter of this course, you'll fit regression models with <code>train()</code> and evaluate their out-of-sample performance using cross-validation and root-mean-square error (RMSE).

Exercise 1: Welcome to the courseExercise 2: In-sample RMSE for linear regressionExercise 3: In-sample RMSE for linear regression on diamondsExercise 4: Out-of-sample error measuresExercise 5: Out-of-sample RMSE for linear regressionExercise 6: Randomly order the data frame
Exercise 7: Try an 80/20 splitExercise 8: Predict on test setExercise 9: Calculate test set RMSE by handExercise 10: Comparing out-of-sample RMSE to in-sample RMSEExercise 11: Cross-validationExercise 12: Advantage of cross-validationExercise 13: 10-fold cross-validationExercise 14: 5-fold cross-validationExercise 15: 5 x 5-fold cross-validationExercise 16: Making predictions on new data

What is DataCamp?

Learn the data skills you need online at your own pace—from non-coding essentials to data science and machine learning.

Start Learning for Free