Randomly order the data frame

One way you can take a train/test split of a dataset is to order the dataset randomly, then divide it into the two sets. This ensures that the training set and test set are both random samples and that any biases in the ordering of the dataset (e.g. if it had originally been ordered by price or size) are not retained in the samples we take for training and testing your models. You can think of this like shuffling a brand new deck of playing cards before dealing hands.

First, you set a random seed so that your work is reproducible and you get the same random split each time you run your script:

set.seed(42)

Next, you use the sample() function to shuffle the row indices of the diamonds dataset. You can later use these indices to reorder the dataset.

rows <- sample(nrow(diamonds))

Finally, you can use this random vector to reorder the diamonds dataset:

diamonds <- diamonds[rows, ]

This exercise is part of the course

Machine Learning with caret in R

View Course

Exercise instructions

Set the random seed to 42.
Make a vector of row indices called rows.
Randomly reorder the diamonds data frame, assigning to shuffled_diamonds.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Set seed


# Shuffle row indices: rows


# Randomly order data

Edit and Run Code