Divide and conquer: train and test sets

When we want to use a statistical method to predict something, it is important to have data to test how well the predictions fit. Splitting the original data to test and train sets allows us to check how well our model works.

The training of the model is done with the train set and prediction on new data is done with the test set. This way you have true classes / labels for the test data, and you can calculate how well the model performed in prediction.

Time to split our data!

This exercise is part of the course

Helsinki Open Data Science

View Course

Exercise instructions

Use the function nrow() on the boston_scaled to get the number of rows in the dataset. Save the number of rows in n.
Execute the code to choose randomly 80% of the rows and save the row numbers to ind
Create train set by selecting the row numbers that are saved in ind.
Create test set by subtracting the rows that are used in the train set
Take the crime classes from the test and save them as correct_classes
Execute the code to remove crime from test set

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# boston_scaled is available

# number of rows in the Boston dataset 
n <- "change me!"

# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)

# create train set
train <- boston_scaled[ind,]

# create test set 
test <- boston_scaled[-ind,]

# save the correct classes from test data
correct_classes <- "change me!"

# remove the crime variable from test data
test <- dplyr::select(test, -crime)

Edit and Run Code