Divide and conquer: train and test sets
When we want to use a statistical method to predict something, it is important to have data to test how well the predictions fit. Splitting the original data to test and train sets allows us to check how well our model works.
The training of the model is done with the train set and prediction on new data is done with the test set. This way you have true classes / labels for the test data, and you can calculate how well the model performed in prediction.
Time to split our data!
This exercise is part of the course
Helsinki Open Data Science
Exercise instructions
- Use the function
nrow()
on theboston_scaled
to get the number of rows in the dataset. Save the number of rows inn
. - Execute the code to choose randomly 80% of the rows and save the row numbers to
ind
- Create
train
set by selecting the row numbers that are saved inind
. - Create
test
set by subtracting the rows that are used in the train set - Take the crime classes from the
test
and save them ascorrect_classes
- Execute the code to remove
crime
fromtest
set
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# boston_scaled is available
# number of rows in the Boston dataset
n <- "change me!"
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set
test <- boston_scaled[-ind,]
# save the correct classes from test data
correct_classes <- "change me!"
# remove the crime variable from test data
test <- dplyr::select(test, -crime)