Generating a random test/train split
For the next several exercises you will use the mpg
data from the package ggplot2
. The data describes the characteristics of several makes and models of cars from different years. The goal is to predict city fuel efficiency from highway fuel efficiency.
In this exercise, you will split mpg
into a training set mpg_train
(75% of the data) and a test set mpg_test
(25% of the data). One way to do this is to generate a column of uniform random numbers between 0 and 1, using the function runif()
(docs).
If you have a dataset dframe
of size \(N\), and you want a random subset of approximately size \(100 * X\)% of \(N\) (where \(X\) is between 0 and 1), then:
- Generate a vector of uniform random numbers:
gp = runif(N)
. dframe[gp < X,]
will be about the right size.dframe[gp >= X,]
will be the complement.
This exercise is part of the course
Supervised Learning in R: Regression
Exercise instructions
- Use the function
nrow
(docs) to get the number of rows in the data framempg
. Assign this count to the variableN
and print it. - Calculate about how many rows 75% of N should be. Assign it to the variable
target
and print it. - Use
runif()
to generate a vector ofN
uniform random numbers, calledgp
. - Use
gp
to splitmpg
intompg_train
andmpg_test
(withmpg_train
containing approximately 75% of the data). - Use
nrow()
to check the size ofmpg_train
andmpg_test
. Are they about the right size?
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# mpg is available
summary(mpg)
dim(mpg)
# Use nrow to get the number of rows in mpg (N) and print it
(N <- ___)
# Calculate how many rows 75% of N should be and print it
# Hint: use round() to get an integer
(target <- ___)
# Create the vector of N uniform random variables: gp
gp <- ___
# Use gp to create the training set: mpg_train (75% of data) and mpg_test (25% of data)
mpg_train <- ___
mpg_test <- ___
# Use nrow() to examine mpg_train and mpg_test
___
___