Generating a random test/train split
For the next several exercises you will use the mpg data from the package ggplot2. The data describes the characteristics of several makes and models of cars from different years. The goal is to predict city fuel efficiency from highway fuel efficiency.
In this exercise, you will split mpg into a training set mpg_train (75% of the data) and a test set mpg_test (25% of the data). One way to do this is to generate a column of uniform random numbers between 0 and 1, using the function runif() (docs). 
If you have a dataset dframe of size \(N\), and you want a random subset of approximately size \(100 * X\)% of \(N\) (where \(X\) is between 0 and 1), then:
- Generate a vector of uniform random numbers: gp = runif(N).
- dframe[gp < X,]will be about the right size.
- dframe[gp >= X,]will be the complement.
Este exercício faz parte do curso
Supervised Learning in R: Regression
Instruções do exercício
- Use the function nrow(docs) to get the number of rows in the data framempg. Assign this count to the variableNand print it.
- Calculate about how many rows 75% of N should be. Assign it to the variable targetand print it.
- Use runif()to generate a vector ofNuniform random numbers, calledgp.
- Use gpto splitmpgintompg_trainandmpg_test(withmpg_traincontaining approximately 75% of the data).
- Use nrow()to check the size ofmpg_trainandmpg_test. Are they about the right size?
Exercício interativo prático
Experimente este exercício completando este código de exemplo.
# mpg is available
summary(mpg)
dim(mpg)
# Use nrow to get the number of rows in mpg (N) and print it
(N <- ___)
# Calculate how many rows 75% of N should be and print it
# Hint: use round() to get an integer
(target <- ___)
# Create the vector of N uniform random variables: gp
gp <- ___
# Use gp to create the training set: mpg_train (75% of data) and mpg_test (25% of data)
mpg_train <- ___
mpg_test <- ___
# Use nrow() to examine mpg_train and mpg_test
___
___