1. Learn
  2. /
  3. Courses
  4. /
  5. Supervised Learning in R: Regression

Exercise

Generating a random test/train split

For the next several exercises you will use the mpg data from the package ggplot2. The data describes the characteristics of several makes and models of cars from different years. The goal is to predict city fuel efficiency from highway fuel efficiency.

In this exercise, you will split mpg into a training set mpg_train (75% of the data) and a test set mpg_test (25% of the data). One way to do this is to generate a column of uniform random numbers between 0 and 1, using the function runif().

If you have a dataset dframe of size \(N\), and you want a random subset of approximately size \(100 * X\)% of \(N\) (where \(X\) is between 0 and 1), then:

  1. Generate a vector of uniform random numbers: gp = runif(N).
  2. dframe[gp < X,] will be about the right size.
  3. dframe[gp >= X,] will be the complement.

Instructions

100 XP
  • Use the function nrow to get the number of rows in the data frame mpg. Assign this count to the variable N and print it.
  • Calculate about how many rows 75% of N should be. Assign it to the variable target and print it.
  • Use runif() to generate a vector of N uniform random numbers, called gp.
  • Use gp to split mpg into mpg_train and mpg_test (with mpg_train containing approximately 75% of the data).
  • Use nrow() to check the size of mpg_train and mpg_test. Are they about the right size?