Exercise

# Split the data

These examples will use a subset of the Student Performance Dataset from UCI ML Dataset Repository.

The goal of this exercise is to predict a student's final Mathematics grade based on the following variables: `sex`

, `age`

, `address`

, `studytime`

(weekly study time), `schoolsup`

(extra educational support), `famsup`

(family educational support), `paid`

(extra paid classes within the course subject) and `absences`

.

The response is `final_grade`

(numeric: from 0 to 20, output target).

After initial exploration, split the data into training, validation, and test sets. In this chapter, we will introduce the idea of a validation set, which can be used to select a "best" model from a set of competing models.

In Chapter 1, we demonstrated a simple way to split the data into two pieces using the `sample()`

function. In this exercise, we will take a slightly different approach to splitting the data that allows us to split the data into more than two parts (here, we want three: train, validation, test). We still use the `sample()`

function, but instead of sampling the indices themselves, we will assign each row to either the training, validation or test sets according to a probability distribution.

The dataset `grade`

is already in your workspace.

Instructions

**100 XP**

- Take a look at the data using the
`str()`

function. - Set a seed (for reproducibility) and then sample
`n_train`

rows to define the set of training set indices.- Draw a sample of size
`nrow(grade)`

from the number 1 to 3 (with replacement). You want approximately 70% of the sample to be 1 and the remaining 30% to be equally split between 2 and 3.

- Draw a sample of size
- Subset
`grade`

using the sample you just drew so that indices with the value 1 are in`grade_train`

, indices with the value 2 are in`grade_valid`

, and indices with 3 are in`grade_test`

.