Get startedGet started for free

Train/test split

To objectively assess a Machine Learning model you need to be able to test it on an independent set of data. You can't use the same data that you used to train the model: of course the model will perform (relatively) well on those data!

You will split the data into two components:

  • training data (used to train the model) and
  • testing data (used to test the model).

Note: From here on you'll be working with a smaller subset of the flights data, which just makes the exercises run more quickly.

This exercise is part of the course

Machine Learning with PySpark

View Course

Exercise instructions

  • Randomly split the flights data into two sets with 80:20 proportions. For repeatability set a random number seed of 43 for the split.
  • Check that the training data has roughly 80% of the records from the original data.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Split into training and testing sets in a 80:20 ratio
flights_train, flights_test = flights.____(____, ____)

# Check that training set has around 80% of records
training_ratio = flights_train.____() / ____.____()
print(training_ratio)
Edit and Run Code