Train/test split
To objectively assess a Machine Learning model you need to be able to test it on an independent set of data. You can't use the same data that you used to train the model: of course the model will perform (relatively) well on those data!
You will split the data into two components:
- training data (used to train the model) and
- testing data (used to test the model).
Note: From here on you'll be working with a smaller subset of the flights data, which just makes the exercises run more quickly.
This exercise is part of the course
Machine Learning with PySpark
Exercise instructions
- Randomly split the
flights
data into two sets with 80:20 proportions. For repeatability set a random number seed of 43 for the split. - Check that the training data has roughly 80% of the records from the original data.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Split into training and testing sets in a 80:20 ratio
flights_train, flights_test = flights.____(____, ____)
# Check that training set has around 80% of records
training_ratio = flights_train.____() / ____.____()
print(training_ratio)