Aan de slagGa gratis aan de slag

Train/test split

To objectively assess a Machine Learning model you need to be able to test it on an independent set of data. You can't use the same data that you used to train the model: of course the model will perform (relatively) well on those data!

You will split the data into two components:

  • training data (used to train the model) and
  • testing data (used to test the model).

Note: From here on you'll be working with a smaller subset of the flights data, which just makes the exercises run more quickly.

Deze oefening maakt deel uit van de cursus

Machine Learning with PySpark

Cursus bekijken

Oefeninstructies

  • Randomly split the flights data into two sets with 80:20 proportions. For repeatability set a random number seed of 43 for the split.
  • Check that the training data has roughly 80% of the records from the original data.

Praktische interactieve oefening

Probeer deze oefening eens door deze voorbeeldcode in te vullen.

# Split into training and testing sets in a 80:20 ratio
flights_train, flights_test = flights.____(____, ____)

# Check that training set has around 80% of records
training_ratio = flights_train.____() / ____.____()
print(training_ratio)
Code bewerken en uitvoeren