Get startedGet started for free

Evaluate the Decision Tree

You can assess the quality of your model by evaluating how well it performs on the testing data. Because the model was not trained on these data, this represents an objective assessment of the model.

A confusion matrix gives a useful breakdown of predictions versus known values. It has four cells which represent the counts of:

  • True Negatives (TN) — model predicts negative outcome & known outcome is negative
  • True Positives (TP) — model predicts positive outcome & known outcome is positive
  • False Negatives (FN) — model predicts negative outcome but known outcome is positive
  • False Positives (FP) — model predicts positive outcome but known outcome is negative.

These counts (TN, TP, FN and FP) should sum to the number of records in the testing data, which is only a subset of the flights data. You can compare to the number of records in the tests data, which is flights_test.count().

Note: These predictions are made on the testing data, so the counts are smaller than they would have been for predictions on the training data.

This exercise is part of the course

Machine Learning with PySpark

View Course

Exercise instructions

  • Create a confusion matrix by counting the combinations of label and prediction. Display the result.
  • Count the number of True Negatives, True Positives, False Negatives and False Positives.
  • Calculate the accuracy.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Create a confusion matrix
prediction.groupBy(____, 'prediction').____().____()

# Calculate the elements of the confusion matrix
TN = prediction.filter('prediction = 0 AND label = prediction').count()
TP = prediction.____('____ AND ____').____()
FN = prediction.____('____ AND ____').____()
FP = prediction.____('____ AND ____').____()

# Accuracy measures the proportion of correct predictions
accuracy = ____
print(accuracy)
Edit and Run Code