1. Evaluating a model graphically
In the first part of this chapter, you will learn some ways to evaluate a model's prediction performance. We will show these techniques on linear models, but they also apply to other regression approaches. In this lesson, you will review how to evaluate a model graphically.
2. Plotting Ground Truth vs. Predictions
You've already seen the plot of ground truth vs prediction. Predictions are on the x-axis, and outcomes are on the y-axis. The line x equals y represents the line of perfect prediction: if the model predicted perfectly, all the points would be along this line.
You should look for the points to be evenly above and below the x equals y line, and ideally, close to it, as on the left. This means that the errors are not systematic: they are not correlated with the actual outcome.
When a model doesn't fit well, there can be regions where the points are entirely above or below the line. This demonstrates systematic error: errors that are correlated with the value of the outcome. This can indicate that you don't yet have all the important variables in your model, or that you need an algorithm that can find more complex relationships in the data.
3. The Residual Plot
The residual plot plots the residuals, or the difference between the outcome and the prediction, against the predictions.
In a model with no systematic errors, the errors will be evenly distributed between positive and negative, and have about the same magnitude above and below. When there are systematic errors, there will be clusters of all positive or all negative residuals.
4. The Gain Curve
The gain curve plot is useful when sorting the instances is more important than predicting the exact outcome value. In this example, suppose we have a model that predicts home prices, but we are more interested in identifying higher value homes than in predicting their exact selling price.
The x-axis shows the fraction of total houses as sorted by the model. The y-axis shows the fraction of total accumulated dollars.
The diagonal line represents the gain curve if the houses are sorted randomly. The green curve we call the "wizard curve." This is the curve a perfect model would trace out. The blue curve is what our model traces out.
5. Reading the Gain Curve
In this example, the model curve and the wizard curve line up almost perfectly for the first 30% of the data. This means the model correctly identified the top 30% highest value houses, and sorted them by price correctly. After that, the model does not sort quite as well, as shown by the gap between the curves.
GainCurve plots are also useful for models that predict probability, as we will see later.
We will use the GainCurvePlot function from the WVPlots package, which takes as input a data frame, the names of the prediction and outcome columns, and a title.
6. Let's practice!
Now that you have seen different ways to visualize model performance, let's practice.