1. The Limits of Prediction
Now that you have started to build models, interpret model parameters, and use models to make predictions, we need to discuss errors that inevitably arise when making predictions.
In this lesson, you'll learn about prediction problems to watch out for, and start to quantify related errors.
Previously, you saw errors associated with randomness in residuals, but the most common source of large errors is trying to make predictions with a model far outside its "domain of validity".
In this lesson, we'll discuss the concept of "domain of validity', and see examples of common abuses for two kinds of predictions, interpolation and extrapolation, and the associated errors that can arise.
2. Interpolation
Here is a plot of data sampled once per month. It looks like it might contain a linear trend, with some random noise. A linear model seems appropriate.
But we have to be mindful of the "step size": that is, how far apart you want to sample points along the x-axis.
If that step size of the data used to build the model is too large, you may be missing something.
Let's compare this monthly data to daily data...
3. Interpolation
If we sampled the data every day, we reveal a rich collection of features. Some even look like oscillations.
This data is not simply a linear trend with a bit of random noise.
4. Interpolation
Let's see what happens when we fit the model to the under sampled data...
5. Interpolation
This fit looks pretty good. The residuals, by eye don't look so bad. But imagine if we interpolate, making predictions at points in-between the monthly boundaries...
6. Interpolation
Making the model fit the monthly data, and then trying to interpolate to the daily data, we see the result is rather bad.
In this case, it might be better to fit the linear model just to a limited range or "domain" of times, from 2014-March through 2014-August.
7. Domain of Validity
Linear models often originate from some assumption that in a certain limited range of independent variable values, the dependent variable change linearly.
Recall the earlier discussion about the concept of a Taylor Series. In some limited range of values, we get lucky and all the terms above n=1 are very small as COMPARED to a0 and a1, and we can safely ignore the nonlinear features in the data.
When we fit to a smaller range of values, in this example, a few months in 2014, this range is sometimes called the "domain of validity" of the model.
Using the model outside this range is a form of extrapolation. Sometimes you can go a bit outside the range, but the further you go, the more likely your model will produce predictions very far away from reality.
8. Extrapolating Too Far
The biggest problem with applying linear models to extrapolations is that people tend to want to apply the model too far outside the domain of validity.
In this example, our hiking trail looks very linear from 0 to 10, shown as block points.
If we only had data for that domain, and fit a model, shown as the red line, it works great.
But if we try to apply that model to x-values outside this domain, points shown in blue, the model residuals are HUGE.
In these cases, it's usually very unwise to extrapolate too far unless you have other experience of domain knowledge to guide you.
9. Let's practice!
Now that we've seen these concepts illustrated visually, let's use some of the same python tools already used earlier in the course to see some examples of errors in interpolation and extrapolation.