Relative error
In this exercise, you will compare relative error to absolute error. For the purposes of modeling, we will define relative error as
$$ rel = \frac{(y - pred)}{y} $$
that is, the error is relative to the true outcome. You will measure the overall relative error of a model using root mean squared relative error:
$$ rmse_{rel} = \sqrt(\overline{rel^2}) $$
where \(\overline{rel^2}\) is the mean of \(rel^2\).
The example (toy) dataset fdata
has been pre-loaded. It includes the columns:
y
: the true output to be predicted by some model; imagine it is the amount of money a customer will spend on a visit to your store.pred
: the predictions of a model that predictsy
.label
: categorical: whethery
comes from a population that makessmall
purchases, orlarge
ones.
You want to know which model does "better": the one predicting the small
purchases, or the one predicting large
ones.
This is a part of the course
“Supervised Learning in R: Regression”
Exercise instructions
- Fill in the blanks to examine the data. Notice that large purchases tend to be about 100 times larger than small ones.
- Fill in the blanks to create error columns:
- Define residual as
y - pred
. - Define relative error as
residual / y
.
- Define residual as
- Fill in the blanks to calculate and compare RMSE and relative RMSE.
- How do the absolute errors compare? The relative errors?
- Examine the plot of predictions versus outcome.
- In your opinion, which model does "better"?
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# fdata is available
summary(fdata)
# Examine the data: generate the summaries for the groups large and small:
fdata %>%
group_by(label) %>% # group by small/large purchases
summarize(min = ___, # min of y
mean = ___, # mean of y
max = ___) # max of y
# Fill in the blanks to add error columns
fdata2 <- fdata %>%
group_by(label) %>% # group by label
mutate(residual = ___, # Residual
relerr = ___) # Relative error
# Compare the rmse and rmse.rel of the large and small groups:
fdata2 %>%
group_by(label) %>%
summarize(rmse = ___, # RMSE
rmse.rel = ___) # Root mean squared relative error
# Plot the predictions for both groups of purchases
ggplot(fdata2, aes(x = pred, y = y, color = label)) +
geom_point() +
geom_abline() +
facet_wrap(~ label, ncol = 1, scales = "free") +
ggtitle("Outcome vs prediction")
This exercise is part of the course
Supervised Learning in R: Regression
In this course you will learn how to predict future events using linear regression, generalized additive models, random forests, and xgboost.
Before moving on to more sophisticated regression techniques, we will look at some other modeling issues: modeling with categorical inputs, interactions between variables, and when you might consider transforming inputs and outputs before modeling. While more sophisticated regression techniques manage some of these issues automatically, it's important to be aware of them, in order to understand which methods best handle various issues -- and which issues you must still manage yourself.
Exercise 1: Categorical inputsExercise 2: Examining the structure of categorical inputsExercise 3: Modeling with categorical inputsExercise 4: InteractionsExercise 5: Modeling an interactionExercise 6: Modeling an interaction (2)Exercise 7: Transforming the response before modelingExercise 8: Relative errorExercise 9: Modeling log-transformed monetary outputExercise 10: Comparing RMSE and root-mean-squared Relative ErrorExercise 11: Transforming inputs before modelingExercise 12: Input transforms: the "hockey stick"Exercise 13: Input transforms: the "hockey stick" (2)What is DataCamp?
Learn the data skills you need online at your own pace—from non-coding essentials to data science and machine learning.