1. Welcome to the course
Welcome to the machine learning toolbox course. I'm Max Kuhn, statistician and author of the caret package, which I've been working on for over a decade.
2. Supervised Learning
Today caret is one of the most widely used packages in R for supervised learning (also known as predictive modeling).
Supervised learning is machine learning when you have a "target variable," or something specific you want to predict.
A classic example of supervised learning is predicting which species an iris is, based on its physical measurements. Another example would be predicting which customers in your business will "churn" or cancel their service.
In both of these cases, we have something specific we want to predict on new data: species and churn.
3. Supervised Learning
There are two main kinds of predictive models: classification and regression.
Classification models predict qualitative variables, for example the species of a flower, or "will a customer churn". Regression models predict quantitative variables, for example the price of a diamond.
Once we have a model, we use a "metric" to evaluate how well the model works. A metric is quantifiable and gives us an objective measure of how well the model predicts on new data.
For regression problems, we will focus on "root mean squared error" or RMSE as our metric of choice.
This is the error that linear regression models typically seek to minimize, for example in the lm() function in R. It's a good, general purpose error metric, and the most common one for regression models.
4. Evaluating Model Performance
Unfortunately, it's common practice to calculate RMSE on the same data we used to fit the model. This typically leads to overly-optimistic estimates of model performance. This is also known as overfitting.
A better approach is to use out-of-sample estimates of model performance.
This is the approach caret takes, because it simulates what happens in the real world and helps us avoid over-fitting.
5. In-sample error
However, it's useful to start off by looking at in-sample error, so we can contrast it later with out-of-sample error on the same dataset.
First, we load the mtcars dataset and fit a model to the first 20 rows.
Next, we make in-sample predictions, using the predict function on our model.
Finally, we calculate RMSE on our training data, and get pretty good results.
6. Let's practice!
Let's practice calculating RMSE on some other datasets.