1. Recap of machine learning basics
Good job, you now have a solid understanding of the differences between model parameters and hyperparameters.
But before we jump into actively tuning the hyperparameters, we will briefly recap the basics of machine learning in R.
We will use the caret package first, because it automatically performs a basic hyperparameter tuning for you with every training run.
2. Machine learning with caret - splitting data
But before we look into this automatic hyperparameter tuning in caret, we need to prepare our data for training.
First, we will divide our data into training and test sets.
Caret makes this step easy with the `createDataPartition` function. It lets us give a class label vector as input for stratified partitioning of the data; this is important because we want to have a roughly equal ratio of classes in our training and test set. With the argument `p` we tell the function what proportion of the data should go into the training set, here 70%.
The index that will be created can then be used for subsetting the original dataset.
How much of the data you want to keep for training can be part of the optimization process. There are really no strict rules on how to split the data but you want to make sure that you have enough training power
and that you have a representative test set. With a small dataset such as this, 70% is a common number, but you will also often see 80 or 90% training data.
3. Train a machine learning model with caret
Here, I will not go into additional steps of the machine learning workflow, like feature engineering, preprocessing, normalization, balancing classes, etc. Just keep in mind that in a real-world scenario, you would at least want to think about incorporating these steps into your workflow.
Our validation scheme is defined in the `trainControl()` function: we will do 5 times 3 repeated cross-validation, which means repeating 3-fold cross-validation 5 times. This scheme is then given as an argument in the `train()` function.
In `caret` we can train machine learning models with a large number of different algorithms; we define this with the argument `method` in the `train()` function. Here, we will train a Random Forest model, which is abbreviated `rf`.
`train()` also wants to know which data and which features to use. Our dataset is the training set that we created before. The features are given with a formula: the class or response variable (here **diagnosis**) is written before and features after the tilde. For features, we write a dot here, which indicates that we want to use all remaining columns as features in our model.
In addition, I want to know how long my model took to train. For this, I am using the tictoc package, which will return the runtime between tic and toc.
As we can see, our model took about 1.4 seconds to train.
4. Automatic hyperparameter tuning in caret
Here is the random forest model we just trained.
In the output we can already see hyperparameter tuning in action as caret performs it automatically with different options for the hyperparameter mtry --> you will learn more about that in the next lesson!
What's important to note here is that caret compares different hyperparameters on the training and validation data only.
Do NOT be tempted to measure your model performance on the test data during hyperparameter tuning as that would give you an overly optimistic and biased performance evaluation!
5. Let's start modeling!
Now, it's your turn to start modeling!