1. Model training
Great, welcome to the video on model training. Once we have the business case, we can then start the model training process, if we decided to do so. We will walk through a visual example showing how a model is trained. Here, we will focus on supervised learning models.
2. Modeling dataset
Great, let's start with an example dataset. Here, we have a few columns, including the target variable on the right.
3. Full dataset
Imagine this is the full dataset on all of your customers, or transactions or other units of interest.
4. Splitting data for training
First, we randomly sample a part of that dataset for model training.
I'll stop here for a bit - model training means using input features and the target variable and train the model to detect the patterns to be able to predict target variable on future data. This is called model training.
5. Test
Then we sample a smaller portion of data for model performance measurement on unseen dataset.
Now, let's pause here for a bit. The main goal to do this is to make sure our trained model model learns the patterns which we can be used to predict the values on new data. To achieve this, we have to validate our model performance on this dataset that has not been used in the model training process. This way we get a more realistic performance metric. Let's look into an example.
6. Overfitting and underfitting
Let's use a chart to illustrate why setting some data aside for performance measurement that is unseen to model training is important. Here we have a number of data points showing how much product revenue is affected by advertising.
7. Underfitting
Here, we see a naive model that assumes the pattern is linear. We can clearly see that the actual pattern is a non-linear curve. In this case we say the model underfits the data as it is a too simple assumption.
8. Overfitting 1
This is an example of another extreme - here the model perfectly memorizes every single point in the training dataset. We call this overfitting, as it just memorizes patterns in training data.
9. Overfitting 2
The red dots are unseen data - you can see that the overfitted model does not predict unseen data well. This is a too complex model memorizing patterns in training data.
10. Right model fit 1
Now this seems like a good approximation - it does not memorize the patterns, nor is it too simple like a straight line.
11. Right model fit 2
And, with the same new unseen data - the red dots - we can see that the model would predict them pretty well.
12. Model training
Now the process of model training starts with the training dataset. The machine learning model uses statistical or other algorithms to learn the patterns between the target variable Y and the features we have collected on our observations.
13. Assess model performance on test
Then we measure the model performance on the unseen test dataset.
Let's pause - here we're using a new term - unseen dataset. This means that the model didn't use (or see) this data when it was being trained on the training dataset. This is important as we plan to use the machine learning model rules on future unseen data to make predictions.
Always remember to clarify with your machine learning team whether they have followed this model performance process, and if whatever machine learning results you are presented with have been produced by the test dataset instead of training. This is very important as test performance will give you the best estimate of the actual model accuracy, while training numbers could be too optimistic.
14. Try a few models
Finally, this process can be done a few times to find the model that fits the data best without overfitting or underfitting it.
15. Let's practice!
Great work, now, let's dive into several exercises testing your knowledge of model training.