1. Informed Methods: Bayesian Statistics
Now we will explore an important statistical concept and its application to informed hyperparameter tuning: Bayesian methods.
2. Bayes Introduction
Bayes rule is a famous statistical tool that has been around for 250 years.
So how can this be relevant to machine learning today? Well, Bayes rule is actually a method where we can iteratively use new evidence to update our beliefs about an outcome.
Intuitively this makes senses as when using informed search we want to learn from evidence to make our hyperparameter tuning better.
3. Bayes Rule
Here is the Bayes Rule formula.
The left hand side is the probability of A (which we care about), given that B has occurred, where B is some new (relevant) evidence.
This is known as the 'posterior'.
The right hand side is how we calculate this.
P(A) is known as the 'prior'. It is the initial hypothesis about the event we care about. See how it is different to P(A|B)? The latter is the probability given new evidence.
4. Bayes Rule
P(B) is the 'marginal likelihood' and it is the probability of observing this new evidence.
P(B|A) is the 'likelihood' which is the probability of observing the evidence, given the event we care about.
This all may be quite confusing, but let's use a common example of a medical diagnosis to demonstrate.
5. Bayes in Medicine
Let's take a medical example to illustrate the Bayesian process.
Let's say 5% of people in the general population have a certain disease. This is our P(D).
10% of people are genetically predisposed to this condition. That is, because of their genetics, they are more likely to get this condition. This is our P(Pre).
We also know that 20% of people with the disease are predisposed which is our P(Pre|D).
6. Bayes in Medicine
So what is the probability that any given person has the disease?
If we know nothing about a person, then the probability of them having the disease is just the prior.
However, what if we add some new evidence that this person is predisposed?
We can update our beliefs by subbing into Bayes formula.
Now we can see that the probability of having the disease, given that someone is predisposed is around 10%, or double the original prior.
7. Bayes in Hyperparameter Tuning
We can apply this logic to hyperparameter tuning using the following process.
Pick a hyperparameter combination.
Build a model.
Get new _evidence_ (the score of the model).
Update our beliefs and chose better hyperparameters next round and continue until we are happy with the result.
Bayesian hyperparameter tuning is very new but quite popular for larger and more complex hyperparameter tuning tasks.
8. Bayesian Hyperparameter Tuning with Hyperopt
A useful package for Bayesian hyperparameter tuning is Hyperopt.
To undertake Bayesian hyperparameter tuning with this package, we first need to
set the domain, which is our Grid, with a bit of a twist.
Then, we set the optimization algorithm (we will use the default TPE).
Finally, we need to set the objective function to minimize: we will use 1-Accuracy because this package tries to minimize not maximize something.
9. Hyperopt: Set the Domain (grid)
There are many options for how to lay out the grid in Bayesian optimization including those mentioned on the slide.
Hyperopt does not use point values on the grid but instead each point represents probabilities for each hyperparameter value.
To keep it simple, we will use uniform distribution. There are many more distributions if you check the documentation.
10. The Domain
Let's see how to set up our grid or domain in hyperopt.
This code demonstrates this using a simple uniform distribution between the min and max values supplied.
quniform means uniform but quantized (or binned) by the specified third number.
11. The objective function
We need to define an objective function to run the algorithm.
It needs to take in the parameters to test and use those to create an estimator (for us a GBM).
The estimator is cross-validated to find the best average score and returns the average loss over the folds. We need to make our loss as one minus the best score since Hyperopt will work to minimize what we return - we don't want to minimize accuracy!
I have also written a small function to write out the results at each iteration for analysis later.
12. Run the algorithm
Now we simply need to call the algorithm.
We give it the objective function we created, the sample space we set up, how many iterations and an optional random seed to ensure consistency.
The algorithm is the standard algorithm and best one currently implemented.
This function will only return the best hyperparameters which is why we logged out results at each iteration.
13. Let's practice!
Let's practice Bayesian hyperparameter tuning in python!