Automatic machine learning with H2O
1. Automatic machine learning with H2O
The final and most convenient function for advanced model tuning is h2o's automatic machine learning functionality.2. Automatic Machine Learning (AutoML)
Automatic Machine Learning goes one step beyond regular hyperparameter tuning: instead of tuning only one model type or algorithm, AutoML performs tuning for a number of different algorithms, as well as hyperparameters. AutoML makes finding the best (or almost best) model extremely fast and easy because it is all combined in a single function, that only needs a dataset, the target in case of classification and a time or model number limit that tells it how long to train models for.3. AutoML in H2O
AutoML trains a number of different algorithms during a default classification run in this specific order: - 1 generalized linear model - 1 distributed random forest - 1 extremely randomized tree - 3 XGBoost - 5 gradient boosting machines - a Neural Net, a random grid of XGBoost, a random grid of GBMs, a random grid of Neural Nets and - 2 Stacked Ensembles One of the 2 ensembles is calculated from all models, the other only from the best models of each family of algorithms. In case you want to exclude algorithms, you can define a list of these algorithms in your automl run. In some cases, it is recommended to exclude tree-based algorithms.4. Hyperparameter tuning in H2O's AutoML
For all algorithms where multiple models are run, automl automatically tests a range of hyperparameters for different arguments. These are some of the hyperparameters for gradient boosting models and for neural nets. XGBoost hyperparameters are similar to GBM. You can find out more about each hyperparameter by going to the help for the original h2o model functions. For gradient boosting that would be h2o.gbm. Random Forest and Extremely Randomized Trees are not grid searched because only one model is trained for each.5. Using AutoML with H2O
Here you see the h2o.automl function in action. automl uses the same arguments as regular h2o algorithms: x, y, training_frame, validation_frame, etc. As before with random search, AutoML needs stopping criteria. These can either be the maximum run time (as here 60 seconds), which will define the time spent on grid searches or the maximum number of models. We could also give both and automl will stop when it reaches either of the criteria. Note, that training the stacked ensembles doesn't count to the maximum run time, nor the maximum number of models to train, they will always be calculated at the end. With sort_metric, we define how the models should be sorted in the final output, called the leaderboard. The best model according to this metric, here logloss, will be on position 1 in the leaderboard. You can choose - area under the curve (default for binary classification) - mean per class error (default for multinomial classification) - mean residual deviance (default for regression) and more. As always, the help functions will give you additional details.6. Viewing the AutoML leaderboard
This is how the leaderboard from the previous automl run looks like. You can extract it by calling automl_model@leaderboard. The leaderboard contains all models that were trained, as well as the model id and performance metrics, like - mean_per_class_error - logloss - root mean squared error - or mean squared error The second column of the leaderboard shows the metric that was used for ranking, in our case mean per class error, as specified in the automl function. If you don't specify a leaderboard dataset, metrics will be calculated on 5-fold cross-validation results. Follow this link to read the complete documentation.7. Extracting models from AutoML leaderboard
You can extract individual models from automl (usually, the best model) via their model_id, which you find the in the leaderboard. Models you extracted can again be treated just as you would any other h2o model, e.g. for predictions.8. Get ready for your last round of exercises!
Get ready for your last round of exercises!Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.