Machine learning with h2o

1. Machine learning with H2O

In the previous chapters, you learned what hyperparameters are and how you can tune them with `caret` and `mlr`. There are many other popular machine learning packages and the one I want to present in this chapter is h2o.

2. What is H2O?

H2O is an open source machine learning platform you can use with R and the h2o package. What makes h2o special compared to caret and mlr is that h2o is designed for scalability. This means that the implementations of h2o's machine learning algorithms can be trained on distributed clusters. That's why you need to initiate an h2o cluster with the h2o.init() function - if you are not working on remote clusters, like Spark or Hadoop, you will initiate a local cluster on your machine. Another very useful feature of h2o is AutoML for automatic model comparison and hyperparameter tuning.

3. New dataset: seeds data

In this chapter, we will be working on a dataset with measurements of geometrical properties of wheat seed kernels. These measurements are area, perimeter, compactness, kernel length and width, asymmetry and kernel grove. We also know the seed type, that describes three different varieties of wheat. In this dataset we have 50 instances for each of the three seed varieties, denoted with 1, 2 and 3.

4. Preparing the data for modeling with H2O

Before we can start, we need to pass our data to the H2O instance. If you load data in from a file, you can directly load them as h2o frames, but often, we want to preprocess data with other R packages. In this case, we use the as.h2o function to convert an R object to an h2o frame. Next, it is good practice to define the names of the features (`x`) and the target variable (`y`). These names correspond to the column names of our dataset. In this dataset, the target seed type has been encoded numerically as 1, 2 and 3. Because we want to use the seed type for classification, we need to convert the target into factors.

5. Training, validation and test sets

h2o also contains a function for splitting data into training, validation and test sets. The h2o.splitframe function takes: - the h2o frame to split and - the ratios for how many instances should go into the created subsets (here, we want to have 70% of the data in the training set, 15% in the validation set and the remaining 15% in the test set. We can use the summary function on our response variable to compare the ratios in the different subsets. summary for h2o frames takes the additional exact_quantiles argument, which - if set to TRUE - computes exact quantiles. Per default, it use approximations.

6. Model training with H2O

h2o contains a number of different algorithms that can be trained in a distributed fashion: - Gradient Boosted models - Generalized linear models - Random Forest models - and Neural Networks These functions can take many arguments and hyperparameters. You can find them all by looking at the help for each function.

7. Model training with H2O

In this gradient boosting example, I am telling h2o which column in my data is the target 'y' and which features 'x' I want to include in the model. And I give the training and validation data. Here is the beginning of the model output with a summary of the final hyperparameters.

8. Evaluate model performance with H2O

h2o also includes functions for evaluating model performance. The h2o.performance function calculates a set of metrics on a new h2o frame, here we use the test data. We can use additional functions to extract components from this model metrics object, like the confusion matrix and logloss. If we want to use our model to generate predictions, we use the h2o(dot)predict function.

9. Let's practice!

Let's practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.