Regression models

1. Regression models

Welcome to another lesson on model validation. There are two types of predictive models discussed in this course. Models built for continuous variables, or regression models, and models built for categorical variables, called classification models. This lesson focuses purely on regression models, and more specifically, random forest regression models using scikit-learn.

2. Random forests in scikit-learn

Although this is not a machine learning course, it is important to understand the basic principles of the models we will be running and discussing. For that reason, we will stick with random forest models throughout this course, and only run random forest regression or random forest classification models. Both models have similar parameters and are called in the same way when using scikit-learn.

3. Decision trees

To understand random forest models, we should review decision trees. Decision trees look at various ways to split data until only a few or even a single observation remains. The splits may be categorically based, "Are you left-handed?", or continuously based, "what is your age?" A new observation will follow the tree based on its own data values until it reaches an end-node (called a leaf). In the given example, Bob - who is left-handed, 18 years old, and likes onions, would be predicted to be in $4,000 of debt if we followed this decision tree. The value in the end-node represents the average of all people in the training data who ended in that leaf.

4. Averaging decision trees

Random forest regression models generate a bunch of different decision trees and use the mean prediction of the decision trees as the final value for a new observation. Here we created five decision trees. Their average prediction for Bob was $4,200 of debt.

5. Random forest parameters

Although these algorithms have a lot of parameters, we will focus on only three. n_estimators is the number of trees to create for the random forest. max_depth is the maximum depth for these trees, or how many times we can split the data. It is also described as the maximum length from the beginning of a tree to the tree's end nodes. These two parameters alone can make a big impact on model accuracy. Lastly, random_state allows us to create reproducible models. I will always use 1,111 as my random state. If you ever see a different number, I promise I did not code that example! There are two ways to set these parameters. They can be set when RandomForestRegressor() is initiated, which is the most common way for setting model parameters. They can also be set later, by assigning a new value to a models attribute. The second method could be helpful when testing out different sets of parameters.

6. Feature importance

After a model is created, we can assess how important different features (or columns) of the data were in the model by using the dot-feature_importances_ attribute. If the data is a pandas DataFrame, X, we can access the column names and print the importance score quite easily. The larger this number is, the more important that column was in the model. In our example, we loop through the values from dot-feature_importances_ and match the score to the column from X. The output tells us that eye_color is not that useful in our model, but the fact that someone is left-handed is highly important.

7. Let's begin

Let's create a random forest regression model and look at its output.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.