Explainability in tree-based models

1. Explainability in tree-based models

Having explored how to explain linear models, let's now delve into tree-based models. These include decision trees and random forests, both pivotal for their unique decision-making structures.

2. Decision tree

At the heart of any tree-based model is the decision tree, the fundamental building block from which other models are developed. Decision trees can be used for both regression and classification tasks. Recall that they make predictions using a tree-like structure of decisions, each based on a specific feature. This structure offers clear insight into the decision-making process, making decision trees inherently explainable.

3. Random forest

However, a random forest, which consists of many decision trees and can be used for both regression and classification, complicates direct interpretability. In fact, a random forest aggregates the decisions of potentially hundreds or thousands of trees. While we can examine individual trees within a Random Forest to understand their decision paths, this approach becomes impractical with large numbers of trees. To address this, we rely on feature importance scores, which measure how much a feature contributes to reducing uncertainty in predictions. This differs from the coefficient-based importance used in linear models, where importance is derived directly from the learned coefficients. In tree-based models like random forests, importance is determined by how much a given feature reduces the prediction uncertainty across the decision trees.

4. Admissions dataset

Let’s see how to do this in code. We'll use the admissions dataset to predict whether a student will be admitted to graduate school or not based on several features using a decision tree classifier and a random forest classifier. Assume that our training data is divided into features stored in X_train and labels in y_train.

5. Model training

We start by importing the DecisionTreeClassifier and RandomForestClassifier from sklearn.tree and sklearn.ensemble respectively. We initialize both classifiers and train them on the training data using the .fit() method. After training, we assess which features are most influential by checking the .feature_importances_ attribute from both models, which outputs an array containing the importance scores for each feature. These scores quantify how much each feature contributes to the model’s decision-making process.

6. Feature importance

We visualize these importances in a horizontal bar plot using the matplotlib library by passing the feature names from X_train.columns and their corresponding importances to the plt.barh function. Upon analyzing the results, we find that cumulative GPA and test scores consistently rank as the most influential features. This prioritization is expected and validates our models, as these metrics are commonly regarded as strong indicators of academic ability and performance. Consequently, they significantly impact admission decisions.

7. Let's practice!

Time for some practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.