Get startedGet started for free

Decision tree for classification

1. Decision-Tree for Classification

Hi! My name is Elie Kawerk, I'm a Data Scientist and I'll be your instructor. In this course, you'll be learning about tree-based models for classification and regression.

2. Course Overview

In chapter 1, you'll be introduced to a set of supervised learning models known as Classification-And-Regression-Tree or CART. In chapter 2, you'll understand the notions of bias-variance trade-off and model ensembling. Chapter 3 introduces you to Bagging and Random Forests. Chapter 4 deals with boosting, specifically with AdaBoost and Gradient Boosting. Finally in chapter 5, you'll understand how to get the most out of your models through hyperparameter-tuning.

3. Classification-tree

Given a labeled dataset, a classification tree learns a sequence of if-else questions about individual features in order to infer the labels. In contrast to linear models, trees are able to capture non-linear relationships between features and labels. In addition, trees don't require the features to be on the same scale through standardization for example.

4. Breast Cancer Dataset in 2D

To understand trees more concretely, we'll try to predict whether a tumor is malignant or benign in the Wisconsin Breast Cancer dataset using only 2 features. The figure here shows a scatterplot of two cancerous cell features with malignant-tumors in blue and benign-tumors in red.

5. Decision-tree Diagram

When a classification tree is trained on this dataset, the tree learns a sequence of if-else questions with each question involving one feature and one split-point. Take a look at the tree diagram here. At the top, the tree asks whether the concave-points mean of an instance is <= 0-point-051. If it is, the instance traverses the True branch; otherwise, it traverses the False branch. Similarly, the instance keeps traversing the internal branches until it reaches an end. The label of the instance is then predicted to be that of the prevailing class at that end. The maximum number of branches separating the top from an extreme-end is known as the maximum depth which is equal to 2 here.

6. Classification-tree in scikit-learn

Now that you know what a classification tree is, let's fit one with scikit-learn. First, import DecisionTreeClassifier from sklearn.tree as shown in line 1. Also, import the functions train_test_split() from sklearn.model_selection and accuracy_score() from sklearn.metrics. In order to obtain an unbiased estimate of a model's performance, you must evaluate it on an unseen test set. To do so, first split the data into 80% train and 20% test using train_test_split(). Set the parameter stratify to y in order for the train and test sets to have the same proportion of class labels as the unsplit dataset. You can now use DecisionTreeClassifier() to instantiate a tree classifier, dt with a maximum depth of 2 by setting the parameter max_depth to 2. Note that the parameter random_state is set to 1 for reproducibility.

7. Classification-tree in scikit-learn

Then call the fit method on dt and pass X_train and y_train. To predict the labels of the test-set, call the predict method on dt. Finally print the accuracy of the test set using accuracy_score(). To understand the tree's predictions more concretely, let's see how it classifies instances in the feature-space.

8. Decision Regions

A classification-model divides the feature-space into regions where all instances in one region are assigned to only one class-label. These regions are known as decision-regions. Decision-regions are separated by surfaces called decision-boundaries. The figure here shows the decision-regions of a linear-classifier. Note how the boundary is a straight-line.

9. Decision Regions: CART vs. Linear Model

In contrast, as shown here on the right, a classification-tree produces rectangular decision-regions in the feature-space. This happens because at each split made by the tree, only one feature is involved.

10. Let's practice!

Now let's practice!

Create Your Free Account

or

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.