Decision trees
Your task in this exercise is to make a simple decision tree using scikit-learn's DecisionTreeClassifier
on the breast cancer
dataset that comes pre-loaded with scikit-learn.
This dataset contains numeric measurements of various dimensions of individual tumors (such as perimeter and texture) from breast biopsies and a single outcome value (the tumor is either malignant, or benign).
We've preloaded the dataset of samples (measurements) into X
and the target values per tumor into y
. Now, you have to split the complete dataset into training and testing sets, and then train a DecisionTreeClassifier
. You'll specify a parameter called max_depth
. Many other parameters can be modified within this model, and you can check all of them out here.
This exercise is part of the course
Extreme Gradient Boosting with XGBoost
Exercise instructions
- Import:
train_test_split
fromsklearn.model_selection
.DecisionTreeClassifier
fromsklearn.tree
.
- Create training and test sets such that 20% of the data is used for testing. Use a
random_state
of123
. - Instantiate a
DecisionTreeClassifier
calleddt_clf_4
with amax_depth
of4
. This parameter specifies the maximum number of successive split points you can have before reaching a leaf node. - Fit the classifier to the training set and predict the labels of the test set.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Import the necessary modules
____
____
# Create the training and test sets
X_train, X_test, y_train, y_test = ____(____, ____, test_size=____, random_state=____)
# Instantiate the classifier: dt_clf_4
dt_clf_4 = ____
# Fit the classifier to the training set
____
# Predict the labels of the test set: y_pred_4
y_pred_4 = ____
# Compute the accuracy of the predictions: accuracy
accuracy = float(np.sum(y_pred_4==y_test))/y_test.shape[0]
print("accuracy:", accuracy)