How to grow your tree

1. How to grow your tree

Welcome back!

2. Diabetes dataset

You now know how to create the skeleton or specification of a decision tree and how to train it on real data to create a model. In this section, we’ll use the diabetes dataset from the last exercise. It has an outcome column that indicates if a patient has diabetes or not and some numeric predictors like blood pressure, BMI, and age.

3. Using the whole dataset

So far, you used a dataset to fit a decision tree model. Now, how can you test the performance of your model on new data? You would need to collect more data.

4. Data split

A common solution for this problem is to split the data into two groups. One part of the data, the training set, is used to estimate parameters, and compare or tune models. The test set is held in reserve until the end of the project. It is used as an unbiased source for measuring final model performance. It is critical that you don't use the test set before this point, otherwise the testing data will have become part of the model development process.

5. Splitting methods

There are different ways to create these data partitions. You could take data from the end, from the center, or a random sample. Just select a part of the data set, say 80% of all samples, for the training set and use the rest for the test set. We will introduce more exciting and more robust splitting methods in chapter 2.

6. The initial_split() function

The initial_split() function from the rsample package comes in handy because it does all the work for you. The function takes the original data, diabetes, and a proportion argument, say 0-point-9, and randomly assigns samples (rows of the tibble) to the training or test set. If no proportion is given, it assigns 75% of the samples to the training and 25% to the test set. The result is a split object which contains analysis and assessment sets, which are just different names for training and test sets. Here we see that 76 samples are in the test or assessment set, which is about 10% of 768 samples in total.

7. Functions training() and testing()

After the initial split, the training() and testing() functions return the actual datasets. Calling the training() function on our split object diabetes_split gives us our training set. Calling the testing() function on our split object diabetes_split gives us our test set. You can compare the number of rows using the nrow() function to validate that the training set has indeed 90% as many rows as the total dataset.

8. Avoid class imbalances

Ideally, your training and test sets have the same distribution in the outcome variable. Otherwise, if your data split is very unlucky, you can end up with a training set with no diabetes patients at all, which would result in a useless model. The following code exhibits this problem. First, count the different outcomes of 'yes' and 'no' in the training set using the table() function. Then, count the proportion of 'yes' outcomes among all outcomes in the training set - it's about 15%. Do the same for the test set and find that this contains approximately 63% diabetes patients. This is a real problem if you have a rare disease dataset and end up with no positive outcomes in your training set.

9. Solution - enforce similar distributions

A remedy for this is the strata argument, which is set to the target variable "outcome" here. This ensures a random split with a similar outcome distribution in both the training and test sets.

10. Let's split!

Enough said, now it's your turn to split the data!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.