1. Welcome to the course!
Hi and welcome to the course!
My name is Sandro and I am a Data Scientist. I will introduce you to some exciting classification and regression methods using decision trees and ensemble models.
What exactly are we going to cover?
2. Course overview
In Chapter 1, you'll be introduced to a set of supervised learning models known as classification trees.
In Chapter 2, you'll build decision trees for regression problems and understand the concepts of cross-validation and bias-variance trade-off.
Chapter 3 introduces you to hyperparameter tuning, bagging, and random forests.
Finally, Chapter 4 deals with boosting as a powerful ensemble method.
Along the way, you'll get to know many useful tools and methods for machine learning.
3. Decision trees are flowcharts
Consider this flowchart, which shows a way of classifying living animals.
A set of questions like "Can this thing live in water?" or "Does it have feathers?" allow you to narrow down the options until you arrive at a decision.
This type of flow chart describes how a computer or an algorithm could go about solving a classification problem. This schema is also found in human decision-making, like holiday planning or deciding where to meet up with your friends.
4. Advantages of tree-based models
One of the biggest advantages of decision trees is that they are easy to explain. Anyone able to read a flow-chart is already able to understand a decision tree.
In contrast to linear models, trees are able to capture non-linear relationships.
Furthermore, trees do not need normalization or standardization of numeric features. Trees can also handle categorical features without the need to create dummy binary indicator variables.
Missing values are not a problem, and trees are robust to outliers.
Last but not least, it's relatively fast to train a decision tree, so tree methods can handle big datasets.
5. Disadvantages of tree-based models
Unfortunately, large and deep trees are hard to interpret.
One of the major problems with trees is that they have high variance.
If not tuned properly, it's easy to create overly complex trees that do not generalize well to new data, known as overfitting.
6. The tidymodels package
There are many great machine learning R packages out there. Throughout the course, we will use the tidymodels package, which orchestrates many of these for you.
Among these is parsnip for modeling, rsample for sampling, and yardstick for measuring performance.
7. The tidymodels package
To make use of the package, simply type "library parenthesis, tidymodels, then end parenthesis" in your console.
It takes care of loading all the other useful packages.
8. Create a decision tree
To create a tree model, you first need to create the specification for your later model.
This serves as a functional design or skeleton.
First, pick a model class, and since we are in a tree-based course, we use decision_tree().
9. Create a decision tree
The set_engine() function adds an engine to power or implement the model.
We use rpart, which is an R package for "recursive partitioning".
10. Create a decision tree
Then, set the mode, like classification or regression.
This sets the class of problems the model will solve.
11. From a model specification to a real model
A model specification, which you can save, for example, as tree_spec, is only a skeleton. You need to bring it to life using data. We call this process model training or model fitting.
Simply call the fit() function on your specification supplying the arguments 'formula' and 'data'.
You can read the formula as "outcome is modeled as a function of age and bmi".
We used a diabetes dataset in this example.
The output informs you about the training time and the number of samples used for training.
12. Let's build a model!
Time for you to practice!