1. ML modeling steps
Great work! Now, we will learn about the main machine learning steps!
2. Supervised learning steps
As you've learned previously, supervised learning predicts a certain outcome. We typically follow five steps when building a supervised learning model. First, we split the data into training and testing. This is important as we want to "train" the model on one set of data, and then measure its performance on unseen values or testing dataset to make sure it works well on unseed data. Then, we initialize the model. After that, we fit the model on the training data. Here, we also say that we are "training" the model. Once that is done, we predict the outcome using the trained model on the unseen - or testing - data. Finally, we measure the model performance on the testing data. Let's see how this looks in code!
3. Supervised learning with code
We will build a supervised learning model using a simple model called decision tree, which is basically a set of if-else rules which decide what is the predicted outcome. First, let's load the libraries we will need. We import the tree module from scikit-learn, then the train_test_split function to split data into training and testing datasets, and finally the accuracy_score function to calculate the model performance.
4. Supervised learning steps with code
Now, let's build our first model! First, we split the data into training and testing. We use the previously loaded train_test_split function, which requires the original X and Y datasets, and a test_size parameter value between 0 and 1. This defines the percentage to be reserved to testing dataset. Then, we initialize the decision tree and store it as mytree. The next step is to fit the model on the training data. Here, the decision tree learns the if-else rules maximizing the model accuracy. Once that is done, we predict the values on the testing dataset. Finally, we calculate the accuracy score which is the percentage of correctly predicted outcome variables.
5. Unsupervised learning steps
In contrast, unsupervised learning aims to group the observations into mutually exclusive clusters. Here, we have fewer steps as there's no target variable and no need to measure model accuracy. First, we initialize the model. Then, we fit the model on the data. Once this is done, we assign the cluster values to the original dataset. Finally, we explore the differences between clusters.
6. Unsupervised learning with code
We will now build an unsupervised learning model using a very popular model called K-means clustering. First, let's load the libraries we'll need. We import the KMeans module from scikit-learn library and the pandas library.
7. Unsupervised learning with code
Then, we initialize the KMeans instance with one mandatory parameter - number of clusters. In chapter four we'll go through several approaches how to find the optimal number here, but ultimately it's a test & learn process. The final decision is made in step 4 when exploring the characteristics of each cluster and choosing the solution that is the most interpretable. Here, we initialize the instance with 3 clusters. In the second step we fit the model on the data. Next, assign the cluster labels to the original dataset as a separate column. Finally, we explore the clusters by aggregating the original data for each cluster. In this case, we calculate the average of each variable for each cluster.
8. Let's go build some models!
Impressive! Now let's go and try to implement some of these models!