1. Logistic regression on sonar
Classification models differ from regression models
2. Classification models
in that you're trying to predict a categorical target. For example, predicting whether or not a loan will default.
This is still a form of supervised learning, like with regression problems. As before, we can use a train/test split to explore how well our model generalizes to new data.
In this chapter, we'll be working with the 'sonar' dataset, a classic statistics dataset which contains some characteristics of a sonar signal for objects that are either rocks or mines. The goal here is to train a classifier that can reliably distinguish rocks from mines.
3. Example: Sonar data
Let's load the sonar dataset and take a look at it. Note that the target is either "R" for rock and "M" for mine and most of the predictors are numbers measuring some aspect of a sonar signal.
4. Splitting the data
Analyzing sonar and radar signals was one of the original applications of machine learning. As with the diamonds and Boston housing datasets in the previous chapter, we'll start by splitting the dataset randomly into training and test sets.
This time; however, we'll do a 60/40 split, instead of 80/20. The sonar dataset is very small, so a 40% split gives us a more reliable test set. It would be even better to use multiple 80/20 splits and average the results of each 20% split.
We'll discuss this idea in more detail later.
5. Splitting the data
First, we randomly order the dataset. This is important to avoid bias in our train/test split and make sure we get a representative sample of the whole dataset.
Next, we identify a row that's about 60% of the way through the dataset. This row will be the last observation in our training set.
Finally, we check that our training set is 60% of the entire dataset.
6. Let's practice!
Let's practice making train/test splits.