1. Training and test sets
One of the key steps for preprocessing that you should be familiar with is splitting the data into training and test sets.
2. Why split?
We split our dataset into training and test for a few main reasons.
First, it reduces the risk of overfitting, which recall, arises when the model fits the training data too closely, resulting in poor performance when predicting on unseen data.
Second, if we train a model on our entire set of data, we won't have any way to test and validate our model, as the model will essentially know the dataset by-heart. Holding out a test set allows us to preserve some data the model hasn't seen yet, so we can evaluate the model's performance on unseen data.
3. Splitting up your dataset
The train_test_split function from sklearn model_selection is used to randomly shuffle and then split the features and labels, stored in X and y, into training and test sets. X_train and X_test are the training and test features, and y_train and y_test are the training and test labels. It's good practice to specify the random_state argument, so we can reproduce the exact same splits if needed.
By default, the function will split 75% of the data into the training set and 25% into the test set, but we can adjust the proportion of the data assigned to the test set with the test_size argument.
In many scenarios, the default splitting parameters will work well. However, if our labels have an uneven distribution, where one label is much more common than another, the test and training sets might not be representative samples of the dataset, which could bias the model we're trying to train. This is called class imbalance.
For example, in the training and test shown here, we can see that the training set has only samples labeled n, while there is a y label in the test set.
4. Stratified sampling
A good technique for sampling more accurately when you have imbalanced classes is stratified sampling, which is a way of sampling that takes into account the distribution of classes in the dataset.
Let's say we have a dataset with 100 samples, 80 of which are class 1 and 20 of which are class 2. We want the class distribution in both our training set and our test set to reflect this, so in both our training and test sets, we'd want 80% of our sample to be class 1 and 20% to be class 2, which means we'd want 60 class 1 samples and 15 class 2 samples in our training set of 75 samples. In our test set of 25 samples, we want to have 20 class 1 samples and 5 of class 2. This is on par with the distribution of classes in the original dataset.
5. Stratified sampling
There's a nice way to do this using the train_test_split function. The function has a stratify parameter, and to stratify according to class labels, pass the dataset labels, y, to that argument.
The dataset contains 100 labels, 80 of which are class 1 and 20 are class 2. Running train_test_split and stratifying on the class labels,
6. Stratified sampling
creates training and test labels with the same distribution of classes.
7. Let's practice!
Now it's your turn to do some stratified sampling!