Modeling without normalizing
Let's take a look at what might happen to your model's accuracy if you try to model data without doing some sort of standardization first.
Here we have a subset of the wine
dataset. One of the columns, Proline
, has an extremely high variance compared to the other columns. This is an example of where a technique like log normalization would come in handy, which you'll learn about in the next section.
The scikit-learn model training process should be familiar to you at this point, so we won't go too in-depth with it. You already have a k-nearest neighbors model available (knn
) as well as the X
and y
sets you need to fit and score on.
This exercise is part of the course
Preprocessing for Machine Learning in Python
Exercise instructions
- Split up the
X
andy
sets into training and test sets, ensuring that class labels are equally distributed in both sets. - Fit the
knn
model to the training features and labels. - Print the test set accuracy of the
knn
model using the.score()
method.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = ____(____, ____, stratify=____, random_state=42)
knn = KNeighborsClassifier()
# Fit the knn model to the training data
knn.____(____, ____)
# Score the model on the test data
print(knn.____(____))