Get startedGet started for free

Modeling without normalizing

Let's take a look at what might happen to your model's accuracy if you try to model data without doing some sort of standardization first.

Here we have a subset of the wine dataset. One of the columns, Proline, has an extremely high variance compared to the other columns. This is an example of where a technique like log normalization would come in handy, which you'll learn about in the next section.

The scikit-learn model training process should be familiar to you at this point, so we won't go too in-depth with it. You already have a k-nearest neighbors model available (knn) as well as the X and y sets you need to fit and score on.

This exercise is part of the course

Preprocessing for Machine Learning in Python

View Course

Exercise instructions

  • Split up the X and y sets into training and test sets, ensuring that class labels are equally distributed in both sets.
  • Fit the knn model to the training features and labels.
  • Print the test set accuracy of the knn model using the .score() method.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = ____(____, ____, stratify=____, random_state=42)

knn = KNeighborsClassifier()

# Fit the knn model to the training data
knn.____(____, ____)

# Score the model on the test data
print(knn.____(____))
Edit and Run Code