scikit-learn refresher

1. Welcome to the course!

Welcome to the course on logistic regression and support vector machines with Python! In this first chapter, we'll cover the syntax for using these classifiers in scikit-learn. In Chapter 2, we'll go into a more conceptual study of loss functions. This will form the basis for going deeper into logistic regression and support vector machines (or SVMs) in Chapters 3 and 4.

2. Assumed knowledge

In this course we'll assume you've taken the prerequisite courses or have a similar level of knowledge. In this video we'll briefly review the standard syntax of the popular machine learning package scikit-learn, which was covered in the prerequisite course on supervised learning. We'll continue to use scikit-learn extensively in this course. To remind you, supervised learning refers to learning a relationship from examples of input-output pairs, usually called X and y.

3. Fitting and predicting

There are a few typical steps of supervised learning. First, let's load the newsgroups data from scikit-learn's repository of built-in datasets. We can inspect the shape of X and y and see that we have about 11,000 training examples, each with about 130,000 features. In this case the features are derived from the words appearing in each news article, and the y-values are the article topics, which is what we're trying to predict.

4. Fitting and predicting (cont.)

Next, we can import the k nearest neighbors classifier, or KNN for short. We instantiate the classifier, and store it in the variable knn. This is the step where we specify model hyperparameters, like the number of neighbors for KNN. Next, we can fit the model using the "fit" method. This is standard syntax across all of scikit-learn. Then, we can make predictions on any data set, including the original training set X. The variable y_pred now contains one entry per row of X with the prediction from the trained classifier.

5. Model evaluation

Let's evaluate our KNN classifier. We can use the "score" function to compute the score on the training data, and it's tempting to be satisfied by almost 100% accuracy. But this number isn't particularly meaningful, since we want to know how the model generalizes to unseen data. This ability to generalize is often measured with a validation set. scikit-learn provides a convenient function to split up our data, train_test_split. X_train and y_train now contain the training set, and X_test and y_test now contain the test or validation set, which by default contains 25% of the examples. Before we compute the score on the test set, we need to make sure our model is training on the training set only, otherwise it would have access to the test set, defeating its purpose. So we refit on the training set, and compute the test score. We see that we have a much lower testing accuracy. Whether or not 66% accuracy is considered "good" depends on the situation, but the training error was definitely a poor representation of the model's ability to classify new data.

6. Let's practice!

Let's practice using KNN with scikit-learn.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.