Get startedGet started for free

Overview of machine learning models

1. Overview of machine learning models

In this lesson, we'll cover the basic concepts of machine learning models for classification and take a quick look at using some common algorithms.

2. Logistic regression

A classification algorithm is a function that separates values into either a positive or negative class. To do this, it takes in training data and provide decisions on testing data. There are many different algorithms that we will cover in the course. To start, we'll cover logistic regression. Here is an example picture, where the red represents negative examples and the blue represents positive examples. The solid blue line is known as a decision boundary. Logistic regression finds this boundary by finding the best fit between a dependent variable (the target variable) and various independent variables (the features).

3. Training the model

We can create a model as follows, and note the use of parentheses. Each classifier via sklearn takes in different parameters, and has a fit method. The fit method takes in two arguments that come from the training data: X_train, which is a vector of features, and y_train is the vector of targets. It is important to keep the training data separate from the testing data, since the classifier is only supposed to make predictions on data it has not seen before. This is to avoid the model "seeing answers beforehand" - if the model is good, it should generalize outside of the training set.

4. Testing the model

After our model has been trained, we can use the predict method to generate predictions (y_test) on testing data using our testing features (X_test). An example of y_test is shown here: the labels are either 0 or 1 to represent not click or click. Additionally, using sklearn we can get probability scores themselves via the predict_proba method. As seen in the example output, these scores provide a score for a label being 0 and 1 respectively. These probability scores are the model's estimate for the likelihood that a particular ad was clicked by a particular user. Overall, we want the model to accurately be able to predict which users will click which ads. Then, for each user, we can show to them ads for which they have the highest predicted probability of clicking.

5. Evaluating the model

Next comes evaluation. In evaluating our model, we have a choice of many different metrics. For now we can start off by looking at accuracy: the percentage of test targets (clicks) that we correctly identify. We will implement a version of accuracy by hand, but more often will be using sklearn's accuracy_score function, which takes as input first the actual test labels (y_test) and the predicted test labels (y_pred). A good classifier should have a high accuracy, but this is not the only metric we should use because sometimes the datasets themselves are imbalanced, meaning the training data has many of one target type. For example, if there were mostly non-targets (zeros) in the dataset, then a classifier could achieve a high accuracy by automatically predicting a zero for each testing observation. This is very relevant to ads CTR prediction, since, as we saw, the positive rate in the training set is under 20%. We will go over interpretations of a low CTR in the next lesson, as well as other metrics besides accuracy.

6. Let's practice!

Now that we've done a high level overview of machine learning models, let's dive into some examples using logistic regression!