Get startedGet started for free

Tuning models

1. Tuning models

Now that you've done a deeper dive into the importance of achieving high evaluation metrics, let's discover some ways to tune models better to achieve those outcomes.

2. Regularization

Regularization is the process of addressing overfitting, which is when a model may have great training performance but bad testing performance, meaning it followed the training data too closely. Regularization is done by altering the magnitude of coefficients of parameters within a model in order to discourage complex models. For example, in the following graph there are two models (blue line and green line) fit to a set of red data points. The coefficients for the parameters of the model in the blue line are much larger than the ones in the green line, leading to overfitting. Regularization is important because it can help lead to better out-of-sample performance, increasing the evaluation metrics discussed in last chapter, and hence the ROI on ad spend.

3. Examples of regularization

Now let's discuss some specific examples of regularization. We'll look at two types of models covered before: logistic regression and decision trees. For logistic regression, in sklearn the parameter denoted by 'C' is the inverse of a regularization strength on the coefficients, so the smaller the value, the stronger the magnitude of regularization, aka more penalty. The larger the penalty, the less complex the model is. So having a C = 0.05 will be a less complex model than having C = 0.5, and even less complex than C = 1. For decision trees, the main parameter is max_depth, which affects how many layers deep the model can go. The higher the max depth, the more complex the resulting tree can be. So, having a max_depth = 3 will be a less complex model than a max_depth = 5 and even less complex than max_depth = 10.

4. Cross validation

Cross validation is a technique to estimate model performance that is independent on the way that data is split. Say, for example, that it happens that your training data contains all no clicks or only clicks. Then, your training data will lead to a model that is severely over or under-performing! To address this, cross validation makes sure all available data is both trained and tested on. It works as the following: you can create up to k "folds", and for each of the k-folds, you will use that kth fold as a testing set and the other k-1 folds as a training set. The picture shows an example where k = 5. When done training over all k folds, you will have an estimate for your model's performance. Note that there is still a separate evaluation on the test data - the k folds are all contained within the training data.

5. Examples of cross validation

To conduct k-fold cross validation, sklearn has a module called KFold which takes in two parameters, n_splits which represents the number of splits, or k, and random_state which contains data on how the splits are randomly selected (which can be fixed). Then, you can use the cross_val_score function, which takes in the following arguments: first is the classifier of interest, second and third are the features and target of training data respectively, the fourth is the cross validation settings, which is the KFold object from prior, and the last is the scoring metric. For scoring, precision is given by the string precision weighted, recall by the string recall weighted, AUC of the ROC curve is given by the string "roc_auc", etc.

6. Let's practice!

Now that you've learned and seen some examples about regularization and cross validation, let's jump right into tuning some models!