Get startedGet started for free

Did we really predict the sentiment well?

1. Did we really predict the sentiment well?

In the previous video, we used all of the available data to build a logistic regression model and assess its accuracy. However, we want to make sure our machine learning model generalizes and performs well on unseen data. How to do that?

2. Train/test split

To get any idea on how well a model will perform on unseen data, we randomly split the dataset in 2 parts: one used for training (building the model) and one for testing (evaluate the performance of the model). In some cases, when we want to tune the parameters of our algorithm, we might have 3 sets: training, testing and validation, but this is out of scope for our course. The training set is usually around 70 or 80% of the whole dataset, and the rest is used for testing.

3. Train/test in Python

In Python, we can perform a random train-test split using the train_test_split function from the sklearn.model_selection package. It takes as arguments arrays, lists, or DataFrames. The X-train and test matrices and y-train and test vectors are the output of the train_test_split. The first arguments we provide in the function are the features matrix X and labels vector y. We can specify the proportion of the data going to testing; here, it is equal to 0.2. Another parameter is the random state, which is the seed generator used to make the random split. It ensures that every time you perform the train-test split on the same data, you will get the same instances in each set. We can also specify the stratify argument. If we want to ensure that the train and test set have similar proportions of both classes, we can do that by specifying stratify to be equal to y.

4. Logistic regression with train/test split

Let's revisit our logistic regression example, executed after a train-test split. We create the LogisticRegression object and fit it on the training set. We can calculate the accuracy on the training data, calling score on the logistic regression with arguments X_train and y_train. We can also calculate the accuracy score of the model using the test set - X_test and y_test. It is slightly lower than the accuracy on the training data, which is usually the case.

5. Accuracy score with train/test split

You may recall that another way to calculate the accuracy was to use the accuracy_score function from the sklearn.metrics. After we have built the logistic regression model, we apply predict to the logistic regression specifying X_test as an argument. In the last step, we call the accuracy score on the true and predicted labels. The value is identical to the accuracy produced by the score function.

6. Confusion matrix

The accuracy is a useful measure of a model's performance but it's not always the most informative. We can instead use something called a 'confusion matrix'. It shows the number of predicted and true values of each of the classes, as displayed in the table. A confusion matrix will allow us to calculate the values in each cell and say how many observations of each class we have predicted correctly. For more details of when we would want to optimize for the different cells, refer to other DataCamp courses.

7. Confusion matrix in Python

In Python, we import the confusion_matrix from the sklearn.metrics module. After we have built our logistic regression and predicted the test set labels, we call the confusion matrix where we give as arguments the true and predicted labels. We have divided the matrix by the length of the y-vector in order to obtain proportions in the cells of the matrix.

8. Let's practice!

Now let's solve some exercises!