Let's predict the sentiment!

1. Let's predict the sentiment!

In this final chapter, we will use a supervised learning model to predict the sentiment.

2. Classification problems

Imagine we are working with the product reviews. A supervised learning task will try to classify any new review as either positive or negative based on already labeled reviews. This is what we call a classification problem. In the case of the product and movie reviews, we have two classes - positive and negative. This is a binary classification problem. The airline sentiment Twitter data has three categories of sentiment: positive, neutral and negative. This is a multi-class classification problem.

3. Linear and logistic regressions

One algorithm commonly applied in classification tasks is a logistic regression. You might be familiar with a linear regression, where we fit a straight line to approximate a relationship, shown in the graph on the left. With a logistic regression, instead of fitting a line, we are fitting an S-shaped curve, called a sigmoid function. A property of this function is that for any value of x, y will be between 0 and 1.

4. Logistic function

When performing linear regression, we are predicting a numeric outcome (say the sale price of a house). With logistic regression, we estimate the probability that the outcome (sentiment) belongs to a particular category(positive or negative) given the review. Since we are estimating a probability and want an output between 0 and 1, we model the X values using the sigmoid/logistic function, as shown on the graph. For more details on logistic regression, refer to other courses on DataCamp.

5. Logistic regression in Python

In Python, we import the LogisticRegression from the sklearn.linear_model library. Keep in mind that the sklearn API works only with continuous variables. It also requires either a DataFrame or an array as arguments and cannot handle missing data. Therefore, all transformation of the data needs to be completed beforehand. We call the logistic regression function and create a Logistic classifier object. We fit it by specifying the X matrix, which is an numpy array of our features or a pandas DataFrame, and the vector of targets y.

6. Measuring model performance

How do we know if the model is any good? We look at the discrepancy between the predicted label and what was the real label for each instance (observation) in our dataset. One common metric to use is the accuracy score. Though not appropriate in all contexts, it is still useful. Accuracy gives us the fraction of predictions that our model got right. The higher and closer it is to 1, the better. One way we can calculate the accuracy score of a logistic regression model is by calling the score method on the logistic regression object. It takes as arguments the X matrix and y vector.

7. Using accuracy score

Alternatively, we can use the accuracy_score function from sklearn.metrics. There is an accuracy_score function apart from the score function because different models have different default score metrics. Thus, the accuracy_score function always returns the accuracy but the score function might return other metrics if we use it to evaluate other models. Here, we need to explicitly calculate the predictions of the model, by calling predict on the matrix of features. The accuracy score takes as arguments the vector of true labels and the predicted labels. We see in the case of logistic regression, both score and accuracy score return value of 0.9009.

8. Let's practice!

Can we trust such high accuracy? We should be careful in making strong conclusions just yet. In the next video, we will see how to check how robust the model performance is but before that, let's solve some exercises!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.