Step 3: Building a classifier
This is the last step in the sentiment analysis prediction. We have explored and enriched our dataset with features related to the sentiment, and created numeric vectors from it.
You will use the dataset that you built in the previous steps. Namely, it contains a feature for the length of reviews, and 200 features created with the Tfidf vectorizer.
Your task is to train a logistic regression to predict the sentiment. The data has been imported for you and is called reviews_transformed
. The target is called score
and is binary : 1
when the product review is positive and 0
otherwise.
Train a logistic regression model and evaluate its performance on the test data. How well does the model do?
All the required packages have been imported for you.
This exercise is part of the course
Sentiment Analysis in Python
Exercise instructions
- Perform the train/test split, allocating 20% of the data to testing and setting the random seed to
456
. - Train a logistic regression model.
- Predict the class.
- Print out the accuracy score and the confusion matrix on the test set.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Define X and y
y = reviews_transformed.score
X = reviews_transformed.drop('score', axis=1)
# Train/test split
X_train, X_test, y_train, y_test = ____(____, ____, ____=0.2, ____=456)
# Train a logistic regression
log_reg = ____.____(____, ____)
# Predict the labels
y_predicted = log_reg.____(____)
# Print accuracy score and confusion matrix on test set
print('Accuracy on the test set: ', ____(____, ____))
print(____(____, ____)/len(y_test))