Sentiment analysis with GBM
Let's now use scikit-learn
's GradientBoostingClassifier
on the reviews
dataset to predict the sentiment of a review given its text.
We will not pass the raw text as input for the model. The following pre-processing has been done for you:
- Remove reviews with missing values.
- Select data from the top 5 apps.
- Select a random subsample of 500 reviews.
- Remove "stop words" from the reviews.
- Transform the reviews into a matrix, in which each feature represents the frequency of a word in a review.
Do you want a deeper understanding of text mining? Then go check the course Introduction to Natural Language Processing in Python!
This exercise is part of the course
Ensemble Methods in Python
Exercise instructions
- Build a
GradientBoostingClassifier
with100
estimators and a learning rate of0.1
. - Calculate the predictions on the test set.
- Compute the accuracy to evaluate the model.
- Calculate and print the confusion matrix.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Build and fit a Gradient Boosting classifier
clf_gbm = ____(____, ____, random_state=500)
clf_gbm.fit(X_train, y_train)
# Calculate the predictions on the test set
pred = ____
# Evaluate the performance based on the accuracy
acc = ____
print('Accuracy: {:.3f}'.format(acc))
# Get and show the Confusion Matrix
cm = ____
print(cm)