Exercise

Sentiment analysis with GBM

Let's now use scikit-learn's GradientBoostingClassifier on the reviews dataset to predict the sentiment of a review given its text.

We will not pass the raw text as input for the model. The following pre-processing has been done for you:

  1. Remove reviews with missing values.
  2. Select data from the top 5 apps.
  3. Select a random subsample of 500 reviews.
  4. Remove "stop words" from the reviews.
  5. Transform the reviews into a matrix, in which each feature represents the frequency of a word in a review.

Do you want a deeper understanding of text mining? Then go check the course Introduction to Natural Language Processing in Python!

Instructions

100 XP
  • Build a GradientBoostingClassifier with 100 estimators and a learning rate of 0.1.
  • Calculate the predictions on the test set.
  • Compute the accuracy to evaluate the model.
  • Calculate and print the confusion matrix.