Airline sentiment with stop words
You are given a dataset, called tweets
, which contains customers' reviews and sentiments about airlines. It consists of two columns: airline_sentiment
and text
where the sentiment can be positive, negative or neutral, and the text
is the text of the tweet.
In this exercise, you will create a BOW representation but will account for the stop words. Remember that stop words are not informative and you might want to remove them. That will result in a smaller vocabulary and eventually, fewer features. Keep in mind that we can enrich a default list of stop words with ones that are specific to our context.
This exercise is part of the course
Sentiment Analysis in Python
Exercise instructions
- Import the default list of English stop words.
- Update the default list of stop words with the given list
['airline', 'airlines', '@']
to createmy_stop_words
. - Specify the stop words argument in the vectorizer.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Import the stop words
from sklearn.feature_extraction.text import CountVectorizer, ____
# Define the stop words
my_stop_words = ____.____(['airline', 'airlines', '@'])
# Build and fit the vectorizer
vect = CountVectorizer(____=my_stop_words)
vect.fit(tweets.text)
# Create the bow representation
X_review = vect.transform(tweets.text)
# Create the data frame
X_df = pd.DataFrame(X_review.toarray(), columns=vect.get_feature_names())
print(X_df.head())