Multiple text columns
In this exercise, you will continue working with the airline Twitter data. A dataset tweets
has been imported for you.
In some situations, you might have more than one text column in a dataset and you might want to create a numeric representation for each of the text columns. Here, besides the text
column, which contains the body of the tweet, there is a second text column, called negativereason
. It contains the reason the customer left a negative review.
Your task is to build BOW representations for both columns and specify the required stop words.
Cet exercice fait partie du cours
Sentiment Analysis in Python
Instructions
- Import the vectorizer package and the default list of English stop words.
- Update the default list of English stop words and create the
my_stop_words
set. - Specify the stop words argument in the first vectorizer to the updated set, and in the second vectorizer - the default set of English stop words.
Exercice interactif pratique
Essayez cet exercice en complétant cet exemple de code.
# Import the vectorizer and default English stop words list
____
# Define the stop words
my_stop_words = ____._____(['airline', 'airlines', '@', 'am', 'pm'])
# Build and fit the vectorizers
vect1 = CountVectorizer(____=my_stop_words)
vect2 = CountVectorizer(____=____)
vect1.fit(tweets.text)
vect2.fit(tweets.negative_reason)
# Print the last 15 features from the first, and all from second vectorizer
print(vect1.get_feature_names()[-15:])
print(vect2.get_feature_names())