Sentiment analysis

1. Sentiment analysis

Excellent! You've been able to detect the presence of words in tweets and plot their relative prevalence across time. This is a small step in the direction of understanding meaning in text. In this lesson, we're going to focus on a method for deriving meaning from text called sentiment analysis.

2. Understanding sentiment analysis

Sentiment analysis is a type of natural language processing method which determines whether a word, sentence, paragraph, or document is positive or negative. The idea behind sentiment analysis is that we count the words which are positive and negative as a proportion of the words in the rest of the document. Each document then gets a positive and negative score. Sentiment analysis can be useful in gauging reactions to a company, product, politician, or policy.

3. Sentiment analysis tools

We'll use the VADER SentimentIntensityAnalyzer included with the Natural Language Toolkit or `nltk`. The VADER toolkit handles short text documents like tweets very well because it measures sentiment not only with particular words, but also for emoji and different types of capitalization. For instance, there's a qualitative difference in 'Nice' in lowercase letters versus 'NICE' in all caps.

4. Implementing sentiment analysis

To use VADER, we first import it from `nltk`. Next, we instantiate a SentimentIntensityAnalyzer. Lastly, we can generate sentiment scores with the polarity_scores function and the Series method, `apply`.

5. Interpreting sentiment scores

A critical part of any type of natural language processing involves reading the text and assessing whether the method makes sense compared to a human reading. If we're attempting to replicate meaning with computational methods, then we have to make sure that meaning has face validity. Face validity means that the metric matches the concept we're trying to measure. In this case, we want to be sure the sentiment score matches our idea of what it means for a tweet to be positive or negative.

6. Interpreting sentiment scores

Here, we have two examples -- a positive tweet and a negative tweet. Each sentiment score from the VADER analyzer provides four values: negative, neutral, positive, and compound. Positive and negative are self-explanatory, while neutral measures words that don't contribute to the sentiment. Compound, however, is a combination of the positive and the negative; it's an overall assessment which ranges between negative 1 and positive 1. Below 0 is negative, and above 0 is positive. The first tweet presented here reads to human eyes as positive and the compound score is rather high -- about 0-point-9. The second tweet reads as negative but the compound score is only slightly below zero, around -0-point-07.

7. Generating sentiment averages

We generate sentiment averages over time in the same way we generate average prevalence measures. We'll extract only the 'compound' field from the sentiment scores. Next, we'll separate sentiments for each company with our `check_word_in_tweet` function. Lastly, we'll generate an average value for our time window of one minute.

8. Plotting sentiment scores

We can plot sentiment scores in the same way we plotted the prevalence of keywords. We'll set the x-axis to time, and the y-axis to our sentiment score. This plot indicates that sentiment towards Google is slightly higher than that of Facebook over time. This is despite Facebook being mentioned more, which we saw in the last lesson. This underlines the importance of extracting meaning from text, not just performing keyword counts.

9. Let's practice!

Now that you know what sentiment analysis is, let's revisit the data science hashtag dataset and analyze the sentiment of those tweets.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.