1. Explore qdap's polarity & built-in lexicon
In this chapter you applied qdap’s polarity() function a few times to get a quick positive or negative assessment. It turns out this function is more complex than you may think.
2. polarity()
First there is a built in subjectivity lexicon. This standard lexicon comes from two researchers from the University of Illinois at Chicago. It contains almost 7000 words marked as positive or negative. In this chapter you will learn how to change the lexicon but let's first examine how it works without adding words.
3. Context cluster
When you apply the polarity() function to text, the function identifies words from the subjectivity lexicon. Once the polarized words are “tagged” the function creates a context cluster around the term. In this example, the lexicon contains "good" and it is found in the text. By default, a context cluster includes the four words before and two words after the identified word. So removing stopwords will impact the polarity scores because it affects the words in the cluster.
4. Context cluster, continued
Next, each of the individual words is classified as polarized, neutral, negator, amplifier or de-amplifier. The identified words from the lexicon are the polarized words. In this case, "good." A neutral word has no impact on the context cluster's polarity but does affect the word count which is important later. In this cluster there were seven neutral terms like "learning" and stopwords. Amplifiers and de-amplifiers are considered valence shifters. Valence shifters add or subtract to the author's intent. An example amplifier is “very” as in “Ted's voice is very nice.” The amplifier positively affects the niceness of my voice. Lastly, a negator switches the polarity of the cluster as in "Ted's voice is NOT very nice". The amplified very and positivity of nice are now switched completely to not being nice at all!
5. Context cluster glossary
As a quick review here are the terms we covered. Once polarity has created a context cluster it classifies terms into each of these types, polarized, neutral, negators, and valence shifters.
6. Context cluster scoring
As you may expect a positive word has a value of 1. A negative term has a value of -1.
This context cluster does not have a negation so we don't have to switch any values.
In the polarity function an amplifier, like "very" is valued at point-eight while de-amplifiers receive a negative point-eight.
7. Polarity calculation
In the end all the polarity values are summed. Keep in mind we didn't remove stopwords so the entire passage has 9 terms. The word "good" counts as a one. An amplifier "very" adds another point eight. SO the total polarity is one-point-eight with a total word count of nine. Then one-point-eight is divided by the square root of nine. Thus one-point-eight divided by 3 so that the polarity score is 0-point-6. Dividing by the square root of the total number of words accounts for the polarity term density. The thought being that densely packed polarized words, negators and valence shifters engender stronger polarity feelings.
Now that you know how the polarity() function works, you will close out this chapter applying it and adjusting the subjectivity lexicon to fit your particular text. This is important for polarity analysis. For example in the twitter-sphere using terms like “lol” for laugh out loud should be added as positive words in the lexicon. Without these channel specific terms, your analysis may completely miss important context clusters.
Remember the next chapters will introduce the tidytext package, visuals and then a practical application of sentiment analysis using property rental reviews. Can't wait!
8. Let's practice!