Get startedGet started for free

How many words do YOU know? Zipf's law & subjectivity lexicon

1. How many words do YOU know? Zipf's law & subjectivity lexicon

Boom! A little refresher, and a polarity visualization. Pretty good start. Many sentiment analysis methods use a subjectivity lexicon. Let's learn what a subjectivity lexicon is and why it works.

2. Subjectivity lexicon

The polarity function you just applied uses a subjectivity lexicon. A subjectivity lexicon is a predefined list of words associated with a specific emotion or positive or negative feelings. For example, the words bad, awful and terrible can all reasonably be associated with a negative state. In contrast, perfect, or ideal can be connected with positivity. In some cases sentiment analysis is merely the comparison between the author’s text and the predefined subjectivity lexicon. The visual you made is based on a subjectivity lexicon compared to some fictitious text. More on that later. For now focus on the subjectivity lexicon and why it works.

3. Where to get subjectivity lexicons?

In this course we primarily work with qdap's polarity() function which uses an academic lexicon from the University of Illinois-Chicago. Plus, we also spend a lot of time working with the tidytext sentiments tibble made of 3 different lexicons.

4. library(lexicon)

There is even an R package called lexicon that has a bunch of subjectivity and other word lists. While we use the standard ones from qdap and tidytext, it's worth exploring the many word lists from the lexicon library. Plus, you should always think about adjusting your lexicons since it's unlikely to be accurate for your exact need.

5. No way! Too few words.

So now you may be saying to yourself, little lists of words aren’t going to help me with my text because people are so expressive. In fact, you may know more than fifty thousand words and most subjectivity lexicons are only a few thousand…how can searching for these terms provide an accurate sentiment analysis? The answer is based on two principles. First, a linguist named George Zipf created “Zipf’s law”. Second, the principle of least effort also helps support using small lexicons.

6. Zipf's Law in action

Zipf's law states that given some text, the frequency of any single word is inversely proportional to its rank in the frequency table. More simply, if you counted up the word frequency in a passage, the second word would appear about half as much, fifty percent, as the first. One over the word's place on the list or one over two. Similarly, the third most frequent word could appear one third as much as the first and so on. One over three because it's the third term in the list. Zipf's law can be observed outside of language too. Here is a table of US City populations. Zipf's law can be observed in our language, the way we settle cities and sometimes even in industry market shares.

7. Principle of Least Effort

The other principle at play here is called “the principle of least effort” from library sciences. A speaker or writer doesn't want to exert a lot of effort when communicating while the audience doesn't want to spend a lot of energy interpreting. So the word choice or lexical diversity becomes limited. What does all this mean? People are lazy…at least when it comes to expressing themselves. They may know tens of thousands of words but really only consistently use a few to express meaning. This works out well for a subjectivity lexicon because it means the words lists don't have to be so long.

8. Up next...

In this section you will create a visual demonstrating Zipf’s law on three million tweets and then apply qdap’s polarity() function to some actual text to see a subjectivity lexicon in action.

9. Let's practice!