Get startedGet started for free

Intro to word clouds

1. Intro to word clouds

Maybe a frequency plot isn't all that visually engaging. A more popular form of frequency plot is a word cloud.

2. A simple word cloud

In a simple word cloud, size is relative to word frequency. Word clouds can be more complex though, as we'll show later. For this simple word cloud, we use as-matrix then rowSums to calculate a summed term vector from a corpus as a foundation. Next, you create a data frame with the first column containing the row names. This is done by calling names on the term frequency object as the first column. Remember, in this case the names are just the terms of the TDM. The second column is the summed vector term_frequency. Both of these columns are used in the wordcloud function along with the number of words, 100, and color of the wordcloud, "red".

3. The impact of stop words

It's important to carefully select preprocessing steps and stop words when working on your text. For example, it's very difficult to capture proper nouns if you make everything lower case. So adjust this clean_corpus function beyond removePunctuation, stripWhitespace, removeNumbers, and tolower. Equally important is choosing stop words to be removed. This wordcloud was constructed with the clean_corpus custom function that doesn't remove chardonnay. Since all 1000 tweets mention chardonnay, it's not surprising it's the largest word. That is the worst because it masks an underlying insight!

4. Removing uninformative words

So you have to change the function to include the words you expect like "chardonnay", "wine" and "glass" as part of the removeWords parameter. This allows some unexpected words to surface in the visual. Looks like some people enjoy soul music while having a glass of wine.

5. Let's practice!