Session Ready
Exercise

Treemaps for groups of documents

Often you will find yourself working with documents in groups, such as author, product or by company. This exercise lets you learn about the text while retaining the groups in a compact visual. For example, with customer reviews grouped by product you may want to explore multiple dimensions of the customer reviews at the same time. First you could calculate the polarity() of the reviews. Another dimension may be length. Document length can demonstrate the emotional intensity. If a customer leaves a short "great shoes!" one could infer they are actually less enthusiastic compared to a lengthier positive review. You may also want to group reviews by product type such as women's, men's and children's shoes. A treemap lets you examine all of these dimensions.

For text analysis, within a treemap each individual box represents a document such as a tweet. Documents are grouped in some manner such as author. The size of each box is determined by a numeric value such as number of words or letters. The individual colors are determined by a sentiment score.

After you organize the tibble, you use the treemap library containing the function treemap() to make the visual. The code example below declares the data, grouping variables, size, color and other aesthetics.

treemap(
  data_frame,
  index = c("group", "individual_document"),
  vSize = "doc_length",
  vColor = "avg_score",
  type = "value",
  title = "Sentiment Scores by Doc",
  palette = c("red", "white", "green")
)

The pre-loaded all_books object contains a combined tidy format corpus with 4 Shakespeare, 3 Melville and 4 Twain books. Based on the treemap you should be able to tell who writes longer books, and the polarity of the author as a whole and for individual books.

Instructions 1/3
undefined XP
  • 1
  • 2
  • 3

Calculate each book's length in a new object called book_length using count() with the book column.