More than words: tokenization (2)

The tidytext package lets you analyze text data using "tidyverse" packages such as dplyr and sparklyr. How to do sentiment analysis is beyond the scope of this course; you can see more in the Sentiment Analysis. This exercise is designed to give you a quick taste of how to do it on Spark.

Sentiment analysis essentially lets you assign a score or emotion to each word. For example, in the AFINN lexicon, the word "outstanding" has a score of +5, since it is almost always used in a positive context. "grace" is a slightly positive word, and has a score of +1. "fraud" is usually used in a negative context, and has a score of -4. The AFINN scores dataset is returned by get_sentiments("afinn"). For convenience, the unnested word data and the sentiment lexicon have been copied to Spark.

Typically, you want to compare the sentiment of several groups of data. To do this, the code pattern is as follows.

text_data %>%
  inner_join(sentiments, by = "word") %>%
  group_by(some_group) %>%
  summarize(positivity = sum(score))

An inner join takes all the values from the first table, and looks for matches in the second table. If it finds a match, it adds the data from the second table. Unlike a left join, it will drop any rows where it doesn't find a match. The principle is shown in this diagram.

An inner join, explained using table of colors.

Like left joins, inner joins are a type of mutating join, since they add columns to the first table. See if you can guess which function to use for inner joins, and how to use it. (Hint: the usage is really similar to left_join(), anti_join(), and semi_join()!)

A Spark connection has been created for you as spark_conn. Tibbles attached to the title words and sentiment lexicon stored in Spark have been pre-defined as title_text_tbl and afinn_sentiments_tbl respectively.

Create a variable named sentimental_artists from title_text_tbl.
- Use inner_join() to join afinn_sentiments_tbl to title_text_tbl by "word".
- Group by the artist_name.
- Summarize to define a variable positivity, equal to the sum of the score field.
Find the top 5 artists with the most negative song titles.
- Arrange the sentimental_artists by ascending positivity.
- Use slice_max to get the top 5 results.
Find the top 5 artists with the most positive song titles.
- Arrange the sentimental_artists by descending positivity.
- Get the top 5 results.

Light My Fire: Starting To Use Spark With dplyr Syntax

Tools of the Trade: Advanced dplyr Usage

Going Native: Use The Native Interface to Manipulate Spark DataFrames

Case Study: Learning to be a Machine: Running Machine Learning Models on Spark

Exercise

More than words: tokenization (2)

Instructions