More than words: tokenization (2)
The tidytext
package lets you analyze text data using "tidyverse" packages such as dplyr
and sparklyr
. How to do sentiment analysis is beyond the scope of this course; you can see more in the Sentiment Analysis. This exercise is designed to give you a quick taste of how to do it on Spark.
Sentiment analysis essentially lets you assign a score or emotion to each word. For example, in the AFINN lexicon, the word "outstanding"
has a score of +5
, since it is almost always used in a positive context. "grace"
is a slightly positive word, and has a score of +1
. "fraud"
is usually used in a negative context, and has a score of -4
. The AFINN scores dataset is returned by get_sentiments("afinn")
. For convenience, the unnested word data and the sentiment lexicon have been copied to Spark.
Typically, you want to compare the sentiment of several groups of data. To do this, the code pattern is as follows.
text_data %>%
inner_join(sentiments, by = "word") %>%
group_by(some_group) %>%
summarize(positivity = sum(score))
An inner join takes all the values from the first table, and looks for matches in the second table. If it finds a match, it adds the data from the second table. Unlike a left join, it will drop any rows where it doesn't find a match. The principle is shown in this diagram.
Like left joins, inner joins are a type of mutating join, since they add columns to the first table. See if you can guess which function to use for inner joins, and how to use it. (Hint: the usage is really similar to left_join()
, anti_join()
, and semi_join()
!)
This exercise is part of the course
Introduction to Spark with sparklyr in R
Exercise instructions
A Spark connection has been created for you as spark_conn
. Tibbles attached to the title words and sentiment lexicon stored in Spark have been pre-defined as title_text_tbl
and afinn_sentiments_tbl
respectively.
- Create a variable named
sentimental_artists
fromtitle_text_tbl
.- Use
inner_join()
to joinafinn_sentiments_tbl
totitle_text_tbl
by"word"
. - Group by the
artist_name
. - Summarize to define a variable
positivity
, equal to the sum of thescore
field.
- Use
- Find the top 5 artists with the most negative song titles.
- Arrange the
sentimental_artists
by ascending positivity. - Use
slice_max
to get the top 5 results.
- Arrange the
- Find the top 5 artists with the most positive song titles.
- Arrange the
sentimental_artists
by descending positivity. - Get the top 5 results.
- Arrange the
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# title_text_tbl, afinn_sentiments_tbl have been pre-defined
title_text_tbl
afinn_sentiments_tbl
sentimental_artists <- title_text_tbl %>%
# Inner join with sentiments on word field
___ %>%
# Group by artist
___ %>%
# Summarize to get positivity
___
sentimental_artists %>%
# Arrange by ascending positivity
___ %>%
# Get top 5
___
sentimental_artists %>%
# Arrange by descending positivity
___ %>%
# Get top 5
___