More than words: tokenization (1)
Common uses of text-mining include analyzing shopping reviews to ascertain purchasers' feeling about the product, or analyzing financial news to predict the sentiment regarding stock prices. In order to analyze text data, common pre-processing steps are to convert the text to lower-case (see tolower()
), and to split sentences into individual words.
ft_tokenizer()
performs both these steps. Its usage takes the same pattern as the other transformations that you have seen, with no other arguments.
shop_reviews %>%
ft_tokenizer("review_text", "review_words")
Since the output can contain a different number of words in each row, output.col
is a list column, where every element is a list of strings. To analyze text data, it is usually preferable to have one word per row in the data. The list-of-list-of-strings format can be transformed to a single character vector using unnest()
from the tidyr
package. There is currently no method for unnesting data on Spark, so for now, you have to collect it to R before transforming it. The code pattern to achieve this is as follows.
library(tidyr)
text_data %>%
ft_tokenizer("sentences", "word") %>%
collect() %>%
mutate(word = lapply(word, as.character)) %>%
unnest(word)
If you want to learn more about using the tidyr
package, take the Cleaning Data in R course.
This exercise is part of the course
Introduction to Spark with sparklyr in R
Exercise instructions
A Spark connection has been created for you as spark_conn
. A tibble attached to the track metadata stored in Spark has been pre-defined as track_metadata_tbl
.
- Create a variable named
title_text
fromtrack_metadata_tbl
.- Select the
artist_name
andtitle
fields. - Use
ft_tokenizer()
to create a new field,word
, which contains the title split into words. - Collect the result.
- Mutate the
word
column, flattening it to a list of character vectors usinglapply
andas.character
. - Use
unnest()
to flatten the list column, and get one word per row.
- Select the
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# track_metadata_tbl has been pre-defined
track_metadata_tbl
title_text <- track_metadata_tbl %>%
# Select artist_name, title
___ %>%
# Tokenize title to words
___ %>%
# Collect the result
___ %>%
# Flatten the word column
___ %>%
# Unnest the list column
___