More than words: tokenization (1)

Common uses of text-mining include analyzing shopping reviews to ascertain purchasers' feeling about the product, or analyzing financial news to predict the sentiment regarding stock prices. In order to analyze text data, common pre-processing steps are to convert the text to lower-case (see tolower()), and to split sentences into individual words.

ft_tokenizer() performs both these steps. Its usage takes the same pattern as the other transformations that you have seen, with no other arguments.

shop_reviews %>%
  ft_tokenizer("review_text", "review_words")

Since the output can contain a different number of words in each row, output.col is a list column, where every element is a list of strings. To analyze text data, it is usually preferable to have one word per row in the data. The list-of-list-of-strings format can be transformed to a single character vector using unnest() from the tidyr package. There is currently no method for unnesting data on Spark, so for now, you have to collect it to R before transforming it. The code pattern to achieve this is as follows.

library(tidyr)
text_data %>%
  ft_tokenizer("sentences", "word") %>%
  collect() %>%
  mutate(word = lapply(word, as.character)) %>%
  unnest(word)

If you want to learn more about using the tidyr package, take the Cleaning Data in R course.

This exercise is part of the course

Introduction to Spark with sparklyr in R

View Course

Exercise instructions

A Spark connection has been created for you as spark_conn. A tibble attached to the track metadata stored in Spark has been pre-defined as track_metadata_tbl.

Create a variable named title_text from track_metadata_tbl.
- Select the artist_name and title fields.
- Use ft_tokenizer() to create a new field, word, which contains the title split into words.
- Collect the result.
- Mutate the word column, flattening it to a list of character vectors using lapply and as.character.
- Use unnest() to flatten the list column, and get one word per row.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# track_metadata_tbl has been pre-defined
track_metadata_tbl

title_text <- track_metadata_tbl %>%
  # Select artist_name, title
  ___ %>%
  # Tokenize title to words
  ___ %>%
  # Collect the result
  ___ %>%
  # Flatten the word column 
  ___ %>% 
  # Unnest the list column
  ___

Edit and Run Code