More than words: tokenization (3)
ft_tokenizer()
uses a simple technique to generate words by splitting text data on spaces. For more advanced usage, you can use regular expressions to split the text data. This is done via the ft_regex_tokenizer()
function, which has the same usage as ft_tokenizer()
, but with an extra pattern
argument for the splitter.
a_tibble %>%
ft_regex_tokenizer("x", "y", pattern = regex_pattern)
The return value from ft_regex_tokenizer()
, like ft_tokenizer()
, is a list of lists of character vectors.
The dataset contains a field named artist_mbid
that contains an ID for the artist on MusicBrainz, a music metadata encyclopedia website. The IDs take the form of hexadecimal numbers split by hyphens, for example, 65b785d9-499f-48e6-9063-3a1fd1bd488d
.
Diese Übung ist Teil des Kurses
Introduction to Spark with sparklyr in R
Anleitung zur Übung
- Select the
artist_mbid
field fromtrack_metadata_tbl
. - Split the MusicBrainz IDs into chunks of hexadecimal numbers.
- Call
ft_regex_tokenizer()
. - The output column should be called
artist_mbid_chunks
. - Use a hyphen,
-
, for thepattern
argument.
- Call
Interaktive Übung
Versuche dich an dieser Übung, indem du diesen Beispielcode vervollständigst.
# track_metadata_tbl has been pre-defined
track_metadata_tbl
track_metadata_tbl %>%
# Select artist_mbid column
___ %>%
# Split it by hyphens
___