Exercise

Matching a single grapheme

A related problem is matching a single character. You've used ANY_CHAR to do this up until now, but it will only match a character represented by a single code point. Take these three names:

x <- c("Adele", "Ad\u00e8le", "Ad\u0065\u0300le")
writeLines(x)

They look the similar, but this regular expression only matches two of them:

str_view(x, "Ad" %R% ANY_CHAR %R% "le")

because in the third name è is represented by two code points. The Unicode standard has a concept of a grapheme that represents a display character, but may be composed of many code points. To match any grapheme you can use GRAPHEME.

str_view(x, "Ad" %R% GRAPHEME %R% "le")

Names of rulers from the Vietnamese Tây Sơn dynasty, with diacritics given as separate graphemes, is pre-defined as tay_son_separate.

Instructions 1/3

undefined XP
  • 1

    Use str_view_all(), with ANY_CHAR as a pattern to view each character in tay_son_separate.

  • 2

    Do the same again with GRAPHEME as a pattern, to see the difference between characters and graphemes.

  • 3
    • Use stri_trans_nfc() to combine the diacritics with their associated characters, storing the result as tay_son_builtin.
    • Use str_view_all() to view each grapheme in tay_son_builtin.