1. Learn
  2. /
  3. Courses
  4. /
  5. String Manipulation with stringr in R

Connected

Exercise

Matching a specific code point or code groups

Things can get tricky when some characters can be specified two ways, for example è, an e with a grave accent, can be specified either with the single code point \u00e8 or the combination of a \u0065 and a combining grave accent \u0300. They look the same:

x <- c("\u00e8", "\u0065\u0300")
writeLines(x)

But, specifying the single code point only matches that version:

str_view(x, "\u00e8")

The stringi package that stringr is built on contains functions for converting between the two forms. stri_trans_nfc() composes characters with combining accents into a single character. stri_trans_nfd() decomposes character with accents into separate letter and accent characters. You can see how the characters differ by looking at the hexadecimal codes.

as.hexmode(utf8ToInt(stri_trans_nfd("\u00e8")))
as.hexmode(utf8ToInt(stri_trans_nfc("\u0065\u0300")))

In Unicode, an accent is known as a diacritic Unicode Property, and you can match it using the rebus value UP_DIACRITIC.

Vietnamese makes heavy use of diacritics to denote the tones in its words. In this exercise, you'll manipulate the diacritics in the names of Vietnamese rulers.

Instructions

100 XP

Names of rulers from the 18th Century Vietnamese Tây Sơn dynasty are shown in the script.

  • tay_son_builtin has the accents built into each letter. Run the code that defines and prints this variable.
  • Call stri_trans_nfd() to decompose the letters with accents into separate letter and accent characters, and assign the result to tay_son_separate.
  • Print tay_son_separate to verify the names still display the same way.
  • View all the accents by calling str_view_all() and matching UP_DIACRITIC. The match is shown after the letter that the diacritic belongs to.