Matching a specific code point or code groups

Things can get tricky when some characters can be specified two ways, for example è, an e with a grave accent, can be specified either with the single code point \u00e8 or the combination of a \u0065 and a combining grave accent \u0300. They look the same:

x <- c("\u00e8", "\u0065\u0300")
writeLines(x)

But, specifying the single code point only matches that version:

str_view(x, "\u00e8")

The stringi package that stringr is built on contains functions for converting between the two forms. stri_trans_nfc() composes characters with combining accents into a single character. stri_trans_nfd() decomposes character with accents into separate letter and accent characters. You can see how the characters differ by looking at the hexadecimal codes.

as.hexmode(utf8ToInt(stri_trans_nfd("\u00e8")))
as.hexmode(utf8ToInt(stri_trans_nfc("\u0065\u0300")))

In Unicode, an accent is known as a diacritic Unicode Property, and you can match it using the rebus value UP_DIACRITIC.

Vietnamese makes heavy use of diacritics to denote the tones in its words. In this exercise, you'll manipulate the diacritics in the names of Vietnamese rulers.

This exercise is part of the course

String Manipulation with stringr in R

View Course

Exercise instructions

Names of rulers from the 18th Century Vietnamese Tây Sơn dynasty are shown in the script.

tay_son_builtin has the accents built into each letter. Run the code that defines and prints this variable.
Call stri_trans_nfd() to decompose the letters with accents into separate letter and accent characters, and assign the result to tay_son_separate.
Print tay_son_separate to verify the names still display the same way.
View all the accents by calling str_view_all() and matching UP_DIACRITIC. The match is shown after the letter that the diacritic belongs to.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Names with builtin accents
(tay_son_builtin <- c(
  "Nguy\u1ec5n Nh\u1ea1c", 
  "Nguy\u1ec5n Hu\u1ec7",
  "Nguy\u1ec5n Quang To\u1ea3n"
))

# Convert to separate accents
tay_son_separate <- ___

# Verify that the string prints the same
tay_son_separate

# Match all accents
str_view_all(tay_son_separate, ___)

Edit and Run Code