Matching a specific code point or code groups
Things can get tricky when some characters can be specified two ways, for example è, an e with a grave accent, can be specified either with the single code point \u00e8
or the combination of a \u0065
and a combining grave accent \u0300
. They look the same:
x <- c("\u00e8", "\u0065\u0300")
writeLines(x)
But, specifying the single code point only matches that version:
str_view(x, "\u00e8")
The stringi
package that stringr
is built on contains functions for converting between the two forms. stri_trans_nfc()
composes characters with combining accents into a single character. stri_trans_nfd()
decomposes character with accents into separate letter and accent characters. You can see how the characters differ by looking at the hexadecimal codes.
as.hexmode(utf8ToInt(stri_trans_nfd("\u00e8")))
as.hexmode(utf8ToInt(stri_trans_nfc("\u0065\u0300")))
In Unicode, an accent is known as a diacritic Unicode Property, and you can match it using the rebus
value UP_DIACRITIC
.
Vietnamese makes heavy use of diacritics to denote the tones in its words. In this exercise, you'll manipulate the diacritics in the names of Vietnamese rulers.
This exercise is part of the course
String Manipulation with stringr in R
Exercise instructions
Names of rulers from the 18th Century Vietnamese Tây Sơn dynasty are shown in the script.
tay_son_builtin
has the accents built into each letter. Run the code that defines and prints this variable.- Call
stri_trans_nfd()
to decompose the letters with accents into separate letter and accent characters, and assign the result totay_son_separate
. - Print
tay_son_separate
to verify the names still display the same way. - View all the accents by calling
str_view_all()
and matchingUP_DIACRITIC
. The match is shown after the letter that the diacritic belongs to.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Names with builtin accents
(tay_son_builtin <- c(
"Nguy\u1ec5n Nh\u1ea1c",
"Nguy\u1ec5n Hu\u1ec7",
"Nguy\u1ec5n Quang To\u1ea3n"
))
# Convert to separate accents
tay_son_separate <- ___
# Verify that the string prints the same
tay_son_separate
# Match all accents
str_view_all(tay_son_separate, ___)