Matching a single grapheme
A related problem is matching a single character. You've used ANY_CHAR
to do this up until now, but it will only match a character represented by a single code point. Take these three names:
x <- c("Adele", "Ad\u00e8le", "Ad\u0065\u0300le")
writeLines(x)
They look the similar, but this regular expression only matches two of them:
str_view(x, "Ad" %R% ANY_CHAR %R% "le")
because in the third name è is represented by two code points. The Unicode standard has a concept of a grapheme that represents a display character, but may be composed of many code points. To match any grapheme you can use GRAPHEME
.
str_view(x, "Ad" %R% GRAPHEME %R% "le")
Names of rulers from the Vietnamese Tây Sơn dynasty, with diacritics given as separate graphemes, is pre-defined as tay_son_separate
.
This exercise is part of the course
String Manipulation with stringr in R
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# tay_son_separate has been pre-defined
tay_son_separate
# View all the characters in tay_son_separate
___