1. Unicode and pattern matching
You first encountered unicode characters back in Chapter 1 but let's refresh your memory and go into a little more detail.
2. Unicode
Unicode is a standard that associates every letter or symbol with a hexadecimal code point. For example, the lower-case
3. Unicode
letter a has the code 61, the greek letter
4. Unicode
mu, often used as a symbol for a population mean in statistics, has the code 3BC, and the
5. Unicode
smiley face emoji has the code 1F600.
6. Unicode in R
In R, you can put a unicode character in a string either using a backslash lower case u followed by a four digit code or a backslash upper case U, followed by up to an eight digit code. If a character that has more than a four digit code you'll have to use the upper case U variant, so in particular, most emojis have five digit codes so you'll need the upper case U version to give yourself a round of applause,There is one caution, if you are on Windows, Unicode points with more than four digits aren't handled corrrectly. So, if you notice diffferent behavior on your local system compared to DataCamp, that may be why. You can see the hexadecimal code for a character
7. Unicode in R
by combining utf8ToInt with as-dot-hexmode. So what happens if you want to match a Unicode character in a regular expression? You can treat them just
8. Matching Unicode
like any other character, simply specify the code point in the pattern. For, example take a look at this sentence. To match the ? you just need to know its code point: If you need to find a code point, often a Google search will get you there. You can also try looking through the charts on the official Unicode site, or try a specific Unicode search tool. Unicode also has a number of ways to specify collections of characters: categories,
9. Matching Unicode groups
scripts and blocks. Any of these collections can be specified in a regular expression using \p followed by the name of the collection in curly braces. rebus provides direct access to a lot of these collections using a function with the corresponding name. For example, you might look for all characters in the Greek and Coptic block with: this. If you are curious you might look at the other properties you can match in these rebus help pages. There are a few other tricky aspects to working with Unicode you'll learn about in the following exercises.
10. Let's practice!