Character classes and repetitions
1. Character classes and repetitions
So far, we've looked for specific letters and numbers like the letter "K" or the number "3". But what if we want to search for something more generalized, a group of letters or numbers, without specifying them in advance? This is where character classes and repetitions come in.2. Available character classes
Character classes describe different types of characters. When we work with data, we are often working with numerical data. To search for numerical data by using regular expressions, we use the digit character class. It will match all numbers from zero to nine. It can either be written as a "d" with two leading backslashes or as "digit" encapsulated by colons and square brackets. In this course we will work primarily with the first of the two versions as it is shorter and it's also more commonly used across different programming languages. The next character class that we also use very often is the word character. It is described with two backslashes followed by a "w". It's a bit unintuitive at first: It will match not only alphabetical letters but also numbers and underscores. The reason for this is that in many programming languages these are the characters that are allowed as variable names. However, if you only want to search for characters, whether lower-case or upper-case, you can create a custom pattern by using the square brackets with capital A "minus" capital Z and lowercase a "minus" lowercase z. This is no longer called a character class but it has a similar effect: A character is allowed to be any of the letters or numbers listed within the square brackets. Like this we can also create custom patterns with fewer options. Square brackets with letters "a, e, i, o, u" will for example match all vowels. The third and last character class we will use in this lesson is the "space", defined by two backslashes and an "s". It will match not only spaces between words, but also tabs and line breaks.3. A concrete example
Let's apply these classes and patterns to the same string and compare the results: With the function "str match all" we match all characters that meet the current pattern. For example when we look for the digit character class it will return 3 and 5. All characters but the space will match the word character class, but only the alphabetical letters will match our custom pattern and only "i" and "o" match our vowel pattern. Lastly, the space between the two words will match the space character class and nothing else.4. Repetitions
So what if you want to search for multiple instances of the same character class? Say you want to match two word characters in a row, you could write backslash, backslash "w", backslash, backslash, "w" to match two word characters in a row, but that would become tedious pretty quickly, right? That's why regular expressions also offer ways to define repetitions of character classes. So if we want to match two word characters in a row, we can use curly braces. We append to our word character class an opening and a closing curly brace - in it we write the number of repetitions that we are looking for. If we don't know exactly how many repetitions we want to match, we can also define a minimum and a maximum of repetitions. We separate these with a comma, so "two comma three" will match two or three word characters in a row. We can even leave the maximum out. The minimum will still apply so with "two comma" we can match two or more word characters in a row. If we also want no minimum, we can use the plus sign. It will match an arbitrary number of word characters in a row, but at least one. The last special character in this lesson is even more permissive: It will match any amount of word characters, even zero word characters. This can be useful for example if you search for three things where you're not sure whether the part in the middle is present or not.5. Inversion of character classes
But what if you want to match anything but the character class? This is where the negated form of classes come in. So when backslash, backslash lowercase "d" matches all digits, backslash, backslash uppercase "D" matches the opposite of that, so all but a digit. The same goes for the "w", lowercase matches all word characters, uppercase "w" matches everything else. And last but not least also for spaces. Lowercase "s" matches spaces and uppercase "S" matches everything that is not a space. For the pattern that we saw before that matches the letters of the alphabet, there is also a negation. We can invert the match by adding a caret inside the square brackets, at the start. It will then match everything that is not a regular letter.6. Custom pattern with classes
We can also combine the two concepts and create a custom pattern that will match for example all digits and all spaces. If we put the digit class and the space class into square brackets, it will then match all digits and all spaces.7. Let's practice!
Alright, you've seen how we can search numbers, letters or spaces and how to configure your pattern for the number of occurrences. Let's put this into practice!Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.