1. How to write regular expressions in Python?
In this lesson we'll cover how to write regular expressions in Python.
2. Definition
A regular expression is a sequence of special characters or so-called metacharacters defining a pattern to search in a text. The easiest sequence is just a sequence of letters
as in this example.
If we provide some text and search for this sequence,
3. Definition
we will find the following matches. Note that, apart from the word cat, the sequence fits to the beginning of the word "catches". But what to do if we want to search for more complicated patterns?
4. Complex patterns
Let's check with this text. We want to have a sequence that fits
5. Complex patterns
to all the e-mail addresses in it. Simple character sequence isn't enough. Therefore, metacharacters are needed.
6. Special characters
In a regular expression metacharacters are mapped to real characters.
Some of them are simple and are mapped onto themselves.
A dot metacharacter is mapped to everything.
But a dot prefixed with backslash maps to a dot character.
7. Special characters
The following metacharacters represent backslash followed by a letter.
"w" small maps to any alphanumeric character or underscore.
"d" small maps to any digit.
"s" small maps to any whitespace character
8. Square brackets
Several metacharacters can be enclosed
in square brackets which itself is a metacharacter. In this case, the mapping will result in either of the characters enclosed. There are also short versions for some of the frequently used expressions:
any lowercased character,
any uppercased character,
any digit.
We can also combine them together.
9. Repetitions
Complex or simple metacharacters can be followed by symbols indicating how many times the associated character is repeated.
"*" indicates that the character is absent or repeats an undefined number of times.
"+" indicates that the character is present at least once.
"?" indicates that the character exists or not.
"{}" indicate the lower and upper bound for a character to be present.
10. Regular expression for an e-mail
Returning back to our previous example, a regular expression fitting an e-mail address
can look like this. Let's have a better understanding.
11. Regular expression for an e-mail
This part maps to at least one letter, digit, underscore, or dot character.
12. Regular expression for an e-mail
The '@' symbol maps to itself.
13. Regular expression for an e-mail
This part maps to at least one lowercased letter.
14. Regular expression for an e-mail
Backslash and dot map simply to a dot character.
15. Regular expression for an e-mail
And again, mapping to at least one lowercased letter.
16. re package
We defined a regular expression. But how do we use it programmatically?
The re package comes to help! Once we defined an expression,
we can pass it to the .compile() function. Note that we use the "r" prefix before the expression.
The next step is to use it against our text. We'll cover a couple of functions to do so.
17. re.finditer()
The finditer() function returns a special object given a pattern and text.
We can use this object in a for loop. In this case each item will represent a Match object containing the information about a single match in our text.
18. re.finditer()
To retrieve this information, we can call the following methods on our Match object. The .group() method will return the matching substring. The .start() and .end() methods return the start and end indices of the matching substring in a given text.
19. re.findall()
If we are only interested about the matching substrings, we can use the findall() function.
It simply returns a list of substrings representing the matches to our pattern.
20. re.split()
Another interesting function is called split().
Instead of returning matches, the method splits a given string by a matching pattern. This results in a list of strings of the following form.
21. Let's practice!
That was a concise reminder on regular expressions. Now, let's practice our skills!