1. Introduction to regular expressions
We arrive to an important part of our journey:
regular expressions.
2. What is a regular expression?
A regular expression, or regex, is a string that contains a combination of normal and special characters that describes patterns to find text within a text. This sounds very complicated.
Let's break it down to understand it better.
Here, we have an example of what a regular expression looks like. In Python, the r at the beginning indicates a raw string. It is always advisable to use it.
3. What is a regular expression?
We said that a regex contains normal characters, or literal characters we already know.
The normal characters match themselves. In the case shown in our slide, st exactly matches an s followed by a t.
4. What is a regular expression?
They also contain special characters.
Metacharacters represent types of characters. Let's look one by one as they appear in the slide. Backslash d represents a digit.
5. What is a regular expression?
backslash s a whitespace,
6. What is a regular expression?
backslash w a word character.They also represent ideas, such as location or quantity.
7. What is a regular expression?
In the example, 3 and 10 inside curly braces indicates that the character immediately to the left, in this case backslash w, should appear between 3 and 10 times.
8. What is a regular expression?
We said that regex describes a pattern.
A pattern is a sequence of characters that maps to words or punctuation.
9. What is a regular expression?
As a data scientist,
you will use pattern matching
to find and replace specific text.
To validate strings such as passwords or email addresses.
Why use regex? They are very powerful and fast. They allow you to search complex patterns that would be very difficult to find otherwise.
10. The re module
Python has a useful library, the re module, to handle regex.
You can import it as shown in the code. Let's see how it works.
To find all matches of a pattern,
we use the dot findall method. It takes two arguments: the regex and the string.
In the code, we want to find all the matches of hashtag movies in the specified string.
The method returns a list with the two matches found.
11. The re module
To
split a string at each pattern match,
we could use the method dot split
In the example, we want to split the specified string at every exclamation mark match.
It returns a list of the substrings as you can see in the output.
12. The re module
Finally,
we could replace any pattern match with another string
using the dot sub method. It takes three arguments: the regex, replacement and string.
In the example, we replace every match of yellow with the word nice.
We get the following output.
13. Supported metacharacters
Let's look at the supported metacharacters.
In the example, we want to find all matches of the patterns containing User followed by a number.
We use backlash d to represent the digit.
We get the following matches.
Next, we find matches of the pattern containing User followed by a non-digit.
In that case, we use backslash capital D
obtaining the following match.
14. Supported metacharacters
If we want to find all matches of the pattern containing User followed by any digit or normal character,
we can use backlash w.
We get all following matches.
In the next example, we need to find the price in a string.
We use backslash capital W to match the dollar sign followed by a digit
obtaining the following output.
15. Supported metacharacters
Finally, we use backslash s to specify
the pattern Data whitespace science
getting the following match.
In the second example, we use backslash capital S to
detect the matches of ice, followed by any non-space character, followed by cream and replace them with the word ice cream.
16. Let's practice!
Now, it's time for you to practice regular expressions!