RegEx with spaCy

1. RegEx with spaCy

Welcome, let us learn about rule-based information extraction.

2. What is RegEx?

Rule-based information extraction is useful for many NLP tasks. Certain types of entities, such as dates or phone numbers have distinct formats that can be recognized by a set of rules without needing to train any model. Regular expressions, or RegEx, are used for rule-based information extraction with complex string matching patterns. RegEx can be used to retrieve patterns or replace matching patterns in a string with some other patterns. For example, given a text, we can use regular expressions to find any reference to links or phone numbers.

3. RegEx strengths and weaknesses

Nearly all data scientists and engineers use RegEx at some stage in their workflow from cleaning data to implementing machine learning models. There are several advantages in using RegEx. Due to its complex syntax, it allows programmers to write robust rules. It allows finding all types of variances in strings, performs quickly and it is supported by different programming languages. Despite these advantages, RegEx has a few weaknesses. Its syntax is quite difficult for beginners. Writing good RegEx patterns requires a knowledge of all the ways a pattern may vary in texts.

4. RegEx in Python

Python comes prepackaged with a RegEx library, called re. Let's assume we want to find the phone numbers in a text. The first step is to define a pattern. Assuming a phone number is always written as something like 3 digits-3 digits-4 digits, a pattern to find such phone numbers is shown. In this pattern, backslash-d is representative of a metacharacter that matches any digit from 0 to 9. A number within curly brackets shows how many occurrences of the pattern are expected. Hence parenthesis backslash-d curly brackets 3 is looking for three digits. We also use dash in between digits to match the shape of the phone number.

5. RegEx in Python

To find any matching patterns in a given text, we can use re-dot-finditer() method from the re package. We can iterate through found matches. Every match contains of start and end characters of the matching section of the text, they are accessible using match-dot-start() and match-dot-end() methods. We can see that two phone numbers that are matching the given pattern, 832-123-5555 and 425-123-4567, are found with their corresponding start and end characters from the input text.

6. RegEx in spaCy

spaCy has quick ways to implement RegEx in three pipes: Matcher, PhraseMatcher, and EntityRuler. Matcher and PhraseMatcher do not align the matched patterns as entities in the doc-dot-ents. For this reason, we utilize EntityRuler to implement regular expressions. We have already learned to use EntityRuler to improve entity recognition accuracy in spaCy. We will learn more about Marcher and PhraseMatcher later on. Let's look at an example of using EntityRuler to find phone numbers. The pattern consists of a list of dictionaries with two keys of label and pattern. In this instance, the label is set to PHONE_NUMBER. To match a pattern such as 3 digits-3 digits-4 digits, we use a pattern that consists of 5 smaller dictionaries, where each dictionary is representing a part of the matching pattern. The first, third and fifth dictionaries with the key of SHAPE, are representing patterns with a shape of three or four digits by using three or four d's. The second and fourth dictionaries with a key of ORTH, are representing the exact match of a string, which is set to a dash in this pattern. Writing patterns in spaCy requires practice, spaCy documentation provides more information about different pattern attributes.

7. Let's practice!

Let's practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.