Get startedGet started for free

spaCy Matcher and PhraseMatcher

1. spaCy Matcher and PhraseMatcher

Welcome! Let us learn more about rule-based information extraction using spaCy.

2. Matcher in spaCy

RegEx patterns are not trivial to read and debug. For these reasons, spaCy provides a readable, production-level, and maintainable alternative, the Matcher class. The Matcher class can match predefined rules to a sequence of tokens in Doc containers. Let's look at an example. We first import spaCy and the Matcher class. We then load the en_core_web_sm model and run the model on the given text to generate a Doc container. Next, a Matcher object is initialized with the given model's vocabulary by using Matcher(nlp-dot-vocab).

3. Matcher in spaCy

Next, we define a pattern to match lower cased good and morning by defining a list with two key value pairs. The first one, has a key of "LOWER" and value of "good" and the second one, has a key of "LOWER" and value of "morning". Then we add this pattern with a custom name, such as morning_greeting, to a list of patterns in the Matcher object and run the matcher on the Doc container. The output of a Matcher object is matched patterns which include tuples of a match id, start and end token indices of the matched pattern.

4. Matcher extended syntax support

The Matcher class allows patterns to be more expressive by allowing some operators inside the curly brackets. These operators are for extended comparison and look similar to Python's in, not in and comparison operators. The table shows a list of supported operators in the Matcher class.

5. Matcher extended syntax support

For instance, if we want to match both lowercase good morning and good evening patterns in a text, we can use a single matching pattern and the IN operator. In this case, the pattern will be a list of two key value pairs. The first one is {"LOWER": "good"} and the second one is {"LOWER": {"IN": ["morning", "evening"]}}.

6. PhraseMatcher in spaCy

While processing unstructured text, we often have long lists and dictionaries that we want to scan and match in given texts. The Matcher patterns are handcrafted and each token needs to be coded individually. If we have a long list of phrases, Matcher is no longer the best option. In this instance, PhraseMatcher class helps us match long dictionaries. As an example for PhraseMatcher, let's assume that we want to match two terms in a given text, Bill Gates and John Smith. First, we import spaCy and PhraseMatcher class. Then, we load the en_core_web_sm model and initialize the PhraseMatcher object using PhraseMatcher(nlp-dot-vocab).

7. PhraseMatcher in spaCy

Next, we create patterns for the PhraseMatcher object, by calling the nlp-dot-make_doc() method on each term. This method converts given terms into pattern entities, that are usable by the PhraseMatcher class. Then, we follow similar steps as the Matcher class, and run the PhraseMatcher object on the given Doc container of a text and iterate through matches to extract start and end token IDs of the matched patterns.

8. PhraseMatcher in spaCy

The previous example shows how we can match patterns by their exact values. If we want to match lower cased patterns or utilize shape of a pattern for matching, we can use the attr (attribute) argument in the PhraseMatcher class. In one example, we set the attr argument to LOWER and allow PhraseMatcher to find lower cased matching patterns. In the second example, by setting the attr argument to SHAPE, we are asking PhraseMatcher to match patterns to a given shape. In this instance, we are looking to retrieve IP addresses in a text and provide multiple examples of them, such as 110-dot-0-dot-0-dot-0 to the PhraseMatcher class.

9. Let's practice!

Let's practice!