1. Rule-based Matching
In this video, we'll take a look at spaCy's matcher, which lets you write rules to find words and phrases in text.
2. Why not just regular expressions?
Compared to regular expressions, the matcher works with Doc and Token objects instead of only strings.
It's also more flexible: you can search for texts but also other lexical attributes.
You can even write rules that use the model's predictions.
For example, find the word "duck" only if it's a verb, not a noun.
3. Match patterns
Match patterns are lists of dictionaries. Each dictionary describes one token. The keys are the names of token attributes, mapped to their expected values.
In this example, we're looking for two tokens with the text "iPhone" and "X".
We can also match on other token attributes. Here, we're looking for two tokens whose lowercase forms equal "iphone" and "x".
We can even write patterns using attributes predicted by the model. Here, we're matching a token with the lemma "buy", plus a noun. The lemma is the base form, so this pattern would match phrases like "buying milk" or "bought flowers".
4. Using the Matcher (1)
To use a pattern, we first import the matcher from spacy dot matcher.
We also load a model and create the nlp object.
The matcher is initialized with the shared vocabulary, nlp dot vocab. You'll learn more about this later – for now, just remember to always pass it in.
The matcher dot add method lets you add a pattern. The first argument is a unique ID to identify which pattern was matched. The second argument is an optional callback. We don't need one here, so we set it to None. The third argument is the pattern.
To match the pattern on a text, we can call the matcher on any doc.
This will return the matches.
5. Using the Matcher (2)
When you call the matcher on a doc, it returns a list of tuples.
Each tuple consists of three values: the match ID, the start index and the end index of the matched span.
This means we can iterate over the matches and create a Span object: a slice of the doc at the start and end index.
6. Matching lexical attributes
Here's an example of a more complex pattern using lexical attributes.
We're looking for five tokens:
A token consisting of only digits.
Three case-insensitive tokens for "fifa", "world" and "cup".
And a token that consists of punctuation.
The pattern matches the tokens "2018 FIFA World Cup:".
7. Matching other token attributes
In this example, we're looking for two tokens:
A verb with the lemma "love", followed by a noun.
This pattern will match "loved dogs" and "love cats".
8. Using operators and quantifiers (1)
Operators and quantifiers let you define how often a token should be matched. They can be added using the "OP" key.
Here, the "?" operator makes the determiner token optional, so it will match a token with the lemma "buy", an optional article and a noun.
9. Using operators and quantifiers (2)
"OP" can have one of four values:
An "!" negates the token, so it's matched 0 times.
A "?" makes the token optional, and matches it 0 or 1 times.
A "+" matches a token 1 or more times.
And finally, an "*" matches 0 or more times.
Operators can make your patterns a lot more powerful, but they also add more complexity – so use them wisely.
10. Let's practice!
Token-based matching opens up a lot of new possibilities for information extraction. So let's try it out and write some patterns!