1. Combining models and rules
Combining statistical models with rule-based systems is one of the most powerful tricks you should have in your NLP toolbox.
In this video, we'll take a look at how to do it with spaCy.
2. Statistical predictions vs. rules
Statistical models are useful if your application needs to be able to generalize based on a few examples.
For instance, detecting product or person names usually benefits from a statistical model. Instead of providing a list of all person names ever, your application will be able to predict whether a span of tokens is a person name. Similarly, you can predict dependency labels to find subject/object relationships.
To do this, you would use spaCy's entity recognizer, dependency parser or part-of-speech tagger.
3. Statistical predictions vs. rules
Rule-based approaches on the other hand come in handy if there's a more or less finite number of instances you want to find. For example, all countries or cities of the world, drug names or even dog breeds.
In spaCy, you can achieve this with custom tokenization rules, as well as the matcher and phrase matcher.
4. Recap: Rule-based Matching
In the last chapter, you learned how to use spaCy's rule-based matcher to find complex patterns in your texts. Here's a quick recap.
The matcher is initialized with the shared vocabulary – usually nlp dot vocab.
Patterns are lists of dictionaries, and each dictionary describes one token and its attributes. Patterns can be added to the matcher using the matcher dot add method.
Operators let you specify how often to match a token. For example, "+" will match one or more times.
Calling the matcher on a doc object will return a list of the matches. Each match is a tuple consisting of an ID, and the start and end token index in the document.
5. Adding statistical predictions
Here's an example of a matcher rule for "golden retriever".
If we iterate over the matches returned by the matcher, we can get the match ID and the start and end index of the matched span. We can then find out more about it. Span objects give us access to the original document and all other token attributes and linguistic features predicted by the model.
For example, we can get the span's root token. If the span consists of more than one token, this will be the token that decides the category of the phrase. For example, the root of "Golden Retriever" is "Retriever". We can also find the head token of the root. This is the syntactic "parent" that governs the phrase – in this case, the verb "have".
Finally, we can look at the previous token and its attributes. In this case, it's a determiner, the article "a".
6. Efficient phrase matching (1)
The phrase matcher is another helpful tool to find sequences of words in your data.
It performs a keyword search on the document, but instead of only finding strings, it gives you direct access to the tokens in context.
It takes Doc objects as patterns.
It's also really fast.
This makes it very useful for matching large dictionaries and word lists on large volumes of text.
7. Efficient phrase matching (2)
Here's an example.
The phrase matcher can be imported from spacy dot matcher and follows the same API as the regular matcher.
Instead of a list of dictionaries, we pass in a Doc object as the pattern.
We can then iterate over the matches in the text, which gives us the match ID, and the start and end of the match. This lets us create a Span object for the matched tokens "Golden Retriever" to analyze it in context.
8. Let's practice!
Let's try out some of the new techniques for combining rules with statistical models.