Get Started

spaCy EntityRuler

1. spaCy EntityRuler

Welcome! Let's learn about EntityRuler, a component in spaCy that allows us to include or modify named entities using pattern matching rules.

2. spaCy EntityRuler

EntityRuler lets us add entities to Doc-dot-ents. It can be combined with EntityRecognizer, a spaCy pipeline component for named-entity recognition, to boost accuracy, or used on its own to implement a purely rule-based entity recognition system. We can add named-entities to a Doc container using an entity pattern. Entity patterns are dictionaries with two keys. One key is the "label" which is specifying the label to assign to the entity if the pattern is matched, and the second key is the "pattern", which is the matched string. The entity ruler accepts two types of patterns: phrase entity and token entity patterns. A phrase entity pattern is used for exact string matches, for example to exactly match Microsoft as a named entity with a label of ORG, we can use an entity pattern dictionary with a "label" equal to ORG and the "pattern" to be set as "Microsoft". A token entity pattern uses one dictionary to describe one token. For example, to match lower cases san francisco to an entity type of GPE (a location type), we can use an entity pattern dictionary with a "label" equal to GPE and the "pattern" to be set to a list of two key value pairs where the key is set to "LOWER" and the value is set to "san" for one and "francisco" for the other pair.

3. Adding EntityRuler to spaCy pipeline

The EntityRuler can be added to a spaCy model using dot-add_pipe() method by passing "entity_ruler" name. When the nlp model is called on a text, it will find matches in the doc container and add them as entities in the doc-dot-ents, using the specified pattern label as the entity label. As an example, we load a blank spaCy model and use -dot-add_pipe("entity_ruler") method to add EntityRuler component. Next, we define a list of patterns. Patterns can be a combination of phrase entity and token entity patterns. These patterns can be added to the EntityRuler component using -dot-add_patterns() method.

4. Adding EntityRuler to spaCy pipeline

Next, we run the model on a given text to generate a Doc container. The nlp model uses the EntityRuler component to populate the dot-ents attribute of the Doc container. In this instance, Microsoft and San Francisco are extracted as entities with ORG and GPE entity labels respectively.

5. EntityRuler in action

The entity ruler is designed to integrate with spaCy’s existing components and enhance the named entity recognizer performance. Let us look at an example of "Manhattan associates is a company in the US". In this case, the model is unable to accurately classify Manhattan associates as an ORG.

6. EntityRuler in action

We can add an EntityRuler component to the current nlp pipeline. If we add the ruler after an existing ner component by setting the "after" argument of the -dot-add_pipe() method to "ner", the entity ruler will only add entities to the doc-dot-ents if they don’t overlap with existing entities predicted by the model. In this case, the model tags Manhattan with an incorrect GPE type, because the ruler component is called after existing ner (EntityRecognizer) component of the model.

7. EntityRuler in action

However, if we add an EntityRuler before the ner component by setting the "before" argument of -dot-add_pipe() method to "ner", to recognize Manhattan associate as an ORG, the entity recognizer will respect the existing entity spans and adjust its predictions based on patterns added to the EntityRuler. This can improve model accuracy in our case.

8. Let's practice!

Let's practice!