RegEx with EntityRuler in spaCy
Regular expressions, or RegEx, are used for rule-based information extraction with complex string matching patterns. RegEx can be used to retrieve patterns or replace matching patterns in a string with some other patterns. In this exercise, you will practice using EntityRuler
in spaCy
to find email addresses in a given text
.
spaCy
package is already imported for your use. You can use \d
to match string patterns representative of a metacharacter that matches any digit from 0 to 9.
A spaCy
pattern can use REGEX
as an attribute. In this case, a pattern will be of shape [{"TEXT": {"REGEX": "<a given pattern>"}}]
.
This is a part of the course
“Natural Language Processing with spaCy”
Exercise instructions
- Define a pattern to match phone numbers of the form
8888888888
to be used by theEntityRuler
. - Load a blank
spaCy
English model and add anEntityRuler
component to the pipeline. - Add the compiled pattern to the
EntityRuler
component. - Run the model and print the tuple of text and type of entities for the given
text
.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
text = "Our phone number is 4251234567."
# Define a pattern to match phone numbers
patterns = [{"label": "PHONE_NUMBERS", "pattern": [{"TEXT": {"REGEX": "(____){____}"}}]}]
# Load a blank model and add an EntityRuler
nlp = spacy.____("en")
ruler = nlp.____("entity_ruler")
# Add the compiled patterns to the EntityRuler
ruler.____(patterns)
# Print the tuple of entities texts and types for the given text
doc = ____(____)
print([(ent.____, ent.____) for ent in doc.____])