RegEx with EntityRuler in spaCy
Regular expressions, or RegEx, are used for rule-based information extraction with complex string matching patterns. RegEx can be used to retrieve patterns or replace matching patterns in a string with some other patterns. In this exercise, you will practice using EntityRuler in spaCy to find email addresses in a given text.
spaCy package is already imported for your use. You can use \d to match string patterns representative of a metacharacter that matches any digit from 0 to 9.
A spaCy pattern can use REGEX as an attribute. In this case, a pattern will be of shape [{"TEXT": {"REGEX": "<a given pattern>"}}].
Deze oefening maakt deel uit van de cursus
Natural Language Processing with spaCy
Oefeninstructies
- Define a pattern to match phone numbers of the form
8888888888to be used by theEntityRuler. - Load a blank
spaCyEnglish model and add anEntityRulercomponent to the pipeline. - Add the compiled pattern to the
EntityRulercomponent. - Run the model and print the tuple of text and type of entities for the given
text.
Praktische interactieve oefening
Probeer deze oefening eens door deze voorbeeldcode in te vullen.
text = "Our phone number is 4251234567."
# Define a pattern to match phone numbers
patterns = [{"label": "PHONE_NUMBERS", "pattern": [{"TEXT": {"REGEX": "(____){____}"}}]}]
# Load a blank model and add an EntityRuler
nlp = spacy.____("en")
ruler = nlp.____("entity_ruler")
# Add the compiled patterns to the EntityRuler
ruler.____(patterns)
# Print the tuple of entities texts and types for the given text
doc = ____(____)
print([(ent.____, ent.____) for ent in doc.____])