LoslegenKostenlos loslegen

PhraseMatcher in spaCy

While processing unstructured text, you often have long lists and dictionaries that you want to scan and match in given texts. The Matcher patterns are handcrafted and each token needs to be coded individually. If you have a long list of phrases, Matcher is no longer the best option. In this instance, PhraseMatcher class helps us match long dictionaries. In this exercise, you will practice to retrieve patterns with matching shapes to multiple terms using PhraseMatcher class.

en_core_web_sm model is already loaded and ready for you to use as nlp. PhraseMatcher class is imported. A text string and a list of terms are available for your use.

Diese Übung ist Teil des Kurses

Natural Language Processing with spaCy

Kurs anzeigen

Anleitung zur Übung

  • Initialize a PhraseMatcher class with an attr to match to shape of given terms.
  • Create patterns to add to the PhraseMatcher object.
  • Find matches to the given patterns and print start and end token indices and matching section of the given text.

Interaktive Übung

Versuche dich an dieser Übung, indem du diesen Beispielcode vervollständigst.

text = "There are only a few acceptable IP addresse: (1) 127.100.0.1, (2) 123.4.1.0."
terms = ["110.0.0.0", "101.243.0.0"]

# Initialize a PhraseMatcher class to match to shapes of given terms
matcher = ____(nlp.____, attr = ____)

# Create patterns to add to the PhraseMatcher object
patterns = [nlp.make_doc(____) for term in terms]
matcher.____("IPAddresses", patterns)

# Find matches to the given patterns and print start and end characters and matches texts
doc = ____
matches = ____
for match_id, start, end in matches:
    print("Start token: ", ____, " | End token: ", ____, "| Matched text: ", doc[____:____].text)
Code bearbeiten und ausführen