PhraseMatcher in spaCy
While processing unstructured text, you often have long lists and dictionaries that you want to scan and match in given texts. The Matcher patterns are handcrafted and each token needs to be coded individually. If you have a long list of phrases, Matcher
is no longer the best option. In this instance, PhraseMatcher
class helps us match long dictionaries. In this exercise, you will practice to retrieve patterns with matching shapes to multiple terms using PhraseMatcher
class.
en_core_web_sm
model is already loaded and ready for you to use as nlp
. PhraseMatcher
class is imported. A text
string and a list of terms
are available for your use.
This exercise is part of the course
Natural Language Processing with spaCy
Exercise instructions
- Initialize a
PhraseMatcher
class with anattr
to match to shape of giventerms
. - Create
patterns
to add to thePhraseMatcher
object. - Find matches to the given patterns and print start and end token indices and matching section of the given
text
.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
text = "There are only a few acceptable IP addresse: (1) 127.100.0.1, (2) 123.4.1.0."
terms = ["110.0.0.0", "101.243.0.0"]
# Initialize a PhraseMatcher class to match to shapes of given terms
matcher = ____(nlp.____, attr = ____)
# Create patterns to add to the PhraseMatcher object
patterns = [nlp.make_doc(____) for term in terms]
matcher.____("IPAddresses", patterns)
# Find matches to the given patterns and print start and end characters and matches texts
doc = ____
matches = ____
for match_id, start, end in matches:
print("Start token: ", ____, " | End token: ", ____, "| Matched text: ", doc[____:____].text)