Efficient phrase matching
Sometimes it's more efficient to match exact strings instead of writing patterns describing the individual tokens. This is especially true for finite categories of things – like all countries of the world.
We already have a list of countries, so let's use this as the basis of our information extraction script. A list of string names is available as the variable COUNTRIES
. The nlp
object and a test doc
have already been created and the doc.text
has been printed to the shell.
This exercise is part of the course
Advanced NLP with spaCy
Exercise instructions
- Import the
PhraseMatcher
and initialize it with the sharedvocab
as the variablematcher
. - Add the phrase patterns and call the matcher on the
doc
.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Import the PhraseMatcher and initialize it
from spacy.____ import ____
matcher = ____(____)
# Create pattern Doc objects and add them to the matcher
# This is the faster version of: [nlp(country) for country in COUNTRIES]
patterns = list(nlp.pipe(COUNTRIES))
matcher.add('COUNTRY', None, *patterns)
# Call the matcher on the test document and print the result
matches = ____(____)
print([doc[start:end] for match_id, start, end in matches])