Linguistic features in spaCy

1. Linguistic features in spaCy

Welcome back! Let's learn about some of spaCy's linguistic features such as the part-of-speech-tagger and named entity recognizer to extract information from text.

2. POS tagging

POS stands for part-of-speech. A part of speech is a grammatical term that categorizes words based on their function and context within a sentence. For example, the English language has nine main POS categories, some of them are: verb, noun, adjective, adverb and conjunction.

3. POS tagging with spaCy

One use case for POS tagging is to confirm the meaning of a word. For example, some words such as "watch" can be both noun and verb. spaCy captures POS tags in the pos_ feature of the nlp pipeline. spacy-dot-explain() can be used on a given tag to include explanations of the tags.

4. POS tagging with spaCy

Let's look at an example for extracting part-of-speech tags for two sentences: "I watch TV" and "I left without my watch". We use list comprehension to identify the token and the POS tag using token-dot-pos-underscore, and the explanation using spacy-dot-explain and passing it token-dot-pos-underscore. The word "watch" is correctly tagged as a verb in the first sentence, and tagged as a noun in the second example.

5. Named entity recognition

On to named entity recognition! A named entity is a word or phrase that refers to a specific entity with a name, such as a organization. Named-entity recognition (NER) is a NLP task that classifies named entities found in an unstructured text into pre-defined categories such as person names. spaCy supports a wide range of entity types such as: PERSON to represent a named person, ORG to represent a company, GPE for a geo-political entity like a country, LOC for other locations such as mountain ranges, DATE and TIME.

6. NER and spaCy

spaCy models can predict named entities and their corresponding labels as part of the NER component. Named entities are available via the doc-dot-ents property of a Doc container. spaCy will also tag each entity with its corresponding label, which represents an entity type. The label of an entity is available via the -dot-label_ property.

7. NER and spaCy

The code snippet illustrates how we extract named entities from "Albert Einstein was genius". We can iterate through entities by using doc-dot-ents attribute, and access entity text, the start and end characters of each entity, and entity labels by using -dot-text, -dot-start_char, -dot-end_char and -dot-label_ respectively. In this instance, Albert Einstein is detected as a PERSON which starts from the first and ends at the 15th character of the given text.

8. NER and spaCy

An alternative approach to extract entities and their types is to directly use Token class instead of accessing doc-dot-ents to only check extracted named entities. spaCy tags each token in a given Doc container with its entity type if it is categorized as an entity. We can access a Token's -dot-text and -dot-entity_type. If a token is not classified as an entity such as the words was and genius, we will see an empty string as the entity type.

9. displaCy

We can also visualize these entities using displaCy. displaCy has different visualization options, such as the entity visualizer, which highlights named entities and their labels in a text. For example we can use displacy-dot-serve function to visualize named entities of a previous example, "Albert Einstein was genius". The displacy-dot-serve function takes two arguments, a Doc container, and the type of displaCy visualization which is "ent" (entities) in this instance.

10. Let's practice!

Let's exercise our learnings!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.