Token classification

1. Token classification

NLP goes beyond text classification, with applications in token classification, question answering, and sequence generation. Let's begin with token classification.

2. Text versus token classification

So far, we've focused on text classification where we classify entire sentences or pairs of texts.

3. Text versus token classification

In token classification, we assign labels to individual words or tokens within a sentence; the same tokens we learned how to produce in Chapter 1. It's the foundation for two common tasks in NLP: Named Entity Recognition, or NER, and Part of Speech Tagging, or PoS Tagging.

4. Named entity recognition (NER)

Let's start with Named Entity Recognition. NER is used to identify and label entities like names, locations, organizations, dates, and more. For example, in the sentence: "Apple opened a new office in Toronto in March 2023". NER would tag Apple as an organization, Toronto as a location, and March 2023 as a date. Note that each model recognizes specific types of entities, depending on what it was trained on. NER is especially useful in applications like: Information retrieval, where it helps gather key details, such as identifying all companies and dates mentioned in financial reports, and question answering systems, which use NER to find specific facts, like the location linked to a company when asked, "Where is Apple headquartered?

5. NER in code

To perform Named Entity Recognition, we use the pipeline function with the task set to "ner" and a suitable model. We set grouped_entities=True to merge tokens that belong to the same entity, like combining "United" and "States" into "United States". Passing a sentence to the pipeline returns a list of recognized entities, where each result includes: the entity type: PER for person, ORG for organization, LOC for location, a confidence score, the full entity text, and the character positions of the entity in the original text.

6. Part of speech (PoS) tagging

Now let's move on to Part of Speech Tagging. PoS tagging assigns grammatical roles, like noun, verb, or adjective, to each word in a sentence. For example, in "The quick fox jumps over the lazy dog", "fox" is a noun and "jumps" is a verb. This is useful in tasks like syntactic parsing, grammar correction, and text generation, where understanding the sentence structure is essential.

7. PoS tagging in code

To perform PoS tagging, we create a pipeline with the task set to "token-classification" using a suitable model. We set grouped_entities=True to group tokens into meaningful words or phrases. Passing a sentence to the pipeline returns a list of results, where each includes: the part of speech tag: PROPN for proper noun, VERB for verb, ADP for preposition, a confidence score, the word or phrase identified, and the character positions of the word in the text.

8. Let's practice!

Let's try out both pipelines and see what kind of insights we can extract from text!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.

Natural Language Processing (NLP) in Python

IntermediateSkill Level

4.8+

268 reviews