1. Part-of-speech tagging
In this lesson,
we will cover part-of-speech tagging, which is one of the most popularly used feature engineering techniques in NLP.
2. Applications
Part-of speech tagging or POS tagging has an immense number of applications in NLP. It is used in word-sense disambiguation
to identify the sense of a word in a sentence. For instance, consider the sentences "the bear is a majestic animal"
and "please bear with me".
Both sentences use the word 'bear' but they mean different things. POS tagging helps in identifying this distinction by identifying one bear as a noun and the other as a verb. Consequentially, POS tagging is also used in sentiment analysis,
question answering systems
and linguistic approaches to detect fake news
and opinion spam. For example, one paper discovered that fake news headlines, on average, tend to use lesser common nouns and more proper nouns than mainstream headlines. Generating the POS tags for these words proved extremely useful in detecting false or hyperpartisan news.
3. POS tagging
So what is POS tagging? It is the process of assigning
every word (or token) in a piece of text, its corresponding part-of-speech. For instance, consider the sentence
"Jane is an amazing guitarist". A typical POS tagger will label Jane
as a proper noun, is
as a verb, an
as a determiner (or an article), amazing
as an adjective and finally, guitarist
as a noun.
4. POS tagging using spaCy
POS Tagging is extremely easy to do using spaCy's models and performing it is almost identical to generating tokens or lemmas. As usual,
we import the spacy library and load the en_core_web_sm model as nlp. We will use the same sentence
"Jane is an amazing guitarist" from before. We will then create a Doc object
that will perform POS tagging, by default.
5. POS tagging using spaCy
Using list comprehension,
we generate a list of tuples pos where the first element of the tuple is the token and is generated using token.text and the second element is its POS tag, which is generated using token.pos_. Printing pos
will give us the following output. Note how the tagger correctly identified all the parts-of-speech as we had discussed earlier. That said, remember that POS tagging is not an exact science. spaCy infers the POS tags of these words based on the predictions given by its pre-trained models. In other words, the accuracy of the POS tagging is dependent on the data that the model has been trained on and the data that it is being used on.
6. POS annotations in spaCy
spaCy is capable of identifying close to 20 parts-of-speech and as we saw in the previous slide, it uses specific annotations to denote a particular part of speech. For instance, PROPN
referred to a proper noun and DET
referred to a determinant. You can find the complete list of POS annotations used by spaCy
in spaCy's documentation. Here
is a snapshot of the web page.
7. Let's practice!
Great! Let's now practice our understanding of POS tagging in the next few exercises.