Part-of-speech tagging

1. Part-of-speech tagging

In this lesson, we will cover part-of-speech tagging, which is one of the most popularly used feature engineering techniques in NLP.

2. Applications

Part-of speech tagging or POS tagging has an immense number of applications in NLP. It is used in word-sense disambiguation to identify the sense of a word in a sentence. For instance, consider the sentences "the bear is a majestic animal" and "please bear with me". Both sentences use the word 'bear' but they mean different things. POS tagging helps in identifying this distinction by identifying one bear as a noun and the other as a verb. Consequentially, POS tagging is also used in sentiment analysis, question answering systems and linguistic approaches to detect fake news and opinion spam. For example, one paper discovered that fake news headlines, on average, tend to use lesser common nouns and more proper nouns than mainstream headlines. Generating the POS tags for these words proved extremely useful in detecting false or hyperpartisan news.

3. POS tagging

So what is POS tagging? It is the process of assigning every word (or token) in a piece of text, its corresponding part-of-speech. For instance, consider the sentence "Jane is an amazing guitarist". A typical POS tagger will label Jane as a proper noun, is as a verb, an as a determiner (or an article), amazing as an adjective and finally, guitarist as a noun.

4. POS tagging using spaCy

POS Tagging is extremely easy to do using spaCy's models and performing it is almost identical to generating tokens or lemmas. As usual, we import the spacy library and load the en_core_web_sm model as nlp. We will use the same sentence "Jane is an amazing guitarist" from before. We will then create a Doc object that will perform POS tagging, by default.

5. POS tagging using spaCy

Using list comprehension, we generate a list of tuples pos where the first element of the tuple is the token and is generated using token.text and the second element is its POS tag, which is generated using token.pos_. Printing pos will give us the following output. Note how the tagger correctly identified all the parts-of-speech as we had discussed earlier. That said, remember that POS tagging is not an exact science. spaCy infers the POS tags of these words based on the predictions given by its pre-trained models. In other words, the accuracy of the POS tagging is dependent on the data that the model has been trained on and the data that it is being used on.

6. POS annotations in spaCy

spaCy is capable of identifying close to 20 parts-of-speech and as we saw in the previous slide, it uses specific annotations to denote a particular part of speech. For instance, PROPN referred to a proper noun and DET referred to a determinant. You can find the complete list of POS annotations used by spaCy in spaCy's documentation. Here is a snapshot of the web page.

7. Let's practice!

Great! Let's now practice our understanding of POS tagging in the next few exercises.