Tokenization and Lemmatization

1. Tokenization and Lemmatization

In NLP, we usually have to deal with texts from a variety of sources. For instance,

2. Text sources

it can be a news article where the text is grammatically correct and proofread. It could be tweets containing shorthands and hashtags. It could also be comments on YouTube where people have a tendency to abuse capital letters and punctuations.

3. Making text machine friendly

It is important that we standardize these texts into a machine friendly format. We want our models to treat similar words as the same. Consider the words Dogs and dog. Strictly speaking, they are different strings. However, they connotate the same thing. Similarly, reduction, reducing and reduce should also be standardized to the same string regardless of their form and case usage. Other examples include don't and do not, and won't and will not. In the next couple of lessons, we will learn techniques to achieve this.

4. Text preprocessing techniques

The text processing techniques you use are dependent on the application you're working on. We'll be covering the common ones, including converting words into lowercase removing unnecessary whitespace, removing punctuation, removing commonly occurring words or stopwords, expanding contracted words like don't and removing special characters such as numbers and emojis.

5. Tokenization

To do this, we must first understand tokenization. Tokenization is the process of splitting a string into its constituent tokens. These tokens may be sentences, words or punctuations and is specific to a particular language. In this course, we will primarily be focused with word and punctuation tokens. For instance, consider this sentence. Tokenizing it into its constituent words and punctuations will yield the following list of tokens. Tokenization also involves expanding contracted words. Therefore, a word like don't gets decomposed into two tokens: do and n't as can be seen in this example.

6. Tokenization using spaCy

To perform tokenization in python, we will use the spacy library. We first import the spacy library. Next, we load a pre-trained English model 'en_core_web_sm' using spacy.load(). This will return a Language object that has the know-how to perform tokenization. This is stored in the variable nlp. Let's now define a string we want to tokenize. We pass this string into nlp to generate a spaCy Doc object. We store this in a variable named doc. This Doc object contains the required tokens (and many other things, as we will soon find out). We generate the list of tokens by using list comprehension as shown. This is essentially looping over doc and extracting the text of each token in each iteration. The result is as follows.

7. Lemmatization

Lemmatization is the process of converting a word into its lowercased base form or lemma. This is an extremely powerful process of standardization. For instance, the words reducing, reduces, reduced and reduction, when lemmatized, are all converted into the base form reduce. Similarly be verbs such as am, are and is are converted into be. Lemmatization also allows us to convert words with apostrophes into their full forms. Therefore, n't is converted to not and 've is converted to have.

8. Lemmatization using spaCy

When you pass the string into nlp, spaCy automatically performs lemmatization by default. Therefore, generating lemmas is identical to generating tokens except that we extract token.lemma_ in each iteration inside the list comprehension instead of token.text. Also, observe how spaCy converted the Is into -PRON-. This is standard behavior where every pronoun is converted into the string '-PRON-'.

9. Let's practice!

Once we understand how to perform tokenization and lemmatization, performing the text preprocessing techniques described earlier becomes easier. Before we move to that, let's first practice our understanding of the concepts introduced so far.