1. Tokenization and Lemmatization
In NLP,
we usually have to deal with texts from a variety of sources. For instance,
2. Text sources
it can be a news article
where the text is grammatically correct and proofread. It could be tweets
containing shorthands and hashtags. It could also be comments
on YouTube where people have a tendency to abuse capital letters and punctuations.
3. Making text machine friendly
It is important that we standardize these texts into a machine friendly format. We want our models to treat similar words as the same. Consider the words
Dogs and dog. Strictly speaking, they are different strings. However, they connotate the same thing. Similarly,
reduction, reducing and reduce should also be standardized to the same string regardless of their form and case usage. Other examples include
don't and do not,
and won't and will not. In the next couple of lessons, we will learn techniques to achieve this.
4. Text preprocessing techniques
The text processing techniques you use are dependent on the application you're working on. We'll be covering the common ones, including converting words into lowercase
removing unnecessary whitespace,
removing punctuation,
removing commonly occurring words
or stopwords, expanding contracted words
like don't and removing special characters
such as numbers and emojis.
5. Tokenization
To do this, we must first understand tokenization. Tokenization is the process of splitting a string into its constituent tokens. These tokens may be sentences, words or punctuations and is specific to a particular language. In this course, we will primarily be focused with word and punctuation tokens. For instance, consider this sentence.
Tokenizing it into its constituent words and punctuations will yield the following
list of tokens. Tokenization also involves expanding contracted words.
Therefore, a word like don't gets decomposed into two tokens:
do and n't as can be seen in this example.
6. Tokenization using spaCy
To perform tokenization in python, we will use the spacy library. We first import
the spacy library. Next, we load a pre-trained English model 'en_core_web_sm'
using spacy.load(). This will return a Language object that has the know-how to perform tokenization. This is stored in the variable nlp. Let's now define a string
we want to tokenize. We pass this string into nlp
to generate a spaCy Doc object. We store this in a variable named doc. This Doc object contains the required tokens (and many other things, as we will soon find out). We generate the list of tokens by using
list comprehension as shown. This is essentially looping over doc and extracting the text of each token in each iteration.
The result is as follows.
7. Lemmatization
Lemmatization is the process of converting a word
into its lowercased base form or lemma. This is an extremely powerful process of standardization. For instance, the words
reducing, reduces, reduced and reduction, when lemmatized, are all converted into the base form reduce. Similarly be verbs
such as am, are and is are converted into be. Lemmatization also allows us to convert words with apostrophes into their full forms. Therefore,
n't is converted to not
and 've is converted to have.
8. Lemmatization using spaCy
When you pass the string into nlp, spaCy automatically performs lemmatization by default. Therefore, generating lemmas is identical
to generating tokens except that we extract
token.lemma_ in each iteration inside the list comprehension instead of token.text. Also, observe how
spaCy converted the Is into -PRON-. This is standard behavior where every pronoun is converted into the string '-PRON-'.
9. Let's practice!
Once we understand how to perform tokenization and lemmatization, performing the text preprocessing techniques described earlier becomes easier. Before we move to that, let's first practice our understanding of the concepts introduced so far.