1. Stemming and lemmatization
In a language, words are often derived from other words, meaning words can share the same root. When we create a numeric transformation of a text feature, we might want to strip a word down to its root. This is the topic of this lesson.
2. What is stemming?
This process is called stemming. More formally, stemming can be defined as the transformation of words to their root forms, even if the stem itself is not a valid word in the language.
For example, staying, stays, stayed will be mapped to the root 'stay', and house, houses, housing will be mapped to the root 'hous'. In general, stemming will tend to chop off suffixes such as '-ed', '-ing', '-er', as well as plural or possessive forms.
3. What is lemmatization?
Lemmatization is quite a similar process to stemming, with the main difference that with lemmatization, the resulting roots are valid words in the language.
Going back to our examples of words derived from 'stay', lemmatization reduces them to 'stay';
and words derived from 'house' are reduced to the noun 'house'.
4. Stemming vs. lemmatization
You might wonder when to use stemming and when lemmatization.
The main difference is in the obtained roots. With lemmatization they are actual words and with stemming they might not be.
So if in your problem it's important to retain words, not only roots, lemmatization would be more suitable.
However, if you use nltk - which is what we will use in this course - stemming follows an algorithm which makes it faster than the lemmatization process in nltk. Furthermore, lemmatization is dependent on knowing the part of speech of the word you want to lemmatize. For example, whether we want to transform a noun, a verb, an adjective, etc.
5. Stemming of strings
One popular stemming library is the PorterStemmer in the nltk.stem package. The PorterStemmer is not the only stemmer in nltk but it's quite fast and easy to use, so it's often a standard choice.
We call the PorterStemmer function and store it under the name porter. We can then call porter.stem on a string, for example, 'wonderful'. The result is 'wonder'.
6. Non-English stemmers
Stemming is possible using other languages as well, such as Danish, Dutch, French, Spanish, German, etc.
To use foreign language stemmers we need to use the SnowballStemmer package. We can specify in the stemmer the foreign language we want to use. Then we apply the stem function on our string. For example, we have imported a Dutch stemmer and fed it a Dutch verb. The result is the root of the verb.
7. How to stem a sentence?
If you apply the PorterStemmer on a sentence, the result is the original sentence. We see nothing has changed about our 'Today is a wonderful day!' sentence.
We need to stem each word in the sentence separately. Therefore, as a first step, we need to transform the sentence into tokens using the familiar word_tokenize function. In the second step, we apply the stemming function on each word of the sentence, using a list comprehension.
8. Lemmatization of a string
The lemmatization of strings is similar to stemming. We import the WordNetLemmatizer from the nltk.stem library. It uses the WordNet database to look up lemmas of words.
We call the WordNetLemmatizer function and store it under the name WNlemmatizer. We can then call WNlemmatizer.lemmatize() on 'wonderful'. Note that we have specified a part-of-speech, given by the 'pos' argument. The default pos is noun, or 'n'. Here we specify an adjective, that's why pos = 'a'. The result is 'wonderful'. If you'd recall, stemming returned 'wonder' as a result.
9. Let's practice!
Let's solve some exercises and reinforce the concepts related to stemming and lemmatization.