Get startedGet started for free

Text normalization techniques

1. Text normalization techniques

Now let's jump into text normalization.

2. Text normalization

Language is flexible; people write the same idea in different ways. But machines prefer consistency. Normalization brings words into a standard form, so variations of the same word are treated the same. This is important for tasks like classification, search, or sentiment analysis. Let's explore three normalization techniques: lowercasing, stemming, and lemmatization.

3. Lowercasing

A word like data can be written with different capitalization but represent the same concept. Without lowercasing, they will be treated as separate tokens. This can mislead models and unnecessarily inflate the vocabulary size. We can resolve this using the .lower() method on the original text. The result is a version where all instances of "data" are unified.

4. Lowercasing applicability

Lowercasing simplifies text for many NLP tasks that rely on keyword analysis such as text classification or sentiment analysis.

5. Lowercasing applicability

But in cases where case affects meaning, like "us" versus "US", it's better to preserve capitalization.

6. Stemming

Next is stemming, a technique that reduces words to their root form by trimming off suffixes. For example, "running" becomes "run", and "reading" becomes "read". It's fast and useful in tasks like search, where we care more about grouping similar words than grammar. For example, stemming can sometimes produce non-dictionary forms, like turning "organization" into "organizat".

7. Stemming in code

To use it in code, we import PorterStemmer from nltk.stem and create a stemmer. For a given list of tokens, we apply stemmer.stem() to each word using a list comprehension. The result is a list of stemmed words, some of which may not be actual words we'd find in a dictionary.

8. Lemmatization

Lemmatization, like stemming, reduces words to their base form, but uses vocabulary and grammar rules to return real words. For example, "organizations" becomes "organization" instead of "organizat". Some words, like "running", might stay the same if the lemmatizer recognizes them as already correct. Lemmatization is ideal when grammatical accuracy matters, such as in document similarity or question answering. However, it is slower than stemming.

9. Lemmatization in code

To use it in code, we import WordNetLemmatizer from nltk.stem, we download the wordnet resource from NLTK, and initialize the lemmatizer. We apply it to a given list of tokens using lemmatizer.lemmatize() inside a list comprehension, resulting in a list of clean, valid English words.

10. Stemming vs. lemmatization

We can think of lemmatization as carefully trimming a tree to preserve its structure: slower but more accurate, always producing valid words. Stemming is like using a chainsaw: faster but often chopping words into non-dictionary forms. Choose lemmatization when grammar matters, like in chatbots, translation, or text generation. For speed and large datasets, like in search engines or topic tagging, stemming is a solid, lightweight option.

11. Let's practice!

Time for some practice!

Create Your Free Account

or

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.