1. Text cleaning
Now that we know
how to convert a string into a list of lemmas, we are now in a good position to perform basic text cleaning.
2. Text cleaning techniques
Some of the most common text cleaning steps include removing extra whitespaces,
escape sequences, punctuations,
special characters
such as numbers and stopwords.
In other words, it is very common to remove non-alphabetic tokens and words that occur so commonly that they are not very useful for analysis.
3. isalpha()
Every python string has an isalpha() method that returns true if all the characters of the string are alphabets. Therefore, the "Dog".isalpha()
will return true but "3dogs".isalpha()
will return false as it has a non-alphabetic character 3. Similarly, numbers,
punctuations
and emojis
will all return false too. This is an extremely convenient method to remove all (lemmatized) tokens that are or contain numbers, punctuation and emojis.
4. A word of caution
If isalpha() as a silver bullet that cleans text meticulously seems too good to be true, it's because it is. Remember that isalpha() has a tendency of returning false on words we would not want to remove. Examples include abbreviations
such as USA and UK which have periods in them, and proper nouns with numbers
in them such as word2vec and xto10x. For such nuanced cases, isalpha() may not be sufficient. It may be advisable to write your own custom functions,
typically using regular expressions, to ensure you're not inadvertently removing useful words.
5. Removing non-alphabetic characters
Consider the string here.
This has a lot of punctuations, unnecessary extra whitespace, escape sequences, numbers and emojis. We will generate
the lemmatized tokens like before.
6. Removing non-alphabetic characters
Next, we loop
through the tokens again and choose only those words that are either -PRON- or contain only alphabetic characters. Let's now print
out the sanitized string. We see
that all the non-alphabetic characters have been removed and each word is separated by a single space.
7. Stopwords
There are some words in the English language that occur
so commonly that it is often a good idea to just ignore them. Examples include articles
such as a and the, be verbs such as is and am and pronouns such as he and she.
8. Removing stopwords using spaCy
spaCy has a built-in list of stopwords
which we can access
using spacy.lang.en.stop_words.STOP_WORDS..
9. Removing stopwords using spaCy
We make a
small tweak to a_lemmas generation step. Notice that we have removed the -PRON- condition as pronouns are stopwords anyway and should be removed. Additionally, we have introduced a new condition to check if the word belongs to spacy's list of stopwords. The output
is as follows. Notice how the string consists only of base form words. Always exercise caution while using third party stopword lists. It is common that an application find certain words useful that may be considered a stopword by third party lists. It is often advisable to create your custom stopword lists.
10. Other text preprocessing techniques
There are other preprocessing techniques that are used but have been omitted for the sake of brevity. Some of them include removing HTML
or XML tags, replacing accented characters
and correcting spelling errors and shorthands
11. A word of caution
We have covered a lot of text preprocessing techniques in the last couple of lessons. However, a word of caution is in place. The text preprocessing techniques you use is always dependent on the application. There are many applications which may find punctuations, numbers and emojis useful, so it may be wise to not remove them. In other cases, using all caps may be a good indicator of something.
Remember to always use only those techniques that are relevant to your particular use case.
12. Let's practice!
It's now time to practice!