Text cleaning

1. Text cleaning

Now that we know how to convert a string into a list of lemmas, we are now in a good position to perform basic text cleaning.

2. Text cleaning techniques

Some of the most common text cleaning steps include removing extra whitespaces, escape sequences, punctuations, special characters such as numbers and stopwords. In other words, it is very common to remove non-alphabetic tokens and words that occur so commonly that they are not very useful for analysis.

3. isalpha()

Every python string has an isalpha() method that returns true if all the characters of the string are alphabets. Therefore, the "Dog".isalpha() will return true but "3dogs".isalpha() will return false as it has a non-alphabetic character 3. Similarly, numbers, punctuations and emojis will all return false too. This is an extremely convenient method to remove all (lemmatized) tokens that are or contain numbers, punctuation and emojis.

4. A word of caution

If isalpha() as a silver bullet that cleans text meticulously seems too good to be true, it's because it is. Remember that isalpha() has a tendency of returning false on words we would not want to remove. Examples include abbreviations such as USA and UK which have periods in them, and proper nouns with numbers in them such as word2vec and xto10x. For such nuanced cases, isalpha() may not be sufficient. It may be advisable to write your own custom functions, typically using regular expressions, to ensure you're not inadvertently removing useful words.

5. Removing non-alphabetic characters

Consider the string here. This has a lot of punctuations, unnecessary extra whitespace, escape sequences, numbers and emojis. We will generate the lemmatized tokens like before.

6. Removing non-alphabetic characters

Next, we loop through the tokens again and choose only those words that are either -PRON- or contain only alphabetic characters. Let's now print out the sanitized string. We see that all the non-alphabetic characters have been removed and each word is separated by a single space.

7. Stopwords

There are some words in the English language that occur so commonly that it is often a good idea to just ignore them. Examples include articles such as a and the, be verbs such as is and am and pronouns such as he and she.

8. Removing stopwords using spaCy

spaCy has a built-in list of stopwords which we can access using spacy.lang.en.stop_words.STOP_WORDS..

9. Removing stopwords using spaCy

We make a small tweak to a_lemmas generation step. Notice that we have removed the -PRON- condition as pronouns are stopwords anyway and should be removed. Additionally, we have introduced a new condition to check if the word belongs to spacy's list of stopwords. The output is as follows. Notice how the string consists only of base form words. Always exercise caution while using third party stopword lists. It is common that an application find certain words useful that may be considered a stopword by third party lists. It is often advisable to create your custom stopword lists.

10. Other text preprocessing techniques

There are other preprocessing techniques that are used but have been omitted for the sake of brevity. Some of them include removing HTML or XML tags, replacing accented characters and correcting spelling errors and shorthands

11. A word of caution

We have covered a lot of text preprocessing techniques in the last couple of lessons. However, a word of caution is in place. The text preprocessing techniques you use is always dependent on the application. There are many applications which may find punctuations, numbers and emojis useful, so it may be wise to not remove them. In other cases, using all caps may be a good indicator of something. Remember to always use only those techniques that are relevant to your particular use case.

12. Let's practice!

It's now time to practice!