Simple text preprocessing

1. Simple text preprocessing

In this video, we will cover some simple text preprocessing.

2. Why preprocess?

Text processing helps make for better input data when performing machine learning or other statistical methods. For example, in the last few exercises you have applied small bits of preprocessing (like tokenization) to create a bag of words. You also noticed that applying simple techniques like lowercasing all of the tokens, can lead to slightly better results for a bag-of-words model. Preprocessing steps like tokenization or lowercasing words are commonly used in NLP. Other common techniques are things like lemmatization or stemming, where you shorten the words to their root stems, or techniques like removing stop words, which are common words in a language that don't carry a lot of meaning -- such as and or the, or removing punctuation or unwanted tokens. Of course, each model and process will have different results -- so it's good to try a few different approaches to preprocessing and see which works best for your task and goal.

3. Preprocessing example

We have here some example input and output text we might expect from preprocessing. First we have a simple two sentence string about pets. Then we have some example output tokens we want. You can see that the text has been tokenized and that everything is lowercase. We also notice that stopwords have been removed and the plural nouns have been made singular.

4. Text preprocessing with Python

We can perform text preprocessing using many of the tools we already know and have learned. In this code, we are using the same text as from our previous video, a few sentences about a cat with a box. We can use list comprehensions to tokenize the sentences which we first make lowercase using the string lower method. The string is_alpha method will return True if the string has only alphabetical characters. We use the is_alpha method along with an if statement iterating over our tokenized result to only return only alphabetic strings (this will effectively strip tokens with numbers or punctuation). To read out the process in both code and English we say we take each token from the word_tokenize output of the lowercase text if it contains only alphabetical characters. In the next line, we use another list comprehension to remove words that are in the stopwords list. This stopwords list for english comes built in with the NLTK library. Finally, we can create a counter and check the two most common words, which are now cat and box (unlike the and box which were the two tokens returned in our first result). Preprocessing has already improved our bag of words and made it more useful by removing the stopwords and non-alphabetic words.

5. Let's practice!

You can now get started by preprocessing your own text!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.