Stop words and punctuation handling

1. Stop words and punctuation handling

After tokenization, the next step in preprocessing is cleaning by removing stop words and punctuation.

2. Stop words

Not all words in a sentence carry useful information. Words like "the", "and", or "is" appear frequently in text but contribute little to the machine's understanding of context. These are called stop words. Because they don't add much value in many NLP tasks, removing stop words can help models focus on the more important, content-rich words.

3. Stop words removal

Removing stop words depends on the task. For example, to understand a text's topic, like in a review, words such as "great" or "product" are more informative than common words like "this" or "a".

4. Stop words removal

In tasks like translation, even stop words contribute to the meaning and should be preserved.

5. Accessing stop words

NLTK provides a built-in list of stop words for several languages, including English. These are based on linguistic research and are commonly used for basic filtering. To access them, we import the stopwords module from nltk.corpus and download the data. Now, we can view or use the stop words of a given language using stopwords.words() and passing the language name. Here we can see the first 10 words.

6. Removing stop words

Once we have our stop words, we can filter them out from our tokens. We define the text and tokenize it to obtain a list of words. Then, we remove any token that appears in the stop_words list using a list comprehension. Here, `word.lower() not in stop_words` checks that the word isn't in the stop word list. Notice that we use .lower() to make the comparison case-insensitive. The output is a list of meaningful words, with the stop words removed.

7. Punctuation

Next, let's handle punctuation. Punctuation marks, like commas, periods, and question marks, and special characters like hashtags or slashes help structure language for humans, but in many NLP applications, they don't add meaningful information.

8. Punctuation removal

For example, if we want to find the most common or important words in a group of documents, like identifying key topics in customer reviews, news articles, or social media posts, punctuation can get in the way and add noise.

9. Punctuation removal

However, in tasks like text summarization or text generation, they help maintain sentence structure and clarity. Without them, the output can become hard to read, inconsistent, or grammatically incorrect.

10. Accessing and removing punctuation

Python's string module includes string.punctuation, which contains common punctuation marks and special characters. We use it to further clean our previously filtered_tokens by removing tokens containing punctuation. Using list comprehension, we remove the punctuation characters. The result is a list of meaningful words, free from both stop words and punctuation. For example, the period at the end of the sentence has been removed, making the text cleaner for analysis.

11. Let's practice!

Remember, preprocessing depends on your task. Always align cleaning steps with your end goal. Now, let's try this out in code.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.

Natural Language Processing (NLP) in Python

IntermediateSkill Level

4.8+

161 reviews