Stop words

1. Stop words

In every language, there are words that occur too frequently and are not very informative. Sometimes, it is useful to get rid of them before we build a machine learning model.

2. What are stop words and how to find them?

Words that occur too frequently and are not very informative are called stop words. But how do we know which words are not informative? In every language, there is a set of words that most practitioners agree are not useful and should be removed when performing a natural language processing task. For instance, in English the definite and indefinite article (the, a/an), conjunctions ('and','but','for'), propositions('on', 'in', 'at'), etc. are stop words. Secondly, depending on the context, we might want to expand the standard set of stop words. For example, in the movie reviews dataset, we might want to exclude words such as 'film', 'movie', 'cinema', etc.

3. Stop words with word clouds

Maybe you recall from a previous video that we built word clouds using movie reviews. Here is an example of two word clouds using the movie reviews. In the picture on the left, the stop words have not been removed. Words that pop up are 'film' and 'br', which is an indication for a line break. In the cloud on the right side, stop words have been removed and now we see words such as 'character', 'see', 'good', 'story'.

4. Remove stop words from word clouds

How do we remove stop words when creating a word cloud? Let's start by reviewing how we built a word cloud. First, we import the WordCloud function from wordcloud. We also import the default list of STOPWORDS from wordcloud. To create our list of stop words, we can take a set of the default list. A set is like a list but with unique, not repeating items. We can update the set of stop words by calling update and providing a list to it. We pass our list of stopwords, called my_stopwords to the stopwords argument in the WordCloud function. Then we display it. So, the only new argument we added here is defining the list of stop words. Everything else stays the same.

5. Stop words with BOW

Removing non-informative words when we are building a BOW transformation can also be very useful. This can easily be incorporated in the countvectorizer function. First, we need to import the list of default English stop words from the same feature_extraction.text package from sci-kit learn. Let's assume we want to enrich this default list with movie-specific words. To do that, we call the union function on the default list. Remember that a union of two sets A and B consists of all elements of A and all elements of B such that no elements are repeated. In our case, the union will add the new words to the list of default stop words, if that word is not already there. To use the constructed set, we specify the stop_words argument in the CountVectorizer to be equal to our defined set. Everything else stays the same and should look pretty familiar by now. One important thing to note is that using stopwords will reduce the size of the vocabulary we built using a BOW or another approach.

6. Let's practice!

Let's solve some exercises where you will practice removing stop words!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.