Get startedGet started for free

Removing stopwords

In the following exercises you're going to clean the Enron emails, in order to be able to use the data in a topic model. Text cleaning can be challenging, so you'll learn some steps to do this well. The dataframe containing the emails df is available. In a first step you need to define the list of stopwords and punctuations that are to be removed in the next exercise from the text data. Let's give it a try.

This exercise is part of the course

Fraud Detection in Python

View Course

Exercise instructions

  • Import the stopwords from ntlk.
  • Define 'english' words to use as stopwords under the variable stop.
  • Get the punctuation set from the string package and assign it to exclude.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Import nltk packages and string 
from nltk.corpus import ____
import string

# Define stopwords to exclude
stop = set(____.____('____'))
stop.update(("to","cc","subject","http","from","sent", "ect", "u", "fwd", "www", "com"))

# Define punctuations to exclude and lemmatizer
exclude = set(____.____)
Edit and Run Code