Removing stopwords
In the following exercises you're going to clean the Enron emails, in order to be able to use the data in a topic model. Text cleaning can be challenging, so you'll learn some steps to do this well. The dataframe containing the emails df
is available. In a first step you need to define the list of stopwords and punctuations that are to be removed in the next exercise from the text data. Let's give it a try.
This exercise is part of the course
Fraud Detection in Python
Exercise instructions
- Import the stopwords from
ntlk
. - Define 'english' words to use as stopwords under the variable
stop
. - Get the punctuation set from the
string
package and assign it toexclude
.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Import nltk packages and string
from nltk.corpus import ____
import string
# Define stopwords to exclude
stop = set(____.____('____'))
stop.update(("to","cc","subject","http","from","sent", "ect", "u", "fwd", "www", "com"))
# Define punctuations to exclude and lemmatizer
exclude = set(____.____)