LoslegenKostenlos loslegen

Removing stopwords

In the following exercises you're going to clean the Enron emails, in order to be able to use the data in a topic model. Text cleaning can be challenging, so you'll learn some steps to do this well. The dataframe containing the emails df is available. In a first step you need to define the list of stopwords and punctuations that are to be removed in the next exercise from the text data. Let's give it a try.

Diese Übung ist Teil des Kurses

Fraud Detection in Python

Kurs anzeigen

Anleitung zur Übung

  • Import the stopwords from ntlk.
  • Define 'english' words to use as stopwords under the variable stop.
  • Get the punctuation set from the string package and assign it to exclude.

Interaktive Übung

Versuche dich an dieser Übung, indem du diesen Beispielcode vervollständigst.

# Import nltk packages and string 
from nltk.corpus import ____
import string

# Define stopwords to exclude
stop = set(____.____('____'))
stop.update(("to","cc","subject","http","from","sent", "ect", "u", "fwd", "www", "com"))

# Define punctuations to exclude and lemmatizer
exclude = set(____.____)
Code bearbeiten und ausführen