Cleaning text data
Now that you've defined the stopwords and punctuations, let's use these to clean our enron emails in the dataframe df
further. The lists containing stopwords and punctuations are available under stop
and exclude
There are a few more steps to take before you have cleaned data, such as "lemmatization" of words, and stemming the verbs. The verbs in the email data are already stemmed, and the lemmatization is already done for you in this exercise.
Diese Übung ist Teil des Kurses
Fraud Detection in Python
Interaktive Übung
Versuche dich an dieser Übung, indem du diesen Beispielcode vervollständigst.
# Import the lemmatizer from nltk
from nltk.stem.wordnet import WordNetLemmatizer
lemma = WordNetLemmatizer()
# Define word cleaning function
def clean(text, stop):
text = text.____()
# Remove stopwords
stop_free = " ".join([word for word in text.lower().split() if ((___ not in ___) and (not word.isdigit()))])
# Remove punctuations
punc_free = ''.join(word for word in stop_free if ___ not in ____)
# Lemmatize all words
normalized = " ".join(____.____(word) for word in punc_free.split())
return normalized