Cleaning text data
Now that you've defined the stopwords and punctuations, let's use these to clean our enron emails in the dataframe df
further. The lists containing stopwords and punctuations are available under stop
and exclude
There are a few more steps to take before you have cleaned data, such as "lemmatization" of words, and stemming the verbs. The verbs in the email data are already stemmed, and the lemmatization is already done for you in this exercise.
This exercise is part of the course
Fraud Detection in Python
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Import the lemmatizer from nltk
from nltk.stem.wordnet import WordNetLemmatizer
lemma = WordNetLemmatizer()
# Define word cleaning function
def clean(text, stop):
text = text.____()
# Remove stopwords
stop_free = " ".join([word for word in text.lower().split() if ((___ not in ___) and (not word.isdigit()))])
# Remove punctuations
punc_free = ''.join(word for word in stop_free if ___ not in ____)
# Lemmatize all words
normalized = " ".join(____.____(word) for word in punc_free.split())
return normalized