Cleaning TED talks in a dataframe
In this exercise, we will revisit the TED Talks from the first chapter. You have been a given a dataframe ted consisting of 5 TED Talks. Your task is to clean these talks using techniques discussed earlier by writing a function preprocess and applying it to the transcript feature of the dataframe.
The stopwords list is available as stopwords.
This exercise is part of the course
Feature Engineering for NLP in Python
Exercise instructions
- Generate the Doc object for
text. Ignore thedisableargument for now. - Generate lemmas using list comprehension using the
lemma_attribute. - Remove non-alphabetic characters using
isalpha()in the if condition.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Function to preprocess text
def preprocess(text):
# Create Doc object
doc = nlp(____, disable=['ner', 'parser'])
# Generate lemmas
lemmas = [token.____ for token in doc]
# Remove stopwords and non-alphabetic characters
a_lemmas = [lemma for lemma in lemmas
if lemma.____ and lemma not in stopwords]
return ' '.join(a_lemmas)
# Apply preprocess to ted['transcript']
ted['transcript'] = ted['transcript'].apply(____)
print(ted['transcript'])