Cleaning TED talks in a dataframe
In this exercise, we will revisit the TED Talks from the first chapter. You have been a given a dataframe ted
consisting of 5 TED Talks. Your task is to clean these talks using techniques discussed earlier by writing a function preprocess
and applying it to the transcript
feature of the dataframe.
The stopwords list is available as stopwords
.
This exercise is part of the course
Feature Engineering for NLP in Python
Exercise instructions
- Generate the Doc object for
text
. Ignore thedisable
argument for now. - Generate lemmas using list comprehension using the
lemma_
attribute. - Remove non-alphabetic characters using
isalpha()
in the if condition.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Function to preprocess text
def preprocess(text):
# Create Doc object
doc = nlp(____, disable=['ner', 'parser'])
# Generate lemmas
lemmas = [token.____ for token in doc]
# Remove stopwords and non-alphabetic characters
a_lemmas = [lemma for lemma in lemmas
if lemma.____ and lemma not in stopwords]
return ' '.join(a_lemmas)
# Apply preprocess to ted['transcript']
ted['transcript'] = ted['transcript'].apply(____)
print(ted['transcript'])