Cleaning TED talks in a dataframe
In this exercise, we will revisit the TED Talks from the first chapter. You have been a given a dataframe ted consisting of 5 TED Talks. Your task is to clean these talks using techniques discussed earlier by writing a function preprocess and applying it to the transcript feature of the dataframe.
The stopwords list is available as stopwords.
Diese Übung ist Teil des Kurses
<Kurs>Feature Engineering for NLP in Python</Kurs>Übungsanweisungen
- Generate the Doc object for
text. Ignore thedisableargument for now. - Generate lemmas using list comprehension using the
lemma_attribute. - Remove non-alphabetic characters using
isalpha()in the if condition.
Interaktive praktische Übung
Versuche dich an dieser Übung, indem du diesen Beispielcode vervollständigst.
# Function to preprocess text
def preprocess(text):
# Create Doc object
doc = nlp(____, disable=['ner', 'parser'])
# Generate lemmas
lemmas = [token.____ for token in doc]
# Remove stopwords and non-alphabetic characters
a_lemmas = [lemma for lemma in lemmas
if lemma.____ and lemma not in stopwords]
return ' '.join(a_lemmas)
# Apply preprocess to ted['transcript']
ted['transcript'] = ted['transcript'].apply(____)
print(ted['transcript'])