Cleaning TED talks in a dataframe
In this exercise, we will revisit the TED Talks from the first chapter. You have been a given a dataframe ted
consisting of 5 TED Talks. Your task is to clean these talks using techniques discussed earlier by writing a function preprocess
and applying it to the transcript
feature of the dataframe.
The stopwords list is available as stopwords
.
Este ejercicio forma parte del curso
Feature Engineering for NLP in Python
Instrucciones del ejercicio
- Generate the Doc object for
text
. Ignore thedisable
argument for now. - Generate lemmas using list comprehension using the
lemma_
attribute. - Remove non-alphabetic characters using
isalpha()
in the if condition.
Ejercicio interactivo práctico
Prueba este ejercicio completando el código de muestra.
# Function to preprocess text
def preprocess(text):
# Create Doc object
doc = nlp(____, disable=['ner', 'parser'])
# Generate lemmas
lemmas = [token.____ for token in doc]
# Remove stopwords and non-alphabetic characters
a_lemmas = [lemma for lemma in lemmas
if lemma.____ and lemma not in stopwords]
return ' '.join(a_lemmas)
# Apply preprocess to ted['transcript']
ted['transcript'] = ted['transcript'].apply(____)
print(ted['transcript'])