ComenzarEmpieza gratis

Cleaning TED talks in a dataframe

In this exercise, we will revisit the TED Talks from the first chapter. You have been a given a dataframe ted consisting of 5 TED Talks. Your task is to clean these talks using techniques discussed earlier by writing a function preprocess and applying it to the transcript feature of the dataframe.

The stopwords list is available as stopwords.

Este ejercicio forma parte del curso

Feature Engineering for NLP in Python

Ver curso

Instrucciones del ejercicio

  • Generate the Doc object for text. Ignore the disable argument for now.
  • Generate lemmas using list comprehension using the lemma_ attribute.
  • Remove non-alphabetic characters using isalpha() in the if condition.

Ejercicio interactivo práctico

Prueba este ejercicio completando el código de muestra.

# Function to preprocess text
def preprocess(text):
  	# Create Doc object
    doc = nlp(____, disable=['ner', 'parser'])
    # Generate lemmas
    lemmas = [token.____ for token in doc]
    # Remove stopwords and non-alphabetic characters
    a_lemmas = [lemma for lemma in lemmas 
            if lemma.____ and lemma not in stopwords]
    
    return ' '.join(a_lemmas)
  
# Apply preprocess to ted['transcript']
ted['transcript'] = ted['transcript'].apply(____)
print(ted['transcript'])
Editar y ejecutar código