LoslegenKostenlos starten

Cleaning TED talks in a dataframe

In this exercise, we will revisit the TED Talks from the first chapter. You have been a given a dataframe ted consisting of 5 TED Talks. Your task is to clean these talks using techniques discussed earlier by writing a function preprocess and applying it to the transcript feature of the dataframe.

The stopwords list is available as stopwords.

Diese Übung ist Teil des Kurses

<Kurs>Feature Engineering for NLP in Python</Kurs>
Kurs ansehen

Übungsanweisungen

  • Generate the Doc object for text. Ignore the disable argument for now.
  • Generate lemmas using list comprehension using the lemma_ attribute.
  • Remove non-alphabetic characters using isalpha() in the if condition.

Interaktive praktische Übung

Versuche dich an dieser Übung, indem du diesen Beispielcode vervollständigst.

# Function to preprocess text
def preprocess(text):
  	# Create Doc object
    doc = nlp(____, disable=['ner', 'parser'])
    # Generate lemmas
    lemmas = [token.____ for token in doc]
    # Remove stopwords and non-alphabetic characters
    a_lemmas = [lemma for lemma in lemmas 
            if lemma.____ and lemma not in stopwords]
    
    return ' '.join(a_lemmas)
  
# Apply preprocess to ted['transcript']
ted['transcript'] = ted['transcript'].apply(____)
print(ted['transcript'])
Code bearbeiten und ausführen