Cleaning a blog post

In this exercise, you have been given an excerpt from a blog post. Your task is to clean this text into a more machine friendly format. This will involve converting to lowercase, lemmatization and removing stopwords, punctuations and non-alphabetic characters.

The excerpt is available as a string blog and has been printed to the console. The list of stopwords are available as stopwords.

Diese Übung ist Teil des Kurses

<Kurs>Feature Engineering for NLP in Python</Kurs>

Kurs ansehen

Übungsanweisungen

Using list comprehension, loop through doc to extract the lemma_ of each token.
Remove stopwords and non-alphabetic tokens using stopwords and isalpha().

Interaktive praktische Übung

Versuche dich an dieser Übung, indem du diesen Beispielcode vervollständigst.

# Load model and create Doc object
nlp = spacy.load('en_core_web_sm')
doc = nlp(blog)

# Generate lemmatized tokens
lemmas = [token.____ for token in ____]

# Remove stopwords and non-alphabetic tokens
a_lemmas = [lemma for lemma in lemmas 
            if lemma.____ and lemma not in ____]

# Print string after text cleaning
print(' '.join(a_lemmas))

Code bearbeiten und ausführen