Cleaning a blog post
In this exercise, you have been given an excerpt from a blog post. Your task is to clean this text into a more machine friendly format. This will involve converting to lowercase, lemmatization and removing stopwords, punctuations and non-alphabetic characters.
The excerpt is available as a string blog
and has been printed to the console. The list of stopwords are available as stopwords
.
This exercise is part of the course
Feature Engineering for NLP in Python
Exercise instructions
- Using list comprehension, loop through
doc
to extract thelemma_
of each token. - Remove stopwords and non-alphabetic tokens using
stopwords
andisalpha()
.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Load model and create Doc object
nlp = spacy.load('en_core_web_sm')
doc = nlp(blog)
# Generate lemmatized tokens
lemmas = [token.____ for token in ____]
# Remove stopwords and non-alphabetic tokens
a_lemmas = [lemma for lemma in lemmas
if lemma.____ and lemma not in ____]
# Print string after text cleaning
print(' '.join(a_lemmas))