Cleaning a blog post
In this exercise, you have been given an excerpt from a blog post. Your task is to clean this text into a more machine friendly format. This will involve converting to lowercase, lemmatization and removing stopwords, punctuations and non-alphabetic characters.
The excerpt is available as a string blog
and has been printed to the console. The list of stopwords are available as stopwords
.
Cet exercice fait partie du cours
Feature Engineering for NLP in Python
Instructions
- Using list comprehension, loop through
doc
to extract thelemma_
of each token. - Remove stopwords and non-alphabetic tokens using
stopwords
andisalpha()
.
Exercice interactif pratique
Essayez cet exercice en complétant cet exemple de code.
# Load model and create Doc object
nlp = spacy.load('en_core_web_sm')
doc = nlp(blog)
# Generate lemmatized tokens
lemmas = [token.____ for token in ____]
# Remove stopwords and non-alphabetic tokens
a_lemmas = [lemma for lemma in lemmas
if lemma.____ and lemma not in ____]
# Print string after text cleaning
print(' '.join(a_lemmas))