Cleaning a blog post
In this exercise, you have been given an excerpt from a blog post. Your task is to clean this text into a more machine friendly format. This will involve converting to lowercase, lemmatization and removing stopwords, punctuations and non-alphabetic characters.
The excerpt is available as a string blog
and has been printed to the console. The list of stopwords are available as stopwords
.
Este ejercicio forma parte del curso
Feature Engineering for NLP in Python
Instrucciones del ejercicio
- Using list comprehension, loop through
doc
to extract thelemma_
of each token. - Remove stopwords and non-alphabetic tokens using
stopwords
andisalpha()
.
Ejercicio interactivo práctico
Prueba este ejercicio completando el código de muestra.
# Load model and create Doc object
nlp = spacy.load('en_core_web_sm')
doc = nlp(blog)
# Generate lemmatized tokens
lemmas = [token.____ for token in ____]
# Remove stopwords and non-alphabetic tokens
a_lemmas = [lemma for lemma in lemmas
if lemma.____ and lemma not in ____]
# Print string after text cleaning
print(' '.join(a_lemmas))