Text preprocessing practice
Now, it's your turn to apply the techniques you've learned to help clean up text for better NLP results. You'll need to remove stop words and non-alphabetic characters, lemmatize, and perform a new bag-of-words on your cleaned text.
You start with the same tokens you created in the last exercise: lower_tokens
. You also have the Counter
class imported.
This exercise is part of the course
Introduction to Natural Language Processing in Python
Exercise instructions
- Import the
WordNetLemmatizer
class fromnltk.stem
. - Create a list
alpha_only
that contains only alphabetical characters. You can use the.isalpha()
method to check for this. - Create another list called
no_stops
consisting of words fromalpha_only
that are not contained inenglish_stops
. - Initialize a
WordNetLemmatizer
object calledwordnet_lemmatizer
and use its.lemmatize()
method on the tokens inno_stops
to create a new list calledlemmatized
. - Create a new
Counter
calledbow
with the lemmatized words. - Lastly, print the 10 most common tokens.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Import WordNetLemmatizer
____
# Retain alphabetic words: alpha_only
alpha_only = [t for t in ____ if ____]
# Remove all stop words: no_stops
no_stops = [t for t in ____ if t not in ____]
# Instantiate the WordNetLemmatizer
wordnet_lemmatizer = ____
# Lemmatize all tokens into a new list: lemmatized
lemmatized = [____ for t in ____]
# Create the bag-of-words: bow
bow = ____(____)
# Print the 10 most common tokens
print(____.____(__))