Text preprocessing practice
Now, it's your turn to apply the techniques you've learned to help clean up text for better NLP results. You'll need to remove stop words and non-alphabetic characters, lemmatize, and perform a new bag-of-words on your cleaned text.
You start with the same tokens you created in the last exercise: lower_tokens. You also have the Counter class imported.
This exercise is part of the course
Introduction to Natural Language Processing in Python
Exercise instructions
- Import the
WordNetLemmatizerclass fromnltk.stem. - Create a list
alpha_onlythat contains only alphabetical characters. You can use the.isalpha()method to check for this. - Create another list called
no_stopsconsisting of words fromalpha_onlythat are not contained inenglish_stops. - Initialize a
WordNetLemmatizerobject calledwordnet_lemmatizerand use its.lemmatize()method on the tokens inno_stopsto create a new list calledlemmatized. - Create a new
Countercalledbowwith the lemmatized words. - Lastly, print the 10 most common tokens.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Import WordNetLemmatizer
____
# Retain alphabetic words: alpha_only
alpha_only = [t for t in ____ if ____]
# Remove all stop words: no_stops
no_stops = [t for t in ____ if t not in ____]
# Instantiate the WordNetLemmatizer
wordnet_lemmatizer = ____
# Lemmatize all tokens into a new list: lemmatized
lemmatized = [____ for t in ____]
# Create the bag-of-words: bow
bow = ____(____)
# Print the 10 most common tokens
print(____.____(__))