Get startedGet started for free

Text preprocessing practice

Now, it's your turn to apply the techniques you've learned to help clean up text for better NLP results. You'll need to remove stop words and non-alphabetic characters, lemmatize, and perform a new bag-of-words on your cleaned text.

You start with the same tokens you created in the last exercise: lower_tokens. You also have the Counter class imported.

This exercise is part of the course

Introduction to Natural Language Processing in Python

View Course

Exercise instructions

  • Import the WordNetLemmatizer class from nltk.stem.
  • Create a list alpha_only that contains only alphabetical characters. You can use the .isalpha() method to check for this.
  • Create another list called no_stops consisting of words from alpha_only that are not contained in english_stops.
  • Initialize a WordNetLemmatizer object called wordnet_lemmatizer and use its .lemmatize() method on the tokens in no_stops to create a new list called lemmatized.
  • Create a new Counter called bow with the lemmatized words.
  • Lastly, print the 10 most common tokens.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Import WordNetLemmatizer
____

# Retain alphabetic words: alpha_only
alpha_only = [t for t in ____ if ____]

# Remove all stop words: no_stops
no_stops = [t for t in ____ if t not in ____]

# Instantiate the WordNetLemmatizer
wordnet_lemmatizer = ____

# Lemmatize all tokens into a new list: lemmatized
lemmatized = [____ for t in ____]

# Create the bag-of-words: bow
bow = ____(____)

# Print the 10 most common tokens
print(____.____(__))
Edit and Run Code