Text preprocessing practice

Now, it's your turn to apply the techniques you've learned to help clean up text for better NLP results. You'll need to remove stop words and non-alphabetic characters, lemmatize, and perform a new bag-of-words on your cleaned text.

You start with the same tokens you created in the last exercise: lower_tokens. You also have the Counter class imported.

Bu egzersiz

Introduction to Natural Language Processing in Python

kursunun bir parçasıdır

Kursu Görüntüle

Egzersiz talimatları

Import the WordNetLemmatizer class from nltk.stem.
Create a list alpha_only that contains only alphabetical characters. You can use the .isalpha() method to check for this.
Create another list called no_stops consisting of words from alpha_only that are not contained in english_stops.
Initialize a WordNetLemmatizer object called wordnet_lemmatizer and use its .lemmatize() method on the tokens in no_stops to create a new list called lemmatized.
Create a new Counter called bow with the lemmatized words.
Lastly, print the 10 most common tokens.

Uygulamalı interaktif egzersiz

Bu örnek kodu tamamlayarak bu egzersizi bitirin.

# Import WordNetLemmatizer
____

# Retain alphabetic words: alpha_only
alpha_only = [t for t in ____ if ____]

# Remove all stop words: no_stops
no_stops = [t for t in ____ if t not in ____]

# Instantiate the WordNetLemmatizer
wordnet_lemmatizer = ____

# Lemmatize all tokens into a new list: lemmatized
lemmatized = [____ for t in ____]

# Create the bag-of-words: bow
bow = ____(____)

# Print the 10 most common tokens
print(____.____(__))

Kodu Düzenle ve Çalıştır