Esercizio di preprocessing del testo

Ora tocca a te applicare le tecniche che hai imparato per ripulire il testo e ottenere risultati NLP migliori. Dovrai rimuovere le stop word e i caratteri non alfabetici, lemmatizzare ed eseguire un nuovo bag-of-words sul testo pulito.

Parti dagli stessi token creati nell'esercizio precedente: lower_tokens. Hai anche importato la classe Counter.

Questo esercizio fa parte del corso

Introduzione al Natural Language Processing in Python

Visualizza corso

Istruzioni dell'esercizio

Importa la classe WordNetLemmatizer da nltk.stem.
Crea una lista alpha_only che contenga solo caratteri alfabetici. Puoi usare il metodo .isalpha() per verificarlo.
Crea un'altra lista chiamata no_stops composta dalle parole di alpha_only che non sono presenti in english_stops.
Inizializza un oggetto WordNetLemmatizer chiamato wordnet_lemmatizer e usa il suo metodo .lemmatize() sui token in no_stops per creare una nuova lista chiamata lemmatized.
Crea un nuovo Counter chiamato bow con le parole lemmatizzate.
Infine, stampa i 10 token più comuni.

esercizio interattivo pratico

Prova questo esercizio completando questo codice di esempio.

# Import WordNetLemmatizer
____

# Retain alphabetic words: alpha_only
alpha_only = [t for t in ____ if ____]

# Remove all stop words: no_stops
no_stops = [t for t in ____ if t not in ____]

# Instantiate the WordNetLemmatizer
wordnet_lemmatizer = ____

# Lemmatize all tokens into a new list: lemmatized
lemmatized = [____ for t in ____]

# Create the bag-of-words: bow
bow = ____(____)

# Print the 10 most common tokens
print(____.____(__))

Modifica ed esegui il codice