Punctuation, numbers and tokens

At the end of the previous chapter you loaded a dataset of SMS messages which had been labeled as either "spam" (label 1) or "ham" (label 0). You're now going to use those data to build a classifier model.

But first you'll need to prepare the SMS messages as follows:

remove punctuation and numbers
tokenize (split into individual words)
remove stop words
apply the hashing trick
convert to TF-IDF representation.

In this exercise you'll remove punctuation and numbers, then tokenize the messages.

The SMS data are available as sms.

Questo esercizio fa parte del corso

Machine Learning with PySpark

Visualizza il corso

Istruzioni dell'esercizio

Import the function to replace regular expressions and the feature to tokenize.
Replace all punctuation characters from the text column with a space. Do the same for all numbers in the text column.
Split the text column into tokens. Name the output column words.

Esercizio pratico interattivo

Prova a risolvere questo esercizio completando il codice di esempio.

# Import the necessary functions
from pyspark.sql.functions import ____
from pyspark.ml.feature import ____

# Remove punctuation (REGEX provided) and numbers
wrangled = sms.withColumn('text', ____(sms.text, '[_():;,.!?\\-]', ____))
wrangled = wrangled.withColumn(____, ____(____, ____, ____))

# Merge multiple spaces
wrangled = wrangled.withColumn('text', regexp_replace(wrangled.text, ' +', ' '))

# Split the text into words
wrangled = ____(inputCol='text', outputCol=____).____(wrangled)

wrangled.show(4, truncate=False)

Modifica ed esegui il codice