Punctuation, numbers and tokens

At the end of the previous chapter you loaded a dataset of SMS messages which had been labeled as either "spam" (label 1) or "ham" (label 0). You're now going to use those data to build a classifier model.

But first you'll need to prepare the SMS messages as follows:

remove punctuation and numbers
tokenize (split into individual words)
remove stop words
apply the hashing trick
convert to TF-IDF representation.

In this exercise you'll remove punctuation and numbers, then tokenize the messages.

The SMS data are available as sms.

Deze oefening maakt deel uit van de cursus

Machine Learning with PySpark

Cursus bekijken

Oefeninstructies

Import the function to replace regular expressions and the feature to tokenize.
Replace all punctuation characters from the text column with a space. Do the same for all numbers in the text column.
Split the text column into tokens. Name the output column words.

Praktische interactieve oefening

Probeer deze oefening eens door deze voorbeeldcode in te vullen.

# Import the necessary functions
from pyspark.sql.functions import ____
from pyspark.ml.feature import ____

# Remove punctuation (REGEX provided) and numbers
wrangled = sms.withColumn('text', ____(sms.text, '[_():;,.!?\\-]', ____))
wrangled = wrangled.withColumn(____, ____(____, ____, ____))

# Merge multiple spaces
wrangled = wrangled.withColumn('text', regexp_replace(wrangled.text, ' +', ' '))

# Split the text into words
wrangled = ____(inputCol='text', outputCol=____).____(wrangled)

wrangled.show(4, truncate=False)

Code bewerken en uitvoeren