Get startedGet started for free

Punctuation, numbers and tokens

At the end of the previous chapter you loaded a dataset of SMS messages which had been labeled as either "spam" (label 1) or "ham" (label 0). You're now going to use those data to build a classifier model.

But first you'll need to prepare the SMS messages as follows:

  • remove punctuation and numbers
  • tokenize (split into individual words)
  • remove stop words
  • apply the hashing trick
  • convert to TF-IDF representation.

In this exercise you'll remove punctuation and numbers, then tokenize the messages.

The SMS data are available as sms.

This exercise is part of the course

Machine Learning with PySpark

View Course

Exercise instructions

  • Import the function to replace regular expressions and the feature to tokenize.
  • Replace all punctuation characters from the text column with a space. Do the same for all numbers in the text column.
  • Split the text column into tokens. Name the output column words.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Import the necessary functions
from pyspark.sql.functions import ____
from pyspark.ml.feature import ____

# Remove punctuation (REGEX provided) and numbers
wrangled = sms.withColumn('text', ____(sms.text, '[_():;,.!?\\-]', ____))
wrangled = wrangled.withColumn(____, ____(____, ____, ____))

# Merge multiple spaces
wrangled = wrangled.withColumn('text', regexp_replace(wrangled.text, ' +', ' '))

# Split the text into words
wrangled = ____(inputCol='text', outputCol=____).____(wrangled)

wrangled.show(4, truncate=False)
Edit and Run Code