Punctuation, numbers and tokens
At the end of the previous chapter you loaded a dataset of SMS messages which had been labeled as either "spam" (label 1
) or "ham" (label 0
). You're now going to use those data to build a classifier model.
But first you'll need to prepare the SMS messages as follows:
- remove punctuation and numbers
- tokenize (split into individual words)
- remove stop words
- apply the hashing trick
- convert to TF-IDF representation.
In this exercise you'll remove punctuation and numbers, then tokenize the messages.
The SMS data are available as sms
.
This exercise is part of the course
Machine Learning with PySpark
Exercise instructions
- Import the function to replace regular expressions and the feature to tokenize.
- Replace all punctuation characters from the
text
column with a space. Do the same for all numbers in thetext
column. - Split the
text
column into tokens. Name the output columnwords
.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Import the necessary functions
from pyspark.sql.functions import ____
from pyspark.ml.feature import ____
# Remove punctuation (REGEX provided) and numbers
wrangled = sms.withColumn('text', ____(sms.text, '[_():;,.!?\\-]', ____))
wrangled = wrangled.withColumn(____, ____(____, ____, ____))
# Merge multiple spaces
wrangled = wrangled.withColumn('text', regexp_replace(wrangled.text, ' +', ' '))
# Split the text into words
wrangled = ____(inputCol='text', outputCol=____).____(wrangled)
wrangled.show(4, truncate=False)