Punctuation, numbers and tokens
At the end of the previous chapter you loaded a dataset of SMS messages which had been labeled as either "spam" (label 1) or "ham" (label 0). You're now going to use those data to build a classifier model.
But first you'll need to prepare the SMS messages as follows:
- remove punctuation and numbers
- tokenize (split into individual words)
- remove stop words
- apply the hashing trick
- convert to TF-IDF representation.
In this exercise you'll remove punctuation and numbers, then tokenize the messages.
The SMS data are available as sms.
This exercise is part of the course
Machine Learning with PySpark
Exercise instructions
- Import the function to replace regular expressions and the feature to tokenize.
- Replace all punctuation characters from the
textcolumn with a space. Do the same for all numbers in thetextcolumn. - Split the
textcolumn into tokens. Name the output columnwords.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Import the necessary functions
from pyspark.sql.functions import ____
from pyspark.ml.feature import ____
# Remove punctuation (REGEX provided) and numbers
wrangled = sms.withColumn('text', ____(sms.text, '[_():;,.!?\\-]', ____))
wrangled = wrangled.withColumn(____, ____(____, ____, ____))
# Merge multiple spaces
wrangled = wrangled.withColumn('text', regexp_replace(wrangled.text, ' +', ' '))
# Split the text into words
wrangled = ____(inputCol='text', outputCol=____).____(wrangled)
wrangled.show(4, truncate=False)