Punctuation, numbers and tokens
At the end of the previous chapter you loaded a dataset of SMS messages which had been labeled as either "spam" (label 1) or "ham" (label 0). You're now going to use those data to build a classifier model.
But first you'll need to prepare the SMS messages as follows:
- remove punctuation and numbers
- tokenize (split into individual words)
- remove stop words
- apply the hashing trick
- convert to TF-IDF representation.
In this exercise you'll remove punctuation and numbers, then tokenize the messages.
The SMS data are available as sms.
Deze oefening maakt deel uit van de cursus
Machine Learning with PySpark
Oefeninstructies
- Import the function to replace regular expressions and the feature to tokenize.
- Replace all punctuation characters from the
textcolumn with a space. Do the same for all numbers in thetextcolumn. - Split the
textcolumn into tokens. Name the output columnwords.
Praktische interactieve oefening
Probeer deze oefening eens door deze voorbeeldcode in te vullen.
# Import the necessary functions
from pyspark.sql.functions import ____
from pyspark.ml.feature import ____
# Remove punctuation (REGEX provided) and numbers
wrangled = sms.withColumn('text', ____(sms.text, '[_():;,.!?\\-]', ____))
wrangled = wrangled.withColumn(____, ____(____, ____, ____))
# Merge multiple spaces
wrangled = wrangled.withColumn('text', regexp_replace(wrangled.text, ' +', ' '))
# Split the text into words
wrangled = ____(inputCol='text', outputCol=____).____(wrangled)
wrangled.show(4, truncate=False)