Session Ready
Exercise

Punctuation, numbers and tokens

At the end of the previous chapter you loaded a dataset of SMS messages which had been labeled as either "spam" (label 1) or "ham" (label 0). You're now going to use those data to build a classifier model.

But first you'll need to prepare the SMS messages as follows:

  • remove punctuation and numbers
  • tokenize (split into individual words)
  • remove stop words
  • apply the hashing trick
  • convert to TF-IDF representation.

In this exercise you'll remove punctuation and numbers, then tokenize the messages.

The SMS data are available as sms.

Instructions
100 XP
  • Import the function to replace regular expressions and the feature to tokenize.
  • Replace all punctuation characters from the text column with a space. Do the same for all numbers in the text column.
  • Split the text column into tokens. Name the output column words.