SMS spam pipeline

You haven't looked at the SMS data for quite a while. Last time we did the following:

split the text into tokens
removed stop words
applied the hashing trick
converted the data from counts to IDF and
trained a logistic regression model.

Each of these steps was done independently. This seems like a great application for a pipeline!

The Pipeline and LogisticRegression classes have already been imported into the session, so you don't need to worry about that!

Deze oefening maakt deel uit van de cursus

Machine Learning with PySpark

Cursus bekijken

Oefeninstructies

Create an object for splitting text into tokens.
Create an object to remove stop words. Rather than explicitly giving the input column name, use the getOutputCol() method on the previous object.
Create objects for applying the hashing trick and transforming the data into a TF-IDF. Use the getOutputCol() method again.
Create a pipeline which wraps all of the above steps as well as an object to create a Logistic Regression model.

Praktische interactieve oefening

Probeer deze oefening eens door deze voorbeeldcode in te vullen.

from pyspark.ml.feature import Tokenizer, StopWordsRemover, HashingTF, IDF

# Break text into tokens at non-word characters
tokenizer = ____(inputCol='text', outputCol='words')

# Remove stop words
remover = ____(inputCol=____, outputCol='terms')

# Apply the hashing trick and transform to TF-IDF
hasher = ____(inputCol=____, outputCol="hash")
idf = ____(inputCol=____, outputCol="features")

# Create a logistic regression object and add everything to a pipeline
logistic = LogisticRegression()
pipeline = Pipeline(stages=[____, ____, ____, ____, logistic])

Code bewerken en uitvoeren