SMS spam pipeline
You haven't looked at the SMS data for quite a while. Last time we did the following:
- split the text into tokens
- removed stop words
- applied the hashing trick
- converted the data from counts to IDF and
- trained a logistic regression model.
Each of these steps was done independently. This seems like a great application for a pipeline!
The Pipeline
and LogisticRegression
classes have already been imported into the session, so you don't need to worry about that!
This exercise is part of the course
Machine Learning with PySpark
Exercise instructions
- Create an object for splitting text into tokens.
- Create an object to remove stop words. Rather than explicitly giving the input column name, use the
getOutputCol()
method on the previous object. - Create objects for applying the hashing trick and transforming the data into a TF-IDF. Use the
getOutputCol()
method again. - Create a pipeline which wraps all of the above steps as well as an object to create a Logistic Regression model.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
from pyspark.ml.feature import Tokenizer, StopWordsRemover, HashingTF, IDF
# Break text into tokens at non-word characters
tokenizer = ____(inputCol='text', outputCol='words')
# Remove stop words
remover = ____(inputCol=____, outputCol='terms')
# Apply the hashing trick and transform to TF-IDF
hasher = ____(inputCol=____, outputCol="hash")
idf = ____(inputCol=____, outputCol="features")
# Create a logistic regression object and add everything to a pipeline
logistic = LogisticRegression()
pipeline = Pipeline(stages=[____, ____, ____, ____, logistic])