SMS spam pipeline

You haven't looked at the SMS data for quite a while. Last time we did the following:

split the text into tokens
removed stop words
applied the hashing trick
converted the data from counts to IDF and
trained a logistic regression model.

Each of these steps was done independently. This seems like a great application for a pipeline!

The Pipeline and LogisticRegression classes have already been imported into the session, so you don't need to worry about that!

Bu egzersiz

Machine Learning with PySpark

kursunun bir parçasıdır

Kursu Görüntüle

Egzersiz talimatları

Create an object for splitting text into tokens.
Create an object to remove stop words. Rather than explicitly giving the input column name, use the getOutputCol() method on the previous object.
Create objects for applying the hashing trick and transforming the data into a TF-IDF. Use the getOutputCol() method again.
Create a pipeline which wraps all of the above steps as well as an object to create a Logistic Regression model.

Uygulamalı interaktif egzersiz

Bu örnek kodu tamamlayarak bu egzersizi bitirin.

from pyspark.ml.feature import Tokenizer, StopWordsRemover, HashingTF, IDF

# Break text into tokens at non-word characters
tokenizer = ____(inputCol='text', outputCol='words')

# Remove stop words
remover = ____(inputCol=____, outputCol='terms')

# Apply the hashing trick and transform to TF-IDF
hasher = ____(inputCol=____, outputCol="hash")
idf = ____(inputCol=____, outputCol="features")

# Create a logistic regression object and add everything to a pipeline
logistic = LogisticRegression()
pipeline = Pipeline(stages=[____, ____, ____, ____, logistic])

Kodu Düzenle ve Çalıştır