1
Introduction
Free
Spark is a framework for working with Big Data. In this chapter you'll cover some background about Spark and Machine Learning. You'll then find out how to connect to Spark using Python and load CSV data.
2
Classification
Now that you are familiar with getting data into Spark, you'll move onto building two types of classification model: Decision Trees and Logistic Regression. You'll also find out about a few approaches to data preparation.
3
Regression
Next you'll learn to create Linear Regression models. You'll also find out how to augment your data by engineering new predictors as well as a robust approach to selecting only the most relevant predictors.
4
Ensembles & Pipelines
Finally you'll learn how to make your models more efficient. You'll find out how to use pipelines to make your code clearer and easier to maintain. Then you'll use cross-validation to better test your models and select good model parameters. Finally you'll dabble in two types of ensemble model.

Initializing

SMS spam pipeline

You haven't looked at the SMS data for quite a while. Last time we did the following:

split the text into tokens
removed stop words
applied the hashing trick
converted the data from counts to IDF and
trained a logistic regression model.

Each of these steps was done independently. This seems like a great application for a pipeline!

The Pipeline and LogisticRegression classes have already been imported into the session, so you don't need to worry about that!

Create an object for splitting text into tokens.
Create an object to remove stop words. Rather than explicitly giving the input column name, use the getOutputCol() method on the previous object.
Create objects for applying the hashing trick and transforming the data into a TF-IDF. Use the getOutputCol() method again.
Create a pipeline which wraps all of the above steps as well as an object to create a Logistic Regression model.