Machine Learning Pipelines
In the next two chapters you'll step through every stage of the machine learning pipeline, from data intake to model evaluation. Let's get to it!
At the core of the pyspark.ml
module are the Transformer
and Estimator
classes. Almost every other class in the module behaves similarly to these two basic classes.
Transformer
classes have a .transform()
method that takes a DataFrame and returns a new DataFrame; usually the original one with a new column appended. For example, you might use the class Bucketizer
to create discrete bins from a continuous feature or the class PCA
to reduce the dimensionality of your dataset using principal component analysis.
Estimator
classes all implement a .fit()
method. These methods also take a DataFrame, but instead of returning another DataFrame they return a model object. This can be something like a StringIndexerModel
for including categorical data saved as strings in your models, or a RandomForestModel
that uses the random forest algorithm for classification or regression.
Which of the following is not true about machine learning in Spark?
This exercise is part of the course
Foundations of PySpark
Hands-on interactive exercise
Turn theory into action with one of our interactive exercises
