SMS spam optimised
The pipeline you built earlier for the SMS spam model used the default parameters for all of the elements in the pipeline. It's very unlikely that these parameters will give a particularly good model though. In this exercise you're going to run the pipeline for a selection of parameter values. We're going to do this in a systematic way: the values for each of the hyperparameters will be laid out on a grid and then pipeline will systematically run across each point in the grid.
In this exercise you'll set up a parameter grid which can be used with cross validation to choose a good set of parameters for the SMS spam classifier.
The following are already defined:
hasher
— aHashingTF
object andlogistic
— aLogisticRegression
object.
This exercise is part of the course
Machine Learning with PySpark
Exercise instructions
- Create a parameter grid builder object.
- Add grid points for
numFeatures
andbinary
parameters to theHashingTF
object, giving values 1024, 4096 and 16384, and True and False, respectively. - Add grid points for
regParam
andelasticNetParam
parameters to theLogisticRegression
object, giving values of 0.01, 0.1, 1.0 and 10.0, and 0.0, 0.5, and 1.0 respectively. - Build the parameter grid.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Create parameter grid
params = ____()
# Add grid for hashing trick parameters
params = params.____(____, ____) \
.____(____, ____)
# Add grid for logistic regression parameters
params = params.____(____, ____) \
.____(____, ____)
# Build parameter grid
params = ____.____()