Tell Spark how to tune your ALS model

Now we'll need to create a ParamGrid to tell Spark what hyperparameters we want it to tune, how to tune them, and then build out an evaluator so Spark can know how to measure the algorithm's performance.

This exercise is part of the course

Building Recommendation Engines with PySpark

Exercise instructions

Import RegressionEvaluator from pyspark.ml.evaluation and ParamGridBuilder and CrossValidator from pyspark.ml.tuning.
Build a ParamGrid called param_grid using the ParamGridBuilder provided. Call the .addGrid() method on each hyperparameter by providing the name of the model and the name of each hyperparameter (ex: .addGrid(als.rank, []). Do this for the rank, maxIter and regParam hyperparameters. Also provide the respective lists of hyperparameter values that Spark should try, as provided here:

 rank: [10, 50, 100, 150]  
 maxIter: [5, 50, 100, 200]  
 regParam: [.01, .05, .1, .15]

Create a RegressionEvaluator called evaluator. Set the metricName to "rmse", set the labelCol to "rating", and tell Spark that when it generates predictions to call the predictionCol "prediction".
Run len(param_grid) to confirm that the param_grid was created and to confirm that the right number of hyperparameter combinations will be tested. It should be equal to the number of rank values * the number of maxIter values * the number of regParam values in the ParamGridBuilder.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Import the requisite items
from pyspark.ml.evaluation import ____
from pyspark.ml.____ import ____, ____

# Add hyperparameters and their respective values to param_grid
____ = ParamGridBuilder() \
            .addGrid(als.rank, [____, ____, ____, ____]) \
            .addGrid(als.____, [____, ____, ____, ____]) \
            .addGrid(als.____, [____, ____, ____, ____]) \
            .build()
           
# Define evaluator as RMSE and print length of evaluator
____ = RegressionEvaluator(metricName="____", labelCol="____", predictionCol="____") 
print ("Num models to be tested: ", len(param_grid))

Edit and Run Code

This exercise is part of the course

Building Recommendation Engines with PySpark

AdvancedSkill Level

4.8+

Start Course for Free

This chapter will show you how powerful recommendations engines can be, and provide important distinctions between collaborative-filtering engines and content-based engines as well as the different types of implicit and explicit data that recommendation engines can use. You will also learn a very powerful way to uncover hidden features (latent features) that you may not even know exist in customer datasets.

Exercise 1: Why learn how to build recommendation engines?Exercise 2: See the power of a recommendation engine Exercise 3: Power of recommendation engines Exercise 4: Recommendation engine types and data types Exercise 5: Collaborative vs content-based filtering Exercise 6: Collaborative vs content based filtering part II Exercise 7: Implicit vs explicit data Exercise 8: Ratings data types Exercise 9: Uses for recommendation engines Exercise 10: Alternate uses of recommendation engines.Exercise 11: Confirm understanding of latent features

In this chapter you will review basic concepts of matrix multiplication and matrix factorization, and dive into how the Alternating Least Squares algorithm works and what arguments and hyperparameters it uses to return the best recommendations possible. You will also learn important techniques for properly preparing your data for ALS in Spark.

Exercise 1: Overview of matrix multiplication Exercise 2: Matrix multiplication Exercise 3: Matrix multiplication part II Exercise 4: Overview of matrix factorization Exercise 5: Matrix factorization Exercise 6: Non-negative matrix factorization Exercise 7: How ALS alternates to generate predictions Exercise 8: Estimating recommendations Exercise 9: RMSE as ALS alternates Exercise 10: Data preparation for Spark ALS Exercise 11: Correct format and distinct users Exercise 12: Assigning integer id's to movies Exercise 13: ALS parameters and hyperparameters Exercise 14: Build out an ALS model Exercise 15: Build RMSE evaluator Exercise 16: Get RMSE

In this chapter you will be introduced to the MovieLens dataset. You will walk through how to assess it's use for ALS, build out a full cross-validated ALS model on it, and learn how to evaluate it's performance. This will be the foundation for all subsequent ALS models you build using Pyspark.

Exercise 1: Introduction to the MovieLens dataset Exercise 2: Viewing the MovieLens Data Exercise 3: Calculate sparsity Exercise 4: The GroupBy and Filter methods Exercise 5: MovieLens Summary Statistics Exercise 6: View Schema Exercise 7: ALS model buildout on MovieLens Data Exercise 8: Create test/train splits and build your ALS model Exercise 9: Tell Spark how to tune your ALS model

Current Exercise

Exercise 10: Build your cross validation pipeline Exercise 11: Best Model and Best Model Parameters Exercise 12: Model Performance Evaluation Exercise 13: Generate predictions and calculate RMSE Exercise 14: Interpreting the RMSE Exercise 15: Do recommendations make sense

In most real-life situations, you won't not have "perfect" customer data available to build an ALS model. This chapter will teach you how to use your customer behavior data to "infer" customer ratings and use those inferred ratings to build an ALS recommendation engine. Using the Million Songs Dataset as well as another version of the MovieLens dataset, this chapter will show you how to use the data available to you to build a recommendation engine using ALS and evaluate it's performance.

Exercise 1: Introduction to the Million Songs Dataset Exercise 2: Confirm understanding of implicit rating concepts Exercise 3: MSD summary statistics Exercise 4: Grouped summary statistics Exercise 5: Add zeros Exercise 6: Evaluating implicit ratings models Exercise 7: Specify ALS hyperparameters Exercise 8: Build implicit models Exercise 9: Running a cross-validated implicit ALS model Exercise 10: Extracting parameters Exercise 11: Overview of binary, implicit ratings Exercise 12: Binary model performance Exercise 13: Recommendations from binary data Exercise 14: Course recap