View Schema
As you know from previous chapters, Spark's implementation of ALS requires that movieId
s and userId
s be provided as integer datatypes. Many datasets need to be prepared accordingly in order for them to function properly with Spark. A common issue is that Spark thinks numbers are strings, and vice versa.
Here, you'll use the .cast()
method to address these types of problems. Let's take a look at the schema of the dataset to ensure it's in the correct format.
This exercise is part of the course
Building Recommendation Engines with PySpark
Exercise instructions
- Use
.printSchema()
to check whether the ratings dataset contains the proper data types for ALS to function correctly. Are theuserId
s andmovieId
s provided as integer datatypes? Are therating
s in numeric format? - Ensure that the columns of the
ratings
dataframe are the correct data types. Call thecast()
method on each column and specify theuserID
andmovieId
columns to be type"integer"
and therating
column to be of type"double"
. (We don't need thetimestamp
column, so we can leave that out.) - Call
.printSchema()
again onratings
to confirm that the data types are now correct.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Use .printSchema() to see the datatypes of the ratings dataset
ratings.____()
# Tell Spark to convert the columns to the proper data types
ratings = ratings.select(ratings.userId.cast("____"), ratings.movieId.cast("____"), ratings.rating.cast("____"))
# Call .printSchema() again to confirm the columns are now in the correct format
ratings.____()