View Schema
As you know from previous chapters, Spark's implementation of ALS requires that movieIds and userIds be provided as integer datatypes. Many datasets need to be prepared accordingly in order for them to function properly with Spark. A common issue is that Spark thinks numbers are strings, and vice versa.
Here, you'll use the .cast() method to address these types of problems. Let's take a look at the schema of the dataset to ensure it's in the correct format.
Cet exercice fait partie du cours
Building Recommendation Engines with PySpark
Instructions
- Use
.printSchema()to check whether the ratings dataset contains the proper data types for ALS to function correctly. Are theuserIds andmovieIds provided as integer datatypes? Are theratings in numeric format? - Ensure that the columns of the
ratingsdataframe are the correct data types. Call thecast()method on each column and specify theuserIDandmovieIdcolumns to be type"integer"and theratingcolumn to be of type"double". (We don't need thetimestampcolumn, so we can leave that out.) - Call
.printSchema()again onratingsto confirm that the data types are now correct.
Exercice interactif pratique
Essayez cet exercice en complétant cet exemple de code.
# Use .printSchema() to see the datatypes of the ratings dataset
ratings.____()
# Tell Spark to convert the columns to the proper data types
ratings = ratings.select(ratings.userId.cast("____"), ratings.movieId.cast("____"), ratings.rating.cast("____"))
# Call .printSchema() again to confirm the columns are now in the correct format
ratings.____()