View Schema
As you know from previous chapters, Spark's implementation of ALS requires that movieIds and userIds be provided as integer datatypes. Many datasets need to be prepared accordingly in order for them to function properly with Spark. A common issue is that Spark thinks numbers are strings, and vice versa.
Here, you'll use the .cast() method to address these types of problems. Let's take a look at the schema of the dataset to ensure it's in the correct format.
Diese Übung ist Teil des Kurses
<Kurs>Building Recommendation Engines with PySpark</Kurs>Übungsanweisungen
- Use
.printSchema()to check whether the ratings dataset contains the proper data types for ALS to function correctly. Are theuserIds andmovieIds provided as integer datatypes? Are theratings in numeric format? - Ensure that the columns of the
ratingsdataframe are the correct data types. Call thecast()method on each column and specify theuserIDandmovieIdcolumns to be type"integer"and theratingcolumn to be of type"double". (We don't need thetimestampcolumn, so we can leave that out.) - Call
.printSchema()again onratingsto confirm that the data types are now correct.
Interaktive praktische Übung
Versuche dich an dieser Übung, indem du diesen Beispielcode vervollständigst.
# Use .printSchema() to see the datatypes of the ratings dataset
ratings.____()
# Tell Spark to convert the columns to the proper data types
ratings = ratings.select(ratings.userId.cast("____"), ratings.movieId.cast("____"), ratings.rating.cast("____"))
# Call .printSchema() again to confirm the columns are now in the correct format
ratings.____()