MSD summary statistics
Let's get familiar with the Million Songs Echo Nest Taste Profile data subset. For purposes of this course, we'll just call it the Million Songs dataset or msd
. Let's get the number of users and the number of songs. Let's also see which songs have the most plays from this subset.
Diese Übung ist Teil des Kurses
Building Recommendation Engines with PySpark
Anleitung zur Übung
- Use the
.show()
method to see what the data looks like. - Complete the code to count the number of distinct
userId
s. Select theuserId
column, then call.distinct()
and.count()
. - Now do the same thing for the
songId
s, so count the number of distinctsongId
s. Select thesongId
column and call.distinct()
and.count()
on it.
Interaktive Übung
Vervollständige den Beispielcode, um diese Übung erfolgreich abzuschließen.
# Look at the data
msd.____()
# Count the number of distinct userIds
user_count = msd.select("____").____().count()
print("Number of users: ", user_count)
# Count the number of distinct songIds
song_count = msd.select("____").____().count()
print("Number of songs: ", song_count)