Read Dask DataFrames from Parquet
In Chapter 1, you analyzed some Spotify data, which was split across multiple files to find the top hits of 2005-2020. You did this using the dask.delayed()
function and a loop. Let's see how much easier this analysis becomes using Dask DataFrames.
dask.dataframe
has been imported for you as dd
.
This exercise is part of the course
Parallel Programming with Dask in Python
Exercise instructions
- Load the Parquet data folder located in
"data/spotify_parquet"
. - Use the DataFrame's
.nlargest()
method to find the top 10 songs by'popularity'
. - Convert the delayed object into a pandas DataFrame by computing it.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Read the spotify_parquet folder
df = ____
# Find the 10 most popular songs
top_10_songs = ____
# Convert the delayed result to a pandas DataFrame
top_10_songs_df = ____
print(top_10_songs_df)