A PySpark groupby
You've seen how to use the dask
framework and its DataFrame abstraction to do some calculations. However, as you've seen in the video, in the big data world Spark is probably a more popular choice for data processing.
In this exercise, you'll use the PySpark package to handle a Spark DataFrame. The data is the same as in previous exercises: participants of Olympic events between 1896 and 2016.
The Spark Dataframe, athlete_events_spark
is available in your workspace.
The methods you're going to use in this exercise are:
.printSchema()
: helps print the schema of a Spark DataFrame..groupBy()
: grouping statement for an aggregation..mean()
: take the mean over each group..show()
: show the results.
This exercise is part of the course
Introduction to Data Engineering
Exercise instructions
- Find out the type of
athlete_events_spark
. - Find out the schema of
athlete_events_spark
. - Print out the mean age of the Olympians, grouped by year. Notice that spark has not actually calculated anything yet. You can call this lazy evaluation.
- Take the previous result, and call
.show()
on the result to calculate the mean age.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Print the type of athlete_events_spark
print(____(athlete_events_spark))
# Print the schema of athlete_events_spark
print(athlete_events_spark.____())
# Group by the Year, and find the mean Age
print(athlete_events_spark.____('Year').mean(____))
# Group by the Year, and find the mean Age
print(athlete_events_spark.____('Year').mean(____).____())