1. Learn
  2. /
  3. Courses
  4. /
  5. Introduction to Data Engineering

Connected

Exercise

A PySpark groupby

You've seen how to use the dask framework and its DataFrame abstraction to do some calculations. However, as you've seen in the video, in the big data world Spark is probably a more popular choice for data processing.

In this exercise, you'll use the PySpark package to handle a Spark DataFrame. The data is the same as in previous exercises: participants of Olympic events between 1896 and 2016.

The Spark Dataframe, athlete_events_spark is available in your workspace.

The methods you're going to use in this exercise are:

  • .printSchema(): helps print the schema of a Spark DataFrame.
  • .groupBy(): grouping statement for an aggregation.
  • .mean(): take the mean over each group.
  • .show(): show the results.

Instructions

100 XP
  • Find out the type of athlete_events_spark.
  • Find out the schema of athlete_events_spark.
  • Print out the mean age of the Olympians, grouped by year. Notice that spark has not actually calculated anything yet. You can call this lazy evaluation.
  • Take the previous result, and call .show() on the result to calculate the mean age.