Get startedGet started for free

From task to subtasks

For this exercise, you will be using parallel computing to apply the function take_mean_age() that calculates the average athlete's age in a given year in the Olympics events dataset. The DataFrame athlete_events has been loaded for you and contains amongst others, two columns:

  • Year: the year the Olympic event took place
  • Age: the age of the Olympian

You will be using the multiprocessor.Pool API which allows you to distribute your workload over several processes. The function parallel_apply() is defined in the sample code. It takes in as input the function being applied, the grouping used, and the number of cores needed for the analysis. Note that the @print_timing decorator is used to time each operation.

This exercise is part of the course

Introduction to Data Engineering

View Course

Exercise instructions

  • Complete the code, so you apply take_mean_age with 1 core first, then 2 and finally 4 cores.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Function to apply a function over multiple cores
@print_timing
def parallel_apply(apply_func, groups, nb_cores):
    with Pool(nb_cores) as p:
        results = p.map(apply_func, groups)
    return pd.concat(results)

# Parallel apply using 1 core
parallel_apply(take_mean_age, athlete_events.groupby('Year'), ____)

# Parallel apply using 2 cores
parallel_apply(take_mean_age, athlete_events.groupby('Year'), ____)

# Parallel apply using 4 cores
parallel_apply(take_mean_age, athlete_events.groupby('Year'), ____)
Edit and Run Code