From task to subtasks
For this exercise, you will be using parallel computing to apply the function take_mean_age()
that calculates the average athlete's age in a given year in the Olympics events dataset. The DataFrame athlete_events
has been loaded for you and contains amongst others, two columns:
Year
: the year the Olympic event took placeAge
: the age of the Olympian
You will be using the multiprocessor.Pool
API which allows you to distribute your workload over several processes. The function parallel_apply()
is defined in the sample code. It takes in as input the function being applied, the grouping used, and the number of cores needed for the analysis. Note that the @print_timing
decorator is used to time each operation.
This exercise is part of the course
Introduction to Data Engineering
Exercise instructions
- Complete the code, so you apply
take_mean_age
with1
core first, then2
and finally4
cores.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Function to apply a function over multiple cores
@print_timing
def parallel_apply(apply_func, groups, nb_cores):
with Pool(nb_cores) as p:
results = p.map(apply_func, groups)
return pd.concat(results)
# Parallel apply using 1 core
parallel_apply(take_mean_age, athlete_events.groupby('Year'), ____)
# Parallel apply using 2 cores
parallel_apply(take_mean_age, athlete_events.groupby('Year'), ____)
# Parallel apply using 4 cores
parallel_apply(take_mean_age, athlete_events.groupby('Year'), ____)