Session Ready
Exercise

Building a pipeline with delayed

If we use dask.delayed, we don't need to use generators; the dask scheduler will manage memory usage. In this version of the flight delay analysis, you'll compute the total yearly percentage of delayed flights.

Along with pandas, the decorator function delayed has been imported for you from dask, and the following decorated function, which calls pd.read_csv() on a single file, has been created for you:

@delayed
def read_one(filename):
    return pd.read_csv(filename)

Your job is to define three decorated functions to complete the pipeline: a function to total the number of flights, a function to count the number of delayed flights, and a function to aggregate the results.

Instructions
100 XP
  • Define a @delayed-function count_flights() that accepts a single DataFrame df as input and returns the len() function applied to that DataFrame.
  • Define a @delayed-function count_delayed() that accepts a single DataFrame df as input and returns (df['DEP_DELAY']>0).sum().
  • Define a @delayed-function pct_delayed() that accepts n_delayed & n_flights as input and returns 100 multiplied by the sum of n_delayed divided by the sum of n_flights.