Get startedGet started for free

Task graphs and scheduling methods

1. Task graphs and scheduling methods

Let's take a closer look at what Dask is doing behind the scenes.

2. Visualizing a task graph

Here we create two objects using a delayed function and add them. We can visualize the task graph for this using the result object's dot-visualize method. The figure produced is read from bottom to top and shows the tasks required to compute the result.

3. Overlapping task graph

This example we saw in the last lesson has a shared intermediate.

4. Overlapping task graph

But if we look at the task graphs for each result, we see that the graphs don't reflect this. If we run the compute method on delayed_result_one and delayed_result_two independently, then delayed_intermediate will be calculated twice.

5. Overlapping task graph

Just like we used the dask-dot-compute function in the last lesson, we can use the dask-dot-visualize function to plot the task graph for both results. Dask has noticed the results share delayed_intermediate. When we compute these results together, the intermediate will only be computed once. We should compute results together as much as possible as it allows Dask to optimize shared steps like this. Now let's look at our two scheduling methods.

6. Multi-threading vs. parallel processing

Parallel processing uses multiple instances of Python. Each instance, or process, uses its own memory space in RAM. Multi-threading uses only one Python instance, so all threads share the same memory.

7. Multi-threading vs. parallel processing

In this example, if we use parallel processes to calculate sum1 and sum2, then the big_arrays are copied over from the main Python session into new processes to run these sums. The result must be copied back into the main session. Copying data between processes is slow, and in this case, takes longer than computing the sum.

8. Multi-threading vs. parallel processing

Multi-threading occurs inside the main Python session, so these arrays don't need to be copied at all.

9. The GIL

Unlike other programming languages, Python has a global interpreter lock, the GIL, which limits its multi-threading ability. The GIL means that when one thread is reading the code, it stops other threads from being able to read that code simultaneously. Kind of like a team of cooks passing round a single recipe book. Once a cook knows its next step, they will hand the book to someone else. If the steps are short, the cooks will spend a lot of time waiting to read. Parallel processing avoids this, since each process has its own GIL. So processes can read the code in parallel. This example function sums numbers from zero to n. If we want to run it multiple times, multi-threading won't give us a speed up, but parallel processing will.

10. Example timings - GIL

This figure shows the times taken to run the function 16 times, using a for-loop, using 3 threads, and using 3 processes. Each orange bar shows when a task started and finished. Processes took a while to start up, but were much faster. The 3 threads took 3 times longer to complete each task because they kept locking each other out.

11. Functions which release the GIL

Some functions, like loading data, release the GIL so that other threads can read the code while that thread is waiting.

12. Example timings - Loading data

This figure shows times to load in CSV files. The threads don't lock each other out here, so are fast.

13. Summary

Let's recap these scheduling methods. Both processes and threads take time to start up, but threads are much faster. Processes have separate memory pools, while threads share their memory between them. This means it takes time to transfer data into, out of, and between processes. Processes each have their own GIL, while threads don't. This can slow down threads.

14. Let's practice!

Now let's go into some exercises.