1. Introduction to Dask
Hi, and welcome to this course on parallel programming with Dask.
2. Speeding up computations using multiple cores
Modern computers have multiple cores in the CPU. To make the most of our computing resources, we need to write code which uses multiple cores. In this course, we will learn to use Dask to do this, and therefore speed up our calculations.
3. Concurrent programming
Regular programming is concurrent. We work through a script in order, completing one step after another. This path through the script is known as a thread of execution.
4. Multi-threading
If some steps in the script are independent of each other, we could split these across two or more threads. We don't need to have completed task 1 to do task 5.
5. Multi-threading
Modern computers have multiple cores in the CPU, which can run simultaneously. So we can run these threads in parallel on different cores.
6. Multi-threading
This all happens inside a single Python process. A process is an instance of a program.
7. Parallel processing
An alternative to using multi-threading is parallel processing. Parallel processing takes each of these threads and runs it inside a different instance of Python. This is like having Python open twice or more.
8. Parallel programming
Parallel programming is writing code so that it can be run in parallel using threads or processes. We'll cover the differences between both of these later.
9. Lazy evaluation
We can use the Dask package to do parallel programming in Python. To do this, Dask relies on lazy evaluation, which means that variables are not computed until the point in our script where they are actually needed. Instead, a task list is created.
When we finally need the result in our script, Dask will analyze the steps required to compute the result, and then split it across multiple threads or processes to compute it faster.
10. Dask delayed
To convert regular code into lazy evaluated code, we can use the Dask delayed function. This function takes another function as an argument and returns a lazy version of it.
11. Dask delayed
When we run this function on some input value, it will return a delayed object. The original function hasn't actually been run yet.
12. Dask delayed
In practice, we can delay a function and use it in a single line. Here we delay my-square-function, and use the function with an input value of 4.
13. Computing the answer
We use the delayed object's dot-compute method to run the calculation. This returns the answer that the non-delayed function would have given.
14. Using operations on delayed objects
We can perform standard mathematical operations on the delayed object, and it will return another delayed object, but Dask stores all the tasks needed to compute the answer.
15. Lazy evaluation
In this example, each number is squared and then added together, so the squaring operation could be run in parallel across multiple threads or processes.
16. Lazy evaluation
When the compute method is run, Dask automatically splits the tasks across threads or processes. We'll come back to how to choose which is used. For now, we'll let Dask use its default, which in this case is threads.
17. Sharing computation
Using lazy evaluation means Dask will not automatically store the intermediate results. In this example, the delayed_intermediate will be calculated twice. Once to calculate delayed_result1 and once to calculate delayed_result2. This is inefficient.
18. Sharing computation
Instead, we can use the dask-dot-compute method to compute multiple delayed objects and share intermediate results.
19. Let's practice!
These are the first steps to learning Dask, and later in this Chapter, you'll use these functions to analyze some Spotify data. For now, let's master these basics in some exercises.