1. Using processes and threads
We've learned how to build pipelines to process different kinds of data using Dask, but so far, we have used Dask's default scheduling method.
2. Dask default scheduler
By default, Dask arrays, DataFrames, and delayed objects use threads. Dask bags use parallel processes. But these default options won't always be the best options for our calculation.
3. Choosing the scheduler
We can choose the scheduler Dask uses to compute an answer by setting the scheduler parameter inside the object's compute method. We set the scheduler to threads to use multi-threading and to processes to use parallel processes. We can also set this parameter inside the dask-dot-compute function.
4. Recap - threads vs. processes
We learned previously that processes are slower to start up than threads, and they are slow to transfer data to. However, they avoid the global interpreter lock or GIL. We can use our knowledge of the calculation we are running in order to predict which of these methods will be faster, but we can also just test them.
5. Creating a local cluster
Dask has even more options than using just processes or just threads to run your calculations, and these other options will sometimes be faster. We can use the LocalCluster object from the dask-dot-distributed subpackage to use threads and processes together. Here, we create a cluster object. We tell it to use processes and to create 2 individual processes. We also tell it to allow each of these processes to use 2 threads. This means there are a total of 4 threads.
6. Creating a local cluster
Similarly, we can create a local cluster which uses threads. Here we create 2 threads, and each thread can itself use 2 threads.
7. Simple local cluster
If we choose not to set the number of workers and threads when creating a cluster, Dask will choose these for us by checking our computer's specifications. This computer has 4 cores, which is the number of physical CPUs. Each core has 2 logical processors, which means each core can run 2 threads at once. So when using parallel processing, Dask will choose to use 4 processes and 2 threads per process. If instead we chose to not use processes, then Dask creates one thread per logical processor, for a total of 8.
8. Creating a client
In order to use this cluster to run calculations, we need to create a client. The client passes our calculations to the cluster we created. We use the Client class from dask-dot-distributed and pass it our cluster.
9. Creating a client easily
To save on some lines of code, we can make the client class create its own local cluster. Instead of creating a cluster and passing it to the client, like the code on the left, we can pass the local cluster arguments to the client and have it create its own cluster, like the code on the right. These two blocks of code are equivalent.
10. Using the cluster
Once we have created the client, Dask will use this client for all computations by default, no matter if they are bags, DataFrames, or anything else. However, we can still change the scheduler if we set it explicitly in the compute method. We can also explicitly pass the computation to the client.
11. Other kinds of cluster
We can also use Dask to split a computation across multiple computers using different cluster classes. These are beyond the scope of this course, but if we have access to a cluster, we can use the same Dask workflow we have learned in this course to scale up our computations massively.
12. Let's practice!
Now let's practice.