1. Dask arrays
So far, we have seen how Dask can be used as a general-purpose tool for multi-threading and parallel processing. But there are more specialized parts of the Dask library for using common data types. The first data type we'll cover are arrays.
2. Chunking arrays
Imagine our dataset is highly structured so that the entire dataset could be stored in a NumPy array. Just like with CSVs in the last chapter, we may want to chunk this dataset into smaller pieces so that it doesn't take up as much RAM at once, and so we can split up calculations.
3. Chunking arrays
Dask arrays do exactly this. Each chunk inside the Dask array is a NumPy array. When we want to perform some operation over the full dataset, Dask loads it piece by piece to perform the analysis.
4. NumPy vs. Dask arrays
The syntax for creating and operating on Dask arrays is almost identical to that for NumPy arrays. Here, we create arrays filled with ones. The only difference is that Dask requires the chunk size to be specified. The NumPy and Dask arrays are 2-dimensional arrays of size four thousand by six thousand. Each chunk in the Dask array is size one thousand by two thousand. This means the array is split into 4 pieces along one dimension and 3 pieces along the other - so a total of 12 chunks. The Dask version of creating and summing this array is a lot faster here.
5. Dask array task graph
This is the task graph for the previous calculation. Dask will create 12 array chunks and sum each one. Then it will aggregate all the sums into the final answer. From the task graph, we can see that these sums could be computed in parallel.
By default, Dask arrays use threads. In this case, Dask can split up the tasks of creating the array chunks and summing across threads.
6. Dask array methods
Dask arrays are created lazily but have almost all of the same methods as NumPy arrays, such as max, min, and sum. When these methods are used, they return another lazy Dask array.
When we run the compute method, Dask will return a NumPy array.
7. Treating Dask arrays like NumPy arrays
Similar to using array methods, we can perform mathematical operations, slices, or even run NumPy functions on a Dask array, and it will return another lazy Dask array.
8. Loading arrays of images
There are lots of different ways to load data from disk into a Dask array. In this lesson, we'll look at the particular case of loading images.
Inside dask-dot-array is the subpackage called image. We can use this to lazily load in images. Here, we tell it to look inside the images folder and find all the files ending in dot-png. When we print the Dask array this command created, we can see that there are 40 thousand images, which are each 256 by 256 pixels and have 3 channels. These are the red, blue, and green values for each pixel.
9. Applying custom functions over chunks
Sometimes, when using Dask arrays, we won't want to run a function across the entire array. The function might only be designed to operate on a chunk. For example, when manipulating images, this might be to apply a function to each image separately.
In order to apply this filter function to all images, we use the image-array's map_blocks method, and pass in the function. A block is the same thing as a chunk. Since each block is a single image, this function will be applied to each image separately. This returns another lazy array.
10. Let's practice!
Okay, let's practice by using Dask to process images of American Sign Language signs.