Get startedGet started for free

Building delayed pipelines

1. Building delayed pipelines

So far, we've seen how Dask can be used as a general-purpose tool for parallel programming.

2. Chunks of data

Dask is also a really useful tool to build pipelines to process large amounts of data. As our datasets get bigger, there will become a point where we can no longer load the whole dataset into memory at once. Although it fits on our computer's hard drive, it will not fit into our RAM, which is used for running computations. So we will need to load and process the data in chunks, which is generally slower than if the whole dataset could be loaded at once. But thankfully, we can use Dask to speed up this process.

3. Spotify songs dataset

Let's look at an example where we have a directory of CSV files, which contain data about songs on Spotify released in each year. There is too much data to be able to load them all into memory at once.

4. Spotify songs dataset

Each CSV file has a number of columns, but let's say we want to compute some statistics on the track duration. Particularly to find the maximum track length across all years.

5. Analyzing the data

To compute this, we will need to load in each CSV, find the maximum track length for it, and store this value. Then once we have looped through all the files, we find the maximum of all these maximums. In this example, we spend a lot of time loading data, and so we may be able to speed up this analysis using multi-threading or parallel processing.

6. Analyzing the data

To convert this to lazy evaluation, all we need to do is delay the pandas read-csv function,

7. Analyzing the data

and delay the final max function.

8. Using methods of a delayed object

The max-length variable calculated inside the loop is a delayed object. This is because df is also a delayed object. When you use a method of a delayed object, it returns another delayed object. Dask will add the task of running this method to the task graph. The same is also true for accessing properties of the object, like df-dot-shape.

9. Using methods of a delayed object

Here, Dask remembers it will need to run the fake method on df. It doesn't know whether df will have that method once it is actually evaluated. So if a method doesn't exist, then it won't cause an error until after the compute method has been run.

10. Using methods of a delayed object

Maximums is a list of delayed objects, and we can pass this into a delayed max function.

11. Computing lists of delayed objects

We can compute the results of the entire list of maximums using the dask-dot-compute method. This returns a tuple with a list of results inside.

12. Computing lists of delayed objects

We can get the list back just by selecting the first item.

13. To delay or not to delay

Remember, just because we use a function, that doesn't mean we need to delay that function. Inside this loop, we moved the line of code which finds the maximum track length into a function. This code still works as before, and what comes out of the function is a delayed object. This is because the function only uses methods inside it.

14. Deeper task graph

This is the full task graph which will be used to calculate the final maximum. Note that the 14 branches here don't pass information between each other, and each branch only outputs a single number at the end. We only pass in a filename string at the start of the branch, so there is very little data transfer. Either parallel processing or threads could be faster here, and which wins will depend on our computer and the size of each data file.

15. Let's practice!

Let's build some lazy pipelines.