Get startedGet started for free

Using any data in Dask bags

1. Using any data in Dask bags

So far, we've learned how to use Dask bags to load unstructured data from text files and load semi-structured data from JSON.

2. Complex mixed data formats

But we can use Dask bags for any kind of data that we can load in Python. This could be video, audio, or any kind of Python object. Let's say we are working with some video data. The displayed images, or frames, form a three-dimensional array. Each video also has a one-dimensional array which stores the video's sound. As well as this, there may be some metadata. If the video is in mp4 format, then it may have directors, producers, writers, a copyright notice, and many other pieces of data stored in the mp4 file. This mixed data won't fit into a single Dask array or DataFrame. Instead, we can work with it inside a Dask bag.

3. Creating a Dask bag

We can start by using glob to create a list of all of our mp4 files.

4. Creating a Dask bag

Then we use this list to create a Dask bag using the from_sequence function. This creates a Dask bag where every element in the bag is just the filename.

5. Loading custom data

For whatever kind of data we are using, we will need to have a function which can load it. Here, the load_mp4 function takes a filename and returns a dictionary which contains the video and audio arrays, and the filename.

6. Loading custom data

We map this function over all of the file names. In this new bag, every element is one of our video dictionaries. But remember that this is performed lazily. So no data has been loaded yet.

7. Loading custom data

Alternatively, we could have loaded the data like we did in chapter one. We could loop over the filenames and append the delayed loaded files to a list. Constructing this list or using a Dask bag are very similar approaches, and in fact, we can convert between them.

8. List of delayed objects vs. Dask bag

Using the from_delayed function, we can convert a list of Dask delayed objects into a Dask bag. Using the to_delayed method does the opposite and converts a Dask bag into a list of Dask delayed objects. Generally, the code will be tidier if we use bags, but not always. Which one we choose is really just a style choice.

9. Further analysis

Once we have loaded our data into a bag, we can use the map method to run further analysis. For example, our transcribe-audio takes each video dictionary and analyzes the audio to create a transcript. It adds this new information to the video dictionary and returns it.

10. Further analysis

It is possible to build out a full data pipeline by using the map and filter methods with custom functions. Here we filter out dictionaries which have blank transcripts. We also analyze whether the transcripts are positive or negative, and add this to the dictionary too. As before, after we have built up the task graph, we can drop any variables we don't want and convert to a Dask DataFrame.

11. Results

Finally, we use the compute method to run the calculations, and collect the results in a pandas DataFrame.

12. Using .wav files

Any unstructured data can be used with Dask bags, but in the following exercises, you are going to be analyzing audio files in dot-wav format. To load these files, we can use the wavfile module from scipy-dot-io. The read function returns the sampling frequency, which is like the frame rate for recorded audio, and the audio itself.

13. Using .wav files

The sample frequency is just an integer, and the audio data is just a NumPy array of amplitudes at each time point.

14. Let's practice!

Now let's practice!