1. Using any data in Dask bags
So far, we've learned how to use Dask bags to load unstructured data from text files and load semi-structured data from JSON.
2. Complex mixed data formats
But we can use Dask bags for any kind of data that we can load in Python. This could be video, audio, or any kind of Python object.
Let's say we are working with some video data. The displayed images, or frames, form a three-dimensional array. Each video also has a one-dimensional array which stores the video's sound. As well as this, there may be some metadata. If the video is in mp4 format, then it may have directors, producers, writers, a copyright notice, and many other pieces of data stored in the mp4 file. This mixed data won't fit into a single Dask array or DataFrame. Instead, we can work with it inside a Dask bag.
3. Creating a Dask bag
We can start by using glob to create a list of all of our mp4 files.
4. Creating a Dask bag
Then we use this list to create a Dask bag using the from_sequence function. This creates a Dask bag where every element in the bag is just the filename.
5. Loading custom data
For whatever kind of data we are using, we will need to have a function which can load it.
Here, the load_mp4 function takes a filename and returns a dictionary which contains the video and audio arrays, and the filename.
6. Loading custom data
We map this function over all of the file names. In this new bag, every element is one of our video dictionaries.
But remember that this is performed lazily. So no data has been loaded yet.
7. Loading custom data
Alternatively, we could have loaded the data like we did in chapter one. We could loop over the filenames and append the delayed loaded files to a list.
Constructing this list or using a Dask bag are very similar approaches, and in fact, we can convert between them.
8. List of delayed objects vs. Dask bag
Using the from_delayed function, we can convert a list of Dask delayed objects into a Dask bag. Using the to_delayed method does the opposite and converts a Dask bag into a list of Dask delayed objects.
Generally, the code will be tidier if we use bags, but not always. Which one we choose is really just a style choice.
9. Further analysis
Once we have loaded our data into a bag, we can use the map method to run further analysis. For example, our transcribe-audio takes each video dictionary and analyzes the audio to create a transcript. It adds this new information to the video dictionary and returns it.
10. Further analysis
It is possible to build out a full data pipeline by using the map and filter methods with custom functions.
Here we filter out dictionaries which have blank transcripts. We also analyze whether the transcripts are positive or negative, and add this to the dictionary too.
As before, after we have built up the task graph, we can drop any variables we don't want and convert to a Dask DataFrame.
11. Results
Finally, we use the compute method to run the calculations, and collect the results in a pandas DataFrame.
12. Using .wav files
Any unstructured data can be used with Dask bags, but in the following exercises, you are going to be analyzing audio files in dot-wav format.
To load these files, we can use the wavfile module from scipy-dot-io. The read function returns the sampling frequency, which is like the frame rate for recorded audio, and the audio itself.
13. Using .wav files
The sample frequency is just an integer, and the audio data is just a NumPy array of amplitudes at each time point.
14. Let's practice!
Now let's practice!