Get startedGet started for free

Multidimensional arrays

1. Multidimensional arrays

Earlier in the chapter, we saw how we could load images into a 4-dimensional Dask array. But there are lots more formats for array-like data.

2. Types of multi-dimensional data

This could be weather predictions, 3D biomedical scans, satellite images, and measurements from all kinds of modern sensors.

3. HDF5

One popular format to store array-like data is HDF5. This is a hierarchical data format. This means that within the HDF5 file, the data is arranged into folders and subfolders.

4. What does an HDF5 file look like?

If we were looking at an HDF5 file using our file browser, we would see it as a single file.

5. What does an HDF5 file look like?

But inside each file, there can be multiple datasets; just like inside each folder, we can have multiple files.

6. Navigating HDF5 files with h5py

The HDF5 format works with lots of different programming languages. In Python, we can open these datasets using the h5py package. In this example, we open the data-dot-hdf5 file. Once opened, we can print the keys of this file as if it was a Python dictionary. This shows us the different datasets inside the file.

7. Navigating HDF5 files with h5py

We can select dataset A like selecting from a dictionary, but we must use slash-A as the key. If we print the item we have extracted, we can see its shape and data type. The actual data inside A hasn't been loaded yet.

8. Loading from HDF5

We can turn this dataset into a Dask array using the from-array function inside the dask-dot-array subpackage. We also need to specify the chunk size that Dask will use. There is no correct chunk size which we should use, but it should be smaller than the full size of the dataset. It should also be small enough that several of these chunks could be loaded into memory at once, since each thread or process will load a chunk each. The array returned is a lazy Dask array.

9. Zarr

One other format which is becoming increasingly popular is Zarr. Like HDF5, this is a hierarchical data format. Zarr is a more modern format, so it was built with chunking in mind. It was also created so that we can efficiently stream data from cloud computing services. Also, unlike HDF5, we can actually open and look inside the Zarr file using our file browser to see each chunk of each array. If we were choosing one of these formats to run on our own computer only, then they are roughly the same.

10. Loading from Zarr

We can open a Zarr dataset using the from-zarr method inside dask-dot-array. We also need to specify which component of the dataset we want to load. We don't need to specify the chunk size as Dask will automatically choose the same chunks as are saved on disk.

11. Let's practice!

In the following exercises, you will run some analysis on a dataset of European weather data in different file formats. Let's practice.