1. Xarray
One of the hard parts of working with multi-dimensional data is remembering what each axis means. Maybe you found this to be the case in the exercises.
2. Xarray - like pandas in more dimensions
pandas helps us use tabular data by assigning labels to the columns and rows. For higher-dimensional data, there is Xarray. If you already know how to use pandas, then you are most of the way to knowing how to use Xarray. And best of all, Xarray can use Dask behind the scenes.
3. DataFrame
Let's say we have some tabular data. Each column represents a different property that was measured. The row index represents a coordinate, like time. Together, the columns make up a DataFrame.
4. DataSet
Xarray DataSets are similar, but instead of there being one coordinate index, there may be multiple coordinates, like having time and space. Instead of having one-dimensional columns, we now have two-or-more-dimensional DataArrays. Together, the DataArrays make up an Xarray DataSet.
5. Loading a DataSet from Zarr
Now, instead of just loading one of the DataArrays inside a Zarr file, we can load them all into an Xarray DataSet. We do this using Xarray's open-zarr function. When we print it, we can see that there are 3 indices. These are the coordinates lat, lon, and time. There are 2 DataArrays, precip, and temp. Each of the DataArrays is three-dimensional. We can see that the DataArrays loaded from Zarr are actually Dask arrays. The chunk sizes match those on the disk. As usual, this is done lazily, so no data has actually been loaded yet except for the coordinates, which take up much less memory than the DataArrays.
6. DataFrame vs. DataSet
Performing analysis with the DataSets is very similar to performing analysis with DataFrames. We can slice them, but since there are multiple dimensions, we need to select the dimension by name. Here, we use pandas-style date selection, which is also available in Xarray. Note that instead of using loc and iloc, in Xarray, we use sel and isel. We can select variables using the same syntax as pandas.
7. DataFrame vs. DataSet
We can do mathematical operations like finding the mean, but we also need to specify which dimensions to calculate the mean across. We can specify this by the name of the coordinate instead of the axis number, and it can be one or more dimensions. We can also do more complex operations like groupbys and rolling window operations. This example computes the rolling mean along the dimension dim1, with a rolling window size of 5 steps. Like with Dask arrays, these will all return lazy objects, so we will need to run the compute method to calculate the actual answer.
8. Plotting
pandas lets us make plots using the DataFrame's plot method, and so does Xarray. This makes it really fast to explore our datasets. We can run some analysis and see the answer plotted using the plot method. If the DataArray is one-dimensional, the plot will be a line plot.
9. Plotting
If it is 2-D, the plot will be a heatmap.
10. Plotting
If the DataArray is 3-dimensions or higher, Xarray will take values from across all the dimensions and flatten them into a 1-dimensional histogram.
When we use the plot method, Xarray will automatically run the compute method for us. We can't have a lazy plot.
11. Let's practice!
Alright, let's speed up our array analysis with Xarray.