1. Dask bag operations
Dask bags are a really flexible way of processing unstructured data.
2. The map method
Let's say we have this custom function, which takes a string, and returns the number of words in that string.
3. The map method
We would like to apply it to all of the strings in this Dask bag. We can do this by using the bag's dot-map method. When we pass a function into the map method, it will run this function on every element in the bag. This returns another delayed Dask bag.
4. The map method
We need to run the compute method in order to return the actual results. We can see that the function was applied to each string separately.
5. JSON data
JSON is a very common data type used with Dask bags. JSON data is stored in plain text files, which usually end in dot-json.
Inside these files, the data looks like a bunch of Python dictionaries. These dictionaries have a series of keys and their values. The JSON can even be nested so that the value is another dictionary. In this example, the values under the employment key are actually lists of dictionaries. This nested structure is the reason this data is loaded into Dask bags rather than DataFrames. We can use the read_text function to read each line of the JSON file into a new element of a bag. However, this loads it in as a string.
6. Converting JSON from string to dictionary
Once it has been read in as a string, we use the loads function from the json package on each string. This converts a JSON string into a Python dictionary.
7. Filtering
Let's say we want to select only the elements in the bag which meet some condition. This is possible using the bag's dot-filter method. Here, we have written a function which checks whether the years_service key of each dictionary has a value less than one.
We pass this function into the bag's dot-filter method. This applies the function to each element in the bag. If the function returns True, then that element is kept; otherwise, it is filtered out. Therefore, the new bag is smaller.
8. Filtering
Another common way of doing this is to pass in a lambda function to perform the same filtering. This lambda function does the same thing as the function above.
9. Pluck method
Sometimes we might only be interested in a single item of the dictionary. Let's say we only want the data under the employment key. In this case, we can use the bag's pluck method to select it. This returns a new Dask bag of this data. Each element in the new bag is a list of dictionaries. This is because the values under the employment key were themselves lists of dictionaries.
10. Pluck method
This can be useful to build more complex analysis pipelines. The employment key lists all the previous jobs each employee has had. Now that we have selected this list using pluck, we could use the map method and the len function to find the total number of jobs each employee has had. This would create a bag where each element is an integer.
11. Aggregations
When the bag only contains numerical values, then we can use its min, max, and mean methods to calculate aggregate statistics. Here we find the minimum, maximum, and mean number of previous jobs of the new employees across the bag. Just like before, we use dask-dot-compute to allow Dask to share parts of the calculation between these three different results. This means we won't have to load and process the data 3 separate times.
12. Let's practice!
Okay, let's practice.