Get startedGet started for free

Introduction to Dask bags

1. Introduction to Dask bags

So far, we have used Dask to analyze structured data, the kind of data which easily fits into arrays or DataFrames. However, Dask can also be used to analyze unstructured and semi-structured data.

2. What is unstructured data?

Text data stored in strings is an example of unstructured data. Data stored in dictionaries is semi-structured. In Python, we might store multiple records of these data types in lists like these.

3. Dask bags

When using Dask, we store the unstructured or semi-structured data in Dask bags. We can use the from sequence method from dask-dot-bag to convert a list directly into a Dask bag. Each element of the data list becomes an element of the bag. Here, we set the number of partitions to 5. This tells Dask how many chunks to use to process the data, similar to how Dask DataFrames and arrays are chunked. Dask bags are also lazy, just like Dask arrays and DataFrames. We can extract elements from the Dask bag using the take method. This is equivalent to using the Dask DataFrame's head method. The take method loads and returns a tuple of items from the bag. In this case, we request only one example.

4. Dask bags

But we can get more results by passing a larger value to the take method. If we wanted to load the whole bag into memory, we would use the compute method instead of the take method.

5. Number of elements

If we want to find out how many elements are inside the bag, we can use its dot-count method. This returns another delayed object, so to get the answer, we need to run its compute method.

6. Loading in text data

Text is a very common form of unstructured data. For this, Dask has the read-text function. To use the function, we first need to create a list of the files we want Dask to load in. Here, we use the glob function from within the glob package. This makes a list of all the files in the data directory, which end in dot-txt. Alternatively, instead of passing a list of files into the read-text function, we can pass the same string we passed to glob. It will use this to find the same files, and so these two different methods do the same thing. Dask will default to use one partition per text file. Here there are three text files, so there are three partitions.

7. String operations

The read-text function will extract each line from the text files and add them as separate elements inside the bag. Once we have loaded in the text data lazily, we can use the bag's string accessor methods to manipulate the text. We can use the dot-str-dot-lower method to convert the entire text to lower case.

8. String operations

We can use the string accessor's dot-replace method to replace words in the text. Here we replace the word good with great. We can also use the string accessor's count method to count the number of times certain words appear in each string in the bag. Great appeared zero, one, and five times in the first three strings. Using these methods returns another Dask bag, so we use the take method to show a few examples, or we can use the compute method to return a full list. There are also many more methods available under the string accessor to manipulate text. But we will leave them for you to explore yourself.

9. Let's practice!

For now, let's go to the exercises to practice using Dask bags to process a text dataset of TripAdvisor reviews.