Get startedGet started for free

Converting unstructured data to a DataFrame

1. Converting unstructured data to a DataFrame

Most of the time, we will be using Dask bags to extract structured information from unstructured data. This means that at some point in our pipeline, our data might fit into a DataFrame. These are generally easier to work with than bags after our data has structure.

2. Nested JSON data

Let's start with the data from the last lesson. It has a nested structure which means it can't be turned into a tidy DataFrame initially. In order to convert the data from a bag to a DataFrame, we need to modify each dictionary so that the keys and values are basic data types like numbers, datetimes, and strings. Perhaps for this data, we aren't interested in the full employment list, but just want to extract the number of previous jobs each person has held.

3. Restructuring a dictionary

Therefore we write the following custom function. It accepts a dictionary since every element in the bag is a dictionary. It counts the number of entries in the list stored under the employment key and adds this value to the dictionary under the key number_of_previous_jobs. We map this function over all the elements in the bag.

4. Removing parts of the dictionary

After we have done this, we would want to remove the original list of jobs from each dictionary. We can do this by writing another custom function. This function accepts a dictionary and a key to be dropped. The function deletes the key from the dictionary using the del keyword, and returns the dictionary without this entry. This function has two input parameters. When we pass a function like this into the map method, we can also specify the extra parameters for the function. We set the key-to-drop parameter as the employment key so that it will be deleted.

5. Selecting parts of the dictionary

It could be that there are more keys we want to drop than we want to keep, so we could write a complementary function which only keeps those keys. In this function, we pass the dictionary and a list of keys which should be kept. Inside the function, we create a new dictionary and loop through the desired keys, and add them to the new dictionary. We then return the new dictionary.

6. Converting to DataFrame

We have cleaned up the dictionaries so they are no longer nested and have just two simple keys. Now we can use the bag's dot-to-dataframe method to convert the Dask bag into a Dask DataFrame. As usual, this is lazily evaluated. The original JSON data hasn't even been loaded yet, but we have constructed a pipeline which processes it and inserts it into a DataFrame. Now we could use any of the Dask DataFrame methods, and run its compute method to turn it into a Pandas DataFrame.

7. Let's practice!

Now let's practice adding structure to some unstructured data in the exercises.