Get Started

Reading nested data into a DataFrame

1. Reading nested data into a DataFrame

In the previous videos, we have seen different techniques

2. Review

to reshape DataFrame and Series, to explode list contained in columns, and to split and concatenate string columns.

3. JSON format

But there are other important data structures we can handle. JSON format is one of them. JSON stands for JavaScript Object Notation. It is a data-interchange format that is easy for humans to read and write, but it is also easy for machines to parse and generate. That's why is frequently used. JSON files are commonly used when sharing data between applications and data scientist, saving information, or metadata, about a dataset, or reading data from an API.

4. JSON format

Let's see what a JSON looks like. It's very similar to a dictionary. The JSON in the slide has a very simple structure.

5. Nested JSON

But sometimes we have to work with a nested JSON, like this one. We can see that we have a dictionary inside a list. And a dictionary inside a dictionary. These are more complex data to work with.

6. Data normalization

For those cases, we can use the json_normalize function from pandas. It takes our nested JSON object, flattens it out, and reads it into a DataFrame. By default, json_normalize uses a period to indicate nested levels.

7. Data normalization

To change that, we can use the sep parameter. For example, we can specify an underscore as a separator. Now, the last two column names are separated by an underscore.

8. Data normalization

We can take advantage of the names. We can reshape the DataFrame using the wide_to_long function, setting 'books' as stubnames, an underscore as the separator, and the expression backslash w plus as a suffix, to indicate the columns end with in a word.

9. Complex JSON

Let's analyze the following JSON. As you can see, it is more complex than the previous one. We have not only nested dictionaries, but the books keys now have a list of dictionaries as values.

10. Complex JSON

If we apply json_normalize() as before, the books column is still not flattened out.

11. Record path

In this case, we can use the record_path parameter. It tells pandas what key path leads to each individual observation in the JSON. In the output, the books column has been converted into two columns, title and year.

12. Metadata

We can add a third parameter, the meta parameter. It tells pandas what data we want to include from the rest of the JSON. Now, we can see the rest of the data present in the resulting DataFrame.

13. Let's practice!

It's time to put these concepts into practice.