1. Reading nested data into a DataFrame
In the previous videos, we have seen different techniques
2. Review
to reshape DataFrame and Series, to explode list contained in columns, and to split and concatenate string columns.
3. JSON format
But there are other important data structures we can handle. JSON format is one of them. JSON stands for JavaScript Object Notation. It is a data-interchange format that is easy for humans to read and write, but it is also easy for machines to parse and generate. That's why is frequently used.
JSON files are commonly used when sharing data between applications and data scientist, saving information, or metadata, about a dataset, or reading data from an API.
4. JSON format
Let's see what a JSON looks like. It's very similar to a dictionary. The JSON in the slide has a very simple structure.
5. Nested JSON
But sometimes we have to work with a nested JSON, like this one. We can see that we have a dictionary inside a list. And a dictionary inside a dictionary. These are more complex data to work with.
6. Data normalization
For those cases, we can use the json_normalize function from pandas. It takes our nested JSON object, flattens it out, and reads it into a DataFrame. By default, json_normalize uses a period to indicate nested levels.
7. Data normalization
To change that, we can use the sep parameter. For example, we can specify an underscore as a separator. Now, the last two column names are separated by an underscore.
8. Data normalization
We can take advantage of the names. We can reshape the DataFrame using the wide_to_long function, setting 'books' as stubnames, an underscore as the separator, and the expression backslash w plus as a suffix, to indicate the columns end with in a word.
9. Complex JSON
Let's analyze the following JSON. As you can see, it is more complex than the previous one. We have not only nested dictionaries, but the books keys now have a list of dictionaries as values.
10. Complex JSON
If we apply json_normalize() as before, the books column is still not flattened out.
11. Record path
In this case, we can use the record_path parameter. It tells pandas what key path leads to each individual observation in the JSON. In the output, the books column has been converted into two columns, title and year.
12. Metadata
We can add a third parameter, the meta parameter. It tells pandas what data we want to include from the rest of the JSON. Now, we can see the rest of the data present in the resulting DataFrame.
13. Let's practice!
It's time to put these concepts into practice.