1. Intro to non-rectangular data
Welcome to the final chapter!
2. Rectangular data
So far, we've always worked with data in a rectangular, or tabular, structure, with columns and rows. Such as this sample of Soviet space dogs data stored in a spreadsheet. To save this kind of data we often use the CSV format, where values are separated by commas.
While this rectangular format is convenient for our analyses, not all data comes in this shape.
3. Non-rectangular formats
Take for example JSON and XML files. Both formats are widely used and aim to be both human and machine-readable, but clearly, they don't have a rectangular shape with columns and rows. Instead, they have hierarchical, tree-like structures, with high-level elements branching into lower-level elements.
These structures pose a challenge for data analysis, but don't be worried. tidyr can help you turn these into a rectangular format.
4. A list of lists of lists
When you read a JSON file with the rjson package's fromJSON() function, you'll end up with a nested list.
The example shown here has a top-level list with two elements. Each element corresponds to a Star Wars character and contains a named list with two items: the name of the character, and the films they appeared in. This "films" field is in itself again a list.
5. A first step to rectangling
As a first step to rectangling this data, we'll use the tibble() function to create a data frame with one column, which we'll name character, and pass it the list.
You'll see that this data frame already has two rows, one for each top-level element of the input list. However, the values in these rows are lists themselves, whereas we would like to have the elements of these lists spread out over multiple columns.
6. Unnesting lists to columns
This can be done with tidyr's unnest_wider() function. It spreads the two elements of the character list into two columns: name and films. The unnest_wider() function was able to come up with these column names because we passed it a named list.
We could take our unnesting one step further and also apply it to our films column, as this too, is a list column. It contains four films for Darth Vader and two for Jar Jar Binks.
7. Unnesting lists to columns
Since the elements in these columns weren't named lists, tidyr had to improvise when creating the column names, and since both characters appeared in a different number of films NA values were added for Jar Jar Binks' 3rd and 4th movie. This data isn't exactly tidy and we'll come back to this example later on.
8. Let's practice!
But first, it's your turn to rectangle nested lists.