1. Extracting non-tabular data
So far, we've built pipelines using tabular data sources. In this lesson, we'll learn more about working with non-tabular data.
2. Extracting non-tabular data
Throughout this chapter, we'll use an architecture diagram to help guide our progress. First, we'll explore tools for extracting non-tabular data, before transforming and loading that same data for downstream use. Let's jump in!
3. Types of non-tabular data
In an ever-connected world, data is produced and consumed at incredibly rapid rates. According to MIT, more than 80% of data that is produced and consumed is unstructured, or non-tabular. This includes data sources such as text, audio, image, video, spatial, and IoT data.
With non-tabular data, data engineers will often spend a great deal of time transforming and extracting features from raw, unstructured data, into a "tabular" format.
4. Working with APIs and JSON data
An API, or application programming interface, is software that allows data consumers to send or receive data, without directly accessing the database in which information is stored. APIs are one of the most commonly-used tools to gather data from third parties.
Working with an API is like going to a bank. In a single trip, you can deposit or withdraw money, or both. The bank won't let you directly interact with their bank vault, but it allows for specially-trained individuals to do so. This is just like an API, which prevents a user from directly interacting with a database. Most APIs only allows for users to read data, to prevent accidental deletions or overwriting of data.
APIs typically serve data in JSON format, which stands for JavaScript object notation. JSON data is non-tabular, and stores information in key-value pairs. JSON data is schema-less, and looks and feels quite similar to dictionaries in Python.
5. Reading JSON files with pandas
The data structure above depicts data in JSON format, containing four key-value pairs. Each key stores a list of values, column-wise. To extract data from a JSON file in this format, we'll use the pd-dot-read_json function, which returns a DataFrame.
The first argument passed to read_json is the file_path the JSON file is stored at. The orient parameter denotes how data is stored in the JSON file. Here, the data is stored in a columnar manner, with each key-value pair corresponding to column headers and values. When data is stored as a list of dictionaries, orient should be set to "records". "index" is passed to orient when key-value pairs are made up of indicies and a dictionary containing the remainder of the record.
6. Nested or unstructured JSON data
Sometimes, data stored in a JSON file is not "DataFrame-ready". This might be the case when the source data contains nested objects or has a varying schema, as shown here.
When this occurs, the best approach is to first load the file into a dictionary, with the intent to transform the data into a DataFrame.
7. Reading JSON files with json
To do this, we'll use the load function, from the json library. This function takes a file object, and returns a Python data type, typically a dictionary.
In our example, we call the json-dot-load function on the file object to extract the contents of the file into the raw_stock_data variable.
Using the type function we see that raw_stock_data is indeed a dictionary. This dictionary should look nearly identical to the contents of the JSON file. Best of all, we can transform this dictionary into a format such that it can be easily stored as a pandas DataFrame.
8. Let's practice!
Nice work! It's time to practice extracting data from JSON files. Best of luck!