Get startedGet started for free

Transforming non-tabular data

1. Transforming non-tabular data

Before we begin transforming non-tabular data, let's take a look back at our architecture diagram for this chapter.

2. Transforming non-tabular data

In the last lesson, we practiced using the json-dot-load function to store complex JSON data as a dictionary. Now, we'll explore tools and techniques to transform this data, so it can be stored as a DataFrame.

3. Storing data in dictionaries

In this dictionary, each key represents a unique index, and the associated value is a nested dictionary which stores the remainder of the record. To make this dictionary DataFrame-ready, it needs to be wrangled into a data structure that can easily be converted to a DataFrame, such as a list of lists.

4. Iterating over dictionary components

First, we'll want to iterate over the dictionaries keys and values. To do this, we can use the keys, values, and items methods. The keys and values methods create a list made up of keys or values, respectively. These can then be looped over, and leveraged as needed. The items method returns a list of tuples, where each tuple is a key-value pair. With the items method, both a key and value can be accessed in a single iteration of a for loop. This will come in handy later.

5. Parsing data from dictionaries

In addition to iterating over a dictionary, we'll also want to extract individual values from a dictionary. We can do this using the dot-get method. get takes a key, and if that key exists in the dictionary, the associated value will be returned. Otherwise, the return value will be None. To parse the "volume" from the dictionary shown above, "volume" is passed to the get method, and the results are stored in the volume variable. If a second value is passed to the get method, this value will be returned if the key does not exist in the dictionary. This is best practice when working with JSON data parsed from a file that does not follow a set format. To parse data from a nested dictionary, the get method can be called twice. To extract the "open" value, get is first called to parse the dictionary stored at key "price", before being called again to return the "open" value.

6. Creating a DataFrame from a list of lists

Once data has been converted from a dictionary into a list of lists, it can be passed to the pd-dot-DataFrame function to create a new DataFrame. Column names for the DataFrame can set by assigning a list of columns names to the columns attribute. The set_index method takes a column name, or list of column names, and creates a new index.

7. Transforming stock data

Now, let's put it all together. To iterate through the raw_stock_data dictionary, we'll call the items method. Using a for loop, each key and value in the dictionary can be accessed. Here, the keys are timestamps, and the values are a dictionary containing the remaining data. From this nested dictionary, the open, close, and volume values are parsed using the get method, and appended as a list to the parsed_stock_data list. This list is then converted to a DataFrame using the DataFrame function, and the column names and index are set.

8. Let's practice!

Being able to transform non-tabular data into a DataFrame will come in handy when building data pipelines. Now, time for some practice!

Create Your Free Account

or

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.