Semi-Structured Data Types

1. Semi-Structured Data Types

I love working with semi-structured data. I like the fact that it can look intimidating at first – it’s not just in a normal row-column format. But it turns out it really does have a structure, you just have to figure out what that structure is, and once you’ve done that, you can systematically extract whatever info you want from it. We’re going to cover semi-structured data over the course of two videos. In this initial video we’re going to practice working with semi-structured data. First we’ll take a look at an example of semi-structured data, and then we’ll talk about the definition of semi-structured data more generally. Then we’ll cover the Snowflake data types that can hold semi-structured data, and get you up-to-speed on one primary method of ingesting semi-structured data in Snowflake. In the following video, we’ll practice manipulating each of those data types, and practice a few ways to work with semi-structured data after you’ve got it in Snowflake. So you might not remember this, but we’ve actually ingested semi-structured data before in this course! There’s a column of semi-structured values in our TastyBytes menu. Let’s take a look at that. SELECT MENU_ITEM_NAME , MENU_ITEM_HEALTH_METRICS_OBJ FROM FROSTBYTE_TASTY_BYTES.RAW_POS.MENU; When we run this query, we see that for each menu item, we get a list of ingredients, and then a few health metrics flags (is it dairy free, etc.). Each entry is in a JSON format, which is one of five semi-structured data formats that Snowflake accepts – the others are Avro, ORC, Parquet, and XML. We’ll focus on JSON examples in this video. So what *is* “semi-structured” data? [If](https://www.snowflake.com/guides/semi-structured-data-101/) you think of “structured” data as tabular data – so data that usually has a schema and exists in a flat table – and you think of “unstructured” data as raw images or raw text, then “semi-structured” data is somewhere in the middle – It has *some* sort of structure, some sort of meaningful organization to it, as you can see here, but that structure can be *hierarchical* in a way that differentiates it from the straightforward tabular data you often see in relational databases. Informally, you can think of it as data that can be flexibly represented, like a dictionary – so organized into key-value pair mappings – and the keys and values can take on very flexible forms (they can be arrays, etc.). [You](https://www.snowflake.com/guides/semi-structured-data-101/)’ll run into it a lot if you work with sensor data or web data. In any case, Snowflake lets you take in data like these JSON examples, or Avro, ORC, Parquet, and XML, and then store them as one of three Snowflake data types – VARIANT, OBJECT, and ARRAY. We’re going to discuss the VARIANT type first. If we run: DESCRIBE TABLE FROSTBYTE_TASTY_BYTES.RAW_POS.MENU; We can see that the MENU_ITEM_HEALTH_METRICS_OBJ is of the VARIANT type, and sure enough, if we look back at the code we used to create the menu data, we can see that that was our doing – We specified it to be VARIANT. Back then, we when we created our empty menu table, we specified that menu_item_health_metrics_obj should be of the VARIANT data type: -- menu table build CREATE TABLE frostbyte_tasty_bytes.raw_pos.menu ( menu_id NUMBER(19,0), menu_type_id NUMBER(38,0), menu_type VARCHAR(16777216), truck_brand_name VARCHAR(16777216), menu_item_id NUMBER(38,0), menu_item_name VARCHAR(16777216), item_category VARCHAR(16777216), item_subcategory VARCHAR(16777216), cost_of_goods_usd NUMBER(38,4), sale_price_usd NUMBER(38,4), menu_item_health_metrics_obj VARIANT ); So what is a VARIANT type? The key thing to know is that it is an extremely flexible data type – [It](https://docs.snowflake.com/sql-reference/data-types-semistructured#characteristics-of-a-variant) can hold values of any other type. So it’s especially useful when loading data with an unusual structure, because you can just create a table with a VARIANT column, and put that data there. The VARIANT type then detects the underlying data type and keeps track of that, in addition to keeping track of the value itself, and when you perform operations on a VARIANT column, it will look at the underlying data type and let you perform operations relevant to that underlying type. To make this more concrete, as a quick example, let’s create a table with one column in it, and make that column of the VARIANT type, but put NUMBER data into it. We haven’t learned CREATE TABLE AS SELECT (CTAs) in this course, but if you’re a SQL user, you’re likely already familiar with them. The nice thing about them is it lets you easily create a table from an existing table. So we’ll create a table with one VARIANT column, and we’ll call that column cost_of_goods_variant, but what we’ll put into that is the cost_of_goods_usd data, which was originally a number. CREATE TABLE FROSTBYTE_TASTY_BYTES.RAW_POS.TEST_MENU (cost_of_goods_variant) AS SELECT cost_of_goods_usd::VARIANT FROM FROSTBYTE_TASTY_BYTES.RAW_POS.MENU; We can confirm that the new column is a VARIANT column. DESCRIBE TABLE FROSTBYTE_TASTY_BYTES.RAW_POS.TEST_MENU; But since VARIANT columns save both the value *and* the underlying data type, we can check to see that Snowflake recognizes that the data type inside this VARIANT column is a number: We’ll use the typeof() function for this: SELECT TYPEOF(cost_of_goods_variant) FROM frostbyte_tasty_bytes.raw_pos.test_menu; And then to wrap this up, if we multiply our variant column by 2.0, say, we can see that Snowflake automatically realized that even though this is a VARIANT, and it might be non-sensical to multiply some VARIANT columns depending on the underlying data type, the underlying data type here was a number, and could be multiplied. SELECT cost_of_goods_variant, cost_of_goods_variant*2.0 FROM frostbyte_tasty_bytes.raw_pos.test_menu; So again – The takeaway: The VARIANT column is extremely flexible, and can hold other data types inside it, and Snowflake keeps track of those data types. Although the VARIANT type doesn’t only have to be used with semi-structured data, in practice if you’re working with semi-structured data, that’s often what you’ll use, because it’s such a flexible data type. So let’s check out the underlying type of the semi-structured data – the JSON data – that we loaded into our VARIANT column when we originally loaded the TastyBytes menu data: SELECT TYPEOF(MENU_ITEM_HEALTH_METRICS_OBJ) FROM frostbyte_tasty_bytes.raw_pos.menu; You’ll see that this says OBJECT. An OBJECT is the second Snowflake data type you should know that comes up in the semi-structured data world. It’s like a dictionary in Python. It always has a key-value pair, but the value can be of the VARIANT type, and since the VARIANT type can hold any other type, the OBJECT type can also hold any other type – through VARIANTs. This may sound confusing. Mostly you don’t have to worry about it – The key thing to be aware of is that OBJECTS always have key-value pairs, and this means that you can interact with them differently from VARIANTs – you can pull data from the OBJECT type just based on the key, using brackets like this: SELECT MENU_ITEM_HEALTH_METRICS_OBJ, MENU_ITEM_HEALTH_METRICS_OBJ['menu_item_id'] FROM frostbyte_tasty_bytes.raw_pos.menu; If we double click on the JSON, we’ll see that each one has this key called “menu_item_id,” and by using the brackets, we can pull out only the values associated with that key for each entry. Okay, so the last data type we want to talk about is the ARRAY type. The ARRAY is like an array in other languages – a list of ordered entries, and you can access these entries based on that order. [Each](https://docs.snowflake.com/en/sql-reference/data-types-semistructured#characteristics-of-an-array) element in a Snowflake ARRAY is of the VARIANT type. Again, that sounds confusing, and you don’t have to worry about it for now. The key thing to know is that you can pull the first entry out of an array using [0], the second using [1], etc. In the next video, we’ll pull all of this together, and run through an example where we dig our way through the various levels of our JSON column. You’ll see how fun and messy semi-structured data can get, and I hope it will also demonstrate the style of thinking that you need to employ to work with JSON data.

2. Let's practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.