Semi-Structured Data Manipulation

1. Semi-Structured Data Manipulation

In the previous video, we learned about the five semi-structured data formats Snowflake accepts; we learned about Snowflake’s VARIANT, ARRAY, and OBJECT data types; and we learned about checking the data held inside a VARIANT column using the TYPEOF function. In this video, we’ll practice working with semi-structured data – specifically JSON data. To be specific, we’ll look at our Tasty Bytes menu data, and pull out the first ingredient for each menu item. Then we’ll quickly cover one way you can load semi-structured data of your own. Okay, so we’ve got our JSON data. One way we can pull out the value of a top-level key inside that JSON data is to use dot notation, like this: SELECT MENU_ITEM_HEALTH_METRICS_OBJ:menu_item_health_metrics FROM FROSTBYTE_TASTY_BYTES.RAW_POS.MENU; Notice that colon – that’s the dot notation I’m talking about. This pulls out the value associated with the “menu_item_health_metrics” key. We could also have used brackets with the name of the top-level key in it, and that would have done the same thing: SELECT MENU_ITEM_HEALTH_METRICS_OBJ['menu_item_health_metrics'] FROM FROSTBYTE_TASTY_BYTES.RAW_POS.MENU; I’ll use brackets going forward. So what we do next is going to feel a little bit like the movie Inception, because we’re going to dig our way through the various levels of our JSON. The goal here is to show one way to traverse a JSON in Snowflake, and it gets a little involved, because hierarchical data can be structured so flexibly. This will help us practice accessing data from Variants, Objects, and Arrays. After we do this, I’ll introduce an alternative method that some find simpler. Okay, so let’s start digging. Let’s say our goal is to extract the first ingredient from each one of these JSONs. If we take a look at an example JSON, we’ll see this will actually take four layers of digging. We’re going to create a new table at each layer so we can carefully track what we’re doing – checking the underlying type of data in each case, etc. CREATE TABLE FROSTBYTE_TASTY_BYTES.RAW_POS.TEST_MENU1 (ingredients) AS SELECT MENU_ITEM_HEALTH_METRICS_OBJ FROM FROSTBYTE_TASTY_BYTES.RAW_POS.MENU; SELECT * FROM FROSTBYTE_TASTY_BYTES.RAW_POS.TEST_MENU1; So this is a table that just pulls the MENU_ITEM_HEALTH_METRICS_OBJ column. That’s all. If we click on the first JSON, we can see that we’ll have to dig through four layers, because we have to get to menu_item_health_metrics, then the first element inside menu_item_health_metrics, then ingredients, then the first element inside ingredients. Okay, so if we describe the table, we see that this is a VARIANT column. Great. DESCRIBE TABLE FROSTBYTE_TASTY_BYTES.RAW_POS.TEST_MENU1; And it’s categorized its contents as being of the OBJECT type. SELECT TYPEOF(ingredients) FROM FROSTBYTE_TASTY_BYTES.RAW_POS.TEST_MENU1; OBJECTs are like dictionaries, so we can pull out the first value corresponding to the menu_item_health_metrics key like this: CREATE TABLE FROSTBYTE_TASTY_BYTES.RAW_POS.TEST_MENU2 (ingredients) AS SELECT MENU_ITEM_HEALTH_METRICS_OBJ['menu_item_health_metrics'] FROM FROSTBYTE_TASTY_BYTES.RAW_POS.MENU; SELECT * FROM FROSTBYTE_TASTY_BYTES.RAW_POS.TEST_MENU2; Great. Now we’re in one layer. If we check the TYPE of data inside our new menu_item_health_metrics column, we can see that it’s an ARRAY. SELECT TYPEOF(ingredients) FROM FROSTBYTE_TASTY_BYTES.RAW_POS.TEST_MENU2; This means, to pull out the first element, we should use brackets with the element order – so [0]. Let’s do that. CREATE TABLE FROSTBYTE_TASTY_BYTES.RAW_POS.TEST_MENU3 (ingredients) AS SELECT MENU_ITEM_HEALTH_METRICS_OBJ['menu_item_health_metrics'][0] FROM FROSTBYTE_TASTY_BYTES.RAW_POS.MENU; SELECT * FROM FROSTBYTE_TASTY_BYTES.RAW_POS.TEST_MENU3; If we check the TYPE of the data now, it’s an OBJECT, so we should pull out the next element based on the key “ingredients.” CREATE TABLE FROSTBYTE_TASTY_BYTES.RAW_POS.TEST_MENU4 (ingredients) AS SELECT MENU_ITEM_HEALTH_METRICS_OBJ['menu_item_health_metrics'][0]['ingredients'] FROM FROSTBYTE_TASTY_BYTES.RAW_POS.MENU; SELECT * FROM FROSTBYTE_TASTY_BYTES.RAW_POS.TEST_MENU4; And finally, we can look at the TYPE of the data now, and see that it’s an ARRAY: SELECT TYPEOF(ingredients) FROM FROSTBYTE_TASTY_BYTES.RAW_POS.TEST_MENU4; That means to pull out the first element, we should use brackets with the element order again. Let’s do that. CREATE TABLE FROSTBYTE_TASTY_BYTES.RAW_POS.TEST_MENU5 (ingredients) AS SELECT MENU_ITEM_HEALTH_METRICS_OBJ['menu_item_health_metrics'][0]['ingredients'][0] FROM FROSTBYTE_TASTY_BYTES.RAW_POS.MENU; SELECT * FROM FROSTBYTE_TASTY_BYTES.RAW_POS.TEST_MENU5; And there we are! We’ve dug through the layers of the JSON and made it to the first ingredient! This time, if we check the data type of the contents of our VARIANT column, we see it’s a VARCHAR. SELECT TYPEOF(ingredients) FROM FROSTBYTE_TASTY_BYTES.RAW_POS.TEST_MENU5; So to do all of that digging at once, we’d just need to run the following: SELECT MENU_ITEM_HEALTH_METRICS_OBJ['menu_item_health_metrics'][0]['ingredients'][0] FROM FROSTBYTE_TASTY_BYTES.RAW_POS.MENU; And we get the first ingredient. You don’t *have* to do it this way, though. There’s another way that many find easier, though at first glance the syntax may seem unusual to you. (It was unfamiliar syntax for me when I first saw it.) You can use a *lateral flatten* table function, like this: SELECT * FROM frostbyte_tasty_bytes.raw_pos.menu m, LATERAL FLATTEN (input \=> m.menu_item_health_metrics_obj:menu_item_health_metrics) obj; Let’s go slowly here before we run this command – You are selecting from your menu table, like normal, but after that, we’re putting “LATERAL FLATTEN” followed by the thing in our menu table that we want to flatten – in this case, menu_item_health_metrics within menu_item_health_metrics_obj. What this does is automatically pull out the content of “menu_item_health_metrics” as a new column. Okay, now let’s run it. There are still the same number of rows as before – 100 – but there are a bunch of new columns now that LATERAL FLATTEN created. We can see that the “VALUE” column brings us much closer to what we want if we’re looking to pull the first ingredient. Now all we need to do to get to that first ingredient is this: SELECT m.menu_item_name, value:"ingredients"[0] AS ingredients FROM frostbyte_tasty_bytes.raw_pos.menu m, LATERAL FLATTEN (input \=> m.menu_item_health_metrics_obj:menu_item_health_metrics); We go to the value column, pull out “ingredients” by key, and pull out the first element of that array. Let’s wrap up our semi-structured data work by creating a view and doing some analytics on it. We’ll create the view directly in our tippity-top schema, analytics, though normally we’d put it in the harmonized schema first and promote it. Our goal here will be to count the number of menu items that are healthy, gluten free, dairy free, and nut free. Here’s the view we’ll make: -- Creating our View Using our Semi-Structured Flattening SQL CREATE OR REPLACE VIEW frostbyte_tasty_bytes.analytics.menu_v AS SELECT m.menu_item_health_metrics_obj:menu_item_id::integer AS menu_item_id, value:"is_healthy_flag"::VARCHAR(1) AS is_healthy_flag, value:"is_gluten_free_flag"::VARCHAR(1) AS is_gluten_free_flag, value:"is_dairy_free_flag"::VARCHAR(1) AS is_dairy_free_flag, value:"is_nut_free_flag"::VARCHAR(1) AS is_nut_free_flag FROM frostbyte_tasty_bytes.raw_pos.menu m, LATERAL FLATTEN (input \=> m.menu_item_health_metrics_obj:menu_item_health_metrics); Notice that it’s using the LATERAL FLATTEN table function, and then pulling out the dummy variables from the “value” column (which was created by LATERAL FLATTEN), and then it’s casting those as 1-character strings. Now all that’s left is to sum up the counts for each flag! -- Providing Metrics to Executives SELECT COUNT(DISTINCT menu_item_id) AS total_menu_items, SUM(CASE WHEN is_healthy_flag \= 'Y' THEN 1 ELSE 0 END) AS healthy_item_count, SUM(CASE WHEN is_gluten_free_flag \= 'Y' THEN 1 ELSE 0 END) AS gluten_free_item_count, SUM(CASE WHEN is_dairy_free_flag \= 'Y' THEN 1 ELSE 0 END) AS dairy_free_item_count, SUM(CASE WHEN is_nut_free_flag \= 'Y' THEN 1 ELSE 0 END) AS nut_free_item_count FROM frostbyte_tasty_bytes.analytics.menu_v; 95 nut-free items. 81-dairy free. Just 23 healthy items. Good information to know, and if we hadn’t learned how to pull all of this from its semi-structured format, we would have been in trouble! To recap: In this video, we learned a few things. We learned how to use dot notation and position-based indexing to pull elements out of semi-structured data that’s been loaded as Snowflake’s VARIANT, ARRAY, and OBJECT data types. We then put our skills to practice by digging through four layers of a column in our menu data to find out the first ingredient. Then we learned about a table function, LATERAL FLATTEN, that can save us time by doing some of that initial digging for us. It explodes a VARIANT column into multiple columns. We’re doing great! We’re approaching the end of this first module, and we’ve added knowledge of semi-structured data and views to our earlier learnings about virtual warehouses, stages, databases, schemas, and tables. Pretty soon, we’ll be getting into exciting Snowflake features like Time Travel and Cloning. It’s going to be awesome.

2. Let's practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.