Semi-Structured Data Querying

1. Semi-Structured Data Querying

Harbr ingests shipment events from dozens of carrier partners — and every carrier sends JSON in a slightly different shape.

2. Semi-structured Data Loading

Snowflake supports semi-structured data loading. Once loaded, Snowflake lets you query semi-structured data.

3. Managing Different Data Shapes

Let's go back to our example. Harbr receives shipment events from dozens of carrier partners. Every carrier sends JSON, but the structures vary - some nest the delivery address inside a carrier object, others include an array of intermediate stops, some omit stops entirely. With a rigid schema requirement you'd spend more time wrangling formats than analysing data. Snowflake's VARIANT type changes that.

4. Transformation Example

You can query the data in JSON, Avro, Parquet, ORC, and XML format directly — without flattening it into columns first. This shows the query that took the JSON data on the left and turned it into structured data on the right. The data sits in a special column type called VARIANT, which can hold any semi-structured shape. Some nest addresses inside a carrier object; others include arrays of transit stops. In this video you'll learn how Snowflake handles that variety using the VARIANT column type.

5. Snowflake Data Types

Before diving into semi-structured data, it's worth seeing where VARIANT fits in Snowflake's type system. Snowflake's scalar types — numbers, strings, dates, and booleans — work exactly as you'd expect from any relational database. Each column has a fixed type declared when the table is created, and Snowflake enforces it on every row that comes in. VARIANT is different. It's Snowflake's type for semi-structured data: JSON, Parquet, Avro, ORC, or XML. Unlike VARCHAR or NUMBER, a VARIANT column accepts any structure - a flat object, a deeply nested document, an array. The schema isn't enforced on arrival; it's resolved at query time. Harbr's delivery_events table uses both: structured columns like shipment_id, region, and dispatch_date sit alongside a VARIANT column that holds the raw event payload from each carrier. Everything in this lesson is about getting typed, queryable values back out of that VARIANT column.

6. The VARIANT type

VARIANT is a column type that stores semi-structured data as-is. You load JSON into it without declaring what fields exist. Snowflake simply stores the raw structure and figures out the data structure when you query it. That's called schema-on-read. The schema isn't enforced when data arrives - it's applied when you ask a question. This means Harbr can ingest Carrier A and Carrier B's completely different JSON structures into the same VARIANT column, and handle the differences at query time rather than at load time.

7. Navigating JSON

Harbr's shipment events land in a table with a VARIANT column called event - that's where the raw JSON lives. Each row holds a nested structure: a shipment ID, a carrier object with a name field, and an array of stops. To pull out specific values, Snowflake gives you three navigation tools. The colon gets you into the JSON. Dot notation handles nested fields - so event_data colon carrier dot name reaches inside the carrier object. Square brackets pick from an array by position - stops at index zero gives you the first stop. The double colon at the end casts each value to STRING, giving you a clean flat table from what started as nested JSON.

8. Turning Arrays into Rows wih FLATTEN

Dot notation gets you one value per row which works when you’re pulling a single value, but it doesn’t work for arrays. Arrays need a different approach. Harbr doesn’t want to see one row per shipment, it needs to see one row per stop. That’s where FLATTEN comes in, FLATTEN takes an array and turns it into own row so instead of having one row with a list, you get multiple rows - one for each stop.

9. Using LATERAL FLATTEN

When we use LATERAL FLATTEN, it runs for each row in shipment_events. So each shipment gets expanded independently. The alias f gives us access to each stop through f.value then from there, we use the same navigation as before. Now if we look at the output. SHP-001 had three stops so it became three rows. That nested array is now flattened into a table. And once it’s in that shape, you can filter it, aggregate it, or join it like any other dataset.

10. Let's practice!

You’ve covered Snowflake’s type system and where VARIANT fits, the VARIANT type for schema-on-read ingestion, dot notation and type casting to extract typed values from JSON, and LATERAL FLATTEN to turn arrays into rows. Let’s practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.