The Medallion Architecture
1. The medallion architecture
Welcome back! Now let's tackle the next challenge - organizing raw data into something trustworthy.2. From raw to ready
When raw data lands in your lakehouse, how do you turn messy, unvalidated records into clean, business-ready insights? That's exactly what the medallion architecture solves. It gives you a clear, layered approach to organizing data quality.3. The data quality problem
Imagine it's your first week as a data engineer. Your company ingests data from dozens of sources - taxi ride logs, IoT sensors, web events. Some records have missing fields. Others have duplicates or mismatched types. Analysts want trustworthy dashboards, but the raw data isn't there yet. You need a system to organize the cleanup. That system is the medallion architecture.4. Three layers, one purpose
Think of it like a professional kitchen. Bronze is your raw ingredients, straight from the market - unwashed, uncut. Silver is the prep station, where everything gets washed, chopped, and validated. Gold is the finished dish, plated and ready to serve. Different chefs pick up from different stages, just like different teams in your organization. Three layers, each with a clear purpose.5. The bronze layer
Let's make this concrete with city taxi trip data. The bronze layer captures raw data exactly as it arrives - no transformations, no filtering. Notice the raw JSON: timestamps use different formats, some fare fields are null, and distance units are inconsistent. Bronze is append-only, meaning nothing is ever deleted or overwritten. This is your safety net. If anything goes wrong downstream, you can always trace back to the original records.6. The silver layer
At the silver layer, we clean things up. The same taxi data now has timestamps parsed into proper datetime types, null fares removed, duplicates eliminated, and zone codes validated against a reference table. The schema is enforced - every column has a consistent type. This is where data scientists typically start exploring. The data is reliable and granular, ready for detailed analysis.7. The gold layer
The gold layer takes things one step further. Here, data is aggregated and shaped for specific business needs. Instead of individual rides, you see a summary table - average fare by pickup zone per day. This is what powers dashboards, executive reports, and key performance metrics. Gold tables are optimized for consumption: fast queries, pre-joined data, and business-level granularity.8. Who typically uses which layer?
So who typically consumes each layer? There are no hard rules, but common patterns emerge. Data engineers usually work across all three layers, building the pipelines that move data from bronze to gold. Data scientists often explore silver tables for analysis and feature engineering, though they may also use gold. Business analysts and dashboards typically consume gold tables for reporting. The key insight is that these are typical patterns, not strict boundaries - any role may access any layer depending on the use case.9. Summary
Let's recap. The medallion architecture gives your lakehouse a clear organizational pattern. Bronze captures everything raw - your safety net. Silver cleans and validates - your reliable foundation. Gold aggregates for business - your served insights. Together, these three layers create a path from raw ingestion to trusted, actionable data.10. Let's practice!
Now it's your turn. In the exercises ahead, you'll query bronze, silver, and gold tables and see the differences for yourself.Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.