Benchmarking and continuous improvement
1. Benchmarking and continuous improvement
How do you know if your Genie space is performing well? Benchmarking gives you a data-driven answer.2. What is benchmarking
A benchmark is a collection of "Gold Standard" questions and answers run against your space to calculate an accuracy score. It gives you a data-driven Pass/Fail report instead of guessing if changes helped. And it ensures fixing one problem hasn't accidentally broken another, so you can iterate with confidence.3. Benchmarking frequency
Build your test suite from 10 to 20 Gold Standard questions covering simple counts to complex Trusted KPIs. When do you run it? During initial development, daily or after every significant curation change. Check that new synonyms and relationships haven't broken existing logic. After schema changes, run the full suite immediately when adding new tables or columns. In production run weekly or monthly as a health check to track how the model handles new real-world questions over time.4. Interpreting results
After running a benchmark, you'll see a pass rate. But the real value is in interpreting failures. Say your Bakehouse benchmark shows 100% accuracy for "Total sales by city" but 0% for "Total revenue by municipality". That tells you Genie understands the schema but not user terminology. It needs Synonyms, not a logic change. Compare Expected SQL versus Generated SQL on failed tests to find the logic gap.5. Prioritizing fixes
Identify failure patterns. If "Supplier" or "revenue" queries fail most often in your Bakehouse space, focus there. Jump from the benchmark result straight to the Curation menu to add synonyms and examples. Prioritize based on frequency and business impact. Fix high-frequency, high-impact failures first for maximum improvement with minimal effort.6. Continuous improvement cycle
The cycle has four steps: build your test suite,7. Continuous improvement cycle
execute and interpret results,8. Continuous improvement cycle
prioritize and curate fixes,9. Continuous improvement cycle
then verify improvement by re-running the benchmark. Each cycle raises accuracy. The key discipline is always verifying before you deploy. Confirm the score went up before releasing changes to users.10. Scaling across the organization
Scaling Genie across an organization follows a few principles. Centralize data and permissions with Unity Catalog. Use Trusted Assets for company-wide KPIs as a single source of truth. Reserve "Editor" access for trained Data Stewards; most users only need "Viewer." Mandate thumbs-up and thumbs-down feedback so curators can surface failures. And run regular benchmarks to catch accuracy drift before it affects users.11. Let's practice!
Interpret benchmark results and prioritize curation tasks. Let's practice!Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.