Why Your Semantic Layer Matters More Than Your AI Agent

A data warehouse knows what columns exist. A semantic model knows what they mean. This is the story of how I built the layer between the two, and why I believe the model, not the AI agent, is the product.

Key takeaways

The most common silent analytics failure is not a broken query. It is a successful query that returns the wrong number. A semantic model makes this structurally impossible.
The AI interprets user intent and translates it into a semantic query. A deterministic engine generates the SQL. Same inputs, same output, every time. No LLM is involved in SQL generation.
The model is organized in two layers: a physical layer that mirrors the warehouse schema, and a model layer that defines what the data means. They change independently, owned by different people.
Every piece of metadata provided to the AI (disambiguation hints, semantic types, canonical filter values) was added to fix a specific, observed failure. Metadata is infrastructure, not documentation.
AI can compress semantic model development from months to weeks by generating structured drafts for human review, eliminating the blank-page problem.

In this article

The problem nobody solves with better prompts
The approach: separate what changes for different reasons
How the model solves the hard problems
Why you can trust it
Getting started: how AI solves the cold-start problem
Keeping it clean: preventing metric proliferation
What I learned building it

The problem nobody solves with better prompts

Every data platform eventually confronts the same question: where does the definition of "revenue" live? Is it in the SQL query an analyst wrote last quarter? In the transformation pipeline a data engineer maintains? In the dashboard someone configured two years ago that nobody wants to touch? In the Slack thread where someone explained to the CFO that "net revenue" doesn't include refunds but "total revenue" does?

In most organizations, the answer is "all of the above." Metric definitions are scattered across queries, dashboards, documentation, and tribal knowledge. The same column is interpreted differently by different teams. Hand an LLM the raw warehouse schema and it will confidently join two tables that shouldn't be joined at the grain of the question, because nothing in the schema says otherwise.

A user asks

"Show me Net Revenue by Consumer Region for last quarter."

Seven words, three layers of ambiguity. "Net Revenue" is not a column. It is a derived metric spanning three tables with different grains. "Consumer Region" requires a historically-versioned join where the correct row depends on when the order was placed, not today's date. "Last quarter" could mean the calendar quarter or the fiscal quarter, depending on which team is asking. An LLM will make a confident choice for each ambiguity. It will be wrong about at least one, and the resulting number will look plausible.

The silent failure mode

A SQL syntax error is visible: the query fails, the user retries. A wrong metric definition is invisible: the query succeeds, the number lands in a slide deck, and a decision is made on a figure that was never correct.

The semantic model exists to make this second failure mode structurally impossible.

More schema documentation does not close this gap. You can annotate every column with a description, and an LLM still will not know that two columns in different tables represent the same concept, or that a dimension table requires a time-windowed join. What you need is not more metadata on the schema. You need a separate layer that defines what the data means, independent of how it is stored.

· · ·

The approach: separate what changes for different reasons

The central design principle is simple: things that change for different reasons should live in different places.

Warehouse schemas change when data teams add new tables, introduce new columns, migrate to a new warehouse, or restructure data models. Business definitions change when analysts refine what "net revenue" means or a new team needs a different lens on the same data. These are different rates of change, driven by different people for different reasons. Mixing them in the same layer means every schema change risks breaking a metric definition, and every metric change requires touching schema files.

I separated them into two layers.

Two-Layer Architecture

The physical layer is a complete mirror of the warehouse schema: every table, every column, every join path between tables. It is the single source of truth about what the warehouse contains. If a column is not declared here, the system cannot reference it. If a join is not declared here, the system cannot perform it.

The model layer sits above it and defines what the data means. Business domains like "delivery" or "subscription" are modeled as entities that pull from multiple physical tables, define metrics with precise formulas, and compose through inheritance. An analyst thinks in domains, and the model layer mirrors that vocabulary exactly.

What this separation buys you

When the warehouse changes (a column is renamed, a table is migrated): only the physical layer is updated. Every metric and business definition continues to work untouched.

When a business definition changes ("net revenue" now includes a new deduction category): only the model layer is updated. The warehouse schema files are untouched.

When a new team needs a different lens (a Finance perspective on the same delivery data): one new file is added. The core domain model it inherits from does not change.

Each change touches exactly one layer. This is the core design constraint, and it is what makes the system maintainable as the number of metrics, teams, and use cases grows.

Everything is defined in version-controlled configuration files, reviewed in pull requests the same way application code is. A metric definition change is a two-line diff. The alternative (metric definitions stored in a database, managed through a UI) trades reviewability for convenience. I chose reviewability.

· · ·

How the model solves the hard problems

A key architectural decision: the AI interprets the user's question, translates it into a semantic query (selecting the right metrics, dimensions, filters, and date ranges from the governed catalog) and hands it to the semantic engine, which handles all SQL generation. The AI understands intent. The engine guarantees correctness. Same semantic query in, same SQL out, every time.

This separation is what makes governed analytics possible. But it only works if the model carries enough information to handle the hard problems. Here are the ones I had to solve.

When multiple metrics share the same name

A user asks for "revenue." The model contains total revenue, net revenue, gross revenue, and revenue per order. Without guidance, the agent picks one (silently wrong most of the time) or asks a clarifying question when the answer is obvious from context.

I solved this by making every metric carry more than just a formula. Each metric has a certification level (is this verified by the data team or still experimental?), a semantic type (is this a rate that should never be summed, a duration that should be averaged, or a cumulative total?), and free-form disambiguation hints written by the metric author: "This is gross revenue before deductions. It is the default revenue metric. Use net revenue for post-deduction figures." The agent reads these hints in context and resolves ambiguity without asking. Each of these metadata fields was added to fix a specific, observed failure. They are not documentation; they are operational guidance for the AI.

When joins are ambiguous or historically versioned

In a parcel delivery business, a delivery has two consumers: the person who placed the order and the person who receives it. Both live in the same consumer table. "Show me revenue by consumer region" is ambiguous: which consumer?

And even once you pick the right consumer, their region may have changed over time. The consumer table has multiple historical rows per person. Joining without accounting for the time dimension silently shifts revenue between regions — or worse, multiplies it.

I solved both problems in the model. Every join between tables is explicitly declared with its cardinality, its join logic, and whether it requires time-windowed matching. The query generator does not infer joins from column names or foreign keys. If a join is not declared, it does not happen. For historically versioned dimensions, the system automatically resolves the correct row based on when the transaction occurred, not the current date. And when the same dimension joins in two different roles (ordering consumer vs. receiving consumer), the model names them separately, generates separate SQL aliases, and presents them as distinct choices in the UI. No ambiguity reaches the user.

When summary models exist and the AI can't tell when to use them over the fact table

Pre-aggregated tables exist for performance: pre-computed daily rollups that answer common queries faster than scanning millions of fact rows. But they carry restrictions. A daily delivery aggregate might be pre-built excluding cancelled deliveries. If a user asks for "revenue by delivery status" and the system routes to that aggregate, the cancelled row is missing entirely. The total is lower than reality. There is no error, no warning. The number looks right. It isn't.

I solved this with a strict eligibility check. Before routing any query to an aggregate, the system evaluates whether the aggregate can safely answer the entire question: are all the requested metrics available, do all the requested dimensions exist, are the aggregate's content restrictions compatible with the user's filters? The decision is binary: the aggregate serves the whole query, or the system falls back to the full table. There is no partial routing, no hybrid mode. It is better to be correct and slower than fast and wrong.

"The aggregate routing system embodies a core principle: correctness is not negotiable. Performance is earned one rule at a time."

When the AI guesses filter values

"Show me cancelled orders." Is the filter value "cancelled", "Cancelled", "CANCELLED", or "cancel"? A wrong guess produces a valid query that returns zero rows. The user assumes there are no cancelled orders.

I solved this by declaring the exact valid values for every enumerated column in the physical layer. These canonical values are extracted when the model loads and provided to the AI as part of its operating context. The agent matches against exact strings. No guessing, no case-sensitivity bugs, no empty results from typos.

When the AI misunderstands how a metric should be aggregated

"What's the total cancellation rate across all regions?" If the agent sums individual region rates instead of computing a weighted average, the resulting number is meaningless. This is a common analytical mistake, one that even experienced analysts make.

I solved this by tagging metrics with semantic types: rate metrics should be displayed as percentages and never summed, duration metrics should be compared with averages and include their unit, cumulative metrics should not be double-counted across time periods. These rules flow into the AI's context as behavioral guidance. The query engine itself does not enforce them. A power user making a deliberate choice is not blocked. But the AI follows them when suggesting aggregations, which prevents the most common class of analytical errors from reaching the end user.

· · ·

Why you can trust it

A governed model that allows incorrect definitions is worse than no model at all. Without one, analysts know they are on their own and double-check everything. With a model they trust, they stop checking. If a metric formula is wrong, every query that uses it is silently corrupted. The bar for reliability has to be higher than "it usually works."

Three design decisions set that bar.

Errors are caught when the model is defined, not when a user runs a query. Every model goes through structural validation before it can serve a single query. Circular dependencies, missing fields on historically versioned joins, broken references between tables. All of these are rejected immediately when the model author saves their work. A model with a latent error cannot reach production. The feedback loop for the model author is seconds, not the days or weeks it would take for an end user to stumble onto the bug.

The query generator is deterministic and traceable. Given the same inputs, it produces the same SQL every time. No probabilistic reasoning, no optimization tricks that make the output hard to trace. A human reading the generated SQL can follow every clause back to a specific model definition. When something looks wrong, you can diagnose it. This matters more than cleverness when the system's primary job is producing correct numbers.

The model detects when the warehouse changes under it. The physical layer is a contract that says "these tables have these columns with these types." Warehouses evolve: columns are renamed, types change, tables are rebuilt. When the two diverge, queries fail with cryptic errors. The system compares its definitions against the live database and reports exactly where they disagree: which columns were renamed, which were added, which changed type. This turns an invisible problem into a visible, actionable one.

Underneath all of this sits a test suite that covers over a thousand scenarios: every query generation path, every filter operator, every edge case in aggregate routing, every inheritance pattern. These are not superficial checks. Each test builds a model, runs the query generator, and verifies the exact SQL output. If a metric formula shifts, a join order changes, or a filter appears in the wrong place, a test fails. The suite is the safety net that lets me iterate on the model with confidence.

· · ·

Getting started: how AI solves the cold-start problem

If the semantic model is this valuable, why doesn't every organization have one? Because building it by hand is prohibitively expensive. A data engineer sits with business stakeholders, maps every metric definition, every relationship, every temporal rule, table by table, column by column. For an organization with hundreds of tables and thousands of columns, this is months of work. The value is clear. The investment stops most teams before they start. And so they either don't build a semantic layer at all, or they build a partial one that covers their most-used dashboards and nothing else.

I believe AI can compress this timeline from months to weeks, not by replacing human judgment, but by eliminating the blank-page problem.

The approach: an AI-powered Studio Agent reads the physical data model: table structures, column patterns, query history, existing transformation logic. From this analysis, it generates a structured draft of the semantic model: proposed metric definitions, inferred join paths, temporal rules, and dimension hierarchies. Every proposal carries supporting evidence: why this join was inferred, what query patterns support this metric grouping, what table characteristics suggest a historically versioned dimension.

The cognitive shift

Building a semantic model from scratch is authoring a document from a blank page. Reviewing a structured, evidence-backed draft is reviewing a pull request. The cognitive load is fundamentally different. The person doing the review still needs domain expertise. They still decide whether the proposed "net revenue" formula is right for their business. But they are correcting and approving, not creating from nothing.

· · ·

Keeping it clean: preventing metric proliferation

Getting started is one problem. Keeping the model healthy as it grows is another. Without guardrails, teams create variations instead of reusing existing definitions: "revenue," "revenue adjusted," "revenue adjusted v2," "net revenue excluding refunds new." Each made sense at the time it was created. Collectively, they degrade the model because five metrics now match any revenue-related question, and the AI has no basis for choosing the right one.

The guardrails are structural. When a new metric is proposed, the system checks for semantic overlap, not just name similarity, but definition similarity. Equivalent metrics with different filters get redirected to the existing metric with a filter parameter. Genuine variants require disambiguation annotations explaining when to use each one. These annotations feed directly into the AI agent's context, the same disambiguation hints that help choose between "net revenue" and "gross revenue" when a user just says "revenue."

The result is a model that grows intentionally. New metrics are added because they represent genuinely new business concepts, not because someone didn't know the existing metric already existed.

· · ·

What I learned building it

This system was not designed on a whiteboard. Each layer emerged from a specific failure of the previous one. These are the lessons that shaped my architecture decisions.

LESSON 01

Separate what changes for different reasons

The earliest version mixed table schemas and metric definitions in the same place. When a warehouse column was renamed, every metric formula that referenced it had to be updated. The two-layer architecture was born from this pain. Now a schema change touches one layer. Business definition changes touch the other. Neither cascades. The cost is two layers to maintain. The benefit is that each layer evolves independently, owned by the people who understand it best.

LESSON 02

Choose correctness over cleverness

An early prototype tried to optimize the generated SQL: reordering joins, combining filters, eliminating redundant subqueries. The optimized SQL was sometimes faster. But when it produced wrong results, diagnosing the problem was nearly impossible because the SQL bore no resemblance to the model's logical structure. I switched to a strict pipeline where the output is predictable and traceable. Correctness beats performance when the system's primary job is producing trustworthy numbers.

LESSON 03

AI metadata is infrastructure, not documentation

Disambiguation hints, certification levels, semantic types, and canonical filter values were all late additions. I initially treated them as "nice to have." Then I watched the agent consistently choose the wrong metric when two had similar names, suggest summing a rate, and use the wrong casing on filter values. Each metadata field was added to fix a specific, observed failure. The enriched context provided to the AI is not a description of the model. It is an operating manual that prevents the AI from repeating the same mistake twice. Every field in it earns its place by eliminating a failure.

LESSON 04

Never serve a partially correct answer

The temptation with aggregate tables is to use them whenever possible and fall back for anything they can't handle. I tried pulling some metrics from an aggregate and others from the full table in the same query. The results were subtly wrong when the two sources interacted. The system now enforces a binary decision: the aggregate serves the entire query, or it doesn't serve any of it. No partial credit. This simplified the logic dramatically and eliminated an entire category of silent errors.

· · ·

The model is the product

The semantic model is not scaffolding you build so the AI can do interesting work. The model is the interesting work. It encodes institutional knowledge about what the data means, how metrics relate to each other, which dimensions make sense together, and where the edge cases hide. The AI agent is a delivery mechanism for that knowledge. The model is the substance.

Every improvement to the model (a disambiguation hint that resolves an ambiguity, a canonical value that prevents a filter mistake, a semantic type that stops an aggregation error) directly improves every query the system will ever generate. In my experience, improving the model delivers higher returns than any amount of prompt engineering. The model is the leverage point.

"A great model with a mediocre agent produces better answers than a mediocre model with a great agent."

Read the full technical deep-dive: entity inheritance, formula qualification, the complete query generation pipeline, and the full test architecture.