Oct 15, 2025

Modern Data Architecture: Medallion, Data Mesh & Beyond

Medallion Architecture Data Mesh Data Platform dbt Databricks

The Modern Challenge

Traditional centralized data teams don’t scale. As organizations grow, they face:

⏱️ Bottlenecks: Central team can’t serve all domains
🔄 Slow Iterations: Weeks/months for new data products
📉 Quality Issues: Teams far from data don’t understand it
🏢 Silos: Departments build shadow data systems

Modern architectures address these through layered organization (Medallion) and decentralized ownership (Data Mesh).

The Underlying Principle: Separation of Concerns (SoC)

Both Medallion and Data Mesh embody the software engineering principle of Separation of Concerns:

Architecture	SoC Application
Medallion	Each layer has ONE job: Bronze (ingest), Silver (clean), Gold (serve)
Data Mesh	Each domain owns its data end-to-end, platform provides shared infrastructure
Combined	Clear boundaries prevent “spaghetti pipelines” where everything depends on everything

Why SoC Matters in Data: Without clear separation, a schema change in Bronze breaks Gold reports. With SoC, each layer acts as a contract boundary—changes are isolated and testable.

1. Medallion Architecture

Definition

Medallion Architecture is a data design pattern that organizes data into progressive layers of quality and refinement: Bronze → Silver → Gold.

The Three Layers

┌─────────────────────────────────────────────────────────────────┐
│                    Medallion Architecture                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐  │
│  │  Source  │───▶│  Bronze  │───▶│  Silver  │───▶│   Gold   │  │
│  │ Systems  │    │   (Raw)  │    │ (Clean)  │    │  (Mart)  │  │
│  └──────────┘    └──────────┘    └──────────┘    └──────────┘  │
│                        │              │              │          │
│                        ▼              ▼              ▼          │
│                   Raw Data       Processed      Business-       │
│                   Exact Copy     Validated      Ready Data      │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

🥉 Bronze Layer (Raw / Landing)

Aspect	Description
Purpose	Preserve raw data exactly as received
Content	JSON, CSV, Parquet dumps, API responses
Transformations	Minimal: add ingestion timestamp, source metadata
Schema	Schema-on-read or inferred
Retention	Long-term (years) for replay/audit

-- Bronze table example
CREATE TABLE bronze.raw_orders (
    _ingestion_timestamp TIMESTAMP,
    _source_file STRING,
    _raw_data STRING  -- Original JSON/payload
);

-- Or structured but raw
CREATE TABLE bronze.orders (
    order_id STRING,
    customer_id STRING,
    order_date STRING,      -- Still string, not parsed
    amount STRING,          -- Still string
    _ingestion_ts TIMESTAMP,
    _source STRING
);

Why keep raw data?

🔄 Replayability: Re-process if logic changes
🔍 Debugging: Trace issues to source
⚖️ Compliance: Audit trail for regulators
📊 Flexibility: Build new Silver tables without re-ingesting

🥈 Silver Layer (Clean / Staging)

Aspect	Description
Purpose	Single source of truth, cleaned and validated
Content	Typed, deduplicated, standardized tables
Transformations	Parsing, casting, deduplication, validation
Schema	Enforced, documented
Consumers	Analysts, data scientists, Gold layer

-- Silver table: cleaned, typed, deduplicated
CREATE TABLE silver.orders (
    order_id INT PRIMARY KEY,
    customer_id INT NOT NULL,
    order_date DATE NOT NULL,
    amount DECIMAL(12,2),
    currency VARCHAR(3),
    status VARCHAR(20),
    _loaded_at TIMESTAMP,
    _updated_at TIMESTAMP
);

-- Transformation from Bronze to Silver
INSERT INTO silver.orders
SELECT 
    CAST(order_id AS INT),
    CAST(customer_id AS INT),
    PARSE_DATE(order_date),
    CAST(amount AS DECIMAL(12,2)),
    UPPER(TRIM(currency)),
    status,
    CURRENT_TIMESTAMP,
    CURRENT_TIMESTAMP
FROM bronze.orders
WHERE order_id IS NOT NULL
QUALIFY ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY _ingestion_ts DESC) = 1;

Silver Layer Responsibilities:

✅ Data type casting
✅ Null handling
✅ Deduplication
✅ Standardization (formats, codes)
✅ Data quality checks
✅ Soft deletes / SCD handling

🥇 Gold Layer (Serve / Mart)

Aspect	Description
Purpose	Business-ready, optimized for consumption
Content	Aggregated KPIs, Star Schemas, feature tables
Transformations	Business logic, aggregations, dimensional modeling
Schema	Star/Snowflake Schema
Consumers	BI tools, executives, ML models

-- Gold table: aggregated, business-ready
CREATE TABLE gold.daily_sales_summary (
    date_key INT REFERENCES gold.dim_date(date_key),
    store_key INT REFERENCES gold.dim_store(store_key),
    product_category VARCHAR(50),
    total_orders INT,
    total_revenue DECIMAL(15,2),
    avg_order_value DECIMAL(10,2),
    unique_customers INT,
    _computed_at TIMESTAMP
);

-- Gold transformation
INSERT INTO gold.daily_sales_summary
SELECT
    d.date_key,
    s.store_key,
    p.category,
    COUNT(*) as total_orders,
    SUM(o.amount) as total_revenue,
    AVG(o.amount) as avg_order_value,
    COUNT(DISTINCT o.customer_id) as unique_customers,
    CURRENT_TIMESTAMP
FROM silver.orders o
JOIN gold.dim_date d ON o.order_date = d.full_date
JOIN gold.dim_store s ON o.store_id = s.store_id
JOIN gold.dim_product p ON o.product_id = p.product_id
GROUP BY d.date_key, s.store_key, p.category;

Gold Layer Patterns:

⭐ Star Schema: Fact + Dimension tables (Kimball)
📊 Wide Tables: Pre-joined, denormalized for specific use cases
🤖 Feature Store: ML-ready feature tables
📈 Aggregates: Pre-computed KPIs and metrics

Semantic Layer Trend (2024-2025): To prevent “What is Revenue?” debates (different definitions across departments), modern stacks add a Semantic Layer (Metric Store) on top of Gold:

Tools: dbt Semantic Layer, Cube, Looker, Metricflow

Purpose: Single source of truth for metric definitions, consumed by all BI tools

Pattern: Gold Tables → Semantic Layer → BI Dashboards

Layer Comparison

Aspect	Bronze	Silver	Gold
Data Quality	As-is	Validated	Business-ready
Schema	Loose	Enforced	Optimized
Transformations	None	Technical	Business logic
Consumers	Engineers	Analysts, DS	BI, Executives
Query Pattern	Rare	Exploratory	High-frequency
Update Frequency	Real-time / micro-batch	Near real-time	Scheduled batch

dbt Implementation

models/
├── staging/          # Bronze → Silver
│   ├── stg_orders.sql
│   └── stg_customers.sql
├── intermediate/     # Silver transformations
│   └── int_orders_enriched.sql
└── marts/           # Gold layer
    ├── dim_customer.sql
    ├── dim_product.sql
    └── fct_sales.sql

-- models/staging/stg_orders.sql
{{ config(materialized='view') }}

SELECT
    order_id::int as order_id,
    customer_id::int as customer_id,
    order_date::date as order_date,
    amount::decimal(12,2) as amount,
    UPPER(TRIM(status)) as status
FROM {{ source('raw', 'orders') }}
WHERE order_id IS NOT NULL

-- models/marts/fct_sales.sql
{{ config(materialized='table') }}

SELECT
    {{ dbt_utils.generate_surrogate_key(['o.order_id']) }} as sale_key,
    d.date_key,
    c.customer_key,
    p.product_key,
    o.quantity,
    o.amount as revenue
FROM {{ ref('stg_orders') }} o
JOIN {{ ref('dim_date') }} d ON o.order_date = d.full_date
JOIN {{ ref('dim_customer') }} c ON o.customer_id = c.customer_id
JOIN {{ ref('dim_product') }} p ON o.product_id = p.product_id

Multi-hop Architecture: How Many Layers?

The standard Medallion pattern uses 3 layers, but real-world complexity may require more or fewer.

Layer Spectrum

Simplest              Standard Medallion           Complex Enterprise
    │                        │                            │
    ▼                        ▼                            ▼
┌─────────┐           ┌─────────────────┐          ┌─────────────────────┐
│ Raw→Gold│           │Bronze→Silver→Gold│          │Bronze→Silver→Silver+│
│ (2 hop) │           │    (3 hop)      │          │  →Gold→Gold+        │
└─────────┘           └─────────────────┘          │    (4-5 hop)        │
                                                    └─────────────────────┘

Layer Count Decision Matrix

Scenario	Recommended Hops	Reason
Simple BI reporting	2 (Raw → Mart)	Low complexity, fast delivery
Standard analytics platform	3 (Bronze/Silver/Gold)	Balance flexibility & maintenance
Multi-source integration	3-4	Split Silver: `cleaned` + `integrated`
Multi-consumer requirements	3-4	Split Gold: `core` + `dept_specific`
Strict regulatory environment	4+	Additional audit/compliance layer

Signals to Add More Layers

Signal	Action
Silver models too complex	Split into `silver_stg` + `silver_conform`
Multiple Gold tables repeat same JOINs	Add `Intermediate` layer
Different departments need different granularity	Add department-specific Gold
Need complete audit trail	Add `archive` layer

dbt Common Layer Structure

models/
├── staging/       # Bronze → Silver (1st hop)
├── intermediate/  # Silver → Silver (2nd hop) ← Optional
├── marts/         # Silver → Gold (3rd hop)
│   ├── core/      # Shared Gold
│   └── finance/   # Department-specific Gold (4th hop)
└── reports/       # Pre-aggregated reports (5th hop) ← Optional

Rule of Thumb: Start with 3 layers. Add more only when you see clear signals of complexity. Each additional layer increases maintenance overhead.

2. Data Mesh

Definition

Data Mesh is a sociotechnical approach that decentralizes data ownership to domain teams while providing a self-serve infrastructure.

Created by Zhamak Dehghani at Thoughtworks (2019)

The Four Principles

┌─────────────────────────────────────────────────────────────────┐
│                    Data Mesh Principles                          │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ┌────────────────────┐    ┌────────────────────┐              │
│  │  1. Domain         │    │  2. Data as a      │              │
│  │     Ownership      │    │     Product        │              │
│  │                    │    │                    │              │
│  │  Teams own their   │    │  Treat data like   │              │
│  │  domain's data     │    │  a product with    │              │
│  │                    │    │  SLAs & docs       │              │
│  └────────────────────┘    └────────────────────┘              │
│                                                                  │
│  ┌────────────────────┐    ┌────────────────────┐              │
│  │  3. Self-Serve     │    │  4. Federated      │              │
│  │     Platform       │    │     Governance     │              │
│  │                    │    │                    │              │
│  │  Infrastructure    │    │  Global standards, │              │
│  │  teams can use     │    │  local autonomy    │              │
│  │  independently     │    │                    │              │
│  └────────────────────┘    └────────────────────┘              │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Principle 1: Domain Ownership

Teams that create/understand the data own and serve it.

Traditional (Centralized):

┌──────────┐         ┌──────────────────┐         ┌──────────┐
│  Sales   │────────▶│   Central Data   │────────▶│  All     │
│  Team    │         │      Team        │         │ Consumers│
└──────────┘         │                  │         └──────────┘
┌──────────┐         │   (Bottleneck)   │
│ Marketing│────────▶│                  │
└──────────┘         └──────────────────┘

Data Mesh (Decentralized):

┌─────────────────┐                      ┌──────────┐
│   Sales Domain  │─────────────────────▶│          │
│   (owns sales   │                      │ Marketing│
│    data)        │                      │ Domain   │
└─────────────────┘                      │          │
         │                               └──────────┘
         │                                    │
         ▼                                    ▼
┌─────────────────┐                      ┌──────────┐
│ Marketing Domain│◀─────────────────────│  Sales   │
│ (owns marketing │                      │  Domain  │
│  data)          │                      │          │
└─────────────────┘                      └──────────┘

Principle 2: Data as a Product

Each domain publishes data products with product management discipline.

Characteristic	Description
Discoverable	Listed in data catalog with metadata
Addressable	Clear endpoint/path to access
Trustworthy	Quality SLAs, freshness guarantees
Self-describing	Schema, lineage, documentation
Interoperable	Standard formats, global IDs
Secure	Access controls, privacy compliance

# Example: Data Product Contract
name: sales.orders
version: 2.0.0
owner: sales-data-team@company.com
description: All completed customer orders

schema:
  - name: order_id
    type: integer
    description: Unique identifier
  - name: customer_id
    type: integer
    description: FK to customers.customer_id

sla:
  freshness: 1 hour
  completeness: 99.9%
  availability: 99.5%

access:
  public: true
  pii: false

Data Contracts = “Shift Left” for Data

Just as software engineering shifted testing left (earlier in the pipeline), Data Contracts shift data quality left:

Without Contracts: Upstream team changes schema → downstream pipelines break → data team discovers days later

With Contracts: Schema change fails contract validation → caught in CI/CD before deployment

Data Contracts are the bridge between Microservices and the Data Platform. They prevent software engineers from accidentally breaking data pipelines with schema changes.

Principle 3: Self-Serve Platform

A data platform that enables domain teams to build/ship data products independently.

┌─────────────────────────────────────────────────────────────────┐
│                    Self-Serve Data Platform                      │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ┌──────────────────────────────────────────────────────┐       │
│  │                   Domain Teams Use                    │       │
│  ├──────────────────────────────────────────────────────┤       │
│  │  Data Product Templates  │  CI/CD Pipelines          │       │
│  │  Quality Frameworks      │  Access Control           │       │
│  │  Schema Registry         │  Data Catalog             │       │
│  └──────────────────────────────────────────────────────┘       │
│                          │                                       │
│                          ▼                                       │
│  ┌──────────────────────────────────────────────────────┐       │
│  │                 Platform Team Provides                │       │
│  ├──────────────────────────────────────────────────────┤       │
│  │  Compute (Spark/DBT)     │  Storage (Lakehouse)      │       │
│  │  Orchestration           │  Observability            │       │
│  │  Data Mesh Infra         │  Security/Governance      │       │
│  └──────────────────────────────────────────────────────┘       │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Principle 4: Federated Computational Governance

Balance between global standards and local autonomy.

Global (Federated)	Local (Domain)
Naming conventions	Schema design
Data quality minimums	Quality tests
PII handling rules	Business logic
Interoperability standards	Implementation
Audit requirements	Development pace

Data Mesh Challenges

Challenge	Mitigation
Coordination overhead	Strong platform team, clear contracts
Skill gaps	Training, embedded data engineers
Inconsistent quality	Automated quality gates
Discovery complexity	Robust data catalog
Governance complexity	Clear ownership, federated policies

When to Use Data Mesh

✅ Good Fit	❌ Poor Fit
Large organization (>100 engineers)	Small team (<20)
Multiple distinct domains	Single product/domain
Domain teams have data skills	No data skills in domains
Bottleneck at central data team	Central team working fine
Need for speed + scale	Simple analytics needs

3. Combining Medallion + Data Mesh

The architectures complement each other:

┌─────────────────────────────────────────────────────────────────┐
│                  Medallion + Data Mesh                           │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ┌─────────────────────────────────────────────────────┐        │
│  │               Sales Domain                           │        │
│  │  ┌─────────┐  ┌─────────┐  ┌─────────┐              │        │
│  │  │ Bronze  │─▶│ Silver  │─▶│  Gold   │ ◀── Product  │        │
│  │  └─────────┘  └─────────┘  └─────────┘              │        │
│  └─────────────────────────────────────────────────────┘        │
│                          │                                       │
│                          │ Published Data Product                │
│                          ▼                                       │
│  ┌─────────────────────────────────────────────────────┐        │
│  │             Marketing Domain                         │        │
│  │  ┌─────────┐  ┌─────────┐  ┌─────────┐              │        │
│  │  │ Bronze  │─▶│ Silver  │─▶│  Gold   │ ◀── Product  │        │
│  │  │(consumes│  └─────────┘  └─────────┘              │        │
│  │  │ Sales)  │                                         │        │
│  │  └─────────┘                                         │        │
│  └─────────────────────────────────────────────────────┘        │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Implementation Pattern

Each domain has its own Medallion layers (Bronze/Silver/Gold)
Gold layer tables are the published Data Products
Cross-domain consumption: Domain B’s Bronze can ingest Domain A’s Gold
Platform provides shared tooling, catalog, governance

4. Tool Ecosystem

Platform Components

┌─────────────────────────────────────────────────────────────────┐
│                   Modern Data Platform Stack                     │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ┌───────────────┐  ┌───────────────┐  ┌───────────────┐        │
│  │   Ingestion   │  │   Storage     │  │  Transform    │        │
│  ├───────────────┤  ├───────────────┤  ├───────────────┤        │
│  │ Fivetran      │  │ Databricks    │  │ dbt           │        │
│  │ Airbyte       │  │ Snowflake     │  │ Spark         │        │
│  │ Debezium      │  │ Delta Lake    │  │ SQLMesh       │        │
│  └───────────────┘  └───────────────┘  └───────────────┘        │
│                                                                  │
│  ┌───────────────┐  ┌───────────────┐  ┌───────────────┐        │
│  │ Orchestration │  │   Catalog     │  │   Quality     │        │
│  ├───────────────┤  ├───────────────┤  ├───────────────┤        │
│  │ Dagster       │  │ DataHub       │  │ Great Expect. │        │
│  │ Airflow       │  │ Atlan         │  │ Soda          │        │
│  │ Prefect       │  │ Unity Catalog │  │ Monte Carlo   │        │
│  └───────────────┘  └───────────────┘  └───────────────┘        │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Orchestration Deep Dive: Airflow vs Prefect

Two popular orchestration tools with different philosophies:

Aspect	Apache Airflow	Prefect
Philosophy	DAGs as configuration	Python-first, decorators
Created	2014 (Airbnb)	2018
Definition Style	Python + explicit DAG class	`@flow` / `@task` decorators
Dynamic DAGs	Requires workarounds	Native support
Local Development	Requires full setup	Simple `python script.py`
Managed Cloud	Astronomer, MWAA, GCP Composer	Prefect Cloud (free tier)
Ecosystem	Very mature, many plugins	Newer but growing fast
Learning Curve	Medium-high	Lower

Code Comparison

Airflow:

from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime

def extract(): ...
def transform(): ...

with DAG('etl_pipeline', start_date=datetime(2024,1,1), schedule='@daily') as dag:
    t1 = PythonOperator(task_id='extract', python_callable=extract)
    t2 = PythonOperator(task_id='transform', python_callable=transform)
    t1 >> t2

Prefect:

from prefect import flow, task

@task
def extract(): ...

@task
def transform(data): ...

@flow(name="etl_pipeline")
def etl():
    data = extract()
    transform(data)

if __name__ == "__main__":
    etl()

When to Choose

Choose Airflow	Choose Prefect
Enterprise with Airflow experience	New projects, rapid iteration
Need extensive connectors	Python-focused team
Using managed services (GCP/AWS)	Want simple local development
Long-term stable project	Dynamic workflows

Data Mesh Specific Tools

Category	Tools
Data Product Management	Datamesh Manager, DataOS
Data Contracts	Soda, datacontract CLI
Federated Catalog	DataHub, Atlan, Collibra
Self-Serve Platform	Databricks Unity Catalog, Starburst Galaxy

5. Decision Framework

Architecture Selection

┌─────────────────────────────────────────────────────────────────┐
│                   Architecture Decision                          │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  Organization Size?                                              │
│      │                                                           │
│      ├── Small (<20 data people)                                │
│      │       └──▶ Medallion (centralized)                       │
│      │                                                           │
│      ├── Medium (20-100)                                        │
│      │       └──▶ Medallion + domain alignment                  │
│      │                                                           │
│      └── Large (100+)                                           │
│              │                                                   │
│              ├── Domains have data skills?                      │
│              │       │                                           │
│              │       ├── Yes ──▶ Full Data Mesh                 │
│              │       └── No ──▶ Medallion + gradual shift       │
│              │                                                   │
│              └── Central team bottleneck?                       │
│                      │                                           │
│                      ├── Yes ──▶ Data Mesh                      │
│                      └── No ──▶ Medallion (centralized)         │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

FinOps Warning: The Hidden Cost of Decentralization

Data Mesh and distributed Lakehouse architectures bring agility but can cause cost explosion:

Risk	Cause	Mitigation
Duplicate compute	Multiple domains running similar queries	Shared intermediate tables, cross-domain marts
Unoptimized queries	Domain teams lack SQL optimization skills	Query analysis tools, training, guardrails
Storage sprawl	Every team creates copies “just in case”	Data lifecycle policies, automated cleanup
Uncapped serverless	Snowflake/BigQuery pay-per-query	Budget alerts, query cost limits

Platform Team Responsibility: Provide Cost Observability dashboards showing:

Per-domain compute/storage costs

Cost trends and anomalies

Query efficiency metrics

Tools: Snowflake Cost Management, Databricks Cost Explorer, select star, Zenith

6. Summary: The City Analogy 🏙️

Concept	City Analogy
Medallion	Highway system (Raw imports → Processing zones → Downtown business)
Bronze	Port/Airport (raw goods arrive as-is)
Silver	Warehouses (quality checked, sorted, standardized)
Gold	Retail stores (consumer-ready products)
Data Mesh	City districts (each district manages its own services, follows city codes)
Self-Serve Platform	City utilities (water, power everyone can connect to)
Federated Governance	City laws (global building codes, local zoning)

Key Takeaways

Medallion Architecture provides a clear layer pattern for data quality progression
Data Mesh decentralizes ownership while maintaining interoperability
They work together: Medallion as the pattern, Mesh as the organizational model
Start simple: Begin with Medallion, evolve to Mesh as organization scales
Platform is key: Self-serve infrastructure enables both patterns

← Previous Scaling the Robot Army: High-Density Infrastructure for Enterprise RPA

Next → Beyond Click Automation: RPA Integration with Documents, Databases, and Files