Hero image for Modern Data Architecture: Medallion, Data Mesh & Beyond

Modern Data Architecture: Medallion, Data Mesh & Beyond

Medallion Architecture Data Mesh Data Platform dbt Databricks

The Modern Challenge

Traditional centralized data teams don’t scale. As organizations grow, they face:

  • ⏱️ Bottlenecks: Central team can’t serve all domains
  • 🔄 Slow Iterations: Weeks/months for new data products
  • 📉 Quality Issues: Teams far from data don’t understand it
  • 🏢 Silos: Departments build shadow data systems

Modern architectures address these through layered organization (Medallion) and decentralized ownership (Data Mesh).

The Underlying Principle: Separation of Concerns (SoC)

Both Medallion and Data Mesh embody the software engineering principle of Separation of Concerns:

ArchitectureSoC Application
MedallionEach layer has ONE job: Bronze (ingest), Silver (clean), Gold (serve)
Data MeshEach domain owns its data end-to-end, platform provides shared infrastructure
CombinedClear boundaries prevent “spaghetti pipelines” where everything depends on everything

Why SoC Matters in Data: Without clear separation, a schema change in Bronze breaks Gold reports. With SoC, each layer acts as a contract boundary—changes are isolated and testable.


1. Medallion Architecture

Definition

Medallion Architecture is a data design pattern that organizes data into progressive layers of quality and refinement: Bronze → Silver → Gold.

The Three Layers

┌─────────────────────────────────────────────────────────────────┐
│                    Medallion Architecture                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐  │
│  │  Source  │───▶│  Bronze  │───▶│  Silver  │───▶│   Gold   │  │
│  │ Systems  │    │   (Raw)  │    │ (Clean)  │    │  (Mart)  │  │
│  └──────────┘    └──────────┘    └──────────┘    └──────────┘  │
│                        │              │              │          │
│                        ▼              ▼              ▼          │
│                   Raw Data       Processed      Business-       │
│                   Exact Copy     Validated      Ready Data      │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

🥉 Bronze Layer (Raw / Landing)

AspectDescription
PurposePreserve raw data exactly as received
ContentJSON, CSV, Parquet dumps, API responses
TransformationsMinimal: add ingestion timestamp, source metadata
SchemaSchema-on-read or inferred
RetentionLong-term (years) for replay/audit
-- Bronze table example
CREATE TABLE bronze.raw_orders (
    _ingestion_timestamp TIMESTAMP,
    _source_file STRING,
    _raw_data STRING  -- Original JSON/payload
);

-- Or structured but raw
CREATE TABLE bronze.orders (
    order_id STRING,
    customer_id STRING,
    order_date STRING,      -- Still string, not parsed
    amount STRING,          -- Still string
    _ingestion_ts TIMESTAMP,
    _source STRING
);

Why keep raw data?

  • 🔄 Replayability: Re-process if logic changes
  • 🔍 Debugging: Trace issues to source
  • ⚖️ Compliance: Audit trail for regulators
  • 📊 Flexibility: Build new Silver tables without re-ingesting

🥈 Silver Layer (Clean / Staging)

AspectDescription
PurposeSingle source of truth, cleaned and validated
ContentTyped, deduplicated, standardized tables
TransformationsParsing, casting, deduplication, validation
SchemaEnforced, documented
ConsumersAnalysts, data scientists, Gold layer
-- Silver table: cleaned, typed, deduplicated
CREATE TABLE silver.orders (
    order_id INT PRIMARY KEY,
    customer_id INT NOT NULL,
    order_date DATE NOT NULL,
    amount DECIMAL(12,2),
    currency VARCHAR(3),
    status VARCHAR(20),
    _loaded_at TIMESTAMP,
    _updated_at TIMESTAMP
);

-- Transformation from Bronze to Silver
INSERT INTO silver.orders
SELECT 
    CAST(order_id AS INT),
    CAST(customer_id AS INT),
    PARSE_DATE(order_date),
    CAST(amount AS DECIMAL(12,2)),
    UPPER(TRIM(currency)),
    status,
    CURRENT_TIMESTAMP,
    CURRENT_TIMESTAMP
FROM bronze.orders
WHERE order_id IS NOT NULL
QUALIFY ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY _ingestion_ts DESC) = 1;

Silver Layer Responsibilities:

  • ✅ Data type casting
  • ✅ Null handling
  • ✅ Deduplication
  • ✅ Standardization (formats, codes)
  • ✅ Data quality checks
  • ✅ Soft deletes / SCD handling

🥇 Gold Layer (Serve / Mart)

AspectDescription
PurposeBusiness-ready, optimized for consumption
ContentAggregated KPIs, Star Schemas, feature tables
TransformationsBusiness logic, aggregations, dimensional modeling
SchemaStar/Snowflake Schema
ConsumersBI tools, executives, ML models
-- Gold table: aggregated, business-ready
CREATE TABLE gold.daily_sales_summary (
    date_key INT REFERENCES gold.dim_date(date_key),
    store_key INT REFERENCES gold.dim_store(store_key),
    product_category VARCHAR(50),
    total_orders INT,
    total_revenue DECIMAL(15,2),
    avg_order_value DECIMAL(10,2),
    unique_customers INT,
    _computed_at TIMESTAMP
);

-- Gold transformation
INSERT INTO gold.daily_sales_summary
SELECT
    d.date_key,
    s.store_key,
    p.category,
    COUNT(*) as total_orders,
    SUM(o.amount) as total_revenue,
    AVG(o.amount) as avg_order_value,
    COUNT(DISTINCT o.customer_id) as unique_customers,
    CURRENT_TIMESTAMP
FROM silver.orders o
JOIN gold.dim_date d ON o.order_date = d.full_date
JOIN gold.dim_store s ON o.store_id = s.store_id
JOIN gold.dim_product p ON o.product_id = p.product_id
GROUP BY d.date_key, s.store_key, p.category;

Gold Layer Patterns:

  • Star Schema: Fact + Dimension tables (Kimball)
  • 📊 Wide Tables: Pre-joined, denormalized for specific use cases
  • 🤖 Feature Store: ML-ready feature tables
  • 📈 Aggregates: Pre-computed KPIs and metrics

Semantic Layer Trend (2024-2025): To prevent “What is Revenue?” debates (different definitions across departments), modern stacks add a Semantic Layer (Metric Store) on top of Gold:

  • Tools: dbt Semantic Layer, Cube, Looker, Metricflow
  • Purpose: Single source of truth for metric definitions, consumed by all BI tools
  • Pattern: Gold Tables → Semantic Layer → BI Dashboards

Layer Comparison

AspectBronzeSilverGold
Data QualityAs-isValidatedBusiness-ready
SchemaLooseEnforcedOptimized
TransformationsNoneTechnicalBusiness logic
ConsumersEngineersAnalysts, DSBI, Executives
Query PatternRareExploratoryHigh-frequency
Update FrequencyReal-time / micro-batchNear real-timeScheduled batch

dbt Implementation

models/
├── staging/          # Bronze → Silver
│   ├── stg_orders.sql
│   └── stg_customers.sql
├── intermediate/     # Silver transformations
│   └── int_orders_enriched.sql
└── marts/           # Gold layer
    ├── dim_customer.sql
    ├── dim_product.sql
    └── fct_sales.sql
-- models/staging/stg_orders.sql
{{ config(materialized='view') }}

SELECT
    order_id::int as order_id,
    customer_id::int as customer_id,
    order_date::date as order_date,
    amount::decimal(12,2) as amount,
    UPPER(TRIM(status)) as status
FROM {{ source('raw', 'orders') }}
WHERE order_id IS NOT NULL

-- models/marts/fct_sales.sql
{{ config(materialized='table') }}

SELECT
    {{ dbt_utils.generate_surrogate_key(['o.order_id']) }} as sale_key,
    d.date_key,
    c.customer_key,
    p.product_key,
    o.quantity,
    o.amount as revenue
FROM {{ ref('stg_orders') }} o
JOIN {{ ref('dim_date') }} d ON o.order_date = d.full_date
JOIN {{ ref('dim_customer') }} c ON o.customer_id = c.customer_id
JOIN {{ ref('dim_product') }} p ON o.product_id = p.product_id

Multi-hop Architecture: How Many Layers?

The standard Medallion pattern uses 3 layers, but real-world complexity may require more or fewer.

Layer Spectrum

Simplest              Standard Medallion           Complex Enterprise
    │                        │                            │
    ▼                        ▼                            ▼
┌─────────┐           ┌─────────────────┐          ┌─────────────────────┐
│ Raw→Gold│           │Bronze→Silver→Gold│          │Bronze→Silver→Silver+│
│ (2 hop) │           │    (3 hop)      │          │  →Gold→Gold+        │
└─────────┘           └─────────────────┘          │    (4-5 hop)        │
                                                    └─────────────────────┘

Layer Count Decision Matrix

ScenarioRecommended HopsReason
Simple BI reporting2 (Raw → Mart)Low complexity, fast delivery
Standard analytics platform3 (Bronze/Silver/Gold)Balance flexibility & maintenance
Multi-source integration3-4Split Silver: cleaned + integrated
Multi-consumer requirements3-4Split Gold: core + dept_specific
Strict regulatory environment4+Additional audit/compliance layer

Signals to Add More Layers

SignalAction
Silver models too complexSplit into silver_stg + silver_conform
Multiple Gold tables repeat same JOINsAdd Intermediate layer
Different departments need different granularityAdd department-specific Gold
Need complete audit trailAdd archive layer

dbt Common Layer Structure

models/
├── staging/       # Bronze → Silver (1st hop)
├── intermediate/  # Silver → Silver (2nd hop) ← Optional
├── marts/         # Silver → Gold (3rd hop)
│   ├── core/      # Shared Gold
│   └── finance/   # Department-specific Gold (4th hop)
└── reports/       # Pre-aggregated reports (5th hop) ← Optional

Rule of Thumb: Start with 3 layers. Add more only when you see clear signals of complexity. Each additional layer increases maintenance overhead.


2. Data Mesh

Definition

Data Mesh is a sociotechnical approach that decentralizes data ownership to domain teams while providing a self-serve infrastructure.

Created by Zhamak Dehghani at Thoughtworks (2019)

The Four Principles

┌─────────────────────────────────────────────────────────────────┐
│                    Data Mesh Principles                          │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ┌────────────────────┐    ┌────────────────────┐              │
│  │  1. Domain         │    │  2. Data as a      │              │
│  │     Ownership      │    │     Product        │              │
│  │                    │    │                    │              │
│  │  Teams own their   │    │  Treat data like   │              │
│  │  domain's data     │    │  a product with    │              │
│  │                    │    │  SLAs & docs       │              │
│  └────────────────────┘    └────────────────────┘              │
│                                                                  │
│  ┌────────────────────┐    ┌────────────────────┐              │
│  │  3. Self-Serve     │    │  4. Federated      │              │
│  │     Platform       │    │     Governance     │              │
│  │                    │    │                    │              │
│  │  Infrastructure    │    │  Global standards, │              │
│  │  teams can use     │    │  local autonomy    │              │
│  │  independently     │    │                    │              │
│  └────────────────────┘    └────────────────────┘              │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Principle 1: Domain Ownership

Teams that create/understand the data own and serve it.

Traditional (Centralized):

┌──────────┐         ┌──────────────────┐         ┌──────────┐
│  Sales   │────────▶│   Central Data   │────────▶│  All     │
│  Team    │         │      Team        │         │ Consumers│
└──────────┘         │                  │         └──────────┘
┌──────────┐         │   (Bottleneck)   │
│ Marketing│────────▶│                  │
└──────────┘         └──────────────────┘

Data Mesh (Decentralized):

┌─────────────────┐                      ┌──────────┐
│   Sales Domain  │─────────────────────▶│          │
│   (owns sales   │                      │ Marketing│
│    data)        │                      │ Domain   │
└─────────────────┘                      │          │
         │                               └──────────┘
         │                                    │
         ▼                                    ▼
┌─────────────────┐                      ┌──────────┐
│ Marketing Domain│◀─────────────────────│  Sales   │
│ (owns marketing │                      │  Domain  │
│  data)          │                      │          │
└─────────────────┘                      └──────────┘

Principle 2: Data as a Product

Each domain publishes data products with product management discipline.

CharacteristicDescription
DiscoverableListed in data catalog with metadata
AddressableClear endpoint/path to access
TrustworthyQuality SLAs, freshness guarantees
Self-describingSchema, lineage, documentation
InteroperableStandard formats, global IDs
SecureAccess controls, privacy compliance
# Example: Data Product Contract
name: sales.orders
version: 2.0.0
owner: sales-data-team@company.com
description: All completed customer orders

schema:
  - name: order_id
    type: integer
    description: Unique identifier
  - name: customer_id
    type: integer
    description: FK to customers.customer_id

sla:
  freshness: 1 hour
  completeness: 99.9%
  availability: 99.5%

access:
  public: true
  pii: false

Data Contracts = “Shift Left” for Data

Just as software engineering shifted testing left (earlier in the pipeline), Data Contracts shift data quality left:

  • Without Contracts: Upstream team changes schema → downstream pipelines break → data team discovers days later
  • With Contracts: Schema change fails contract validation → caught in CI/CD before deployment

Data Contracts are the bridge between Microservices and the Data Platform. They prevent software engineers from accidentally breaking data pipelines with schema changes.


Principle 3: Self-Serve Platform

A data platform that enables domain teams to build/ship data products independently.

┌─────────────────────────────────────────────────────────────────┐
│                    Self-Serve Data Platform                      │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ┌──────────────────────────────────────────────────────┐       │
│  │                   Domain Teams Use                    │       │
│  ├──────────────────────────────────────────────────────┤       │
│  │  Data Product Templates  │  CI/CD Pipelines          │       │
│  │  Quality Frameworks      │  Access Control           │       │
│  │  Schema Registry         │  Data Catalog             │       │
│  └──────────────────────────────────────────────────────┘       │
│                          │                                       │
│                          ▼                                       │
│  ┌──────────────────────────────────────────────────────┐       │
│  │                 Platform Team Provides                │       │
│  ├──────────────────────────────────────────────────────┤       │
│  │  Compute (Spark/DBT)     │  Storage (Lakehouse)      │       │
│  │  Orchestration           │  Observability            │       │
│  │  Data Mesh Infra         │  Security/Governance      │       │
│  └──────────────────────────────────────────────────────┘       │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Principle 4: Federated Computational Governance

Balance between global standards and local autonomy.

Global (Federated)Local (Domain)
Naming conventionsSchema design
Data quality minimumsQuality tests
PII handling rulesBusiness logic
Interoperability standardsImplementation
Audit requirementsDevelopment pace

Data Mesh Challenges

ChallengeMitigation
Coordination overheadStrong platform team, clear contracts
Skill gapsTraining, embedded data engineers
Inconsistent qualityAutomated quality gates
Discovery complexityRobust data catalog
Governance complexityClear ownership, federated policies

When to Use Data Mesh

✅ Good Fit❌ Poor Fit
Large organization (>100 engineers)Small team (<20)
Multiple distinct domainsSingle product/domain
Domain teams have data skillsNo data skills in domains
Bottleneck at central data teamCentral team working fine
Need for speed + scaleSimple analytics needs

3. Combining Medallion + Data Mesh

The architectures complement each other:

┌─────────────────────────────────────────────────────────────────┐
│                  Medallion + Data Mesh                           │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ┌─────────────────────────────────────────────────────┐        │
│  │               Sales Domain                           │        │
│  │  ┌─────────┐  ┌─────────┐  ┌─────────┐              │        │
│  │  │ Bronze  │─▶│ Silver  │─▶│  Gold   │ ◀── Product  │        │
│  │  └─────────┘  └─────────┘  └─────────┘              │        │
│  └─────────────────────────────────────────────────────┘        │
│                          │                                       │
│                          │ Published Data Product                │
│                          ▼                                       │
│  ┌─────────────────────────────────────────────────────┐        │
│  │             Marketing Domain                         │        │
│  │  ┌─────────┐  ┌─────────┐  ┌─────────┐              │        │
│  │  │ Bronze  │─▶│ Silver  │─▶│  Gold   │ ◀── Product  │        │
│  │  │(consumes│  └─────────┘  └─────────┘              │        │
│  │  │ Sales)  │                                         │        │
│  │  └─────────┘                                         │        │
│  └─────────────────────────────────────────────────────┘        │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Implementation Pattern

  1. Each domain has its own Medallion layers (Bronze/Silver/Gold)
  2. Gold layer tables are the published Data Products
  3. Cross-domain consumption: Domain B’s Bronze can ingest Domain A’s Gold
  4. Platform provides shared tooling, catalog, governance

4. Tool Ecosystem

Platform Components

┌─────────────────────────────────────────────────────────────────┐
│                   Modern Data Platform Stack                     │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ┌───────────────┐  ┌───────────────┐  ┌───────────────┐        │
│  │   Ingestion   │  │   Storage     │  │  Transform    │        │
│  ├───────────────┤  ├───────────────┤  ├───────────────┤        │
│  │ Fivetran      │  │ Databricks    │  │ dbt           │        │
│  │ Airbyte       │  │ Snowflake     │  │ Spark         │        │
│  │ Debezium      │  │ Delta Lake    │  │ SQLMesh       │        │
│  └───────────────┘  └───────────────┘  └───────────────┘        │
│                                                                  │
│  ┌───────────────┐  ┌───────────────┐  ┌───────────────┐        │
│  │ Orchestration │  │   Catalog     │  │   Quality     │        │
│  ├───────────────┤  ├───────────────┤  ├───────────────┤        │
│  │ Dagster       │  │ DataHub       │  │ Great Expect. │        │
│  │ Airflow       │  │ Atlan         │  │ Soda          │        │
│  │ Prefect       │  │ Unity Catalog │  │ Monte Carlo   │        │
│  └───────────────┘  └───────────────┘  └───────────────┘        │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Orchestration Deep Dive: Airflow vs Prefect

Two popular orchestration tools with different philosophies:

AspectApache AirflowPrefect
PhilosophyDAGs as configurationPython-first, decorators
Created2014 (Airbnb)2018
Definition StylePython + explicit DAG class@flow / @task decorators
Dynamic DAGsRequires workaroundsNative support
Local DevelopmentRequires full setupSimple python script.py
Managed CloudAstronomer, MWAA, GCP ComposerPrefect Cloud (free tier)
EcosystemVery mature, many pluginsNewer but growing fast
Learning CurveMedium-highLower

Code Comparison

Airflow:

from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime

def extract(): ...
def transform(): ...

with DAG('etl_pipeline', start_date=datetime(2024,1,1), schedule='@daily') as dag:
    t1 = PythonOperator(task_id='extract', python_callable=extract)
    t2 = PythonOperator(task_id='transform', python_callable=transform)
    t1 >> t2

Prefect:

from prefect import flow, task

@task
def extract(): ...

@task
def transform(data): ...

@flow(name="etl_pipeline")
def etl():
    data = extract()
    transform(data)

if __name__ == "__main__":
    etl()

When to Choose

Choose AirflowChoose Prefect
Enterprise with Airflow experienceNew projects, rapid iteration
Need extensive connectorsPython-focused team
Using managed services (GCP/AWS)Want simple local development
Long-term stable projectDynamic workflows

Data Mesh Specific Tools

CategoryTools
Data Product ManagementDatamesh Manager, DataOS
Data ContractsSoda, datacontract CLI
Federated CatalogDataHub, Atlan, Collibra
Self-Serve PlatformDatabricks Unity Catalog, Starburst Galaxy

5. Decision Framework

Architecture Selection

┌─────────────────────────────────────────────────────────────────┐
│                   Architecture Decision                          │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  Organization Size?                                              │
│      │                                                           │
│      ├── Small (<20 data people)                                │
│      │       └──▶ Medallion (centralized)                       │
│      │                                                           │
│      ├── Medium (20-100)                                        │
│      │       └──▶ Medallion + domain alignment                  │
│      │                                                           │
│      └── Large (100+)                                           │
│              │                                                   │
│              ├── Domains have data skills?                      │
│              │       │                                           │
│              │       ├── Yes ──▶ Full Data Mesh                 │
│              │       └── No ──▶ Medallion + gradual shift       │
│              │                                                   │
│              └── Central team bottleneck?                       │
│                      │                                           │
│                      ├── Yes ──▶ Data Mesh                      │
│                      └── No ──▶ Medallion (centralized)         │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

FinOps Warning: The Hidden Cost of Decentralization

Data Mesh and distributed Lakehouse architectures bring agility but can cause cost explosion:

RiskCauseMitigation
Duplicate computeMultiple domains running similar queriesShared intermediate tables, cross-domain marts
Unoptimized queriesDomain teams lack SQL optimization skillsQuery analysis tools, training, guardrails
Storage sprawlEvery team creates copies “just in case”Data lifecycle policies, automated cleanup
Uncapped serverlessSnowflake/BigQuery pay-per-queryBudget alerts, query cost limits

Platform Team Responsibility: Provide Cost Observability dashboards showing:

  • Per-domain compute/storage costs
  • Cost trends and anomalies
  • Query efficiency metrics
  • Tools: Snowflake Cost Management, Databricks Cost Explorer, select star, Zenith

6. Summary: The City Analogy 🏙️

ConceptCity Analogy
MedallionHighway system (Raw imports → Processing zones → Downtown business)
BronzePort/Airport (raw goods arrive as-is)
SilverWarehouses (quality checked, sorted, standardized)
GoldRetail stores (consumer-ready products)
Data MeshCity districts (each district manages its own services, follows city codes)
Self-Serve PlatformCity utilities (water, power everyone can connect to)
Federated GovernanceCity laws (global building codes, local zoning)

Key Takeaways

  1. Medallion Architecture provides a clear layer pattern for data quality progression
  2. Data Mesh decentralizes ownership while maintaining interoperability
  3. They work together: Medallion as the pattern, Mesh as the organizational model
  4. Start simple: Begin with Medallion, evolve to Mesh as organization scales
  5. Platform is key: Self-serve infrastructure enables both patterns