Mar 11, 2025

MongoDB Schema Design: Embedding vs Referencing

nosql mongodb database schema-design data-modeling

Prerequisites: Understanding of MongoDB basics. See NOSQL 02 MongoDB Basics.

The biggest mindset shift from SQL to MongoDB is how you structure your data. Forget normalization — embrace denormalization (when appropriate).

Part A: The Two Approaches

1. Embedding vs Referencing

graph TD
    subgraph "Embedding (Denormalized)"
        E1["Order Document"]
        E2["{ customer: {...}, items: [...] }"]
        E1 --> E2
    end
    
    subgraph "Referencing (Normalized)"
        R1["Order Document"]
        R2["{ customerId: ObjectId }"]
        R3["Customer Document"]
        R1 --> R2
        R2 -.->|lookup| R3
    end
    
    style E2 fill:#27ae60,color:#fff
    style R2 fill:#3498db,color:#fff

Approach	When to Use	Trade-off
Embedding	Data read together	Duplication, larger docs
Referencing	Data updates independently	Requires $lookup (join)

2. Embedding (Denormalization)

Put related data inside the parent document.

// Order with embedded customer and items
{
  "_id": ObjectId("order123"),
  "orderDate": ISODate("2024-03-15"),
  "status": "shipped",
  
  "customer": {                    // Embedded!
    "name": "Alice Chen",
    "email": "alice@example.com",
    "address": "123 Main St, Taipei"
  },
  
  "items": [                       // Embedded array!
    { "product": "Laptop", "price": 1200, "qty": 1 },
    { "product": "Mouse", "price": 25, "qty": 2 }
  ],
  
  "total": 1250
}

Advantages

Benefit	Explanation
Single read	Get everything in one query
Atomic writes	Update order + items in one operation
No joins	No $lookup needed

Disadvantages

Drawback	Explanation
Data duplication	Customer info copied to every order
Update anomalies	If Alice changes email, update ALL her orders
Document size limit	16 MB max per document

[!WARNING] Write Contention: Even for 1:Few relationships, embedding can cause problems if the document is frequently updated by many concurrent writers. Example: A viral post’s likes array with thousands of likes per second causes write contention on that single document. In high-concurrency scenarios, consider referencing or using counters in a separate document.

3. Referencing (Normalization)

Store related data in separate collections with ID references.

// Order document (references customer)
{
  "_id": ObjectId("order123"),
  "orderDate": ISODate("2024-03-15"),
  "customerId": ObjectId("customer456"),  // Reference!
  "items": [
    { "productId": ObjectId("prod789"), "qty": 1 },
    { "productId": ObjectId("prod012"), "qty": 2 }
  ]
}

// Customer document (separate collection)
{
  "_id": ObjectId("customer456"),
  "name": "Alice Chen",
  "email": "alice@example.com"
}

// Product document (separate collection)
{
  "_id": ObjectId("prod789"),
  "name": "Laptop",
  "price": 1200
}

Advantages

Benefit	Explanation
No duplication	Customer data in one place
Easy updates	Change email once, everywhere sees it
Smaller documents	References are just ObjectIds

Disadvantages

Drawback	Explanation
Multiple queries	Need $lookup to join data
Not atomic	Can’t update across collections atomically
Application complexity	More queries to manage

[!WARNING] Sharding Limitation: In distributed (sharded) clusters, $lookup across different shards has significant performance overhead. Earlier MongoDB versions had restrictions on cross-shard $lookup. If you plan to scale horizontally, prefer embedding to keep related data on the same shard.

Part B: Decision Framework

4. The Embedding vs Reference Decision Tree

flowchart TD
    START[Data Relationship] --> Q1{How often read<br/>together?}
    Q1 -->|Always| EMBED[Embed]
    Q1 -->|Sometimes| Q2{Child data<br/>changes often?}
    Q2 -->|Rarely| EMBED
    Q2 -->|Frequently| REF[Reference]
    Q1 -->|Rarely| REF
    
    REF --> Q3{Is child shared<br/>by many parents?}
    Q3 -->|Yes| REF_FINAL[Reference]
    Q3 -->|No| Q4{Child has<br/>own lifecycle?}
    Q4 -->|Yes| REF_FINAL
    Q4 -->|No| EMBED
    
    style EMBED fill:#27ae60,color:#fff
    style REF_FINAL fill:#3498db,color:#fff

5. Quick Reference Table

Scenario	Recommendation	Reason
User + Addresses	Embed	Addresses belong to user
Order + Items	Embed	Items only exist in that order
Blog + Comments	Embed (if few)	Comments read with blog
Blog + Comments	Reference (if many)	Avoid huge documents
Product + Category	Reference	Category shared by many products
User + Orders	Reference	Orders have own lifecycle

Part C: Real-World Patterns

6. One-to-One: Embed

Example: User profile + settings

// ✅ Good: Embed
{
  "_id": ObjectId("user123"),
  "username": "alice",
  "email": "alice@example.com",
  "settings": {                    // 1:1 relationship
    "theme": "dark",
    "language": "en",
    "notifications": true
  }
}

7. One-to-Few: Embed

Example: User + addresses (typical person has 1-3 addresses)

// ✅ Good: Embed
{
  "_id": ObjectId("user123"),
  "name": "Alice",
  "addresses": [
    { "type": "home", "city": "Taipei" },
    { "type": "work", "city": "Hsinchu" }
  ]
}

8. One-to-Many: It Depends

Small number of children → Embed

// Blog with 10-20 comments: Embed
{
  "_id": ObjectId("blog123"),
  "title": "MongoDB Guide",
  "comments": [
    { "user": "Bob", "text": "Great post!", "date": ISODate(...) },
    // ... up to ~20 comments
  ]
}

Large number of children → Reference

// Blog with 1000+ comments: Reference
// Blog document
{ "_id": ObjectId("blog123"), "title": "MongoDB Guide" }

// Comments collection
{ "_id": ObjectId("c1"), "blogId": ObjectId("blog123"), "text": "Great!" }
{ "_id": ObjectId("c2"), "blogId": ObjectId("blog123"), "text": "Thanks!" }
// ... thousands of comments

9. Many-to-Many: Reference with Array

Example: Students and Courses

// Student document
{
  "_id": ObjectId("student1"),
  "name": "Alice",
  "courseIds": [
    ObjectId("course1"),
    ObjectId("course2")
  ]
}

// Course document
{
  "_id": ObjectId("course1"),
  "name": "Database 101",
  "studentIds": [
    ObjectId("student1"),
    ObjectId("student2")
  ]
}

Part D: E-Commerce Case Study

10. SQL Design (Normalized)

erDiagram
    Customers ||--o{ Orders : places
    Orders ||--o{ OrderItems : contains
    OrderItems }o--|| Products : references
    Products }o--|| Categories : belongs_to
    
    Customers {
        int id PK
        string name
        string email
    }
    Orders {
        int id PK
        int customer_id FK
        datetime order_date
    }
    OrderItems {
        int id PK
        int order_id FK
        int product_id FK
        int quantity
    }
    Products {
        int id PK
        string name
        decimal price
        int category_id FK
    }

Problem: Displaying one order requires 4 table JOINs.

11. MongoDB Design (Denormalized)

graph LR
    subgraph "SQL: 4 Tables + JOINs"
        direction TB
        C[Customers] -.-> O[Orders]
        O -.-> OI[OrderItems]
        OI -.-> P[Products]
    end
    
    subgraph "MongoDB: 1 Document"
        direction TB
        DOC["{<br/>  customer: {...},<br/>  items: [{...}],<br/>  total: 2624<br/>}"]
    end
    
    style C fill:#e74c3c,color:#fff
    style O fill:#e74c3c,color:#fff
    style OI fill:#e74c3c,color:#fff
    style P fill:#e74c3c,color:#fff
    style DOC fill:#27ae60,color:#fff

// Orders collection - Hybrid approach
{
  "_id": ObjectId("order123"),
  "orderDate": ISODate("2024-03-15"),
  "status": "delivered",
  
  // Embedded: Snapshot of customer at order time
  "customer": {
    "_id": ObjectId("cust456"),  // Keep reference for updates
    "name": "Alice Chen",
    "shippingAddress": "123 Main St"
  },
  
  // Embedded: Order items with product snapshot
  "items": [
    {
      "productId": ObjectId("prod789"),  // Reference for lookup
      "name": "MacBook Pro",             // Snapshot (won't change)
      "priceAtPurchase": 2499,           // Historical record
      "quantity": 1
    }
  ],
  
  "subtotal": 2499,
  "tax": 125,
  "total": 2624
}

Why This Works

Field	Strategy	Reason
`customer._id`	Reference	Can lookup current customer
`customer.name`	Snapshot	Display without $lookup
`items[].productId`	Reference	Can link to product page
`items[].priceAtPurchase`	Snapshot	Historical record (price may change)

Part E: Anti-Patterns to Avoid

12. Massive Arrays (Unbounded Growth)

// ❌ Bad: Array grows forever
{
  "blogId": ObjectId("..."),
  "comments": [
    // ... 50,000 comments -> Document too large!
  ]
}

// ✅ Good: Separate collection with pagination
db.comments.find({ blogId: ObjectId("...") })
           .sort({ date: -1 })
           .limit(20);

13. Deep Nesting

// ❌ Bad: Too deep (hard to query/update)
{
  "level1": {
    "level2": {
      "level3": {
        "level4": {
          "data": "buried treasure"
        }
      }
    }
  }
}

// ✅ Good: Keep nesting to 2-3 levels max
{
  "settings": {
    "notifications": {
      "email": true,
      "sms": false
    }
  }
}

14. Storing Lookup Tables as Embedded

// ❌ Bad: Category duplicated in every product
{ "name": "Laptop", "category": { "id": 1, "name": "Electronics" } }
{ "name": "Phone", "category": { "id": 1, "name": "Electronics" } }
// If category name changes -> update ALL products!

// ✅ Good: Reference for shared data
{ "name": "Laptop", "categoryId": ObjectId("cat1") }
{ "name": "Phone", "categoryId": ObjectId("cat1") }
// + Categories collection with single source of truth

Summary

Embedding vs Referencing

	Embed	Reference
Read performance	Fast (single query)	Slower ($lookup needed)
Write complexity	Simple (atomic)	Complex (multiple collections)
Data consistency	May have duplicates	Single source of truth
Document size	Can grow large	Stays small

Quick Decision Guide

Relationship	Size	Update Frequency	Recommendation
1:1	Any	Any	Embed
1:Few	< 20	Low	Embed
1:Many	< 100	Low	Embed
1:Many	100+	Any	Reference
Many:Many	Any	Any	Reference (arrays)

Golden Rules

Embed what you read together
Reference what changes independently
Denormalize for read performance
Keep documents under 16 MB
Avoid unbounded array growth

💡 Practice Questions

Conceptual

When should you embed data vs reference data in MongoDB?
What is the 16 MB document size limit and why is it a concern for embedded arrays?
Explain the concept of a “snapshot” in schema design. When would you store a snapshot instead of a reference?

Hands-on

// Design a schema for a blog application with:
// - Posts (title, content, author, createdAt)
// - Comments (user, text, createdAt) - assume ~50 comments per post average
// Should comments be embedded or referenced? Show your document structure.

💡 View Answer

For ~50 comments per post, embedding is reasonable:

// Posts collection
{
  "_id": ObjectId("post123"),
  "title": "MongoDB Schema Design",
  "content": "...",
  "author": {
    "_id": ObjectId("user789"),
    "name": "Alice"  // Snapshot for display
  },
  "createdAt": ISODate("2024-03-15"),
  "comments": [
    { "user": "Bob", "text": "Great post!", "createdAt": ISODate(...) },
    // ... up to ~50-100 comments
  ]
}

If comments grow to 1000+, move to separate collection:

// Comments collection
{ "_id": ObjectId(...), "postId": ObjectId("post123"), "user": "Bob", "text": "..." }

Scenario

Anti-pattern: A developer embedded an array of “followers” inside each user document. A popular user has 1 million followers. What problems will occur?

💡 View Answer

Problems:

16 MB Hard Limit: 1 million ObjectId values = ~12 MB (each ObjectId is 12 bytes). Add other fields and you easily exceed the 16 MB limit, causing BSONObj size is invalid error — the document cannot be saved.
Write Contention: Every new follower requires updating the same document, causing lock contention under high concurrency.
Slow Operations: Loading/updating a 12+ MB array is extremely slow.

Solution: Use a separate collection for follower relationships:

// Followers collection (Reference pattern)
{ "_id": ObjectId(...), "userId": ObjectId("popular_user"), "followerId": ObjectId("fan123") }

Store only a follower count in the user document, use $inc for atomic updates.

← Previous In-Memory OLTP: Memory-Optimized Tables & Natively Compiled Procedures

Next → Columnstore Index: Analytics Optimization & Data Warehouse Performance