Hero image for MongoDB Schema Design: Embedding vs Referencing

MongoDB Schema Design: Embedding vs Referencing

nosql mongodb database schema-design data-modeling

Prerequisites: Understanding of MongoDB basics. See NOSQL 02 MongoDB Basics.

The biggest mindset shift from SQL to MongoDB is how you structure your data. Forget normalization — embrace denormalization (when appropriate).


Part A: The Two Approaches

1. Embedding vs Referencing

graph TD
    subgraph "Embedding (Denormalized)"
        E1["Order Document"]
        E2["{ customer: {...}, items: [...] }"]
        E1 --> E2
    end
    
    subgraph "Referencing (Normalized)"
        R1["Order Document"]
        R2["{ customerId: ObjectId }"]
        R3["Customer Document"]
        R1 --> R2
        R2 -.->|lookup| R3
    end
    
    style E2 fill:#27ae60,color:#fff
    style R2 fill:#3498db,color:#fff
ApproachWhen to UseTrade-off
EmbeddingData read togetherDuplication, larger docs
ReferencingData updates independentlyRequires $lookup (join)

2. Embedding (Denormalization)

Put related data inside the parent document.

// Order with embedded customer and items
{
  "_id": ObjectId("order123"),
  "orderDate": ISODate("2024-03-15"),
  "status": "shipped",
  
  "customer": {                    // Embedded!
    "name": "Alice Chen",
    "email": "alice@example.com",
    "address": "123 Main St, Taipei"
  },
  
  "items": [                       // Embedded array!
    { "product": "Laptop", "price": 1200, "qty": 1 },
    { "product": "Mouse", "price": 25, "qty": 2 }
  ],
  
  "total": 1250
}

Advantages

BenefitExplanation
Single readGet everything in one query
Atomic writesUpdate order + items in one operation
No joinsNo $lookup needed

Disadvantages

DrawbackExplanation
Data duplicationCustomer info copied to every order
Update anomaliesIf Alice changes email, update ALL her orders
Document size limit16 MB max per document

[!WARNING] Write Contention: Even for 1:Few relationships, embedding can cause problems if the document is frequently updated by many concurrent writers. Example: A viral post’s likes array with thousands of likes per second causes write contention on that single document. In high-concurrency scenarios, consider referencing or using counters in a separate document.


3. Referencing (Normalization)

Store related data in separate collections with ID references.

// Order document (references customer)
{
  "_id": ObjectId("order123"),
  "orderDate": ISODate("2024-03-15"),
  "customerId": ObjectId("customer456"),  // Reference!
  "items": [
    { "productId": ObjectId("prod789"), "qty": 1 },
    { "productId": ObjectId("prod012"), "qty": 2 }
  ]
}

// Customer document (separate collection)
{
  "_id": ObjectId("customer456"),
  "name": "Alice Chen",
  "email": "alice@example.com"
}

// Product document (separate collection)
{
  "_id": ObjectId("prod789"),
  "name": "Laptop",
  "price": 1200
}

Advantages

BenefitExplanation
No duplicationCustomer data in one place
Easy updatesChange email once, everywhere sees it
Smaller documentsReferences are just ObjectIds

Disadvantages

DrawbackExplanation
Multiple queriesNeed $lookup to join data
Not atomicCan’t update across collections atomically
Application complexityMore queries to manage

[!WARNING] Sharding Limitation: In distributed (sharded) clusters, $lookup across different shards has significant performance overhead. Earlier MongoDB versions had restrictions on cross-shard $lookup. If you plan to scale horizontally, prefer embedding to keep related data on the same shard.


Part B: Decision Framework

4. The Embedding vs Reference Decision Tree

flowchart TD
    START[Data Relationship] --> Q1{How often read<br/>together?}
    Q1 -->|Always| EMBED[Embed]
    Q1 -->|Sometimes| Q2{Child data<br/>changes often?}
    Q2 -->|Rarely| EMBED
    Q2 -->|Frequently| REF[Reference]
    Q1 -->|Rarely| REF
    
    REF --> Q3{Is child shared<br/>by many parents?}
    Q3 -->|Yes| REF_FINAL[Reference]
    Q3 -->|No| Q4{Child has<br/>own lifecycle?}
    Q4 -->|Yes| REF_FINAL
    Q4 -->|No| EMBED
    
    style EMBED fill:#27ae60,color:#fff
    style REF_FINAL fill:#3498db,color:#fff

5. Quick Reference Table

ScenarioRecommendationReason
User + AddressesEmbedAddresses belong to user
Order + ItemsEmbedItems only exist in that order
Blog + CommentsEmbed (if few)Comments read with blog
Blog + CommentsReference (if many)Avoid huge documents
Product + CategoryReferenceCategory shared by many products
User + OrdersReferenceOrders have own lifecycle

Part C: Real-World Patterns

6. One-to-One: Embed

Example: User profile + settings

// ✅ Good: Embed
{
  "_id": ObjectId("user123"),
  "username": "alice",
  "email": "alice@example.com",
  "settings": {                    // 1:1 relationship
    "theme": "dark",
    "language": "en",
    "notifications": true
  }
}

7. One-to-Few: Embed

Example: User + addresses (typical person has 1-3 addresses)

// ✅ Good: Embed
{
  "_id": ObjectId("user123"),
  "name": "Alice",
  "addresses": [
    { "type": "home", "city": "Taipei" },
    { "type": "work", "city": "Hsinchu" }
  ]
}

8. One-to-Many: It Depends

Small number of children → Embed

// Blog with 10-20 comments: Embed
{
  "_id": ObjectId("blog123"),
  "title": "MongoDB Guide",
  "comments": [
    { "user": "Bob", "text": "Great post!", "date": ISODate(...) },
    // ... up to ~20 comments
  ]
}

Large number of children → Reference

// Blog with 1000+ comments: Reference
// Blog document
{ "_id": ObjectId("blog123"), "title": "MongoDB Guide" }

// Comments collection
{ "_id": ObjectId("c1"), "blogId": ObjectId("blog123"), "text": "Great!" }
{ "_id": ObjectId("c2"), "blogId": ObjectId("blog123"), "text": "Thanks!" }
// ... thousands of comments

9. Many-to-Many: Reference with Array

Example: Students and Courses

// Student document
{
  "_id": ObjectId("student1"),
  "name": "Alice",
  "courseIds": [
    ObjectId("course1"),
    ObjectId("course2")
  ]
}

// Course document
{
  "_id": ObjectId("course1"),
  "name": "Database 101",
  "studentIds": [
    ObjectId("student1"),
    ObjectId("student2")
  ]
}

Part D: E-Commerce Case Study

10. SQL Design (Normalized)

erDiagram
    Customers ||--o{ Orders : places
    Orders ||--o{ OrderItems : contains
    OrderItems }o--|| Products : references
    Products }o--|| Categories : belongs_to
    
    Customers {
        int id PK
        string name
        string email
    }
    Orders {
        int id PK
        int customer_id FK
        datetime order_date
    }
    OrderItems {
        int id PK
        int order_id FK
        int product_id FK
        int quantity
    }
    Products {
        int id PK
        string name
        decimal price
        int category_id FK
    }

Problem: Displaying one order requires 4 table JOINs.

11. MongoDB Design (Denormalized)

graph LR
    subgraph "SQL: 4 Tables + JOINs"
        direction TB
        C[Customers] -.-> O[Orders]
        O -.-> OI[OrderItems]
        OI -.-> P[Products]
    end
    
    subgraph "MongoDB: 1 Document"
        direction TB
        DOC["{<br/>  customer: {...},<br/>  items: [{...}],<br/>  total: 2624<br/>}"]
    end
    
    style C fill:#e74c3c,color:#fff
    style O fill:#e74c3c,color:#fff
    style OI fill:#e74c3c,color:#fff
    style P fill:#e74c3c,color:#fff
    style DOC fill:#27ae60,color:#fff
// Orders collection - Hybrid approach
{
  "_id": ObjectId("order123"),
  "orderDate": ISODate("2024-03-15"),
  "status": "delivered",
  
  // Embedded: Snapshot of customer at order time
  "customer": {
    "_id": ObjectId("cust456"),  // Keep reference for updates
    "name": "Alice Chen",
    "shippingAddress": "123 Main St"
  },
  
  // Embedded: Order items with product snapshot
  "items": [
    {
      "productId": ObjectId("prod789"),  // Reference for lookup
      "name": "MacBook Pro",             // Snapshot (won't change)
      "priceAtPurchase": 2499,           // Historical record
      "quantity": 1
    }
  ],
  
  "subtotal": 2499,
  "tax": 125,
  "total": 2624
}

Why This Works

FieldStrategyReason
customer._idReferenceCan lookup current customer
customer.nameSnapshotDisplay without $lookup
items[].productIdReferenceCan link to product page
items[].priceAtPurchaseSnapshotHistorical record (price may change)

Part E: Anti-Patterns to Avoid

12. Massive Arrays (Unbounded Growth)

// ❌ Bad: Array grows forever
{
  "blogId": ObjectId("..."),
  "comments": [
    // ... 50,000 comments -> Document too large!
  ]
}

// ✅ Good: Separate collection with pagination
db.comments.find({ blogId: ObjectId("...") })
           .sort({ date: -1 })
           .limit(20);

13. Deep Nesting

// ❌ Bad: Too deep (hard to query/update)
{
  "level1": {
    "level2": {
      "level3": {
        "level4": {
          "data": "buried treasure"
        }
      }
    }
  }
}

// ✅ Good: Keep nesting to 2-3 levels max
{
  "settings": {
    "notifications": {
      "email": true,
      "sms": false
    }
  }
}

14. Storing Lookup Tables as Embedded

// ❌ Bad: Category duplicated in every product
{ "name": "Laptop", "category": { "id": 1, "name": "Electronics" } }
{ "name": "Phone", "category": { "id": 1, "name": "Electronics" } }
// If category name changes -> update ALL products!

// ✅ Good: Reference for shared data
{ "name": "Laptop", "categoryId": ObjectId("cat1") }
{ "name": "Phone", "categoryId": ObjectId("cat1") }
// + Categories collection with single source of truth

Summary

Embedding vs Referencing

EmbedReference
Read performanceFast (single query)Slower ($lookup needed)
Write complexitySimple (atomic)Complex (multiple collections)
Data consistencyMay have duplicatesSingle source of truth
Document sizeCan grow largeStays small

Quick Decision Guide

RelationshipSizeUpdate FrequencyRecommendation
1:1AnyAnyEmbed
1:Few< 20LowEmbed
1:Many< 100LowEmbed
1:Many100+AnyReference
Many:ManyAnyAnyReference (arrays)

Golden Rules

  1. Embed what you read together
  2. Reference what changes independently
  3. Denormalize for read performance
  4. Keep documents under 16 MB
  5. Avoid unbounded array growth

💡 Practice Questions

Conceptual

  1. When should you embed data vs reference data in MongoDB?

  2. What is the 16 MB document size limit and why is it a concern for embedded arrays?

  3. Explain the concept of a “snapshot” in schema design. When would you store a snapshot instead of a reference?

Hands-on

// Design a schema for a blog application with:
// - Posts (title, content, author, createdAt)
// - Comments (user, text, createdAt) - assume ~50 comments per post average
// Should comments be embedded or referenced? Show your document structure.
💡 View Answer

For ~50 comments per post, embedding is reasonable:

// Posts collection
{
  "_id": ObjectId("post123"),
  "title": "MongoDB Schema Design",
  "content": "...",
  "author": {
    "_id": ObjectId("user789"),
    "name": "Alice"  // Snapshot for display
  },
  "createdAt": ISODate("2024-03-15"),
  "comments": [
    { "user": "Bob", "text": "Great post!", "createdAt": ISODate(...) },
    // ... up to ~50-100 comments
  ]
}

If comments grow to 1000+, move to separate collection:

// Comments collection
{ "_id": ObjectId(...), "postId": ObjectId("post123"), "user": "Bob", "text": "..." }

Scenario

  1. Anti-pattern: A developer embedded an array of “followers” inside each user document. A popular user has 1 million followers. What problems will occur?
💡 View Answer

Problems:

  1. 16 MB Hard Limit: 1 million ObjectId values = ~12 MB (each ObjectId is 12 bytes). Add other fields and you easily exceed the 16 MB limit, causing BSONObj size is invalid error — the document cannot be saved.

  2. Write Contention: Every new follower requires updating the same document, causing lock contention under high concurrency.

  3. Slow Operations: Loading/updating a 12+ MB array is extremely slow.

Solution: Use a separate collection for follower relationships:

// Followers collection (Reference pattern)
{ "_id": ObjectId(...), "userId": ObjectId("popular_user"), "followerId": ObjectId("fan123") }

Store only a follower count in the user document, use $inc for atomic updates.