NoSQL Introduction: Types, CAP Theorem & Storage Engines
Prerequisites: Understanding of relational databases. See DB 02 Software Layer for SQL fundamentals.
NoSQL databases were created to solve problems that traditional SQL databases struggle with: massive scale, flexible schemas, and distributed computing.
Part A: The Four Types of NoSQL
1. Document Database
Structure: JSON-like documents (nested, flexible)
{
"_id": "user123",
"name": "Alice",
"orders": [
{ "item": "Laptop", "price": 1200 },
{ "item": "Mouse", "price": 25 }
]
}
| Product | Use Case |
|---|---|
| MongoDB | Content management, e-commerce catalogs |
| Couchbase | Mobile apps with offline sync |
Why use it: When your data has variable structure (products with different attributes).
2. Key-Value Store
Structure: Simple dictionary — key → value
session:abc123 → { userId: 1, expires: "2024-03-15" }
cache:product:99 → { name: "Laptop", price: 1200 }
| Product | Use Case |
|---|---|
| Redis | Caching, session storage, real-time leaderboards |
| DynamoDB | Serverless apps, high-throughput workloads |
Why use it: Blazing fast reads/writes for simple lookups.
3. Column-Family Store
Structure: Data stored by columns, not rows
graph LR
subgraph "Row Store (SQL)"
R1["Row 1: ID, Name, Age, City"]
R2["Row 2: ID, Name, Age, City"]
end
subgraph "Column Store (NoSQL)"
C1["Column: All IDs"]
C2["Column: All Names"]
C3["Column: All Ages"]
end
style R1 fill:#e74c3c,color:#fff
style R2 fill:#e74c3c,color:#fff
style C1 fill:#27ae60,color:#fff
style C2 fill:#27ae60,color:#fff
style C3 fill:#27ae60,color:#fff
| Product | Use Case |
|---|---|
| Cassandra | Time-series data, IoT sensor logs, high-velocity writes |
| HBase | Hadoop ecosystem, log data ingestion |
Why use it: Excellent write throughput and handling of sparse data. (Note: For pure analytical aggregations, consider Column-Oriented DBs like ClickHouse.)
4. Graph Database
Structure: Nodes (entities) + Edges (relationships)
graph LR
A[Alice] -->|FRIENDS_WITH| B[Bob]
B -->|WORKS_AT| C[Google]
A -->|LIKES| D[Coffee]
B -->|LIKES| D
style A fill:#3498db,color:#fff
style B fill:#3498db,color:#fff
style C fill:#27ae60,color:#fff
style D fill:#f39c12,color:#fff
| Product | Use Case |
|---|---|
| Neo4j | Social networks, fraud detection |
| Amazon Neptune | Knowledge graphs, recommendation engines |
Why use it: Finding relationships is O(1) via index-free adjacency — each node directly points to its neighbors without index lookups, unlike SQL JOINs that must scan indexes.
5. Comparison Summary
| Type | Data Model | Best For | Example Query |
|---|---|---|---|
| Document | JSON objects | Flexible schemas | ”Get user with all their orders” |
| Key-Value | Key → Value | Caching | ”Get session by ID” |
| Column | Column families | Analytics | ”Sum of all sales this month” |
| Graph | Nodes + Edges | Relationships | ”Friends who also bought X” |
Part B: CAP Theorem
6. The Impossible Triangle
In a distributed system, you can only guarantee two of three properties:
💡 Modern Understanding of CAP
In reality, P (Partition Tolerance) is non-negotiable — networks WILL fail. So when a partition occurs, you must choose between C (Consistency) and A (Availability). “Pick 2” is a simplification; the real choice is C vs A during network failures.
graph TD
subgraph "CAP Theorem"
C[Consistency<br/>All nodes see same data]
A[Availability<br/>Every request gets response]
P[Partition Tolerance<br/>System works despite network splits]
end
C --- A
A --- P
P --- C
style C fill:#3498db,color:#fff
style A fill:#27ae60,color:#fff
style P fill:#e74c3c,color:#fff
| Combination | Sacrifice | Example |
|---|---|---|
| CA | Partition Tolerance | Traditional SQL (single server) |
| CP | Availability | MongoDB (with strict write concern) |
| AP | Consistency | Cassandra, DynamoDB |
7. Real-World Example
Scenario: Network splits your 3 MongoDB servers into two groups.
sequenceDiagram
participant Client
participant Server1 as Server 1 (Primary)
participant Server2 as Server 2
participant Server3 as Server 3
Note over Server1,Server3: Network Partition!
rect rgb(255, 200, 200)
Server1--xServer2: Cannot reach
Server1--xServer3: Cannot reach
end
Client->>Server1: Write order
alt CP Mode (Consistency Priority)
Server1-->>Client: ERROR - Cannot confirm write
Note right of Client: Available = NO
else AP Mode (Availability Priority)
Server1-->>Client: OK - Written locally
Note right of Client: Consistent = NO (other servers outdated)
end
Part C: BASE vs ACID
8. ACID (SQL Databases)
| Property | Meaning | Example |
|---|---|---|
| Atomicity | All or nothing | Bank transfer: both debit and credit succeed, or neither |
| Consistency | Valid state → Valid state | Total money in system stays same |
| Isolation | Transactions don’t interfere | Two users can’t buy the last item |
| Durability | Once committed, permanent | Survives power failure |
9. BASE (NoSQL Databases)
| Property | Meaning |
|---|---|
| Basically Available | System always responds (maybe stale data) |
| Soft state | Data may change over time (syncing) |
| Eventual consistency | Given time, all nodes will agree |
10. Comparison
graph LR
subgraph "ACID (SQL)"
A1[Strong Consistency]
A2[Immediate]
A3[Slower writes]
end
subgraph "BASE (NoSQL)"
B1[Eventual Consistency]
B2[Faster at scale]
B3[May read stale data]
end
A1 -->|Trade-off| B2
style A1 fill:#e74c3c,color:#fff
style B1 fill:#27ae60,color:#fff
| ACID | BASE | |
|---|---|---|
| Priority | Correctness | Availability |
| Scale | Harder to scale out | Built for scale out |
| Use Case | Banking, inventory | Social feeds, analytics |
Part D: MongoDB Storage Engine (WiredTiger)
11. What is WiredTiger?
WiredTiger is MongoDB’s default storage engine since version 3.2. Think of it as the “V8 engine” that powers MongoDB.
graph TD
subgraph "MongoDB Architecture"
APP[Application] --> DRIVER[MongoDB Driver]
DRIVER --> QUERY[Query Engine]
QUERY --> STORAGE[WiredTiger Engine]
STORAGE --> DISK[Disk Storage]
end
style STORAGE fill:#27ae60,color:#fff
12. Document-Level Locking
The Problem: Early MongoDB used database-level locking — if one user writes, the entire database is locked.
WiredTiger’s Solution: Document-level locking — only the specific document being modified is locked.
graph LR
subgraph "Database-Level Lock (Old)"
DB1[Entire Database LOCKED]
U1[User 1 writes Doc A] --> DB1
U2[User 2 wants Doc B] -->|WAIT| DB1
end
subgraph "Document-Level Lock (WiredTiger)"
D1[Doc A LOCKED]
D2[Doc B FREE]
U3[User 1 writes Doc A] --> D1
U4[User 2 writes Doc B] --> D2
end
style DB1 fill:#e74c3c,color:#fff
style D1 fill:#f39c12,color:#fff
style D2 fill:#27ae60,color:#fff
💡 MVCC: Why Reads Don’t Block Writes
WiredTiger uses MVCC (Multi-Version Concurrency Control): readers see the old version while writers create a new version. This is why reads and writes don’t block each other — true non-blocking concurrency.
13. Compression
WiredTiger compresses data on disk:
| Compression | CPU Usage | Space Savings |
|---|---|---|
| snappy (default) | Low | ~50% |
| zlib | Medium | ~70% |
| zstd | Low-Medium | ~60% |
// Check current engine
db.serverStatus().storageEngine
// { "name": "wiredTiger", ... }
14. Journaling & Checkpoints
| Feature | Purpose |
|---|---|
| Journal | Write-ahead log (WAL) for crash recovery |
| Checkpoint | Periodic flush to disk (every 60s or 2GB) |
This is similar to SQL Server’s transaction log!
Summary
NoSQL Type Selection Guide
| Need | Choose |
|---|---|
| Flexible product catalog | Document (MongoDB) |
| Super-fast caching | Key-Value (Redis) |
| Time-series / IoT data | Column (Cassandra) |
| Social network relationships | Graph (Neo4j) |
CAP/BASE Quick Reference
| Concept | Meaning |
|---|---|
| CAP | Pick 2: Consistency, Availability, Partition Tolerance |
| CP | Strong consistency, sacrifice availability during partition |
| AP | Always available, sacrifice consistency during partition |
| BASE | Eventual consistency model for distributed systems |
WiredTiger Benefits
- ✅ Document-level locking (high concurrency)
- ✅ Compression (50-70% space savings)
- ✅ Journaling (crash recovery)
💡 Practice Questions
Conceptual
-
Name the 4 types of NoSQL databases and give one use case for each.
-
Explain the CAP theorem. Why can’t a distributed system have all three properties?
-
What is the difference between ACID and BASE consistency models?
-
Describe two features of MongoDB’s WiredTiger storage engine.
Scenario
-
Decision: Your company needs to build a social network. Which NoSQL database type would you recommend for storing friend relationships, and why?
-
Trade-off: An e-commerce site needs both fast reads and strong consistency for inventory counts. According to CAP theorem, what challenges might you face with a distributed database?