Feb 25, 2025

SQL Server Infrastructure & High Availability: RAID, Backup & DR

sql-server database infrastructure high-availability backup

Prerequisites: Understanding of SQL Server basics. This article focuses on DBA/Operations topics.

When your database serves millions of users, you need reliability, redundancy, and recovery. This is the infrastructure layer.

Part A: RAID — Protecting Against Disk Failure

1. What is RAID?

RAID (Redundant Array of Independent Disks) = Multiple disks working together for speed and/or safety.

graph LR
    subgraph "Single Disk"
        D1[1 Disk = 1 Point of Failure]
    end
    
    subgraph "RAID Array"
        R1[Disk 1]
        R2[Disk 2]
        R3[Disk 3]
        R4[Disk 4]
    end
    
    style D1 fill:#e74c3c,color:#fff
    style R1 fill:#27ae60,color:#fff
    style R2 fill:#27ae60,color:#fff
    style R3 fill:#27ae60,color:#fff
    style R4 fill:#27ae60,color:#fff

2. RAID Levels Comparison

RAID	How it Works	Min Disks	Capacity	Speed	Fault Tolerance	Use Case
RAID 0	Striping only	2	100%	⚡⚡⚡ Fast	❌ None	Temp/scratch
RAID 1	Mirroring	2	50%	⚡ Normal	✅ 1 disk	OS drive
RAID 5	Striping + parity	3	n-1 disks	⚡⚡ Good	✅ 1 disk	General use
RAID 6	Striping + double parity	4	n-2 disks	⚡ Medium	✅ 2 disks	Critical data, NAS
RAID 10	Mirror + Stripe	4	50%	⚡⚡⚡ Fast	✅ 1 per mirror	SQL Server
RAID 50	RAID 5 + Striping	6	Varies	⚡⚡ Good	✅ 1 per RAID 5	Large storage arrays
RAID 60	RAID 6 + Striping	8	Varies	⚡ Medium	✅ 2 per RAID 6	Enterprise storage

3. RAID Visual Explanation

RAID 0 (Striping) — Speed, No Safety

graph LR
    DATA["Data: A B C D E F G H"]
    
    subgraph "Disk 1"
        D1[A, C, E, G]
    end
    
    subgraph "Disk 2"
        D2[B, D, F, H]
    end
    
    DATA --> D1
    DATA --> D2
    
    style D1 fill:#e74c3c,color:#fff
    style D2 fill:#e74c3c,color:#fff

✅ Read/Write speed: 2x (parallel access)
❌ Any disk fails = ALL DATA LOST

RAID 1 (Mirroring) — Safety, Less Space

graph LR
    DATA["Data: A B C D"]
    
    subgraph "Mirror"
        D1["Disk 1: A B C D"]
        D2["Disk 2: A B C D<br/>(Exact Copy)"]
    end
    
    DATA --> D1
    DATA --> D2
    D1 <-.->|Same Data| D2
    
    style D1 fill:#27ae60,color:#fff
    style D2 fill:#27ae60,color:#fff

✅ One disk fails? The other has everything
❌ Lose 50% capacity

RAID 5 (Striping + Parity) — Balanced

graph TD
    subgraph "RAID 5 Array"
        D1["Disk 1<br/>A1, B2, Cp"]
        D2["Disk 2<br/>A2, Bp, C1"]
        D3["Disk 3<br/>Ap, B1, C2"]
    end
    
    P["p = Parity<br/>(Can rebuild)"]
    
    D1 --- D2 --- D3
    
    style D1 fill:#f39c12,color:#fff
    style D2 fill:#f39c12,color:#fff
    style D3 fill:#f39c12,color:#fff

✅ Any ONE disk can fail, data survives
⚠️ Rebuild is slow and stresses remaining disks
⚠️ Write Penalty: Slower writes due to parity calculation. Not recommended for write-heavy OLTP systems.

RAID 6 (Double Parity) — Extra Protection

graph TD
    subgraph "RAID 6 Array (4+ disks)"
        D1["Disk 1<br/>A1, B2, Cp, Dq"]
        D2["Disk 2<br/>A2, Bp, Cq, D1"]
        D3["Disk 3<br/>Ap, Bq, C1, D2"]
        D4["Disk 4<br/>Aq, B1, C2, Dp"]
    end
    
    NOTE["p, q = Two Parity Blocks<br/>(Different algorithms)"]
    
    D1 --- D2 --- D3 --- D4
    
    style D1 fill:#9b59b6,color:#fff
    style D2 fill:#9b59b6,color:#fff
    style D3 fill:#9b59b6,color:#fff
    style D4 fill:#9b59b6,color:#fff

✅ Any TWO disks can fail simultaneously, data survives
⚠️ Write performance lower than RAID 5 (double parity calculation)
📦 Popular for NAS and backup storage

RAID 10 (Mirror + Stripe) — Best for SQL Server

graph TD
    subgraph "Mirror Set 1"
        D1[Disk 1<br/>A, C]
        D2[Disk 2<br/>A, C]
    end
    
    subgraph "Mirror Set 2"
        D3[Disk 3<br/>B, D]
        D4[Disk 4<br/>B, D]
    end
    
    D1 <-->|Mirror| D2
    D3 <-->|Mirror| D4
    
    STRIPE[Striped Data]
    D1 --> STRIPE
    D3 --> STRIPE
    
    style D1 fill:#27ae60,color:#fff
    style D2 fill:#27ae60,color:#fff
    style D3 fill:#3498db,color:#fff
    style D4 fill:#3498db,color:#fff
    style STRIPE fill:#9b59b6,color:#fff

RAID 10 Advantages:

✅ Fast reads AND writes
✅ Can lose 1 disk per mirror
❌ Need minimum 4 disks, 50% capacity

Microsoft recommends RAID 10 for SQL Server data files

RAID 10 Architecture

4. SQL Server & RAID Best Practices

File Type	Recommended RAID	Reason
Data files (.mdf)	RAID 10	Heavy random I/O
Log files (.ldf)	RAID 1 or RAID 10	Sequential writes, critical
TempDB	RAID 10 or RAID 0*	High I/O, can recreate
Backups	RAID 5 or RAID 6	Large sequential reads

*RAID 0 only if tempdb can be recreated on restart

☁️ Cloud Context: In cloud environments (AWS EBS, Azure Managed Disk), RAID is typically handled at the infrastructure layer. You focus on selecting IOPS and Throughput tiers instead of managing disk arrays directly.

Part B: Backup Strategies

5. Why Backup?

RAID protects against disk failure, but NOT against:

❌ Accidental DELETE
❌ Ransomware encryption
❌ Application bugs corrupting data
❌ Natural disasters

Backup = Your last line of defense

6. Backup Types

graph LR
    subgraph "Backup Types"
        FULL[Full Backup]
        DIFF[Differential]
        LOG[Transaction Log]
    end
    
    FULL -->|Daily| DIFF
    DIFF -->|Hourly| LOG
    LOG -->|Every 15 min| LOG
    
    style FULL fill:#e74c3c,color:#fff
    style DIFF fill:#f39c12,color:#fff
    style LOG fill:#27ae60,color:#fff

Type	Contains	Size	Restore Need
Full	Entire database	Large	Just this backup
Differential	Changes since last Full	Medium	Full + this Diff
Transaction Log	Changes since last Log	Small	Full + all Logs

7. Backup Commands

-- 0. Integrity Check (Critical!)
-- Check for corruption BEFORE backing up
DBCC CHECKDB(MyDB) WITH NO_INFOMSGS;

-- 1. Full backup:
BACKUP DATABASE MyDB 
TO DISK = 'D:\Backups\MyDB_Full.bak'
WITH COMPRESSION, INIT, CHECKSUM;
-- CHECKSUM ensures page integrity during backup

-- 2. Differential backup:
BACKUP DATABASE MyDB 
TO DISK = 'D:\Backups\MyDB_Diff.bak'
WITH DIFFERENTIAL, COMPRESSION, CHECKSUM;

-- 3. Transaction log backup:
-- IMPORTANT: This truncates the log, allowing space reuse
BACKUP LOG MyDB 
TO DISK = 'D:\Backups\MyDB_Log.trn'
WITH COMPRESSION;

⚠️ Why DBCC CHECKDB First?

If you backup a corrupted database, your backup is also corrupted. Run DBCC CHECKDB regularly (weekly minimum) to catch corruption early.

📋 Log Truncation Explained

In Full Recovery Model, the transaction log grows forever until you back it up. BACKUP LOG marks old log records as reusable (“truncates” the log). This is why Full Recovery requires regular log backups — or your disk fills up!

8. Common Backup Strategies

Backup Strategies Overview

Simple Strategy (Small DBs)

Sunday: Full
Monday-Saturday: Full (daily)

Standard Strategy (Most DBs)

Sunday 12AM: Full
Mon-Sat 12AM: Differential
Every hour: Transaction Log

Enterprise Strategy (Mission-Critical)

Sunday 12AM: Full
Daily 12AM: Differential
Every 15 minutes: Transaction Log
Copy to offsite storage: Daily

9. Recovery Models

Model	Log Backup Possible?	Point-in-Time Restore?	Use Case
Simple	❌ No	❌ No	Dev/Test
Full	✅ Yes	✅ Yes	Production
Bulk-Logged	✅ Yes	⚠️ Limited	ETL periods

-- Check recovery model:
SELECT name, recovery_model_desc FROM sys.databases;

-- Change recovery model:
ALTER DATABASE MyDB SET RECOVERY FULL;

10. Point-in-Time Recovery

-- Restore to specific time (requires FULL recovery model):
RESTORE DATABASE MyDB FROM DISK = 'MyDB_Full.bak' WITH NORECOVERY;
RESTORE LOG MyDB FROM DISK = 'MyDB_Log1.trn' WITH NORECOVERY;
RESTORE LOG MyDB FROM DISK = 'MyDB_Log2.trn' 
WITH STOPAT = '2024-03-15 14:30:00', RECOVERY;

Part C: Replication — Keeping Copies in Sync

11. Types of Replication

graph LR
    subgraph "Replication Types"
        SNAP[Snapshot]
        TRANS[Transactional]
        MERGE[Merge]
    end
    
    PUB[Publisher] --> SNAP --> SUB1[Subscriber]
    PUB --> TRANS --> SUB2[Subscriber]
    PUB --> MERGE --> SUB3[Subscriber]

Type	How it Works	Latency	Use Case
Snapshot	Periodic full copy	High (minutes-hours)	Reporting, rarely changed
Transactional	Stream changes in near-real-time	Low (seconds)	Read replicas
Merge	Bi-directional sync	Medium	Mobile/disconnected

12. Transactional Replication Flow

sequenceDiagram
    participant P as Publisher<br/>(Main DB)
    participant D as Distributor<br/>(Queue)
    participant S as Subscriber<br/>(Copy)
    
    P->>D: 1. Log Reader reads transaction log
    D-->>D: 2. Store in distribution DB
    D->>S: 3. Distribution Agent pushes changes
    S-->>S: Apply changes

Part D: Always On Availability Groups — Enterprise HA

13. What is Always On AG?

Always On Availability Groups = Multiple database replicas with automatic failover.

Always On Availability Group Architecture

graph TD
    subgraph "Primary Replica"
        P[SQL Server 1<br/>Read/Write]
    end
    
    subgraph "Secondary Replicas"
        S1[SQL Server 2<br/>Read-Only]
        S2[SQL Server 3<br/>Read-Only]
    end
    
    P -->|Synchronous| S1
    P -->|Asynchronous| S2
    
    L[Listener: ag-listener.company.com]
    L --> P
    
    style P fill:#27ae60,color:#fff
    style S1 fill:#3498db,color:#fff
    style S2 fill:#3498db,color:#fff

14. Sync vs Async Mode

Mode	Commit Behavior	Data Loss on Failover	Latency	Use Case
Synchronous	Wait for secondary to confirm	❌ None (zero data loss)	Higher	Same datacenter
Asynchronous	Don’t wait	⚠️ Possible	Lower	DR site (different city)

15. Failover Types

Type	Trigger	Downtime
Automatic	Primary failure detected	Seconds (30-60s typical)
Manual Planned	DBA initiates	Seconds
Forced	Primary unreachable	Potential data loss

15.1 Always On AG vs Failover Cluster Instance (FCI)

Feature	Always On AG	Failover Cluster Instance
Architecture	Shared-Nothing	Shared-Storage
Data Storage	Each node has own copy	Single shared disk
Storage Cost	Higher (N copies)	Lower (1 copy)
Failover Scope	Per-database	Entire instance
Read from Secondary	✅ Yes	❌ No
Requires SAN	❌ No	✅ Yes

When to choose?

AG: Modern choice. Better for DR across datacenters, readable secondaries, per-database control.

FCI: Legacy or when shared storage (SAN) already exists. Simpler but less flexible.

Part E: Disaster Recovery (DR)

16. Key Metrics: RTO and RPO

Metric	Meaning	Question
RTO	Recovery Time Objective	How fast must we recover?
RPO	Recovery Point Objective	How much data can we lose?

Scenario	RTO	RPO	Solution
E-commerce	1 hour	5 min	Always On + Log shipping
Banking	0	0	Synchronous Always On
Archive system	24 hours	1 day	Daily backup to cloud

17. DR Strategy Tiers

graph TD
    subgraph "Tier 1: Backup Only"
        T1["Backup to tape/cloud<br/>RTO: Hours-Days | RPO: Last backup"]
    end
    
    subgraph "Tier 2: Warm Standby"
        T2["Log shipping to secondary<br/>RTO: 1-4 hours | RPO: 15 min"]
    end
    
    subgraph "Tier 3: Hot Standby"
        T3["Always On AG (async)<br/>RTO: Minutes | RPO: Near-zero"]
    end
    
    subgraph "Tier 4: Active-Active"
        T4["Distributed AG<br/>RTO: Seconds | RPO: Zero"]
    end
    
    T1 --> T2 --> T3 --> T4
    
    style T1 fill:#e74c3c,color:#fff
    style T2 fill:#f39c12,color:#fff
    style T3 fill:#3498db,color:#fff
    style T4 fill:#27ae60,color:#fff

18. Log Shipping — Simple DR

-- On Primary:
BACKUP LOG MyDB TO DISK = '\\FileShare\Logs\MyDB_Log.trn';

-- Copy job runs every 15 minutes to secondary

-- On Secondary:
RESTORE LOG MyDB FROM DISK = '\\FileShare\Logs\MyDB_Log.trn' 
WITH STANDBY = 'MyDB_Undo.ldf';

Summary: When to Use What

Technology Selection

Requirement	Solution
Protect against disk failure	RAID 10
Recover from accidental DELETE	Backup + Point-in-time restore
Read replicas for reporting	Transactional replication or AG
Automatic failover (same DC)	Always On AG (synchronous)
DR site in another city	Always On AG (async) or Log Shipping
Zero data loss requirement	Synchronous Always On + SAN replication

Cost vs Protection

Level	Components	Cost
Basic	RAID + Daily backup	$
Standard	+ Differential + Log backups	$$
Professional	+ Replication or Log Shipping	$$$
Enterprise	+ Always On AG + DR site	$$$$
Mission Critical	+ Active-Active + SAN replication	$$$$$

Quick Reference: Backup Strategy

Small DB (< 10 GB):
  └── Full backup daily

Medium DB (10-100 GB):
  └── Full weekly + Diff daily + Log hourly

Large DB (100 GB+):
  └── Full weekly + Diff daily + Log every 15 min

Critical DB:
  └── Full weekly + Diff daily + Log every 5 min + Offsite copy

💡 Practice Questions

Conceptual

Compare RAID 5, RAID 6, and RAID 10. Which would you use for SQL Server data files and why?
Explain the difference between Full, Differential, and Transaction Log backups. How do they work together?
What is the difference between Replication and Always On Availability Groups?
Define RPO and RTO. How do they influence your backup/DR strategy?

Hands-on

-- Design a backup strategy for a 50GB database with RPO of 15 minutes and RTO of 1 hour.
-- List: backup types, frequencies, and retention period.

💡 View Answer

Backup Strategy:

Full backup: Weekly (Sunday night)
Differential backup: Daily (every night)
Transaction log backup: Every 15 minutes

Retention:

Keep 4 weeks of full backups
Keep 7 days of differential backups
Keep 24 hours of log backups

Recovery scenario (RTO < 1 hour):

Restore latest full backup (~20 min for 50GB)
Restore latest differential (~10 min)
Restore transaction logs up to failure point (~10 min)
Total: ~40 minutes ✅

Scenario

Disaster Planning: Your company’s main datacenter loses power. You have an Always On Availability Group with a secondary in another city. Walk through the failover process.
Capacity Planning: A database is growing 10GB per month. Current RAID 10 array is 80% full. What would you recommend?

← Previous SQL Server Advanced Features: Triggers, Temporal Tables & More

Next → SQL Server Troubleshooting: Connection Issues & Debugging