Hero image for SQL Server Infrastructure & High Availability: RAID, Backup & DR

SQL Server Infrastructure & High Availability: RAID, Backup & DR

sql-server database infrastructure high-availability backup

Prerequisites: Understanding of SQL Server basics. This article focuses on DBA/Operations topics.

When your database serves millions of users, you need reliability, redundancy, and recovery. This is the infrastructure layer.


Part A: RAID — Protecting Against Disk Failure

1. What is RAID?

RAID (Redundant Array of Independent Disks) = Multiple disks working together for speed and/or safety.

graph LR
    subgraph "Single Disk"
        D1[1 Disk = 1 Point of Failure]
    end
    
    subgraph "RAID Array"
        R1[Disk 1]
        R2[Disk 2]
        R3[Disk 3]
        R4[Disk 4]
    end
    
    style D1 fill:#e74c3c,color:#fff
    style R1 fill:#27ae60,color:#fff
    style R2 fill:#27ae60,color:#fff
    style R3 fill:#27ae60,color:#fff
    style R4 fill:#27ae60,color:#fff

2. RAID Levels Comparison

RAIDHow it WorksMin DisksCapacitySpeedFault ToleranceUse Case
RAID 0Striping only2100%⚡⚡⚡ Fast❌ NoneTemp/scratch
RAID 1Mirroring250%⚡ Normal✅ 1 diskOS drive
RAID 5Striping + parity3n-1 disks⚡⚡ Good✅ 1 diskGeneral use
RAID 6Striping + double parity4n-2 disks⚡ Medium✅ 2 disksCritical data, NAS
RAID 10Mirror + Stripe450%⚡⚡⚡ Fast✅ 1 per mirrorSQL Server
RAID 50RAID 5 + Striping6Varies⚡⚡ Good✅ 1 per RAID 5Large storage arrays
RAID 60RAID 6 + Striping8Varies⚡ Medium✅ 2 per RAID 6Enterprise storage

3. RAID Visual Explanation

RAID 0 (Striping) — Speed, No Safety

graph LR
    DATA["Data: A B C D E F G H"]
    
    subgraph "Disk 1"
        D1[A, C, E, G]
    end
    
    subgraph "Disk 2"
        D2[B, D, F, H]
    end
    
    DATA --> D1
    DATA --> D2
    
    style D1 fill:#e74c3c,color:#fff
    style D2 fill:#e74c3c,color:#fff
  • ✅ Read/Write speed: 2x (parallel access)
  • ❌ Any disk fails = ALL DATA LOST

RAID 1 (Mirroring) — Safety, Less Space

graph LR
    DATA["Data: A B C D"]
    
    subgraph "Mirror"
        D1["Disk 1: A B C D"]
        D2["Disk 2: A B C D<br/>(Exact Copy)"]
    end
    
    DATA --> D1
    DATA --> D2
    D1 <-.->|Same Data| D2
    
    style D1 fill:#27ae60,color:#fff
    style D2 fill:#27ae60,color:#fff
  • ✅ One disk fails? The other has everything
  • ❌ Lose 50% capacity

RAID 5 (Striping + Parity) — Balanced

graph TD
    subgraph "RAID 5 Array"
        D1["Disk 1<br/>A1, B2, Cp"]
        D2["Disk 2<br/>A2, Bp, C1"]
        D3["Disk 3<br/>Ap, B1, C2"]
    end
    
    P["p = Parity<br/>(Can rebuild)"]
    
    D1 --- D2 --- D3
    
    style D1 fill:#f39c12,color:#fff
    style D2 fill:#f39c12,color:#fff
    style D3 fill:#f39c12,color:#fff
  • ✅ Any ONE disk can fail, data survives
  • ⚠️ Rebuild is slow and stresses remaining disks
  • ⚠️ Write Penalty: Slower writes due to parity calculation. Not recommended for write-heavy OLTP systems.

RAID 6 (Double Parity) — Extra Protection

graph TD
    subgraph "RAID 6 Array (4+ disks)"
        D1["Disk 1<br/>A1, B2, Cp, Dq"]
        D2["Disk 2<br/>A2, Bp, Cq, D1"]
        D3["Disk 3<br/>Ap, Bq, C1, D2"]
        D4["Disk 4<br/>Aq, B1, C2, Dp"]
    end
    
    NOTE["p, q = Two Parity Blocks<br/>(Different algorithms)"]
    
    D1 --- D2 --- D3 --- D4
    
    style D1 fill:#9b59b6,color:#fff
    style D2 fill:#9b59b6,color:#fff
    style D3 fill:#9b59b6,color:#fff
    style D4 fill:#9b59b6,color:#fff
  • ✅ Any TWO disks can fail simultaneously, data survives
  • ⚠️ Write performance lower than RAID 5 (double parity calculation)
  • 📦 Popular for NAS and backup storage

RAID 10 (Mirror + Stripe) — Best for SQL Server

graph TD
    subgraph "Mirror Set 1"
        D1[Disk 1<br/>A, C]
        D2[Disk 2<br/>A, C]
    end
    
    subgraph "Mirror Set 2"
        D3[Disk 3<br/>B, D]
        D4[Disk 4<br/>B, D]
    end
    
    D1 <-->|Mirror| D2
    D3 <-->|Mirror| D4
    
    STRIPE[Striped Data]
    D1 --> STRIPE
    D3 --> STRIPE
    
    style D1 fill:#27ae60,color:#fff
    style D2 fill:#27ae60,color:#fff
    style D3 fill:#3498db,color:#fff
    style D4 fill:#3498db,color:#fff
    style STRIPE fill:#9b59b6,color:#fff

RAID 10 Advantages:

  • ✅ Fast reads AND writes
  • ✅ Can lose 1 disk per mirror
  • ❌ Need minimum 4 disks, 50% capacity

Microsoft recommends RAID 10 for SQL Server data files

RAID 10 Architecture

4. SQL Server & RAID Best Practices

File TypeRecommended RAIDReason
Data files (.mdf)RAID 10Heavy random I/O
Log files (.ldf)RAID 1 or RAID 10Sequential writes, critical
TempDBRAID 10 or RAID 0*High I/O, can recreate
BackupsRAID 5 or RAID 6Large sequential reads

*RAID 0 only if tempdb can be recreated on restart

☁️ Cloud Context: In cloud environments (AWS EBS, Azure Managed Disk), RAID is typically handled at the infrastructure layer. You focus on selecting IOPS and Throughput tiers instead of managing disk arrays directly.


Part B: Backup Strategies

5. Why Backup?

RAID protects against disk failure, but NOT against:

  • ❌ Accidental DELETE
  • ❌ Ransomware encryption
  • ❌ Application bugs corrupting data
  • ❌ Natural disasters

Backup = Your last line of defense

6. Backup Types

graph LR
    subgraph "Backup Types"
        FULL[Full Backup]
        DIFF[Differential]
        LOG[Transaction Log]
    end
    
    FULL -->|Daily| DIFF
    DIFF -->|Hourly| LOG
    LOG -->|Every 15 min| LOG
    
    style FULL fill:#e74c3c,color:#fff
    style DIFF fill:#f39c12,color:#fff
    style LOG fill:#27ae60,color:#fff
TypeContainsSizeRestore Need
FullEntire databaseLargeJust this backup
DifferentialChanges since last FullMediumFull + this Diff
Transaction LogChanges since last LogSmallFull + all Logs

7. Backup Commands

-- 0. Integrity Check (Critical!)
-- Check for corruption BEFORE backing up
DBCC CHECKDB(MyDB) WITH NO_INFOMSGS;

-- 1. Full backup:
BACKUP DATABASE MyDB 
TO DISK = 'D:\Backups\MyDB_Full.bak'
WITH COMPRESSION, INIT, CHECKSUM;
-- CHECKSUM ensures page integrity during backup

-- 2. Differential backup:
BACKUP DATABASE MyDB 
TO DISK = 'D:\Backups\MyDB_Diff.bak'
WITH DIFFERENTIAL, COMPRESSION, CHECKSUM;

-- 3. Transaction log backup:
-- IMPORTANT: This truncates the log, allowing space reuse
BACKUP LOG MyDB 
TO DISK = 'D:\Backups\MyDB_Log.trn'
WITH COMPRESSION;

⚠️ Why DBCC CHECKDB First?

If you backup a corrupted database, your backup is also corrupted. Run DBCC CHECKDB regularly (weekly minimum) to catch corruption early.

📋 Log Truncation Explained

In Full Recovery Model, the transaction log grows forever until you back it up. BACKUP LOG marks old log records as reusable (“truncates” the log). This is why Full Recovery requires regular log backups — or your disk fills up!

8. Common Backup Strategies

Backup Strategies Overview

Simple Strategy (Small DBs)

Sunday: Full
Monday-Saturday: Full (daily)

Standard Strategy (Most DBs)

Sunday 12AM: Full
Mon-Sat 12AM: Differential
Every hour: Transaction Log

Enterprise Strategy (Mission-Critical)

Sunday 12AM: Full
Daily 12AM: Differential
Every 15 minutes: Transaction Log
Copy to offsite storage: Daily

9. Recovery Models

ModelLog Backup Possible?Point-in-Time Restore?Use Case
Simple❌ No❌ NoDev/Test
Full✅ Yes✅ YesProduction
Bulk-Logged✅ Yes⚠️ LimitedETL periods
-- Check recovery model:
SELECT name, recovery_model_desc FROM sys.databases;

-- Change recovery model:
ALTER DATABASE MyDB SET RECOVERY FULL;

10. Point-in-Time Recovery

-- Restore to specific time (requires FULL recovery model):
RESTORE DATABASE MyDB FROM DISK = 'MyDB_Full.bak' WITH NORECOVERY;
RESTORE LOG MyDB FROM DISK = 'MyDB_Log1.trn' WITH NORECOVERY;
RESTORE LOG MyDB FROM DISK = 'MyDB_Log2.trn' 
WITH STOPAT = '2024-03-15 14:30:00', RECOVERY;

Part C: Replication — Keeping Copies in Sync

11. Types of Replication

graph LR
    subgraph "Replication Types"
        SNAP[Snapshot]
        TRANS[Transactional]
        MERGE[Merge]
    end
    
    PUB[Publisher] --> SNAP --> SUB1[Subscriber]
    PUB --> TRANS --> SUB2[Subscriber]
    PUB --> MERGE --> SUB3[Subscriber]
TypeHow it WorksLatencyUse Case
SnapshotPeriodic full copyHigh (minutes-hours)Reporting, rarely changed
TransactionalStream changes in near-real-timeLow (seconds)Read replicas
MergeBi-directional syncMediumMobile/disconnected

12. Transactional Replication Flow

sequenceDiagram
    participant P as Publisher<br/>(Main DB)
    participant D as Distributor<br/>(Queue)
    participant S as Subscriber<br/>(Copy)
    
    P->>D: 1. Log Reader reads transaction log
    D-->>D: 2. Store in distribution DB
    D->>S: 3. Distribution Agent pushes changes
    S-->>S: Apply changes

Part D: Always On Availability Groups — Enterprise HA

13. What is Always On AG?

Always On Availability Groups = Multiple database replicas with automatic failover.

Always On Availability Group Architecture

graph TD
    subgraph "Primary Replica"
        P[SQL Server 1<br/>Read/Write]
    end
    
    subgraph "Secondary Replicas"
        S1[SQL Server 2<br/>Read-Only]
        S2[SQL Server 3<br/>Read-Only]
    end
    
    P -->|Synchronous| S1
    P -->|Asynchronous| S2
    
    L[Listener: ag-listener.company.com]
    L --> P
    
    style P fill:#27ae60,color:#fff
    style S1 fill:#3498db,color:#fff
    style S2 fill:#3498db,color:#fff

14. Sync vs Async Mode

ModeCommit BehaviorData Loss on FailoverLatencyUse Case
SynchronousWait for secondary to confirm❌ None (zero data loss)HigherSame datacenter
AsynchronousDon’t wait⚠️ PossibleLowerDR site (different city)

15. Failover Types

TypeTriggerDowntime
AutomaticPrimary failure detectedSeconds (30-60s typical)
Manual PlannedDBA initiatesSeconds
ForcedPrimary unreachablePotential data loss

15.1 Always On AG vs Failover Cluster Instance (FCI)

FeatureAlways On AGFailover Cluster Instance
ArchitectureShared-NothingShared-Storage
Data StorageEach node has own copySingle shared disk
Storage CostHigher (N copies)Lower (1 copy)
Failover ScopePer-databaseEntire instance
Read from Secondary✅ Yes❌ No
Requires SAN❌ No✅ Yes

When to choose?

  • AG: Modern choice. Better for DR across datacenters, readable secondaries, per-database control.
  • FCI: Legacy or when shared storage (SAN) already exists. Simpler but less flexible.

Part E: Disaster Recovery (DR)

16. Key Metrics: RTO and RPO

MetricMeaningQuestion
RTORecovery Time ObjectiveHow fast must we recover?
RPORecovery Point ObjectiveHow much data can we lose?
ScenarioRTORPOSolution
E-commerce1 hour5 minAlways On + Log shipping
Banking00Synchronous Always On
Archive system24 hours1 dayDaily backup to cloud

17. DR Strategy Tiers

graph TD
    subgraph "Tier 1: Backup Only"
        T1["Backup to tape/cloud<br/>RTO: Hours-Days | RPO: Last backup"]
    end
    
    subgraph "Tier 2: Warm Standby"
        T2["Log shipping to secondary<br/>RTO: 1-4 hours | RPO: 15 min"]
    end
    
    subgraph "Tier 3: Hot Standby"
        T3["Always On AG (async)<br/>RTO: Minutes | RPO: Near-zero"]
    end
    
    subgraph "Tier 4: Active-Active"
        T4["Distributed AG<br/>RTO: Seconds | RPO: Zero"]
    end
    
    T1 --> T2 --> T3 --> T4
    
    style T1 fill:#e74c3c,color:#fff
    style T2 fill:#f39c12,color:#fff
    style T3 fill:#3498db,color:#fff
    style T4 fill:#27ae60,color:#fff

18. Log Shipping — Simple DR

-- On Primary:
BACKUP LOG MyDB TO DISK = '\\FileShare\Logs\MyDB_Log.trn';

-- Copy job runs every 15 minutes to secondary

-- On Secondary:
RESTORE LOG MyDB FROM DISK = '\\FileShare\Logs\MyDB_Log.trn' 
WITH STANDBY = 'MyDB_Undo.ldf';

Summary: When to Use What

Technology Selection

RequirementSolution
Protect against disk failureRAID 10
Recover from accidental DELETEBackup + Point-in-time restore
Read replicas for reportingTransactional replication or AG
Automatic failover (same DC)Always On AG (synchronous)
DR site in another cityAlways On AG (async) or Log Shipping
Zero data loss requirementSynchronous Always On + SAN replication

Cost vs Protection

LevelComponentsCost
BasicRAID + Daily backup$
Standard+ Differential + Log backups$$
Professional+ Replication or Log Shipping$$$
Enterprise+ Always On AG + DR site$$$$
Mission Critical+ Active-Active + SAN replication$$$$$

Quick Reference: Backup Strategy

Small DB (< 10 GB):
  └── Full backup daily

Medium DB (10-100 GB):
  └── Full weekly + Diff daily + Log hourly

Large DB (100 GB+):
  └── Full weekly + Diff daily + Log every 15 min

Critical DB:
  └── Full weekly + Diff daily + Log every 5 min + Offsite copy

💡 Practice Questions

Conceptual

  1. Compare RAID 5, RAID 6, and RAID 10. Which would you use for SQL Server data files and why?

  2. Explain the difference between Full, Differential, and Transaction Log backups. How do they work together?

  3. What is the difference between Replication and Always On Availability Groups?

  4. Define RPO and RTO. How do they influence your backup/DR strategy?

Hands-on

-- Design a backup strategy for a 50GB database with RPO of 15 minutes and RTO of 1 hour.
-- List: backup types, frequencies, and retention period.
💡 View Answer

Backup Strategy:

  • Full backup: Weekly (Sunday night)
  • Differential backup: Daily (every night)
  • Transaction log backup: Every 15 minutes

Retention:

  • Keep 4 weeks of full backups
  • Keep 7 days of differential backups
  • Keep 24 hours of log backups

Recovery scenario (RTO < 1 hour):

  1. Restore latest full backup (~20 min for 50GB)
  2. Restore latest differential (~10 min)
  3. Restore transaction logs up to failure point (~10 min)
  4. Total: ~40 minutes ✅

Scenario

  1. Disaster Planning: Your company’s main datacenter loses power. You have an Always On Availability Group with a secondary in another city. Walk through the failover process.

  2. Capacity Planning: A database is growing 10GB per month. Current RAID 10 array is 80% full. What would you recommend?