Jan 21, 2025

B-Tree Index: How SQL Server Finds Your Data

sql-server database performance indexing

Prerequisites: Basic understanding of SQL Server and database tables. See SQL Server & Networking Fundamentals for background.

The Problem: Finding Data Without Indexes

Imagine a library with 1 million books but no catalog system. To find a specific book, you’d have to scan every shelf—one by one.

┌─────────────────────────────────────────────────────────────────┐
│              Without Index: Full Table Scan                     │
├─────────────────────────────────────────────────────────────────┤ 
│                                                                 │
│   Query: SELECT * FROM Customers WHERE CustomerID = 999999      │
│                                                                 │
│   ┌───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐         │
│   │ 1 │ 2 │ 3 │ 4 │ 5 │ 6 │ 7 │...│...│...│999999│...│ 1M│      │
│   └───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───────┴───┴───┘     │
│     ↑   ↑   ↑   ↑   ↑   ↑   ↑   ↑   ↑   ↑     ↑                 │
│   Scan every single row until found!                            │
│                                                                 │
│   Time: O(n) = Check all 1,000,000 rows                         │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

This is a Table Scan — the slowest way to find data.

The Solution: B-Tree Index

A B-Tree (Balanced Tree) is a sorted, hierarchical data structure that allows SQL Server to find any row in O(log n) time instead of O(n).

┌─────────────────────────────────────────────────────────────────┐
│              With B-Tree Index: Logarithmic Search              │
├─────────────────────────────────────────────────────────────────┤ 
│                                                                 │
│   Query: SELECT * FROM Customers WHERE CustomerID = 750000      │
│                                                                 │
│                         ┌───────────┐                           │
│                         │  Root     │                           │
│                         │ 1-500K    │                           │
│                         │ 500K-1M   │                           │
│                         └─────┬─────┘                           │
│                               │ "750000 is in 500K-1M range"    │
│                    ┌──────────┴──────────┐                      │
│                    ▼                      ▼                     │
│              ┌──────────┐          ┌──────────┐                 │
│              │ 500-750K │          │ 750K-1M  │ ← Go here       │
│              └──────────┘          └─────┬────┘                 │
│                                          │                      │
│                               ┌──────────┴──────────┐           │
│                               ▼                      ▼          │
│                        ┌──────────┐          ┌──────────┐       │
│                        │ 750-800K │ ← Found! │ 800-850K │       │
│                        └──────────┘          └──────────┘       │
│                                                                 │
│   Time: O(log n) = Only 3-4 levels to traverse for 1M rows      │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Rows	Full Scan (O(n))	B-Tree (O(log n))	Improvement
1,000	1,000 reads	~10 reads	100x
1,000,000	1,000,000 reads	~20 reads	50,000x
1,000,000,000	1,000,000,000 reads	~30 reads	33,000,000x

Part A: B-Tree Structure

1. Anatomy of a B-Tree

┌─────────────────────────────────────────────────────────────────┐
│              B-Tree Anatomy                                     │
├─────────────────────────────────────────────────────────────────┤ 
│                                                                 │
│   Level 0 (Root)        ┌────────────────────────┐              │
│                         │   500  │  1000  │  1500 │             │
│                         └──┬─────┴────┬───┴────┬──┘             │
│                            │          │        │                │
│   Level 1 (Intermediate)   ▼          ▼        ▼                │
│                     ┌──────────┐ ┌──────────┐ ┌──────────┐      │
│                     │100│200│..│ │600│700│..│ │1100│1200│       │
│                     └────┬─────┘ └──────────┘ └──────────┘      │
│                          │                                      │
│   Level 2 (Leaf)         ▼                                      │
│                     ┌──────────────────────────────────┐        │
│                     │ Actual data or pointers to data  │        │
│                     │ (Row 100, Row 101, Row 102...)   │        │
│                     └──────────────────────────────────┘        │
│                                                                 │
│   Key Properties:                                               │
│   • All leaves at same depth (balanced)                         │
│   • Nodes sorted left to right                                  │
│   • Each node = one 8KB page on disk                            │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

2. Pages: The Building Blocks

SQL Server stores data in 8KB pages. Each B-Tree node = one page.

Page Type	Contents
Root Page	Entry point; pointers to intermediate pages
Intermediate Pages	Navigation pages with key ranges
Leaf Pages	Actual data OR pointers to data (depends on index type)

┌─────────────────────────────────────────────────────────────────┐
│              8KB Page Structure                                 │
├─────────────────────────────────────────────────────────────────┤ 
│                                                                 │
│   ┌─────────────────────────────────────────────────────────┐   │
│   │ Page Header (96 bytes)                                  │   │
│   │ - Page number, Object ID, Index ID                      │   │
│   │ - Free space info, Previous/Next page pointers          │   │
│   ├─────────────────────────────────────────────────────────┤   │
│   │                                                         │   │
│   │ Row 1: [Key: 100] [Data or Pointer]                     │   │
│   │ Row 2: [Key: 101] [Data or Pointer]                     │   │
│   │ Row 3: [Key: 102] [Data or Pointer]                     │   │
│   │ ...                                                     │   │
│   │ Row N: [Key: 199] [Data or Pointer]                     │   │
│   │                                                         │   │
│   ├─────────────────────────────────────────────────────────┤   │
│   │ Row Offset Array (from bottom up)                       │   │
│   └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│   Typical capacity: 100-500 rows per page (depends on row size) │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Part B: Clustered vs Non-Clustered Index

This is the most important distinction in SQL Server indexing.

3. Clustered Index: The Table IS the Index

A Clustered Index physically reorders the data on disk to match the index order.

┌─────────────────────────────────────────────────────────────────┐
│              Clustered Index = Physical Sort Order              │
├─────────────────────────────────────────────────────────────────┤ 
│                                                                 │
│   CREATE CLUSTERED INDEX IX_Customers_ID ON Customers(ID)       │
│                                                                 │
│   Before (Heap):              After (Clustered):                │
│   ┌─────────────────┐         ┌─────────────────┐               │
│   │ ID=5, Name=Eve  │         │ ID=1, Name=Alice│ ← Physically  │
│   │ ID=1, Name=Alice│         │ ID=2, Name=Bob  │   sorted!     │
│   │ ID=3, Name=Carol│   →     │ ID=3, Name=Carol│               │
│   │ ID=2, Name=Bob  │         │ ID=4, Name=Dave │               │
│   │ ID=4, Name=Dave │         │ ID=5, Name=Eve  │               │
│   └─────────────────┘         └─────────────────┘               │
│                                                                 │
│   Leaf pages contain ACTUAL ROWS (all columns)                  │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Key Facts:

Only ONE clustered index per table (because data can only be sorted one way)
Primary Key = Clustered Index by default
Leaf level = the actual table data

What Is a Heap? (No Clustered Index)

A table without a clustered index is called a Heap — data is stored in no particular order.

┌─────────────────────────────────────────────────────────────────┐
│              Heap vs Clustered Table                            │
├─────────────────────────────────────────────────────────────────┤ 
│                                                                 │
│   HEAP (No CI):                    CLUSTERED TABLE:             │
│   ┌─────────────────┐              ┌─────────────────┐          │
│   │ ID=5, Name=Eve  │              │ ID=1, Name=Alice│          │
│   │ ID=1, Name=Alice│  (random)    │ ID=2, Name=Bob  │ (sorted) │
│   │ ID=3, Name=Carol│              │ ID=3, Name=Carol│          │
│   └─────────────────┘              └─────────────────┘          │
│                                                                 │
│   NC Index points to: RID          NC Index points to:          │
│   (FileID:PageID:SlotID)           Clustered Key (ID)           │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Aspect	Heap	Clustered Table
Data order	Random (insertion order)	Sorted by clustered key
NC Index pointer	RID (physical location)	Clustered key value
Range scans	Full table scan	Fast (sequential read)
Point lookups	Via RID (fast if known)	Via B-Tree (fast)

When Heaps Are Useful:

⚡ Staging tables: Bulk-load data quickly, process later, then drop
⚡ High-velocity inserts: No page splits, no key ordering overhead
⚡ Bulk export/import: ETL intermediate tables

Rule of Thumb: Most production tables should have a clustered index. Use heaps only for temporary/staging scenarios where write speed matters more than read performance.

4. Non-Clustered Index: The Separate Index

A Non-Clustered Index is a separate structure with pointers to the actual data.

┌─────────────────────────────────────────────────────────────────┐
│              Non-Clustered Index = Separate Lookup              │
├─────────────────────────────────────────────────────────────────┤ 
│                                                                 │
│   CREATE NONCLUSTERED INDEX IX_Customers_Name ON Customers(Name)│
│                                                                 │
│   ┌────────────────────┐         ┌─────────────────────────┐    │
│   │  NC Index          │         │  Table (Clustered)      │    │
│   │  (sorted by Name)  │         │  (sorted by ID)         │    │
│   ├────────────────────┤         ├─────────────────────────┤    │
│   │ Alice → ID=1 ──────┼────────►│ ID=1, Alice, 555-1234   │    │
│   │ Bob → ID=2 ────────┼────────►│ ID=2, Bob, 555-5678     │    │
│   │ Carol → ID=3 ──────┼────────►│ ID=3, Carol, 555-9999   │    │
│   │ Dave → ID=4 ───────┼────────►│ ID=4, Dave, 555-0000    │    │
│   │ Eve → ID=5 ────────┼────────►│ ID=5, Eve, 555-1111     │    │
│   └────────────────────┘         └─────────────────────────┘    │
│                                                                 │
│   Leaf pages contain: Index key + Row Locator (Clustered Key    │
│   or RID if heap)                                               │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Key Facts:

Up to 999 non-clustered indexes per table
Leaf level contains key + pointer (not full row)
May require a “Key Lookup” to get non-indexed columns

5. Comparison Table

Feature	Clustered Index	Non-Clustered Index
Max per table	1	999
Leaf level contains	All row data	Key + Row locator
Data physically sorted	Yes	No
Speed for range queries	Excellent (sequential read)	Good (may need lookups)
Speed for point lookups	Excellent	Excellent
Storage overhead	None (IS the table)	Additional storage needed

Part C: Index Operations

6. Key Lookup and Covering Indexes

When a non-clustered index doesn’t contain all columns needed:

┌─────────────────────────────────────────────────────────────────┐
│              Key Lookup Problem                                 │
├─────────────────────────────────────────────────────────────────┤ 
│                                                                 │
│   Query: SELECT Name, Phone FROM Customers WHERE Name='Alice'   │
│   Index: CREATE NONCLUSTERED INDEX IX_Name ON Customers(Name)   │
│                                                                 │
│   Step 1: Find 'Alice' in NC Index                              │
│   ┌────────────────────┐                                        │
│   │ Name    │ ID       │                                        │
│   │ Alice   │ 1  ──────┼───┐                                    │
│   └────────────────────┘   │                                    │
│                             │ Key Lookup (expensive!)           │
│   Step 2: Go to table       ▼                                   │
│   ┌──────────────────────────────────┐                          │
│   │ ID=1, Name=Alice, Phone=555-1234 │ ← Get Phone column       │
│   └──────────────────────────────────┘                          │
│                                                                 │
│   SOLUTION: Covering Index                                      │
│   CREATE NONCLUSTERED INDEX IX_Name_Phone                       │
│   ON Customers(Name) INCLUDE(Phone)                             │
│                                                                 │
│   Now leaf level contains: Name + Phone                         │
│   ┌─────────────────────────────────┐                           │
│   │ Name    │ Phone      │ (No lookup needed!)                  │
│   │ Alice   │ 555-1234   │                                      │
│   └─────────────────────────────────┘                           │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

7. Page Splits: The Hidden Performance Killer

When inserting into a full page, SQL Server must split it:

┌─────────────────────────────────────────────────────────────────┐
│              Page Split                                         │
├─────────────────────────────────────────────────────────────────┤ 
│                                                                 │
│   Before: Page is full, need to insert ID=150                   │
│   ┌─────────────────────────────────────────────────────────┐   │
│   │ Page 100: [100][101][102][103][104][199] ← Full!        │   │
│   └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│   After: Page split!                                            │
│   ┌───────────────────────────┐ ┌───────────────────────────┐   │
│   │ Page 100: [100][101][102] │ │ Page 200: [103][104][150] │   │
│   │ (half the data)           │ │ [199] (other half + new)  │   │
│   └───────────────────────────┘ └───────────────────────────┘   │
│                                                                 │
│   Problems:                                                     │
│   1. Extra I/O to allocate new page                             │
│   2. Updates to parent page pointers                            │
│   3. Fragmentation (pages no longer contiguous)                 │
│   4. Transaction log bloat                                      │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

When do splits happen?

Random inserts (non-sequential keys like GUIDs)
Inserts in the middle of an ordered sequence

8. Fill Factor: Leaving Room for Growth

Fill Factor defines how full pages should be when index is built/rebuilt.

┌─────────────────────────────────────────────────────────────────┐
│              Fill Factor                                        │
├─────────────────────────────────────────────────────────────────┤ 
│                                                                 │
│   Fill Factor = 100% (default)                                  │
│   ┌─────────────────────────────────────────────────────────┐   │
│   │ [100][101][102][103][104][105][106][107][108][109] FULL │   │
│   └─────────────────────────────────────────────────────────┘   │
│   → ANY insert causes page split!                               │
│                                                                 │
│   Fill Factor = 80%                                             │
│   ┌─────────────────────────────────────────────────────────┐   │
│   │ [100][101][102][103][104][105][106][107]    SPACE       │   │
│   └─────────────────────────────────────────────────────────┘   │
│   → Room for 2 more inserts before split                        │
│                                                                 │
│   Trade-off:                                                    │
│   Lower Fill Factor = Fewer splits, More disk space             │
│   Higher Fill Factor = More splits, Less disk space             │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Fill Factor	Use Case
100%	Read-only tables, append-only (sequential keys)
80-90%	Normal OLTP with updates
50-70%	Heavy random inserts (GUID keys)

Part D: Index Fragmentation

9. Types of Fragmentation

Type	Description	Impact
Internal	Pages not fully utilized (empty space)	Wastes disk space
External (Logical)	Pages out of physical order	Slows sequential reads

┌─────────────────────────────────────────────────────────────────┐
│              External Fragmentation                             │
├─────────────────────────────────────────────────────────────────┤ 
│                                                                 │
│   Ideal (0% fragmentation):                                     │
│   ┌────────┐┌────────┐┌────────┐┌────────┐                      │
│   │ Page 1 ││ Page 2 ││ Page 3 ││ Page 4 │ (contiguous)         │
│   └────────┘└────────┘└────────┘└────────┘                      │
│                                                                 │
│   Fragmented (after many splits):                               │
│   ┌────────┐         ┌────────┐                                 │
│   │ Page 1 │         │ Page 3 │                                 │
│   └────────┘         └────────┘                                 │
│         ┌────────┐         ┌────────┐                           │
│         │ Page 2 │         │ Page 4 │ (scattered on disk)       │
│         └────────┘         └────────┘                           │
│                                                                 │
│   Impact: Disk head has to jump around → slower reads           │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

10. Checking and Fixing Fragmentation

Check fragmentation:

SELECT 
    index_type_desc,
    avg_fragmentation_in_percent,
    page_count
FROM sys.dm_db_index_physical_stats(
    DB_ID(), 
    OBJECT_ID('Customers'), 
    NULL, 
    NULL, 
    'LIMITED'
);

Fix fragmentation:

Fragmentation	Action	Command
< 5%	Do nothing	-
5-30%	Reorganize	`ALTER INDEX IX_Name REORGANIZE`
> 30%	Rebuild	`ALTER INDEX IX_Name REBUILD`

Operation	Online?	Log Usage	Time
REORGANIZE	Yes	Minimal	Slower
REBUILD	Enterprise only	Full	Faster

Part E: Index Design Best Practices

11. What Columns to Index?

Good Candidates	Bad Candidates
✅ Columns in WHERE clauses	❌ Low selectivity (gender, yes/no)
✅ Columns in JOIN conditions	❌ Frequently updated columns
✅ Columns in ORDER BY	❌ Wide columns (text, nvarchar(max))
✅ High selectivity columns	❌ Tables with heavy INSERT load

Selectivity, Cardinality, and Statistics

Why does SQL Server sometimes ignore your index? The answer lies in Statistics.

Term	Definition	Example
Cardinality	Number of unique values in a column	`Gender` has cardinality 2; `CustomerID` has cardinality 1,000,000
Selectivity	% of rows returned by a filter	`WHERE Gender='M'` returns 50% (low selectivity); `WHERE CustomerID=123` returns 0.0001% (high selectivity)
Statistics	SQL Server’s histogram of value distribution	Updated automatically or via `UPDATE STATISTICS`

┌─────────────────────────────────────────────────────────────────┐
│         Why SQL Server Ignores Your Index                       │
├─────────────────────────────────────────────────────────────────┤ 
│                                                                 │
│   Query: SELECT * FROM Orders WHERE Status = 'Completed'        │
│   Index exists on: Status                                       │
│                                                                 │
│   Query Optimizer checks Statistics:                            │
│   ┌─────────────────────────────────────────────────────────┐   │
│   │ Status      │ Rows      │ % of Table                    │   │
│   │ Pending     │ 50,000    │ 5%  → Use Index ✅            │   │
│   │ Completed   │ 800,000   │ 80% → Table Scan faster! ❌   │   │
│   │ Cancelled   │ 150,000   │ 15% → Maybe use index ⚠️      │   │
│   └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│   Rule: If query returns >15-20% of table, full scan is faster  │
│   than thousands of random I/O lookups.                         │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Key Takeaways:

❌ Low selectivity (returns many rows) → Index hurts more than helps
✅ High selectivity (returns few rows) → Index is valuable
🔄 Outdated statistics → Wrong decisions. Run UPDATE STATISTICS after large data changes

-- Check statistics age
DBCC SHOW_STATISTICS ('Customers', 'IX_Status');

-- Force statistics update
UPDATE STATISTICS Customers WITH FULLSCAN;

12. Composite Index: Column Order Matters!

CREATE NONCLUSTERED INDEX IX_Name_City ON Customers(LastName, City);

┌─────────────────────────────────────────────────────────────────┐
│              Composite Index: Leftmost Prefix Rule              │
├─────────────────────────────────────────────────────────────────┤ 
│                                                                 │
│   Index on (LastName, City)                                     │
│                                                                 │
│   ✅ Can use index for:                                         │
│      WHERE LastName = 'Smith'                                   │
│      WHERE LastName = 'Smith' AND City = 'NYC'                  │
│                                                                 │
│   ❌ Cannot use index for:                                     │
│      WHERE City = 'NYC'  (first column not used!)               │
│                                                                 │
│   Think of it like a phone book:                                │
│   Sorted by LastName first, then by City within same LastName   │
│                                                                 │
│   Smith, NYC                                                    │
│   Smith, LA                                                     │
│   Smith, Tokyo   ← Easy to find all Smiths                      │
│   Jones, NYC                                                    │
│   Jones, LA      ← But finding "all NYC" requires full scan     │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Key Columns vs INCLUDE Columns

Critical distinction often missed by developers:

-- Composite Key Index (both columns sorted)
CREATE INDEX IX_Key ON Orders(CustomerID, OrderDate);

-- Covering Index with INCLUDE (only CustomerID sorted)
CREATE INDEX IX_Cover ON Orders(CustomerID) INCLUDE(OrderDate);

Aspect	Key Columns	INCLUDE Columns
Stored in	All B-Tree levels	Leaf level only
Sorted	✅ Yes, used for seeking	❌ No, just stored alongside
Can be used for	WHERE, ORDER BY, JOIN	SELECT list only
Index size	Larger (in every node)	Smaller (leaf only)

┌─────────────────────────────────────────────────────────────────┐
│         Key Column vs INCLUDE Column Structure                   │
├─────────────────────────────────────────────────────────────────┤ 
│                                                                 │
│   CREATE INDEX IX_Cover ON Orders(CustomerID) INCLUDE(OrderDate)│
│                                                                 │
│                     ┌───────────────┐                           │
│   Root/Intermediate │ CustomerID    │ ← Only key columns        │
│                     │ (sorted)      │   at upper levels         │
│                     └───────────────┘                           │
│                            │                                    │
│                            ▼                                    │
│   Leaf Level         ┌───────────────────────────────┐          │
│                      │ CustomerID │ OrderDate        │          │
│                      │ (sorted)   │ (just stored)    │          │
│                      └───────────────────────────────┘          │
│                                                                 │
│   Query: WHERE CustomerID = 123 → Seeks using CustomerID        │
│   Query: WHERE OrderDate = '2024-01-01' → Cannot seek! ❌       │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

When to use INCLUDE:

✅ Column needed in SELECT but not in WHERE/ORDER BY
✅ Want to avoid Key Lookups without bloating the index
✅ Column has low cardinality (not useful for seeking anyway)

13. Choosing the Right Clustered Index Key

Criteria	Why
Narrow	Used in all non-clustered indexes as pointer
Unique	No need for “uniqueifier”
Static	Changing clustered key = move entire row
Ever-increasing	Prevents page splits (new rows at end)

Best choice: Auto-increment INT (IDENTITY) or sequential GUID (NEWSEQUENTIALID)

Worst choice: Random GUID (NEWID) — causes massive page splits!

Part F: Monitoring and Tuning

14. Finding Missing Indexes

SQL Server tracks queries that would benefit from indexes:

SELECT 
    migs.avg_user_impact,
    mid.statement AS table_name,
    mid.equality_columns,
    mid.inequality_columns,
    mid.included_columns
FROM sys.dm_db_missing_index_groups mig
JOIN sys.dm_db_missing_index_group_stats migs ON mig.index_group_handle = migs.group_handle
JOIN sys.dm_db_missing_index_details mid ON mig.index_handle = mid.index_handle
ORDER BY migs.avg_user_impact DESC;

15. Finding Unused Indexes

Indexes that waste space and slow down writes:

SELECT 
    OBJECT_NAME(i.object_id) AS table_name,
    i.name AS index_name,
    ius.user_seeks,
    ius.user_scans,
    ius.user_lookups,
    ius.user_updates
FROM sys.indexes i
JOIN sys.dm_db_index_usage_stats ius ON i.object_id = ius.object_id AND i.index_id = ius.index_id
WHERE ius.user_seeks = 0 AND ius.user_scans = 0 AND ius.user_lookups = 0
ORDER BY ius.user_updates DESC;

If user_updates is high but seeks/scans/lookups are zero → consider dropping the index.

Part G: What Is an Algorithm?

Before we go deeper, let’s understand the concept that makes all of this work.

16. Algorithm = Recipe (SOP)

An algorithm is simply: “A standard procedure for solving a problem.”

Step	Description
Input	Raw, messy data
Algorithm	Step-by-step processing logic
Output	The desired result

┌─────────────────────────────────────────────────────────────────┐
│              Algorithm = Recipe                                 │
├─────────────────────────────────────────────────────────────────┤ 
│                                                                 │
│   Input: Flour, Eggs, Sugar                                     │
│                                                                 │
│   Algorithm (Recipe):                                           │
│   1. Mix ingredients                                            │
│   2. Let rise for 30 minutes                                    │
│   3. Bake at 180°C for 20 minutes                               │
│                                                                 │
│   Output: Delicious cake! 🎂                                    │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

17. Algorithms We’ve Already Learned

System	Problem	Bad Algorithm	Good Algorithm	Result
SQL Query	Find one row in 1M rows	Full Table Scan (O(n))	B-Tree Seek (O(log n))	10 min → 0.1 sec
Navigation (OSPF)	Find fastest route	Random walk	Dijkstra’s algorithm	Lost → Arrive on time
TCP	Packet lost	Resend everything	Resend only lost packet	Crash → Smooth
Hard Disk	Multiple read requests	Jump randomly	Elevator algorithm (sequential)	Overheated → Efficient

Part H: Index ≠ B-Tree (Other Index Types)

Index is the goal (find data fast). B-Tree is the most common implementation—but not the only one!

┌─────────────────────────────────────────────────────────────────┐
│              Index vs Implementation                            │
├─────────────────────────────────────────────────────────────────┤ 
│                                                                 │
│   Index (Goal)     =     Engine (Purpose: Make car move)        │
│   B-Tree (Method)  =     V8 Engine (One specific design)        │
│                                                                 │
│   Other engines: Electric motor, Rotary engine...               │
│   Other indexes: Hash, Columnstore, Spatial...                  │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

18. Hash Index

Analogy: Library call number calculated by formula.

How it works: A math function calculates the exact storage location.

Hash("Alice") = 42 → Go directly to slot 42!

Advantage	Disadvantage
✅ Point lookup in O(1)	❌ Cannot do range queries
✅ Faster than B-Tree for single-row lookup	❌ `WHERE Age > 20` doesn’t work

Use Case: Redis, SQL Server Memory-Optimized Tables

19. Columnstore Index

Analogy: Store data “vertically” instead of “horizontally.”

┌─────────────────────────────────────────────────────────────────┐
│              Rowstore vs Columnstore                            │
├─────────────────────────────────────────────────────────────────┤ 
│                                                                 │
│   ROWSTORE (Traditional):          COLUMNSTORE:                 │
│   ┌────────────────────────────┐   ┌──────┬──────┬──────┐       │
│   │ ID │ Name  │ Salary        │   │ ID   │ Name │Salary│       │
│   │ 1  │ Alice │ 50000         │   │ 1    │ Alice│ 50000│       │
│   │ 2  │ Bob   │ 60000         │   │ 2    │ Bob  │ 60000│       │
│   │ 3  │ Carol │ 55000         │   │ 3    │ Carol│ 55000│       │
│   └────────────────────────────┘   └──────┴──────┴──────┘       │
│   (Rows stored together)           (Columns stored together)    │
│                                                                 │
│   Query: SELECT AVG(Salary) FROM Employees                      │
│                                                                 │
│   Rowstore: Must read entire rows (wasted I/O)                  │
│   Columnstore: Only reads Salary column (10x faster for BI!)    │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Use Case: Data warehouses, Business Intelligence, Big Data analytics

20. Spatial Index (R-Tree)

Problem: “Find all gas stations within 5km of my location.” B-Tree can’t handle 2D coordinates!

Solution: R-Tree (Rectangle Tree) uses geometric regions.

┌─────────────────────────────────────────────────────────────────┐
│              R-Tree: Geographic Indexing                         │
├─────────────────────────────────────────────────────────────────┤ 
│                                                                  │
│   ┌──────────────────────────────────────┐                       │
│   │    ┌─────────┐     ┌─────────┐       │                       │
│   │    │ Region A│     │ Region B│       │                       │
│   │    │ ⛽ ⛽    │     │   ⛽     │       │                        │
│   │    └─────────┘     └─────────┘       │                       │
│   │         📍 You are here              │                        │
│   │    ┌─────────┐                       │                       │
│   │    │ Region C│                       │                       │
│   │    │   ⛽     │                       │                       │
│   │    └─────────┘                       │                       │
│   └──────────────────────────────────────┘                       │
│                                                                  │
│   Query: "What's near me?"                                       │
│   R-Tree: Only check Region A (overlaps with search area)        │
│   Skip B, C → Massive performance improvement for maps           │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Use Case: Google Maps, Uber, any location-based service

21. Index Type Comparison

Index Type	Best For	Range Queries?	Point Lookup
B-Tree	General purpose OLTP	✅ Yes	⚡ Fast
Hash	Key-value stores	❌ No	⚡⚡ Fastest
Columnstore	Analytics/BI	✅ Yes (aggregate)	❌ Slow
Spatial (R-Tree)	Geographic data	✅ Yes (2D)	⚡ Fast

Part I: Query Optimizer (SQL Server’s Brain)

22. What Happens When You Execute a Query?

When you write SELECT * FROM Customers, SQL Server doesn’t just run it blindly. A built-in AI called the Query Optimizer makes intelligent decisions.

┌─────────────────────────────────────────────────────────────────┐
│              Query Optimizer Decision Process                    │
├─────────────────────────────────────────────────────────────────┤ 
│                                                                  │
│   Your Query: SELECT * FROM Customers WHERE Name = 'Alice'       │
│                                                                  │
│   Query Optimizer thinks (in 0.01 seconds):                      │
│                                                                  │
│   ┌─────────────────────────────────────────────────────────┐    │
│   │ 1. Is there an index on Name?                           │    │
│   │ 2. How many rows will match? (statistics)               │    │
│   │ 3. Is it cheaper to scan the whole table or use index?  │    │
│   │ 4. Generate multiple possible plans                     │    │
│   │ 5. Calculate cost of each plan                          │    │
│   │ 6. Choose the CHEAPEST plan                             │    │
│   └─────────────────────────────────────────────────────────┘    │
│                                                                  │
│   Output: Execution Plan (Index Seek, Hash Join, etc.)           │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

23. Index Seek vs Index Scan

Operation	Description	Performance
Index Seek	Use index to jump directly to matching rows	⚡ Fast
Index Scan	Read entire index from start to end	🐌 Slow
Table Scan	Read entire table (no index used)	🐌🐌 Very slow

SELECT * FROM Customers WHERE CustomerID = 100  
→ Index Seek (good!)

SELECT * FROM Customers WHERE YEAR(BirthDate) = 1990  
→ Index Scan or Table Scan (bad! function on column)

Part J: Index Anti-Patterns (When Indexes Fail)

Even with indexes, SQL Server may refuse to use them. These are the 禁忌 (taboos) to avoid:

24. Common Index Killers

❌ Anti-Pattern 1: Function on Column

-- BAD: Index on BirthDate is IGNORED
SELECT * FROM Customers WHERE YEAR(BirthDate) = 1990;

-- GOOD: Rewrite to avoid function
SELECT * FROM Customers 
WHERE BirthDate >= '1990-01-01' AND BirthDate < '1991-01-01';

Why? The index stores BirthDate values, not YEAR(BirthDate) values.

❌ Anti-Pattern 2: Implicit Type Conversion

-- BAD: CustomerCode is VARCHAR, but you pass INT
-- Index on CustomerCode is IGNORED
SELECT * FROM Customers WHERE CustomerCode = 12345;

-- GOOD: Match the data type
SELECT * FROM Customers WHERE CustomerCode = '12345';

❌ Anti-Pattern 3: Leading Wildcard

-- BAD: Cannot use index (starts with wildcard)
SELECT * FROM Customers WHERE Name LIKE '%Smith';

-- GOOD: Can use index (wildcard at end)
SELECT * FROM Customers WHERE Name LIKE 'Smith%';

❌ Anti-Pattern 4: OR Condition with Non-Indexed Column

-- BAD: If only one column is indexed
SELECT * FROM Customers WHERE Name = 'Alice' OR Phone = '555-1234';

-- GOOD: Use UNION instead
SELECT * FROM Customers WHERE Name = 'Alice'
UNION
SELECT * FROM Customers WHERE Phone = '555-1234';

❌ Anti-Pattern 5: SELECT * (Forcing Key Lookup)

-- BAD: Forces lookup even with covering index
SELECT * FROM Customers WHERE Name = 'Alice';

-- GOOD: Select only needed columns
SELECT Name, Email FROM Customers WHERE Name = 'Alice';

25. How to Check If Index Is Used

Use Execution Plan (Ctrl+M in SSMS):

SET STATISTICS IO ON;
SELECT * FROM Customers WHERE CustomerID = 100;

Look for:

✅ Index Seek = Good
⚠️ Index Scan = Might be OK for small tables
❌ Table Scan = No index used, full table read

Summary: Quick Reference

Clustered vs Non-Clustered

	Clustered	Non-Clustered
Data location	Leaf = actual rows	Leaf = key + pointer
Per table	1	999
Best for	Range scans, PK	Secondary lookups

Index Types

Type	Use Case	Range Query?
B-Tree	General OLTP	✅
Hash	Key lookup	❌
Columnstore	BI/Analytics	✅ (aggregate)
Spatial	Geographic	✅ (2D)

Maintenance Thresholds

Fragmentation	Action
< 5%	None
5-30%	REORGANIZE
> 30%	REBUILD

Index Killers to Avoid

Anti-Pattern	Example
Function on column	`WHERE YEAR(Date) = 2020`
Type mismatch	`WHERE VarcharCol = 123`
Leading wildcard	`WHERE Name LIKE '%Smith'`
SELECT *	Forces key lookup

Key Design Rules

Clustered key: Narrow, unique, static, ever-increasing
Non-clustered: Put most selective column first
Include columns: Avoid key lookups
Monitor: Check missing indexes, unused indexes

💡 Practice Questions

Conceptual

Explain the difference between a clustered index and a non-clustered index. Can a table have multiple of each?
What is a covering index? Why does it improve query performance?
Describe what happens during a page split. How can you minimize page splits?
What is index fragmentation and when should you rebuild vs reorganize an index?

Hands-on

-- Create the optimal index for this query:
SELECT CustomerName, OrderDate, TotalAmount
FROM Orders
WHERE Status = 'Active' AND Region = 'APAC'
ORDER BY OrderDate DESC;

-- Analyze this execution plan and identify the problem:
-- Table Scan (Cost: 95%) -> suggests what action?

💡 View Answer

-- Optimal covering index (includes all columns needed):
CREATE NONCLUSTERED INDEX IX_Orders_StatusRegion
ON Orders (Status, Region, OrderDate DESC)
INCLUDE (CustomerName, TotalAmount);

-- Why this works:
-- 1. Status, Region: Equality predicates (first columns)
-- 2. OrderDate DESC: Sort order matches ORDER BY
-- 3. INCLUDE: CustomerName, TotalAmount avoids key lookup

Table Scan Analysis:

A Table Scan at 95% cost means NO USEFUL INDEX exists
Action: Create an index on the filtered columns (Status, Region)
The index should cover all SELECT columns to avoid lookups

Scenario

Performance Issue: A query that used to run in 2 seconds now takes 30 seconds. The table grew from 100K to 10M rows. No indexes were added. What happened and how would you fix it?
Index Design: A table has a GUID as primary key and clustered index. Insert performance is terrible. Explain why and propose a solution.
Trade-off Decision: A developer wants to add 15 indexes to a table to speed up various reports. What questions would you ask, and what are the potential downsides?

← Previous SQL Server & Networking Fundamentals: From Disk to Cloud

Next → SQL Server Software Layer: The Developer's Playground