Hero image for Building Fault-Tolerant Web Scrapers: Lessons from Scraping 80,000+ Data Points

Building Fault-Tolerant Web Scrapers: Lessons from Scraping 80,000+ Data Points

Web Scraping Python API Data Engineering Fault Tolerance

Building Fault-Tolerant Web Scrapers

The Problem: Why Scrapers Fail

You write the perfect scraper. It works beautifully on your test run of 10 records. Then you scale it to 10,000 records and… chaos.

❌ ConnectionError: Connection timed out
❌ 429 Too Many Requests
❌ 504 Gateway Timeout

Sound familiar? This article shares the battle-tested patterns I developed while scraping 80,000+ library locations from OpenStreetMap for a global data analysis project.


The Architecture: Defense in Depth

┌─────────────────────────────────────────────────────────────────┐
│                   Fault-Tolerant Scraper                         │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ┌─────────────┐   ┌─────────────┐   ┌─────────────┐            │
│  │   Retry     │──▶│   Backoff   │──▶│  Failover   │            │
│  │   Logic     │   │   Strategy  │   │   Nodes     │            │
│  └─────────────┘   └─────────────┘   └─────────────┘            │
│         │                │                 │                     │
│         ▼                ▼                 ▼                     │
│  ┌─────────────┐   ┌─────────────┐   ┌─────────────┐            │
│  │  Checkpoint │──▶│   Logging   │──▶│ Validation  │            │
│  │   Resume    │   │  & Alerts   │   │   Checks    │            │
│  └─────────────┘   └─────────────┘   └─────────────┘            │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Pattern 1: Multiple API Endpoints (Redundancy)

Problem: If your only API endpoint goes down for maintenance, your entire scraper dies.

Solution: Define backup endpoints and rotate between them on failure.

import requests
import time

# Multiple endpoints for redundancy
ENDPOINTS = [
    "https://overpass-api.de/api/interpreter",      # Primary
    "https://overpass.kumi.systems/api/interpreter", # Backup 1
    "https://maps.mail.ru/osm/tools/overpass/api/interpreter"  # Backup 2
]

def query_with_failover(query: str, max_retries: int = 5) -> dict:
    """
    Query with automatic endpoint failover.
    Each retry uses the NEXT endpoint in rotation.
    """
    for attempt in range(max_retries):
        # Rotate through endpoints
        endpoint = ENDPOINTS[attempt % len(ENDPOINTS)]
        
        try:
            response = requests.post(
                endpoint,
                data={"data": query},
                timeout=120
            )
            response.raise_for_status()
            return response.json()
            
        except requests.exceptions.RequestException as e:
            print(f"⚠️ Attempt {attempt + 1} failed on {endpoint}: {e}")
            
            if attempt < max_retries - 1:
                sleep_time = 10 * (attempt + 1)  # Progressive delay
                print(f"💤 Sleeping {sleep_time}s before trying next endpoint...")
                time.sleep(sleep_time)
    
    raise Exception("All endpoints exhausted")

Key Insight: Use attempt % len(ENDPOINTS) to automatically cycle through backup servers. No manual switching required.


Pattern 1.5: Production Best Practices

Before diving into more patterns, here are three essentials that separate hobby scripts from production-grade scrapers:

User-Agent Headers

Problem: Many APIs (including OpenStreetMap) will reject requests without a proper User-Agent, or give stricter rate limits.

# Always identify your scraper
HEADERS = {
    "User-Agent": "MyLibraryScraper/1.0 (contact@youremail.com)",
    "Accept": "application/json"
}

response = requests.post(endpoint, data=query, headers=HEADERS, timeout=120)

Pro Tip: Include a contact email in your User-Agent. Responsible API providers will reach out before blocking you.

Session Objects (Connection Reuse)

Problem: Each requests.get() opens a new TCP connection. For 80,000 requests, that’s 80,000 handshakes.

Solution: Use requests.Session() to reuse connections.

from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

def create_session() -> requests.Session:
    """Create a session with retry logic built-in."""
    session = requests.Session()
    
    # Configure automatic retries
    retry_strategy = Retry(
        total=3,
        backoff_factor=1,
        status_forcelist=[500, 502, 503, 504]
    )
    
    adapter = HTTPAdapter(max_retries=retry_strategy)
    session.mount("https://", adapter)
    session.mount("http://", adapter)
    
    # Set default headers
    session.headers.update(HEADERS)
    
    return session

# Usage
session = create_session()
response = session.post(endpoint, data=query)  # Reuses connection!

Logging vs Print

Problem: print() statements disappear when your script runs on a server or overnight.

Solution: Use Python’s logging module for production.

import logging

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler('scraper.log'),  # Save to file
        logging.StreamHandler()                # Also print to console
    ]
)

logger = logging.getLogger(__name__)

# Replace print() with logger
logger.info(f"Processing {country}...")
logger.warning(f"Rate limited. Backing off...")
logger.error(f"Failed to fetch {country}: {e}")

Production Note: For truly mission-critical scrapers, consider structured logging (JSON format) that can be ingested by tools like ELK Stack or Datadog.


Pattern 2: Exponential Backoff

Problem: After a 429 Too Many Requests, immediately retrying makes things worse. You’ll get rate-limited even harder.

Solution: Wait exponentially longer between retries.

import random

def exponential_backoff(attempt: int, base_delay: float = 1.0, max_delay: float = 60.0) -> float:
    """
    Calculate delay with exponential backoff + jitter.
    
    Attempt 0: ~1s
    Attempt 1: ~2s
    Attempt 2: ~4s
    Attempt 3: ~8s
    ...capped at max_delay
    """
    delay = min(base_delay * (2 ** attempt), max_delay)
    
    # Add jitter (randomness) to prevent thundering herd
    jitter = random.uniform(0, delay * 0.3)
    
    return delay + jitter

def fetch_with_backoff(url: str, max_retries: int = 5) -> requests.Response:
    """Fetch with exponential backoff on failures."""
    
    for attempt in range(max_retries):
        try:
            response = requests.get(url, timeout=30)
            
            if response.status_code == 429:
                # Rate limited - back off
                delay = exponential_backoff(attempt, base_delay=15)
                print(f"⛔ Rate limited. Backing off for {delay:.1f}s...")
                time.sleep(delay)
                continue
                
            elif response.status_code == 504:
                # Server timeout - back off less aggressively
                delay = exponential_backoff(attempt, base_delay=5)
                print(f"⏱️ Server timeout. Retrying in {delay:.1f}s...")
                time.sleep(delay)
                continue
                
            response.raise_for_status()
            return response
            
        except requests.exceptions.ConnectionError:
            delay = exponential_backoff(attempt)
            print(f"🔌 Connection error. Retrying in {delay:.1f}s...")
            time.sleep(delay)
    
    raise Exception(f"Failed after {max_retries} retries")

Why Jitter Matters

Without jitter, if 100 scrapers all get rate-limited at the same time, they’ll all retry at the exact same moment, causing another wave of rate-limiting. Adding randomness spreads out the retries.

Bonus: Parse Retry-After Header

Some APIs (including Overpass) tell you exactly how long to wait via the Retry-After response header:

def get_backoff_delay(response: requests.Response, attempt: int) -> float:
    """
    Get delay from Retry-After header if available,
    otherwise fall back to exponential backoff.
    """
    retry_after = response.headers.get('Retry-After')
    
    if retry_after:
        try:
            # Retry-After can be seconds (integer) or HTTP-date
            return float(retry_after)
        except ValueError:
            pass  # Not a number, fall through
    
    # Fallback to exponential backoff
    return exponential_backoff(attempt)

# Usage
if response.status_code == 429:
    delay = get_backoff_delay(response, attempt)
    logger.warning(f"Rate limited. Waiting {delay:.1f}s (from Retry-After)...")
    time.sleep(delay)

Why this matters: Respecting Retry-After is both polite and efficient. You won’t wait longer than necessary, and you won’t hammer the server before it’s ready.


Pattern 3: Checkpoint & Resume (Idempotency)

Problem: Your script crashes at country #41 out of 50. Do you really need to re-scrape the first 40?

Solution: Write progress to disk after each unit of work. On restart, skip completed items.

import pandas as pd
import os

RAW_DATA_PATH = "data/libraries_raw.csv"

def get_processed_items(filepath: str) -> set:
    """Get set of already-processed items from existing output."""
    if os.path.exists(filepath):
        df = pd.read_csv(filepath)
        return set(df['country'].unique())
    return set()

def scrape_all_countries(countries: list) -> None:
    """Scrape countries with checkpoint resume."""
    
    # 1. Load existing progress
    processed = get_processed_items(RAW_DATA_PATH)
    print(f"📂 Found {len(processed)} already processed countries")
    
    for country in countries:
        # 2. Skip if already done
        if country in processed:
            print(f"⏭️ Skipping {country} (already processed)")
            continue
        
        print(f"🔄 Processing {country}...")
        
        try:
            # 3. Fetch data
            data = fetch_country_libraries(country)
            
            # 4. Append immediately (mode='a')
            df = pd.DataFrame(data)
            df.to_csv(
                RAW_DATA_PATH,
                mode='a',
                header=not os.path.exists(RAW_DATA_PATH),
                index=False
            )
            
            print(f"✅ {country}: {len(data)} libraries saved")
            
        except Exception as e:
            print(f"❌ {country} failed: {e}")
            # Continue to next country, don't crash
            continue

The Power of mode='a'

Save StrategyRiskRecovery
df.to_csv() at the endCrash = lose everythingMust restart from 0
df.to_csv(mode='a') each iterationCrash = lose 1 itemResume from checkpoint

Pattern 4: Smart Rate Limiting

Problem: You don’t know the API’s rate limit. You either go too fast (get blocked) or too slow (waste time).

Solution: Adaptive rate limiting based on response headers.

import time

class AdaptiveRateLimiter:
    """
    Adjust request rate based on API feedback.
    - If hitting 429s: slow down
    - If succeeding: gradually speed up
    """
    
    def __init__(self, initial_delay: float = 1.0):
        self.delay = initial_delay
        self.min_delay = 0.5
        self.max_delay = 30.0
        self.consecutive_success = 0
    
    def wait(self):
        """Wait before next request."""
        time.sleep(self.delay)
    
    def on_success(self):
        """Gradually speed up after consecutive successes."""
        self.consecutive_success += 1
        
        if self.consecutive_success >= 10:
            self.delay = max(self.delay * 0.9, self.min_delay)
            self.consecutive_success = 0
            print(f"📈 Speeding up. New delay: {self.delay:.2f}s")
    
    def on_rate_limit(self):
        """Slow down after rate limit hit."""
        self.delay = min(self.delay * 2, self.max_delay)
        self.consecutive_success = 0
        print(f"📉 Slowing down. New delay: {self.delay:.2f}s")
    
    def on_error(self):
        """Slight slowdown on other errors."""
        self.delay = min(self.delay * 1.5, self.max_delay)
        self.consecutive_success = 0

# Usage
limiter = AdaptiveRateLimiter(initial_delay=2.0)

for url in urls:
    limiter.wait()
    
    response = requests.get(url)
    
    if response.status_code == 200:
        limiter.on_success()
    elif response.status_code == 429:
        limiter.on_rate_limit()
    else:
        limiter.on_error()

Pattern 5: Automated Validation

Problem: The scraper finishes without errors. But is the data correct? Did you accidentally scrape the wrong coordinates?

Solution: Built-in validation and visual QA.

import folium
import pandas as pd

def validate_scrape_results(df: pd.DataFrame) -> None:
    """Run automated validation checks."""
    
    print("\n📊 Validation Report")
    print("=" * 40)
    
    # Check 1: Expected row count
    total = len(df)
    print(f"Total records: {total:,}")
    
    if total < 1000:
        print("⚠️ WARNING: Unexpectedly low record count!")
    
    # Check 2: Null values
    null_pct = df.isnull().sum() / len(df) * 100
    for col, pct in null_pct.items():
        if pct > 5:
            print(f"⚠️ WARNING: {col} has {pct:.1f}% null values")
    
    # Check 3: Coordinate bounds
    if 'latitude' in df.columns:
        invalid_lat = ((df['latitude'] < -90) | (df['latitude'] > 90)).sum()
        if invalid_lat > 0:
            print(f"❌ ERROR: {invalid_lat} records with invalid latitude!")
    
    # Check 4: Sample per country
    country_counts = df.groupby('country').size()
    underpopulated = country_counts[country_counts < 10]
    if len(underpopulated) > 0:
        print(f"⚠️ WARNING: Countries with <10 records: {list(underpopulated.index)}")

def generate_qa_map(df: pd.DataFrame, output_path: str = "qa_map.html") -> None:
    """Generate interactive map for visual QA."""
    
    # Sample to prevent browser crash
    if len(df) > 5000:
        print(f"📍 Sampling 5000 points from {len(df):,} for map...")
        df = df.sample(n=5000, random_state=42)
    
    # Create map centered on mean coordinates
    m = folium.Map(
        location=[df['latitude'].mean(), df['longitude'].mean()],
        zoom_start=2
    )
    
    # Add points
    for _, row in df.iterrows():
        folium.CircleMarker(
            location=[row['latitude'], row['longitude']],
            radius=3,
            color='blue',
            fill=True
        ).add_to(m)
    
    m.save(output_path)
    print(f"🗺️ QA map saved to {output_path}")

Putting It All Together

def main():
    """Complete fault-tolerant scraping pipeline."""
    
    # Initialize
    limiter = AdaptiveRateLimiter(initial_delay=2.0)
    processed = get_processed_items(RAW_DATA_PATH)
    
    countries = ["USA", "UK", "Japan", "France", ...]  # 40+ countries
    
    for country in countries:
        # Skip completed
        if country in processed:
            continue
        
        # Rate limit
        limiter.wait()
        
        # Fetch with failover + backoff
        for attempt in range(MAX_RETRIES):
            endpoint = ENDPOINTS[attempt % len(ENDPOINTS)]
            
            try:
                data = fetch_country_data(country, endpoint)
                
                # Save immediately (checkpoint)
                save_results(data, RAW_DATA_PATH)
                
                limiter.on_success()
                break
                
            except RateLimitError:
                limiter.on_rate_limit()
                time.sleep(exponential_backoff(attempt))
                
            except TimeoutError:
                limiter.on_error()
                time.sleep(exponential_backoff(attempt))
    
    # Final validation
    df = pd.read_csv(RAW_DATA_PATH)
    validate_scrape_results(df)
    generate_qa_map(df)
    
    print("✅ Scrape complete!")

if __name__ == "__main__":
    main()

Key Takeaways

PatternProblem SolvedImplementation
Multiple EndpointsSingle point of failureattempt % len(endpoints)
Exponential BackoffRate limit spiralsdelay * 2^attempt + jitter
Checkpoint ResumeCrash recoverymode='a' + skip processed
Adaptive Rate LimitUnknown API limitsSuccess → speed up, 429 → slow down
Automated ValidationBad data qualityNull checks + visual QA map

Results from Production

Using these patterns on my library density project:

  • 80,000+ records scraped across 40+ countries
  • 4-5 endpoint failures gracefully handled via failover
  • Zero manual restarts required (checkpoint resume worked)
  • 100% data validation passed

The scraper ran overnight, encountered multiple 429s and timeouts, and still completed successfully. That’s the power of fault-tolerant design.


Remember: A scraper is only as good as its error handling. Plan for failure, and your code will thrive.