Case Study: Designing a Web Crawler at Google Scale
Mental Model
A search-engine crawler is not a brute-force downloader, but a highly disciplined, distributed scheduling machine that balances network politeness, content relevance, dynamic freshness, and duplicate filtering at extreme scale.
A web crawler is an automated system that browses the World Wide Web in a methodical manner to index content for search engines like Google or Bing. At a global scale, indexing the internet requires solving massive distributed systems problems: managing billions of pages, parsing malformed HTML, avoiding infinite crawler traps, respecting server capacity, and handling huge network bandwidth.
Requirements & Core Constraints
To build a search engine crawler that can index the web, we must define clear boundaries for scale, performance, and legal compliance.
Functional Constraints
- Universal Crawling: Traverse the web starting from seed URLs, extracting links, and crawling new pages recursively.
- Seen URL Deduplication: Ensure the crawler never crawls the same exact URL twice.
- Content Deduplication: Detect and discard near-duplicate web pages (e.g., identical content served on different URLs or with minor timestamp changes).
- Politeness Protocol: Strictly respect
robots.txtspecifications and never overload a target web server with a high frequency of hits. - Freshness Control: Periodically recrawl web pages to capture updates, prioritizing high-importance pages.
Non-Functional SLAs
- Unbounded Scalability: The system must be capable of crawling and indexing at least 10 Billion pages per month.
- High Fault Tolerance: The system must run continuously. If individual worker nodes crash, the master scheduler must reassign URLs without losing crawl state.
- Extremely Polite Ingestion: Rate-limit requests to any single domain to one request every few seconds (configurable per host).
- Storage Durability: Retain crawl metadata, link graphs, and content fingerprints across petabytes of data.
Back-of-the-Envelope Estimates
Let's calculate the hardware, network, and storage capacity needed to support 10 Billion crawled pages per month.
1. Ingestion Bandwidth & Throughput
- Total Monthly Crawled Pages: $10\text{ Billion}$
- Average Page Size (HTML + HTTP Headers + Metadata): $100\text{ KB}$
- Total Ingested Data per Month: $10\text{B} \times 100\text{ KB} = 1,000\text{ Terabytes} = 1\text{ Petabyte (PB)}$
- Total Seconds in a Month: $30\text{ days} \times 86,400\text{ seconds} = 2,592,000\text{ seconds}$
- Average Crawling Rate: $10\text{B} / 2,592,000\text{s} \approx 3,858\text{ pages/sec (QPS)}$
- Peak Crawling Rate (3x Average): $\approx 11,574\text{ QPS}$
- Average Network Ingress Bandwidth: $3,858\text{ QPS} \times 100\text{ KB} = 385.8\text{ MB/sec} \approx 3.1\text{ Gbps}$
- Peak Network Ingress Bandwidth: $11,574\text{ QPS} \times 100\text{ KB} = 1,157.4\text{ MB/sec} \approx 9.3\text{ Gbps}$
2. Seen URL Storage Capacity
- Total Crawled URLs Tracked: $100\text{ Billion}$ (representing both crawled pages and found links awaiting processing).
- Average URL String Length: $100\text{ characters}$ ($100\text{ bytes}$).
- Raw URL Storage Size: $100\text{B} \times 100\text{ bytes} = 10\text{ Terabytes (TB)}$.
- Seen URL Bloom Filter Size in RAM: Allocating 10 bits per URL to keep the false positive rate under 1%. $$\text{Memory Size} = 100\text{B URLs} \times 10\text{ bits} = 1\text{ Trillion bits} \approx 125\text{ GB}$$ This can easily fit in the RAM of a single large memory server or be sharded across a small cluster of Redis nodes.
API Design & Core Contracts
The distributed crawler relies on internal messaging contracts to coordinate masters, worker fleets, and deduplication stores.
1. Master-Worker Job Dispatch Protocol
Used by the central URL Frontier to assign crawling tasks to Fetcher workers. This is typically implemented via high-speed gRPC streaming to bypass REST serialization overhead.
gRPC rpc FetchJobs(stream WorkerReport) returns (stream CrawlJob)
CrawlJob Payload (Protobuf schema representation):
syntax = "proto3";
message CrawlJob {
string job_id = 1;
string url = 2;
int32 connection_timeout_ms = 3;
int32 max_redirects = 4;
string robots_txt_rules = 5;
bytes host_ip = 6;
}
2. Worker Execution Report Endpoint
Sent by Fetcher workers back to the URL Frontier when a crawl finishes or fails.
POST /api/v1/crawler/report
Request Payload:
{
"job_id": "job_983147",
"url": "https://codesprintpro.com/courses/system-design",
"status": "COMPLETED",
"http_code": 200,
"elapsed_ms": 482,
"simhash_fingerprint": "98a3b8cd291a84f3",
"extracted_urls": [
"https://codesprintpro.com/courses/lld",
"https://codesprintpro.com/blog/system-design-payment-gateway"
],
"error_message": null
}
High-Level Design (HLD)
Scaling a distributed crawler requires isolating task queuing from content fetching, parsing, and deduplication.
1. Google-Scale Mercator Web Crawler Topology
This high-level architecture separates URL prioritization, dns caching, page fetching, HTML parsing, content deduplication, and indexing.
graph TD
Seeds[Seed URLs List] --> Frontier[URL Frontier Scheduler]
Frontier -->|Assign URL| Fetcher[Fetcher Worker Fleet]
subgraph FetchPhase [Fetch & Resolve Phase]
Fetcher -->|Check Cache| DNSCache[(Local DNS Cache)]
Fetcher -->|HTTP GET Request| WebServer[Target Web Server]
end
WebServer -->|Raw HTML Response| Parser[HTML Parser & Link Extractor]
subgraph ParsePhase [Parsing & Deduplication Phase]
Parser -->|Raw Text| ContentDedup{SimHash Deduplicator}
ContentDedup -->|Duplicate| Drop[Drop Duplicate Page]
ContentDedup -->|Unique Page| HBase[(HBase Page Content Store)]
Parser -->|Extracted URLs| URLFilter[URL Filter & Canonicalization]
URLFilter -->|Filtered URLs| URLSeen{URL Seen Filter}
end
URLSeen -->|Already Crawled| Drop2[Drop URL]
URLSeen -->|New URL| Frontier
Detailed Component Flow:
- URL Frontier: The brain of the crawler. It maintains the list of target URLs, coordinates crawling priority, and enforces politeness delays so we do not overload any target web server.
- Fetcher Worker Fleet: A cluster of thousands of stateless dockerized containers running non-blocking asynchronous HTTP clients. Fetchers query a high-speed local DNS Cache before falling back to network DNS servers, preventing DNS server overload.
- HTML Parser: Parses retrieved raw pages, handles malformed HTML markup, and extracts metadata and anchor tag links.
- SimHash Content Deduplicator: Strips dynamic elements (like timestamps or ads) and generates a 64-bit signature of the page content, verifying it against the Seen Content database.
- URL Filter & Canonicalization: Standardizes links (e.g., converts relative URLs to absolute, forces lowercase, strips session IDs) to avoid indexing the same page under slightly different names.
- URL Seen Filter: Combines an in-memory Bloom filter for fast checks with a persistent HBase Seen DB for 100% accuracy.
2. Deep Dive: URL Frontier Architecture
To guarantee politeness and avoid overwhelming servers, the URL Frontier organizes queues using priority and politeness routing layers.
graph TD
IngressURLs[Incoming Target URLs] --> PriorityRouter[Priority Router]
subgraph PriorityLayer [Priority & Authority Management]
PriorityRouter --> PQueue1[Priority Queue 1: High PageRank]
PriorityRouter --> PQueue2[Priority Queue 2: Medium PageRank]
PriorityRouter --> PQueue3[Priority Queue 3: Low PageRank]
end
PQueue1 --> PolitenessRouter[Politeness Router]
PQueue2 --> PolitenessRouter
PQueue3 --> PolitenessRouter
subgraph PolitenessLayer [Politeness Management]
PolitenessRouter --> PQueueHostA[Queue Host A: codesprintpro.com]
PolitenessRouter --> PQueueHostB[Queue Host B: wikipedia.org]
PolitenessRouter --> PQueueHostC[Queue Host C: github.com]
end
PQueueHostA --> DelayScheduler{Heap Delay Scheduler}
PQueueHostB --> DelayScheduler
PQueueHostC --> DelayScheduler
DelayScheduler -->|Polite Dispatch| FetchWorkers[Active Fetcher Threads]
Low-Level Design (LLD) & Data Models
A distributed crawler needs storage systems optimized for high-throughput writes and rapid, point-in-time reads.
Database Selection & Schema Layout
1. HBase Seen URL & Crawl Metadata DB
We use HBase (running on top of HDFS) as our primary crawl state store. HBase is a wide-column NoSQL store built for massive data scaling.
- Row Key: MD5 or SHA-256 hash of the canonicalized URL. Using a hash ensures even distribution of write loads, preventing hotspotting on lexicographically ordered domain names.
Schema Design (HBase Column Families):
cf_metadata: Stores status, HTTP codes, and retry counters.cf_fingerprint: Stores the SimHash and crawl timestamps.
RowKey (URL MD5 Hash) | Column Family: cf_metadata | Column Family: cf_fingerprint
-----------------------------------|--------------------------------|------------------------------
7f923b8cd2e1a84f938c291a24bc9831 | status: "CRAWLED" | simhash: "98a3b8cd291a84f3"
| http_code: 200 | last_crawl_time: 1779435420
| retry_count: 0 | content_len: 102430
2. Host Politeness Cache (Redis Online Boundary)
The Frontier must query host politeness configuration rapidly to determine active sleep delays. We cache host politeness settings in Redis.
- Key:
host:polite:{hostname}(Type: Redis Hash)
Redis Key-Value Schema Example:
HSET host:polite:codesprintpro.com \
crawl_delay_seconds 5 \
last_crawl_timestamp 1779435425 \
robots_txt_hash "ad8c72bd"
SimHash Content Deduplication Implementation
Websites often contain minor dynamic details, such as changing timestamps or comment counts. To avoid indexing duplicate pages, we use SimHash to construct a 64-bit fingerprint. The hamming distance between two SimHash values indicates how similar their text content is.
Here is a clean, compilable Python implementation of a SimHash calculation and hamming distance checker:
import hashlib
import re
from typing import List
class SimHashEngine:
def __init__(self, hash_bits: int = 64):
self.hash_bits = hash_bits
def tokenize(self, text: str) -> List[str]:
"""
Cleans and tokenizes text input, removing HTML artifacts
and extracting words to form feature tokens.
"""
text = text.lower()
# Remove common non-alphanumeric noise
text = re.sub(r'[^\w\s]', '', text)
return text.split()
def get_md5_hash_bits(self, token: str) -> int:
"""
Computes a 64-bit integer hash for a given token using MD5.
"""
digest = hashlib.md5(token.encode('utf-8')).digest()
# Extract the first 8 bytes (64 bits)
return int.from_bytes(digest[:8], byteorder='big')
def calculate_simhash(self, text: str) -> int:
"""
Calculates the 64-bit SimHash fingerprint of a given text string.
"""
tokens = self.tokenize(text)
if not tokens:
return 0
# Initialize 64-dimension vector to zeroes
vector = [0] * self.hash_bits
for token in tokens:
token_hash = self.get_md5_hash_bits(token)
for i in range(self.hash_bits):
# Check if the i-th bit of token_hash is set
bit_mask = 1 << i
if token_hash & bit_mask:
vector[i] += 1 # Add weight (default weight = 1)
else:
vector[i] -= 1 # Subtract weight
# Reconstruct final 64-bit fingerprint
fingerprint = 0
for i in range(self.hash_bits):
if vector[i] > 0:
fingerprint |= (1 << i)
return fingerprint
def hamming_distance(self, hash1: int, hash2: int) -> int:
"""
Calculates the Hamming distance (number of differing bits)
between two 64-bit SimHash values.
"""
xor_result = hash1 ^ hash2
# Count the number of set bits (1s) in the XOR result
return bin(xor_result).count('1')
def is_near_duplicate(self, hash1: int, hash2: int, threshold: int = 3) -> bool:
"""
If the Hamming distance is less than or equal to the threshold
(commonly 3 bits), the pages are near-duplicates.
"""
return self.hamming_distance(hash1, hash2) <= threshold
# Verification block
if __name__ == "__main__":
engine = SimHashEngine()
doc1 = "Learn system design on CodeSprintPro and master distributed caching systems."
doc2 = "Learn system design on CodeSprintPro and master distributed caching system! [Updated 2026]"
doc3 = "Cooking healthy soup with fresh chicken, carrot and potato."
sh1 = engine.calculate_simhash(doc1)
sh2 = engine.calculate_simhash(doc2)
sh3 = engine.calculate_simhash(doc3)
dist_1_2 = engine.hamming_distance(sh1, sh2)
dist_1_3 = engine.hamming_distance(sh1, sh3)
print(f"Distance between doc1 & doc2: {dist_1_2} bits")
print(f"Distance between doc1 & doc3: {dist_1_3} bits")
assert engine.is_near_duplicate(sh1, sh2, threshold=3) == True
assert engine.is_near_duplicate(sh1, sh3, threshold=3) == False
print("Verification Successful!")
Scaling Challenges & System Bottlenecks
1. Crawler Traps and Infinite Path Loops
Web servers can generate an infinite number of paths (e.g., dynamic calendars displaying /calendar?date=2026-05-22, recursive folder trees, or dynamic user profiles), which can cause the crawler to run out of memory.
- Mitigation (Max Depth, Path Analysis & Pattern Matching): We enforce a maximum URL path depth (e.g., max 16 slash segments). The URL Frontier runs real-time string analysis to flag repeating directory components (e.g.,
/files/files/files/). We also dynamically quarantine hosts that exceed 1 Million crawl jobs with zero inbound PageRank impact.
2. Massive Concurrent DNS Resolution Overload
Synchronous DNS lookups per page fetch would throttle the crawler's throughput.
- Mitigation (Local DNS Caches & Async DNS Resolvers): Instead of relying on the operating system's standard synchronous blocking DNS resolver, Fetcher worker processes contain an in-memory Local DNS Cache refreshed asynchronously in the background. If a domain name's DNS entry misses, the fetcher queries a dedicated cluster of custom DNS proxy servers, preventing DNS resolution from blocking the critical crawl loop.
3. Seen URL Storage Scalability
Checking if a URL has already been crawled against a dataset of 100 Billion records can quickly saturate disk databases.
- Mitigation (Seen Bloom Filter Partitioning): We implement a tiered check strategy. We maintain a cluster of Redis-backed partitioned Bloom Filters in RAM. If the Bloom filter returns
false(meaning the URL has definitely not been crawled), we immediately enqueue it. If it returnstrue(meaning the URL is likely seen), we perform a fast lookup in our HBase Seen DB to resolve any potential false positives.
System Trade-offs & CAP Posture
1. AP vs CP Posture in the Distributed Crawl DB
- Crawl DB is strictly AP (Availability / Eventual Consistency): We prioritize high-speed writes and crawl progress logging over absolute strong consistency. It is acceptable if two fetchers crawling different domains discover the same target link and write it to the crawl database simultaneously, resulting in a temporary duplicate check lag. The system values crawl throughput over a perfectly synchronized database state.
Failure Scenarios & Resilience
1. Fetcher Worker Node Crashes
A fetcher node might crash mid-way through downloading a heavy page, causing the allocated URL queue partition to hang.
- Resilience (Lease-Based URL Reservation): When a Fetcher worker claims a URL from the Frontier, it obtains a 10-minute lease registered in the Frontier queue. If the Fetcher crashes or fails to report execution within 10 minutes, the lease expires. The Frontier marks the URL as unassigned and enqueues it for another worker to process.
[ URL Frontier ]
│
├──(Lease Assigned 10m)──> [ Fetcher Worker A ] (crashes!)
│
└──(Lease Expires -> Re-enqueue)──> [ Fetcher Worker B ] (retries)
2. Target Host Denial of Service (DoS) Accusation
If politeness controls fail due to queue partitioning bugs, a crawler host might trigger thousands of requests per second to a single target domain, triggering firewalls and IP blocking.
- Resilience (Centralized Redis Rate-Limiting Guard): In addition to domain queues, we deploy an out-of-band Redis token bucket rate-limiter for every target IP. If a fetcher is about to connect to an IP, it must acquire a token. If Redis returns a rate-limit error, the fetcher drops the request and enqueues the URL back in the Frontier.
Staff Engineer Perspective (Operational Deep Dive)
[!WARNING] The Real-World Danger of Memory Exhaustion in Frontier Queues Master nodes running the URL Frontier can run out of memory when crawls expand into vast domains containing billions of nested links.
To prevent system crashes:
- Restrict Frontier memory usage by configuring disk-backed spooling queues (e.g., using RocksDB or LevelDB local disk stores on master nodes) rather than keeping all FIFO host queues in RAM.
- Limit RAM allocations strictly to the active Politeness Heap Delay Scheduler, storing inactive queues on high-speed NVMe storage.
DNS Query amplification Prevention
A major real-world issue in distributed crawlers is that DNS lookups can amplify and overwhelm target DNS root servers. If you crawl millions of subdomains (e.g., user-generated blog sites under a primary domain), your local DNS server may execute recursive lookups that look like a distributed denial of service attack. Always implement a strict DNS throttling cache policy that flags and skips crawling subdomains if their primary domain's DNS name server returns recurring timeouts or host lookup failures.
Verbal Script & Mock Interview Guide
An illustrative dialogue between an interviewer and a Principal Systems Architect.
Interviewer: "How does the URL Frontier maintain politeness across thousands of concurrent fetcher threads? If multiple threads pull URLs from the queue, what stops them from hitting the same host at the same time?"
Candidate: "We solve this by separating priority queues from politeness queues. We maintain two separate arrays of queues in the URL Frontier. First, incoming URLs are routed into Priority Queues based on page authority (PageRank). Then, a background worker consumes URLs from these priority queues and routes them into Politeness Queues. We maintain exactly one FIFO queue per target IP address. To coordinate fetching, we use a Heap-based Delay Scheduler containing hostnames and their next eligible crawl timestamps. A fetcher thread queries the heap for the next available host whose delay has expired, retrieves a URL from that host's polite queue, updates the host's timestamp in the heap with a 5-second sleep offset, and executes the HTTP fetch. This ensures that no two threads can hit the same domain simultaneously."
Interviewer: "Excellent. If the Bloom filter for seen URLs returns a false positive, what is the impact on the crawler?"
Candidate: "A Bloom filter can have false positives, meaning it might report that we have seen a URL when we actually haven't. The impact of a false positive is that the crawler will skip crawling a new URL, missing a page. However, it will never cause us to crawl a URL twice. We minimize this risk by sizing the Bloom filter to allocate 10 bits per URL, keeping the false positive rate under 1%. For highly authoritative seed domains where missing pages is unacceptable, we bypass the Bloom filter and query the HBase Seen database directly to guarantee 100% crawling completeness."
Interviewer: "How do you handle redirects (like HTTP 301 and 302) during page fetches?"
Candidate: "Redirects are a common vector for infinite loops. Fetcher nodes parse the 'Location' header of redirect responses but do not follow them immediately. Instead, they format the redirected URL, run it through the canonicalization filters, and insert it back into the URL Frontier. This ensures that redirected URLs are subjected to the same seen checks, prioritizations, and politeness delays as standard URLs, protecting the fetcher fleet from infinite redirect loops."