System Design: Designing a Distributed Search Engine
Search is the most common way humans interact with massive datasets. Building a system that can perform full-text search across billions of documents with millisecond latency requires a complete rethink of traditional database storage.
1. Core Requirements
- Full-Text Search: Supporting complex queries, fuzzy matching, and ranking.
- High Throughput: Thousands of search queries per second.
- Real-time Indexing: New documents should be searchable within seconds.
- Scalability: Handling petabytes of data across hundreds of nodes.
2. The Inverted Index (The Foundation)
A traditional database table is like a book's table of contents (ID -> Data). An Inverted Index is like the index at the back of a textbook (Word -> List of IDs).
- The Process:
- Tokenization: Splitting text into individual words.
- Normalization: Lowercasing and removing punctuation.
- Indexing: Mapping each word to a "Postings List" of document IDs where it appears.
3. Storage Internals: Segments and Lucene
Elasticsearch is built on Apache Lucene.
- Segments: An index is made of multiple immutable "Segments." Once a segment is written to disk, it never changes.
- Merging: In the background, Lucene merges smaller segments into larger ones. This process (Segment Merging) is critical for performance and is where "deleted" documents are physically removed.
4. Distributed Architecture: Sharding
To scale, a single index is split into multiple Shards.
- Primary Shard: Handles write operations.
- Replica Shard: Provides high availability and handles read (search) traffic.
- Routing: When a query arrives, the "Coordinator Node" broadcasts the search to all relevant shards, merges the results (Scatter-Gather pattern), and returns them to the user.
5. Ranking and Relevance (TF-IDF vs. BM25)
How does the engine know which document is most relevant?
- Term Frequency (TF): How often does the word appear in this document?
- Inverse Document Frequency (IDF): Is this word rare across the whole dataset?
- BM25: The modern industry standard algorithm that builds on TF-IDF to provide highly accurate relevance scores.
6. Real-time Ingestion Pipeline
Elasticsearch doesn't fsync to disk for every document (it's too slow).
- Documents are written to an in-memory Indexing Buffer.
- Every 1 second (Refresh interval), the buffer is turned into a new Segment and made searchable.
- For durability, documents are also appended to a Translog (WAL).
Summary
Building a distributed search engine is about mastering the Inverted Index and managing the complex lifecycle of Immutable Segments. By combining these with a robust sharding strategy, you can build a system that powers everything from e-commerce search to petabyte-scale log analysis.
