The single biggest reason candidates fail system design interviews is lack of structure. Without a framework, you risk jumping straight into drawing diagrams, only to realize twenty minutes later that you missed critical requirements or have designed the wrong system entirely. The PEDAL framework is designed to keep your thoughts structured, verify your assumptions early, and ensure you cover all crucial phases in a standard 45-minute interview.
An interview is not a test of your ability to memorize a specific blueprint. It is an interactive session where the interviewer evaluates your problem-solving process, your communication skills, and your depth of understanding under resource constraints. Adhering to a rigid, logical progression prevents you from skipping crucial design phases or getting stuck in low-level details before the layout is established.
System Requirements
The first phase of the PEDAL framework (Parameters) focuses on clarifying requirements. You must spend the first 5 minutes mapping out functional scope and operational constraints.
When you are given an ambiguous prompt like "Design Twitter," the interviewer is intentionally withholding details. Your job is to define the boundaries of the problem through structured questioning.
Functional Requirements (Scope definition)
- Core Action Scope: What is the primary user action? For example, in a Twitter design, it is posting tweets and viewing a timeline.
- Secondary Features: Ask if features like search, notifications, or media uploads are in scope.
- Write vs. Read Operations: Clarify the expected ratio of write-to-read actions to understand if the system is read-heavy or write-heavy.
- User Relationships: Clarify if users can follow other users, and if these relationships are unidirectional (like Twitter) or bidirectional (like Facebook).
Non-Functional Requirements (Scale and SLAs)
- High Availability: What is the target SLA? For example, is it three-nines (99.9%) or five-nines (99.999%) availability?
- P99 Latency Targets: What are the response time requirements for core API paths (e.g. read latency less than 100 milliseconds)?
- Data Consistency Standard: Does the system require strong consistency (e.g. payment systems) or eventual consistency (e.g. social feeds)?
- System Growth assumptions: Are we designing for a fixed active user base, or must the system scale to support a tenfold increase in users over five years?
API Design and Interface Contracts
The fourth phase of the framework (APIs & Data) requires defining clean endpoints to bind the system.
During the interview, do not handwave APIs. Write out structured HTTP or gRPC contracts to demonstrate that you understand how client-server interactions work.
Interview API Guidelines
- Focus on REST or gRPC: Explicitly define HTTP methods, paths, request payloads, and response objects.
- Handle Ingestion and Retrieval: Provide at least one write contract and one read contract to cover the system flow.
- Include Idempotency: For mutation APIs, document key fields like
idempotency_keyorrequest_idto protect against retry errors.
Mock API Example (Post Tweet):
- Protocol: HTTP/1.1 POST
- Path:
/v1/tweets
Request Payload:
{
"user_id": "usr_9982",
"text": "Hello, distributed world!",
"media_urls": ["https://s3.amazonaws.com/bucket/img1.png"],
"client_timestamp": "2026-06-16T12:00:00Z"
}
Response Payload:
{
"tweet_id": "tw_88221199",
"status": "PUBLISHED",
"created_at": "2026-06-16T12:00:01Z"
}
High-Level Architecture
The third phase (Diagrams) involves laying out the system topology. In an interview, avoid jumping straight to complex configurations. Begin with a clean, top-down flow of requests.
Start with the entry point where the user interacts. Show the client request routing through DNS services, a global CDN for static asset caching, and a Load Balancer that routes traffic into the internal API Gateway. The Gateway handles validation, auth, and rate limiting before dispatching requests to stateless write and read services.
graph TD
Client[Mobile/Web Client] -->|1. DNS / CDN| LB[Global Load Balancer]
LB -->|2. Route Requests| Gateway[API Gateway / Auth]
subgraph Microservices Cluster
Gateway -->|3. Post Tweet| WriteService[Tweet Write Service]
Gateway -->|4. Read Timeline| ReadService[Timeline Feed Service]
end
WriteService -->|5. Store metadata| DB[(Metadata Database)]
WriteService -->|6. Cache Event| Cache[(Redis Cache)]
ReadService -->|7. Fast Read| Cache
The PEDAL Execution Lifecycle
This diagram illustrates how the PEDAL framework flows sequentially from parameters gathering to the final deep-dive phase.
graph LR
P[Parameters: Clarify Scope] --> E[Estimates: Capacity Planning]
E --> D[Diagrams: High-Level Architecture]
D --> A[APIs & Data: Contracts & Schema]
A --> L[Logic: Technical Deep Dive]
Low-Level Design and Schema
As part of the APIs & Data phase, you must draft a database schema. Do not just list database names; write out table definitions, partition keys, and index keys to prove you understand data access patterns.
Relational Schema Blueprint (PostgreSQL)
If designing a system with structured relations and transactional safety (like user accounts or billing):
CREATE TABLE users (
user_id VARCHAR(64) PRIMARY KEY,
username VARCHAR(100) UNIQUE NOT NULL,
email VARCHAR(255) NOT NULL,
created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);
CREATE TABLE tweets (
tweet_id VARCHAR(64) PRIMARY KEY,
user_id VARCHAR(64) REFERENCES users(user_id),
content VARCHAR(280) NOT NULL,
created_at TIMESTAMP WITH TIME ZONE NOT NULL
);
-- Optimize queries for user timeline retrieval
CREATE INDEX idx_tweets_user_time ON tweets(user_id, created_at DESC);
NoSQL/DynamoDB Blueprint
If scaling write-heavy feeds where relational joins are too slow:
- Partition Key (PK):
USER#<user_id> - Sort Key (SK):
TWEET#<timestamp> - Attributes:
content,media_links,like_count
Specifying partition keys and sort keys proves that you understand NoSQL access patterns. In DynamoDB, queries are fast when they filter on the partition key and sort key, avoiding costly full-table scans.
Scaling Challenges and Capacity Estimation
The second phase (Estimates) requires performing back-of-the-envelope calculations to size the storage, throughput, and memory constraints of the system.
45-Minute Interview Timeline Gantt Chart
To manage your time successfully, adhere to the following time budgeting rules:
gantt
title System Design Interview Timeline (45 Mins)
dateFormat mm
axisFormat %M
section Phases
Clarify Requirements (P) :00, 5m
Capacity Estimation (E) :05, 5m
High-Level Diagrams (D) :10, 10m
APIs & Data Models (A) :20, 5m
Deep Dive & Logic (L) :25, 20m
Capacity Estimation Example (100M Daily Active Users)
Let's calculate the system capacity bounds:
- Throughput (RPS):
- 100 Million active users, with each posting 2 times per day.
- Daily Writes: $100\text{M} \times 2 = 200\text{M}$ writes per day.
- Average RPS: $200\text{M} / 86400 \approx 2,300 \text{ write requests per second}$.
- Peak RPS (Average $\times$ 2): $\approx 4,600 \text{ writes per second}$.
- Storage Sizing (5 Years):
- Each tweet metadata row size is 500 bytes.
- Daily Storage: $200\text{M} \times 500 \text{ bytes} = 100 \text{ GB per day}$.
- Yearly Storage: $100 \text{ GB} \times 365 \approx 36.5 \text{ TB per year}$.
- 5-Year Storage: $36.5 \text{ TB} \times 5 \approx 182.5 \text{ TB}$.
- Memory Sizing (Redis Cache):
- Cache the home timelines of the active users.
- If we cache the top 100 tweets for the 100M users:
- $100\text{M} \times 100 \text{ tweets} \times 500 \text{ bytes} = 5 \text{ TB}$ of raw memory.
- With a 30% overhead for indexes, we need $\approx 6.5 \text{ TB}$ of Redis memory, partitioned across 65 cache nodes of 100GB RAM each.
Failure Scenarios and Resilience
In the Logic / Deep Dive phase, you must address failure boundaries and show how your architecture degrades under load.
1. Load Balancer & Rate Limiter Resilience
- Scenario: A DDOS attack or a rogue bot flood attempts to exhaust backend threads.
- Mitigation: Deploy a rate-limiting middleware (Token Bucket or Leaky Bucket) at the API Gateway level. Return HTTP 429 Too Many Requests to prevent service degradation.
2. Database Replica Failover
- Scenario: The primary database instance crashes during a write spike.
- Mitigation: Configure multi-AZ deployments with automatic health checks. When the primary node fails, the standby replica is promoted to primary within 30 seconds, while temporary writes queue up in Kafka buffers.
3. Circuit Breaker Configuration
- Scenario: A downstream service (e.g. user recommendation) becomes slow, blocking upstream service threads.
- Mitigation: Wrap the service client inside a Circuit Breaker (using a library like Resilience4j). If the failure rate of calls exceeds 50%, the breaker trips open, and fallback responses (e.g. popular static tweets) are served instantly.
Architectural Trade-offs
Seniority is proven by presenting alternatives and justifying your selections. Avoid declaring one architecture as "the best."
| Choice Area | Option A | Option B | Core Trade-off |
|---|---|---|---|
| Database | PostgreSQL (Relational) | Cassandra (Wide-column NoSQL) | Relational supports ACID and complex queries, but Cassandra scales writes horizontally to infinity. |
| Communication | REST (HTTP/1.1) | gRPC (HTTP/2 Protobuf) | REST has broad compatibility and easy debugging, while gRPC has lower overhead and faster execution times. |
| Caching | Redis (In-memory KV) | Local Memory (App Cache) | Redis offers distributed cross-node sync, but local memory has near-zero latency because it skips network calls. |
Every design decision is a choice between two sets of limitations. A Staff Engineer is someone who understands these trade-offs and selects the option that aligns with business objectives.
Staff Engineer Perspective
Verbal Script
Interviewer: "Welcome to the system design interview. Today, I'd like you to design a distributed notification system like the one used at Netflix. How would you start?"
Candidate: "I'll begin by clarifying the parameters of the system. First, let's look at the functional scope: what kinds of notifications are we supporting? SMS, Email, and Mobile Push? And are we supporting scheduling, prioritization, or batching?"
Interviewer: "Yes, we support all three channels. Prioritization is in scope. Scheduling can be a secondary target."
Candidate: "Understood. On the non-functional side: what is the daily active user count we are targeting, and what is the maximum delivery latency SLA for high-priority notifications like OTPs?"
Interviewer: "We target 200 Million active users, and high-priority notifications must be delivered within 2 seconds."
Candidate: "Excellent. Now that we have established our parameters, I will perform some quick capacity estimations to determine our throughput and database constraints. If we send an average of 10 notifications per active user daily, that is 2 Billion notifications per day. This equates to an average of about 23,000 notifications processed per second. Assuming each notification payload averages 1 kilobyte of data, our system ingestion rate is 23 Megabytes per second, and our daily storage requirement is 2 Terabytes. Based on this, a single relational database instance will be write-saturated quickly. I will design a horizontally-scalable architecture utilizing a distributed log queue to buffer writes, combined with NoSQL document stores for notification tracking."