System Design: Designing a Video Conferencing System
Designing a real-time video conferencing system like Zoom or Microsoft Teams is fundamentally different from a video streaming service like YouTube. While YouTube prioritizes quality and high resolution, Zoom prioritizes Latency. A delay of more than 150ms makes a conversation impossible.
1. Core Requirements
- Real-time Video/Audio: Bi-directional streaming with sub-200ms latency.
- Large Meetings: Supporting hundreds or thousands of participants.
- Screen Sharing: Sharing a high-resolution, low-framerate stream.
- Resilience: Handling varying network conditions (packet loss, low bandwidth).
2. The Protocol: UDP vs. TCP
- The Choice: UDP (User Datagram Protocol) is the mandatory choice for real-time media.
- Why? TCP's error correction (re-sending lost packets) causes delay. In a call, it's better to lose a single frame of video (a minor glitch) than to pause the whole call to wait for that frame to arrive.
3. Communication Technology: WebRTC
WebRTC is the standard for real-time communication in the browser. It handles:
- STUN/TURN Servers: For bypassing firewalls and finding the best path between peers.
- Signaling: Using WebSockets to exchange metadata (like "I'm calling you") before the media starts flowing.
4. Scaling the Meeting: SFU vs. MCU
How do you deliver 100 video streams to 100 participants?
Option A: Peer-to-Peer (Mesh)
Every user sends their stream to every other user.
- Limit: Only works for 2-3 people. A user's upload bandwidth will crash with more.
Option B: MCU (Multipoint Control Unit)
The server receives all streams, mixes them into one single video (like a collage), and sends that one stream to everyone.
- Pros: Low bandwidth for the client.
- Cons: Extremely CPU-intensive for the server.
Option C: SFU (Selective Forwarding Unit) - The Standard
The server receives all streams but doesn't mix them. It simply forwards the relevant streams to each participant.
- The Optimization: If a participant is muted and their camera is off, the SFU stops forwarding their data. This is how Zoom scales to 1,000 people.
5. Handling Network Jitter (Adaptive Bitrate)
- Simulcast: The client sends three versions of their video (High, Medium, Low quality) to the SFU. The SFU forwards the High-quality version to users with fast internet and the Low-quality version to users with slow mobile data.
6. Global Scalability
Video servers must be placed in data centers geographically close to participants to minimize the "Speed of Light" delay.
- Geo-routing: If users in London are talking, the meeting should be hosted on a server in London, not New York.
Summary
The engineering of video conferencing is a masterclass in Low-latency Networking. By leveraging UDP, SFU architectures, and Simulcast for adaptive quality, you can build a platform that makes global communication feel as natural as a face-to-face meeting.
