WhatsApp a deep dive

Tarun Jain
16 min readAug 30, 2024

--

· Key Component
· Message Delivery Flow:
· Key Considerations:
· User Database (User DB)
Types of Data Stored
1. Users Table:
2. User Devices Table:
3. User Contacts Table:
4. User Settings Table:
· Notification Service
a) In-App Notifications:
b) Third-Party Push Notifications:
c) Detecting App State:
d) Notification Flow:
· WebSocket Session Management:
a) Session Storage:
b) Stored Information:
c) Session Management Process:
d) Scalability:
e) Handling Reconnections:
· Why do we need a Message Queue between App server 1 and app server 2?
Why use a Message Queue:
Scenarios where a Message Queue might not be necessary:
Alternatives to a dedicated Message Queue:
Hybrid Approach:
· High-Level Architecture Diagram
step-by-step process of sending and receiving a message:
· Real-life solutions for each component:
· System Functionality
· Functional Requirements
· Non-Functional Requirements
· Back-of-the-Envelope Calculation
·
· API Design
Key API endpoints:
Authentication
User Management
Messaging /1:1 Chat API:
Group Messaging API:
Notification API:
· Database Schema Design
Group Tables
Group members tables
Notification tables
Database type rationale:
Database Selection Guide for Messaging Systems
· Technology Stack Justification
Programming Languages:
Frameworks:
Databases:
Messaging Protocols:
Infrastructure:
Load Balancing
CDN
Search capability
· web servers vs app servers

Key Component

1. Client Layer:
— Sender’s and Receiver’s devices run the WhatsApp client application.

2. Load Balancer:
— Distributes incoming requests across multiple application servers.

3. Application Servers:
— Handle client requests, message processing, and business logic.

4. Message Queue:
— Ensures reliable message delivery and handles high traffic.

5. Databases:
— User DB: Stores user profiles, contacts, and metadata.
— Message DB: Stores messages (possibly using a NoSQL database for scalability).

6. Caching Layer:
— Redis cluster for caching frequently accessed data (user info, recent messages).

7. Notification Service:
— Handles push notifications to inform users of new messages.

8. WebSocket Servers

Message Delivery Flow:

1. Sender’s device sends a message to the load balancer.
2. Load balancer routes the request to an available application server.
3. App server stores the message in the Message DB.
4. Message is enqueued in the Message Queue for processing.
5. Another app server dequeues the message.
6. App server fetches receiver’s info from User DB.
7. Receiver’s info is cached in Redis for faster future access.
8. Push notification is sent to the Notification Service.
9. Receiver’s device is notified of a new message.
10. Receiver’s device requests the message from the load balancer.
11. Request is routed to an available app server.
12. App server retrieves the message from Message DB.
13. Message is sent to the receiver’s device.

Key Considerations:

1. Encryption: End-to-end encryption for message security.
2. Scalability: Horizontal scaling of app servers and databases.
3. Consistency: Eventual consistency model for distributed systems.
4. Offline Support: Queue messages for offline users.
5. Real-time Communication: WebSocket or long polling for real-time updates.
6. Multi-device Support: Synchronize messages across user’s devices.

Would you like me to elaborate on any specific component or aspect of this design?

User Database (User DB)

Now, let’s break down the schema and explain the type of data each table holds:

Types of Data Stored

  • Personal Information: Phone number, username, full name, profile picture URL.
  • Status Information: User status, last seen timestamp.
  • Contact Information: User’s contacts and their custom names.
  • Device Information: Types of devices used, device tokens for notifications.
  • User Preferences: Notification, privacy, and theme settings.

1. Users Table:

  • Stores core user information
  • Fields include: user_id, phone_number, username, full_name, profile_picture_url, status, last_seen, created_at, updated_at, is_online, and account_status
  • This table forms the basis of user identity in the system
CREATE TABLE users (
user_id BIGINT PRIMARY KEY,
phone_number VARCHAR(15) UNIQUE NOT NULL,
username VARCHAR(50) UNIQUE,
full_name VARCHAR(100),
profile_picture_url VARCHAR(255),
status VARCHAR(140),
last_seen TIMESTAMP,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
is_online BOOLEAN DEFAULT FALSE,
account_status ENUM('active', 'inactive', 'suspended') DEFAULT 'active'
);

2. User Devices Table:

— Keeps track of devices associated with each user
— Useful for multi-device support and push notifications
— Fields include: device_id, user_id (foreign key to users table), device_type, device_token, and last_login

CREATE TABLE user_devices (
device_id BIGINT PRIMARY KEY,
user_id BIGINT,
device_type VARCHAR(50),
device_token VARCHAR(255),
last_login TIMESTAMP,
FOREIGN KEY (user_id) REFERENCES users(user_id)
);

3. User Contacts Table:

  • Manages user contacts and relationships.
  • Fields include: contact_id, user_id (foreign key to users table), contact_user_id (also a foreign key to users table), contact_name, and is_blocked
  • Allows for custom contact names and blocking functionality
CREATE TABLE user_contacts (
contact_id BIGINT PRIMARY KEY,
user_id BIGINT,
contact_user_id BIGINT,
contact_name VARCHAR(100),
is_blocked BOOLEAN DEFAULT FALSE,
FOREIGN KEY (user_id) REFERENCES users(user_id),
FOREIGN KEY (contact_user_id) REFERENCES users(user_id)
);

4. User Settings Table:

— Stores user-specific settings and preferences
— Fields include: setting_id, user_id (foreign key to users table), privacy settings for last seen, profile photo, and about, as well as notification preferences

CREATE TABLE user_settings (
setting_id BIGINT PRIMARY KEY,
user_id BIGINT,
privacy_last_seen ENUM('everyone', 'contacts', 'nobody') DEFAULT 'everyone',
privacy_profile_photo ENUM('everyone', 'contacts', 'nobody') DEFAULT 'everyone',
privacy_about ENUM('everyone', 'contacts', 'nobody') DEFAULT 'everyone',
notifications_enabled BOOLEAN DEFAULT TRUE,
FOREIGN KEY (user_id) REFERENCES users(user_id)
);

This schema allows for efficient storage and retrieval of user-related data, supporting key features of a WhatsApp-like messaging system such as:

1. User authentication and identification
2. Contact management
3. Multi-device support
4. User status and presence information
5. Privacy settings
6. Push notification management

The User DB would typically be implemented using a relational database management system (RDBMS) like PostgreSQL or MySQL, which can handle the complex relationships between these tables efficiently.

Notification Service

The Notification Service in a WhatsApp-like system typically handles both in-app notifications and third-party push notifications. It’s designed to work as follows:

a) In-App Notifications:

  • When the app is open and active, real-time updates are typically handled through WebSocket connections.
  • The In-App Notification Handler component manages these real-time updates.
  • This approach is more efficient and provides a better user experience than relying on third-party push services when the app is already open.

b) Third-Party Push Notifications:

  • For Android, services like Firebase Cloud Messaging (FCM) are commonly used.
  • For iOS, Apple Push Notification Service (APNS) is used.
  • These are utilized when the app is not actively running or when the WebSocket connection is not established.

c) Detecting App State:

  • The app regularly sends heartbeat signals to the server when it’s open and in the foreground.
  • The WebSocket connection status also indicates if the app is active.
  • The app state (foreground/background) is typically sent as part of the WebSocket connection metadata or regular API calls.

d) Notification Flow:

  1. When a new message arrives, the system first checks if the recipient has an active WebSocket connection.
  2. If connected, the message is sent as an in-app notification through the WebSocket.
  3. If not connected, or if the app is in the background, a push notification is sent via FCM/APNS.

WebSocket Session Management:

WebSocket sessions are crucial for maintaining real-time connections with clients.

Here’s how they’re typically managed:

a) Session Storage:

  • WebSocket session information is usually stored in a fast, distributed data store like Redis.
  • A dedicated Session DB (as shown in the updated diagram) can also be used for more persistent storage.

b) Stored Information:

  • User ID
  • Device information
  • Connection timestamp
  • Last activity timestamp
  • Session token

c) Session Management Process:

  1. When a user logs in, a new WebSocket connection is established.
  2. The session information is stored in the Session DB and cached in Redis for quick access.
  3. The WebSocket Server Cluster maintains these connections and updates session status.
  4. If a connection is lost, the session is marked as inactive but may be kept for a period to allow for quick reconnection.

d) Scalability:

  • The WebSocket Server Cluster can be scaled horizontally to handle millions of concurrent connections.
  • Load balancers distribute incoming WebSocket connections across the cluster.

e) Handling Reconnections:

  • If a client loses connection and reconnects, it can provide its last known session token.
  • The system checks if the session is still valid and can quickly re-establish the connection without a full authentication process.

This architecture allows for efficient handling of both online and offline users, providing real-time updates when possible and falling back to push notifications when necessary. The WebSocket session management ensures that the system can maintain millions of concurrent connections while providing a seamless experience for users across multiple devices.

Would you like me to elaborate on any specific aspect of the notification system or WebSocket session management?

Why do we need a Message Queue between App server 1 and app server 2?

Now, let’s analyze the role of the Message Queue and consider scenarios where it might or might not be necessary:

Why use a Message Queue:

a) Decoupling:

  • The queue decouples the sending process (App Server 1) from the receiving process (App Server 2).
  • This allows each part of the system to operate independently, enhancing system resilience.

b) Load Balancing:

  • During high traffic periods, the queue can buffer messages, preventing the receiving servers from being overwhelmed.

c) Asynchronous Processing:

  • The sending server can quickly enqueue a message and move on to handle other requests.
  • The receiving server can process messages at its own pace.

d) Reliability:

  • If App Server 2 is temporarily unavailable, messages are safely stored in the queue.
  • This prevents message loss during server downtime or network issues.

e) Scalability:

  • Easy to add more consumer servers (like App Server 2) to handle increased load.

Scenarios where a Message Queue might not be necessary:

a) Low Traffic Systems:

  • In a system with consistently low message volume, direct server-to-server communication might be simpler and equally effective.

b) Ultra-Low Latency Requirements:

  • For systems where every millisecond counts, the additional hop through a queue might introduce unacceptable delay.

c) Simple, Non-Distributed Systems:

  • In a monolithic application or a simple system running on a single server, a queue might add unnecessary complexity.

Alternatives to a dedicated Message Queue:

a) Direct Server-to-Server Communication:

  • App Server 1 could make a direct API call to App Server 2.
  • Pros: Lower latency for normal operations.
  • Cons: Less resilient to high load or server failures.

b) Database-Based Queue:

  • Use a database table as a simple queue.
  • Pros: Simpler setup if you’re already using a database.
  • Cons: Not as efficient for high-volume messaging.

c) In-Memory Data Structure:

  • Use Redis not just for caching, but as a lightweight message broker.
  • Pros: Very fast, can handle high throughput.
  • Cons: Less durable than some dedicated message queue solutions.

Hybrid Approach:

  • Attempt direct server-to-server communication first.
  • Fall back to a message queue if the direct attempt fails or if the system is under high load.
  • This combines the low latency of direct communication with the reliability of a queue.

In the context of a WhatsApp-like system, which typically deals with high volume and requires high availability, a Message Queue is often beneficial. However, the specific implementation could vary based on exact requirements:

For a smaller scale application, you might start with direct communication and introduce a queue as you scale

For a large-scale system, you might use a robust, distributed queue system like Apache Kafka or RabbitMQ from the start.

You could also implement a hybrid system that uses direct communication for urgent messages and queues for less time-sensitive operations.

The choice ultimately depends on your specific requirements for scalability, reliability, and latency.

High-Level Architecture Diagram

Let’s start with a detailed architecture diagram:

step-by-step process of sending and receiving a message:

  1. Send Message: The sender’s client (mobile app or web client) sends a message to the load balancer.
  2. Route Request: The load balancer routes the request to an available Nginx web server.
  3. Forward to App Server: The web server forwards the request to a Node.js API server.
  4. Authenticate User: The app server authenticates the user using the session information stored in Redis.
  5. Store Message: The app server stores the message in the Cassandra message database.
  6. Enqueue Notification: The app server enqueues a notification in RabbitMQ for processing.
  7. Dequeue Notification: The Socket.io Notification Server dequeues the notification from RabbitMQ.
  8. Check Recipient Status: The app server checks the recipient’s online status in the Redis cache.

9a. If Online, Send via WebSocket: If the recipient is online, the message is sent to the Socket.io Cluster for real-time delivery.

9b. If Offline, Send Push Notification: If the recipient is offline, a push notification is sent via the FCM/APNs Integration Service.

10a. Deliver Real-time Message: If the recipient is online, the Socket.io Cluster delivers the message in real-time.

10b. Send Push Notification: If the recipient is offline, a push notification is sent to their device.

11. Fetch Message: When the recipient opens the app (either from the real-time notification or push notification), their client sends a request to fetch the message.

12. Route Request: The load balancer routes the fetch request to an available Nginx web server.

13. Forward to App Server: The web server forwards the fetch request to a Node.js API server.

14. Retrieve Message: The app server retrieves the message from the Cassandra message database.

15. Send Message to Recipient: The app server sends the message content to the recipient’s client.

Real-life solutions for each component:

  • Mobile App: Native iOS (Swift) and Android (Kotlin) apps
  • Web App: React.js for a responsive single-page application
  • API Gateway: NGINX for load balancing and reverse proxy
  • Authentication Service: OAuth 2.0 with JWT for secure authentication
  • User Service: Node.js for efficient handling of user-related operations
  • Messaging Service: Go for high-performance message processing
  • Group Service: Java Spring for robust group management
  • Notification Service: Python for flexible notification handling
  • Presence Service: Node.js for real-time user status updates
  • Media Service: Python with libraries like Pillow for image processing
  • Databases: PostgreSQL (users), Cassandra (messages), MongoDB (groups)
  • Cache: Redis for fast data retrieval
  • File Storage: Amazon S3 for scalable object storage
  • Message Queue: Apache Kafka for reliable, high-throughput message streaming
  • Push Notification: Firebase Cloud Messaging for cross-platform notifications
  • CDN: Cloudflare for global content delivery

System Functionality

  • 1:1 Messaging: Users can send text, images, videos, and files directly to other users.
  • Group Messaging: Users can create groups, add/remove members, and send messages to all group members simultaneously.
  • Notifications: Users receive push notifications for new messages when the app is in the background.

Functional Requirements

For seasoned architects: The system is designed as a distributed microservices architecture to handle the core functionalities:

  • 1:1 messaging is managed by the Messaging Service, which handles message creation, storage, and retrieval.
  • Group messaging is handled by a combination of the Group Service (for group management) and the Messaging Service (for message distribution).
  • Notification delivery is managed by the Notification Service in conjunction with external push notification services.

For junior developers: Think of the system as a set of specialized workers, each responsible for a specific task:

  • The Messaging Service is like a postal worker, handling sending and receiving messages.
  • The Group Service is like a club organizer, managing who’s in what group.
  • The Notification Service is like a town crier, announcing new messages to users.

Non-Functional Requirements

Scalability:

  • System should handle 10x growth in users without major architectural changes.
  • The microservices architecture allows horizontal scaling of individual components. We use database sharding and caching to manage data growth.
  • NoSQL databases generally offer better horizontal scalability.
  • Some NewSQL databases provide both SQL features and horizontal scalability.

Availability:

  • 99.99% uptime (less than 1 hour of downtime per year).
  • We use multi-region deployment with active-active setup and implement circuit breakers to prevent cascading failures.

Latency:

  • Message delivery should be less than 100ms for 99% of messages.
  • Real-time messaging needs low-latency databases.
  • In-memory databases or caching layers can help reduce latency.

Performance:

  • We implement a distributed caching layer (e.g., Redis) to reduce database load and message queues (e.g., Kafka) for asynchronous processing.

Security:

  • End-to-end encryption for all messages, secure authentication, and data privacy compliance.
  • End-to-end encryption for messages, TLS for all communications, and robust authentication mechanisms.

Reliability:

  • We implement retry mechanisms, idempotent APIs, and use eventual consistency where appropriate.

Maintainability:

  • Microservices architecture for independent scaling and updates of components.

Compliance and Data Governance:

  • Some databases offer better support for data encryption, auditing, and compliance features.

For junior developers:

  • Scalability: It’s like adding more cashiers at a busy store. We can add more servers to handle more users.
  • Performance: We use caches (like a notepad for quick reference) and message queues (like a to-do list) to keep things fast.
  • Availability: We have multiple backup systems, like having spare tires in a car.
  • Security: We lock messages in a safe (encryption) and check IDs carefully (authentication).
  • Reliability: If something fails, we try again automatically, like redialing a phone number when the line is busy.

Back-of-the-Envelope Calculation

Assumptions for 1 million users:
- Average of 50 messages sent per user per day
- Average message size: 500 bytes (accounting for metadata)
- 10% of messages contain media (average 100 KB)

Daily calculations:
- Total messages: 1,000,000 * 50 = 50,000,000
- Text data: 50,000,000 * 90% * 500 bytes = 22.5 GB
- Media data: 50,000,000 * 10% * 100 KB = 500 GB [average image size?]
- Total daily data: 522.5 GB

Storage requirements:
- Assuming 30 days retention: 522.5 GB * 30 = 15.67 TB

Network requirements:
- Peak rate (assuming 5x average): (522.5 GB * 5) / 86400 seconds ≈ 30 MB/s

Compute resources:
- Assuming 1000 requests/second at peak:
— Load Balancers: 2–4 high-performance instances
— Application Servers: 20–40 instances (depending on the efficiency of implementation)
— Database Servers: 10–20 instances (mix of read replicas and write nodes)

API Design

Key API endpoints:

POST /api/v1/messages
GET /api/v1/messages/{messageId}
POST /api/v1/groups
POST /api/v1/groups/{groupId}/messages
PUT /api/v1/users/{userId}/status
POST /api/v1/notifications/register

Authentication

  • JWT-based authentication with refresh tokens
POST /api/v1/auth/login
POST /api/v1/auth/logout
POST /api/v1/auth/refresh-token

User Management

GET /api/v1/users/{userId}
PUT /api/v1/users/{userId}
GET /api/v1/users/{userId}/contacts
POST /api/v1/users/{userId}/contacts

Rate limiting: Implement token bucket algorithm at the API Gateway level

Error handling: Use standard HTTP status codes with detailed error messages

Messaging /1:1 Chat API:

POST /api/v1/messages
GET /api/v1/messages/{chatId}
PUT /api/v1/messages/{messageId}/status
POST /api/v1/messages
{
"sender_id": "user123",
"recipient_id": "user456",
"content": "Hello!",
"type": "text"
}
Response: 201 Created
{
"message_id": "msg789",
"timestamp": "2023–05–20T10:30:00Z"
}

Group Messaging API:

POST /api/v1/groups
GET /api/v1/groups/{groupId}
POST /api/v1/groups/{groupId}/members
DELETE /api/v1/groups/{groupId}/members/{userId}
POST /api/v1/groups/{group_id}/messages
{
"sender_id": "user123",
"content": "Hello group!",
"type": "text"
}
Response: 201 Created
{
"message_id": "msg790",
"timestamp": "2023–05–20T10:31:00Z"
}

Notification API:

PUT /api/v1/users/{userId}/notification-settings
POST /api/v1/notifications/register-device
POST /api/v1/notifications/register
{
"user_id": "user123",
"device_token": "fcm_token_123",
"platform": "android"
}
Response: 200 OK
{
"status": "registered"
}

Database Schema Design

// Postgres SQL
//Users table (PostgreSQL):
CREATE TABLE users (
user_id UUID PRIMARY KEY,
phone_number VARCHAR(15) UNIQUE,
username VARCHAR(50),
status VARCHAR(100),
last_seen TIMESTAMP
);
//Messages table (Cassandra):

CREATE TABLE messages (
chat_id UUID,
message_id TIMEUUID,
sender_id UUID,
content TEXT,
sent_at TIMESTAMP,
PRIMARY KEY ((chat_id), message_id)
) WITH CLUSTERING ORDER BY (message_id DESC);

or

// SQL
CREATE TABLE Messages (
message_id UUID PRIMARY KEY,
sender_id UUID REFERENCES Users(user_id),
receiver_id UUID REFERENCES Users(user_id),
content TEXT,
media_url VARCHAR(255),
sent_at TIMESTAMP,
delivered_at TIMESTAMP,
read_at TIMESTAMP,
chat_id UUID
);

Group Tables

CREATE TABLE Groups (
group_id UUID PRIMARY KEY,
name VARCHAR(100),
description TEXT,
created_at TIMESTAMP,
created_by UUID REFERENCES Users(user_id)
);

Group members tables

CREATE TABLE GroupMembers (
group_id UUID REFERENCES Groups(group_id),
user_id UUID REFERENCES Users(user_id),
joined_at TIMESTAMP,
role VARCHAR(20),
PRIMARY KEY (group_id, user_id)
);

Notification tables

CREATE TABLE Notifications (
notification_id UUID PRIMARY KEY,
user_id UUID REFERENCES Users(user_id),
content TEXT,
created_at TIMESTAMP,
read_at TIMESTAMP
);

Database type rationale:

  • PostgreSQL (Users): ACID compliance for critical user data, complex querying capabilities.
  • Cassandra (Messages): High write throughput, time-series data model suits message storage.
  • MongoDB (Groups): Flexible schema for varying group structures, good for frequent reads.
  • Redis (Cache): In-memory data structure store for fast data retrieval and real-time features. + real-time presence information
  • Amazon S3 or Google Cloud Storage for media files

Database Selection Guide for Messaging Systems

When choosing a database for a messaging system like WhatsApp, consider the following factors:

Data Model:

  • Relational (SQL) databases are suitable for structured data with complex relationships.
  • NoSQL databases are better for unstructured or semi-structured data and horizontal scaling.

Scalability:

  • NoSQL databases generally offer better horizontal scalability.
  • Some NewSQL databases provide both SQL features and horizontal scalability.

Consistency vs. Availability:

  • Choose based on CAP theorem trade-offs.
  • For messaging, often prefer availability and partition tolerance over strict consistency.

Read/Write Patterns:

  • High-volume messaging systems typically have write-heavy workloads.
  • Consider databases optimized for write operations.

Data Size and Growth Rate:

  • Estimate data volume and growth to choose a database that can handle the load.

Latency Requirements:

  • Real-time messaging needs low-latency databases.
  • In-memory databases or caching layers can help reduce latency.

Search Capabilities:

  • If full-text search is needed, consider databases with built-in search or integrate a search engine.

Compliance and Data Governance:

  • Some databases offer better support for data encryption, auditing, and compliance features.

Technology Stack Justification

Programming Languages:

  • Go (Messaging Service): High concurrency support, fast execution for message handling.
  • Node.js (User and Presence Services): Asynchronous I/O for handling many concurrent connections. Scalable, real-time application logic (Node.js)
  • Java Spring (Group Service): Robust framework for complex business logic in group management.
  • Python (Notification and Media Services): Rich ecosystem of libraries for various tasks.

Frameworks:

  • React.js (Web App): Component-based architecture for building interactive UIs.
  • Express.js (Node.js services): Minimalist web framework for building APIs quickly.

Databases:

  • PostgreSQL: ACID compliance, complex querying for user data.
  • Cassandra: Distributed nature, high write throughput for message storage.
  • MongoDB: Flexible schema for group data, good read performance.
  • Redis: In-memory data structure for caching and real-time features.

Messaging Protocols:

  • WebSocket: Full-duplex communication for real-time messaging. (Socket.io)
  • MQTT: Lightweight publish-subscribe protocol for mobile notifications.
  • Message queues (Kafka or RabbitMQ) for asynchronous processing

Infrastructure:

  • Docker: Containerization for consistent development and deployment environments.
  • Kubernetes: Container orchestration for managing microservices at scale.
  • AWS/GCP: Cloud infrastructure for scalability and managed services.

This technology stack provides a balance of performance, scalability, and developer productivity, crucial for a complex system like a WhatsApp-like messaging platform.

Load Balancing

  • Layer 7 load balancing using NGINX or HAProxy
  • Consistent hashing for distributing WebSocket connections

CDN

Content Delivery Network (CDN) for media delivery

Search capability

  • Elasticsearch for message search functionality
Unlisted

--

--