Rate Limiter

Tarun Jain
15 min readMar 15, 2024

--

Rate Limiter?

  • First line of defense
  • Any incoming request is first consulted against rate limiter
  • if we are under limits, go through otherwise, reject with error (429)

Benefits of Rate Limiting

  • Prevent Resource Starvation — due to DDOS attacks
  • Reduce cost — limit cost overruns by preventing the overuse of a resource + limiting outnound paid requests
  • prevents overload-

Core Concepts of Rate Limiting

  • Identifier —unique attribute that differentiates between individual callers. (IP address, userId etc.)
  • window — time period where the limit comes into play
  • limit — ceiling for allowable requests or actions within a designated time span.

Leaky Bucket

📺Leaky Bucket Algorithm | Rate Limiting | System Design

The following image perfectly illustrates the leaky bucket algorithm.

  • The image shows the usage of leaky bucket algorithm in traffic shaping.
  • If we map it our limiting requests to server use case, water drops from the faucets are the requests, the bucket is the request queue and the water drops leaked from the bucket are the responses. Just as the water dropping to a full bucket will overflow, the requests arrive after the queue becomes full will be rejected.
  • As we can see, in the leaky bucket algorithm, the requests are processed at an approximately constant rate, which smooths out bursts of requests. Even though the incoming requests can be bursty, the outgoing responses are always at a same rate.

A simple implementation for demonstration purpose:

//Credits - https://medium.com/@devenchan/implementing-rate-limiting-in-java-from-scratch-leaky-bucket-and-tokenn-bucket-implementation-63a944ba93aa
public class LeakyBucketRateLimiter implements RateLimiter {
// The maximum capacity of the bucket. Once water reaches this level, further requests are rejected.
private final long threshold;
// Time unit for measuring the leak rate, set to one second (1000 milliseconds).
private final long windowUnit = 1000;
// Current level of water in the bucket, managed atomically to ensure thread safety.
private final AtomicLong water = new AtomicLong(0);
// Timestamp of the last leak calculation, used to determine how much water has leaked over time.
private volatile long lastLeakTimestamp;

/**
* Constructs a LeakyBucketRateLimiter with a specified threshold.
*
* @param threshold the maximum number of requests that can be handled in the time window.
*/
public LeakyBucketRateLimiter(long threshold) {
this.threshold = threshold;
this.lastLeakTimestamp = System.currentTimeMillis();
}

/**
* Tries to acquire a permit for a request based on the current state of the bucket.
*
* @return true if the request can be accommodated (water level is below threshold), false otherwise.
*/
@Override
public boolean tryAcquire() {
long currentTime = System.currentTimeMillis();
// Calculate the amount of water that has leaked since the last check.
long leakedAmount = ((currentTime - lastLeakTimestamp) / windowUnit) * threshold;

// If any water has leaked, reduce the water level accordingly.
if (leakedAmount > 0) {
water.addAndGet(-leakedAmount);
lastLeakTimestamp = currentTime;
}

// Ensure the water level does not go below zero.
if (water.get() < 0) {
water.set(0);
}

// If the bucket is not full, increment the water level and allow the request.
if (water.get() < threshold) {
water.getAndIncrement();
return true;
}

// If the bucket is full, reject the request.
return false;
}
}
public class LeakyBucket extends RateLimiter {

private long nextAllowedTime;

private final long REQUEST_INTERVAL_MILLIS;

protected LeakyBucket(int maxRequestPerSec) {
super(maxRequestPerSec);
REQUEST_INTERVAL_MILLIS = 1000 / maxRequestPerSec;
nextAllowedTime = System.currentTimeMillis();
}

@Override
boolean allow() {
long curTime = System.currentTimeMillis();
synchronized (this) {
if (curTime >= nextAllowedTime) {
nextAllowedTime = curTime + REQUEST_INTERVAL_MILLIS;
return true;
}
return false;
}
}
}

Token Bucket

https://www.geeksforgeeks.org/token-bucket-algorithm/

implementation in Java

Source — https://youtu.be/FU4WlwfS3G0?si=wAo_lp1TdkE5yoVB
Source — https://youtu.be/FU4WlwfS3G0?si=wAo_lp1TdkE5yoVB
  • In a cluster environment where hosts must coordinate to implement effective rate limiting, the process involves several key components.
  • Initially, a job scheduler retrieves rules from a remote service and creates token buckets stored in a cache. When a client request is received, a unique identifier is generated, allowing the system to access the appropriate token bucket. The system then checks if the request can be allowed based on the tokens available.
source — https://youtu.be/FU4WlwfS3G0?si=wAo_lp1TdkE5yoVB

simple solution with 1 host (non distributed solution)

Source — https://youtu.be/FU4WlwfS3G0?si=wAo_lp1TdkE5yoVB

Fixed window rate limiter

  • From the implementation perspective, one advantage of this approach is, unlike token bucket where we have to lock the bucket while getting tokens from it, we can use an atomic operation to increase the counter in each window to make the code lock-free.
  • Obviously, fixed window counter algorithm only guarantees the average rate within each window but not across windows.
  • For example, if the expected rate is 2 per minute and there are 2 requests at 00:00:58 and 00:00:59 as well as 2 requests at 00:01:01 and 00:01:02. Then the rate of both window [00:00, 00:01) and window [00:01, 00:02) is 2 per minute. But the rate of window [00:00:30, 00:01:30) is in fact 4 per minute.
Source — https://hechao.li/img/fixed-window-counter.png
//Source - https://hechao.li/2018/06/25/Rate-Limiter-Part1/
import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.ConcurrentMap;
import java.util.concurrent.atomic.AtomicInteger;

public class FixedWindowCounter extends RateLimiter {

// TODO: Clean up stale entries
private final ConcurrentMap<Long, AtomicInteger> windows = new ConcurrentHashMap<>();

protected FixedWindowCounter(int maxRequestPerSec) {
super(maxRequestPerSec);
}

@Override
boolean allow() {
long windowKey = System.currentTimeMillis() / 1000 * 1000;
windows.putIfAbsent(windowKey, new AtomicInteger(0));
return windows.get(windowKey).incrementAndGet() <= maxRequestPerSec;
}
}

/*Cleaning up of stale entries is omitted in above code.
We can run a job to clean stale windows regularly.
For instance, schedule a task running at 00:00:00 to remove all the entries
created in previous day.*/

implemented using Redis

Sliding window log

  • Sliding window log algorithm keeps a log of request timestamps for each user.
  • When a request comes, we first pop all outdated timestamps before appending the new request time to the log. Then we decide whether this request should be processed depending on whether the log size has exceeded the limit. For example, suppose the rate limit is 2 requests per minute:
//Source https://hechao.li/2018/06/25/Rate-Limiter-Part1/

import java.util.LinkedList;
import java.util.Queue;

public class SlidingWindowLog extends RateLimiter {

private final Queue<Long> log = new LinkedList<>();

protected SlidingWindowLog(int maxRequestPerSec) {
super(maxRequestPerSec);
}

@Override
boolean allow() {
long curTime = System.currentTimeMillis();
long boundary = curTime - 1000; //?
synchronized (log) { //notice syncronized can use Redis's sorted set and ZREMRANGEBYSCORE
while (!log.isEmpty() && log.element() <= boundary) {
log.poll();
}
log.add(curTime);
return log.size() <= maxRequestPerSec;
}
}
}

/*Though in above implementation we use a lock while performing operations
on the log, in practice, Redis’s sorted set and ZREMRANGEBYSCORE
command can provide atomic operations to accomplish this*/

advantage of sliding window log over fixed window counter

  • is that, instead of ensuring an average rate within each window, it provides a more accurate rate limit because the window boundary is dynamic instead of fixed.
  • For example, if the time unit is 1 minute, fixed window counter guarantees an average rate in window [00:00, 00:01), [00:01,00:02), etc. But sliding window log guarantees that for each request arrives at time t, the rate in window (t - 1, t] won’t exceed limit.

disadvantage

  • of this approach is its memory footprint. We notice that, even if a request is rejected, its request time is recorded in the log, making the log unbounded.

Sliding window (counter)

📺Video: Sliding window counter Intro
📺Video: Sliding window counter Implementation

  • Sliding window counter is similar to fixed window counter but it smooths out bursts of traffic by adding a weighted count in previous window to the count in current window.
  • For example, suppose the limit is 10 per minute. There are 9 requests in window [00:00, 00:01) and 5 reqeuests in window [00:01, 00:02).
  • For a requst arrives at 00:01:15, which is at 25% position of window [00:01, 00:02), we calculate the request count by the formula: 9 x (1 - 25%) + 5 = 11.75 > 10.
  • Thus we reject this request. Even though both windows don’t exceed the limit, the request is rejected because the weighted sum of previous and current window does exceed the limit.
  • This is still not accurate becasue it assumes that the distribution of requests in previous window is even, which may not be true.
  • But compares to fixed window counter, which only guarantees rate within each window, and sliding window log, which has huge memory footprint, sliding window is more practical.

A simple implementation for demonstration purpose:

import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.ConcurrentMap;
import java.util.concurrent.atomic.AtomicInteger;

public class SlidingWindow extends RateLimiter {

// TODO: Clean up stale entries
private final ConcurrentMap<Long, AtomicInteger> windows = new ConcurrentHashMap<>();

protected SlidingWindow(int maxRequestPerSec) {
super(maxRequestPerSec);
}

@Override
boolean allow() {
long curTime = System.currentTimeMillis();
long curWindowKey = curTime / 1000 * 1000;
windows.putIfAbsent(curWindowKey, new AtomicInteger(0));
long preWindowKey = curWindowKey - 1000;
AtomicInteger preCount = windows.get(preWindowKey);
if (preCount == null) {
return windows.get(curWindowKey).incrementAndGet() <= maxRequestPerSec;
}

double preWeight = 1 - (curTime - curWindowKey) / 1000.0;
long count = (long) (preCount.get() * preWeight
+ windows.get(curWindowKey).incrementAndGet());
return count <= maxRequestPerSec;
}
}
Source — https://youtu.be/gVVDo2h6DwA?si=qp8LQkLHYOZrTx6q

Distributed Rate limiter

communication amount different services

[Idea is how each service will share token information]

Transitioning to a distributed setting, the challenge arises of sharing token information among multiple hosts.

  • For instance, in a cluster with three hosts aiming to permit four requests per second per client, each host’s bucket should contain four tokens. This is crucial because requests for the same client may not be evenly spread across hosts.

Effective communication among hosts is crucial in distributed systems, particularly for implementing rate limiting solutions.

  1. The initial method discussed is full mesh communication, where each host is aware of and shares information with every other host in the cluster. However, the scalability of this approach is limited, as the number of messages increases quadratically with the number of hosts.
  2. To address this, gossip protocols present a more efficient alternative, mimicking the way epidemics spread by having hosts randomly select peers to share data.
  3. Another option involves using a distributed cache, like Redis, allowing for independent scaling and sharing among service teams.
  4. Additionally, a coordination service can streamline communication by designating a leader to aggregate information, significantly reducing the message load.
  5. Although maintaining a reliable coordination service poses challenges, simpler leader-election algorithms may lead to multiple leaders, which can still function effectively without significant disruption.
Source — https://youtu.be/FU4WlwfS3G0?si=UdfFIpBJqBA4_Xf8

Communication Protocols

  • TCP — gaurantess delivery of data in the same order.
  • UDP — no gaurantee of delivery of all the data in the same order.

Rate liming to be more accurate but with performance overhead — TCP

Less accurate solution but faster — UDP

How to integrate rate limiting service

two primary integration strategies for implementing a rate limiter within a service.

  1. The first option involves integrating the rate limiter directly into the service as a collection of classes or a library, which allows for faster performance as it eliminates the need for inter-process communication. This approach also enhances resilience since there are no potential failures associated with such calls.
  2. The second option operates as a separate daemon process, requiring two libraries: one for the daemon itself and another for the client that facilitates communication between the service and the daemon. This method offers programming language flexibility, allowing the rate limiter daemon to be developed in a different language from the service.
    The local rate limiter operates within its own memory space, allowing for enhanced behavior control of both the service and the daemon. This isolation contributes to more predictable memory allocation, as the service process does not have to allocate space for numerous memory buckets.
    service teams often express caution when integrating new libraries, posing questions about memory consumption, CPU usage, and potential impacts during network failures. These inquiries reflect a broader concern about ensuring the reliability of the service amidst any bugs in the rate limiter library.

Ultimately, while each approach has its strengths and weaknesses, the choice between them hinges on specific use cases and the particular needs of the service team, with the daemon-based option facilitating communication with other hosts.

https://youtu.be/FU4WlwfS3G0?si=srl-p1GbUEmYfwHm

popularity of the daemon approach in distributed systems for tasks like auto-discovery of service hosts.

  • This method is recognized for its ability to enhance communication among hosts within a cluster, facilitating tasks such as the auto-discovery of service hosts.
  • Such a capability allows individual hosts to identify one another seamlessly, which is crucial for maintaining efficient operations in distributed environments.
  • This pattern has gained popularity due to its adaptability to various use cases and the specific needs of service teams.

Functional Requirement

  • client can send a limited number of requests to a server within a window
  • client should get an error message if the defined threshold limit of request is crossed for a server single server or across different combinations of servers.

boolean allowRequest(request)

Non-Functional Requirements:

  1. The system should be highly available since it protects our service from external attacks.
  2. Performance is an important factor for any system. So, we need to be careful that rate limiter service should not add substantial latencies to the system.
  3. Scalable — supports an arbritarly large number of clusters.
  4. Accurate
  5. ease of integeration

Back of the envelope memory calculation

  • key value pair

key is userid/ip address

value is timestamp+count

Relational DB vs In Memory cache time

rate limiting in Google Guava library.

potential practical issues and failure modes

creation and removal of token buckets in memory based on client request activity.

  • management of token buckets in memory, particularly in response to client requests. When numerous clients send simultaneous requests, it is theoretically feasible to create multiple token buckets. However, practical implementation dictates that buckets need not be retained in memory if a client does not generate requests for a certain duration.
  • Initially, when a client makes a request, a corresponding bucket is created and retained in memory as long as subsequent requests occur within a short interval. If a gap in requests extends beyond several seconds, the bucket is removed, ready to be re-established upon the client’s next request.
  • potential failure modes noting that
    → if a daemon fails, it may lead to visibility issues for other hosts in the cluster, resulting in the affected host disengaging from the group while continuing to enforce throttling measures.

failure modes such as daemon failure or network partitions and their impact on the rate limiting solution

Failure modes can significantly affect rate limiting solutions, particularly through daemon failures or network partitions.

  • When a daemon fails, other hosts in the cluster may lose visibility of it, resulting in the failed daemon continuing to throttle requests independently. This leads to a situation where fewer requests are throttled overall.
  • Similarly, during a network partition, some hosts may be unable to communicate with the rest of the cluster, allowing each host to process requests without coordination.
    → This scenario can lead to an increase in the total number of requests allowed, as demonstrated by a previous example involving multiple hosts and tokens.
    → In essence, when communication between hosts is disrupted, more requests are permitted, leading to a less effective throttling mechanism. → To address these challenges, the introduction of a self-service tool for rule management may be necessary to maintain control over request limits.

need for a self-service tool for rule management by service teams

  • In the context of a rate limiting solution, failures can lead to an increase in allowed requests while reducing the effectiveness of throttling.
  • This situation highlights the importance of implementing a self-service tool for rule management, enabling service teams to create, update, and delete their rules as necessary. Such a tool would empower teams to manage their rate limiting configurations efficiently, ensuring that they can respond swiftly to changing requirements.
  • Additionally, synchronization plays a crucial role in this system, particularly within the token bucket mechanism, where proper coordination is essential to maintain optimal performance and reliability.
  • This underscores the complexity of managing rate limiting in both local and distributed systems.

synchronization needs in the token bucket class and cache

  • Service teams have the flexibility to create, update, and delete their own rules as necessary, highlighting the dynamic nature of system design.
  • Within this context, synchronization becomes crucial, particularly in the implementation of the token bucket class.
  • To enhance thread-safety in this class, the use of atomic references is recommended.
  • Another significant aspect that necessitates synchronization is the token bucket cache.
  • When there is a need to manage the lifecycle of buckets — deleting unused ones and recreating them — synchronization issues may arise. To address these challenges, employing a concurrent hash map is suggested, which serves as a thread-safe alternative to the standard hash map in Java.
  • While concerns about synchronization bottlenecks are valid, they typically emerge only in scenarios with extraordinarily high request rates. For the majority of services, even basic synchronization strategies do not introduce substantial performance penalties.

How to handle throttled requests?

  • Client queue them up and resend later
  • retry throttled request with exponential backoff and jitter

How to decide on rate limiter design?

# Nodes < 1,000 and # active bucket of ≤ 10,000 → Gossip + UDP will be fast and accurate

# Nodes > 10,000 — Dirtibuted Cache
→ increased latency
→ increased operational cost

Rate limiting in AWS

API Gateway throttling — burst limit vs rate limit

per second

This is an implementation of the Token bucket implementation.

  • The burst limit defines the number of requests your API can handle concurrently. [bucket size]
  • The rate limit defines the number of allowed requests per second. [refill rate]

Concurrently means that requests run in parallel. Assuming that one request takes 10ms, you could have 100 request per second with a concurrency of 1, if they were all executed in series. But if they were all executed at the same moment, the concurrency would be 100. In both cases a rate limit of 100 would suffice. In the first case, a burst limit of 1 would allow all requests to succeed, in the second case this would deny 99 requests.

The official documentation only mentions the Token bucket algorithm briefly.

Unlisted

--

--