Bitly | TinyURL | URL shortener

10 min readMay 6, 2022

· Benefits of Short URLs
· Questions/Function Requirement
· User stories
∘ Writes per second
∘ Scaling reads
∘ Storage Requirement
∘ Simple Design
∘ Short Identifiers — Requirement for short Identifier
∘ Auto-generated integers
∘ Can we use UUID?
∘ Can we use MD5?
∘ Can we use Base62?
∘ Can we use Base58?
∘ Database based base62 encoded shortened URL
∘ Integer to Custom Encoding
∘ Why 8 chars long?
∘ Problems with bigger code
∘ How to avoid very small short codes or variable length?
· Approaches to generate short code
∘ Pregenerate all 7 size keys
∘ Ticket server approach
∘ Snowflake approach
∘ Zookeeper approach
∘ Final
· Primary key can be integer
· SQL Database is bad choice for short URLs
∘ Cache
· Load balancers
∘ Load Balancer Configuration
· Base62 encoding
· URL redirection
∘ Working TinyURL Code
· Custom Short URLs
· Things to consider
· Security Considerations
∘ Input validation
∘ Rate limiting
∘ Captacha
∘ HTTPS
∘ Monitong and logging
· Monitoring and Logging
∘ Monitoring
∘ Logging
∘ Metrics to Monitor
∘ Alerting Rules
· Analytics
· TODO
· References

Benefits of Short URLs

easier to type
Don’t take much space when you share them — tweets, promotional images etc.
analytics

Questions/Function Requirement

How short the url code should be? — as short as we can
Traffic
→ How many new URLs will be added per month? — 500M (100:1 will be read and write ratio)
→ For how long the URLs should be kept? — 5 years
→ Total number of URLs? → 500M*12*5 = 30 Billion URLs
→ 30 billion * 1kb = ~30TB
link expiration time?
What character we can use? [0–9 a-z A-Z]

User stories

Writes per second

[1Million Requests per day = ~12 requests/per second]

500M URLs shortened per month => ~600M/30
20M per day => ~24M/24
1M per hour => ~1.2M/60
20K per minute => ~18k / 60
300 URLs per second
x10peak? => 3K URLs
=> within the limits of a single database
=> will need multiple servers to handle this load

Scaling reads

single instance of Db can handle — 10k rps
single instance of redis can handle — 100k rps

300 writes per second
30k reads per second (100:1 reads to writes)
x10 peaks are possible => 300k rps on peak
scale out database [can use a LB between server and database if replicas are changing else server can decide which replica to use]
use redis but it’s limited by RAM (We will need 50 machines of 64GB ram to hold 1TB data)

Storage Requirement

30 billion URLs
Each URL 100 bytes (URL, short identifier and other info) on average => 3 Trillion bytes => 3 Billion KB => 3 Million MB => 3000 GB => 3 TB

Simple Design

300 writes per second is ok for single DB instance
3 TB will also fit into a single database.

Short Identifiers — Requirement for short Identifier

As short as possible but not very short (3–4 digits)
Random
Human readable [not o 0 / $]

Auto-generated incrementing integers

not Random
very short
someone can scrape

Can we use UUID?

A Universally Unique Identifier (UUID) is 128 bits long and is represented in hexadecimal as 32 characters
substring(Base58(UUID(URL)), 8)
Hexadecimal digits: 0–9 and A–F
16³² = (2⁴)³² = 2¹²⁸ => too long to type
Short ❌
Random ✅
Human readable ✅

Can we use MD5?

Size: 2¹²⁸, same as UUID
substring(Base58(Md5(URL)), 8)
Not random!!
Short ❌
Random ❌
Human readable ✅

Can we use Base62?

problem is some characters looks very similar (o and 0)
All English characters = 52(a-zA-Z)+10 number = 62
UUID = 32 character
❌Base62(UUID) = 22 characters [ character needed are only 6 and no need of 22 characters, 62⁶]
62⁵ = 916 Million
62⁶ = 56 Billion
62⁷ = 3 Trillion
62⁸ = 281 Trillion
Short ✅
Random ✅
Human readable ❌

Can we use Base58?

All characters — (minus) zero, capital i, capital o and lower case L
58⁶ = 38 Billion
58⁷ = 2 Trillion
58⁸ = 128 Trillion
approach # 1 — substring(base58(UUID), 8))
approach # 2 —

result=[]
for 1 to 8
  Random(0 , 57)
  append to result

Short ✅
Random ✅
Human readable ✅

Database based base62 encoded shortened URL

A URL can have any length. Therefore to generate the fixed-length output, first we will hash the URL using sha256 which will generate 256 bit i.e. 64 hex string.

sha256(https://this-is-a-long-url.com/desing/url/shortner) = 
2553a57edb1b87b60ac40d12ba491560fdf1cc886b6f9171b2e42b2caf4eede5

To shorten the generated hash, we can use base62 encoding (a-z, A-Z, 0–9, exclude + and / ).

base62(2553a57edb1b87b60ac40d12ba491560fdf1cc886b6f9171b2e42b2caf4eede5) =
8qmTY1gUvtjJA6mmUKYvRswXGxwdujJ5JW12xxiPXfR

The input of 64 hex string, it generated 43 chars long encoded string. To shorten it further, we can get the random 7characters from the encoded string.

random_7chars(8qmTY1gUvtjJA6mmUKYvRswXGxwdujJ5JW12xxiPXfR) = UKmTY1g

Integer to Custom Encoding

still can be reverse engineer with lot of efforts
problem is not with encoding scheme but the auto incrementing integers — make random numbers

Why 7 chars long?

6 letter long key = 6²⁶ = ~ 56.8 billion
7 letter long key = 6²⁷ = ~ 3,5 trillion
8 letter long key = 6²⁸ = ~ 218 trillion

Based on our application needs, we can choose the length of the shortened URL. 7chars length will be enough for most of the applications therefore we will go with 7chars.

Problems with bigger code

Storage size → More disk and more memory used for a bigger shortened url code
Efficiency → a larger key takes more cpu time to compare
Simplicity → not easy to remember

How to avoid very short codes or variable length?

1000 will generate short code of 3 size which is not acceptable (why?)
start with value which produce 3 character length
padding? 0000abc

Approaches to generate short code

Pregenerate all 7 size keys

pre-generate 7 size keys and track with boolean value to check if allocated or not

key : is_allocated
abcvedd : true

Ticket server approach

Snowflake approach

timestamp:machine_id:

Zookeeper approach

distributed co-ordinator — good to maintain centeralized configuration information
server will reach out to zookeeper to get a unused range of values and once range is exhausted (or server crashes and comes back online again) it will reach out to zookeeper for the new range
one server will generate shortcode in the increasing sequence hence can be guessed? — add another 10–12 random bits after counter number to increase entropy

Read replicas for Database/MYSQL — it will be multiply our data storage i.e. if total data is 3TB and we have 3 read replicas then total storage will be 21TB

Distributed Caching instead of MySQL — it will be very expensive to have 50 machines of Redis also data it not persistent

Caching + MySQL approach + In memroy cache?

1% of 3TB in cache => 30GB
300k peak reads
300k/4 redis machines => less than 100k reads per second

Final

Primary key can be integer

searching is very fast for integer compared to string
create b+ tree is easier for number than string

SQL Database is bad choice for short URLs

CREATE TABLE url_mappings (
    id BIGSERIAL PRIMARY KEY,
    short_url VARCHAR(10) UNIQUE NOT NULL,
    original_url TEXT NOT NULL,
    user_id UUID,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    expiration_date TIMESTAMP,
    click_count BIGINT DEFAULT 0,
    INDEX idx_short_url (short_url)
);

CREATE TABLE analytics (
    id BIGSERIAL PRIMARY KEY,
    short_url_id BIGINT REFERENCES url_mappings(id),
    timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    ip_address INET,
    user_agent TEXT,
    referrer TEXT,
    country VARCHAR(2)
);

(Random) UUIDs are Bad for Performance in MySQL — Is Postgres better? Let us Discuss | UUIDs as primary key are bad

The effect of Random UUID on database performance

URL Shortner UUIDs

https://www.percona.com/blog/uuids-are-popular-but-bad-for-performance-lets-discuss/

In all cases, the insertion rate is at first CPU bound but as soon the table is larger than the buffer pool, the insertion rapidly becomes IO bound. This is expected and shouldn’t surprise anyone. The use of a smaller representation for the UUID values just allows more rows to fit in the buffer pool but in the long run, it doesn’t really help the performance, as the random insertion order dominates. If you are using random UUID values as primary keys, your performance is limited by the amount of memory you can afford.
Postgres will be faster at inserting rows in the table itself which is organized by rowid. It has to maintain a btree for the uuid PK values so it will eventually struggle like InnoDB, maybe a bit later in term of number of rows.

Cache

80/20 — 20% url will generate 80% traffic
modern day caches can range from 6 GB –256 GB
LRU eviction
1-hour TTL
Write-through caching

// redis cache
def get_original_url(short_url):
    # Try cache first
    original_url = redis_cache.get(short_url)
    if original_url:
        return original_url
    
    # Cache miss, query database
    original_url = db.query_url(short_url)
    if original_url:
        redis_cache.set(short_url, original_url, ex=3600)  # 1 hour expiry
    return original_url

Load balancers

between client and servers
between servers and databases
between servers and caches

Load Balancer Configuration

Use consistent hashing for distribution
Health checks every 30 seconds
Session persistence using client IP

upstream backend {
    hash $remote_addr consistent;
    server backend1.example.com:8080 max_fails=3 fail_timeout=30s;
    server backend2.example.com:8080 max_fails=3 fail_timeout=30s;
}

Base62 encoding

URL redirection

HTTP Status 301
with cache disabled

Working TinyURL Code

GitHub - tajain07/url-shortner-nodejs: URL shorter service implemented in Nodejs

URL shorter service implemented in Nodejs. Contribute to tajain07/url-shortner-nodejs development by creating an…

github.com

Custom Short URLs

to avoid collision with system generated short code it’s required that custom urls be atleast 8 character long

Things to consider

security
availability
scalability
deployments / infra
tech stack
testing — automated, integration, load, choas

Security Considerations

Input validation

Validate inputs before avoid storing malicious links to avoid sql injection

Rate limiting

track number of requests by a user or ip address
token bucket algorithm — each IP address is allowed a certain number of requsts per time period (e.g. 100 requests per minute)

// using redis
def is_rate_limited(user_id):
    pipe = redis_client.pipeline()
    now = time.time()
    key = f'rate_limit:{user_id}'
    
    pipe.zadd(key, {now: now})
    pipe.zremrangebyscore(key, 0, now - 60)  # 1 minute window
    pipe.zcard(key)
    pipe.expire(key, 60)
    
    _, _, count, _ = pipe.execute()
    return count > 100  # 100 requests per minute

Captacha

HTTPS

use https to protect user data

Monitong and logging

discussed below

Monitoring and Logging

ELK stack for log aggregation
Prometheus + Grafana for metrics
Distributed tracing with Jaeger

Monitoring

Logging

Metrics to Monitor

Success rate of URL shortening
Latency percentiles (p50, p90, p99)
Cache hit ratio
Database connection pool utilization
Rate limiter effectiveness

SQL queries

Alerting Rules

alerts:
  - name: high_error_rate
    condition: error_rate > 1%
    for: 5m
    severity: critical
  
  - name: high_latency
    condition: p99_latency > 500ms
    for: 5m
    severity: warning

Business Analytics

all info like user ip address, location, device id etc.
Click data
— Timestamp
— IP address (anonymized)
— User agent
— Referrer
— Geolocation
URL data
— Creation time
— Expiration
— Owner (if authenticated)