Benefits of Short URLs

  • easier to type
  • Don’t take much space when you share them — tweets, promotional images etc.
  • analytics

Questions/Function Requirement

  • How short the url code should be? — as short as we can
  • Traffic
    → How many new URLs will be added per month?
    — 500M (100:1 will be read and write ratio)
    For how long the URLs should be kept? — 5 years
    Total number of URLs? → 500M*12*5 = 30 Billion URLs
    → 30 billion * 1kb = ~30TB
  • link expiration time?
  • What character we can use? [0–9 a-z A-Z]

User stories

Writes per second

[1Million Requests per day = ~12 requests/per second]

  • 500M URLs shortened per month => ~600M/30
  • 20M per day => ~24M/24
  • 1M per hour => ~1.2M/60
  • 20K per minute => ~18k / 60
  • 300 URLs per second
  • x10peak? => 3K URLs
    => within the limits of a single database
    => will need multiple servers to handle this load

Scaling reads

single instance of Db can handle — 10k rps

single instance of redis can handle — 100k rps

  • 300 writes per second
  • 30k reads per second (100:1 reads to writes)
    x10 peaks are possible => 300k rps on peak
  • scale out database [can use a LB between server and database if replicas are changing else server can decide which replica to use]
  • use redis but it’s limited by RAM (We will need 50 machines of 64GB ram to hold 1TB data)

Storage Requirement

  • 30 billion URLs
  • Each URL 100 bytes (URL, short identifier and other info) on average => 3 Trillion bytes => 3 Billion KB => 3 Million MB => 3000 GB => 3 TB

Simple Design

  • 300 writes per second is ok for single DB instance
  • 3 TB will also fit into a single database.

Short Identifiers — Requirement for short Identifier

  • As short as possible but not very short (3–4 digits)
  • Random
  • Human readable [not o 0 / $]

Auto-generated incrementing integers

  • not Random
  • very short
  • someone can scrape

Can we use UUID?

  • A Universally Unique Identifier (UUID) is 128 bits long and is represented in hexadecimal as 32 characters
  • substring(Base58(UUID(URL)), 8)
  • Hexadecimal digits: 0–9 and A–F
  • 16³² = (2⁴)³² = 2¹²⁸ => too long to type
  • Short ❌
  • Random ✅
  • Human readable ✅

Can we use MD5?

  • Size: 2¹²⁸, same as UUID
    substring(Base58(Md5(URL)), 8)
  • Not random!!
  • Short ❌
  • Random ❌
  • Human readable ✅

Can we use Base62?

  • problem is some characters looks very similar (o and 0)
  • All English characters = 52(a-zA-Z)+10 number = 62
  • UUID = 32 character
  • ❌Base62(UUID) = 22 characters [ character needed are only 6 and no need of 22 characters, 62⁶]
  • 62⁵ = 916 Million
  • 62⁶ = 56 Billion
  • 62⁷ = 3 Trillion
  • 62⁸ = 281 Trillion
  • Short ✅
  • Random ✅
  • Human readable ❌

Can we use Base58?

  • All characters — (minus) zero, capital i, capital o and lower case L
  • 58⁶ = 38 Billion
  • 58⁷ = 2 Trillion
  • 58⁸ = 128 Trillion
  • approach # 1 — substring(base58(UUID), 8))
  • approach # 2 —
result=[]
for 1 to 8
Random(0 , 57)
append to result
  • Short ✅
  • Random ✅
  • Human readable ✅

Database based base62 encoded shortened URL

A URL can have any length. Therefore to generate the fixed-length output, first we will hash the URL using sha256 which will generate 256 bit i.e. 64 hex string.

sha256(https://this-is-a-long-url.com/desing/url/shortner) = 
2553a57edb1b87b60ac40d12ba491560fdf1cc886b6f9171b2e42b2caf4eede5

To shorten the generated hash, we can use base62 encoding (a-z, A-Z, 0–9, exclude + and / ).

base62(2553a57edb1b87b60ac40d12ba491560fdf1cc886b6f9171b2e42b2caf4eede5) =
8qmTY1gUvtjJA6mmUKYvRswXGxwdujJ5JW12xxiPXfR

The input of 64 hex string, it generated 43 chars long encoded string. To shorten it further, we can get the random 7characters from the encoded string.

random_7chars(8qmTY1gUvtjJA6mmUKYvRswXGxwdujJ5JW12xxiPXfR) = UKmTY1g

Integer to Custom Encoding

  • still can be reverse engineer with lot of efforts
  • problem is not with encoding scheme but the auto incrementing integers — make random numbers

Why 7 chars long?

6 letter long key = 6²⁶ = ~ 56.8 billion
7 letter long key = 6²⁷ = ~ 3,5 trillion
8 letter long key = 6²⁸ = ~ 218 trillion

Based on our application needs, we can choose the length of the shortened URL. 7chars length will be enough for most of the applications therefore we will go with 7chars.

Problems with bigger code

  • Storage size → More disk and more memory used for a bigger shortened url code
  • Efficiency → a larger key takes more cpu time to compare
  • Simplicity → not easy to remember

How to avoid very short codes or variable length?

  • 1000 will generate short code of 3 size which is not acceptable (why?)
  • start with value which produce 3 character length
  • padding? 0000abc

Approaches to generate short code

Pregenerate all 7 size keys

  • pre-generate 7 size keys and track with boolean value to check if allocated or not

key : is_allocated
abcvedd : true

Ticket server approach

Snowflake approach

timestamp:machine_id:

Zookeeper approach

  • distributed co-ordinator — good to maintain centeralized configuration information
  • server will reach out to zookeeper to get a unused range of values and once range is exhausted (or server crashes and comes back online again) it will reach out to zookeeper for the new range
  • one server will generate shortcode in the increasing sequence hence can be guessed? — add another 10–12 random bits after counter number to increase entropy

Read replicas for Database/MYSQL — it will be multiply our data storage i.e. if total data is 3TB and we have 3 read replicas then total storage will be 21TB

Distributed Caching instead of MySQL — it will be very expensive to have 50 machines of Redis also data it not persistent

Caching + MySQL approach + In memroy cache?

  • 1% of 3TB in cache => 30GB
  • 300k peak reads
  • 300k/4 redis machines => less than 100k reads per second

Final

Primary key can be integer

  • searching is very fast for integer compared to string
  • create b+ tree is easier for number than string

SQL Database is bad choice for short URLs

CREATE TABLE url_mappings (
id BIGSERIAL PRIMARY KEY,
short_url VARCHAR(10) UNIQUE NOT NULL,
original_url TEXT NOT NULL,
user_id UUID,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
expiration_date TIMESTAMP,
click_count BIGINT DEFAULT 0,
INDEX idx_short_url (short_url)
);

CREATE TABLE analytics (
id BIGSERIAL PRIMARY KEY,
short_url_id BIGINT REFERENCES url_mappings(id),
timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
ip_address INET,
user_agent TEXT,
referrer TEXT,
country VARCHAR(2)
);

(Random) UUIDs are Bad for Performance in MySQL — Is Postgres better? Let us Discuss | UUIDs as primary key are bad

The effect of Random UUID on database performance

URL Shortner UUIDs

https://www.percona.com/blog/uuids-are-popular-but-bad-for-performance-lets-discuss/
  • In all cases, the insertion rate is at first CPU bound but as soon the table is larger than the buffer pool, the insertion rapidly becomes IO bound. This is expected and shouldn’t surprise anyone. The use of a smaller representation for the UUID values just allows more rows to fit in the buffer pool but in the long run, it doesn’t really help the performance, as the random insertion order dominates. If you are using random UUID values as primary keys, your performance is limited by the amount of memory you can afford.
  • Postgres will be faster at inserting rows in the table itself which is organized by rowid. It has to maintain a btree for the uuid PK values so it will eventually struggle like InnoDB, maybe a bit later in term of number of rows.

Cache

  • 80/20 — 20% url will generate 80% traffic
  • modern day caches can range from 6 GB –256 GB
  • LRU eviction
  • 1-hour TTL
  • Write-through caching
// redis cache
def get_original_url(short_url):
# Try cache first
original_url = redis_cache.get(short_url)
if original_url:
return original_url

# Cache miss, query database
original_url = db.query_url(short_url)
if original_url:
redis_cache.set(short_url, original_url, ex=3600) # 1 hour expiry
return original_url

Load balancers

  • between client and servers
  • between servers and databases
  • between servers and caches

Load Balancer Configuration

  • Use consistent hashing for distribution
  • Health checks every 30 seconds
  • Session persistence using client IP
upstream backend {
hash $remote_addr consistent;
server backend1.example.com:8080 max_fails=3 fail_timeout=30s;
server backend2.example.com:8080 max_fails=3 fail_timeout=30s;
}

Base62 encoding

URL redirection

  • HTTP Status 301
  • with cache disabled

Custom Short URLs

  • to avoid collision with system generated short code it’s required that custom urls be atleast 8 character long

Things to consider

  1. security
  2. availability
  3. scalability
  4. deployments / infra
  5. tech stack
  6. testing — automated, integration, load, choas

Security Considerations

Input validation

  • Validate inputs before avoid storing malicious links to avoid sql injection

Rate limiting

  • track number of requests by a user or ip address
  • token bucket algorithm — each IP address is allowed a certain number of requsts per time period (e.g. 100 requests per minute)
// using redis
def is_rate_limited(user_id):
pipe = redis_client.pipeline()
now = time.time()
key = f'rate_limit:{user_id}'

pipe.zadd(key, {now: now})
pipe.zremrangebyscore(key, 0, now - 60) # 1 minute window
pipe.zcard(key)
pipe.expire(key, 60)

_, _, count, _ = pipe.execute()
return count > 100 # 100 requests per minute

Captacha

HTTPS

  • use https to protect user data

Monitong and logging

  • discussed below

Monitoring and Logging

  1. ELK stack for log aggregation
  2. Prometheus + Grafana for metrics
  3. Distributed tracing with Jaeger

Monitoring

Logging

Metrics to Monitor

  1. Success rate of URL shortening
  2. Latency percentiles (p50, p90, p99)
  3. Cache hit ratio
  4. Database connection pool utilization
  5. Rate limiter effectiveness
  • SQL queries

Alerting Rules

alerts:
- name: high_error_rate
condition: error_rate > 1%
for: 5m
severity: critical

- name: high_latency
condition: p99_latency > 500ms
for: 5m
severity: warning

Business Analytics

  • all info like user ip address, location, device id etc.
  • Click data
    — Timestamp
    — IP address (anonymized)
    — User agent
    — Referrer
    — Geolocation
  • URL data
    — Creation time
    — Expiration
    — Owner (if authenticated)

--

--

No responses yet