Bitly | TinyURL | URL shortener
--
· Benefits of Short URLs
· Questions/Function Requirement
· User stories
∘ Writes per second
∘ Scaling reads
∘ Storage Requirement
∘ Simple Design
∘ Short Identifiers — Requirement for short Identifier
∘ Auto-generated integers
∘ Can we use UUID?
∘ Can we use MD5?
∘ Can we use Base62?
∘ Can we use Base58?
∘ Database based base62 encoded shortened URL
∘ Integer to Custom Encoding
∘ Why 8 chars long?
∘ Problems with bigger code
∘ How to avoid very small short codes or variable length?
· Approaches to generate short code
∘ Pregenerate all 7 size keys
∘ Ticket server approach
∘ Snowflake approach
∘ Zookeeper approach
∘ Final
· Primary key can be integer
· SQL Database is bad choice for short URLs
∘ Cache
· Load balancers
∘ Load Balancer Configuration
· Base62 encoding
· URL redirection
∘ Working TinyURL Code
· Custom Short URLs
· Things to consider
· Security Considerations
∘ Input validation
∘ Rate limiting
∘ Captacha
∘ HTTPS
∘ Monitong and logging
· Monitoring and Logging
∘ Monitoring
∘ Logging
∘ Metrics to Monitor
∘ Alerting Rules
· Analytics
· TODO
· References
Benefits of Short URLs
- easier to type
- Don’t take much space when you share them — tweets, promotional images etc.
- analytics
Questions/Function Requirement
- How short the url code should be? — as short as we can
- Traffic
→ How many new URLs will be added per month? — 500M (100:1 will be read and write ratio)
→ For how long the URLs should be kept? — 5 years
→ Total number of URLs? → 500M*12*5 = 30 Billion URLs
→ 30 billion * 1kb = ~30TB - link expiration time?
- What character we can use? [0–9 a-z A-Z]
User stories
Writes per second
[1Million Requests per day = ~12 requests/per second]
- 500M URLs shortened per month => ~600M/30
- 20M per day => ~24M/24
- 1M per hour => ~1.2M/60
- 20K per minute => ~18k / 60
- 300 URLs per second
- x10peak? => 3K URLs
=> within the limits of a single database
=> will need multiple servers to handle this load
Scaling reads
single instance of Db can handle — 10k rps
single instance of redis can handle — 100k rps
- 300 writes per second
- 30k reads per second (100:1 reads to writes)
x10 peaks are possible => 300k rps on peak - scale out database [can use a LB between server and database if replicas are changing else server can decide which replica to use]
- use redis but it’s limited by RAM (We will need 50 machines of 64GB ram to hold 1TB data)
Storage Requirement
- 30 billion URLs
- Each URL 100 bytes (URL, short identifier and other info) on average => 3 Trillion bytes => 3 Billion KB => 3 Million MB => 3000 GB => 3 TB
Simple Design
- 300 writes per second is ok for single DB instance
- 3 TB will also fit into a single database.
Short Identifiers — Requirement for short Identifier
- As short as possible but not very short (3–4 digits)
- Random
- Human readable [not o 0 / $]
Auto-generated incrementing integers
- not Random
- very short
- someone can scrape
Can we use UUID?
- A Universally Unique Identifier (UUID) is 128 bits long and is represented in hexadecimal as 32 characters
- substring(Base58(UUID(URL)), 8)
- Hexadecimal digits: 0–9 and A–F
- 16³² = (2⁴)³² = 2¹²⁸ => too long to type
- Short ❌
- Random ✅
- Human readable ✅
Can we use MD5?
- Size: 2¹²⁸, same as UUID
substring(Base58(Md5(URL)), 8) - Not random!!
- Short ❌
- Random ❌
- Human readable ✅
Can we use Base62?
- problem is some characters looks very similar (o and 0)
- All English characters = 52(a-zA-Z)+10 number = 62
- UUID = 32 character
- ❌Base62(UUID) = 22 characters [ character needed are only 6 and no need of 22 characters, 62⁶]
- 62⁵ = 916 Million
- 62⁶ = 56 Billion
- 62⁷ = 3 Trillion
- 62⁸ = 281 Trillion
- Short ✅
- Random ✅
- Human readable ❌
Can we use Base58?
- All characters — (minus) zero, capital i, capital o and lower case L
- 58⁶ = 38 Billion
- 58⁷ = 2 Trillion
- 58⁸ = 128 Trillion
- approach # 1 — substring(base58(UUID), 8))
- approach # 2 —
result=[]
for 1 to 8
Random(0 , 57)
append to result
- Short ✅
- Random ✅
- Human readable ✅
Database based base62 encoded shortened URL
A URL can have any length. Therefore to generate the fixed-length output, first we will hash the URL using sha256
which will generate 256
bit i.e. 64
hex string.
sha256(https://this-is-a-long-url.com/desing/url/shortner) =
2553a57edb1b87b60ac40d12ba491560fdf1cc886b6f9171b2e42b2caf4eede5
To shorten the generated hash, we can use base62
encoding (a-z, A-Z, 0–9
, exclude +
and /
).
base62(2553a57edb1b87b60ac40d12ba491560fdf1cc886b6f9171b2e42b2caf4eede5) =
8qmTY1gUvtjJA6mmUKYvRswXGxwdujJ5JW12xxiPXfR
The input of 64
hex string, it generated 43
chars long encoded string. To shorten it further, we can get the random 7
characters from the encoded string.
random_7chars(8qmTY1gUvtjJA6mmUKYvRswXGxwdujJ5JW12xxiPXfR) = UKmTY1g
Integer to Custom Encoding
- still can be reverse engineer with lot of efforts
- problem is not with encoding scheme but the auto incrementing integers — make random numbers
Why 7 chars long?
6 letter long key = 6²⁶ = ~ 56.8 billion
7 letter long key = 6²⁷ = ~ 3,5 trillion
8 letter long key = 6²⁸ = ~ 218 trillion
Based on our application needs, we can choose the length of the shortened URL. 7
chars length will be enough for most of the applications therefore we will go with 7
chars.
Problems with bigger code
- Storage size → More disk and more memory used for a bigger shortened url code
- Efficiency → a larger key takes more cpu time to compare
- Simplicity → not easy to remember
How to avoid very short codes or variable length?
- 1000 will generate short code of 3 size which is not acceptable (why?)
- start with value which produce 3 character length
- padding? 0000abc
Approaches to generate short code
Pregenerate all 7 size keys
- pre-generate 7 size keys and track with boolean value to check if allocated or not
key : is_allocated
abcvedd : true
Ticket server approach
Snowflake approach
timestamp:machine_id:
Zookeeper approach
- distributed co-ordinator — good to maintain centeralized configuration information
- server will reach out to zookeeper to get a unused range of values and once range is exhausted (or server crashes and comes back online again) it will reach out to zookeeper for the new range
- one server will generate shortcode in the increasing sequence hence can be guessed? — add another 10–12 random bits after counter number to increase entropy
Read replicas for Database/MYSQL — it will be multiply our data storage i.e. if total data is 3TB and we have 3 read replicas then total storage will be 21TB
Distributed Caching instead of MySQL — it will be very expensive to have 50 machines of Redis also data it not persistent
Caching + MySQL approach + In memroy cache?
- 1% of 3TB in cache => 30GB
- 300k peak reads
- 300k/4 redis machines => less than 100k reads per second
Final
Primary key can be integer
- searching is very fast for integer compared to string
- create b+ tree is easier for number than string
SQL Database is bad choice for short URLs
CREATE TABLE url_mappings (
id BIGSERIAL PRIMARY KEY,
short_url VARCHAR(10) UNIQUE NOT NULL,
original_url TEXT NOT NULL,
user_id UUID,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
expiration_date TIMESTAMP,
click_count BIGINT DEFAULT 0,
INDEX idx_short_url (short_url)
);
CREATE TABLE analytics (
id BIGSERIAL PRIMARY KEY,
short_url_id BIGINT REFERENCES url_mappings(id),
timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
ip_address INET,
user_agent TEXT,
referrer TEXT,
country VARCHAR(2)
);
The effect of Random UUID on database performance
- In all cases, the insertion rate is at first CPU bound but as soon the table is larger than the buffer pool, the insertion rapidly becomes IO bound. This is expected and shouldn’t surprise anyone. The use of a smaller representation for the UUID values just allows more rows to fit in the buffer pool but in the long run, it doesn’t really help the performance, as the random insertion order dominates. If you are using random UUID values as primary keys, your performance is limited by the amount of memory you can afford.
- Postgres will be faster at inserting rows in the table itself which is organized by rowid. It has to maintain a btree for the uuid PK values so it will eventually struggle like InnoDB, maybe a bit later in term of number of rows.
Cache
- 80/20 — 20% url will generate 80% traffic
- modern day caches can range from 6 GB –256 GB
- LRU eviction
- 1-hour TTL
- Write-through caching
// redis cache
def get_original_url(short_url):
# Try cache first
original_url = redis_cache.get(short_url)
if original_url:
return original_url
# Cache miss, query database
original_url = db.query_url(short_url)
if original_url:
redis_cache.set(short_url, original_url, ex=3600) # 1 hour expiry
return original_url
Load balancers
- between client and servers
- between servers and databases
- between servers and caches
Load Balancer Configuration
- Use consistent hashing for distribution
- Health checks every 30 seconds
- Session persistence using client IP
upstream backend {
hash $remote_addr consistent;
server backend1.example.com:8080 max_fails=3 fail_timeout=30s;
server backend2.example.com:8080 max_fails=3 fail_timeout=30s;
}
Base62 encoding
URL redirection
- HTTP Status 301
- with cache disabled
Custom Short URLs
- to avoid collision with system generated short code it’s required that custom urls be atleast 8 character long
Things to consider
- security
- availability
- scalability
- deployments / infra
- tech stack
- testing — automated, integration, load, choas
Security Considerations
Input validation
- Validate inputs before avoid storing malicious links to avoid sql injection
Rate limiting
- track number of requests by a user or ip address
- token bucket algorithm — each IP address is allowed a certain number of requsts per time period (e.g. 100 requests per minute)
// using redis
def is_rate_limited(user_id):
pipe = redis_client.pipeline()
now = time.time()
key = f'rate_limit:{user_id}'
pipe.zadd(key, {now: now})
pipe.zremrangebyscore(key, 0, now - 60) # 1 minute window
pipe.zcard(key)
pipe.expire(key, 60)
_, _, count, _ = pipe.execute()
return count > 100 # 100 requests per minute
Captacha
HTTPS
- use https to protect user data
Monitong and logging
- discussed below
Monitoring and Logging
- ELK stack for log aggregation
- Prometheus + Grafana for metrics
- Distributed tracing with Jaeger
Monitoring
Logging
Metrics to Monitor
- Success rate of URL shortening
- Latency percentiles (p50, p90, p99)
- Cache hit ratio
- Database connection pool utilization
- Rate limiter effectiveness
- SQL queries
Alerting Rules
alerts:
- name: high_error_rate
condition: error_rate > 1%
for: 5m
severity: critical
- name: high_latency
condition: p99_latency > 500ms
for: 5m
severity: warning
Business Analytics
- all info like user ip address, location, device id etc.
- Click data
— Timestamp
— IP address (anonymized)
— User agent
— Referrer
— Geolocation - URL data
— Creation time
— Expiration
— Owner (if authenticated)