Distributed Web crawler — [Notes]

6 min readMay 11, 2022

· What’s a web crawler in a nutshell
· Questions
· When a URL is “new”?
∘ Worth Revisiting?
∘ Bloom filter to check uniqueness of a page?
∘ Redis to check uniqueness of a page?
∘ Relational database to check uniqueness of a page?]
∘ Summary
· Priority
∘ Politeness
∘ Frequency
∘ final
· Avoid crawling duplicate URLs at Google scale-Using Bloom Filter

Follow this link for all System design articles

What’s a web crawler in a nutshell

Get a URL
Fetch contents
Store
Extract new URLs from content
Repeat

Questions

How many links do we need to crawl? — Entire Internet (1.5 Billion sites)
Where should I start my crawl? — You’ll be provided a list to begin with (reference pages/seed pages)
When should our crawl end? — Never
What’s the result of my crawl? — Crawler should send all pages to an indexer

When a URL is “new”?

Normalize the URL, we need to check if we have the host, if it’s a relative URL, we won’t have host, and we need to attach it,
sort parameters, because they may come in different order.
Compare two URLs,
If they have the same host, the same path and the same parameters, we would assume that the same URL.

Worth Revisiting?

Another complication we have is time. The content of the same URL changes with time and those changes are relevant to us.
The main page of a news site at the start of the year will look nothing like the main page of the same site at the end of the year. Everything will change, although the URL is exactly the same.
frequency of changes is different for every host, every subdomain.

Bloom filter to check uniqueness of a page?

Cons:

Memory requirements
→ 1.5B sites, if each site has 10 pages on average ~ 50Gb of RAM
False positives
→ 1 miss in million pages so 15K pages never indexed

Bloom filter in memory. It’s more memory efficient and fast

Redis to check uniqueness of a page?

Is a good option to store URLs
Average URL = 50 bytes
15B URLs = 750*10⁹ Bytes = 750/1024*10⁹ KB (or 750/1000*10⁹ KB)
→ 732*10⁶ KB = 715,000MB = 700GB = 70 Shards*10GB Ra

Relational database to check uniqueness of a page?]

Can easily hold 70GB data (can hold upto2TB data)

Summary

High throughput, weak consistency — Bloom filters
Medium throughput, medium consistency — Redis
Low throughput, high consistency — RDBMS

Priority

Politeness — we shouldn’t crawl same domain too often
Frequency — we shouldn't crawl different domains at different pace

Politeness

use a redis to hold info of last domain crawl + threshold | if crawl time is below threshold then postpone tasks, push them to a new queue

We can share same redis instance for prioritiser and url extractor. prioritiser will write to redis where as URL extractor will only read

Frequency

For a start : Set frequency to some default
Each time a site is scrolled,
→ if the content changes then => decrease the frequency (/2)
→ if not changes then => increase the frequency time (*2)
Problem?
→ How to decide if content got changed?

→ What if some site doesn’t change often and we increase it’s frequency to be infinite then we never revisit the same again

final

Avoid crawling duplicate URLs at Google scale-Using Bloom Filter

How to avoid crawling duplicate URLs at Google scale

How to avoid crawling duplicate URLs at Google scale? Option 1: Use a Set data structure to check if a URL already…

blog.bytebytego.com

How to avoid crawling duplicate URLs at Google scale?

Option 1: Use a Set data structure to check if a URL already exists or not. The set is fast, but it is not space-efficient.

Option 2: Store URLs in a database and check if a new URL is in the database. This can work but the load to the database will be very high.

Option 3: 𝐁𝐥𝐨𝐨𝐦 𝐟𝐢𝐥𝐭𝐞𝐫. This option is preferred. Bloom filter was proposed by Burton Howard Bloom in 1970. It is a probabilistic data structure, that is used to test whether an element is a member of a set.

🔹 false: the element is definitely not in the set.

🔹 true: the element is probably in the set.

False-positive matches are possible, but false negatives are not.

The diagram below illustrates how the Bloom filter works. The basic data structure for the Bloom filter is Bit Vector. Each bit represents a hashed value.

Step 1: To add an element to the bloom filter, we feed it to 3 different hash functions (A, B, and C) and set the bits at the resulting positions. Note that both “www.myweb1.com” and “www.myweb2.com” mark the same bit with 1 at index 5. False positives are possible because a bit might be set by another element.

Step 2: When testing the existence of a URL string, the same hash functions A, B, and C are applied to the URL string. If all three bits are 1, then the URL may exist in the dataset; if any of the bits is 0, then the URL definitely does not exist in the dataset.

Hash function choices are important. They must be uniformly distributed and fast. For example, RedisBloom and Apache Spark use murmur, and InfluxDB uses xxhash.

Question — In our example, we used three hash functions. How many hash functions should we use in reality? What are the trade-offs?