Distributed Web crawler — [Notes]
· What’s a web crawler in a nutshell
· Questions
· When a URL is “new”?
∘ Worth Revisiting?
∘ Bloom filter to check uniqueness of a page?
∘ Redis to check uniqueness of a page?
∘ Relational database to check uniqueness of a page?]
∘ Summary
· Priority
∘ Politeness
∘ Frequency
∘ final
· Avoid crawling duplicate URLs at Google scale-Using Bloom Filter
Follow this link for all System design articles
What’s a web crawler in a nutshell
- Get a URL
- Fetch contents
- Store
- Extract new URLs from content
- Repeat
Questions
- How many links do we need to crawl? — Entire Internet (1.5 Billion sites)
- Where should I start my crawl? — You’ll be provided a list to begin with (reference pages/seed pages)
- When should our crawl end? — Never
- What’s the result of my crawl? — Crawler should send all pages to an indexer
When a URL is “new”?
- Normalize the URL, we need to check if we have the host, if it’s a relative URL, we won’t have host, and we need to attach it,
- sort parameters, because they may come in different order.
- Compare two URLs,
- If they have the same host, the same path and the same parameters, we would assume that the same URL.
Worth Revisiting?
- Another complication we have is time. The content of the same URL changes with time and those changes are relevant to us.
- The main page of a news site at the start of the year will look nothing like the main page of the same site at the end of the year. Everything will change, although the URL is exactly the same.
- frequency of changes is different for every host, every subdomain.
Bloom filter to check uniqueness of a page?
Cons:
- Memory requirements
→ 1.5B sites, if each site has 10 pages on average ~ 50Gb of RAM - False positives
→ 1 miss in million pages so 15K pages never indexed
Redis to check uniqueness of a page?
- Is a good option to store URLs
- Average URL = 50 bytes
- 15B URLs = 750*10⁹ Bytes = 750/1024*10⁹ KB (or 750/1000*10⁹ KB)
→ 732*10⁶ KB = 715,000MB = 700GB = 70 Shards*10GB Ra
Relational database to check uniqueness of a page?]
- Can easily hold 70GB data (can hold upto2TB data)
Summary
- High throughput, weak consistency — Bloom filters
- Medium throughput, medium consistency — Redis
- Low throughput, high consistency — RDBMS
Priority
- Politeness — we shouldn’t crawl same domain too often
- Frequency — we shouldn't crawl different domains at different pace
Politeness
Frequency
- For a start : Set frequency to some default
- Each time a site is scrolled,
→ if the content changes then => decrease the frequency (/2)
→ if not changes then => increase the frequency time (*2) - Problem?
→ How to decide if content got changed?
→ What if some site doesn’t change often and we increase it’s frequency to be infinite then we never revisit the same again
final
Avoid crawling duplicate URLs at Google scale-Using Bloom Filter
How to avoid crawling duplicate URLs at Google scale?
Option 1: Use a Set data structure to check if a URL already exists or not. The set is fast, but it is not space-efficient.
Option 2: Store URLs in a database and check if a new URL is in the database. This can work but the load to the database will be very high.
Option 3: 𝐁𝐥𝐨𝐨𝐦 𝐟𝐢𝐥𝐭𝐞𝐫. This option is preferred. Bloom filter was proposed by Burton Howard Bloom in 1970. It is a probabilistic data structure, that is used to test whether an element is a member of a set.
🔹 false: the element is definitely not in the set.
🔹 true: the element is probably in the set.
False-positive matches are possible, but false negatives are not.
The diagram below illustrates how the Bloom filter works. The basic data structure for the Bloom filter is Bit Vector. Each bit represents a hashed value.
Step 1: To add an element to the bloom filter, we feed it to 3 different hash functions (A, B, and C) and set the bits at the resulting positions. Note that both “www.myweb1.com” and “www.myweb2.com” mark the same bit with 1 at index 5. False positives are possible because a bit might be set by another element.
Step 2: When testing the existence of a URL string, the same hash functions A, B, and C are applied to the URL string. If all three bits are 1, then the URL may exist in the dataset; if any of the bits is 0, then the URL definitely does not exist in the dataset.
Hash function choices are important. They must be uniformly distributed and fast. For example, RedisBloom and Apache Spark use murmur, and InfluxDB uses xxhash.
Question — In our example, we used three hash functions. How many hash functions should we use in reality? What are the trade-offs?