Friday, April 22, 2022

Udemy course: Pragmatic system design | Alexey Soshin | Section 11: Design a web crawler (aka Google Crawler)

April 22, 2022

I think that I made a good choice to purchase the course. I love learning from a mock interview section:

Section 11: Design a web crawler (aka Google Crawler). 

I was asked to work on a system design to design a web crawler a few years ago. I always search good ideas to solve this system design question. 

Current design | Uniqueness checker 



Bloom filter | uniqueness design talk 

Cons:
  • Memory requirements
    • 1.58 sites, if each site has 10 pages on average
      • 50GB of RAM
  • False positives



Key-value store? | uniqueness design talk 

  • Average URL = 50 bytes
  • 15B URLs
  • = 750,000,000,000 bytes
  • = 732,000,000 KB
  • = 715,000 MB
  • = 700 GB

Plain old DB? 




Summary | Three choices: Bloom filters, Redis, RDBMS 

High throughput, weak consistency - Bloom filters

Medium throughput, medium consistency - Redis

Low throughput, high consistency - RDBMS 

No comments:

Post a Comment