Saturday, August 15, 2020

System design: Build a Web Crawler

 Here is the link. 

#1 – Basic solution

How to build a rudimentary web crawler?

One simple idea we’ve talked about in 8 Things You Need to Know Before a System Design Interview is to start simple. Let’s focus on building a very rudimentary web crawler that runs on a single machine with single thread. With this simple solution, we can keep optimizing later on.

To crawler a single web page, all we need is to issue a HTTP GET request to the corresponding URL and parse the response data, which is kind of the core of a crawler. With that in mind, a basic web crawler can work like this:

  • Start with a URL pool that contains all the websites we want to crawl.
  • For each URL, issue a HTTP GET request to fetch the web page content.
  • Parse the content (usually HTML) and extract potential URLs that we want to crawl.
  • Add new URLs to the pool and keep crawling.

It depends on the specific problem, sometimes we may have a separate system that generates URLs to crawl. For instance, a program can keep listening to RSS feeds and for every new article, it can add the URL into the crawling pool.

 



No comments:

Post a Comment