Here is the link.
#1 – Basic solution
How to build a rudimentary web crawler?
One simple idea we’ve talked about in 8 Things You Need to Know Before a System Design Interview is to start simple. Let’s focus on building a very rudimentary web crawler that runs on a single machine with single thread. With this simple solution, we can keep optimizing later on.
To crawler a single web page, all we need is to issue a HTTP GET request to the corresponding URL and parse the response data, which is kind of the core of a crawler. With that in mind, a basic web crawler can work like this:
- Start with a URL pool that contains all the websites we want to crawl.
- For each URL, issue a HTTP GET request to fetch the web page content.
- Parse the content (usually HTML) and extract potential URLs that we want to crawl.
- Add new URLs to the pool and keep crawling.
It depends on the specific problem, sometimes we may have a separate system that generates URLs to crawl. For instance, a program can keep listening to RSS feeds and for every new article, it can add the URL into the crawling pool.
No comments:
Post a Comment