Julia's coding blog - Practice makes perfect

From January 2015, she started to practice leetcode questions; she trains herself to stay focus, develops "muscle" memory when she practices those questions one by one. 2015年初, Julia开始参与做Leetcode, 开通自己第一个博客. 刷Leet code的题目, 她看了很多的代码, 每个人那学一点, 也开通Github, 发表自己的代码, 尝试写自己的一些体会. She learns from her favorite sports – tennis, 10,000 serves practice builds up good memory for a great serve. Just keep going. Hard work beats talent when talent fails to work hard.

Saturday, August 15, 2020

System design: Build a Web Crawler

Here is the link.

#1 – Basic solution

How to build a rudimentary web crawler?

One simple idea we’ve talked about in 8 Things You Need to Know Before a System Design Interview is to start simple. Let’s focus on building a very rudimentary web crawler that runs on a single machine with single thread. With this simple solution, we can keep optimizing later on.

To crawler a single web page, all we need is to issue a HTTP GET request to the corresponding URL and parse the response data, which is kind of the core of a crawler. With that in mind, a basic web crawler can work like this:

Start with a URL pool that contains all the websites we want to crawl.
For each URL, issue a HTTP GET request to fetch the web page content.
Parse the content (usually HTML) and extract potential URLs that we want to crawl.
Add new URLs to the pool and keep crawling.

It depends on the specific problem, sometimes we may have a separate system that generates URLs to crawl. For instance, a program can keep listening to RSS feeds and for every new article, it can add the URL into the crawling pool.

Julia's coding blog - Practice makes perfect

Saturday, August 15, 2020

System design: Build a Web Crawler

#1 – Basic solution

No comments:

Post a Comment