Julia's coding blog - Practice makes perfect

From January 2015, she started to practice leetcode questions; she trains herself to stay focus, develops "muscle" memory when she practices those questions one by one. 2015年初, Julia开始参与做Leetcode, 开通自己第一个博客. 刷Leet code的题目, 她看了很多的代码, 每个人那学一点, 也开通Github, 发表自己的代码, 尝试写自己的一些体会. She learns from her favorite sports – tennis, 10,000 serves practice builds up good memory for a great serve. Just keep going. Hard work beats talent when talent fails to work hard.

Wednesday, June 15, 2022

System Design Interview: Search Engine

Here is the link.

SystemDesign

Sep 15, 2021

12 min read

As the volume of information available on the internet continues to grow exponentially, so does the importance of a powerful search engine such as Google and Yahoo Search. These search engines are built to find, understand, and organize billions of web pages available on the internet to present the most relevant suggestions to the queries that the users type into the search bar. Let’s design a mini search engine similar to Google that can help searchers locate what they’re looking for.

What Are The Requirements Of The System

Functional Requirements

The search engine should be able to find and return from its database, all files relevant to the keyword typed in the search bar by a user.
The files should be arranged in order, according to their relevance to the searched keyword and popularity.

Non Functional Requirements

The system should have low latency. Whenever a user makes a query, we want the search results to appear as fast as possible.
The system should have a high throughput. Our search engine should be able to respond to several queries at a time.
The system should be scalable. It should be able to accommodate a growing number of users and files in its database.

The 3 Main Components Of A Search Engine

Here’s the high level architecture of a search engine that shows how the 3 main building blocks that make up the system are arranged:

At one end is the web with millions of pages and at the other end is the user who tries to access some selected pages by typing in a query into the browser. Our search engine is responsible for making the web pages available to the user in a systematic manner. The search engine is made up of three main components: crawler, indexer and retriever.

Crawler

The first component of our search engine is the crawler. The purpose of the crawler is to fetch all the different web pages from the internet through links available on the pages or those present in robot.txt files of websites. Google uses distributed crawlers to crawl millions of pages all over the web. The crawled pages are cached and fed into the indexer, which is the next component of our system.

Indexer

The indexer is the second component of the search engine that continuously runs at the backend, building an index of the files cached by the crawler. As new files become available on the internet, it’s included in the index by the search engine. The purpose of indexing is to make lookup faster since all the file information is stored against an index and can be retrieved from that unique index.

Retriever

The retriever is the third major component of the search engine. This subsystem uses the index created by the indexer to answer users’ queries that are typed into the browser.

Let’s understand how these components work in detail.

Component 1 : Crawler

Web crawler, spider, and robot are different names for the same system, the purpose of which is to fetch pages from the web. Designing a web crawler is in itself a broad topic, discussed in detail in the course Designing a Web Crawler by Educative, but here’s an overview:

The crawler starts with a handful of ‘seed’ pages arranged in a priority queue.
It fetches the pages in the priority queue and caches them in a persistent database.
Besides caching the pages, it also extracts the hyperlinks from them.
These hyperlinks are added to the priority queue.
The URLs in the priority queue are fetched and cached, and the cycle continues.

Though the above steps are the simplest depiction of a crawler, there are several challenges that a web crawler actually faces. It needs to support the different file types and URL extensions available on the internet, comply with the robot exclusion protocols used by websites, and devise suitable crawling frequencies for websites that are updated regularly.

Component 2 : Indexer

In simple terms, an index makes it easier to identify a file based on a simple ID, rather than the complete file name or URL. The main question here is how to index the web pages. Let’s look at the options:

Trivial Solution

As the simplest solution, you can store the files in a list or an array, such as this one:

F = { a, b, c, d … }

Suppose a, b, c, d and so on are files. The index of file ‘a’ is 0, the index for file ‘b’ is 1, for ‘c’ it is 2, and so on. To retrieve file ‘b’, we simply point at that specific index of the list, i.e., F[1].

This is the simplest possible way for indexing files. In the database, indexes can also be stored in a table such as the one in the diagram below.

In the diagram, we have files, 1 through n, that we want to index. Now, the database stores a table where the first column is the index of the file, and the second column stores the content of that file.

When the system wants to retrieve a file, it passes the index of that file through a lookup function to receive its content from the index table in the database.

Shortcomings Of This Approach

The major challenge of the trivial solution discussed above is scanning the database. With this solution, we have a table with two columns: index and content. When the user searches for a keyword, the search system uses LIKE operator in a WHERE clause to search for patterns between the keywords and the content and retrieves the file that includes the keywords.

This approach to search for keywords in the documents will involve a high latency when there are many documents to scan. Scalability and efficiency are two main challenges that this trivial technique to index documents fails to solve, presenting the need for a smarter solution.

Standard Approach — Inverted Index

To create a scalable and efficient search engine, most real-world applications use inverted index as a standard approach for indexing web pages.

The system creates an inverted index table from the crawled web pages. In simpler terms, you can call it a word-document table since each word is stored against the document IDs in which it appears. When stored in a hashmap, the word is the key and its value is the document list in which it is present. When the user types in the search keyword, the index is searched for that keyword, and all the documents in which it appears are returned. Alternatively, the inverted index can also be stored in B-Tree or trie data structure. All these data structures allow you to locate the desired index much faster.

The following are the steps that the system follows to build an inverted index from the documents fetched by the crawler:

Pre-Processing

In the first step, the ‘stop words’, grammatical tenses and casing are removed from the documents since they don’t matter to our application. Removing these before generating the word document table saves storage space and search time. ‘Stop words’ are words like ‘the’, ‘I’, ‘in’, ‘you’, ‘an’ etc., which occur frequently in documents and are not useful in searches.

Grammatical tenses, such as ‘form’, ‘formed’, ‘formation’ are considered the same and stored in a single row in the index as ‘form’ to save storage space and make lookup faster. This is done by a procedure called ‘stemming of the root word’. Similarly, words like ‘jumped’, and ‘jumping’ are stemmed to extract the root word, and stored as ‘jump’ in the index. ‘Porter’s Stemmer’ is a tool that you can use for stemming the root word. There are other tools available as well.

Build Inverted Index

Once the documents are pre-processed and we have all the meaningful words we want to include in the inverted index, the next step is to build the inverted index. The index is built by parsing the documents for words one by one. For each word, the reference of the document it is present in is included in the table.

While scanning a document, each word that the system comes across is located in the index and the ‘Documents’ column for that word is updated to include the document being scanned. If, while scanning a document, the system comes across a word that’s not already present in the index, a new row is created for the word, and the reference for the document is added to the ‘Documents’ column.

For each word in the table, the ‘Documents’ column will have references to all the documents that the word appears in. Depending on the application, you can also add additional columns in the index, such as frequency of the word, it’s location in the documents etc.

Consider the example below. For simplicity, we will assume that there are only 2 documents to index. After removing the stop words and stemming the root words from the content of the two documents, we have a handful of words to include in the index. These words are included in the first column of the index as shown in the table:

The word ‘apple’ appears in both document 1 and document 2 so the document IDs for both documents are included in the ‘Documents’ column. The word ‘health’ also appears in both documents, while ‘fruit’ only appears in the document with ID 1. Similarly, there will be one row entry for each unique word and the document IDs in which it appears will be included in the ‘Documents’ column.

Inverted index helps search faster for any keyword that the user enters in the search box. Once the search engine has the keyword, it locates it in the index and returns all the document IDs in which it appears to the user.

Component 3 : Retriever / Query Processor

Now that we have the inverted index in the metadata database for our search engine, the next component to discuss is the retriever. The function of the retriever is to use the inverted index stored in the database to respond to the users’ queries.

Preprocessing Query

As the initial step, this module of the search engine processes, and, if needed, rewrites the query to make it easier to match it to the right documents. Similar to preprocessing the documents before building an inverted index, ‘stop words’ removal, spelling correction and case folding are some of the operations carried out on the users’ queries before retrieving documents based on them.

Document Retrieval

Since our inverted index has individual words stored in each row and the corresponding documents in the column, document retrieval based on single-word queries is the simplest case. However, users often search for phrases and sentences, not single words. Even after removal of ‘stop words’, queries will have multiple words. With phrases typed in the search engine, there are two types of document searches: conjunctive and disjunctive search.

Conjunctive Search

Conjunctive search is one in which the search engine locates documents which contain all the words present in the user’s query. To be more specific, conjunctive search is AND (intersection) operation on the individual results of each word in the query.

Take, for instance, the same inverted index that we built earlier in the blog using two documents. Now suppose someone searches for “healthy fruits”. After stemming the root word, we have two words, ‘health’ and ‘fruit’, as shown in the diagram below:

‘Health’ appears in documents 1 and 2, and fruit only appears in document 1. We want to pick the documents that have all the words typed by the user, so we’ll perform an AND operation on the results and get document 1 as the result as it has both words, health and fruit. It is returned to the user as the best match for the query that they typed in.

Disjunctive Search

Disjunctive search is the one in which the search engine returns a document as a match if it contains at least one word present in the search query. Simply put, it is an OR operation on the result of all the words in the search query.

Let’s perform a disjunctive search for the same example that we discussed above. The user types in “healthy fruits” into the search box. For disjunctive search, we want to return all the documents that have at least one of the words in the search query.

After stemming the root word, we have two words, “health” and “fruit” to search in the inverted index. For the word “health”, we get documents with ID 1 and 2 from the inverted index, while for “fruit”, we get document 1 in the inverted index. An OR operation (union) is performed on the results. The final result comprising of documents 1 and 2 is returned to the user.

Document Retrieval In Applications

We’ve learned the basics of conjunctive and disjunctive search. Search engines, like Google and Yahoo Search, typically use a combination of the two techniques to deliver results. The documents that have all the search words appear on the top of the search results while those that have fewer search words appear below. Those that have only 1 search word will appear at the bottom.

Other factors are important too. Sometimes, it can be critical to our application that the search words appear in the same order as the one typed by the user. For example, if you search “Paris Hilton”, you’re interested in the information for the actress. If the search engine shows results for “Hilton Paris” instead, it could be information about Hilton hotel in Paris. In this case, it matters to the user that the documents in the results have words appearing in the same order.

If we want the search engine to accommodate such cases, the inverted index should also store the position of the words. During document retrieval, the search engine can check if the position is valid and the document can appear in the search results.

The frequency in which the search words appear in the documents can also be important to our application and may be incorporated when building the inverted index and in the document retrieval process. The framework that the system uses to display the results depends on the application’s purpose and its use cases.

Conclusion

So this is how a mini search engine, like Google, is designed. While we have covered the basics, in reality, Google uses complex models to rank pages appearing in search results. Furthermore, it shards databases across multiple machines to accommodate the large number of documents in searches.