How Search Engines Work
How Search Engines Work
The quote above says it all about the requirement of search engines in the context of the World Wide Web. "All the books are on the floor." The Internet works in a manner which is different from the real world. In the real world, if you want to buy a book you would go to the nearest bookstore, select one you want, pay for it and go back. The bookstore would normally have a sign and be known to the people in the locality, so you can ask for directions. But on the Internet you only have your screen in front of you. You can't see anything; there are no signs and no one to ask directions from. So how do you find the books you want? The answer to this lies in Search Engines.
Therefore, if you are looking for books, you will need to go to your favourite search engine and type the word "books" in the query form. The search engine will then give you all the results pertaining to "books", in the order of popularity (4,620,000,000 results). As you can see the pages are too many to sift through to find what you are looking for, unless it features in the first page. To further narrow down your search, you can add the author's name to the word "books" for example "Charles Dickens", and the search engine results will appear with all pages related to books written by that author (2,510,000 results). You want more! Write "books", "author name" and the "city" and the search engine will deliver all the pages related to books by that author from your city and in all likelihood will have the bookstore which is selling those books (34,900 results). So how does it work?
With so much of information in billions of pages, there is a need to organise the content and deliver the same to users to make their job easier in finding what they are looking for; otherwise they would be looking for the proverbial "needle in the haystack. "But, before we can start learning about optimising the web pages and web sites for search, it is imperative that we learn how the search engines work in order to better appreciate how the small tweaks in the page makes a difference to the way it is ranked by the search engines, and hence accordingly in the search result pages.
According to Wikipedia, the free online encyclopaedia, a web search engine, "is designed to search for information on the World Wide Web. The search results are usually presented in a list of results and are commonly called hits. The information may consist of web pages, images, information and other types of files. Some search engines also mine data available in databases or open directories. Unlike Web directories, which are maintained by human editors, search engines operate algorithmically, or are a mixture of algorithmic and human input. "
Typically, a Search Engine comprises of three main parts- a web crawler, an indexer and the query software.
The web crawler or the robot, bot or the spider is the part of search engine that is responsible for finding new pages or websites, looking for updates in the pages or sites already indexed and collection of data from such pages. Contrary to the usual impression given by the name or even the way the word is used, the spider or crawler is not really crawling the web through the cables and machines. So when you hear someone say phrases like, "the spider visits the page" or "the frequency of crawl", what they mean is that the software is performing the action it is designed to perform.
The spider is, in effect, a software which visits web pages simply like we all do, with the difference that it is much faster and goes through all the pages of the website unless the website has restricted access to spiders. Essentially, like we request the website to send pages to our browser by putting in the address in the address bar or by clicking on the link, similarly the spider requests information from the website about the content. All the data collected by the spider is sent to the second program called the indexer. Spiders are also tasked with the job of following the links on a page. That is how new pages are found faster than they otherwise would be. The spider also makes repeated visits to websites depending on how often the website is updated, as also other factors like popularity of the website.
The actual process works something like this- the spider starts its search and lands up on your website either directly or through some other website which may have a link to your website or page. The spider then proceeds to the home page and downloads the data in the head part of the HTML page, which includes the Title, the Meta Tags, Description & Keywords, and also the robot instruction the robots.txt file. Next the spider goes through the contents and looks for the keywords you had specified in the Meta Tags and notes the frequency of usage of these keywords, notes all the alt tags used, the subtitles, the headings used and also sees if there are any links on the page. Please note that some spiders will extract the entire page for analysis rather than only the usage of keywords. If the spider does find any links, internal or external, on the page, it will proceed to that page and process it the same way as above.
Here one needs to appreciate the fact that considering the sheer volume of documents that are available on the web, and the frequency with which content is being added, it is highly impossible for the crawlers to maintain pace, and hence certain policies need to be laid down for the actions undertaken by the spiders. Therefore, first is to decide which pages are to be crawled. The idea is to prioritise the visit to these pages. But the metrics that influence this visit are best known only to the search companies and may include anything like - the breadth, the links, the page rank, etc. The job is also made easier by either restricting visits to pages with certain extensions like .html, .asp or by what is called a focussed crawling wherein the crawler looks for similar pages.
Once a website has been visited, pages downloaded and indexed, a re-visit policy needs to be framed. It is important for a search engine to not miss updates on pages which may be of importance, and hence it cannot afford to be totally unbiased visiting all pages and websites at uniform intervals. Therefore, a factor like the frequency of update in a website plays an important role in the formulation of the re-visit policy. It is obvious that the news web sites will have a higher visit rate than a normal website.
The web crawlers, because of their speed and fully automated actions, have a high consumption of resources that can end up overloading the network as well as the server that is being crawled. It is therefore desirable that a crawler does its job, without overloading the server in terms of bandwidth and the number of requests it makes to the server, within a given time period. For this purpose, the crawler has to be well written and should have a time delay between subsequent requests. There is also the robots. txt exclusion standard wherein the webmasters can indicate to the crawler which pages on the website are not to be visited, thereby optimising their own bandwidth resource as well as the time the crawler spends on getting the relevant pages.
Since the crawling process is a highly time consuming and resource intensive operation it is essential that the resources are not wasted by different processes of the same search engine inadvertently visiting the same pages/websites. As mentioned earlier, the idea is to find fresh and updated pages. Therefore, the efficiency of a search engine in respect to crawling, indexing and serving fresh relevant results will depend upon the search policies implemented.
All the pages that the spider finds go into the Index, or the Catalogue. The Index is like a huge collection of all the pages the spider ever visited and the same keeps getting updated whenever the spider visits the same page next. You must have observed that when a website cannot be opened by your browser the search result page offers to show you a cached copy of the page. When you see the cached copy of the page, you are actually seeing a page from the Index as last updated. Such a page is being delivered to your browser from the search engine's index and not from the website server.
The number of pages on the web and hence in the search engine catalogue are enormous. Therefore, if one was to look for a particular word, the search server will end up spending a lot of time going through these billions of pages to come up with the relevant documents. This is where the indexing comes in. The index is made so as to categorise documents identified by numbers containing a certain word. For example the word "history" may be appearing in documents numbered nl, n3, n5, n6 and n9. Similarly, the word "India" may be appearing in documents n2, n5, n7, n8 and n9. The index done, the search engine is ready to give you results to the search string entered by the user. So when someone searches for the word "history" the first set of documents are presented and similarly when someone searches for "India" the second set of documents are presented as results. The list of documents containing the single words is called a posting list. Now what happens when someone is looking for both the words together, i.e. "history of India" or "Indian history"? In such cases, a simple iteration between the two polling lists can provide the result. This is called intersecting of the posting lists. In the above example, the result page would comprise documents n5 and n9.
The example given above is that of an inverted index. Similarly, there is e forward index scheme, which will store all the words appearing on the page. Furthermore, to make things more complicated, there is also the concept of index merging, but we will not go into all those technical details here. Suffice it to say that indexing itself is a complicated process, as it has to carry out both the jobs of updating content as well as serving search queries at the same time. Also added to that, is the question of disk space required for storing the content, which is optimised by compression and has a direct impact on the time and processing power required for decompression.
Now the index is ready but there is still the question of relevance. Which of these pages are more relevant to the search string? To decide this, other parameters of the page came in. For instance, in the above example a web page having the words "history" as well as "India" in the title is obviously more relevant than, let's say a page with India in the title but the word history somewhere in the text. Similarly, the words history and India together get more importance than the two words figuring separately. Next would be to see how many links, to other pages on the same subject, exist in the article. The popularity of the website linking to the page is another factor in the ranking of a page. Once the ranking is done, the search engine is now ready to serve the web pages to the user as a search result. The higher the ranking of a page, the higher it shows up in the search result also. The process of analysis and ranking of the page is done by a complex mathematical formula, which is referred to as an algorithm. As mentioned earlier, the exact algorithm is a well guarded secret, but then there is only so much information in a web page and we can easily understand which of the parameters are important, common logic should do the rest.
The third and final part of the Search Engine is the query software. This is the front end of the search engine as you see it and communicates with the index on your behalf. When a user initiates a query, the search engine matches the query with its index and provides the results in the order of relevance as decided by the ranking.
The entire process can be understood with the help of the block diagram.
Having data centres around the world makes the process faster. . The central network coordinates the activities and after receiving the results delivers them to the user.
What one must remember here is that the search is not really live, the search engine is not going through billions of pages in those few fractions of second, but just extracting information from previously indexed and stored pages. Therefore, there is always a likelihood that your page has not been crawled or spidered since your last update to the page or maybe it has been crawled and not yet indexed. Patience generally helps one to tide over such issues.
The queries can be broadly classified into the following three categories:
Informational: A query covering a broad topic like, history, which might have millions of results.
Navigational: A query which is seeking a specific website or page, e.g., CNN.
Transactional: A query wherein the user wants to perform an action, e.g., maybe buying something or downloading something.
What we really need to understand about the entire system is how the users are behaving and why search engine optimisation is important. Multiple independent researches into web user behaviour have revealed the following:
1. The average length of the search query is more than two words, i.e. people are searching for phrases and know the importance of putting more than one word to get more relevant results.
2. 50% of the users do not go beyond the second page ofthe search engine result pages.
3. A large portion, of the number of queries by the same user, are repeat queries wherein the user is trying to re-find a particular page and in most cases would click on the same result as the last time
4. 80-20 rule: A small portion of search words for a major portion of the terms used in the queries, simply meaning that some words are used more often than others.
5. Less than 5% of the users use the advanced search features offered by the search engines.
It is the user behaviour as mentioned above which is pushing the search engines towards semantic search, which means the search engines are evolving to improve search accuracy by trying to understand user intent and the contextual meaning of what is being fed in the query box. The idea is to deliver targeted results to the user rather than random pages related to certain words. As the search engines evolve, it is best that the websites keep with the pace and are optimised to target better rankings in the search engine results.