How Search Engines Work
September 29, 2007 – 5:30 pmAnyone that has worked with Google, or Yahoo, or ask.com may wonder how the search engines get their information. The term “search engine” is often used to describe different methods of getting information about a website and creating a database that contains that information. There are two methods of doing this: crawler-based search engines and human-powered directories. The methods used are very different.
Crawler-Based Search Engines
If a search engine creates its listings automatically, the crawler-based search engines, like Google, will store their results in a database. It is a computer program, a “spider,” that actually “crawls” or the web, the results are pored over by end users.
But what happens if you change your web page? The crawler-based search engines eventually returns to your site to find these changes. Listings are based on keywords that appear on your web page. Page titles, body copy and other elements also play a role.
The Parts of a Crawler-Based Search Engine
A crawler-based search engines will have three parts. First the spider/crawler will visit a web page. It reads it, and follows the inks to other pages within the same site. This operation is called being “spidered” or “crawled.” The spider returns to the site frequently, such as every month or two, to look for changes. This is important, once the spider has found the site it is part of its search or scouring domain. It won’t ignore it.
Next is the database index or catalog. Everything the spider finds goes into the second part of the search engine, the database index. The database index is a giant database collection containing a copy of every web page that the spider finds. If a web page changes, then this book is updated with new information. A spider often looks for websites changes according to the frequency of their changes. This will influence the indexing.
Can a web page be spidered but not indexed? Definitely it can. This happens because it can take a while for new pages or changes that the spider finds to be added to the index. Until this happens it will not be available to those doing the search engine hunt.
The final part is the search engine software. This program sifts through the millions of pages recorded in the index to find matches to a search and rank them in order of what it believes is most relevant. The special algorithms that make this happen are the key to the success of Google, Yahoo, and others search engines. These algorithms are proprietary and a highly kept secret.
Human-Powered Directories
If you want human intelligence behind the search, then a human-powered directory, such as the Open Directory project, is a place to start. Here, you submit a short description to the directory for your entire site, or editors write one for sites they review. A search then looks only for matches only in the descriptions that have been submitted.
One company, Maholo, provides a working framework to build a peer to peer web search engine. Here people mutually form a search engine without the intervention of central servers or a central actor. Another such company is ChaCha, which will guide you through your searches. Currently there are 10,000 guides working to provide the search operation. And Sproose, allows people to vote on the quality of the website, which can affect the ranking and move it up or down.
Hybrid Search Engines
Hybrid search engines are actually directories that use spiders to index the pages you submit to them. Here you pay for the inclusions. These hybrid search engines send their spiders to crawl only the pages you submitted to them. With paid inclusion this may be an expensive option for large web sites. Here you have to evaluate the cost and benefit of the submission in order to see the real value to you.


You must be logged in to post a comment.