My Personal Search-Engine

Current Strategies

  Are we giving-up on real-spiders? Should we try to focus on the quality of `real'-engines by solving their problems directly?

One of the main reasons why the `real'-spiders are not able to fulfill our needs to discover information is that their development is slowed down by their current structure. Everyone who wants to experiment with improving them has to re-implement all parts of the system -- each time. Bright new ideas can not be experimented with unless a lot of money is invested. Building a useful search engine takes a lot of effort and time.

When `real' spiders are improved, both meta-spiders and specialized spiders become superfluous. When implementation is easier, more variation can be tested and "My personal search engine." is born.

Fetching the pages

Severe problems exist with the way spiders behave with respect to obtaining their raw data. Some of the worst I will mention here:
  • Each search-engine needs to retrieve the contents of pages held on web sites on regular bases to build their search tables. Engines which span the whole WWW are taking more and more time to visit all the pages to check their existence and to index their content. As mentioned earlier AltaVista needs about seven weeks to scan the Net once. Some pages, however, change hourly. Quite some delay!

  • Each Spider uses its own method to fetch pages. The first versions are typically a rapid-firing}
    (A rapid-firing spider fetches pages site-by-site. The site it is working on is fully occupied serving that spider, hence not able to serve `normal' visitors.)
    one, which is easier to implement but destructive for the visited sites; blocking all access to them for real users of the site. The HTTP/1.1-protocol performs much better then its 1.0 predecessor, but seems too complicated for some implementors to utilise successfully.

  • Some producers of server-software (no names here) are stimulating web site designers to put all of their pages into a private database. Access to these pages demands a lot from the underlying (operating?) system. These systems are heavily loaded even by a few normal users and any superfluous access should be avoided to keep the system stable.

  • There are many spiders. The Big Search-Engine Index lists 420 search-engines. Happily not all those spiders cover the whole Internet; some are related to a country, a language, or a subject. No-one is happy when all of these spiders extract all of the pages from their databases over and over again. It can seem that only spiders are interested in your content.

  • A large number of interesting sites require registration. Some even require payment. So these sites are not indexed, even if the person who is looking for facts might be willing to register or pay.
The badly-implemented spiders and the exhaustive database-retrievals have forced many sites to block access to every spider. All spiders obey the robot.txt convention. From own count I estimate about 30 percent of all sites currently block all access by spiders. This then looses that site a lot of visitors: because about 35 percent of the people find ones -- averaged sized -- site via an engine. They may decide to bookmark your site and become a regular visitor. This only works if you are listed!

In summary, there are many reasons for the very bad hit-rate on pages. An investigation made by Search-Engine Watch shows that even largest engines of today only get their hands on only 8 to 27 percent of the existing pages. And this number is decreasing.

Spider Functions

Each Spider has to develop a page-fetching mechanism before it can concentrate on how to work with the raw data they retrieve. Ignoring for the moment page-fetching, what kinds of data are does the spider supply to the `user' and how will the searching work?
  • Some spiders build indexes based HTML fetched over HTTP alone. Other spiders also supply search facilities on other document formats (XML, PDF, doc), other protocols (FTP, Usenet), or host-names. Even some experimental indexing of images is available, for example from AltaVista.

  • A number of Spiders do a blunt search on keywords, possibly with some boolean algebra (which few people understand). Others try to build artificial intelligence (or fuzzy-search, or whatever you want to call it) into their indexing.

  • Some know about synonyms, and include those in their search. Unfortunately this is usually only implemented in English.

  • The ability to specify the language a page is in is common, however often you have no possibility to restrict your search to the languages you know. The meaning of a word can differ between two languages: even if you do speak both languages you need to be able to restrict to one language.

  • Most spiders show results by displaying a part of the page which contains the keyword. In some cases, the first few lines are shown. Other spiders show the actual place where the word was found. When the page contains the meta-tag `description', this is usually displayed, which is often a better representation of the content of a document then one paragraph is.

Many variations on many themes. But... I never find the answer to my questions fast. Why?

Blocked Development

Many bright ideas have been implemented in search engines. Some spiders have a good collection of such functions (as are described in the previous section) and contain a relatively good set of data to apply them to. But at the same time a lot of features are left-out because of a lack of time and financial means, or intellectual property problems. Many promising ideas relating to small parts of the search-process are not implemented because one needs to build a complete search-engine, from page-fetch to display, or have nothing at all.

If we are able to remove some of the blocks to implementing new ideas, we will see a better behavior, hence a more valuable World Wide Web.

Next Redesigning the Engines.
Up Main page.