My Personal Search-Engine

Current Strategies

  Spiders are certainly not delivering a good quality result. Various smart algorithms have been developed to find the pages which best serve the query of the user (you can find more about the current strategies at Search Engine Watch). However, everyone will have experience about the huge list of unwanted hits you get in response to your simple question. Current text-retrieval algorithms are designed to work on large amounts of data but not for the huge quantity on Internet.

Real spiders

Going through all the references returned by a spider takes a lot of time. The human interfacing is usually a disaster; you can see that the programmers found "getting the thing to work" was complicated enough, and they are happy that they can at least deliver some results.

Apart from the shear number of references returned a lot are outdated because the network is too large to get a timely overview. AltaVista currently can only visit a site once every seven weeks and they retire sites from their database after a few failed fetch attempts, So it takes months before their database is correct.

The best part of current search engines is that they retrieve their data very quickly. The performance of computers in not the bottle-neck.

The real spiders, fetch all their pages themselves. They are built on expensive hardware (huge quantities of disk-space and memory is required) and complicated software. The figure left shows the structure of the simplest of those spiders. Typically, this functionality is spread over a few computers to get the required performance.

To solve the drawbacks of `real'-spiders, three variations were introduced:

specialization,
improving results by reducing the number of searched-sites,
meta-spiders,
improving results by combining the results of a few spiders or,
meta-information,
adding information to (HTML) pages which is to used by spiders for building indexes.

Specialized spiders

Specialized spiders try to improve the results of a spider by manually adding a pre-selection of subject material. This results in something between a real spider and a manual index. As long as the list of related sites is not too large and well maintained, this will work better than a normal indexing-site with categories. However, a lot of manual interference is required.

Meta-spiders

Meta-spiders call many `real'-spiders and combine their results, as shown in the figure left.

The meta-spider passes the users request to many real-spiders which query in their own databases and return appropriate results. The meta-spider counts how many real-spiders return a certain page. Then those pages are ranked and returned to the user. This way, the collective result might be better than each separate result.

This procedure may result in less optimal coverage than that of a single search engines: the combined performance leads to the average result, a single engine may be better balanced.

Meta-information

Steps are taken by a sites maintainer to add information to documents to facilitate the spiders. HTML meta-tags keywords and description are examples for this extensions. In this situation, the publisher of the document is adding human intelligence.

An example of HTML with meta-information:

 <HTML>
 <HEAD>
  <TITLE>Wildlife in Holland</TITLE>
  <META NAME=keywords CONTENT="bears, apes, rabbits">
  <META NAME=description CONTENT="An overview of
      wildlife in The Netherlands.">
 <BODY>
  ....
The NAME-parameters of the META are not officially standardized, but are commonly used. Most spiders treat them specially. However, this method relies upon accurate abstraction of the content of a page, which is reliant on the person developing the page. These fields are often abused by sex or car-selling companies to lure visitors.

The Web is also changing from pages mainly being written in HTML into providing pages in multiple formats. More and more, the documents are published in doc, XML, pdf, etc. In most of those formats, there is no way to specify meta-information.

For all these reasons, this way of adding meta-information has only limited use.

The librarian Way

A different method to providing extra information is by setting-up a library like structure. A few projects try to find solutions for the massive amount of information this way, for instance Harvest, CHOICE, and DESIRE. These systems add BibTeX-like information to the pages. whilst adding META-tags to HTML-pages is relatively easy, the librarian way requires more effort to capture the content of the page. A site has to run a special service alongside its web-server that supplies the meta-information to the indexing-systems.

But, is everyone willing to become a librarian? For institutes, universities, and such, the effort to comply to this system is not too high, but it is a lot to ask small companies and private persons to use this system. A large part of the knowledge on Internet will never be contained in this information-structure.

Next Identifying the Problems.
Up Main page.