Search Engines

Back in 1994, when World-Wide Web was just emerging, the first index of sites in The Netherlands appeared. It was named the Dutch Home Page. Of course, today there are large number of such indexing sites, such as Yahoo! -- at that time there were only 10 sites in the whole of the Netherlands and they could easily be displayed on one small page!

Within a year, the amount of work to maintain the index overwhelmed this initiative. Software was developed to ease the manual administrative task. New sites were first registered by their web master, by submitting a form. The site was then visited by an administrator, who checked it for credibility and importance: personal home-pages where only included when they contained information which was useful for other people than family and friends. When these tests passed successfully, the site was added to the database from which the index pages were produced.

However, with the number of registered sites growing exponentially the drawbacks of adding those hundreds of new sites by hand became harder and harder to manage. It is not usefull to put many thousands of sites into one page, so you have to classify the sites into categories. As the number of sites within each category grow the category definitions have to become narrower to limit the number of sites in each of them. This strategy worked for three years but the number of categories grows exponentially too.

For an example of a site that indexes sites by category see Yahoo!, which is the largest such site on the Internet. It claims to have half a million sites registered in twenty five thousand categories. A single organization or company will fit in a huge number of detailed categories but, is usually listed under a small number of major subjects, and will not be listed under all of its other activities.

Categorization has its own problems, which are best illustrated by an example. An university will be active in many specialized fields, but is not listed under all of those hundreds of distinctive subjects. This means that someone looking for information about, say super-conductors, in the index ends up with companies specialized in that area, but not with this university which may be performing leading research in this area.

Specialized subject indices which are maintained by people `in the field' are useful, so long as they are updated regularly (which is often not the case) and well known by its target-community. In such an index, a university department will get its place. However, for many subjects these indices are not well maintained, and they often not easy to find. As the Internet grows even these lists can become overwhelming. Additionally, knowing about the existence of a site related to a subject, does not give one the knowledge about what is actually present there.

As the amount of information on WWW grew, textual search systems were introduced. These search engines (also called spiders or crawlers) do not try to categorize sites, but use brute-force methods to scan for pages where certain keywords can be found. With search-engines, the effort for the builders of the indexing site has shifted from manual administration to writing smart software.

Search-engines try to apply natural language techniques, which can assist searchers in finding knowledge which is does not the highest priority of the site-publisher's point of view. A spider will find this knowledge whereas an index will not. Thus, spiders that categorise sites have more control over the quality of the information they show to their visitors because of the intelligence used in their construction. Spiders can be more useful to the searcher because they find items which are not in the site's description and hence transcend the restrictions imposed by an index.

However, a spider will return many results -- a clear case of quantity and not quality and the searcher is subject to information overload.

Most people use search-engines nowadays, instead of indexes. This is not because their results are satisfying, but because the indexing system give less useful results.

I assert that the usefulness of general indexing sites has come to an end. This paper suggests a layered architecture which improve the access of search-engines to data on The Net. This will improve their quality, and thereby benefit everyone who has interest in Internet.

Next Current Strategies.
Up Main page.