My Personal Search-EngineSearch Engines | |
Back in 1994, when World-Wide Web was just emerging, the first index
of sites in The Netherlands appeared. It was named the
Dutch Home Page. Of course, today there are large
number of such
indexing sites, such as Yahoo!
-- at that time there were only
10 sites in the whole of the Netherlands and they could easily be displayed
on one small page!
Within a year, the amount of work to maintain the index overwhelmed this initiative. Software was developed to ease the manual administrative task. New sites were first registered by their web master, by submitting a form. The site was then visited by an administrator, who checked it for credibility and importance: personal home-pages where only included when they contained information which was useful for other people than family and friends. When these tests passed successfully, the site was added to the database from which the index pages were produced. However, with the number of registered sites growing exponentially the drawbacks of adding those hundreds of new sites by hand became harder and harder to manage. It is not usefull to put many thousands of sites into one page, so you have to classify the sites into categories. As the number of sites within each category grow the category definitions have to become narrower to limit the number of sites in each of them. This strategy worked for three years but the number of categories grows exponentially too. For an example of a site that indexes sites by category see Yahoo!, which is the largest such site on the Internet. It claims to have half a million sites registered in twenty five thousand categories. A single organization or company will fit in a huge number of detailed categories but, is usually listed under a small number of major subjects, and will not be listed under all of its other activities.
Categorization has its own problems, which are best illustrated by an
example. An university will be active in many specialized fields, but
is not listed under all of those hundreds of distinctive subjects. This
means that someone looking for information about, say super-
Specialized subject indices which are maintained by people `in the field'
are useful, so long as they are updated regularly (which is often not the
case) and well known by its
target-
As the amount of information on WWW grew, textual search
systems were introduced. These search engines (also called
spiders or crawlers) do not try
to categorize sites, but use brute-force methods to scan for pages where
certain keywords can be found.
With search-engines,
the effort for the builders of the indexing site has shifted from
manual administration to writing smart software.
Search-engines try to apply
natural language techniques, which can assist searchers in finding
knowledge which is does not the highest priority of the
site-publisher's point of view. A spider
will find this knowledge whereas an index will not.
Thus, spiders that categorise sites have more control over the quality
of the information they show to their visitors because of the
intelligence used in their construction.
Spiders can be more useful to the searcher because they find items which
are not in the
site's description and hence transcend the restrictions imposed by an index.
However, a spider will return many results -- a clear case of
quantity and not quality and the searcher is
subject to information overload.
Most people use search-
I assert that the usefulness of general indexing sites has come to an end.
This paper suggests a layered architecture which improve the access of
search-engines to data on The Net. This will improve their quality, and
thereby benefit everyone who has interest in Internet.
Next Current Strategies. |