|
To be able to test my interface, I was looking for data. To my (not
really big) surprise, there is no testbed for new technology in this
field. If you want to do anything with web-pages, for any research or
application, you have to implement all parts of a spider yourself. This
is very costly and time-consuming. This needs improvement.
There are three groups of indexing systems on the current Internet:
- the Manually maintained indexes
- where people add site-descriptions, and sites are categorized. Their
usefulness is deminishing, as shown before. An example is
Yahoo!.
- Distributed manual indexes
- are implemented like a library with many places where people
add information to web-pages to improve the retreivability of the
information. Many of these systems exist in the field of research,
like DESIRE, ROADS, etc.
These systems usually require a piece of extra software to be
installed on any of the participating web-sites.
- Fully automatic indexes
- add the most widely spread. They index all pages (sometimes limited
to a country, sometimes the whole Internet), without any manual
interference. For example
AltaVista.
Most people use the general, fully automatic engines. However, there
are about 480 of them which all try to gather the information on their
own. This is one of the reasons why most internet-sites closed their
doors to search-engines: they don't want all engines slowing down the
use by normal users.
Mark A.C.J. Overmeer,
AT Computing bv, 1999.
|