|
Are we giving-up on real-spiders? Should we try to focus on the
quality of `real'-engines by solving their problems directly?
One of the main reasons why the `real'-spiders are not able to fulfill
our needs to discover information is that their development is
slowed down by their current structure.
Everyone who wants to experiment with improving them has to re-implement
all parts of the system -- each time. Bright new
ideas can not be experimented with unless a lot of money is invested.
Building a useful search engine takes a lot of effort and time.
When `real' spiders are improved,
both meta-spiders and specialized spiders become superfluous. When
implementation is easier, more variation can be tested and
"My personal search engine." is born.
Fetching the pages
Severe problems exist with the way spiders behave with respect to
obtaining their raw data. Some of the worst I will mention here:
- Each search-engine needs to retrieve the contents of pages
held on web sites on regular bases to build their search tables.
Engines which span the whole WWW are taking more and more time to visit
all the pages to check their existence and to index their content.
As mentioned earlier AltaVista needs about seven weeks to
scan the Net once. Some pages, however, change hourly. Quite some delay!
-
Each Spider uses its own method to fetch pages. The first versions
are typically a rapid-firing}
(A rapid-firing spider fetches pages site-by-site. The site it
is working on is fully occupied serving that spider, hence not able to
serve `normal' visitors.)
one, which is easier to implement but
destructive for the visited sites; blocking all access to them for real
users of the site. The HTTP/1.1 -protocol
performs much better then its 1.0 predecessor, but seems too
complicated for some implementors to utilise successfully.
-
Some producers of server-software (no names here) are stimulating web site
designers
to put all of their pages into a private database. Access to these pages
demands a lot from the underlying (operating?) system. These systems
are heavily loaded even by a few normal users and
any superfluous access should be avoided to keep the system stable.
-
There are many spiders.
The Big Search-Engine Index
lists 420 search-engines.
Happily not all those spiders cover the whole Internet; some are related
to a country, a language, or a subject.
No-one is happy when all of these spiders extract all of the pages from
their databases over and over again. It can seem that only spiders are
interested in your content.
-
A large number of interesting sites require registration. Some even
require payment. So these sites are not indexed, even if the
person who is looking for facts might be willing to register or pay.
The badly-implemented spiders and the exhaustive database-retrievals have forced
many sites to block access to every spider. All spiders obey the
robot.txt
convention. From own count I estimate about 30 percent of
all sites currently block all access by spiders. This then looses that site
a lot of visitors: because about 35 percent of the people
find ones -- averaged sized -- site via an engine. They may decide
to bookmark your site and
become a regular visitor. This only works if you are listed!
In summary, there are many reasons for the very bad hit-rate on
pages. An investigation made by
Search-Engine Watch
shows that even largest engines of today only get their hands on only 8
to 27 percent of the existing pages. And this number is decreasing.
Spider Functions
Each Spider has to develop a page-fetching mechanism before it can
concentrate
on how to work with the raw data they retrieve. Ignoring for the moment
page-fetching, what kinds of data are does the spider supply
to the `user' and how will the searching work?
- Some spiders build indexes based HTML fetched over HTTP alone. Other spiders
also supply search facilities on other document formats (XML, PDF, doc), other
protocols (FTP, Usenet), or host-names.
Even some experimental indexing of images is available, for example from
AltaVista.
- A number of Spiders do a blunt search on keywords, possibly with some
boolean
algebra (which few people understand). Others try to build artificial
intelligence (or fuzzy-search, or whatever you want to call it) into
their indexing.
- Some know about synonyms, and include those in their search. Unfortunately
this is usually only implemented in English.
- The ability to specify the language a page is in is common, however often
you have no possibility to restrict your search to the languages you know.
The meaning of a word can differ between two languages: even if you do
speak both languages you need to be able to restrict to one language.
- Most spiders show results by displaying a part of the page which
contains the keyword.
In some cases, the first few lines are shown. Other spiders show the actual
place where the word was found. When the
page contains the meta-tag `description', this is usually displayed, which
is often a better representation of the content of a document then one
paragraph is.
Many variations on many themes. But... I never find the answer to my
questions fast. Why?
Blocked Development
Many bright ideas have been implemented in search engines.
Some spiders have a good collection of such functions (as are described in
the previous section) and contain a relatively good set of data to apply
them to.
But at the same time a lot of features are
left-out because of a lack of time and financial means, or intellectual
property problems.
Many promising ideas relating to small parts of the search-process are not
implemented because one needs to build a complete search-engine, from
page-fetch to display, or have nothing at all.
If we are able to remove some of the blocks to implementing new ideas, we
will see a better behavior, hence a more valuable World Wide Web.
Next Redesigning the Engines.
Up Main page.
|