A Search Interface for my Questions
The Selection Process

Limiters on Words

The Spider, Selection on Keywords, Limiters on Words, Limiters on Sites, Displaying Sites.
  For each of the three word-groups, special limiters can be set. The limiters currently work only at whole word-groups, not at single words. This might change in future versions of this interface when experiments show that working with groups does not show sufficient detail.

The action of the limiters are all shown as histograms (the data in displayed in the figure is for example) with a possibility to adjust the lower- and upper-bound for your request, linear or logarithmic scaling, and cumulative, spike, or Gaussian presentation. Each word-group has a column (respectively keywords, related words, forbidden words). Selecting a box will set a limiter. A stopping-hand signals that the limiter is temporarily disabled. The value in the box shows the effectiveness of the limiter on the number of answers on the question.

At the moment, the following limiters are intended:

  • Hits per page per site. The average number of hits per page which is hit in relation to the number of sites with hits. This visualizes the density of the hits, with respect to sites.
  • Hits per site. The total number of hits for a site, in relation to the amount of sites which have that many hits. With this limiter, you can find sites with a large amount of information about the subject, independent from the size of the site.
  • Pages hit per site. The total number of pages for a site which contain any of the words, in relation to the amount of sites which have that many pages hit. This indicated sites which are likely to have a specialized section about the subject.
  • Pages hit percentage per site. The number of pages hit in a site as percentage of the pages of the whole site, in relation to the number of sites which have that word. This limiter can be used to find sites specialized in the subject you look for.
  • Limiter on location. This is not a histogram, but a checklist with possibilities to restrict the appearance of words to (a combination) of
    • the title of the page,
    • the meta-keyword line in HTML-pages,
    • the meta-description line in HTML-pages, and
    • the content of the page.
    This limiter is not yet implemented.
More than one limiter can be set at the same time, for any of the three groups of words.

These limiters require more detailed information from the search-engine than is currently available. The first implementation of the interface will not ignore overlapping hits: when two words from the same group meet on the same page, this page will be counted twice. Hence, the histogram will not show the real distribution of suitable sites. This is certainly not optimal, when we consider that many words will have overlaps because they are related.

The reason not to implement the best solution is the exponential behavior of this data: the search engine has to recalculate all the possible combinations, and do this over for each word each time a word is added, moved or deleted from the list of selected keywords. The intention is to have to spider produce a lot of suggestions on words to pin-point the question optimal. Exponential behavior will be destructive. By just ignoring the overlaps, the situation changes to simple linear lookups. Experiments shall show if this is simplification will give acceptable results.

Next Limiters on Sites.
Up The Selection Process.