Juggernautsearch (JS) 1.0
Technical Responses and Comparison
By: Donald T. Kasper
HyperProject Inc.
Email: kasper@1stconnect.com
26 October 2000
Response to HTDIG.ORG
HTDIG.ORG produces a small-scale search engine that originated out of San Diego State University. On their web
site they have a forum that has made a number of vicous comments about Juggernautsearch. The obvious conflict of
interest problem notwithstanding, our response to specific technical comments in their newsgroup forum by HTDIG
developers are responded to here:
- "No boolean. Limited to two key words for searching. So of course it's fast". By: Aaron Turner.
Response: Booean search to return pages omitting a keyword only work when you have the full document to
search. Search engines only extract the top few keywords, so requesting a search to exclude a word is no guarantee
that it is not in a document. Even though a mathematical novelty, its use is pointless. There is no limit to the
number of keywords to search for in JS. This person obviously never used the program. Juggernautsearch is fast
because of its engineering and design, not by reducing capabilities.
- "Only the search is available. The crawler is 100% proprietary and not for sale, not for download,
not for rent. They sell the database." and "The tarball consists of 600k of sources (the searcher) and
55 Mb of undocumented cryptic files". By: Loic Dachary. (An HTDIG developer). (He then posted this
on egroups.com). Response: It would help if people bothered to run a program before posting conclusions.
The crawler is called the "pagerunner", which is provided in source. To load the search database, use
the "librarian", which is also provided in source. To run queries, use the cgi scripts, which are also
provided in source. (You start at index.htm, of course). If we build custom databases, they are either for sale
or they are put in the public domain. That is our choice and (last I heard) was part of our rights in a democracy
to engage in commerce for profit. The 55 Mb "undocumentd cryptic files" comprise the Juggernautsearch
database. I cannot imagine why someone would document a database. Developers document source code, not databases.
Programs read databases, users don't. Use the query engine to use the database and the other tools provided to
maintain it. This 55mB database is a sample that was submitted to CNET, and is available there for free.
- "This implies that the code is GPL but not really GPL. You'd be amazed to seem how apparently well
educated people believe they can tampler with the GPL to restrict it's commercial use." By: Loic Dachary.
(An HTDIG developer). Response: The documentation clearly states that the product is GPL and that basically
says it all. GPL is not shareware. It provides for personal, noncommercial use. The author retains all rights to
the product. This is clear in the GPL license which states that only changes made by users are in the public domain.
As companies don't typically develop software to give it away, personal use is the clear result of this language
in the license. GPL licensing is granted by Juggernautsearch, and can be revoked at any time. The product has never
been submited to GNU or any other organization under GPL to give it away. Oh, and yes, we are well educated.
A Comparison to HTDIG
The basic advantages of Juggernautsearch (JS) over HTDIG are:
- HTDIG requires the program to be compiled, or a compiled version has to be downloaded. For most users,
this is a complex process. JS uses Perl, which is not compiled by a user before being run. If changes are made
to suit a particular installation with JS, nothing is ever recompiled.
- The use of Perl makes JS cross-platform without ever being recompiled or its scripts having to be modified
for a target system using Linux or the same version of Perl. There is no version control for JS, since there is
only ONE baseline of code.
- JS can build very large Internet indexes, HTDIG cannot. Databases with several hundred thousand indexes
and URL's are trivial to produce and maintain with JS, while they are on the upper limits of the capability of
HTDIG. Using HTDIG's published example, 13,000 documents on HTDIG uses roughly 156 mB. By comparison, JS takes
210 mB to store the data on 150,000 URL's (web pages) that includes their URL names, top keywords, and their descriptions.
Downloading and copying documents for local storage is just not feasible over large networks; the philosophy of
JS is to index the pages so that the user can bring them up from on a browser if the full content is desired.
- JS allows HTML-based or script-based search. HTDIG only allows HTML-based search. With JS, you can get
the full search result and pipe it to another process or save it to file.
- HTDIG returns the entire document, causing enormous storage overhead. JS takes determines the relevant
keywords in documents and returns those.
- JS allows a URL file containing a list to begin searching from. HTDIG can only retrieve data starting
at a particular, single URL.
- HTDIG response time for database queries can be extraordinarily slow--times of 30 seconds to 5 minutes
have been commonly reported by users. This problem lies, in part, to the processing that occurs upon querying,
such as the sort processing HTSEARCH performs. By comparison, a typical JS search is 3 seconds or less.
- HTDIG can eliminate files with particular extensions while JS can eliminate files with any substring including
extensions. JS can also eliminate undesired URL's. Some URL's, such as those of Yahoo, DoubleClick, and Amazon
can flood a web crawler with URL's that prevents it from ever searching other sites, once one of those "super
sites" are encountered in a web crawl. In addition, JS can check to be sure that a site, once indexed, is
not searched in the next URL web crawl iteration, eliminating the 70% or so of URL duplicates that are generated
during web crawl activities.
- JS queue files generated by its web crawler is in text readable format that can be trivially read by
any program and saved in any user database desired.
- To change any options in HTDIG, you have to recompile the program to get more than its default messaging.
Using Perl, if you run into problems with a script, you can use options to generate more messages about the scripts
as they run.
- Banner advertising and use is built-in to JS, but not HTDIG. The "Juggernaut Tech Support"
graphic on the initial JS search page is, in fact, a banner. Banner use is not currently documented, but the full
source is provided with JS.
- HTDIG can generate segmentation faults, crashing the program. JS doesn't know what a segmentation fault
is, since different libraries don't have to be matched and debugged for production use. JS only uses Perl libraries,
that are well tested for compatability and reliability. We never debug Perl libraries.
- HTDIG has a number of bugs relating to string pattern matching. JS uses Perl, a pattern matching language
that makes pattern matching errors very unlikely. The error could be in a JS Perl script where there is an error
in the type of processing requested, but will be very unlikely to be in the underlying pattern matching and processing
code of Perl.
Conclusions:
HTDIG is for sophisticated users or programmers to index a very small number of computer sites (typically
100 or less). Juggernautsearch is for administrators or end users to index any information of interest from
the entire body of the Internet, as well as local sites and Intranets. HTDIG is a WAIS-style indexing program with
the added ability to search on a network using the HTTP protocol. WAIS development was discontinued in 1995 when
the next generation of search engines arrived. Juggernautsearch was one of the five original search engines that
started the current generation of Internet search engine technology. While HTDIG is useful for limited research
use, Juggernautsearch is readily useful in support of a portal site for public search engine access.
From the HTDIG FAQ list (www.htdig.org):
1.1. Can I search the Internet with HTDIG?
"No. HTDIG is a system for indexing and searching a small set of sites or Intranet. In is not meant to
replace any of the many Internet-wide search engines."
1.2. Can I index the Internet with HTDIG?
"No, as above, HTDIG is not meant as an Internet-wide search engine."
Epitaph:
The generation of WAIS engines is over.
HTDIG has run out of steam. It's time to move on.
Response to SLASHDOT.ORG
Our responses to comments about Juggernautsearch on SLASHDOT are a follows:
- "I'm obviously a bit biased, but there *are* strong, open-sourced search engines. Try ht//Dig for example".
By: Geoff Hutchinson. Response: Its always nice for a member of HTDIG to admit he is biased when talking
about a competitor.
- "Well, if I could get in to download it... I'll give it a try, we've been using htdig here at UK for
years, but it's not very useable for the size of index we are try to work with...takes basically all day to rebuild
the index on a Xeon450 w/512M". By: John Soward. Response: You don't have to rebuild indexes in
Juggernautsearch. After building the database, additions can be made at any time.
- "Htdig is a GPL'd search engine that will crawl your site. It can go a certain depth, or start from
any give page (like a site index page).". By: smoser. Response: Juggernautsearch can start searching
from any given page as well.
- "I typed in 'jj thompson' to see would it find my page about the legendary physicist (it's indexed
by most engines). It didn't bother returning any matches". By: rde. Response: The sample database
is only 55mB. It's not a google.com terabyte database. Obviously, the difference between a demo database to show
how the product works, and a live WWW database is lost on some people.
- "I usually don't like to be a skeptic, but the whole thing just smells funny, especially when there
are very few concrete details about the whole thing". By: dr. Response: They only thing we can
provide more than the source is our clothes, and you can't have those.
Intuitive Project Management SoftwareTM
HyperProject, Inc.
12356 Jolette Ave.
Granada Hills, CA 91344
Phone: 818-831-0404
Email: kasper@1stconnect.com
Web Site: www.hproject.com
kasper@1stconnect.com