A Comparison to HTDIG and a Response to Public Comments Made by HTDIG

Juggernautsearch (JS) 1.0

Technical Responses and Comparison

By: Donald T. Kasper

HyperProject Inc.

Email: kasper@1stconnect.com

26 October 2000

Response to HTDIG.ORG

HTDIG.ORG produces a small-scale search engine that originated out of San Diego State University. On their web site they have a forum that has made a number of vicous comments about Juggernautsearch. The obvious conflict of interest problem notwithstanding, our response to specific technical comments in their newsgroup forum by HTDIG developers are responded to here:

"No boolean. Limited to two key words for searching. So of course it's fast". By: Aaron Turner. Response: Booean search to return pages omitting a keyword only work when you have the full document to search. Search engines only extract the top few keywords, so requesting a search to exclude a word is no guarantee that it is not in a document. Even though a mathematical novelty, its use is pointless. There is no limit to the number of keywords to search for in JS. This person obviously never used the program. Juggernautsearch is fast because of its engineering and design, not by reducing capabilities.
"Only the search is available. The crawler is 100% proprietary and not for sale, not for download, not for rent. They sell the database." and "The tarball consists of 600k of sources (the searcher) and 55 Mb of undocumented cryptic files". By: Loic Dachary. (An HTDIG developer). (He then posted this on egroups.com). Response: It would help if people bothered to run a program before posting conclusions. The crawler is called the "pagerunner", which is provided in source. To load the search database, use the "librarian", which is also provided in source. To run queries, use the cgi scripts, which are also provided in source. (You start at index.htm, of course). If we build custom databases, they are either for sale or they are put in the public domain. That is our choice and (last I heard) was part of our rights in a democracy to engage in commerce for profit. The 55 Mb "undocumentd cryptic files" comprise the Juggernautsearch database. I cannot imagine why someone would document a database. Developers document source code, not databases. Programs read databases, users don't. Use the query engine to use the database and the other tools provided to maintain it. This 55mB database is a sample that was submitted to CNET, and is available there for free.
"This implies that the code is GPL but not really GPL. You'd be amazed to seem how apparently well educated people believe they can tampler with the GPL to restrict it's commercial use." By: Loic Dachary. (An HTDIG developer). Response: The documentation clearly states that the product is GPL and that basically says it all. GPL is not shareware. It provides for personal, noncommercial use. The author retains all rights to the product. This is clear in the GPL license which states that only changes made by users are in the public domain. As companies don't typically develop software to give it away, personal use is the clear result of this language in the license. GPL licensing is granted by Juggernautsearch, and can be revoked at any time. The product has never been submited to GNU or any other organization under GPL to give it away. Oh, and yes, we are well educated.

A Comparison to HTDIG

The basic advantages of Juggernautsearch (JS) over HTDIG are:

HTDIG requires the program to be compiled, or a compiled version has to be downloaded. For most users, this is a complex process. JS uses Perl, which is not compiled by a user before being run. If changes are made to suit a particular installation with JS, nothing is ever recompiled.
The use of Perl makes JS cross-platform without ever being recompiled or its scripts having to be modified for a target system using Linux or the same version of Perl. There is no version control for JS, since there is only ONE baseline of code.
JS can build very large Internet indexes, HTDIG cannot. Databases with several hundred thousand indexes and URL's are trivial to produce and maintain with JS, while they are on the upper limits of the capability of HTDIG. Using HTDIG's published example, 13,000 documents on HTDIG uses roughly 156 mB. By comparison, JS takes 210 mB to store the data on 150,000 URL's (web pages) that includes their URL names, top keywords, and their descriptions. Downloading and copying documents for local storage is just not feasible over large networks; the philosophy of JS is to index the pages so that the user can bring them up from on a browser if the full content is desired.
JS allows HTML-based or script-based search. HTDIG only allows HTML-based search. With JS, you can get the full search result and pipe it to another process or save it to file.
HTDIG returns the entire document, causing enormous storage overhead. JS takes determines the relevant keywords in documents and returns those.
JS allows a URL file containing a list to begin searching from. HTDIG can only retrieve data starting at a particular, single URL.
HTDIG response time for database queries can be extraordinarily slow--times of 30 seconds to 5 minutes have been commonly reported by users. This problem lies, in part, to the processing that occurs upon querying, such as the sort processing HTSEARCH performs. By comparison, a typical JS search is 3 seconds or less.
HTDIG can eliminate files with particular extensions while JS can eliminate files with any substring including extensions. JS can also eliminate undesired URL's. Some URL's, such as those of Yahoo, DoubleClick, and Amazon can flood a web crawler with URL's that prevents it from ever searching other sites, once one of those "super sites" are encountered in a web crawl. In addition, JS can check to be sure that a site, once indexed, is not searched in the next URL web crawl iteration, eliminating the 70% or so of URL duplicates that are generated during web crawl activities.
JS queue files generated by its web crawler is in text readable format that can be trivially read by any program and saved in any user database desired.
To change any options in HTDIG, you have to recompile the program to get more than its default messaging. Using Perl, if you run into problems with a script, you can use options to generate more messages about the scripts as they run.
Banner advertising and use is built-in to JS, but not HTDIG. The "Juggernaut Tech Support" graphic on the initial JS search page is, in fact, a banner. Banner use is not currently documented, but the full source is provided with JS.
HTDIG can generate segmentation faults, crashing the program. JS doesn't know what a segmentation fault is, since different libraries don't have to be matched and debugged for production use. JS only uses Perl libraries, that are well tested for compatability and reliability. We never debug Perl libraries.
HTDIG has a number of bugs relating to string pattern matching. JS uses Perl, a pattern matching language that makes pattern matching errors very unlikely. The error could be in a JS Perl script where there is an error in the type of processing requested, but will be very unlikely to be in the underlying pattern matching and processing code of Perl.

Conclusions:

HTDIG is for sophisticated users or programmers to index a very small number of computer sites (typically 100 or less). Juggernautsearch is for administrators or end users to index any information of interest from the entire body of the Internet, as well as local sites and Intranets. HTDIG is a WAIS-style indexing program with the added ability to search on a network using the HTTP protocol. WAIS development was discontinued in 1995 when the next generation of search engines arrived. Juggernautsearch was one of the five original search engines that started the current generation of Internet search engine technology. While HTDIG is useful for limited research use, Juggernautsearch is readily useful in support of a portal site for public search engine access.

From the HTDIG FAQ list (www.htdig.org):

1.1. Can I search the Internet with HTDIG?

"No. HTDIG is a system for indexing and searching a small set of sites or Intranet. In is not meant to replace any of the many Internet-wide search engines."

1.2. Can I index the Internet with HTDIG?

"No, as above, HTDIG is not meant as an Internet-wide search engine."

Epitaph:

The generation of WAIS engines is over.

HTDIG has run out of steam. It's time to move on.

Response to SLASHDOT.ORG

Our responses to comments about Juggernautsearch on SLASHDOT are a follows:

"I'm obviously a bit biased, but there *are* strong, open-sourced search engines. Try ht//Dig for example". By: Geoff Hutchinson. Response: Its always nice for a member of HTDIG to admit he is biased when talking about a competitor.
"Well, if I could get in to download it... I'll give it a try, we've been using htdig here at UK for years, but it's not very useable for the size of index we are try to work with...takes basically all day to rebuild the index on a Xeon450 w/512M". By: John Soward. Response: You don't have to rebuild indexes in Juggernautsearch. After building the database, additions can be made at any time.
"Htdig is a GPL'd search engine that will crawl your site. It can go a certain depth, or start from any give page (like a site index page).". By: smoser. Response: Juggernautsearch can start searching from any given page as well.
"I typed in 'jj thompson' to see would it find my page about the legendary physicist (it's indexed by most engines). It didn't bother returning any matches". By: rde. Response: The sample database is only 55mB. It's not a google.com terabyte database. Obviously, the difference between a demo database to show how the product works, and a live WWW database is lost on some people.
"I usually don't like to be a skeptic, but the whole thing just smells funny, especially when there are very few concrete details about the whole thing". By: dr. Response: They only thing we can provide more than the source is our clothes, and you can't have those.

Intuitive Project Management Software^TM
HyperProject, Inc.
12356 Jolette Ave.
Granada Hills, CA 91344

Phone: 818-831-0404
Email: kasper@1stconnect.com
Web Site: www.hproject.com
kasper@1stconnect.com