Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: kinosearch: discuss

large index size

 

 

kinosearch discuss RSS feed   Index | Next | Previous | View Threaded


haomself at gmail

Jul 25, 2008, 7:25 PM

Post #1 of 2 (3705 views)
Permalink
large index size

After indexing some html files (4.7G), I got a _1.cfs file that is 8.4G. Is
this normal? I only modified the directory of the sample invindex.plx file
for my indexing and confirmed I did have the $invindexer->finish; line in
there. (the files are much larger before indexing was finished).

Thanks for this great piece of work,

Hao


marvin at rectangular

Jul 26, 2008, 1:12 AM

Post #2 of 2 (3327 views)
Permalink
Re: large index size [In reply to]

On Jul 25, 2008, at 7:25 PM, hao chen wrote:

> After indexing some html files (4.7G), I got a _1.cfs file that is
> 8.4G. Is this normal?

Probably. There's the index files used for lookup/scoring, the stored
fields returned when retrieving hits, and the data used by the
highlighter, which basically duplicates what's in the index files.

If you don't care about highlighting/excerpting and you just want to
fetch titles, set "stored" and "vectorized" to 0 for everything but
the "title" field and you'll cut down significantly on disk usage.

> I only modified the directory of the sample invindex.plx file for my
> indexing

I strongly recommend using a real HTML parser rather than the cheesy
regex tag stripper in the sample app. It's only there because it's
easy to grok at a glance.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


_______________________________________________
KinoSearch mailing list
KinoSearch [at] rectangular
http://www.rectangular.com/mailman/listinfo/kinosearch

kinosearch discuss RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.