
erickerickson at gmail
Apr 29, 2007, 6:39 PM
Post #8 of 10
(1065 views)
Permalink
|
Really take a look at the thread I mentioned, as well as search the user list archives. There's more information than you knew existed <G>. My main thought is that I don't see any evidence that there's an actual problem. That is, what behavior of the simple FS based way of creating an index aren't you happy with? I can't emphasize enough that making things faster, when you *don't* have an observable performance problem is unwise. I'll concede that if you've had experience in a particular problem domain and face a similar problem, you probably can say at the outset that you need to build in certain efficiencies. But if you haven't, worrying about this kind of performance issue is almost certainly a waste of time.... So, again, what behavior are you actually seeing that's causing the problem? And why does it need to be faster? "Indexing huge amounts of data" is irrelevant if you only have to build the index once and change it yearly thereafter. Especially if it builds, say, overnight. If you have to build the index daily and it takes three days to run in the simple-minded way, it's another story. So why don't you try indexing, say, 1,000,000 of your documents the simple way, and 1,000,000 the complex way you started to and see whether 1> the complex way saves you any time. and 2> if it does, is it enough time to make the complexity worthwhile? Best Erick On 4/29/07, Chandan Tamrakar <chandan [at] ccnep> wrote: > > Thanks Erik , so FSDirectory seems better option than RAMDirectory ? Also > I > think O.S can cache files in which case FSDirectory may not be bad , your > thoughts ? > > -----Original Message----- > From: Erick Erickson [mailto:erickerickson [at] gmail] > Sent: Sunday, April 29, 2007 7:07 PM > To: java-user [at] lucene > Subject: Re: batch indexing > > As I understand it, FSDirectory *is* RAMdirectory, at least until > it flushes. There have been several discussions of this, > search the mail archive for things like MergeFactor, MaxBufferedDocs > and the like. You'll find quite a bit of information about how these > parameters interact. > > Particularly, see the thread titled > > "MergeFactor and MaxBufferedDocs value should ...?" > > I suspect, that if you're explicitly creating RAMDirectories and > merging them that your code is more complex than it needs > to be to no good purpose (of course, I've been wrong before, > once or twice at least <G>). > > If you really require parallel indexing, you might try running > FSDirectory based processes on several machines, then merging > the resulting indexes as a final step. How you tell each of these > processes which documents to index I leave as an exercise for > the reader..... > > Best > Erick > > > On 4/29/07, Chandan Tamrakar <chandan [at] ccnep> wrote: > > > > I am trying to index a huge documents on batches . Batch size is > > parameterized to the application say X docs , that means it will hold X > > no. > > of > > > > Docs in the RAM before I flush to file system using > > IndexWriter.addIndexes(Directory[]) method > > > > > > > > My question is : > > > > > > > > Do I need to set mergefactor ? , will it hold default mergefactor docs > in > > memory before it is written to disk as segment . > > > > (But my application will call indexwriter.addindexes function only after > X > > no of documents are in memory) > > > > > > > > If the index sizes are big , at some point of time there might be a out > of > > memory exceptions , ( yes I could check a memory before another > > ramdirectory > > is being created) But what would be the best solution ? Is FSDirectory > is > > better option than Ramdirectory for huge text indexing ? I have roughly > 50 > > GB of fulltext to index? > > > > > > > > > > > > Thks in advance. > > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe [at] lucene > For additional commands, e-mail: java-user-help [at] lucene > >
|