Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

batch indexing

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


halacsy.peter at axelero

Aug 6, 2002, 2:19 PM

Post #1 of 10 (1152 views)
Permalink
batch indexing

Hello everybody,
there were a lot of discussion about batch indexing. I've attached a BatchIndexWriter class that can speed up the indexing. I haven't tested (release early release often).

Unfortunatly one has to modify lucene code to use it: add two methods to IndexWriter.java

/** Sets the analyzer which the text will be analyzed with.
*/
public synchronized void setAnalyzer(Analyzer a) {
this.analyzer = a;
}

/** Returns the analyzer that is used to analyzer the text.
*/
public synchronized Analyzer getAnalyzer() {
return analyzer;
}


Developers! Couldn't you add these methods in the cvs? they're very helpful if one wants to make a wrapper or decorator class.

peter
Attachments: BatchIndexWriter.java (3.27 KB)


cutting at lucene

Aug 6, 2002, 2:35 PM

Post #2 of 10 (1070 views)
Permalink
Re: batch indexing [In reply to]

Halácsy Péter wrote:
> Unfortunatly one has to modify lucene code to use it: add two methods to IndexWriter.java
> public synchronized void setAnalyzer(Analyzer a) {
> this.analyzer = a;
> }
> public synchronized Analyzer getAnalyzer() {
> return analyzer;
> }

I can see the case for getAnalyzer(), but setAnalyzer() could be dangerous.

In any case, all that you invoke in your code is getAnalyzer(), and,
actually, you could use null instead where you call it, since the only
IndexWriter method that uses the analyzer is addDocument, and you never
invoke addDocument() on "realWriter".

So no change to IndexWriter is required, but I would also not object to
the addition of a getAnalyzer() method, especially if someone can come
up with a use for it!

Doug


--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe [at] jakarta>
For additional commands, e-mail: <mailto:lucene-user-help [at] jakarta>


halacsy.peter at axelero

Aug 7, 2002, 4:48 AM

Post #3 of 10 (1070 views)
Permalink
RE: batch indexing [In reply to]

> -----Original Message-----
> From: Doug Cutting [mailto:cutting [at] lucene]
> Sent: Tuesday, August 06, 2002 11:36 PM
> To: Lucene Users List
> Subject: Re: batch indexing
>
>
> Halácsy Péter wrote:
> > Unfortunatly one has to modify lucene code to use it: add
> two methods to IndexWriter.java
> > public synchronized void setAnalyzer(Analyzer a) {
> > this.analyzer = a;
> > }
> > public synchronized Analyzer getAnalyzer() {
> > return analyzer;
> > }
>
> I can see the case for getAnalyzer(), but setAnalyzer() could
> be dangerous.
>
> In any case, all that you invoke in your code is getAnalyzer(), and,
> actually, you could use null instead where you call it, since
> the only
> IndexWriter method that uses the analyzer is addDocument, and
> you never
> invoke addDocument() on "realWriter".

I think I have to use an analyzer to add document to the ram, haven't I?
Do you mean that I can write
m_ramWriter = new IndexWriter(m_ramDirectory, null, true);

instead of

m_ramWriter = new IndexWriter(m_ramDirectory, m_realWriter.getAnalyzer(), true); ?

peter




>
> So no change to IndexWriter is required, but I would also not
> object to
> the addition of a getAnalyzer() method, especially if someone
> can come
> up with a use for it!
>
> Doug
>
>
> --
> To unsubscribe, e-mail:
> <mailto:lucene-user-unsubscribe [at] jakarta>
> For additional commands, e-mail:
> <mailto:lucene-user-help [at] jakarta>
>
>

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe [at] jakarta>
For additional commands, e-mail: <mailto:lucene-user-help [at] jakarta>


cutting at lucene

Aug 7, 2002, 9:29 AM

Post #4 of 10 (1066 views)
Permalink
Re: batch indexing [In reply to]

Halácsy Péter wrote:
> I think I have to use an analyzer to add document to the ram, haven't I?

You're right. I misread the code. You do need getAnalyzer().

I just added this method to IndexWriter.

Sorry for the confusion.

Doug



--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe [at] jakarta>
For additional commands, e-mail: <mailto:lucene-user-help [at] jakarta>


halacsy.peter at axelero

Aug 7, 2002, 9:34 AM

Post #5 of 10 (1064 views)
Permalink
RE: batch indexing [In reply to]

> -----Original Message-----
> From: Doug Cutting [mailto:cutting [at] lucene]
> Sent: Wednesday, August 07, 2002 6:29 PM
> To: Lucene Users List
> Subject: Re: batch indexing
>
>
> Halácsy Péter wrote:
> > I think I have to use an analyzer to add document to the
> ram, haven't I?
>
> You're right. I misread the code. You do need getAnalyzer().
>
> I just added this method to IndexWriter.
>
great! thanks

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe [at] jakarta>
For additional commands, e-mail: <mailto:lucene-user-help [at] jakarta>


dmitrys at earthlink

Aug 8, 2002, 2:01 PM

Post #6 of 10 (1065 views)
Permalink
RE: batch indexing [In reply to]

I was just thinking about doing something similar, but after looking at
your code I thought couldn't the same thing be done by manipulating the
mergeFactor on the existing IndexWriter? It already indexes n documents
into memory before writing a new disk segment. I just looked at it again
but I can't see without a detailed study whether the mergeFactor applies
to merging from RAM to disk only or for merging on-disk segments as
well. If it applies to both, perhaps we could add a different field to
the IndexWriter to allow the two values to be different? Am I missing
something?

Dmitry.

>
> Subject:
> batch indexing
> From:
> Halácsy Péter <halacsy.peter [at] axelero>
> Date:
> Tue, 6 Aug 2002 23:19:43 +0200
> To:
> "Lucene Users List" <lucene-user [at] jakarta>
>
>
>Hello everybody,
>there were a lot of discussion about batch indexing. I've attached a BatchIndexWriter class that can speed up the indexing. I haven't tested (release early release often).
>
>Unfortunatly one has to modify lucene code to use it: add two methods to IndexWriter.java
>
> /** Sets the analyzer which the text will be analyzed with.
> */
> public synchronized void setAnalyzer(Analyzer a) {
> this.analyzer = a;
> }
>
> /** Returns the analyzer that is used to analyzer the text.
> */
> public synchronized Analyzer getAnalyzer() {
> return analyzer;
> }
>
>
>Developers! Couldn't you add these methods in the cvs? they're very helpful if one wants to make a wrapper or decorator class.
>
>peter
>
>
>



--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe [at] jakarta>
For additional commands, e-mail: <mailto:lucene-user-help [at] jakarta>


chandan at ccnep

Apr 29, 2007, 5:59 PM

Post #7 of 10 (1049 views)
Permalink
RE: batch indexing [In reply to]

Thanks Erik , so FSDirectory seems better option than RAMDirectory ? Also I
think O.S can cache files in which case FSDirectory may not be bad , your
thoughts ?

-----Original Message-----
From: Erick Erickson [mailto:erickerickson [at] gmail]
Sent: Sunday, April 29, 2007 7:07 PM
To: java-user [at] lucene
Subject: Re: batch indexing

As I understand it, FSDirectory *is* RAMdirectory, at least until
it flushes. There have been several discussions of this,
search the mail archive for things like MergeFactor, MaxBufferedDocs
and the like. You'll find quite a bit of information about how these
parameters interact.

Particularly, see the thread titled

"MergeFactor and MaxBufferedDocs value should ...?"

I suspect, that if you're explicitly creating RAMDirectories and
merging them that your code is more complex than it needs
to be to no good purpose (of course, I've been wrong before,
once or twice at least <G>).

If you really require parallel indexing, you might try running
FSDirectory based processes on several machines, then merging
the resulting indexes as a final step. How you tell each of these
processes which documents to index I leave as an exercise for
the reader.....

Best
Erick


On 4/29/07, Chandan Tamrakar <chandan [at] ccnep> wrote:
>
> I am trying to index a huge documents on batches . Batch size is
> parameterized to the application say X docs , that means it will hold X
> no.
> of
>
> Docs in the RAM before I flush to file system using
> IndexWriter.addIndexes(Directory[]) method
>
>
>
> My question is :
>
>
>
> Do I need to set mergefactor ? , will it hold default mergefactor docs in
> memory before it is written to disk as segment .
>
> (But my application will call indexwriter.addindexes function only after X
> no of documents are in memory)
>
>
>
> If the index sizes are big , at some point of time there might be a out of
> memory exceptions , ( yes I could check a memory before another
> ramdirectory
> is being created) But what would be the best solution ? Is FSDirectory is
> better option than Ramdirectory for huge text indexing ? I have roughly 50
> GB of fulltext to index?
>
>
>
>
>
> Thks in advance.
>
>



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


erickerickson at gmail

Apr 29, 2007, 6:39 PM

Post #8 of 10 (1065 views)
Permalink
Re: batch indexing [In reply to]

Really take a look at the thread I mentioned, as well as search
the user list archives. There's more information than you knew
existed <G>.

My main thought is that I don't see any evidence that there's an
actual problem. That is, what behavior of the simple FS based
way of creating an index aren't you happy with? I can't emphasize
enough that making things faster, when you *don't* have an
observable performance problem is unwise. I'll concede that
if you've had experience in a particular problem domain and face
a similar problem, you probably can say at the outset that you
need to build in certain efficiencies. But if you haven't, worrying
about this kind of performance issue is almost certainly a waste of
time....

So, again, what behavior are you actually seeing that's causing
the problem? And why does it need to be faster? "Indexing huge
amounts of data" is irrelevant if you only have to build the index
once and change it yearly thereafter. Especially if it builds, say,
overnight. If you have to build the index daily and it takes three days
to run in the simple-minded way, it's another story.

So why don't you try indexing, say, 1,000,000 of your documents
the simple way, and 1,000,000 the complex way you started to and
see whether 1> the complex way saves you any time. and 2> if it
does, is it enough time to make the complexity worthwhile?

Best
Erick

On 4/29/07, Chandan Tamrakar <chandan [at] ccnep> wrote:
>
> Thanks Erik , so FSDirectory seems better option than RAMDirectory ? Also
> I
> think O.S can cache files in which case FSDirectory may not be bad , your
> thoughts ?
>
> -----Original Message-----
> From: Erick Erickson [mailto:erickerickson [at] gmail]
> Sent: Sunday, April 29, 2007 7:07 PM
> To: java-user [at] lucene
> Subject: Re: batch indexing
>
> As I understand it, FSDirectory *is* RAMdirectory, at least until
> it flushes. There have been several discussions of this,
> search the mail archive for things like MergeFactor, MaxBufferedDocs
> and the like. You'll find quite a bit of information about how these
> parameters interact.
>
> Particularly, see the thread titled
>
> "MergeFactor and MaxBufferedDocs value should ...?"
>
> I suspect, that if you're explicitly creating RAMDirectories and
> merging them that your code is more complex than it needs
> to be to no good purpose (of course, I've been wrong before,
> once or twice at least <G>).
>
> If you really require parallel indexing, you might try running
> FSDirectory based processes on several machines, then merging
> the resulting indexes as a final step. How you tell each of these
> processes which documents to index I leave as an exercise for
> the reader.....
>
> Best
> Erick
>
>
> On 4/29/07, Chandan Tamrakar <chandan [at] ccnep> wrote:
> >
> > I am trying to index a huge documents on batches . Batch size is
> > parameterized to the application say X docs , that means it will hold X
> > no.
> > of
> >
> > Docs in the RAM before I flush to file system using
> > IndexWriter.addIndexes(Directory[]) method
> >
> >
> >
> > My question is :
> >
> >
> >
> > Do I need to set mergefactor ? , will it hold default mergefactor docs
> in
> > memory before it is written to disk as segment .
> >
> > (But my application will call indexwriter.addindexes function only after
> X
> > no of documents are in memory)
> >
> >
> >
> > If the index sizes are big , at some point of time there might be a out
> of
> > memory exceptions , ( yes I could check a memory before another
> > ramdirectory
> > is being created) But what would be the best solution ? Is FSDirectory
> is
> > better option than Ramdirectory for huge text indexing ? I have roughly
> 50
> > GB of fulltext to index?
> >
> >
> >
> >
> >
> > Thks in advance.
> >
> >
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>


chrislin0426 at gmail

May 2, 2007, 2:22 AM

Post #9 of 10 (1032 views)
Permalink
Re: batch indexing [In reply to]

Sorry , ask a question.
You say FSDirectory is RADdirectory as least until it flushes.

I cannot understand your means . May you please teach me what it means?

FSDirectory stored in filesystem , and RADdirectory stored in RAM.
MergeFactor and MaxBufferedDocs settings are limited and controlled
the max docs and size with the tempory Indexing action?

If I am fault , tell me , please . Thank you.

=============================
Chris Lin
http://search20.portal20.com.tw
chrislin [at] pchome
Taipei , Taiwan.
-----------------------------------------------------------


2007/4/29, Erick Erickson <erickerickson [at] gmail>:
> As I understand it, FSDirectory *is* RAMdirectory, at least until
> it flushes. There have been several discussions of this,
> search the mail archive for things like MergeFactor, MaxBufferedDocs
> and the like. You'll find quite a bit of information about how these
> parameters interact.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


erickerickson at gmail

May 2, 2007, 5:40 AM

Post #10 of 10 (1031 views)
Permalink
Re: batch indexing [In reply to]

For some intermediate period of time when indexing, the document are
buffered in RAM. There is a complex interplay between several of the
parameters to an IndexWriter that govern how many documents are
indexed in RAM before being flushed to the FSDirectory.

Of course, if you specify a RAMdirectory, it's never written to disk.
But if you specify an FSDirectory, as you write the index some
number of documents are indexed in RAM, then flushed to
disk.

Search the archives and particularly look at the thread titled for
more detailed information.....

"MergeFactor and MaxBufferedDocs value should ...?"


Best
Erick

On 5/2/07, Chris <chrislin0426 [at] gmail> wrote:
>
> Sorry , ask a question.
> You say FSDirectory is RADdirectory as least until it flushes.
>
> I cannot understand your means . May you please teach me what it means?
>
> FSDirectory stored in filesystem , and RADdirectory stored in RAM.
> MergeFactor and MaxBufferedDocs settings are limited and controlled
> the max docs and size with the tempory Indexing action?
>
> If I am fault , tell me , please . Thank you.
>
> =============================
> Chris Lin
> http://search20.portal20.com.tw
> chrislin [at] pchome
> Taipei , Taiwan.
> -----------------------------------------------------------
>
>
> 2007/4/29, Erick Erickson <erickerickson [at] gmail>:
> > As I understand it, FSDirectory *is* RAMdirectory, at least until
> > it flushes. There have been several discussions of this,
> > search the mail archive for things like MergeFactor, MaxBufferedDocs
> > and the like. You'll find quite a bit of information about how these
> > parameters interact.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.