Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

Directory flushing / commit / openIfChanged

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


Harald.Kirsch at raytion

Aug 6, 2012, 4:22 AM

Post #1 of 5 (266 views)
Permalink
Directory flushing / commit / openIfChanged

Hi,

in my application I have to write tons of small documents to the index,
but with a twist. Many of the documents are actually aggregations of
pieces of information that appear in a data stream, usually close
together, but nevertheless merged with information for other documents.

When information a1 for my document A arrives, I create my A-object,
store it with index.addDocument() and forget about it. Later, when a2
arrives, I fetch A from the index, delete it from the index, update it,
and store its updated version. To fetch it from the index, I use a
reader retrieved with IndexReader.openIfChanged(). So for one piece of
information I have roughly the following sequence:

get searcher via IndexReader.openIfChanged()
find previously stored document, if any
if document already available {
update document object
index.deleteDocument(new Term(IDFIELD, id))
} else {
create document object
}
index.addDocument()


The overall speed is not too bad, but I wonder if more is possible. I
changed RAMBufferSizeMB from the default 16 to 200 but saw no
improvement in speed.

I would think that keeping documents in RAM for some time such that many
updates happen in RAM, rather then being written to disk would improve
the overall running time.

Any hints how to configure and use Lucene to improve the speed without
layering my own caching on top of it?

Harald.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


simon.willnauer at gmail

Aug 6, 2012, 4:55 AM

Post #2 of 5 (265 views)
Permalink
Re: Directory flushing / commit / openIfChanged [In reply to]

hey harald,

On Mon, Aug 6, 2012 at 1:22 PM, Harald Kirsch <Harald.Kirsch [at] raytion> wrote:
> Hi,
>
> in my application I have to write tons of small documents to the index, but
> with a twist. Many of the documents are actually aggregations of pieces of
> information that appear in a data stream, usually close together, but
> nevertheless merged with information for other documents.
>
> When information a1 for my document A arrives, I create my A-object, store
> it with index.addDocument() and forget about it. Later, when a2 arrives, I
> fetch A from the index, delete it from the index, update it, and store its
> updated version. To fetch it from the index, I use a reader retrieved with
> IndexReader.openIfChanged(). So for one piece of information I have roughly
> the following sequence:
>
> get searcher via IndexReader.openIfChanged()
> find previously stored document, if any
> if document already available {
> update document object
> index.deleteDocument(new Term(IDFIELD, id))
> } else {
> create document object
> }
> index.addDocument()
>
>
> The overall speed is not too bad, but I wonder if more is possible. I
> changed RAMBufferSizeMB from the default 16 to 200 but saw no improvement in
> speed.
>
> I would think that keeping documents in RAM for some time such that many
> updates happen in RAM, rather then being written to disk would improve the
> overall running time.
>
> Any hints how to configure and use Lucene to improve the speed without
> layering my own caching on top of it?

what happens if you re-open a reader from an IW (NearRealtime) you
flush documents to disk each time you reopen the NRT reader. That
likely means if you have high update rates that you don't keep stuff
in memory for very long so ram buffer size increase won't help much.
What I would try to exploit is the fact that you only need to open a
new reader if the document (or its latest update) you are looking for
has not been flushed to disk yet ie. is not in reader you already have
opened. Lucene ships with some handy tools that helps you to implement
this. I'd likely use org.apache.lucene.search.NRTManager that exposes
the methods of IW (update/add/delete) and returns a sequence ID that
you can later use to request an NRT reader. Lets say you have document
X indexed with sequence ID 15 and you now wanna update it you look up
the ID of doc X in a hashmap or something like this to get the last
changed sequence ID then you ask the NRTManager to refresh the search
it holds right now with NRTManager#waitForGeneration(15) if the
generation is already refreshed it will return immediately otherwise
it will wait until its opened. Then you can just acquire a new
searcher and check the document.

something like this:

String id = doc.getId();
Long seqId = mapping.get(id);

if (seqId != null) {
nrtManager.waitForGeneration(seqId);
}

IndexSearcher s = nrtManager.acquire();
try {
IndexReader reader = s.getReader();
// do something
} finally {
nrtManager.release(s);
}

from time to time you can prune the mapping for sequence ids that are
already flushed.

hope that helps

simon
>
> Harald.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


Harald.Kirsch at raytion

Aug 7, 2012, 6:39 AM

Post #3 of 5 (248 views)
Permalink
Re: Directory flushing / commit / openIfChanged [In reply to]

Hello Simon,

ok, I'll try this out. Just to be sure. I was after a way to update
documents before they are even written to disk, but this seems not to be
the Lucene way. From what you propose I understand that this approach
tries to keep documents from being written up to the time they need to
be actually changed.

If I need to keep some kind map anyway myself, I wonder if I will not
just cache the documents themselves rather than just their sequence id.
If they are "old" enough I migrate them into the index. For the sequence
IDs I would need a retirement strategy too.

It was exactly this additional caching that I hoped to avoid. :-(

Harald.



On 06.08.2012 13:55, Simon Willnauer wrote:
> hey harald,
>
> On Mon, Aug 6, 2012 at 1:22 PM, Harald Kirsch <Harald.Kirsch [at] raytion> wrote:
>> Hi,
>>
>> in my application I have to write tons of small documents to the index, but
>> with a twist. Many of the documents are actually aggregations of pieces of
>> information that appear in a data stream, usually close together, but
>> nevertheless merged with information for other documents.
>>
>> When information a1 for my document A arrives, I create my A-object, store
>> it with index.addDocument() and forget about it. Later, when a2 arrives, I
>> fetch A from the index, delete it from the index, update it, and store its
>> updated version. To fetch it from the index, I use a reader retrieved with
>> IndexReader.openIfChanged(). So for one piece of information I have roughly
>> the following sequence:
>>
>> get searcher via IndexReader.openIfChanged()
>> find previously stored document, if any
>> if document already available {
>> update document object
>> index.deleteDocument(new Term(IDFIELD, id))
>> } else {
>> create document object
>> }
>> index.addDocument()
>>
>>
>> The overall speed is not too bad, but I wonder if more is possible. I
>> changed RAMBufferSizeMB from the default 16 to 200 but saw no improvement in
>> speed.
>>
>> I would think that keeping documents in RAM for some time such that many
>> updates happen in RAM, rather then being written to disk would improve the
>> overall running time.
>>
>> Any hints how to configure and use Lucene to improve the speed without
>> layering my own caching on top of it?
>
> what happens if you re-open a reader from an IW (NearRealtime) you
> flush documents to disk each time you reopen the NRT reader. That
> likely means if you have high update rates that you don't keep stuff
> in memory for very long so ram buffer size increase won't help much.
> What I would try to exploit is the fact that you only need to open a
> new reader if the document (or its latest update) you are looking for
> has not been flushed to disk yet ie. is not in reader you already have
> opened. Lucene ships with some handy tools that helps you to implement
> this. I'd likely use org.apache.lucene.search.NRTManager that exposes
> the methods of IW (update/add/delete) and returns a sequence ID that
> you can later use to request an NRT reader. Lets say you have document
> X indexed with sequence ID 15 and you now wanna update it you look up
> the ID of doc X in a hashmap or something like this to get the last
> changed sequence ID then you ask the NRTManager to refresh the search
> it holds right now with NRTManager#waitForGeneration(15) if the
> generation is already refreshed it will return immediately otherwise
> it will wait until its opened. Then you can just acquire a new
> searcher and check the document.
>
> something like this:
>
> String id = doc.getId();
> Long seqId = mapping.get(id);
>
> if (seqId != null) {
> nrtManager.waitForGeneration(seqId);
> }
>
> IndexSearcher s = nrtManager.acquire();
> try {
> IndexReader reader = s.getReader();
> // do something
> } finally {
> nrtManager.release(s);
> }
>
> from time to time you can prune the mapping for sequence ids that are
> already flushed.
>
> hope that helps
>
> simon
>>
>> Harald.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>> For additional commands, e-mail: java-user-help [at] lucene
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>

--
Harald Kirsch
Raytion GmbH
Kaiser-Friedrich-Ring 74
40547 Duesseldorf
Fon +49-211-550266-0
Fax +49-211-550266-19
http://www.raytion.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


Harald.Kirsch at raytion

Aug 10, 2012, 3:10 AM

Post #4 of 5 (239 views)
Permalink
Re: Directory flushing / commit / openIfChanged [In reply to]

Maybe I did something wrong, maybe it does indeed not help, but pushing
data into Lucene was not any faster than before.

I would like remove my project specific baggage and try to rephrase my
question by means of a simple example.

Suppose a Lucene document is used to count events of certain types. For
each type of event I have one document. Whenever a new event arrives, I
must read the respective document from the index, increment the count,
delete the document from the index and write the new one into the index.

As an addition, consider the distribution of events being a typical
Zipf, i.e. a small number of event types occurs rather frequently, while
other types of events may appear just once.

What is the most efficient sequence of Lucene operations for such a
scenario?

Harald.

On 07.08.2012 15:39, Harald Kirsch wrote:
> Hello Simon,
>
> ok, I'll try this out. Just to be sure. I was after a way to update
> documents before they are even written to disk, but this seems not to be
> the Lucene way. From what you propose I understand that this approach
> tries to keep documents from being written up to the time they need to
> be actually changed.
>
> If I need to keep some kind map anyway myself, I wonder if I will not
> just cache the documents themselves rather than just their sequence id.
> If they are "old" enough I migrate them into the index. For the sequence
> IDs I would need a retirement strategy too.
>
> It was exactly this additional caching that I hoped to avoid. :-(
>
> Harald.
>
>
>
> On 06.08.2012 13:55, Simon Willnauer wrote:
>> hey harald,
>>
>> On Mon, Aug 6, 2012 at 1:22 PM, Harald Kirsch
>> <Harald.Kirsch [at] raytion> wrote:
>>> Hi,
>>>
>>> in my application I have to write tons of small documents to the
>>> index, but
>>> with a twist. Many of the documents are actually aggregations of
>>> pieces of
>>> information that appear in a data stream, usually close together, but
>>> nevertheless merged with information for other documents.
>>>
>>> When information a1 for my document A arrives, I create my A-object,
>>> store
>>> it with index.addDocument() and forget about it. Later, when a2
>>> arrives, I
>>> fetch A from the index, delete it from the index, update it, and
>>> store its
>>> updated version. To fetch it from the index, I use a reader retrieved
>>> with
>>> IndexReader.openIfChanged(). So for one piece of information I have
>>> roughly
>>> the following sequence:
>>>
>>> get searcher via IndexReader.openIfChanged()
>>> find previously stored document, if any
>>> if document already available {
>>> update document object
>>> index.deleteDocument(new Term(IDFIELD, id))
>>> } else {
>>> create document object
>>> }
>>> index.addDocument()
>>>
>>>
>>> The overall speed is not too bad, but I wonder if more is possible. I
>>> changed RAMBufferSizeMB from the default 16 to 200 but saw no
>>> improvement in
>>> speed.
>>>
>>> I would think that keeping documents in RAM for some time such that many
>>> updates happen in RAM, rather then being written to disk would
>>> improve the
>>> overall running time.
>>>
>>> Any hints how to configure and use Lucene to improve the speed without
>>> layering my own caching on top of it?
>>
>> what happens if you re-open a reader from an IW (NearRealtime) you
>> flush documents to disk each time you reopen the NRT reader. That
>> likely means if you have high update rates that you don't keep stuff
>> in memory for very long so ram buffer size increase won't help much.
>> What I would try to exploit is the fact that you only need to open a
>> new reader if the document (or its latest update) you are looking for
>> has not been flushed to disk yet ie. is not in reader you already have
>> opened. Lucene ships with some handy tools that helps you to implement
>> this. I'd likely use org.apache.lucene.search.NRTManager that exposes
>> the methods of IW (update/add/delete) and returns a sequence ID that
>> you can later use to request an NRT reader. Lets say you have document
>> X indexed with sequence ID 15 and you now wanna update it you look up
>> the ID of doc X in a hashmap or something like this to get the last
>> changed sequence ID then you ask the NRTManager to refresh the search
>> it holds right now with NRTManager#waitForGeneration(15) if the
>> generation is already refreshed it will return immediately otherwise
>> it will wait until its opened. Then you can just acquire a new
>> searcher and check the document.
>>
>> something like this:
>>
>> String id = doc.getId();
>> Long seqId = mapping.get(id);
>>
>> if (seqId != null) {
>> nrtManager.waitForGeneration(seqId);
>> }
>>
>> IndexSearcher s = nrtManager.acquire();
>> try {
>> IndexReader reader = s.getReader();
>> // do something
>> } finally {
>> nrtManager.release(s);
>> }
>>
>> from time to time you can prune the mapping for sequence ids that are
>> already flushed.
>>
>> hope that helps
>>
>> simon
>>>
>>> Harald.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>>> For additional commands, e-mail: java-user-help [at] lucene
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>> For additional commands, e-mail: java-user-help [at] lucene
>>
>>
>

--
Harald Kirsch
Raytion GmbH
Kaiser-Friedrich-Ring 74
40547 Duesseldorf
Fon +49-211-550266-0
Fax +49-211-550266-19
http://www.raytion.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


lucene at mikemccandless

Aug 10, 2012, 10:09 AM

Post #5 of 5 (240 views)
Permalink
Re: Directory flushing / commit / openIfChanged [In reply to]

This is a hard use case to do with pure Lucene ... NRTManager (plus
NRTCachingDirectory) is the closest you can get, but given the Zipf
distribution you'll be flushing/opening a new reader very frequently
which leads to low perf.

I think you have to have a cache above, which buffers up changes, and
then periodically writes them to Lucene.

You should be able to use the new IndexWriter.tryDeleteDocument to
speed things up.

See the PKLookupUpdatePerfTest example here:

http://code.google.com/a/apache-extras.org/p/luceneutil/source/browse/perf/PKLookupUpdatePerfTest.java

The code is sort of messy (has lots of if statements to enable
different approaches) but it's basically doing the same thing that you
need to do.

Mike McCandless

http://blog.mikemccandless.com


On Fri, Aug 10, 2012 at 6:10 AM, Harald Kirsch
<Harald.Kirsch [at] raytion> wrote:
> Maybe I did something wrong, maybe it does indeed not help, but pushing data
> into Lucene was not any faster than before.
>
> I would like remove my project specific baggage and try to rephrase my
> question by means of a simple example.
>
> Suppose a Lucene document is used to count events of certain types. For each
> type of event I have one document. Whenever a new event arrives, I must read
> the respective document from the index, increment the count, delete the
> document from the index and write the new one into the index.
>
> As an addition, consider the distribution of events being a typical Zipf,
> i.e. a small number of event types occurs rather frequently, while other
> types of events may appear just once.
>
> What is the most efficient sequence of Lucene operations for such a
> scenario?
>
> Harald.
>
>
> On 07.08.2012 15:39, Harald Kirsch wrote:
>>
>> Hello Simon,
>>
>> ok, I'll try this out. Just to be sure. I was after a way to update
>> documents before they are even written to disk, but this seems not to be
>> the Lucene way. From what you propose I understand that this approach
>> tries to keep documents from being written up to the time they need to
>> be actually changed.
>>
>> If I need to keep some kind map anyway myself, I wonder if I will not
>> just cache the documents themselves rather than just their sequence id.
>> If they are "old" enough I migrate them into the index. For the sequence
>> IDs I would need a retirement strategy too.
>>
>> It was exactly this additional caching that I hoped to avoid. :-(
>>
>> Harald.
>>
>>
>>
>> On 06.08.2012 13:55, Simon Willnauer wrote:
>>>
>>> hey harald,
>>>
>>> On Mon, Aug 6, 2012 at 1:22 PM, Harald Kirsch
>>> <Harald.Kirsch [at] raytion> wrote:
>>>>
>>>> Hi,
>>>>
>>>> in my application I have to write tons of small documents to the
>>>> index, but
>>>> with a twist. Many of the documents are actually aggregations of
>>>> pieces of
>>>> information that appear in a data stream, usually close together, but
>>>> nevertheless merged with information for other documents.
>>>>
>>>> When information a1 for my document A arrives, I create my A-object,
>>>> store
>>>> it with index.addDocument() and forget about it. Later, when a2
>>>> arrives, I
>>>> fetch A from the index, delete it from the index, update it, and
>>>> store its
>>>> updated version. To fetch it from the index, I use a reader retrieved
>>>> with
>>>> IndexReader.openIfChanged(). So for one piece of information I have
>>>> roughly
>>>> the following sequence:
>>>>
>>>> get searcher via IndexReader.openIfChanged()
>>>> find previously stored document, if any
>>>> if document already available {
>>>> update document object
>>>> index.deleteDocument(new Term(IDFIELD, id))
>>>> } else {
>>>> create document object
>>>> }
>>>> index.addDocument()
>>>>
>>>>
>>>> The overall speed is not too bad, but I wonder if more is possible. I
>>>> changed RAMBufferSizeMB from the default 16 to 200 but saw no
>>>> improvement in
>>>> speed.
>>>>
>>>> I would think that keeping documents in RAM for some time such that many
>>>> updates happen in RAM, rather then being written to disk would
>>>> improve the
>>>> overall running time.
>>>>
>>>> Any hints how to configure and use Lucene to improve the speed without
>>>> layering my own caching on top of it?
>>>
>>>
>>> what happens if you re-open a reader from an IW (NearRealtime) you
>>> flush documents to disk each time you reopen the NRT reader. That
>>> likely means if you have high update rates that you don't keep stuff
>>> in memory for very long so ram buffer size increase won't help much.
>>> What I would try to exploit is the fact that you only need to open a
>>> new reader if the document (or its latest update) you are looking for
>>> has not been flushed to disk yet ie. is not in reader you already have
>>> opened. Lucene ships with some handy tools that helps you to implement
>>> this. I'd likely use org.apache.lucene.search.NRTManager that exposes
>>> the methods of IW (update/add/delete) and returns a sequence ID that
>>> you can later use to request an NRT reader. Lets say you have document
>>> X indexed with sequence ID 15 and you now wanna update it you look up
>>> the ID of doc X in a hashmap or something like this to get the last
>>> changed sequence ID then you ask the NRTManager to refresh the search
>>> it holds right now with NRTManager#waitForGeneration(15) if the
>>> generation is already refreshed it will return immediately otherwise
>>> it will wait until its opened. Then you can just acquire a new
>>> searcher and check the document.
>>>
>>> something like this:
>>>
>>> String id = doc.getId();
>>> Long seqId = mapping.get(id);
>>>
>>> if (seqId != null) {
>>> nrtManager.waitForGeneration(seqId);
>>> }
>>>
>>> IndexSearcher s = nrtManager.acquire();
>>> try {
>>> IndexReader reader = s.getReader();
>>> // do something
>>> } finally {
>>> nrtManager.release(s);
>>> }
>>>
>>> from time to time you can prune the mapping for sequence ids that are
>>> already flushed.
>>>
>>> hope that helps
>>>
>>> simon
>>>>
>>>>
>>>> Harald.
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>>>> For additional commands, e-mail: java-user-help [at] lucene
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>>> For additional commands, e-mail: java-user-help [at] lucene
>>>
>>>
>>
>
> --
> Harald Kirsch
> Raytion GmbH
> Kaiser-Friedrich-Ring 74
> 40547 Duesseldorf
> Fon +49-211-550266-0
> Fax +49-211-550266-19
> http://www.raytion.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.