Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

realtime indexing

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


john.wang at gmail

Nov 15, 2007, 10:44 PM

Post #1 of 4 (307 views)
Permalink
realtime indexing

Hi:

It was interesting hearing about the need for real time indexing
at the BirdsOfAFeather round table. We also needed to solve this
problem. We took this approach:

A large disk index that indexes in batch, e.g. sleeps for some time
queue up requests, wakes up and the index.
While large disk index is sleeping, same requests are also added to a
ram index, and when disk indexer is working, requests received is
added to another ram index.

When new disk index is published, the first ram index points to the
secondary ram index, and the secondary ram index is flushed.

we keep 1 index reader open for the disk index, and create new
indexReaders for the ram indexes per request (it seems to be ok
because the ram indexes are small)

We use MultiSearcher across these readers.

duplicates are also handled with our scheme.

I am curious to see if anyone else is trying this. It would be
interesting to hear comments from the experts.

Thanks

-John

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-user-help[at]lucene.apache.org


ab at taktik

Nov 16, 2007, 2:59 AM

Post #2 of 4 (272 views)
Permalink
Re: realtime indexing [In reply to]

Hi,

I'm trying to implement a similar solution.


Could you be more precise on how you handle duplicates, as well as
document deletion?


Thx,


Antoine

On Nov 16, 2007, at 7:44 AM, John Wang wrote:

> Hi:
>
> It was interesting hearing about the need for real time indexing
> at the BirdsOfAFeather round table. We also needed to solve this
> problem. We took this approach:
>
> A large disk index that indexes in batch, e.g. sleeps for some time
> queue up requests, wakes up and the index.
> While large disk index is sleeping, same requests are also added to a
> ram index, and when disk indexer is working, requests received is
> added to another ram index.
>
> When new disk index is published, the first ram index points to the
> secondary ram index, and the secondary ram index is flushed.
>
> we keep 1 index reader open for the disk index, and create new
> indexReaders for the ram indexes per request (it seems to be ok
> because the ram indexes are small)
>
> We use MultiSearcher across these readers.
>
> duplicates are also handled with our scheme.
>
> I am curious to see if anyone else is trying this. It would be
> interesting to hear comments from the experts.
>
> Thanks
>
> -John
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
> For additional commands, e-mail: java-user-help[at]lucene.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-user-help[at]lucene.apache.org


kay.roepke at epublica

Nov 16, 2007, 3:42 AM

Post #3 of 4 (273 views)
Permalink
Re: realtime indexing [In reply to]

On Nov 16, 2007, at 11:59 AM, Antoine Baudoux wrote:

> I'm trying to implement a similar solution.
>
>
> Could you be more precise on how you handle duplicates, as well as
> document deletion?

The key probably is (it was for us, anyway) that you have a fast way
of determining whether or not a given document is in an index.
We use (and John et al, too, I suppose) the unique id (!= doc id) each
document has for that purpose. The basic idea for that should be in
the archives.

So, back to the question:
By definition anything in the RAM index is newer than anything on
disk, so documents found in the RAM index should supersede docs from
disk when they have the same unique id (user id, primary key, whatever).
When you have the hits of the query you can easily determine duplicate
primary keys, and for those you look up from which index they came (by
asking an enhanced MultiReader that knows the indices and their doc id
ranges). The last operation obviously has to be very fast, thus we use
out custom id => docid mapping mechanism (and I think John is using
his own, too).

There are probably even more clever ways of doing this, but it should
give you an idea. :)

cheers,
-k

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-user-help[at]lucene.apache.org


john.wang at gmail

Nov 16, 2007, 4:05 AM

Post #4 of 4 (271 views)
Permalink
Re: realtime indexing [In reply to]

Thanks Kay.

I am doing exactly what you are saying.

Just to elaborate:

So whatever is submitted to the RAM index is always the latest, any
deletes (an update is a delete + an add) submitted to the any of the
ram indexes is recorded (discarded when the ram index is discarded)
with the uid.

That "delete list" is passed onto the searcher handling the disk
index. In the hitCollector, we do a quickly look up of uid given a
docid and then check to see if that uid is in the deleted list and
discard if it is. (actually in reality, you can have your own searcher
implementation and check before score is called to avoid unnecc
scoring computation if you expect the delete list to be large)

For us, we've implemented a way (by hacking into the lucene guts) to
be able to lookup a uid very fast (amounts to an array lookup). and
then the check is just an integer hash lookup (our uid is an integer)

(I started a thread on the dev list on how to quickly lookup primary
id (uid) given a lucene doc id.)


Hope this helps.

-John

On Nov 16, 2007 2:59 AM, Antoine Baudoux <ab[at]taktik.be> wrote:
> Hi,
>
> I'm trying to implement a similar solution.
>
>
> Could you be more precise on how you handle duplicates, as well as
> document deletion?
>
>
> Thx,
>
>
> Antoine
>
>
> On Nov 16, 2007, at 7:44 AM, John Wang wrote:
>
> > Hi:
> >
> > It was interesting hearing about the need for real time indexing
> > at the BirdsOfAFeather round table. We also needed to solve this
> > problem. We took this approach:
> >
> > A large disk index that indexes in batch, e.g. sleeps for some time
> > queue up requests, wakes up and the index.
> > While large disk index is sleeping, same requests are also added to a
> > ram index, and when disk indexer is working, requests received is
> > added to another ram index.
> >
> > When new disk index is published, the first ram index points to the
> > secondary ram index, and the secondary ram index is flushed.
> >
> > we keep 1 index reader open for the disk index, and create new
> > indexReaders for the ram indexes per request (it seems to be ok
> > because the ram indexes are small)
> >
> > We use MultiSearcher across these readers.
> >
> > duplicates are also handled with our scheme.
> >
> > I am curious to see if anyone else is trying this. It would be
> > interesting to hear comments from the experts.
> >
> > Thanks
> >
> > -John
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
> > For additional commands, e-mail: java-user-help[at]lucene.apache.org
> >
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
> For additional commands, e-mail: java-user-help[at]lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-user-help[at]lucene.apache.org

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact lists@gossamer-threads.com
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.