Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-Dev

A new Lucene Directory available

 

 

Lucene java-dev RSS feed   Index | Next | Previous | View Threaded


s.grinovero at sourcesense

Nov 14, 2009, 11:44 AM

Post #1 of 11 (460 views)
Permalink
A new Lucene Directory available

Hello all,
I'm a Lucene user and fan, I wanted to tell you that we just released
a first technology preview of a distributed in memory Directory for
Lucene.

The release announcement:
http://infinispan.blogspot.com/2009/11/second-release-candidate-for-400.html

From there you'll find links to the Wiki, to the sources, to the issue
tracker. A minimal demo is included with the sources.

This was developed together with Google Summer of Code student Lukasz
Moren and much support from the Infinispan and Hibernate Search teams,
as we are storing the index segments on Infinispan and using it's
atomic distributed locks to implement a Lucene LockFactory.

Initial idea was to contribute it directly to Lucene, but as
Infinispan is a LGPL dependency we had to distribute it with
Infinispan (as the other way around would have introduced some legal
issues); still we hope you appreciate the effort and are interested in
giving it a try.
All kind of feedback is welcome, especially on benchmarking
methodologies as I yet have to do some serious performance tests.

Main code, build with Maven2:
svn co http://anonsvn.jboss.org/repos/infinispan/tags/4.0.0.CR2/lucene-directory/
infinispan-directory

Demo, see the Readme:
svn co http://anonsvn.jboss.org/repos/infinispan/tags/4.0.0.CR2/demos/lucene-directory/
lucene-demo

Best Regards,
Sanne

--
Sanne Grinovero
Sourcesense - making sense of Open Source: http://www.sourcesense.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


john.wang at gmail

Nov 14, 2009, 2:15 PM

Post #2 of 11 (437 views)
Permalink
Re: A new Lucene Directory available [In reply to]

HI Sanne:

Very interesting!

What kinda performance should we expect with this, comparing to regular
FSDIrectory on local HD.

Thanks

-John

On Sat, Nov 14, 2009 at 11:44 AM, Sanne Grinovero <
s.grinovero [at] sourcesense> wrote:

> Hello all,
> I'm a Lucene user and fan, I wanted to tell you that we just released
> a first technology preview of a distributed in memory Directory for
> Lucene.
>
> The release announcement:
>
> http://infinispan.blogspot.com/2009/11/second-release-candidate-for-400.html
>
> From there you'll find links to the Wiki, to the sources, to the issue
> tracker. A minimal demo is included with the sources.
>
> This was developed together with Google Summer of Code student Lukasz
> Moren and much support from the Infinispan and Hibernate Search teams,
> as we are storing the index segments on Infinispan and using it's
> atomic distributed locks to implement a Lucene LockFactory.
>
> Initial idea was to contribute it directly to Lucene, but as
> Infinispan is a LGPL dependency we had to distribute it with
> Infinispan (as the other way around would have introduced some legal
> issues); still we hope you appreciate the effort and are interested in
> giving it a try.
> All kind of feedback is welcome, especially on benchmarking
> methodologies as I yet have to do some serious performance tests.
>
> Main code, build with Maven2:
> svn co
> http://anonsvn.jboss.org/repos/infinispan/tags/4.0.0.CR2/lucene-directory/
> infinispan-directory
>
> Demo, see the Readme:
> svn co
> http://anonsvn.jboss.org/repos/infinispan/tags/4.0.0.CR2/demos/lucene-directory/
> lucene-demo
>
> Best Regards,
> Sanne
>
> --
> Sanne Grinovero
> Sourcesense - making sense of Open Source: http://www.sourcesense.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
> For additional commands, e-mail: java-dev-help [at] lucene
>
>


s.grinovero at sourcesense

Nov 14, 2009, 8:33 PM

Post #3 of 11 (428 views)
Permalink
Re: A new Lucene Directory available [In reply to]

Hi John,
I didn't run a long running reliable benchmark, so at the moment I
can't really speak of numbers.
Suggestions and help on performance testing are welcome: I guess it
will shine in some situations, not necessarily all, so really choosing
a correct ratio of concurrent writers/searches, number of nodes in the
cluster and resources per node will never be fair enough to compare
this Directory with others.

On paper the premises are good: it's all in-memory, until it fits: it
will distribute data across nodes and overflow to disk is supported
(called passivation). A permanent store can be configured, so you
could set it to periodically flush incrementally to slower storages
like a database, a filesystem, a cloud storage service. This makes it
possible to avoid losing state even when all nodes are shut down.
A RAMDirectory is AFAIK not recommended as you could hit memory limits
and because it's basically a synchronized HashMap; Infinispan
implements ConcurrentHashMap and doesn't need synchronization.
Even if the data is replicated across nodes each node has it's own
local cache, so when caches are warm and all segments fit in memory it
should be, theoretically, the fastest Directory ever. The more it will
read from disk, the more it will behave similarly to a FSDirectory
with some buffers.

As per Lucene's design, writes can happen only at one node at a time:
one IndexWriter can own the lock, but IndexReaders and Searchers are
not blocked, so when using this Directory it should behave exactly as
if you had multiple processes sharing a local NIOFSdirectory:
basically the situation is that you can't scale on writers, but you
can scale near-linearly with readers adding in more power from more
machines.

Besides performance, the reasons to implement this was to be able to
easily add or remove processing power to a service (clouds), make it
easier to share indexes across nodes, and last but not least to remove
single points of failure: all data is distributed and there is no such
notion of Master: services will continue running fine when killing any
node.

I hope this peeks your interest, sorry if I couldn't provide numbers.

Regards,
Sanne

On Sat, Nov 14, 2009 at 11:15 PM, John Wang <john.wang [at] gmail> wrote:
> HI Sanne:
>
>     Very interesting!
>
>     What kinda performance should we expect with this, comparing to regular
> FSDIrectory on local HD.
> Thanks
> -John
>
> On Sat, Nov 14, 2009 at 11:44 AM, Sanne Grinovero
> <s.grinovero [at] sourcesense> wrote:
>>
>> Hello all,
>> I'm a Lucene user and fan, I wanted to tell you that we just released
>> a first technology preview of a distributed in memory Directory for
>> Lucene.
>>
>> The release announcement:
>>
>> http://infinispan.blogspot.com/2009/11/second-release-candidate-for-400.html
>>
>> From there you'll find links to the Wiki, to the sources, to the issue
>> tracker. A minimal demo is included with the sources.
>>
>> This was developed together with Google Summer of Code student Lukasz
>> Moren and much support from the Infinispan and Hibernate Search teams,
>> as we are storing the index segments on Infinispan and using it's
>> atomic distributed locks to implement a Lucene LockFactory.
>>
>> Initial idea was to contribute it directly to Lucene, but as
>> Infinispan is a LGPL dependency we had to distribute it with
>> Infinispan (as the other way around would have introduced some legal
>> issues); still we hope you appreciate the effort and are interested in
>> giving it a try.
>> All kind of feedback is welcome, especially on benchmarking
>> methodologies as I yet have to do some serious performance tests.
>>
>> Main code, build with Maven2:
>> svn co
>> http://anonsvn.jboss.org/repos/infinispan/tags/4.0.0.CR2/lucene-directory/
>> infinispan-directory
>>
>> Demo, see the Readme:
>> svn co
>> http://anonsvn.jboss.org/repos/infinispan/tags/4.0.0.CR2/demos/lucene-directory/
>> lucene-demo
>>
>> Best Regards,
>> Sanne
>>
>> --
>> Sanne Grinovero
>> Sourcesense - making sense of Open  Source: http://www.sourcesense.com
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
>> For additional commands, e-mail: java-dev-help [at] lucene
>>
>
>



--
Sanne Grinovero
Sourcesense - making sense of Open Source: http://www.sourcesense.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


lukas.vlcek at gmail

Nov 14, 2009, 11:50 PM

Post #4 of 11 (427 views)
Permalink
Re: A new Lucene Directory available [In reply to]

Hi,

this sounds very interesting. Do you know which versions of Lucene are
supported?
Do you know if it would work with upcoming Lucene 3.0.x?
https://jira.jboss.org/jira/browse/ISPN-275

Regards,
Lukas

http://blog.lukas-vlcek.com/


On Sun, Nov 15, 2009 at 5:33 AM, Sanne Grinovero <
s.grinovero [at] sourcesense> wrote:

> Hi John,
> I didn't run a long running reliable benchmark, so at the moment I
> can't really speak of numbers.
> Suggestions and help on performance testing are welcome: I guess it
> will shine in some situations, not necessarily all, so really choosing
> a correct ratio of concurrent writers/searches, number of nodes in the
> cluster and resources per node will never be fair enough to compare
> this Directory with others.
>
> On paper the premises are good: it's all in-memory, until it fits: it
> will distribute data across nodes and overflow to disk is supported
> (called passivation). A permanent store can be configured, so you
> could set it to periodically flush incrementally to slower storages
> like a database, a filesystem, a cloud storage service. This makes it
> possible to avoid losing state even when all nodes are shut down.
> A RAMDirectory is AFAIK not recommended as you could hit memory limits
> and because it's basically a synchronized HashMap; Infinispan
> implements ConcurrentHashMap and doesn't need synchronization.
> Even if the data is replicated across nodes each node has it's own
> local cache, so when caches are warm and all segments fit in memory it
> should be, theoretically, the fastest Directory ever. The more it will
> read from disk, the more it will behave similarly to a FSDirectory
> with some buffers.
>
> As per Lucene's design, writes can happen only at one node at a time:
> one IndexWriter can own the lock, but IndexReaders and Searchers are
> not blocked, so when using this Directory it should behave exactly as
> if you had multiple processes sharing a local NIOFSdirectory:
> basically the situation is that you can't scale on writers, but you
> can scale near-linearly with readers adding in more power from more
> machines.
>
> Besides performance, the reasons to implement this was to be able to
> easily add or remove processing power to a service (clouds), make it
> easier to share indexes across nodes, and last but not least to remove
> single points of failure: all data is distributed and there is no such
> notion of Master: services will continue running fine when killing any
> node.
>
> I hope this peeks your interest, sorry if I couldn't provide numbers.
>
> Regards,
> Sanne
>
> On Sat, Nov 14, 2009 at 11:15 PM, John Wang <john.wang [at] gmail> wrote:
> > HI Sanne:
> >
> > Very interesting!
> >
> > What kinda performance should we expect with this, comparing to
> regular
> > FSDIrectory on local HD.
> > Thanks
> > -John
> >
> > On Sat, Nov 14, 2009 at 11:44 AM, Sanne Grinovero
> > <s.grinovero [at] sourcesense> wrote:
> >>
> >> Hello all,
> >> I'm a Lucene user and fan, I wanted to tell you that we just released
> >> a first technology preview of a distributed in memory Directory for
> >> Lucene.
> >>
> >> The release announcement:
> >>
> >>
> http://infinispan.blogspot.com/2009/11/second-release-candidate-for-400.html
> >>
> >> From there you'll find links to the Wiki, to the sources, to the issue
> >> tracker. A minimal demo is included with the sources.
> >>
> >> This was developed together with Google Summer of Code student Lukasz
> >> Moren and much support from the Infinispan and Hibernate Search teams,
> >> as we are storing the index segments on Infinispan and using it's
> >> atomic distributed locks to implement a Lucene LockFactory.
> >>
> >> Initial idea was to contribute it directly to Lucene, but as
> >> Infinispan is a LGPL dependency we had to distribute it with
> >> Infinispan (as the other way around would have introduced some legal
> >> issues); still we hope you appreciate the effort and are interested in
> >> giving it a try.
> >> All kind of feedback is welcome, especially on benchmarking
> >> methodologies as I yet have to do some serious performance tests.
> >>
> >> Main code, build with Maven2:
> >> svn co
> >>
> http://anonsvn.jboss.org/repos/infinispan/tags/4.0.0.CR2/lucene-directory/
> >> infinispan-directory
> >>
> >> Demo, see the Readme:
> >> svn co
> >>
> http://anonsvn.jboss.org/repos/infinispan/tags/4.0.0.CR2/demos/lucene-directory/
> >> lucene-demo
> >>
> >> Best Regards,
> >> Sanne
> >>
> >> --
> >> Sanne Grinovero
> >> Sourcesense - making sense of Open Source: http://www.sourcesense.com
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
> >> For additional commands, e-mail: java-dev-help [at] lucene
> >>
> >
> >
>
>
>
> --
> Sanne Grinovero
> Sourcesense - making sense of Open Source: http://www.sourcesense.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
> For additional commands, e-mail: java-dev-help [at] lucene
>
>


earwin at gmail

Nov 15, 2009, 3:39 AM

Post #5 of 11 (422 views)
Permalink
Re: A new Lucene Directory available [In reply to]

Terracotta guys "easy-clustered" Lucene a few years ago. I'm yet to
see at least one person saying it worked for him allright.

This new directory ain't gonna be faster than RAMDirectory, as syncs
on a map doesn't matter, they are taken once per opened file -> once
per reopen, which is not happening thousands of times a sec.
Taking a glance at the code (svn trunk), it actually is much slower. I
mean, compare IndexInput.readByte()s. A whole slew of code and method
calls plus a ChunkCacheKey created per each byte read (violent GC
rape, ring the police!) VS if, incr, array access for RAMDir.

I wouldn't be too optimistic in doesn't-fit-in-memory case VS
FSDirectory either. OS' paging/file caching skills are hard to match,
plus OS file cache resides outside of Java heap, which (as reallife
experience dictates) is immensely good for your GC pauses.

Now to the networking part. Infinispan is based on JGroups. Last time
I saw it, it exploded under a moderate load on 20 nodes. I believe the
library is still good, properly configured and for lesser loads, but
not for distributing Lucene index that is frequently updated and
merged on each node of the cluster.

Please excuse me if I'm overboard in places, and correct me if I am wrong.

On Sun, Nov 15, 2009 at 07:33, Sanne Grinovero
<s.grinovero [at] sourcesense> wrote:
> Hi John,
> I didn't run a long running reliable benchmark, so at the moment I
> can't really speak of numbers.
> Suggestions and help on performance testing are welcome: I guess it
> will shine in some situations, not necessarily all, so really choosing
> a correct ratio of concurrent writers/searches, number of nodes in the
> cluster and resources per node will never be fair enough to compare
> this Directory with others.
>
> On paper the premises are good: it's all in-memory, until it fits: it
> will distribute data across nodes and overflow to disk is supported
> (called passivation). A permanent store can be configured, so you
> could set it to periodically flush incrementally to slower storages
> like a database, a filesystem, a cloud storage service. This makes it
> possible to avoid losing state even when all nodes are shut down.
> A RAMDirectory is AFAIK not recommended as you could hit memory limits
> and because it's basically a synchronized HashMap; Infinispan
> implements ConcurrentHashMap and doesn't need synchronization.
> Even if the data is replicated across nodes each node has it's own
> local cache, so when caches are warm and all segments fit in memory it
> should be, theoretically, the fastest Directory ever. The more it will
> read from disk, the more it will behave similarly to a FSDirectory
> with some buffers.
>
> As per Lucene's design, writes can happen only at one node at a time:
> one IndexWriter can own the lock, but IndexReaders and Searchers are
> not blocked, so when using this Directory it should behave exactly as
> if you had multiple processes sharing a local NIOFSdirectory:
> basically the situation is that you can't scale on writers, but you
> can scale near-linearly with readers adding in more power from more
> machines.
>
> Besides performance, the reasons to implement this was to be able to
> easily add or remove processing power to a service (clouds), make it
> easier to share indexes across nodes, and last but not least to remove
> single points of failure: all data is distributed and there is no such
> notion of Master: services will continue running fine when killing any
> node.
>
> I hope this peeks your interest, sorry if I couldn't provide numbers.
>
> Regards,
> Sanne
>
> On Sat, Nov 14, 2009 at 11:15 PM, John Wang <john.wang [at] gmail> wrote:
>> HI Sanne:
>>
>>     Very interesting!
>>
>>     What kinda performance should we expect with this, comparing to regular
>> FSDIrectory on local HD.
>> Thanks
>> -John
>>
>> On Sat, Nov 14, 2009 at 11:44 AM, Sanne Grinovero
>> <s.grinovero [at] sourcesense> wrote:
>>>
>>> Hello all,
>>> I'm a Lucene user and fan, I wanted to tell you that we just released
>>> a first technology preview of a distributed in memory Directory for
>>> Lucene.
>>>
>>> The release announcement:
>>>
>>> http://infinispan.blogspot.com/2009/11/second-release-candidate-for-400.html
>>>
>>> From there you'll find links to the Wiki, to the sources, to the issue
>>> tracker. A minimal demo is included with the sources.
>>>
>>> This was developed together with Google Summer of Code student Lukasz
>>> Moren and much support from the Infinispan and Hibernate Search teams,
>>> as we are storing the index segments on Infinispan and using it's
>>> atomic distributed locks to implement a Lucene LockFactory.
>>>
>>> Initial idea was to contribute it directly to Lucene, but as
>>> Infinispan is a LGPL dependency we had to distribute it with
>>> Infinispan (as the other way around would have introduced some legal
>>> issues); still we hope you appreciate the effort and are interested in
>>> giving it a try.
>>> All kind of feedback is welcome, especially on benchmarking
>>> methodologies as I yet have to do some serious performance tests.
>>>
>>> Main code, build with Maven2:
>>> svn co
>>> http://anonsvn.jboss.org/repos/infinispan/tags/4.0.0.CR2/lucene-directory/
>>> infinispan-directory
>>>
>>> Demo, see the Readme:
>>> svn co
>>> http://anonsvn.jboss.org/repos/infinispan/tags/4.0.0.CR2/demos/lucene-directory/
>>> lucene-demo
>>>
>>> Best Regards,
>>> Sanne
>>>
>>> --
>>> Sanne Grinovero
>>> Sourcesense - making sense of Open  Source: http://www.sourcesense.com
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
>>> For additional commands, e-mail: java-dev-help [at] lucene
>>>
>>
>>
>
>
>
> --
> Sanne Grinovero
> Sourcesense - making sense of Open  Source: http://www.sourcesense.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
> For additional commands, e-mail: java-dev-help [at] lucene
>
>



--
Kirill Zakharenko/Кирилл Захаренко (earwin [at] gmail)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


s.grinovero at sourcesense

Nov 15, 2009, 4:14 AM

Post #6 of 11 (422 views)
Permalink
Re: A new Lucene Directory available [In reply to]

Hi Lukas,
Our reference during early design was Lucene 2.4.1, but we look
forward for compatibility and new tricks.
Current trunk is compatible towards Lucene's trunk, but I won't close
ISPN-275 until it's confirmed against a released Lucene 3.0.0 :
hopefully this will come before Infinispan 4 release.

Regards,
Sanne

On Sun, Nov 15, 2009 at 8:50 AM, Lukáš Vlček <lukas.vlcek [at] gmail> wrote:
> Hi,
>
> this sounds very interesting. Do you know which versions of Lucene are
> supported?
> Do you know if it would work with upcoming Lucene 3.0.x?
> https://jira.jboss.org/jira/browse/ISPN-275
>
> Regards,
> Lukas
>
> http://blog.lukas-vlcek.com/
>
>
> On Sun, Nov 15, 2009 at 5:33 AM, Sanne Grinovero
> <s.grinovero [at] sourcesense> wrote:
>>
>> Hi John,
>> I didn't run a long running reliable benchmark, so at the moment I
>> can't really speak of numbers.
>> Suggestions and help on performance testing are welcome: I guess it
>> will shine in some situations, not necessarily all, so really choosing
>> a correct ratio of concurrent writers/searches, number of nodes in the
>> cluster and resources per node will never be fair enough to compare
>> this Directory with others.
>>
>> On paper the premises are good: it's all in-memory, until it fits: it
>> will distribute data across nodes and overflow to disk is supported
>> (called passivation). A permanent store can be configured, so you
>> could set it to periodically flush incrementally to slower storages
>> like a database, a filesystem, a cloud storage service. This makes it
>> possible to avoid losing state even when all nodes are shut down.
>> A RAMDirectory is AFAIK not recommended as you could hit memory limits
>> and because it's basically a synchronized HashMap; Infinispan
>> implements ConcurrentHashMap and doesn't need synchronization.
>> Even if the data is replicated across nodes each node has it's own
>> local cache, so when caches are warm and all segments fit in memory it
>> should be, theoretically, the fastest Directory ever. The more it will
>> read from disk, the more it will behave similarly to a FSDirectory
>> with some buffers.
>>
>> As per Lucene's design, writes can happen only at one node at a time:
>> one IndexWriter can own the lock, but IndexReaders and Searchers are
>> not blocked, so when using this Directory it should behave exactly as
>> if you had multiple processes sharing a local NIOFSdirectory:
>> basically the situation is that you can't scale on writers, but you
>> can scale near-linearly with readers adding in more power from more
>> machines.
>>
>> Besides performance, the reasons to implement this was to be able to
>> easily add or remove processing power to a service (clouds), make it
>> easier to share indexes across nodes, and last but not least to remove
>> single points of failure: all data is distributed and there is no such
>> notion of Master: services will continue running fine when killing any
>> node.
>>
>> I hope this peeks your interest, sorry if I couldn't provide numbers.
>>
>> Regards,
>> Sanne
>>
>> On Sat, Nov 14, 2009 at 11:15 PM, John Wang <john.wang [at] gmail> wrote:
>> > HI Sanne:
>> >
>> >     Very interesting!
>> >
>> >     What kinda performance should we expect with this, comparing to
>> > regular
>> > FSDIrectory on local HD.
>> > Thanks
>> > -John
>> >
>> > On Sat, Nov 14, 2009 at 11:44 AM, Sanne Grinovero
>> > <s.grinovero [at] sourcesense> wrote:
>> >>
>> >> Hello all,
>> >> I'm a Lucene user and fan, I wanted to tell you that we just released
>> >> a first technology preview of a distributed in memory Directory for
>> >> Lucene.
>> >>
>> >> The release announcement:
>> >>
>> >>
>> >> http://infinispan.blogspot.com/2009/11/second-release-candidate-for-400.html
>> >>
>> >> From there you'll find links to the Wiki, to the sources, to the issue
>> >> tracker. A minimal demo is included with the sources.
>> >>
>> >> This was developed together with Google Summer of Code student Lukasz
>> >> Moren and much support from the Infinispan and Hibernate Search teams,
>> >> as we are storing the index segments on Infinispan and using it's
>> >> atomic distributed locks to implement a Lucene LockFactory.
>> >>
>> >> Initial idea was to contribute it directly to Lucene, but as
>> >> Infinispan is a LGPL dependency we had to distribute it with
>> >> Infinispan (as the other way around would have introduced some legal
>> >> issues); still we hope you appreciate the effort and are interested in
>> >> giving it a try.
>> >> All kind of feedback is welcome, especially on benchmarking
>> >> methodologies as I yet have to do some serious performance tests.
>> >>
>> >> Main code, build with Maven2:
>> >> svn co
>> >>
>> >> http://anonsvn.jboss.org/repos/infinispan/tags/4.0.0.CR2/lucene-directory/
>> >> infinispan-directory
>> >>
>> >> Demo, see the Readme:
>> >> svn co
>> >>
>> >> http://anonsvn.jboss.org/repos/infinispan/tags/4.0.0.CR2/demos/lucene-directory/
>> >> lucene-demo
>> >>
>> >> Best Regards,
>> >> Sanne
>> >>
>> >> --
>> >> Sanne Grinovero
>> >> Sourcesense - making sense of Open  Source: http://www.sourcesense.com
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
>> >> For additional commands, e-mail: java-dev-help [at] lucene
>> >>
>> >
>> >
>>
>>
>>
>> --
>> Sanne Grinovero
>> Sourcesense - making sense of Open  Source: http://www.sourcesense.com
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
>> For additional commands, e-mail: java-dev-help [at] lucene
>>
>
>



--
Sanne Grinovero
Sourcesense - making sense of Open Source: http://www.sourcesense.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


s.grinovero at sourcesense

Nov 15, 2009, 5:13 AM

Post #7 of 11 (419 views)
Permalink
Re: A new Lucene Directory available [In reply to]

Hi Earwin,
thanks for the insight, as I mentioned I have no proper benchmarks to
back my statements but I can see how it behaves, so absolutely I could
be too optimistic.
They are currently profiling Infinispan and speeding up some
internals, so I'll wait for these tasks to finish to begin testing on
our part; while waiting I collect suggestions about how you think I
should test it properly? Which kind of comparisons would you like to
see?

I'm currently working on JIRA clustering (called Scarlet), so the
typical index usage pattern of that application is going to be my
favorite scenario.

I know about the Terracotta efforts, I agree with you and have
collected much feedback about which problems were arising directly
talking with the people maintaining such systems. I even got to hear
some success cases, but yes they are scarce and there are some
problems; be assured that we have analyzed them carefully before
deciding for this design. I'm not a Terracotta expert myself, but was
helped on this by specialists. My personal opinion resulting from
these talks is that Terracotta works, but is too tricky to setup and
not viable in case the indexes change frequently.

About the RAMDirectory comparison, as you said yourself the bytes
aren't read constantly but just at index reopen so I wouldn't be too
worried about the "bunch of methods" as they're executed once per
segment loading; I'll improve that if possible, thanks for looking!
I'm sure many parts can be improved, patches are welcome.

Instances of ChunkCacheKey are not created for each single byte read
but for each byte[] buffer, being the size of these buffers
configurable. This was decided after observations that it was
improving performance to "chunk" segments in smaller pieces rather
than have huge arrays of bytes, but if you like you can configure it
to degenerate to approach the one key per segment ratio.
Comparing to a RAMDirectory is unfair, as with InfinispanDirectory I
can scale :-) Still I take the point, I'll have some tests also in
single node mode to compare them, for fun as the use cases are a bit
different but I'm confident I could surprise you when I have to choice
of the scenario.

About JGroups I'm not technically prepared for a match, but I've heard
of different stories of much bigger than 20 nodes business critical
clusters working very well. Sure, it won't scale without a proper
configuration at all levels: os, jgroups and infrastructure.

Thank you very much for you considerations, it's very appreciated.
Regards,
Sanne

On Sun, Nov 15, 2009 at 12:39 PM, Earwin Burrfoot <earwin [at] gmail> wrote:
> Terracotta guys "easy-clustered" Lucene a few years ago. I'm yet to
> see at least one person saying it worked for him allright.
>
> This new directory ain't gonna be faster than RAMDirectory, as syncs
> on a map doesn't matter, they are taken once per opened file -> once
> per reopen, which is not happening thousands of times a sec.
> Taking a glance at the code (svn trunk), it actually is much slower. I
> mean, compare IndexInput.readByte()s. A whole slew of code and method
> calls plus a ChunkCacheKey created per each byte read (violent GC
> rape, ring the police!) VS if, incr, array access for RAMDir.
>
> I wouldn't be too optimistic in doesn't-fit-in-memory case VS
> FSDirectory either. OS' paging/file caching skills are hard to match,
> plus OS file cache resides outside of Java heap, which (as reallife
> experience dictates) is immensely good for your GC pauses.
>
> Now to the networking part. Infinispan is based on JGroups. Last time
> I saw it, it exploded under a moderate load on 20 nodes. I believe the
> library is still good, properly configured and for lesser loads, but
> not for distributing Lucene index that is frequently updated and
> merged on each node of the cluster.
>
> Please excuse me if I'm overboard in places, and correct me if I am wrong.
>
> On Sun, Nov 15, 2009 at 07:33, Sanne Grinovero
> <s.grinovero [at] sourcesense> wrote:
>> Hi John,
>> I didn't run a long running reliable benchmark, so at the moment I
>> can't really speak of numbers.
>> Suggestions and help on performance testing are welcome: I guess it
>> will shine in some situations, not necessarily all, so really choosing
>> a correct ratio of concurrent writers/searches, number of nodes in the
>> cluster and resources per node will never be fair enough to compare
>> this Directory with others.
>>
>> On paper the premises are good: it's all in-memory, until it fits: it
>> will distribute data across nodes and overflow to disk is supported
>> (called passivation). A permanent store can be configured, so you
>> could set it to periodically flush incrementally to slower storages
>> like a database, a filesystem, a cloud storage service. This makes it
>> possible to avoid losing state even when all nodes are shut down.
>> A RAMDirectory is AFAIK not recommended as you could hit memory limits
>> and because it's basically a synchronized HashMap; Infinispan
>> implements ConcurrentHashMap and doesn't need synchronization.
>> Even if the data is replicated across nodes each node has it's own
>> local cache, so when caches are warm and all segments fit in memory it
>> should be, theoretically, the fastest Directory ever. The more it will
>> read from disk, the more it will behave similarly to a FSDirectory
>> with some buffers.
>>
>> As per Lucene's design, writes can happen only at one node at a time:
>> one IndexWriter can own the lock, but IndexReaders and Searchers are
>> not blocked, so when using this Directory it should behave exactly as
>> if you had multiple processes sharing a local NIOFSdirectory:
>> basically the situation is that you can't scale on writers, but you
>> can scale near-linearly with readers adding in more power from more
>> machines.
>>
>> Besides performance, the reasons to implement this was to be able to
>> easily add or remove processing power to a service (clouds), make it
>> easier to share indexes across nodes, and last but not least to remove
>> single points of failure: all data is distributed and there is no such
>> notion of Master: services will continue running fine when killing any
>> node.
>>
>> I hope this peeks your interest, sorry if I couldn't provide numbers.
>>
>> Regards,
>> Sanne
>>
>> On Sat, Nov 14, 2009 at 11:15 PM, John Wang <john.wang [at] gmail> wrote:
>>> HI Sanne:
>>>
>>>     Very interesting!
>>>
>>>     What kinda performance should we expect with this, comparing to regular
>>> FSDIrectory on local HD.
>>> Thanks
>>> -John
>>>
>>> On Sat, Nov 14, 2009 at 11:44 AM, Sanne Grinovero
>>> <s.grinovero [at] sourcesense> wrote:
>>>>
>>>> Hello all,
>>>> I'm a Lucene user and fan, I wanted to tell you that we just released
>>>> a first technology preview of a distributed in memory Directory for
>>>> Lucene.
>>>>
>>>> The release announcement:
>>>>
>>>> http://infinispan.blogspot.com/2009/11/second-release-candidate-for-400.html
>>>>
>>>> From there you'll find links to the Wiki, to the sources, to the issue
>>>> tracker. A minimal demo is included with the sources.
>>>>
>>>> This was developed together with Google Summer of Code student Lukasz
>>>> Moren and much support from the Infinispan and Hibernate Search teams,
>>>> as we are storing the index segments on Infinispan and using it's
>>>> atomic distributed locks to implement a Lucene LockFactory.
>>>>
>>>> Initial idea was to contribute it directly to Lucene, but as
>>>> Infinispan is a LGPL dependency we had to distribute it with
>>>> Infinispan (as the other way around would have introduced some legal
>>>> issues); still we hope you appreciate the effort and are interested in
>>>> giving it a try.
>>>> All kind of feedback is welcome, especially on benchmarking
>>>> methodologies as I yet have to do some serious performance tests.
>>>>
>>>> Main code, build with Maven2:
>>>> svn co
>>>> http://anonsvn.jboss.org/repos/infinispan/tags/4.0.0.CR2/lucene-directory/
>>>> infinispan-directory
>>>>
>>>> Demo, see the Readme:
>>>> svn co
>>>> http://anonsvn.jboss.org/repos/infinispan/tags/4.0.0.CR2/demos/lucene-directory/
>>>> lucene-demo
>>>>
>>>> Best Regards,
>>>> Sanne
>>>>
>>>> --
>>>> Sanne Grinovero
>>>> Sourcesense - making sense of Open  Source: http://www.sourcesense.com
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
>>>> For additional commands, e-mail: java-dev-help [at] lucene
>>>>
>>>
>>>
>>
>>
>>
>> --
>> Sanne Grinovero
>> Sourcesense - making sense of Open  Source: http://www.sourcesense.com
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
>> For additional commands, e-mail: java-dev-help [at] lucene
>>
>>
>
>
>
> --
> Kirill Zakharenko/Кирилл Захаренко (earwin [at] gmail)
> Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
> ICQ: 104465785
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
> For additional commands, e-mail: java-dev-help [at] lucene
>
>



--
Sanne Grinovero
Sourcesense - making sense of Open Source: http://www.sourcesense.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


earwin at gmail

Nov 15, 2009, 6:43 AM

Post #8 of 11 (420 views)
Permalink
Re: A new Lucene Directory available [In reply to]

> About the RAMDirectory comparison, as you said yourself the bytes
> aren't read constantly but just at index reopen so I wouldn't be too
> worried about the "bunch of methods" as they're executed once per
> segment loading;
The bytes /are/ read constantly (readByte() method). I believe that is
the most innermost loop you can hope to find in Lucene.

> A RAMDirectory is AFAIK not recommended as you could hit memory limits and because it's basically a synchronized HashMap;
On the other hand, just as I mentioned - the only access to said
synchronized HashMap is done when you
open InputStream on a file. That, unlike readByte(), happens rarely,
as InputStreams are cloned after creation as needed.
As for memory limits, your unbounded local cache hits them with same ease.

> Instances of ChunkCacheKey are not created for each single byte read
> but for each byte[] buffer, being the size of these buffers configurable.
No, they are! :-)
InfinispanIndexIO.java, rev. 1103:
120 public byte readByte() throws IOException {
.........
132 buffer = getChunkFromPosition(cache, fileKey,
filePosition, bufferSize);
.........
141 }
getChunkFromPosition() is called each time readByte() is invoked. It
creates 1-2 instances of ChunkCacheKey.

> This was decided after observations that it was
> improving performance to "chunk" segments in smaller pieces rather
> than have huge arrays of bytes, but if you like you can configure it
> to degenerate to approach the one key per segment ratio.
Locally, it's better not to chunk segments (unless you hit 2Gb
barrier). When shuffling them over network - I can't say.

> Comparing to a RAMDirectory is unfair, as with InfinispanDirectory I can scale :-)
I'm just following two of your initial comparisons. And the only
characteristic that can be scaled with such
approach is queries/s. Index size - definetly not, updates/s - questionable.

> About JGroups I'm not technically prepared for a match, but I've heard
> of different stories of much bigger than 20 nodes business critical
> clusters working very well. Sure, it won't scale without a proper
> configuration at all levels: os, jgroups and infrastructure.
The volume of messages travelling around, length of GC delays VS
cluster size and messaging mode matter.
They used reliable synchronous multicasts, so - once one node starts
collecting, all others wait (or worse - send retries).
Another one starts collecting, then another, partially delivered
messages hold threads - caboom!
How is locking handled here? With central broker it probably can work.

--
Kirill Zakharenko/Кирилл Захаренко (earwin [at] gmail)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


s.grinovero at sourcesense

Nov 15, 2009, 8:11 AM

Post #9 of 11 (421 views)
Permalink
Re: A new Lucene Directory available [In reply to]

Hi again Earwin,
thanks you very much for spotting the byte reading issue, it's
definitely not as I wanted it.
https://jira.jboss.org/jira/browse/ISPN-276

I never tried to defend an improved updates/s ratio, just maybe
compared to scheduled rsyncs :-)
Our goal is to scale on queries/sec while usage semantics stays
unchanged, so you can open an IndexWriter as it was local to make
updates clusterwide. Very useful to cluster the many products already
using Lucene which are currently implementing exotic index management
workarounds or shared filesystems, as they weren't designed for it
from the beginning as SolR did.
I mentioned JIRA, you noticed how slow it can get on larger
deployments? because there's no way to deploy it clustered currently
(besides by using Terracotta), as it relies much on Lucene and index
changes need to be applied in real time.

About locking and jgroups.. please switch over to
infinispan-dev [at] lists so you can get better answers and I
don't have to spam the Lucene developers.

Regards,
Sanne



On Sun, Nov 15, 2009 at 3:43 PM, Earwin Burrfoot <earwin [at] gmail> wrote:
>> About the RAMDirectory comparison, as you said yourself the bytes
>> aren't read constantly but just at index reopen so I wouldn't be too
>> worried about the "bunch of methods" as they're executed once per
>> segment loading;
> The bytes /are/ read constantly (readByte() method). I believe that is
> the most innermost loop you can hope to find in Lucene.
>
>> A RAMDirectory is AFAIK not recommended as you could hit memory limits and because it's basically a synchronized HashMap;
> On the other hand, just as I mentioned - the only access to said
> synchronized HashMap is done when you
> open InputStream on a file. That, unlike readByte(), happens rarely,
> as InputStreams are cloned after creation as needed.
> As for memory limits, your unbounded local cache hits them with same ease.
>
>> Instances of ChunkCacheKey are not created for each single byte read
>> but for each byte[] buffer, being the size of these buffers configurable.
> No, they are! :-)
> InfinispanIndexIO.java, rev. 1103:
> 120           public byte readByte() throws IOException {
> .........
> 132              buffer = getChunkFromPosition(cache, fileKey,
> filePosition, bufferSize);
> .........
> 141           }
> getChunkFromPosition() is called each time readByte() is invoked. It
> creates 1-2 instances of ChunkCacheKey.
>
>> This was decided after observations that it was
>> improving performance to "chunk" segments in smaller pieces rather
>> than have huge arrays of bytes, but if you like you can configure it
>> to degenerate to approach the one key per segment ratio.
> Locally, it's better not to chunk segments (unless you hit 2Gb
> barrier). When shuffling them over network - I can't say.
>
>> Comparing to a RAMDirectory is unfair, as with InfinispanDirectory I can scale :-)
> I'm just following two of your initial comparisons. And the only
> characteristic that can be scaled with such
> approach is queries/s. Index size - definetly not, updates/s - questionable.
>
>> About JGroups I'm not technically prepared for a match, but I've heard
>> of different stories of much bigger than 20 nodes business critical
>> clusters working very well. Sure, it won't scale without a proper
>> configuration at all levels: os, jgroups and infrastructure.
> The volume of messages travelling around, length of GC delays VS
> cluster size and messaging mode matter.
> They used reliable synchronous multicasts, so - once one node starts
> collecting, all others wait (or worse - send retries).
> Another one starts collecting, then another, partially delivered
> messages hold threads - caboom!
> How is locking handled here? With central broker it probably can work.
>
> --
> Kirill Zakharenko/Кирилл Захаренко (earwin [at] gmail)
> Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
> ICQ: 104465785
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
> For additional commands, e-mail: java-dev-help [at] lucene
>
>



--
Sanne Grinovero
Sourcesense - making sense of Open Source: http://www.sourcesense.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


manik at jboss

Nov 16, 2009, 2:33 AM

Post #10 of 11 (399 views)
Permalink
Re: A new Lucene Directory available [In reply to]

@Sanne, thanks for announcing this, good stuff!

@Earwin, note that this is a tech preview and hardly production-ready code yet. The more eyes that scan the code, try it out, report bugs and bottlenecks, the better. So thanks for spotting ISPN-276, we look forward to more feedback/patches. :) Regarding your comments regarding locking, cluster-wide syncs, performance and tuning JGroups, I agree with Sanne that you should post your concerns on infinispan-dev [at] lists and we can talk about it in greater depth there while keeping things relevant.

Cheers
Manik

--
Manik Surtani
manik [at] jboss
Lead, Infinispan
Lead, JBoss Cache
http://www.infinispan.org
http://www.jbosscache.org


On 15 Nov 2009, at 16:11, Sanne Grinovero wrote:

> Hi again Earwin,
> thanks you very much for spotting the byte reading issue, it's
> definitely not as I wanted it.
> https://jira.jboss.org/jira/browse/ISPN-276
>
> I never tried to defend an improved updates/s ratio, just maybe
> compared to scheduled rsyncs :-)
> Our goal is to scale on queries/sec while usage semantics stays
> unchanged, so you can open an IndexWriter as it was local to make
> updates clusterwide. Very useful to cluster the many products already
> using Lucene which are currently implementing exotic index management
> workarounds or shared filesystems, as they weren't designed for it
> from the beginning as SolR did.
> I mentioned JIRA, you noticed how slow it can get on larger
> deployments? because there's no way to deploy it clustered currently
> (besides by using Terracotta), as it relies much on Lucene and index
> changes need to be applied in real time.
>
> About locking and jgroups.. please switch over to
> infinispan-dev [at] lists so you can get better answers and I
> don't have to spam the Lucene developers.
>
> Regards,
> Sanne
>
>
>
> On Sun, Nov 15, 2009 at 3:43 PM, Earwin Burrfoot <earwin [at] gmail> wrote:
>>> About the RAMDirectory comparison, as you said yourself the bytes
>>> aren't read constantly but just at index reopen so I wouldn't be too
>>> worried about the "bunch of methods" as they're executed once per
>>> segment loading;
>> The bytes /are/ read constantly (readByte() method). I believe that is
>> the most innermost loop you can hope to find in Lucene.
>>
>>> A RAMDirectory is AFAIK not recommended as you could hit memory limits and because it's basically a synchronized HashMap;
>> On the other hand, just as I mentioned - the only access to said
>> synchronized HashMap is done when you
>> open InputStream on a file. That, unlike readByte(), happens rarely,
>> as InputStreams are cloned after creation as needed.
>> As for memory limits, your unbounded local cache hits them with same ease.
>>
>>> Instances of ChunkCacheKey are not created for each single byte read
>>> but for each byte[] buffer, being the size of these buffers configurable.
>> No, they are! :-)
>> InfinispanIndexIO.java, rev. 1103:
>> 120 public byte readByte() throws IOException {
>> .........
>> 132 buffer = getChunkFromPosition(cache, fileKey,
>> filePosition, bufferSize);
>> .........
>> 141 }
>> getChunkFromPosition() is called each time readByte() is invoked. It
>> creates 1-2 instances of ChunkCacheKey.
>>
>>> This was decided after observations that it was
>>> improving performance to "chunk" segments in smaller pieces rather
>>> than have huge arrays of bytes, but if you like you can configure it
>>> to degenerate to approach the one key per segment ratio.
>> Locally, it's better not to chunk segments (unless you hit 2Gb
>> barrier). When shuffling them over network - I can't say.
>>
>>> Comparing to a RAMDirectory is unfair, as with InfinispanDirectory I can scale :-)
>> I'm just following two of your initial comparisons. And the only
>> characteristic that can be scaled with such
>> approach is queries/s. Index size - definetly not, updates/s - questionable.
>>
>>> About JGroups I'm not technically prepared for a match, but I've heard
>>> of different stories of much bigger than 20 nodes business critical
>>> clusters working very well. Sure, it won't scale without a proper
>>> configuration at all levels: os, jgroups and infrastructure.
>> The volume of messages travelling around, length of GC delays VS
>> cluster size and messaging mode matter.
>> They used reliable synchronous multicasts, so - once one node starts
>> collecting, all others wait (or worse - send retries).
>> Another one starts collecting, then another, partially delivered
>> messages hold threads - caboom!
>> How is locking handled here? With central broker it probably can work.
>>
>> --
>> Kirill Zakharenko/Кирилл Захаренко (earwin [at] gmail)
>> Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
>> ICQ: 104465785
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
>> For additional commands, e-mail: java-dev-help [at] lucene
>>
>>
>
>
>
> --
> Sanne Grinovero
> Sourcesense - making sense of Open Source: http://www.sourcesense.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
> For additional commands, e-mail: java-dev-help [at] lucene
>






---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


sergio.bossa at gmail

Nov 16, 2009, 2:47 AM

Post #11 of 11 (399 views)
Permalink
Re: A new Lucene Directory available [In reply to]

Sanne,

I'd be very interested in knowing what kind of problems you analyzed
and resolved regarding the Terracotta clustered solution, as quoted
below:

> I know about the Terracotta efforts, I agree with you and have
> collected much feedback about which problems were arising directly
> talking with the people maintaining such systems. I even got to hear
> some success cases, but yes they are scarce and there are some
> problems; be assured that we have analyzed them carefully before
> deciding for this design. I'm not a Terracotta expert myself, but was
> helped on this by specialists. My personal opinion resulting from
> these talks is that Terracotta works, but is too tricky to setup and
> not viable in case the indexes change frequently.

In other words, how is the Infinispan solution superior to the Terracotta one?

Thanks,
Cheers,

Sergio B.

--
Sergio Bossa
Software Passionate and Open Source Enthusiast.
URL: http://www.linkedin.com/in/sergiob

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene

Lucene java-dev RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.