Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-Dev

anyone has interests about mg4j's new integer compression algorithm?

 

 

Lucene java-dev RSS feed   Index | Next | Previous | View Threaded


fancyerii at gmail

Jun 23, 2012, 3:38 AM

Post #1 of 6 (424 views)
Permalink
anyone has interests about mg4j's new integer compression algorithm?

http://mg4j.di.unimi.it/
http://vigna.di.unimi.it/papers.php#VigQSI

sounds very interesting and attractive.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


dawid.weiss at cs

Jun 24, 2012, 12:31 AM

Post #2 of 6 (406 views)
Permalink
Re: anyone has interests about mg4j's new integer compression algorithm? [In reply to]

Fyi. I contacted Sebastiano and will get hold of the data set and
benchmarks he used to repeat his experiment with current trunk
(curiosity). Any hints on which configuration should be used will be
welcome.

Dawid

On Sat, Jun 23, 2012 at 12:38 PM, Li Li <fancyerii [at] gmail> wrote:
> http://mg4j.di.unimi.it/
> http://vigna.di.unimi.it/papers.php#VigQSI
>
> sounds very interesting and attractive.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe [at] lucene
> For additional commands, e-mail: dev-help [at] lucene
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


dawid.weiss at cs

Jul 6, 2012, 1:20 AM

Post #3 of 6 (400 views)
Permalink
Re: anyone has interests about mg4j's new integer compression algorithm? [In reply to]

I've repeated Sebastiano's experiments (and so did he). A few quotes
from the communication.

> The index appears to be larger now--43.1GB. Probably they have better skipping structures that take more space.
>
> From what I can see the format is the same as before--the .frq file contains document pointers and positions. So my SearchFiles class still reads documents *and* counts.
>
> But the most interesting part I've read in a blog is that now Lucene has a pluggable index format. This means that someone can actually write a QS index for Lucene and test what happens in production. That's a most interesting change!

and:

> Well, they made a great job:
>
> trec-40-text unscored terms result: 5511 494901
> trec-40-text unscored and result: 2193 769110
> trec-40-text unscored phrase result: 6615 148663
> trec-40-text unscored spans result: 12407 545090
>
> So conjunction is still better, but by a really smaller margin. The worst part is term scanning--they are now significantly faster than QS indices.

Dawid



On Sun, Jun 24, 2012 at 9:31 AM, Dawid Weiss
<dawid.weiss [at] cs> wrote:
> Fyi. I contacted Sebastiano and will get hold of the data set and
> benchmarks he used to repeat his experiment with current trunk
> (curiosity). Any hints on which configuration should be used will be
> welcome.
>
> Dawid
>
> On Sat, Jun 23, 2012 at 12:38 PM, Li Li <fancyerii [at] gmail> wrote:
>> http://mg4j.di.unimi.it/
>> http://vigna.di.unimi.it/papers.php#VigQSI
>>
>> sounds very interesting and attractive.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe [at] lucene
>> For additional commands, e-mail: dev-help [at] lucene
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


fancyerii at gmail

Jul 6, 2012, 2:47 AM

Post #4 of 6 (397 views)
Permalink
Re: anyone has interests about mg4j's new integer compression algorithm? [In reply to]

I can understand these quotes. what's the conclusion from your communication?

On Fri, Jul 6, 2012 at 4:20 PM, Dawid Weiss
<dawid.weiss [at] cs> wrote:
> I've repeated Sebastiano's experiments (and so did he). A few quotes
> from the communication.
>
>> The index appears to be larger now--43.1GB. Probably they have better skipping structures that take more space.
>>
>> From what I can see the format is the same as before--the .frq file contains document pointers and positions. So my SearchFiles class still reads documents *and* counts.
>>
>> But the most interesting part I've read in a blog is that now Lucene has a pluggable index format. This means that someone can actually write a QS index for Lucene and test what happens in production. That's a most interesting change!
>
> and:
>
>> Well, they made a great job:
>>
>> trec-40-text unscored terms result: 5511 494901
>> trec-40-text unscored and result: 2193 769110
>> trec-40-text unscored phrase result: 6615 148663
>> trec-40-text unscored spans result: 12407 545090
>>
>> So conjunction is still better, but by a really smaller margin. The worst part is term scanning--they are now significantly faster than QS indices.
>
> Dawid
>
>
>
> On Sun, Jun 24, 2012 at 9:31 AM, Dawid Weiss
> <dawid.weiss [at] cs> wrote:
>> Fyi. I contacted Sebastiano and will get hold of the data set and
>> benchmarks he used to repeat his experiment with current trunk
>> (curiosity). Any hints on which configuration should be used will be
>> welcome.
>>
>> Dawid
>>
>> On Sat, Jun 23, 2012 at 12:38 PM, Li Li <fancyerii [at] gmail> wrote:
>>> http://mg4j.di.unimi.it/
>>> http://vigna.di.unimi.it/papers.php#VigQSI
>>>
>>> sounds very interesting and attractive.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe [at] lucene
>>> For additional commands, e-mail: dev-help [at] lucene
>>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe [at] lucene
> For additional commands, e-mail: dev-help [at] lucene
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


dawid.weiss at cs

Jul 6, 2012, 2:53 AM

Post #5 of 6 (395 views)
Permalink
Re: anyone has interests about mg4j's new integer compression algorithm? [In reply to]

That 4.0 is significantly faster than 3.6 for this benchmark and there
were minor glitches in the benchmarking code itself.

Dawid

On Fri, Jul 6, 2012 at 11:47 AM, Li Li <fancyerii [at] gmail> wrote:
> I can understand these quotes. what's the conclusion from your communication?
>
> On Fri, Jul 6, 2012 at 4:20 PM, Dawid Weiss
> <dawid.weiss [at] cs> wrote:
>> I've repeated Sebastiano's experiments (and so did he). A few quotes
>> from the communication.
>>
>>> The index appears to be larger now--43.1GB. Probably they have better skipping structures that take more space.
>>>
>>> From what I can see the format is the same as before--the .frq file contains document pointers and positions. So my SearchFiles class still reads documents *and* counts.
>>>
>>> But the most interesting part I've read in a blog is that now Lucene has a pluggable index format. This means that someone can actually write a QS index for Lucene and test what happens in production. That's a most interesting change!
>>
>> and:
>>
>>> Well, they made a great job:
>>>
>>> trec-40-text unscored terms result: 5511 494901
>>> trec-40-text unscored and result: 2193 769110
>>> trec-40-text unscored phrase result: 6615 148663
>>> trec-40-text unscored spans result: 12407 545090
>>>
>>> So conjunction is still better, but by a really smaller margin. The worst part is term scanning--they are now significantly faster than QS indices.
>>
>> Dawid
>>
>>
>>
>> On Sun, Jun 24, 2012 at 9:31 AM, Dawid Weiss
>> <dawid.weiss [at] cs> wrote:
>>> Fyi. I contacted Sebastiano and will get hold of the data set and
>>> benchmarks he used to repeat his experiment with current trunk
>>> (curiosity). Any hints on which configuration should be used will be
>>> welcome.
>>>
>>> Dawid
>>>
>>> On Sat, Jun 23, 2012 at 12:38 PM, Li Li <fancyerii [at] gmail> wrote:
>>>> http://mg4j.di.unimi.it/
>>>> http://vigna.di.unimi.it/papers.php#VigQSI
>>>>
>>>> sounds very interesting and attractive.
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe [at] lucene
>>>> For additional commands, e-mail: dev-help [at] lucene
>>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe [at] lucene
>> For additional commands, e-mail: dev-help [at] lucene
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe [at] lucene
> For additional commands, e-mail: dev-help [at] lucene
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


rcmuir at gmail

Jul 6, 2012, 3:06 AM

Post #6 of 6 (399 views)
Permalink
Re: anyone has interests about mg4j's new integer compression algorithm? [In reply to]

I reviewed the benchmarking code on his website very quickly:

* I don't like his NullCollector, it sets acceptsDocsOutOfOrder() =
false, but its doing nothing but counting. By returning false here, he
is declaring that the collector cares about docid order (which it
doesnt), and preventing the use of BooleanScorer... he could just use
TotalHitCountCollector:
http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/core/src/java/org/apache/lucene/search/TotalHitCountCollector.java

* I'm not sure I like that he uses SpanNearQuery for the 'proximity
window' benchmarking. For just a list of terms, I think
SloppyPhraseQuery is the more natural choice and would be faster: "foo
bar baz"~5 or whatever.

On Fri, Jul 6, 2012 at 5:53 AM, Dawid Weiss
<dawid.weiss [at] cs> wrote:
> That 4.0 is significantly faster than 3.6 for this benchmark and there
> were minor glitches in the benchmarking code itself.
>
> Dawid
>
> On Fri, Jul 6, 2012 at 11:47 AM, Li Li <fancyerii [at] gmail> wrote:
>> I can understand these quotes. what's the conclusion from your communication?
>>
>> On Fri, Jul 6, 2012 at 4:20 PM, Dawid Weiss
>> <dawid.weiss [at] cs> wrote:
>>> I've repeated Sebastiano's experiments (and so did he). A few quotes
>>> from the communication.
>>>
>>>> The index appears to be larger now--43.1GB. Probably they have better skipping structures that take more space.
>>>>
>>>> From what I can see the format is the same as before--the .frq file contains document pointers and positions. So my SearchFiles class still reads documents *and* counts.
>>>>
>>>> But the most interesting part I've read in a blog is that now Lucene has a pluggable index format. This means that someone can actually write a QS index for Lucene and test what happens in production. That's a most interesting change!
>>>
>>> and:
>>>
>>>> Well, they made a great job:
>>>>
>>>> trec-40-text unscored terms result: 5511 494901
>>>> trec-40-text unscored and result: 2193 769110
>>>> trec-40-text unscored phrase result: 6615 148663
>>>> trec-40-text unscored spans result: 12407 545090
>>>>
>>>> So conjunction is still better, but by a really smaller margin. The worst part is term scanning--they are now significantly faster than QS indices.
>>>
>>> Dawid
>>>
>>>
>>>
>>> On Sun, Jun 24, 2012 at 9:31 AM, Dawid Weiss
>>> <dawid.weiss [at] cs> wrote:
>>>> Fyi. I contacted Sebastiano and will get hold of the data set and
>>>> benchmarks he used to repeat his experiment with current trunk
>>>> (curiosity). Any hints on which configuration should be used will be
>>>> welcome.
>>>>
>>>> Dawid
>>>>
>>>> On Sat, Jun 23, 2012 at 12:38 PM, Li Li <fancyerii [at] gmail> wrote:
>>>>> http://mg4j.di.unimi.it/
>>>>> http://vigna.di.unimi.it/papers.php#VigQSI
>>>>>
>>>>> sounds very interesting and attractive.
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: dev-unsubscribe [at] lucene
>>>>> For additional commands, e-mail: dev-help [at] lucene
>>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe [at] lucene
>>> For additional commands, e-mail: dev-help [at] lucene
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe [at] lucene
>> For additional commands, e-mail: dev-help [at] lucene
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe [at] lucene
> For additional commands, e-mail: dev-help [at] lucene
>



--
lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene

Lucene java-dev RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.