Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-Dev

Omit positions but not TF

 

 

Lucene java-dev RSS feed   Index | Next | Previous | View Threaded


ab at getopt

Nov 7, 2009, 4:47 PM

Post #1 of 6 (325 views)
Permalink
Omit positions but not TF

Hi,

During one of discussions at ApacheCon it occurred to me that it would
be useful to have an option to discard positional information but still
keep the term frequency. Even though position-dependent queries wouldn't
work then, still any other queries would work fine and we would get the
right scoring.

I believe it should be possible to do this without changing the file
format, if we used a negative term frequency for terms without postings
- we would have to check for that condition in SegmentTermDocs, change
the flags there and flip the sign of docFreq. And eventually we may want
to add a separate flag for this and bump the format version.

--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


lucene at mikemccandless

Nov 8, 2009, 2:35 AM

Post #2 of 6 (295 views)
Permalink
Re: Omit positions but not TF [In reply to]

+1

I guess we'd add a Fieldable.setOmitPositions? And then save that in
FieldInfos, and fix the postings writing/reading to respect it? Ie,
we can just change the index format. Encoding as negative numbers
isn't great because the termFreq is written as a vInt, which consumes
5 bytes to encode any negative number. Wanna cough up a patch?
Probably this should wait until 3.1.

Mike

On Sat, Nov 7, 2009 at 7:47 PM, Andrzej Bialecki <ab [at] getopt> wrote:
> Hi,
>
> During one of discussions at ApacheCon it occurred to me that it would be
> useful to have an option to discard positional information but still keep
> the term frequency. Even though position-dependent queries wouldn't work
> then, still any other queries would work fine and we would get the right
> scoring.
>
> I believe it should be possible to do this without changing the file format,
> if we used a negative term frequency for terms without postings - we would
> have to check for that condition in SegmentTermDocs, change the flags there
> and flip the sign of docFreq. And eventually we may want to add a separate
> flag for this and bump the format version.
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
> For additional commands, e-mail: java-dev-help [at] lucene
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


ab at getopt

Nov 8, 2009, 8:14 AM

Post #3 of 6 (293 views)
Permalink
Re: Omit positions but not TF [In reply to]

Michael McCandless wrote:
> +1
>
> I guess we'd add a Fieldable.setOmitPositions? And then save that in
> FieldInfos, and fix the postings writing/reading to respect it? Ie,
> we can just change the index format. Encoding as negative numbers

Yes, that's what I had in mind. I was a bit shy of bumping the format
version, but likely there will be other changes that we can put under
the same next version of the format.

> isn't great because the termFreq is written as a vInt, which consumes
> 5 bytes to encode any negative number. Wanna cough up a patch?

Heh .. that's the right term for it, I haven't looked at the details of
oal.index.* since 2.4-ish or so ... we'll see ;)

> Probably this should wait until 3.1.

+1.

--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


ab at getopt

Nov 9, 2009, 8:26 AM

Post #4 of 6 (282 views)
Permalink
Re: Omit positions but not TF [In reply to]

Andrzej Bialecki wrote:
> Michael McCandless wrote:
>> +1
>>
>> I guess we'd add a Fieldable.setOmitPositions? And then save that in
>> FieldInfos, and fix the postings writing/reading to respect it? Ie,
>> we can just change the index format. Encoding as negative numbers
>
> Yes, that's what I had in mind. I was a bit shy of bumping the format
> version, but likely there will be other changes that we can put under
> the same next version of the format.
>
>> isn't great because the termFreq is written as a vInt, which consumes
>> 5 bytes to encode any negative number. Wanna cough up a patch?
>
> Heh .. that's the right term for it, I haven't looked at the details of
> oal.index.* since 2.4-ish or so ... we'll see ;)

Ehh, sorry - I think I'll give up for now, after looking at the
combinatoric increase in the number of arguments to various indexing
classes ...

--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


lucene at mikemccandless

Nov 9, 2009, 9:03 AM

Post #5 of 6 (283 views)
Permalink
Re: Omit positions but not TF [In reply to]

How about opening an issue? This way someone else can come along and
pick up the torch...

Mike

On Mon, Nov 9, 2009 at 11:26 AM, Andrzej Bialecki <ab [at] getopt> wrote:
> Andrzej Bialecki wrote:
>>
>> Michael McCandless wrote:
>>>
>>> +1
>>>
>>> I guess we'd add a Fieldable.setOmitPositions?  And then save that in
>>> FieldInfos, and fix the postings writing/reading to respect it?  Ie,
>>> we can just change the index format.  Encoding as negative numbers
>>
>> Yes, that's what I had in mind. I was a bit shy of bumping the format
>> version, but likely there will be other changes that we can put under the
>> same next version of the format.
>>
>>> isn't great because the termFreq is written as a vInt, which consumes
>>> 5 bytes to encode any negative number.  Wanna cough up a patch?
>>
>> Heh .. that's the right term for it, I haven't looked at the details of
>> oal.index.* since 2.4-ish or so ... we'll see ;)
>
> Ehh, sorry - I think I'll give up for now, after looking at the combinatoric
> increase in the number of arguments to various indexing classes ...
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
> For additional commands, e-mail: java-dev-help [at] lucene
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


simon.willnauer at googlemail

Nov 9, 2009, 9:14 AM

Post #6 of 6 (277 views)
Permalink
Re: Omit positions but not TF [In reply to]

On Mon, Nov 9, 2009 at 6:03 PM, Michael McCandless
<lucene [at] mikemccandless> wrote:
> How about opening an issue?  This way someone else can come along and
> pick up the torch...
+1
>
> Mike
>
> On Mon, Nov 9, 2009 at 11:26 AM, Andrzej Bialecki <ab [at] getopt> wrote:
>> Andrzej Bialecki wrote:
>>>
>>> Michael McCandless wrote:
>>>>
>>>> +1
>>>>
>>>> I guess we'd add a Fieldable.setOmitPositions?  And then save that in
>>>> FieldInfos, and fix the postings writing/reading to respect it?  Ie,
>>>> we can just change the index format.  Encoding as negative numbers
>>>
>>> Yes, that's what I had in mind. I was a bit shy of bumping the format
>>> version, but likely there will be other changes that we can put under the
>>> same next version of the format.
>>>
>>>> isn't great because the termFreq is written as a vInt, which consumes
>>>> 5 bytes to encode any negative number.  Wanna cough up a patch?
>>>
>>> Heh .. that's the right term for it, I haven't looked at the details of
>>> oal.index.* since 2.4-ish or so ... we'll see ;)
>>
>> Ehh, sorry - I think I'll give up for now, after looking at the combinatoric
>> increase in the number of arguments to various indexing classes ...
>>
>> --
>> Best regards,
>> Andrzej Bialecki     <><
>>  ___. ___ ___ ___ _ _   __________________________________
>> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>> http://www.sigram.com  Contact: info at sigram dot com
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
>> For additional commands, e-mail: java-dev-help [at] lucene
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
> For additional commands, e-mail: java-dev-help [at] lucene
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene

Lucene java-dev RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.