Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-Dev

[jira] Issue Comment Edited: (LUCENE-1990) Add unsigned packed int impls in oal.util

 

 

Lucene java-dev RSS feed   Index | Next | Previous | View Threaded


jira at apache

Nov 10, 2009, 6:01 AM

Post #1 of 4 (200 views)
Permalink
[jira] Issue Comment Edited: (LUCENE-1990) Add unsigned packed int impls in oal.util

[ https://issues.apache.org/jira/browse/LUCENE-1990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12775420#action_12775420 ]

Fuad Efendi edited comment on LUCENE-1990 at 11/10/09 2:01 PM:
---------------------------------------------------------------

Specifically for FieldCache, let's see... suppose Field may have 8 different values, and number of documents is high.

{code}
Value0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 ...
Value1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 ...
Value2 0 0 0 1 1 0 0 0 0 0 0 0 0 0 ...
Value3 0 0 0 0 0 0 0 0 0 0 0 1 0 0 ...
Value4 0 0 0 0 0 0 1 0 0 0 0 0 0 0 ...
Value5 0 0 0 0 0 1 0 0 0 0 1 0 1 0 ...
Value6 0 0 0 0 0 0 0 1 0 1 0 0 0 0 ...
Value7 0 0 0 0 0 0 0 0 1 0 0 0 0 1 ...
{code}

- represented as Matrix (or as a Vector); for instance, first row means that Document1 and Document8 have Value0.

And now, if we go "horizontally" we will end up with 8 arrays of int[]. What if we go "vertically"? Field could be encoded as 3-bit (8 different values).

CONSTRAINT: specifically for FieldCache, each Column must have the only "1".

And we can end with array of 3-bit values storing position in a column! Size of array is IndexReader.maxDoc().


hope I am reinventing bycycle :)


was (Author: funtick):
Specifically for FieldCache, let's see... suppose Field may have 8 different values, and number of documents is high.

{code}
Value0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 ...
Value1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 ...
Value2 0 0 0 1 1 0 0 0 0 0 0 0 0 0 ...
Value3 0 0 0 0 0 0 0 0 0 0 0 10 0 0 ...
Value4 0 0 0 0 0 0 1 0 0 0 0 0 0 0 ...
Value5 0 0 0 0 0 1 0 0 0 0 1 0 1 0 ...
Value6 0 0 0 0 0 0 0 1 0 1 0 0 0 0 ...
Value7 0 0 0 0 0 0 0 0 1 0 0 0 0 1 ...
{code}

- represented as Matrix (or as a Vector); for instance, first row means that Document1 and Document8 have Value0.

And now, if we go "horizontally" we will end up with 8 arrays of int[]. What if we go "vertically"? Field could be encoded as 3-bit (8 different values).

CONSTRAINT: specifically for FieldCache, each Column must have the only "1".

And we can end with array of 3-bit values storing position in a column! Size of array is IndexReader.maxDoc().


hope I am reinventing bycycle :)


> Add unsigned packed int impls in oal.util
> -----------------------------------------
>
> Key: LUCENE-1990
> URL: https://issues.apache.org/jira/browse/LUCENE-1990
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Reporter: Michael McCandless
> Priority: Minor
>
> There are various places in Lucene that could take advantage of an
> efficient packed unsigned int/long impl. EG the terms dict index in
> the standard codec in LUCENE-1458 could subsantially reduce it's RAM
> usage. FieldCache.StringIndex could as well. And I think "load into
> RAM" codecs like the one in TestExternalCodecs could use this too.
> I'm picturing something very basic like:
> {code}
> interface PackedUnsignedLongs {
> long get(long index);
> void set(long index, long value);
> }
> {code}
> Plus maybe an iterator for getting and maybe also for setting. If it
> helps, most of the usages of this inside Lucene will be "write once"
> so eg the set could make that an assumption/requirement.
> And a factory somewhere:
> {code}
> PackedUnsignedLongs create(int count, long maxValue);
> {code}
> I think we should simply autogen the code (we can start from the
> autogen code in LUCENE-1410), or, if there is an good existing impl
> that has a compatible license that'd be great.
> I don't have time near-term to do this... so if anyone has the itch,
> please jump!

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 10, 2009, 8:11 AM

Post #2 of 4 (180 views)
Permalink
[jira] Issue Comment Edited: (LUCENE-1990) Add unsigned packed int impls in oal.util [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12775420#action_12775420 ]

Fuad Efendi edited comment on LUCENE-1990 at 11/10/09 4:11 PM:
---------------------------------------------------------------

Specifically for FieldCache, let's see... suppose Field may have 8 different values, and number of documents is high.

{code}
Value0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 ...
Value1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 ...
Value2 0 0 0 1 1 0 0 0 0 0 0 0 0 0 ...
Value3 0 0 0 0 0 0 0 0 0 0 0 1 0 0 ...
Value4 0 0 0 0 0 0 1 0 0 0 0 0 0 0 ...
Value5 0 0 0 0 0 1 0 0 0 0 1 0 1 0 ...
Value6 0 0 0 0 0 0 0 1 0 1 0 0 0 0 ...
Value7 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ...
{code}

- represented as Matrix (or as a Vector); for instance, first row means that Document1 and Document8 have Value0.

And now, if we go "horizontally" we will end up with 8 arrays of int[]. What if we go "vertically"? Field could be encoded as 3-bit (8 different values).

CONSTRAINT: specifically for FieldCache, each Column must have the only "1".

And we can end with array of 3-bit values storing position in a column! Size of array is IndexReader.maxDoc().


hope I am reinventing bycycle :)

P.S.
Of course each solution has pros and cons, I am trying to focus on FieldCache specific use cases.

1. For a given document ID, find a value for a field
2. For a given query results, sort it by a field values
3. For a given query results, count "facet" for each field value

I don't think such naive compression is slower than abstract int[] arrays... and we need to change public API of field cache too: if method returns int[] we are not saving any RAM.

Better is to compare with SOLR use cases and to make API closer to real requirements; SOLR operates with some bitsets instead of arrays...

was (Author: funtick):
Specifically for FieldCache, let's see... suppose Field may have 8 different values, and number of documents is high.

{code}
Value0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 ...
Value1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 ...
Value2 0 0 0 1 1 0 0 0 0 0 0 0 0 0 ...
Value3 0 0 0 0 0 0 0 0 0 0 0 1 0 0 ...
Value4 0 0 0 0 0 0 1 0 0 0 0 0 0 0 ...
Value5 0 0 0 0 0 1 0 0 0 0 1 0 1 0 ...
Value6 0 0 0 0 0 0 0 1 0 1 0 0 0 0 ...
Value7 0 0 0 0 0 0 0 0 1 0 0 0 0 1 ...
{code}

- represented as Matrix (or as a Vector); for instance, first row means that Document1 and Document8 have Value0.

And now, if we go "horizontally" we will end up with 8 arrays of int[]. What if we go "vertically"? Field could be encoded as 3-bit (8 different values).

CONSTRAINT: specifically for FieldCache, each Column must have the only "1".

And we can end with array of 3-bit values storing position in a column! Size of array is IndexReader.maxDoc().


hope I am reinventing bycycle :)

P.S.
Of course each solution has pros and cons, I am trying to focus on FieldCache specific use cases.

1. For a given document ID, find a value for a field
2. For a given query results, sort it by a field values
3. For a given query results, count "facet" for each field value

I don't think such naive compression is slower than abstract int[] arrays... and we need to change public API of field cache too: if method returns int[] we are not saving any RAM.

Better is to compare with SOLR use cases and to make API closer to real requirements; SOLR operates with some bitsets instead of arrays...

> Add unsigned packed int impls in oal.util
> -----------------------------------------
>
> Key: LUCENE-1990
> URL: https://issues.apache.org/jira/browse/LUCENE-1990
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Reporter: Michael McCandless
> Priority: Minor
>
> There are various places in Lucene that could take advantage of an
> efficient packed unsigned int/long impl. EG the terms dict index in
> the standard codec in LUCENE-1458 could subsantially reduce it's RAM
> usage. FieldCache.StringIndex could as well. And I think "load into
> RAM" codecs like the one in TestExternalCodecs could use this too.
> I'm picturing something very basic like:
> {code}
> interface PackedUnsignedLongs {
> long get(long index);
> void set(long index, long value);
> }
> {code}
> Plus maybe an iterator for getting and maybe also for setting. If it
> helps, most of the usages of this inside Lucene will be "write once"
> so eg the set could make that an assumption/requirement.
> And a factory somewhere:
> {code}
> PackedUnsignedLongs create(int count, long maxValue);
> {code}
> I think we should simply autogen the code (we can start from the
> autogen code in LUCENE-1410), or, if there is an good existing impl
> that has a compatible license that'd be great.
> I don't have time near-term to do this... so if anyone has the itch,
> please jump!

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 10, 2009, 8:11 AM

Post #3 of 4 (181 views)
Permalink
[jira] Issue Comment Edited: (LUCENE-1990) Add unsigned packed int impls in oal.util [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12775420#action_12775420 ]

Fuad Efendi edited comment on LUCENE-1990 at 11/10/09 4:10 PM:
---------------------------------------------------------------

Specifically for FieldCache, let's see... suppose Field may have 8 different values, and number of documents is high.

{code}
Value0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 ...
Value1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 ...
Value2 0 0 0 1 1 0 0 0 0 0 0 0 0 0 ...
Value3 0 0 0 0 0 0 0 0 0 0 0 1 0 0 ...
Value4 0 0 0 0 0 0 1 0 0 0 0 0 0 0 ...
Value5 0 0 0 0 0 1 0 0 0 0 1 0 1 0 ...
Value6 0 0 0 0 0 0 0 1 0 1 0 0 0 0 ...
Value7 0 0 0 0 0 0 0 0 1 0 0 0 0 1 ...
{code}

- represented as Matrix (or as a Vector); for instance, first row means that Document1 and Document8 have Value0.

And now, if we go "horizontally" we will end up with 8 arrays of int[]. What if we go "vertically"? Field could be encoded as 3-bit (8 different values).

CONSTRAINT: specifically for FieldCache, each Column must have the only "1".

And we can end with array of 3-bit values storing position in a column! Size of array is IndexReader.maxDoc().


hope I am reinventing bycycle :)

P.S.
Of course each solution has pros and cons, I am trying to focus on FieldCache specific use cases.

1. For a given document ID, find a value for a field
2. For a given query results, sort it by a field values
3. For a given query results, count "facet" for each field value

I don't think such naive compression is slower than abstract int[] arrays... and we need to change public API of field cache too: if method returns int[] we are not saving any RAM.

Better is to compare with SOLR use cases and to make API closer to real requirements; SOLR operates with some bitsets instead of arrays...

was (Author: funtick):
Specifically for FieldCache, let's see... suppose Field may have 8 different values, and number of documents is high.

{code}
Value0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 ...
Value1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 ...
Value2 0 0 0 1 1 0 0 0 0 0 0 0 0 0 ...
Value3 0 0 0 0 0 0 0 0 0 0 0 1 0 0 ...
Value4 0 0 0 0 0 0 1 0 0 0 0 0 0 0 ...
Value5 0 0 0 0 0 1 0 0 0 0 1 0 1 0 ...
Value6 0 0 0 0 0 0 0 1 0 1 0 0 0 0 ...
Value7 0 0 0 0 0 0 0 0 1 0 0 0 0 1 ...
{code}

- represented as Matrix (or as a Vector); for instance, first row means that Document1 and Document8 have Value0.

And now, if we go "horizontally" we will end up with 8 arrays of int[]. What if we go "vertically"? Field could be encoded as 3-bit (8 different values).

CONSTRAINT: specifically for FieldCache, each Column must have the only "1".

And we can end with array of 3-bit values storing position in a column! Size of array is IndexReader.maxDoc().


hope I am reinventing bycycle :)


> Add unsigned packed int impls in oal.util
> -----------------------------------------
>
> Key: LUCENE-1990
> URL: https://issues.apache.org/jira/browse/LUCENE-1990
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Reporter: Michael McCandless
> Priority: Minor
>
> There are various places in Lucene that could take advantage of an
> efficient packed unsigned int/long impl. EG the terms dict index in
> the standard codec in LUCENE-1458 could subsantially reduce it's RAM
> usage. FieldCache.StringIndex could as well. And I think "load into
> RAM" codecs like the one in TestExternalCodecs could use this too.
> I'm picturing something very basic like:
> {code}
> interface PackedUnsignedLongs {
> long get(long index);
> void set(long index, long value);
> }
> {code}
> Plus maybe an iterator for getting and maybe also for setting. If it
> helps, most of the usages of this inside Lucene will be "write once"
> so eg the set could make that an assumption/requirement.
> And a factory somewhere:
> {code}
> PackedUnsignedLongs create(int count, long maxValue);
> {code}
> I think we should simply autogen the code (we can start from the
> autogen code in LUCENE-1410), or, if there is an good existing impl
> that has a compatible license that'd be great.
> I don't have time near-term to do this... so if anyone has the itch,
> please jump!

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 10, 2009, 8:13 AM

Post #4 of 4 (181 views)
Permalink
[jira] Issue Comment Edited: (LUCENE-1990) Add unsigned packed int impls in oal.util [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12775420#action_12775420 ]

Fuad Efendi edited comment on LUCENE-1990 at 11/10/09 4:11 PM:
---------------------------------------------------------------


Specifically for FieldCache, let's see... suppose Field may have 8 different values, and number of documents is high.

{code}
Value0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 ...
Value1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 ...
Value2 0 0 0 1 1 0 0 0 0 0 0 0 0 0 ...
Value3 0 0 0 0 0 0 0 0 0 0 0 1 0 0 ...
Value4 0 0 0 0 0 0 1 0 0 0 0 0 0 0 ...
Value5 0 0 0 0 0 1 0 0 0 0 1 0 1 0 ...
Value6 0 0 0 0 0 0 0 1 0 1 0 0 0 0 ...
Value7 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ...
{code}

- represented as Matrix (or as a Vector); for instance, first row means that Document1 and Document8 have Value0.

And now, if we go "horizontally" we will end up with 8 arrays of int[]. What if we go "vertically"? Field could be encoded as 3-bit (8 different values).

CONSTRAINT: specifically for FieldCache, each Column must have the only "1".

And we can end with array of 3-bit values storing position in a column! Size of array is IndexReader.maxDoc().


hope I am reinventing bycycle :)

P.S.
Of course each solution has pros and cons, I am trying to focus on FieldCache specific use cases.

1. For a given document ID, find a value for a field
2. For a given query results, sort it by a field values
3. For a given query results, count "facet" for each field value

I don't think such naive compression is slower than abstract int[] arrays... and we need to change public API of field cache too: if method returns int[] we are not saving any RAM.

Better is to compare with SOLR use cases and to make API closer to real requirements; SOLR operates with some bitsets instead of arrays...

was (Author: funtick):
Specifically for FieldCache, let's see... suppose Field may have 8 different values, and number of documents is high.

{code}
Value0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 ...
Value1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 ...
Value2 0 0 0 1 1 0 0 0 0 0 0 0 0 0 ...
Value3 0 0 0 0 0 0 0 0 0 0 0 1 0 0 ...
Value4 0 0 0 0 0 0 1 0 0 0 0 0 0 0 ...
Value5 0 0 0 0 0 1 0 0 0 0 1 0 1 0 ...
Value6 0 0 0 0 0 0 0 1 0 1 0 0 0 0 ...
Value7 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ...
{code}

- represented as Matrix (or as a Vector); for instance, first row means that Document1 and Document8 have Value0.

And now, if we go "horizontally" we will end up with 8 arrays of int[]. What if we go "vertically"? Field could be encoded as 3-bit (8 different values).

CONSTRAINT: specifically for FieldCache, each Column must have the only "1".

And we can end with array of 3-bit values storing position in a column! Size of array is IndexReader.maxDoc().


hope I am reinventing bycycle :)

P.S.
Of course each solution has pros and cons, I am trying to focus on FieldCache specific use cases.

1. For a given document ID, find a value for a field
2. For a given query results, sort it by a field values
3. For a given query results, count "facet" for each field value

I don't think such naive compression is slower than abstract int[] arrays... and we need to change public API of field cache too: if method returns int[] we are not saving any RAM.

Better is to compare with SOLR use cases and to make API closer to real requirements; SOLR operates with some bitsets instead of arrays...

> Add unsigned packed int impls in oal.util
> -----------------------------------------
>
> Key: LUCENE-1990
> URL: https://issues.apache.org/jira/browse/LUCENE-1990
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Reporter: Michael McCandless
> Priority: Minor
>
> There are various places in Lucene that could take advantage of an
> efficient packed unsigned int/long impl. EG the terms dict index in
> the standard codec in LUCENE-1458 could subsantially reduce it's RAM
> usage. FieldCache.StringIndex could as well. And I think "load into
> RAM" codecs like the one in TestExternalCodecs could use this too.
> I'm picturing something very basic like:
> {code}
> interface PackedUnsignedLongs {
> long get(long index);
> void set(long index, long value);
> }
> {code}
> Plus maybe an iterator for getting and maybe also for setting. If it
> helps, most of the usages of this inside Lucene will be "write once"
> so eg the set could make that an assumption/requirement.
> And a factory somewhere:
> {code}
> PackedUnsignedLongs create(int count, long maxValue);
> {code}
> I think we should simply autogen the code (we can start from the
> autogen code in LUCENE-1410), or, if there is an good existing impl
> that has a compatible license that'd be great.
> I don't have time near-term to do this... so if anyone has the itch,
> please jump!

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene

Lucene java-dev RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.