Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-Dev

[jira] Commented: (LUCENE-648) Allow changing of ZIP compression level for compressed fields

 

 

Lucene java-dev RSS feed   Index | Next | Previous | View Threaded


jira at apache

Aug 10, 2006, 6:29 PM

Post #1 of 17 (5923 views)
Permalink
[jira] Commented: (LUCENE-648) Allow changing of ZIP compression level for compressed fields

[ http://issues.apache.org/jira/browse/LUCENE-648?page=comments#action_12427385 ]

Grant Ingersoll commented on LUCENE-648:
----------------------------------------

Just curious, have you tried other values in here to see what kind of difference it makes before we go looking for a solution? Could you maybe put together a little benchmark that tries out the various levels and report back?

It could be possible to add another addDocument method from the IndexWriter, so you could change it per document, we could make it part of the constructor to IndexWriter or we could do it as mentioned above. I am not sure what is the best way just yet.

I think this also may fall under the notion of the Flexible Indexing thread that we have been talking about (someday it will get implemented).

> Allow changing of ZIP compression level for compressed fields
> -------------------------------------------------------------
>
> Key: LUCENE-648
> URL: http://issues.apache.org/jira/browse/LUCENE-648
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Affects Versions: 2.0.0, 1.9, 2.0.1, 2.1
> Reporter: Michael McCandless
> Priority: Minor
>
> In response to this thread:
> http://www.gossamer-threads.com/lists/lucene/java-user/38810
> I think we should allow changing the compression level used in the call to java.util.zip.Deflator in FieldsWriter.java. Right now it's hardwired to "best":
> compressor.setLevel(Deflater.BEST_COMPRESSION);
> Unfortunately, this can apparently cause the zip library to take a very long time (10 minutes for 4.5 MB in the above thread) and so people may want to change this setting.
> One approach would be to read the default from a Java system property, but, it seems recently (pre 2.0 I think) there was an effort to not rely on Java System properties (many were removed).
> A second approach would be to add static methods (and static class attr) to globally set the compression level?
> A third method would be in document.Field class, eg a setCompressLevel/getCompressLevel? But then every time a document is created with this field you'd have to call setCompressLevel since Lucene doesn't have a global Field schema (like Solr).
> Any other ideas / prefererences for either of these methods?

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Aug 10, 2006, 7:38 PM

Post #2 of 17 (5811 views)
Permalink
[jira] Commented: (LUCENE-648) Allow changing of ZIP compression level for compressed fields [In reply to]

[ http://issues.apache.org/jira/browse/LUCENE-648?page=comments#action_12427394 ]

Michael McCandless commented on LUCENE-648:
-------------------------------------------

Good question! I will try to get the original document if possible and also run some simple tests to see the variance of CPU time consumed vs % compressed.

> Allow changing of ZIP compression level for compressed fields
> -------------------------------------------------------------
>
> Key: LUCENE-648
> URL: http://issues.apache.org/jira/browse/LUCENE-648
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Affects Versions: 2.0.0, 1.9, 2.0.1, 2.1
> Reporter: Michael McCandless
> Priority: Minor
>
> In response to this thread:
> http://www.gossamer-threads.com/lists/lucene/java-user/38810
> I think we should allow changing the compression level used in the call to java.util.zip.Deflator in FieldsWriter.java. Right now it's hardwired to "best":
> compressor.setLevel(Deflater.BEST_COMPRESSION);
> Unfortunately, this can apparently cause the zip library to take a very long time (10 minutes for 4.5 MB in the above thread) and so people may want to change this setting.
> One approach would be to read the default from a Java system property, but, it seems recently (pre 2.0 I think) there was an effort to not rely on Java System properties (many were removed).
> A second approach would be to add static methods (and static class attr) to globally set the compression level?
> A third method would be in document.Field class, eg a setCompressLevel/getCompressLevel? But then every time a document is created with this field you'd have to call setCompressLevel since Lucene doesn't have a global Field schema (like Solr).
> Any other ideas / prefererences for either of these methods?

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Aug 10, 2006, 10:18 PM

Post #3 of 17 (5812 views)
Permalink
[jira] Commented: (LUCENE-648) Allow changing of ZIP compression level for compressed fields [In reply to]

[ http://issues.apache.org/jira/browse/LUCENE-648?page=comments#action_12427421 ]

Michael Busch commented on LUCENE-648:
--------------------------------------

I think the compression level is only one part of the performance problem. Another drawback of the current implementation is how compressed fields are being merged: the FieldsReader uncompresses the fields, the SegmentMerger concatenates them and the FieldsWriter compresses the data again. The uncompress/compress steps are completely unnecessary and result in a large overhead. Before a document is written to the disk, the data of its fields is even being compressed twice. Firstly, when the DocumentWriter writes the single-document segment to the RAMDirectory, secondly, when the SegmentMerger merges the segments inside the RAMDirectory to write the merged segment to the disk.

Please checkout Jira Issue 629 (http://issues.apache.org/jira/browse/LUCENE-629), where I recently posted a patch that fixes this problem and increases the indexing speed significantly. I also included some performance test results which quantify the improvement. Mike, it would be great if you could also try out the patched version for your tests with the compression level.

> Allow changing of ZIP compression level for compressed fields
> -------------------------------------------------------------
>
> Key: LUCENE-648
> URL: http://issues.apache.org/jira/browse/LUCENE-648
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Affects Versions: 2.0.0, 1.9, 2.0.1, 2.1
> Reporter: Michael McCandless
> Priority: Minor
>
> In response to this thread:
> http://www.gossamer-threads.com/lists/lucene/java-user/38810
> I think we should allow changing the compression level used in the call to java.util.zip.Deflator in FieldsWriter.java. Right now it's hardwired to "best":
> compressor.setLevel(Deflater.BEST_COMPRESSION);
> Unfortunately, this can apparently cause the zip library to take a very long time (10 minutes for 4.5 MB in the above thread) and so people may want to change this setting.
> One approach would be to read the default from a Java system property, but, it seems recently (pre 2.0 I think) there was an effort to not rely on Java System properties (many were removed).
> A second approach would be to add static methods (and static class attr) to globally set the compression level?
> A third method would be in document.Field class, eg a setCompressLevel/getCompressLevel? But then every time a document is created with this field you'd have to call setCompressLevel since Lucene doesn't have a global Field schema (like Solr).
> Any other ideas / prefererences for either of these methods?

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


rengels at ix

Aug 10, 2006, 10:45 PM

Post #4 of 17 (5801 views)
Permalink
Re: [jira] Commented: (LUCENE-648) Allow changing of ZIP compression level for compressed fields [In reply to]

I don't understand why the compressed fields are not just handled
externally in the Document class - just add uncompress/compress
methods. This way all Lucene needs to understand is binary fields,
and you don't have any of these problems during merging or initial
indexing.

On Aug 11, 2006, at 12:18 AM, Michael Busch (JIRA) wrote:

> [ http://issues.apache.org/jira/browse/LUCENE-648?
> page=comments#action_12427421 ]
>
> Michael Busch commented on LUCENE-648:
> --------------------------------------
>
> I think the compression level is only one part of the performance
> problem. Another drawback of the current implementation is how
> compressed fields are being merged: the FieldsReader uncompresses
> the fields, the SegmentMerger concatenates them and the
> FieldsWriter compresses the data again. The uncompress/compress
> steps are completely unnecessary and result in a large overhead.
> Before a document is written to the disk, the data of its fields is
> even being compressed twice. Firstly, when the DocumentWriter
> writes the single-document segment to the RAMDirectory, secondly,
> when the SegmentMerger merges the segments inside the RAMDirectory
> to write the merged segment to the disk.
>
> Please checkout Jira Issue 629 (http://issues.apache.org/jira/
> browse/LUCENE-629), where I recently posted a patch that fixes this
> problem and increases the indexing speed significantly. I also
> included some performance test results which quantify the
> improvement. Mike, it would be great if you could also try out the
> patched version for your tests with the compression level.
>
>> Allow changing of ZIP compression level for compressed fields
>> -------------------------------------------------------------
>>
>> Key: LUCENE-648
>> URL: http://issues.apache.org/jira/browse/LUCENE-648
>> Project: Lucene - Java
>> Issue Type: Improvement
>> Components: Index
>> Affects Versions: 2.0.0, 1.9, 2.0.1, 2.1
>> Reporter: Michael McCandless
>> Priority: Minor
>>
>> In response to this thread:
>> http://www.gossamer-threads.com/lists/lucene/java-user/38810
>> I think we should allow changing the compression level used in the
>> call to java.util.zip.Deflator in FieldsWriter.java. Right now
>> it's hardwired to "best":
>> compressor.setLevel(Deflater.BEST_COMPRESSION);
>> Unfortunately, this can apparently cause the zip library to take a
>> very long time (10 minutes for 4.5 MB in the above thread) and so
>> people may want to change this setting.
>> One approach would be to read the default from a Java system
>> property, but, it seems recently (pre 2.0 I think) there was an
>> effort to not rely on Java System properties (many were removed).
>> A second approach would be to add static methods (and static class
>> attr) to globally set the compression level?
>> A third method would be in document.Field class, eg a
>> setCompressLevel/getCompressLevel? But then every time a document
>> is created with this field you'd have to call setCompressLevel
>> since Lucene doesn't have a global Field schema (like Solr).
>> Any other ideas / prefererences for either of these methods?
>
> --
> This message is automatically generated by JIRA.
> -
> If you think it was sent incorrectly contact one of the
> administrators: http://issues.apache.org/jira/secure/
> Administrators.jspa
> -
> For more information on JIRA, see: http://www.atlassian.com/
> software/jira
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
> For additional commands, e-mail: java-dev-help [at] lucene
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Aug 10, 2006, 11:31 PM

Post #5 of 17 (5808 views)
Permalink
[jira] Commented: (LUCENE-648) Allow changing of ZIP compression level for compressed fields [In reply to]

[ http://issues.apache.org/jira/browse/LUCENE-648?page=comments#action_12427449 ]

Jason Polites commented on LUCENE-648:
--------------------------------------

I you find that compression level has a meaningful impact (which it may not as suggested), one approach for a low-impact fix would be to allow the end user to specify their own Inflater/Deflater when creating the IndexWriter. If not specified, then behaviour remains as is. If the user specifies a different compression level when retrieving the document, that's their bad luck.

> Allow changing of ZIP compression level for compressed fields
> -------------------------------------------------------------
>
> Key: LUCENE-648
> URL: http://issues.apache.org/jira/browse/LUCENE-648
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Affects Versions: 2.0.0, 1.9, 2.0.1, 2.1
> Reporter: Michael McCandless
> Priority: Minor
>
> In response to this thread:
> http://www.gossamer-threads.com/lists/lucene/java-user/38810
> I think we should allow changing the compression level used in the call to java.util.zip.Deflator in FieldsWriter.java. Right now it's hardwired to "best":
> compressor.setLevel(Deflater.BEST_COMPRESSION);
> Unfortunately, this can apparently cause the zip library to take a very long time (10 minutes for 4.5 MB in the above thread) and so people may want to change this setting.
> One approach would be to read the default from a Java system property, but, it seems recently (pre 2.0 I think) there was an effort to not rely on Java System properties (many were removed).
> A second approach would be to add static methods (and static class attr) to globally set the compression level?
> A third method would be in document.Field class, eg a setCompressLevel/getCompressLevel? But then every time a document is created with this field you'd have to call setCompressLevel since Lucene doesn't have a global Field schema (like Solr).
> Any other ideas / prefererences for either of these methods?

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


lucene at mikemccandless

Aug 11, 2006, 3:39 AM

Post #6 of 17 (5799 views)
Permalink
Re: [jira] Commented: (LUCENE-648) Allow changing of ZIP compression level for compressed fields [In reply to]

> I don't understand why the compressed fields are not just handled
> externally in the Document class - just add uncompress/compress methods.
> This way all Lucene needs to understand is binary fields, and you don't
> have any of these problems during merging or initial indexing.

This is an excellent point. Since these are just stored fields, why not
have Lucene treat them as [opaque] binary fields so they are not touched
during normal index operations. And then only when a Document really
needs to provide this value, it would decompress then?

Was there a counterbalance that justified having the compress logic
inside the indexing code?

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


lucene at mikemccandless

Aug 11, 2006, 4:07 AM

Post #7 of 17 (5782 views)
Permalink
Re: [jira] Commented: (LUCENE-648) Allow changing of ZIP compression level for compressed fields [In reply to]

> I don't understand why the compressed fields are not just handled
> externally in the Document class - just add uncompress/compress methods.
> This way all Lucene needs to understand is binary fields, and you don't
> have any of these problems during merging or initial indexing.

The original poster of this issue (on java-user) raised another aspect
of his use case: he needs to update documents that have large compressed
fields.

Ideally, one could pull out a Document, change one of the other (not
compressed) fields, then re-index it (and delete the original one), all
without uncompresssing / recompressing the untouched compressed field.
I guess this would require ability to mark a Field as compressed but
also mark that it's already in compressed form (to avoid re-compressing
it!).

It's of course always possible as a workaround do to all of this outside
of Lucene (as this original poster has done), but, I think these use
cases should be "in scope" for Lucene: if we are going to offer
compressed fields we should try hard to make it efficient for basic use
cases as well as for document updates?

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


rengels at ix

Aug 11, 2006, 7:00 AM

Post #8 of 17 (5788 views)
Permalink
Re: [jira] Commented: (LUCENE-648) Allow changing of ZIP compression level for compressed fields [In reply to]

If you make the compression external this is already done. In order
to do what the poster requires, you still need to read and update
fields without reading the entire document. You just do this at a
binary field level, and do all of he compression/decompression
externally.

I think putting the compression into Lucene needlessly complicates
matters. All that is required is in place field updating, and binary
field support.


On Aug 11, 2006, at 6:07 AM, Michael McCandless wrote:

>
>> I don't understand why the compressed fields are not just handled
>> externally in the Document class - just add uncompress/compress
>> methods. This way all Lucene needs to understand is binary fields,
>> and you don't have any of these problems during merging or initial
>> indexing.
>
> The original poster of this issue (on java-user) raised another
> aspect of his use case: he needs to update documents that have
> large compressed fields.
>
> Ideally, one could pull out a Document, change one of the other
> (not compressed) fields, then re-index it (and delete the original
> one), all without uncompresssing / recompressing the untouched
> compressed field. I guess this would require ability to mark a
> Field as compressed but also mark that it's already in compressed
> form (to avoid re-compressing it!).
>
> It's of course always possible as a workaround do to all of this
> outside of Lucene (as this original poster has done), but, I think
> these use cases should be "in scope" for Lucene: if we are going to
> offer compressed fields we should try hard to make it efficient for
> basic use cases as well as for document updates?
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
> For additional commands, e-mail: java-dev-help [at] lucene
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Aug 11, 2006, 2:24 PM

Post #9 of 17 (5815 views)
Permalink
[jira] Commented: (LUCENE-648) Allow changing of ZIP compression level for compressed fields [In reply to]

[ http://issues.apache.org/jira/browse/LUCENE-648?page=comments#action_12427630 ]

Michael McCandless commented on LUCENE-648:
-------------------------------------------


OK I ran some basic benchmarks to test the effect on indexing of
varying the ZIP compression level from 0-9.

Lucene currently hardwires compression level at 9 (= BEST).

I found a decent text corpus here:

http://people.csail.mit.edu/koehn/publications/europarl

I ran all tests on the "Portuguese-English" data set, which is total
of 327.5 MB of plain text across 976 files.

I just ran the demo IndexFiles, modified to add the file contents as
only a compressed stored field (ie not indexed). Note that this
"amplifies" the cost of compression because in a real setting there
would also be a number of indexed fields.

I didn't change any of the default merge factor settings. I'm running
on Ubuntu Linux 6.06, single CPU (2.4 ghz Pentium 4) desktop machine with
index stored on an internal ATA hard drive.

I first tested indexing time with and without the patch from
LUCENE-629 here:

old version: 648.7 sec

patched version: 145.5 sec

We clearly need to get that patch committed & released! Compressed
fields are far more costly than they ought to be, and people are now
using this (as of 1.9 release).

So, then I ran all subsequent tests with the above patch applied. All
numbers are avg. of 3 runs:

Level Index time (sec) Index size (MB)

None 65.3 322.3
0 92.3 322.3
1 80.8 128.8
2 80.6 122.2
3 81.3 115.8
4 89.8 111.3
5 104.0 106.2
6 121.8 103.6
7 131.7 103.1
8 144.8 102.9
9 145.5 102.9

Quick conclusions:

* There is indeed a substantial variance when you change the compression
level.

* The "sweet spot" above seems to be around 4 or 5 -- should we
change the default from 9?

* I would still say we should make it possible for Lucene users to
change the compression level?


> Allow changing of ZIP compression level for compressed fields
> -------------------------------------------------------------
>
> Key: LUCENE-648
> URL: http://issues.apache.org/jira/browse/LUCENE-648
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Affects Versions: 1.9, 2.0.0, 2.1, 2.0.1
> Reporter: Michael McCandless
> Priority: Minor
>
> In response to this thread:
> http://www.gossamer-threads.com/lists/lucene/java-user/38810
> I think we should allow changing the compression level used in the call to java.util.zip.Deflator in FieldsWriter.java. Right now it's hardwired to "best":
> compressor.setLevel(Deflater.BEST_COMPRESSION);
> Unfortunately, this can apparently cause the zip library to take a very long time (10 minutes for 4.5 MB in the above thread) and so people may want to change this setting.
> One approach would be to read the default from a Java system property, but, it seems recently (pre 2.0 I think) there was an effort to not rely on Java System properties (many were removed).
> A second approach would be to add static methods (and static class attr) to globally set the compression level?
> A third method would be in document.Field class, eg a setCompressLevel/getCompressLevel? But then every time a document is created with this field you'd have to call setCompressLevel since Lucene doesn't have a global Field schema (like Solr).
> Any other ideas / prefererences for either of these methods?

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


nicolas.lalevee at anyware-tech

Aug 12, 2006, 7:27 AM

Post #10 of 17 (5789 views)
Permalink
Re: [jira] Commented: (LUCENE-648) Allow changing of ZIP compression level for compressed fields [In reply to]

Le Vendredi 11 Août 2006 16:00, robert engels a écrit :
> If you make the compression external this is already done. In order
> to do what the poster requires, you still need to read and update
> fields without reading the entire document. You just do this at a
> binary field level, and do all of he compression/decompression
> externally.
>
> I think putting the compression into Lucene needlessly complicates
> matters. All that is required is in place field updating, and binary
> field support.

I agree with you.
The API should be kept compatible between versions, but what about breaking
the compatibility in trunk? Is this will ba a problem is the function
Fieldable.isCompressed() is removed ?

Nicolas

> On Aug 11, 2006, at 6:07 AM, Michael McCandless wrote:
> >> I don't understand why the compressed fields are not just handled
> >> externally in the Document class - just add uncompress/compress
> >> methods. This way all Lucene needs to understand is binary fields,
> >> and you don't have any of these problems during merging or initial
> >> indexing.
> >
> > The original poster of this issue (on java-user) raised another
> > aspect of his use case: he needs to update documents that have
> > large compressed fields.
> >
> > Ideally, one could pull out a Document, change one of the other
> > (not compressed) fields, then re-index it (and delete the original
> > one), all without uncompresssing / recompressing the untouched
> > compressed field. I guess this would require ability to mark a
> > Field as compressed but also mark that it's already in compressed
> > form (to avoid re-compressing it!).
> >
> > It's of course always possible as a workaround do to all of this
> > outside of Lucene (as this original poster has done), but, I think
> > these use cases should be "in scope" for Lucene: if we are going to
> > offer compressed fields we should try hard to make it efficient for
> > basic use cases as well as for document updates?
> >
> > Mike
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
> > For additional commands, e-mail: java-dev-help [at] lucene
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
> For additional commands, e-mail: java-dev-help [at] lucene

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Aug 12, 2006, 11:13 PM

Post #11 of 17 (5778 views)
Permalink
[jira] Commented: (LUCENE-648) Allow changing of ZIP compression level for compressed fields [In reply to]

[ http://issues.apache.org/jira/browse/LUCENE-648?page=comments#action_12427735 ]

Otis Gospodnetic commented on LUCENE-648:
-----------------------------------------

I agree. I like the idea of externalizing this, too, as suggested by Robert on the mailing list.

> Allow changing of ZIP compression level for compressed fields
> -------------------------------------------------------------
>
> Key: LUCENE-648
> URL: http://issues.apache.org/jira/browse/LUCENE-648
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Affects Versions: 1.9, 2.0.0, 2.1, 2.0.1
> Reporter: Michael McCandless
> Priority: Minor
>
> In response to this thread:
> http://www.gossamer-threads.com/lists/lucene/java-user/38810
> I think we should allow changing the compression level used in the call to java.util.zip.Deflator in FieldsWriter.java. Right now it's hardwired to "best":
> compressor.setLevel(Deflater.BEST_COMPRESSION);
> Unfortunately, this can apparently cause the zip library to take a very long time (10 minutes for 4.5 MB in the above thread) and so people may want to change this setting.
> One approach would be to read the default from a Java system property, but, it seems recently (pre 2.0 I think) there was an effort to not rely on Java System properties (many were removed).
> A second approach would be to add static methods (and static class attr) to globally set the compression level?
> A third method would be in document.Field class, eg a setCompressLevel/getCompressLevel? But then every time a document is created with this field you'd have to call setCompressLevel since Lucene doesn't have a global Field schema (like Solr).
> Any other ideas / prefererences for either of these methods?

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


lucene at mikemccandless

Aug 14, 2006, 11:44 AM

Post #12 of 17 (5765 views)
Permalink
Re: [jira] Commented: (LUCENE-648) Allow changing of ZIP compression level for compressed fields [In reply to]

>> If you make the compression external this is already done. In order
>> to do what the poster requires, you still need to read and update
>> fields without reading the entire document. You just do this at a
>> binary field level, and do all of he compression/decompression
>> externally.
>>
>> I think putting the compression into Lucene needlessly complicates
>> matters. All that is required is in place field updating, and binary
>> field support.
>
> I agree with you.
> The API should be kept compatible between versions, but what about breaking
> the compatibility in trunk? Is this will ba a problem is the function
> Fieldable.isCompressed() is removed ?

OK I think this makes total sense. I've opened an issue to track this:

http://issues.apache.org/jira/browse/LUCENE-652

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


nicolas.lalevee at anyware-tech

Aug 16, 2006, 5:32 AM

Post #13 of 17 (5768 views)
Permalink
Re: [jira] Commented: (LUCENE-648) Allow changing of ZIP compression level for compressed fields [In reply to]

Le Lundi 14 Août 2006 20:44, Michael McCandless a écrit :
> >> If you make the compression external this is already done. In order
> >> to do what the poster requires, you still need to read and update
> >> fields without reading the entire document. You just do this at a
> >> binary field level, and do all of he compression/decompression
> >> externally.
> >>
> >> I think putting the compression into Lucene needlessly complicates
> >> matters. All that is required is in place field updating, and binary
> >> field support.
> >
> > I agree with you.
> > The API should be kept compatible between versions, but what about
> > breaking the compatibility in trunk? Is this will ba a problem is the
> > function Fieldable.isCompressed() is removed ?
>
> OK I think this makes total sense. I've opened an issue to track this:
>
> http://issues.apache.org/jira/browse/LUCENE-652

Hi,

In the issue, you wrote that "This way the indexing level just stores opaque
binary fields, and then Document handles compress/uncompressing as needed."

I have looked into the Lucene code, and it seems to me that it is Field that
should take care of compress/uncompress, and it is the FieldsReader and
FieldsWriter that should only view binary data.
Or you mean that compression should be completely external to Lucene ?

In fact, from the end of the other thread "Flexible index format / Payloads
Cont'd", I was discussing about how to cutomize the way data are stored. So I
have looked deeper in the code and I think I have found a way to do so. And
as you could change the way is it stored, you also can define the compression
level, or handle your own compression algorithm. I will show you a patch, but
I have modified so much code because of my sevral tries, that I need first to
remove the unecessary changes. To describe it shortly :
- I have provided a way to provide you own FieldsReader and FieldsWriter (via
a factory). To create a IndexReader, you have to provide that factory; the
actual API is just using a default factory.
- I have moved the code of FieldsReader and FieldsReader that do the field
data reading to a new class FieldData. The FieldsReader instanciates a
FieldData, do a fielddata.read(input), and do a new Field(fielddata,...). The
FieldsReader do a field.getFieldData().write(output);
- so extending FieldsReader, you can provide you own implementation of
FieldData, so you can implement the way you want how data are stored and
read.
The tests pass successfully, but I have an issue with that design : one thing
that is important I think is that in the current design, we can read an index
in an old format, and just do a writer.addIndexes() into a new format. With
the new design, you cannot, because the writer will use the FieldData.write
provided by the reader.
To be continued...

cheers,
Nicolas

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


gsingers at syr

Aug 16, 2006, 5:51 AM

Post #14 of 17 (5750 views)
Permalink
Re: [jira] Commented: (LUCENE-648) Allow changing of ZIP compression level for compressed fields [In reply to]

On Aug 16, 2006, at 8:32 AM, Nicolas Lalevée wrote:

> Hi,
>
> In the issue, you wrote that "This way the indexing level just
> stores opaque
> binary fields, and then Document handles compress/uncompressing as
> needed."
>
> I have looked into the Lucene code, and it seems to me that it is
> Field that
> should take care of compress/uncompress, and it is the FieldsReader
> and
> FieldsWriter that should only view binary data.
> Or you mean that compression should be completely external to Lucene ?
>

I believe the consensus is it should be done externally.

> In fact, from the end of the other thread "Flexible index format /
> Payloads
> Cont'd", I was discussing about how to cutomize the way data are
> stored. So I
> have looked deeper in the code and I think I have found a way to do
> so. And
> as you could change the way is it stored, you also can define the
> compression
> level, or handle your own compression algorithm. I will show you a
> patch, but
> I have modified so much code because of my sevral tries, that I
> need first to
> remove the unecessary changes. To describe it shortly :
> - I have provided a way to provide you own FieldsReader and
> FieldsWriter (via
> a factory). To create a IndexReader, you have to provide that
> factory; the
> actual API is just using a default factory.
> - I have moved the code of FieldsReader and FieldsReader that do
> the field
> data reading to a new class FieldData. The FieldsReader instanciates a
> FieldData, do a fielddata.read(input), and do a new Field
> (fielddata,...). The
> FieldsReader do a field.getFieldData().write(output);
> - so extending FieldsReader, you can provide you own implementation of
> FieldData, so you can implement the way you want how data are
> stored and
> read.
> The tests pass successfully, but I have an issue with that design :
> one thing
> that is important I think is that in the current design, we can
> read an index
> in an old format, and just do a writer.addIndexes() into a new
> format. With
> the new design, you cannot, because the writer will use the
> FieldData.write
> provided by the reader.
> To be continued...

I would love to see this patch. I think one could make a pretty good
argument for this kind of implementation being done "cleanly", that
is, it shouldn't necessarily involve reworking the internals, but
instead could represent the foundation for a new, codec based
indexing mechanism (with an implementation that can read/write the
existing file format.)


>
> cheers,
> Nicolas
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
> For additional commands, e-mail: java-dev-help [at] lucene
>

--------------------------
Grant Ingersoll
Sr. Software Engineer
Center for Natural Language Processing
Syracuse University
335 Hinds Hall
Syracuse, NY 13244
http://www.cnlp.org

Voice: 315-443-5484
Skype: grant_ingersoll
Fax: 315-443-6886




---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


rengels at ix

Aug 16, 2006, 5:33 PM

Post #15 of 17 (5765 views)
Permalink
Re: [jira] Commented: (LUCENE-648) Allow changing of ZIP compression level for compressed fields [In reply to]

I just think the compressed field type should be removed from lucene all together. Only the binary field type should remain, and the application can externally compress/uncompress fields using a fascade/containment hierarchy using Document.

That is

class MyDocument {
Document doc;

String getField(String name) {
if(isCompressed(name) {
return decompress(doc.getBinaryField())
else
return doc.getField();
}

Or some such thing, and not deal with the compression at a lucene level. In order to have Lucene deal with the compression, you would really need to settle on the compression type, and parameters and how they would be stored - otherwise cross platform (or Plucene) would never be able to read to access the index. If the compression were external, all the implementation need is binary field support, and then they would only no be able to access the compressed fields if they did not have a suitable way to decompress them.

Otherwise, I think you need a much more advanced compression scheme - similar to the PDF specification - because different fields would ideally be compressed using different alogorithyms, and forcing a one size fits all doesn't normally work well in such a low-level library.



-----Original Message-----
>From: Grant Ingersoll <gsingers [at] syr>
>Sent: Aug 16, 2006 6:51 AM
>To: java-dev [at] lucene
>Subject: Re: [jira] Commented: (LUCENE-648) Allow changing of ZIP compression level for compressed fields
>
>
>On Aug 16, 2006, at 8:32 AM, Nicolas Lalev�e wrote:
>
>> Hi,
>>
>> In the issue, you wrote that "This way the indexing level just
>> stores opaque
>> binary fields, and then Document handles compress/uncompressing as
>> needed."
>>
>> I have looked into the Lucene code, and it seems to me that it is
>> Field that
>> should take care of compress/uncompress, and it is the FieldsReader
>> and
>> FieldsWriter that should only view binary data.
>> Or you mean that compression should be completely external to Lucene ?
>>
>
>I believe the consensus is it should be done externally.
>
>> In fact, from the end of the other thread "Flexible index format /
>> Payloads
>> Cont'd", I was discussing about how to cutomize the way data are
>> stored. So I
>> have looked deeper in the code and I think I have found a way to do
>> so. And
>> as you could change the way is it stored, you also can define the
>> compression
>> level, or handle your own compression algorithm. I will show you a
>> patch, but
>> I have modified so much code because of my sevral tries, that I
>> need first to
>> remove the unecessary changes. To describe it shortly :
>> - I have provided a way to provide you own FieldsReader and
>> FieldsWriter (via
>> a factory). To create a IndexReader, you have to provide that
>> factory; the
>> actual API is just using a default factory.
>> - I have moved the code of FieldsReader and FieldsReader that do
>> the field
>> data reading to a new class FieldData. The FieldsReader instanciates a
>> FieldData, do a fielddata.read(input), and do a new Field
>> (fielddata,...). The
>> FieldsReader do a field.getFieldData().write(output);
>> - so extending FieldsReader, you can provide you own implementation of
>> FieldData, so you can implement the way you want how data are
>> stored and
>> read.
>> The tests pass successfully, but I have an issue with that design :
>> one thing
>> that is important I think is that in the current design, we can
>> read an index
>> in an old format, and just do a writer.addIndexes() into a new
>> format. With
>> the new design, you cannot, because the writer will use the
>> FieldData.write
>> provided by the reader.
>> To be continued...
>
>I would love to see this patch. I think one could make a pretty good
>argument for this kind of implementation being done "cleanly", that
>is, it shouldn't necessarily involve reworking the internals, but
>instead could represent the foundation for a new, codec based
>indexing mechanism (with an implementation that can read/write the
>existing file format.)
>
>
>>
>> cheers,
>> Nicolas
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
>> For additional commands, e-mail: java-dev-help [at] lucene
>>
>
>--------------------------
>Grant Ingersoll
>Sr. Software Engineer
>Center for Natural Language Processing
>Syracuse University
>335 Hinds Hall
>Syracuse, NY 13244
>http://www.cnlp.org
>
>Voice: 315-443-5484
>Skype: grant_ingersoll
>Fax: 315-443-6886
>
>
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
>For additional commands, e-mail: java-dev-help [at] lucene
>




---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


gsingers at syr

Aug 16, 2006, 6:25 PM

Post #16 of 17 (5750 views)
Permalink
Re: [jira] Commented: (LUCENE-648) Allow changing of ZIP compression level for compressed fields [In reply to]

I agree. I would vote for deprecating the compression stuff. I am
still interested in the flexible indexing part mentioned later in
Nicolas' response, but that is a separate thread.



On Aug 16, 2006, at 8:33 PM, Robert Engels wrote:

> I just think the compressed field type should be removed from
> lucene all together. Only the binary field type should remain, and
> the application can externally compress/uncompress fields using a
> fascade/containment hierarchy using Document.
>
> That is
>
> class MyDocument {
> Document doc;
>
> String getField(String name) {
> if(isCompressed(name) {
> return decompress(doc.getBinaryField())
> else
> return doc.getField();
> }
>
> Or some such thing, and not deal with the compression at a lucene
> level. In order to have Lucene deal with the compression, you would
> really need to settle on the compression type, and parameters and
> how they would be stored - otherwise cross platform (or Plucene)
> would never be able to read to access the index. If the compression
> were external, all the implementation need is binary field support,
> and then they would only no be able to access the compressed fields
> if they did not have a suitable way to decompress them.
>
> Otherwise, I think you need a much more advanced compression scheme
> - similar to the PDF specification - because different fields would
> ideally be compressed using different alogorithyms, and forcing a
> one size fits all doesn't normally work well in such a low-level
> library.
>
>
>
> -----Original Message-----
>> From: Grant Ingersoll <gsingers [at] syr>
>> Sent: Aug 16, 2006 6:51 AM
>> To: java-dev [at] lucene
>> Subject: Re: [jira] Commented: (LUCENE-648) Allow changing of ZIP
>> compression level for compressed fields
>>
>>
>> On Aug 16, 2006, at 8:32 AM, Nicolas Lalev�e wrote:
>>
>>> Hi,
>>>
>>> In the issue, you wrote that "This way the indexing level just
>>> stores opaque
>>> binary fields, and then Document handles compress/uncompressing as
>>> needed."
>>>
>>> I have looked into the Lucene code, and it seems to me that it is
>>> Field that
>>> should take care of compress/uncompress, and it is the FieldsReader
>>> and
>>> FieldsWriter that should only view binary data.
>>> Or you mean that compression should be completely external to
>>> Lucene ?
>>>
>>
>> I believe the consensus is it should be done externally.
>>
>>> In fact, from the end of the other thread "Flexible index format /
>>> Payloads
>>> Cont'd", I was discussing about how to cutomize the way data are
>>> stored. So I
>>> have looked deeper in the code and I think I have found a way to do
>>> so. And
>>> as you could change the way is it stored, you also can define the
>>> compression
>>> level, or handle your own compression algorithm. I will show you a
>>> patch, but
>>> I have modified so much code because of my sevral tries, that I
>>> need first to
>>> remove the unecessary changes. To describe it shortly :
>>> - I have provided a way to provide you own FieldsReader and
>>> FieldsWriter (via
>>> a factory). To create a IndexReader, you have to provide that
>>> factory; the
>>> actual API is just using a default factory.
>>> - I have moved the code of FieldsReader and FieldsReader that do
>>> the field
>>> data reading to a new class FieldData. The FieldsReader
>>> instanciates a
>>> FieldData, do a fielddata.read(input), and do a new Field
>>> (fielddata,...). The
>>> FieldsReader do a field.getFieldData().write(output);
>>> - so extending FieldsReader, you can provide you own
>>> implementation of
>>> FieldData, so you can implement the way you want how data are
>>> stored and
>>> read.
>>> The tests pass successfully, but I have an issue with that design :
>>> one thing
>>> that is important I think is that in the current design, we can
>>> read an index
>>> in an old format, and just do a writer.addIndexes() into a new
>>> format. With
>>> the new design, you cannot, because the writer will use the
>>> FieldData.write
>>> provided by the reader.
>>> To be continued...
>>
>> I would love to see this patch. I think one could make a pretty good
>> argument for this kind of implementation being done "cleanly", that
>> is, it shouldn't necessarily involve reworking the internals, but
>> instead could represent the foundation for a new, codec based
>> indexing mechanism (with an implementation that can read/write the
>> existing file format.)
>>
>>
>>>
>>> cheers,
>>> Nicolas
>>>
>>> --------------------------------------------------------------------
>>> -
>>> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
>>> For additional commands, e-mail: java-dev-help [at] lucene
>>>
>>
>> --------------------------
>> Grant Ingersoll
>> Sr. Software Engineer
>> Center for Natural Language Processing
>> Syracuse University
>> 335 Hinds Hall
>> Syracuse, NY 13244
>> http://www.cnlp.org
>>
>> Voice: 315-443-5484
>> Skype: grant_ingersoll
>> Fax: 315-443-6886
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
>> For additional commands, e-mail: java-dev-help [at] lucene
>>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
> For additional commands, e-mail: java-dev-help [at] lucene
>

--------------------------
Grant Ingersoll
Sr. Software Engineer
Center for Natural Language Processing
Syracuse University
335 Hinds Hall
Syracuse, NY 13244
http://www.cnlp.org

Voice: 315-443-5484
Skype: grant_ingersoll
Fax: 315-443-6886




---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


nicolas.lalevee at anyware-tech

Aug 22, 2006, 9:45 AM

Post #17 of 17 (5656 views)
Permalink
Re: [jira] Commented: (LUCENE-648) Allow changing of ZIP compression level for compressed fields [In reply to]

Le Mercredi 16 Août 2006 14:51, Grant Ingersoll a écrit :
> On Aug 16, 2006, at 8:32 AM, Nicolas Lalevée wrote:
> > Hi,
> >
> > In the issue, you wrote that "This way the indexing level just
> > stores opaque
> > binary fields, and then Document handles compress/uncompressing as
> > needed."
> >
> > I have looked into the Lucene code, and it seems to me that it is
> > Field that
> > should take care of compress/uncompress, and it is the FieldsReader
> > and
> > FieldsWriter that should only view binary data.
> > Or you mean that compression should be completely external to Lucene ?
>
> I believe the consensus is it should be done externally.
>
> > In fact, from the end of the other thread "Flexible index format /
> > Payloads
> > Cont'd", I was discussing about how to cutomize the way data are
> > stored. So I
> > have looked deeper in the code and I think I have found a way to do
> > so. And
> > as you could change the way is it stored, you also can define the
> > compression
> > level, or handle your own compression algorithm. I will show you a
> > patch, but
> > I have modified so much code because of my sevral tries, that I
> > need first to
> > remove the unecessary changes. To describe it shortly :
> > - I have provided a way to provide you own FieldsReader and
> > FieldsWriter (via
> > a factory). To create a IndexReader, you have to provide that
> > factory; the
> > actual API is just using a default factory.
> > - I have moved the code of FieldsReader and FieldsReader that do
> > the field
> > data reading to a new class FieldData. The FieldsReader instanciates a
> > FieldData, do a fielddata.read(input), and do a new Field
> > (fielddata,...). The
> > FieldsReader do a field.getFieldData().write(output);
> > - so extending FieldsReader, you can provide you own implementation of
> > FieldData, so you can implement the way you want how data are
> > stored and
> > read.
> > The tests pass successfully, but I have an issue with that design :
> > one thing
> > that is important I think is that in the current design, we can
> > read an index
> > in an old format, and just do a writer.addIndexes() into a new
> > format. With
> > the new design, you cannot, because the writer will use the
> > FieldData.write
> > provided by the reader.
> > To be continued...
>
> I would love to see this patch. I think one could make a pretty good
> argument for this kind of implementation being done "cleanly", that
> is, it shouldn't necessarily involve reworking the internals, but
> instead could represent the foundation for a new, codec based
> indexing mechanism (with an implementation that can read/write the
> existing file format.)

here it is : https://issues.apache.org/jira/browse/LUCENE-662

enjoy !

Nicolas

--
Nicolas LALEVÉE
Solutions & Technologies
ANYWARE TECHNOLOGIES
Tel : +33 (0)5 61 00 52 90
Fax : +33 (0)5 61 00 51 46
http://www.anyware-tech.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene

Lucene java-dev RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.