
gsingers at syr
Aug 16, 2006, 6:25 PM
Post #16 of 17
(4705 views)
Permalink
|
|
Re: [jira] Commented: (LUCENE-648) Allow changing of ZIP compression level for compressed fields
[In reply to]
|
|
I agree. I would vote for deprecating the compression stuff. I am still interested in the flexible indexing part mentioned later in Nicolas' response, but that is a separate thread. On Aug 16, 2006, at 8:33 PM, Robert Engels wrote: > I just think the compressed field type should be removed from > lucene all together. Only the binary field type should remain, and > the application can externally compress/uncompress fields using a > fascade/containment hierarchy using Document. > > That is > > class MyDocument { > Document doc; > > String getField(String name) { > if(isCompressed(name) { > return decompress(doc.getBinaryField()) > else > return doc.getField(); > } > > Or some such thing, and not deal with the compression at a lucene > level. In order to have Lucene deal with the compression, you would > really need to settle on the compression type, and parameters and > how they would be stored - otherwise cross platform (or Plucene) > would never be able to read to access the index. If the compression > were external, all the implementation need is binary field support, > and then they would only no be able to access the compressed fields > if they did not have a suitable way to decompress them. > > Otherwise, I think you need a much more advanced compression scheme > - similar to the PDF specification - because different fields would > ideally be compressed using different alogorithyms, and forcing a > one size fits all doesn't normally work well in such a low-level > library. > > > > -----Original Message----- >> From: Grant Ingersoll <gsingers [at] syr> >> Sent: Aug 16, 2006 6:51 AM >> To: java-dev [at] lucene >> Subject: Re: [jira] Commented: (LUCENE-648) Allow changing of ZIP >> compression level for compressed fields >> >> >> On Aug 16, 2006, at 8:32 AM, Nicolas Lalev�e wrote: >> >>> Hi, >>> >>> In the issue, you wrote that "This way the indexing level just >>> stores opaque >>> binary fields, and then Document handles compress/uncompressing as >>> needed." >>> >>> I have looked into the Lucene code, and it seems to me that it is >>> Field that >>> should take care of compress/uncompress, and it is the FieldsReader >>> and >>> FieldsWriter that should only view binary data. >>> Or you mean that compression should be completely external to >>> Lucene ? >>> >> >> I believe the consensus is it should be done externally. >> >>> In fact, from the end of the other thread "Flexible index format / >>> Payloads >>> Cont'd", I was discussing about how to cutomize the way data are >>> stored. So I >>> have looked deeper in the code and I think I have found a way to do >>> so. And >>> as you could change the way is it stored, you also can define the >>> compression >>> level, or handle your own compression algorithm. I will show you a >>> patch, but >>> I have modified so much code because of my sevral tries, that I >>> need first to >>> remove the unecessary changes. To describe it shortly : >>> - I have provided a way to provide you own FieldsReader and >>> FieldsWriter (via >>> a factory). To create a IndexReader, you have to provide that >>> factory; the >>> actual API is just using a default factory. >>> - I have moved the code of FieldsReader and FieldsReader that do >>> the field >>> data reading to a new class FieldData. The FieldsReader >>> instanciates a >>> FieldData, do a fielddata.read(input), and do a new Field >>> (fielddata,...). The >>> FieldsReader do a field.getFieldData().write(output); >>> - so extending FieldsReader, you can provide you own >>> implementation of >>> FieldData, so you can implement the way you want how data are >>> stored and >>> read. >>> The tests pass successfully, but I have an issue with that design : >>> one thing >>> that is important I think is that in the current design, we can >>> read an index >>> in an old format, and just do a writer.addIndexes() into a new >>> format. With >>> the new design, you cannot, because the writer will use the >>> FieldData.write >>> provided by the reader. >>> To be continued... >> >> I would love to see this patch. I think one could make a pretty good >> argument for this kind of implementation being done "cleanly", that >> is, it shouldn't necessarily involve reworking the internals, but >> instead could represent the foundation for a new, codec based >> indexing mechanism (with an implementation that can read/write the >> existing file format.) >> >> >>> >>> cheers, >>> Nicolas >>> >>> -------------------------------------------------------------------- >>> - >>> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene >>> For additional commands, e-mail: java-dev-help [at] lucene >>> >> >> -------------------------- >> Grant Ingersoll >> Sr. Software Engineer >> Center for Natural Language Processing >> Syracuse University >> 335 Hinds Hall >> Syracuse, NY 13244 >> http://www.cnlp.org >> >> Voice: 315-443-5484 >> Skype: grant_ingersoll >> Fax: 315-443-6886 >> >> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene >> For additional commands, e-mail: java-dev-help [at] lucene >> > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene > For additional commands, e-mail: java-dev-help [at] lucene > -------------------------- Grant Ingersoll Sr. Software Engineer Center for Natural Language Processing Syracuse University 335 Hinds Hall Syracuse, NY 13244 http://www.cnlp.org Voice: 315-443-5484 Skype: grant_ingersoll Fax: 315-443-6886 --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene For additional commands, e-mail: java-dev-help [at] lucene
|