Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-Dev

[jira] Commented: (LUCENE-510) IndexOutput.writeString() should write length in bytes

 

 

Lucene java-dev RSS feed   Index | Next | Previous | View Threaded


jira at apache

May 8, 2006, 2:03 PM

Post #1 of 2 (460 views)
Permalink
[jira] Commented: (LUCENE-510) IndexOutput.writeString() should write length in bytes

[ http://issues.apache.org/jira/browse/LUCENE-510?page=comments#action_12378519 ]

Marvin Humphrey commented on LUCENE-510:
----------------------------------------

The following patch...

* Changes Lucene to use bytecounts as the prefix to all written Strings
* Changes Lucene to write standard UTF-8 rather than Modified UTF-8
* Adds the new test classes MockIndexOutput and TestIndexOutput
* Increases the number of tests in TestIndexInput

It also slows Lucene down -- indexing takes around a 20% speed hit. It would be possible to submit a patch which had a smaller impact on performance, but this one is already over 700 lines long, and it's goal is to achieve standard UTF-8 compliance and modify the definition of Lucene strings as simply and reliably as possible. Optimization patches can now be submitted which build upon this one.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



> IndexOutput.writeString() should write length in bytes
> ------------------------------------------------------
>
> Key: LUCENE-510
> URL: http://issues.apache.org/jira/browse/LUCENE-510
> Project: Lucene - Java
> Type: Improvement

> Components: Store
> Versions: 2.1
> Reporter: Doug Cutting
> Fix For: 2.1

>
> We should change the format of strings written to indexes so that the length of the string is in bytes, not Java characters. This issue has been discussed at:
> http://www.mail-archive.com/java-dev [at] lucene/msg01970.html
> We must increment the file format number to indicate this change. At least the format number in the segments file should change.
> I'm targetting this for 2.1, i.e., we shouldn't commit it to trunk until after 2.0 is released, to minimize incompatible changes between 1.9 and 2.0 (other than removal of deprecated features).

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


cowtowncoder at yahoo

May 8, 2006, 2:21 PM

Post #2 of 2 (411 views)
Permalink
Re: [jira] Commented: (LUCENE-510) IndexOutput.writeString() should write length in bytes [In reply to]

--- "Marvin Humphrey (JIRA)" <jira [at] apache> wrote:

...
> It also slows Lucene down -- indexing takes around a
> 20% speed hit. It would be possible to submit a
> patch which had a smaller impact on performance, but
> this one is already over 700 lines long, and it's
> goal is to achieve standard UTF-8 compliance and
> modify the definition of Lucene strings as simply
> and reliably as possible. Optimization patches can
> now be submitted which build upon this one.

I'm quite sure that the UTF-8 decoding loop can be
improved quite a bit after merging in the patch, so
eventual performance hit is probably lower (assuming
this is a hot spot). Using a tighter inner loop for
single-byte values can give a significant boost (up to
50% speedup compared to default UTF-8 decoder jdk 1.5
ships with).
In this case, it's probably best to isolate the hot
spot (when working on this part, measuring impact of
changes), since otherwise it may be hard to measure
direct impact. And then measure the total effect when
integrating the change.

That is to say, I wouldn't worry too much about the
initial hit, much/most of it can be optimized away
quite soon, just like you suggested.

-+ Tatu +-


__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene

Lucene java-dev RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.