Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-Dev

[jira] [Updated] (LUCENE-3820) Wrong trailing index calculation in PatternReplaceCharFilter

 

 

Lucene java-dev RSS feed   Index | Next | Previous | View Threaded


jira at apache

Feb 22, 2012, 6:05 PM

Post #1 of 5 (88 views)
Permalink
[jira] [Updated] (LUCENE-3820) Wrong trailing index calculation in PatternReplaceCharFilter

[ https://issues.apache.org/jira/browse/LUCENE-3820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dawid Weiss updated LUCENE-3820:
--------------------------------

Attachment: LUCENE-3820.patch

A patch with reimplementation of getReplaceBlock and a test case that is failing with AIOOB (apply test changes without modifying PatternReplaceCharFilter to get the error).

> Wrong trailing index calculation in PatternReplaceCharFilter
> ------------------------------------------------------------
>
> Key: LUCENE-3820
> URL: https://issues.apache.org/jira/browse/LUCENE-3820
> Project: Lucene - Java
> Issue Type: Bug
> Reporter: Dawid Weiss
> Assignee: Dawid Weiss
> Priority: Minor
> Fix For: 4.0
>
> Attachments: LUCENE-3820.patch
>
>
> I need to use PatternReplaceCharFilter's index corrections directly and it fails for me -- the trailing index is not mapped correctly for a pattern "\\.[\\s]*" and replacement ".", input "A. .B.".
> I tried to understand the logic in getReplaceBlock but I eventually failed and simply rewrote it from scratch. After my changes a few tests don't pass but I don't know if it's the tests that are screwed up or my logic. In essence, the difference between the previous implementation and my implementation is how indexes are mapped for shorter replacements. I shift indexes of shorter regions to the "right" of the original index pool and the previous patch seems to squeeze them to the left (don't know why though).
> If anybody remembers how it's supposed to work, feel free to correct me?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Feb 22, 2012, 6:35 PM

Post #2 of 5 (84 views)
Permalink
[jira] [Updated] (LUCENE-3820) Wrong trailing index calculation in PatternReplaceCharFilter [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-3820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-3820:
--------------------------------

Attachment: LUCENE-3820_test.patch

Here's a simple random test showing some existing bugs in the filter...
* there are offsets problems as dawid notices...
* blockbuffer should always oversize by 1 character, if a block ends on a high surrogate (rare) it should do one additional read() so it doesnt create invalid unicode


> Wrong trailing index calculation in PatternReplaceCharFilter
> ------------------------------------------------------------
>
> Key: LUCENE-3820
> URL: https://issues.apache.org/jira/browse/LUCENE-3820
> Project: Lucene - Java
> Issue Type: Bug
> Reporter: Dawid Weiss
> Assignee: Dawid Weiss
> Priority: Minor
> Fix For: 4.0
>
> Attachments: LUCENE-3820.patch, LUCENE-3820_test.patch
>
>
> I need to use PatternReplaceCharFilter's index corrections directly and it fails for me -- the trailing index is not mapped correctly for a pattern "\\.[\\s]*" and replacement ".", input "A. .B.".
> I tried to understand the logic in getReplaceBlock but I eventually failed and simply rewrote it from scratch. After my changes a few tests don't pass but I don't know if it's the tests that are screwed up or my logic. In essence, the difference between the previous implementation and my implementation is how indexes are mapped for shorter replacements. I shift indexes of shorter regions to the "right" of the original index pool and the previous patch seems to squeeze them to the left (don't know why though).
> If anybody remembers how it's supposed to work, feel free to correct me?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Feb 22, 2012, 7:03 PM

Post #3 of 5 (84 views)
Permalink
[jira] [Updated] (LUCENE-3820) Wrong trailing index calculation in PatternReplaceCharFilter [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-3820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-3820:
--------------------------------

Attachment: LUCENE-3820_test.patch

updated patch, this tests only ascii (to avoid stupid problems in outdated regex support).

But there are a lot of offset problems (perhaps this corresponds to the warning in the class's javadocs?), including things like offsets being corrected to negative numbers...

> Wrong trailing index calculation in PatternReplaceCharFilter
> ------------------------------------------------------------
>
> Key: LUCENE-3820
> URL: https://issues.apache.org/jira/browse/LUCENE-3820
> Project: Lucene - Java
> Issue Type: Bug
> Reporter: Dawid Weiss
> Assignee: Dawid Weiss
> Priority: Minor
> Fix For: 4.0
>
> Attachments: LUCENE-3820.patch, LUCENE-3820_test.patch, LUCENE-3820_test.patch
>
>
> I need to use PatternReplaceCharFilter's index corrections directly and it fails for me -- the trailing index is not mapped correctly for a pattern "\\.[\\s]*" and replacement ".", input "A. .B.".
> I tried to understand the logic in getReplaceBlock but I eventually failed and simply rewrote it from scratch. After my changes a few tests don't pass but I don't know if it's the tests that are screwed up or my logic. In essence, the difference between the previous implementation and my implementation is how indexes are mapped for shorter replacements. I shift indexes of shorter regions to the "right" of the original index pool and the previous patch seems to squeeze them to the left (don't know why though).
> If anybody remembers how it's supposed to work, feel free to correct me?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Feb 23, 2012, 1:43 AM

Post #4 of 5 (81 views)
Permalink
[jira] [Updated] (LUCENE-3820) Wrong trailing index calculation in PatternReplaceCharFilter [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-3820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dawid Weiss updated LUCENE-3820:
--------------------------------

Attachment: LUCENE-3820.patch

A simplifying patch that includes Robert's random tests (and passes).

I've made a deliberate decision to deprecate and not use block delimiters and block processing. If you think this is a backwards no-no then feel free to correct this patch... I think block processing may be worth dropping given the code clarity without it.

> Wrong trailing index calculation in PatternReplaceCharFilter
> ------------------------------------------------------------
>
> Key: LUCENE-3820
> URL: https://issues.apache.org/jira/browse/LUCENE-3820
> Project: Lucene - Java
> Issue Type: Bug
> Reporter: Dawid Weiss
> Assignee: Dawid Weiss
> Priority: Minor
> Fix For: 4.0
>
> Attachments: LUCENE-3820.patch, LUCENE-3820.patch, LUCENE-3820_test.patch, LUCENE-3820_test.patch
>
>
> I need to use PatternReplaceCharFilter's index corrections directly and it fails for me -- the trailing index is not mapped correctly for a pattern "\\.[\\s]*" and replacement ".", input "A. .B.".
> I tried to understand the logic in getReplaceBlock but I eventually failed and simply rewrote it from scratch. After my changes a few tests don't pass but I don't know if it's the tests that are screwed up or my logic. In essence, the difference between the previous implementation and my implementation is how indexes are mapped for shorter replacements. I shift indexes of shorter regions to the "right" of the original index pool and the previous patch seems to squeeze them to the left (don't know why though).
> If anybody remembers how it's supposed to work, feel free to correct me?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Feb 27, 2012, 5:07 AM

Post #5 of 5 (78 views)
Permalink
[jira] [Updated] (LUCENE-3820) Wrong trailing index calculation in PatternReplaceCharFilter [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-3820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dawid Weiss updated LUCENE-3820:
--------------------------------

Description: Reimplementation of PatternReplaceCharFilter to pass randomized tests (used to throw exceptions previously). Simplified code, dropped boundary characters, full input buffered for pattern matching. (was: I need to use PatternReplaceCharFilter's index corrections directly and it fails for me -- the trailing index is not mapped correctly for a pattern "\\.[\\s]*" and replacement ".", input "A. .B.".

I tried to understand the logic in getReplaceBlock but I eventually failed and simply rewrote it from scratch. After my changes a few tests don't pass but I don't know if it's the tests that are screwed up or my logic. In essence, the difference between the previous implementation and my implementation is how indexes are mapped for shorter replacements. I shift indexes of shorter regions to the "right" of the original index pool and the previous patch seems to squeeze them to the left (don't know why though).

If anybody remembers how it's supposed to work, feel free to correct me?)
Fix Version/s: 3.6

> Wrong trailing index calculation in PatternReplaceCharFilter
> ------------------------------------------------------------
>
> Key: LUCENE-3820
> URL: https://issues.apache.org/jira/browse/LUCENE-3820
> Project: Lucene - Java
> Issue Type: Bug
> Reporter: Dawid Weiss
> Assignee: Dawid Weiss
> Priority: Minor
> Fix For: 3.6, 4.0
>
> Attachments: LUCENE-3820.patch, LUCENE-3820.patch, LUCENE-3820_test.patch, LUCENE-3820_test.patch
>
>
> Reimplementation of PatternReplaceCharFilter to pass randomized tests (used to throw exceptions previously). Simplified code, dropped boundary characters, full input buffered for pattern matching.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene

Lucene java-dev RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.