Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-Dev

[jira] [Commented] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.

 

 

Lucene java-dev RSS feed   Index | Next | Previous | View Threaded


jira at apache

Feb 23, 2012, 3:05 PM

Post #1 of 24 (185 views)
Permalink
[jira] [Commented] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.

[ https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13215172#comment-13215172 ]

Naomi Dushay commented on LUCENE-3821:
--------------------------------------

How's this for a document:

<doc>
<str name="id">1</str>
<str name="title_245a_display">The Beatles as musicians</str>
<str name="title_245a_search">The Beatles as musicians :</str>
<str name="title_245c_display">Walter Everett</str>
<str name="title_display">The Beatles as musicians : Revolver through the Anthology</str>
<str name="title_full_display">The Beatles as musicians : Revolver through the Anthology / Walter Everett.</str>
<str name="title_245_search">The Beatles as musicians : Revolver through the Anthology / Walter Everett.</str>
<str name="title_sort">Beatles as musicians Revolver through the Anthology</str>
<str name="all_search">The Beatles as musicians : Revolver through the Anthology</str>
</doc>

> SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
> ---------------------------------------------------------------------------
>
> Key: LUCENE-3821
> URL: https://issues.apache.org/jira/browse/LUCENE-3821
> Project: Lucene - Java
> Issue Type: Bug
> Affects Versions: 3.5, 4.0
> Reporter: Naomi Dushay
> Attachments: schema.xml, solrconfig-test.xml
>
>
> The general bug is a case where a phrase with no slop is found,
> but if you add slop its not.
> I committed a test today (TestSloppyPhraseQuery2) that actually triggers this case,
> jenkins just hasn't had enough time to chew on it.
> ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make it fail on trunk or 3.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Feb 23, 2012, 3:23 PM

Post #2 of 24 (176 views)
Permalink
[jira] [Commented] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds. [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13215191#comment-13215191 ]

Robert Muir commented on LUCENE-3821:
-------------------------------------

I reviewed the random failures: in all cases it fails, repeated terms are in the query.


> SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
> ---------------------------------------------------------------------------
>
> Key: LUCENE-3821
> URL: https://issues.apache.org/jira/browse/LUCENE-3821
> Project: Lucene - Java
> Issue Type: Bug
> Affects Versions: 3.5, 4.0
> Reporter: Naomi Dushay
> Attachments: LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml
>
>
> The general bug is a case where a phrase with no slop is found,
> but if you add slop its not.
> I committed a test today (TestSloppyPhraseQuery2) that actually triggers this case,
> jenkins just hasn't had enough time to chew on it.
> ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make it fail on trunk or 3.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Feb 29, 2012, 3:45 AM

Post #3 of 24 (170 views)
Permalink
[jira] [Commented] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds. [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13219123#comment-13219123 ]

Michael McCandless commented on LUCENE-3821:
--------------------------------------------

I just @Ignore'd this test... it's creating a lot of Jenkins noise... but we should fix this bug!!

> SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
> ---------------------------------------------------------------------------
>
> Key: LUCENE-3821
> URL: https://issues.apache.org/jira/browse/LUCENE-3821
> Project: Lucene - Java
> Issue Type: Bug
> Affects Versions: 3.5, 4.0
> Reporter: Naomi Dushay
> Attachments: LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml
>
>
> The general bug is a case where a phrase with no slop is found,
> but if you add slop its not.
> I committed a test today (TestSloppyPhraseQuery2) that actually triggers this case,
> jenkins just hasn't had enough time to chew on it.
> ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make it fail on trunk or 3.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Feb 29, 2012, 4:53 AM

Post #4 of 24 (175 views)
Permalink
[jira] [Commented] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds. [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13219156#comment-13219156 ]

Doron Cohen commented on LUCENE-3821:
-------------------------------------

Fails here too like this:

ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtestmethod=testRandomIncreasingSloppiness -Dtests.seed=-171bbb992c697625:203709d611c854a5:1ca48cb6d33b3f74 -Dargs="-Dfile.encoding=UTF-8"

I'll look into it

> SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
> ---------------------------------------------------------------------------
>
> Key: LUCENE-3821
> URL: https://issues.apache.org/jira/browse/LUCENE-3821
> Project: Lucene - Java
> Issue Type: Bug
> Affects Versions: 3.5, 4.0
> Reporter: Naomi Dushay
> Attachments: LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml
>
>
> The general bug is a case where a phrase with no slop is found,
> but if you add slop its not.
> I committed a test today (TestSloppyPhraseQuery2) that actually triggers this case,
> jenkins just hasn't had enough time to chew on it.
> ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make it fail on trunk or 3.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Mar 3, 2012, 3:45 PM

Post #5 of 24 (174 views)
Permalink
[jira] [Commented] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds. [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13221737#comment-13221737 ]

Doron Cohen commented on LUCENE-3821:
-------------------------------------

I understand the problem.

It all has to do - as Robert mentioned - with the repeating terms in the phrase query. I am working on a patch - it will change the way that repeats are handled.

Repeating PPs require additional computation - and current SloppyPhraseScorer attempts to do that additional work efficiently, but over simplifies in that and fail to cover all cases.

In the core of things, each time a repeating PP is selected (from the queue) and propagated, *all* its sibling repeaters are propagated as well, to prevent a case that two repeating PPs point to the same document position (which was the bug that originally triggered handling repeats in this code).

But this is wrong, because it propagates all siblings repeaters, and misses some cases.

Also, the code is hard to read, as Mike noted in LUCENE-2410 ([this comment|https://issues.apache.org/jira/browse/LUCENE-2410?focusedCommentId=12879443&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12879443]) ).

So this is a chance to also make the code more maintainable.

I have a working version which is not ready to commit yet, and all the tests pass, except for one which I think is a bug in ExactPhraseScorer, but maybe i am missing something.

The case that fails is this:

{noformat}
AssertionError: Missing in super-set: doc 706
q1: field:"(j o s) (i b j) (t d)"
q2: field:"(j o s) (i b j) (t d)"~1
td1: [doc=706 score=7.7783184 shardIndex=-1, doc=175 score=6.222655 shardIndex=-1]
td2: [.doc=523 score=5.5001016 shardIndex=-1, doc=957 score=5.5001016 shardIndex=-1, doc=228 score=4.400081 shardIndex=-1, doc=357 score=4.400081 shardIndex=-1, doc=390 score=4.400081 shardIndex=-1, doc=503 score=4.400081 shardIndex=-1, doc=602 score=4.400081 shardIndex=-1, doc=757 score=4.400081 shardIndex=-1, doc=758 score=4.400081 shardIndex=-1]
doc 706: Document<stored,indexed,tokenized<field:s o b h j t j z o>>
{noformat}

It seems that q1 too should not match this document?

> SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
> ---------------------------------------------------------------------------
>
> Key: LUCENE-3821
> URL: https://issues.apache.org/jira/browse/LUCENE-3821
> Project: Lucene - Java
> Issue Type: Bug
> Affects Versions: 3.5, 4.0
> Reporter: Naomi Dushay
> Assignee: Doron Cohen
> Attachments: LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml
>
>
> The general bug is a case where a phrase with no slop is found,
> but if you add slop its not.
> I committed a test today (TestSloppyPhraseQuery2) that actually triggers this case,
> jenkins just hasn't had enough time to chew on it.
> ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make it fail on trunk or 3.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Mar 3, 2012, 4:09 PM

Post #6 of 24 (172 views)
Permalink
[jira] [Commented] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds. [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13221746#comment-13221746 ]

Michael McCandless commented on LUCENE-3821:
--------------------------------------------

Doron do you have the seed for that failure? I can dig on the ExactPhraseScorer...

> SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
> ---------------------------------------------------------------------------
>
> Key: LUCENE-3821
> URL: https://issues.apache.org/jira/browse/LUCENE-3821
> Project: Lucene - Java
> Issue Type: Bug
> Affects Versions: 3.5, 4.0
> Reporter: Naomi Dushay
> Assignee: Doron Cohen
> Attachments: LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml
>
>
> The general bug is a case where a phrase with no slop is found,
> but if you add slop its not.
> I committed a test today (TestSloppyPhraseQuery2) that actually triggers this case,
> jenkins just hasn't had enough time to chew on it.
> ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make it fail on trunk or 3.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Mar 3, 2012, 4:27 PM

Post #7 of 24 (174 views)
Permalink
[jira] [Commented] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds. [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13221750#comment-13221750 ]

Michael McCandless commented on LUCENE-3821:
--------------------------------------------

Hmm patch has this:

{noformat}
import backport.api.edu.emory.mathcs.backport.java.util.Arrays;
{noformat}

> SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
> ---------------------------------------------------------------------------
>
> Key: LUCENE-3821
> URL: https://issues.apache.org/jira/browse/LUCENE-3821
> Project: Lucene - Java
> Issue Type: Bug
> Affects Versions: 3.5, 4.0
> Reporter: Naomi Dushay
> Assignee: Doron Cohen
> Attachments: LUCENE-3821.patch, LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml
>
>
> The general bug is a case where a phrase with no slop is found,
> but if you add slop its not.
> I committed a test today (TestSloppyPhraseQuery2) that actually triggers this case,
> jenkins just hasn't had enough time to chew on it.
> ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make it fail on trunk or 3.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Mar 3, 2012, 4:33 PM

Post #8 of 24 (174 views)
Permalink
[jira] [Commented] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds. [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13221755#comment-13221755 ]

Robert Muir commented on LUCENE-3821:
-------------------------------------

Thanks for digging into the problem Doron!

I'm going to be ecstatic if that crazy test finds bugs both in exact and sloppy phrase scorers :)

> SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
> ---------------------------------------------------------------------------
>
> Key: LUCENE-3821
> URL: https://issues.apache.org/jira/browse/LUCENE-3821
> Project: Lucene - Java
> Issue Type: Bug
> Affects Versions: 3.5, 4.0
> Reporter: Naomi Dushay
> Assignee: Doron Cohen
> Attachments: LUCENE-3821.patch, LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml
>
>
> The general bug is a case where a phrase with no slop is found,
> but if you add slop its not.
> I committed a test today (TestSloppyPhraseQuery2) that actually triggers this case,
> jenkins just hasn't had enough time to chew on it.
> ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make it fail on trunk or 3.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Mar 4, 2012, 2:26 AM

Post #9 of 24 (170 views)
Permalink
[jira] [Commented] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds. [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13221840#comment-13221840 ]

Doron Cohen commented on LUCENE-3821:
-------------------------------------

The remaining failure still happens with the updated patch (same seed), and still seems to me like an ExactPhraseScorer bug.

However, it is probably not a simple one I think, because when adding to TestMultiPhraseQuery, it passes, that is, no documents are matched, as expected, although this supposedly created the exact scenario that failed above.

Perhaps ExactPhraseScorer behavior too is influenced by other docs processed so far.

{code:title=Add this to TestMultiPhraseQuery}
public void test_LUCENE_XYZ() throws Exception {
Directory indexStore = newDirectory();
RandomIndexWriter writer = new RandomIndexWriter(random, indexStore);
add("s o b h j t j z o", "LUCENE-XYZ", writer);

IndexReader reader = writer.getReader();
IndexSearcher searcher = newSearcher(reader);

MultiPhraseQuery q = new MultiPhraseQuery();
q.add(new Term[] {new Term("body", "j"), new Term("body", "o"), new Term("body", "s")});
q.add(new Term[] {new Term("body", "i"), new Term("body", "b"), new Term("body", "j")});
q.add(new Term[] {new Term("body", "t"), new Term("body", "d")});
assertEquals("Wrong number of hits", 0,
searcher.search(q, null, 1).totalHits);

// just make sure no exc:
searcher.explain(q, 0);

writer.close();
searcher.close();
reader.close();
indexStore.close();
}
{code}

> SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
> ---------------------------------------------------------------------------
>
> Key: LUCENE-3821
> URL: https://issues.apache.org/jira/browse/LUCENE-3821
> Project: Lucene - Java
> Issue Type: Bug
> Affects Versions: 3.5, 4.0
> Reporter: Naomi Dushay
> Assignee: Doron Cohen
> Attachments: LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml
>
>
> The general bug is a case where a phrase with no slop is found,
> but if you add slop its not.
> I committed a test today (TestSloppyPhraseQuery2) that actually triggers this case,
> jenkins just hasn't had enough time to chew on it.
> ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make it fail on trunk or 3.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Mar 4, 2012, 4:29 AM

Post #10 of 24 (171 views)
Permalink
[jira] [Commented] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds. [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13221867#comment-13221867 ]

Doron Cohen commented on LUCENE-3821:
-------------------------------------

Update: apparently MultiPhraseQuery.toString does not print its "holes".

So the query that failed was not:
{noformat}field:"(j o s) (i b j) (t d)"{noformat}

But rather:
{noformat}"(j o s) ? (i b j) ? ? (t d)"{noformat}

Which is a different story: this query should match the document
{noformat}s o b h j t j z o{noformat}

There is a match for ExactPhraseScorer, but not for Sloppy with slope 1.
So there is still work to do on SloppyPhraseScorer...

(I'll fix MFQ.toString() as well)

> SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
> ---------------------------------------------------------------------------
>
> Key: LUCENE-3821
> URL: https://issues.apache.org/jira/browse/LUCENE-3821
> Project: Lucene - Java
> Issue Type: Bug
> Affects Versions: 3.5, 4.0
> Reporter: Naomi Dushay
> Assignee: Doron Cohen
> Attachments: LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml
>
>
> The general bug is a case where a phrase with no slop is found,
> but if you add slop its not.
> I committed a test today (TestSloppyPhraseQuery2) that actually triggers this case,
> jenkins just hasn't had enough time to chew on it.
> ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make it fail on trunk or 3.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Mar 4, 2012, 4:59 AM

Post #11 of 24 (169 views)
Permalink
[jira] [Commented] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds. [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13221879#comment-13221879 ]

Doron Cohen commented on LUCENE-3821:
-------------------------------------

I think I understand the cause.

In current implementation there is an assumption that once we landed on the first candidate document, it is possible to check if there are repeating pps, by just comparing the in-doc-positions of the terms.

Tricky as it is, while this is true for plain PhrasePositions, it is not true for MultiPhrasePositions - assume to MPPs: (a m n) and (b x y), and first candidate document that starts with "a b". The in-doc-positions of the two pps would be 0,1 respectively (for 'a' and 'b') and we would not even detect the fact that there are repetitions, not to mention not putting them in the same group.

MPPs conflicts with current patch in an additional manner: It is now assumed that each repetition can be assigned a repetition group.

So assume these PPs (and query positions):
0:a 1:b 3:a 4:b 7:c
There are clearly two repetition groups {0:a, 3:a} and {1:b, 4:b},
while 7:c is not a repetition.

But assume these PPs (and query positions):
0:(a b) 1:(b x) 3:a 4:b 7:(c x)
We end up with a single large repetition group:
{0:(a b) 1:(b x) 3:a 4:b 7:(c x)}

I think if the groups are created correctly at the first candidate document, scorer logic would still work, as a collision is decided only when two pps are in the same in-doc-position. The only impact of MPPs would be performance cost: since repetition groups are larger, it would take longer to check if there are repetitions.

Just need to figure out how to detect repetition groups without relying on in-(first-)doc-positions.

> SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
> ---------------------------------------------------------------------------
>
> Key: LUCENE-3821
> URL: https://issues.apache.org/jira/browse/LUCENE-3821
> Project: Lucene - Java
> Issue Type: Bug
> Affects Versions: 3.5, 4.0
> Reporter: Naomi Dushay
> Assignee: Doron Cohen
> Attachments: LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml
>
>
> The general bug is a case where a phrase with no slop is found,
> but if you add slop its not.
> I committed a test today (TestSloppyPhraseQuery2) that actually triggers this case,
> jenkins just hasn't had enough time to chew on it.
> ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make it fail on trunk or 3.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Mar 5, 2012, 7:04 PM

Post #12 of 24 (166 views)
Permalink
[jira] [Commented] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds. [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13222958#comment-13222958 ]

Robert Muir commented on LUCENE-3821:
-------------------------------------

{quote}
I would love it to be the 3rd if I just knew how to do it. Otherwise I like the 2nd, just need to keep in mind that the random test might from time to time create this scenario and so there will be noise in the test builds.
{quote}

I think there is no problem in fixing "some of the bugs" to improve the behavior, even if its still not perfect.

we can take our time thinking of how to handle the remaining scenarios... either way I think we should just go
with your judgement call on this one, since you obviously understand it better than anyone else.


> SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
> ---------------------------------------------------------------------------
>
> Key: LUCENE-3821
> URL: https://issues.apache.org/jira/browse/LUCENE-3821
> Project: Lucene - Java
> Issue Type: Bug
> Affects Versions: 3.5, 4.0
> Reporter: Naomi Dushay
> Assignee: Doron Cohen
> Attachments: LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml
>
>
> The general bug is a case where a phrase with no slop is found,
> but if you add slop its not.
> I committed a test today (TestSloppyPhraseQuery2) that actually triggers this case,
> jenkins just hasn't had enough time to chew on it.
> ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make it fail on trunk or 3.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Mar 6, 2012, 12:08 AM

Post #13 of 24 (167 views)
Permalink
[jira] [Commented] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds. [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13223077#comment-13223077 ]

Doron Cohen commented on LUCENE-3821:
-------------------------------------

Thanks Robert, okay, I'll continue with option 2 then.

In addition, perhaps should try harder for a sloppy version of current ExactPhraseScorer, for both performance and correctness reasons.

In ExactPhraseScorer, the increment of count[posIndex] is by 1, each time the conditions for a match (still) holds.

A sloppy version of this, with N terms and slop=S could increment differently:
{noformat}
1 + N*S at posIndex
1 + N*S - 1 at posIndex-1 and posIndex+1
1 + N*S - 2 at posIndex-2 and posIndex+3
...
1 + N*S - S at posIndex-S and posIndex+S
{noformat}

For S=0, this falls back to only increment by 1 and only at posIndex, same as the ExactPhraseScorer, which makes sense.

Also, the success criteria in ExactPhraseScorer, when checking term k, is, before adding up 1 for term k:
* count[posIndex] == k-1
Or, after adding up 1 for term k:
* count[posIndex] == k

In the sloppy version the criteria (after adding up term k) would be:
* count[posIndex] >= k*(1+N*S)-S

Again, for S=0 this falls to the ExactPhraseScorer logic:
* count[posIndex] >= k

Mike (and all), correctness wise, what do you think?

If you are wondering why the increment at the actual position is (1 + N*S) - it allows to match only posIndexes where all terms contributed something.

I drew an example with 5 terms and slop=2 and so far it seems correct.

Also tried 2 terms and slop=5, this seems correct as well, just that, when there is a match, several posIndexes will contribute to the score of the same match. I think this is not too bad, as for these parameters, same behavior would be in all documents. I would be especially forgiving for this if we this way get some of the performance benefits of the ExactPhraseScorer.

If we agree on correctness, need to understand how to implement it, and consider the performance effect. The tricky part is to increment at posIndex-n. Say there are 3 terms in the query and one of the terms is found at indexes 10, 15, 18. Assume the slope is 2. Since N=3, the max increment is:
- 1 + N*S = 1 + 3*2 = 7.

So the increments for this term would be (pos, incr):
{noformat}
Pos Increment
--- ---------
8 , 5
9 , 6
10 , 7
11 , 6
12 , 5
13 , 5
14 , 6
15 , 7 = max(7,5)
16 , 6 = max(6,5)
17 , 6 = max(5,6)
18 , 7
19 , 6
20 , 5
{noformat}

So when we get to posIndex 17, we know that posIndex 15 contributes 5, but we do not know yet about the contribution of posIndex 18, which is 6, and should be used instead of 5. So some look-ahead (or some fix-back) is required, which will complicate the code.

If this seems promising, should probably open a new issue for it, just wanted to get some feedback first.

> SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
> ---------------------------------------------------------------------------
>
> Key: LUCENE-3821
> URL: https://issues.apache.org/jira/browse/LUCENE-3821
> Project: Lucene - Java
> Issue Type: Bug
> Affects Versions: 3.5, 4.0
> Reporter: Naomi Dushay
> Assignee: Doron Cohen
> Attachments: LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml
>
>
> The general bug is a case where a phrase with no slop is found,
> but if you add slop its not.
> I committed a test today (TestSloppyPhraseQuery2) that actually triggers this case,
> jenkins just hasn't had enough time to chew on it.
> ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make it fail on trunk or 3.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Mar 6, 2012, 8:49 AM

Post #14 of 24 (166 views)
Permalink
[jira] [Commented] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds. [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13223410#comment-13223410 ]

Michael McCandless commented on LUCENE-3821:
--------------------------------------------

Cool!

I haven't fully thought this out (sloppy phrase matching is hard to think about!), but, tentatively, I think this is correct...?

> SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
> ---------------------------------------------------------------------------
>
> Key: LUCENE-3821
> URL: https://issues.apache.org/jira/browse/LUCENE-3821
> Project: Lucene - Java
> Issue Type: Bug
> Affects Versions: 3.5, 4.0
> Reporter: Naomi Dushay
> Assignee: Doron Cohen
> Attachments: LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml
>
>
> The general bug is a case where a phrase with no slop is found,
> but if you add slop its not.
> I committed a test today (TestSloppyPhraseQuery2) that actually triggers this case,
> jenkins just hasn't had enough time to chew on it.
> ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make it fail on trunk or 3.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Mar 6, 2012, 11:07 AM

Post #15 of 24 (171 views)
Permalink
[jira] [Commented] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds. [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13223554#comment-13223554 ]

Doron Cohen commented on LUCENE-3821:
-------------------------------------

OK great!

If you did not point a problem with this up front there's a good chance it will work and I'd like to give it a try.

I have a first patch - not working or anything - it opens ExactPhraseScorer a bit for extensions and adds a class (temporary name) - NonExactPhraseScorer.

The idea is to hide in the ChunkState the details of decaying frequencies due to slops. I will try it over the weekend. If we can make it this way, I'd rather do it in this issue rather than committing the other new code for the fix and then replacing it. If that won't go quick, I'll commit the (other) changes to SloppyPhraseScorer and start a new issue.

> SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
> ---------------------------------------------------------------------------
>
> Key: LUCENE-3821
> URL: https://issues.apache.org/jira/browse/LUCENE-3821
> Project: Lucene - Java
> Issue Type: Bug
> Affects Versions: 3.5, 4.0
> Reporter: Naomi Dushay
> Assignee: Doron Cohen
> Attachments: LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml
>
>
> The general bug is a case where a phrase with no slop is found,
> but if you add slop its not.
> I committed a test today (TestSloppyPhraseQuery2) that actually triggers this case,
> jenkins just hasn't had enough time to chew on it.
> ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make it fail on trunk or 3.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Mar 6, 2012, 11:11 AM

Post #16 of 24 (169 views)
Permalink
[jira] [Commented] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds. [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13223560#comment-13223560 ]

Robert Muir commented on LUCENE-3821:
-------------------------------------

sounds interesting: ExactPhraseScorer really has a lot of useful recent heuristics and optimizations,
especially about when to next() versus advance() and such?

net/net this idea could possibly improvement the performance overall of SloppyPhraseScorer

> SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
> ---------------------------------------------------------------------------
>
> Key: LUCENE-3821
> URL: https://issues.apache.org/jira/browse/LUCENE-3821
> Project: Lucene - Java
> Issue Type: Bug
> Affects Versions: 3.5, 4.0
> Reporter: Naomi Dushay
> Assignee: Doron Cohen
> Attachments: LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml
>
>
> The general bug is a case where a phrase with no slop is found,
> but if you add slop its not.
> I committed a test today (TestSloppyPhraseQuery2) that actually triggers this case,
> jenkins just hasn't had enough time to chew on it.
> ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make it fail on trunk or 3.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Mar 6, 2012, 12:29 PM

Post #17 of 24 (168 views)
Permalink
[jira] [Commented] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds. [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13223629#comment-13223629 ]

Doron Cohen commented on LUCENE-3821:
-------------------------------------

bq. sounds interesting: ExactPhraseScorer really has a lot of useful recent heuristics and optimizations, especially about when to next() versus advance() and such?

next()/advance() will remain, but it would still be more costly than exact - score cache won't play, because freqs really are float in this case, and also there would be more computations on the way. But let's see it working first...

> SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
> ---------------------------------------------------------------------------
>
> Key: LUCENE-3821
> URL: https://issues.apache.org/jira/browse/LUCENE-3821
> Project: Lucene - Java
> Issue Type: Bug
> Affects Versions: 3.5, 4.0
> Reporter: Naomi Dushay
> Assignee: Doron Cohen
> Attachments: LUCENE-3821-SloppyDecays.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml
>
>
> The general bug is a case where a phrase with no slop is found,
> but if you add slop its not.
> I committed a test today (TestSloppyPhraseQuery2) that actually triggers this case,
> jenkins just hasn't had enough time to chew on it.
> ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make it fail on trunk or 3.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Mar 6, 2012, 1:55 PM

Post #18 of 24 (169 views)
Permalink
[jira] [Commented] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds. [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13223700#comment-13223700 ]

Doron Cohen commented on LUCENE-3821:
-------------------------------------

I'm afraid it won't solve the problem.

The complicity of SloppyPhraseScorer stems firstly from the slop.
That part is handled in the scorer for long time.

Two additional complications are repeating terms, and multi-term phrases.
Each one of these, separately, is handled as well.
Their combination however, is the cause for this discussion.

To prevent two repeating terms from landing on the same document position, we propagate the smaller of them (smaller in its phrase-position, which takes into account both the doc-position and the offset of that term in the query).

Without this special treatment, a phrase query "a b a"~2 might match a document "a b", because both "a"'s (query terms) will land on the same document's "a". This is illegal and is prevented by such propagation.

But when one of the repeating terms is a multi-term, it is not possible to know which of the repeating terms to propagate. This is the unsolved bug.

Now, back to current ExactPhraseScorer.
It does not have this problem with repeating terms.
But not because of the different algorithm - rather because of the different scenario.
It does not have this problem because exact phrase scoring does not have it.
In exact phrase scoring, a match is declared only when all PPs are in the same phrase position.
Recall that phrase position = doc-position - query-offset, it is visible that when two PPs with different query offset are in the same phrase-position, their doc-position cannot be the same, and therefore no special handling is needed for repeating terms in exact phrase scorers.

However, once we will add that slopy-decaying frequency, we will match in certain posIndex, different phrase-positions. This is because of the slop. So they might land on the same doc-position, and then we start again...

This is really too bad. Sorry for the lengthy post, hopefully this would help when someone wants to get into this.

Back to option 2.

> SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
> ---------------------------------------------------------------------------
>
> Key: LUCENE-3821
> URL: https://issues.apache.org/jira/browse/LUCENE-3821
> Project: Lucene - Java
> Issue Type: Bug
> Affects Versions: 3.5, 4.0
> Reporter: Naomi Dushay
> Assignee: Doron Cohen
> Attachments: LUCENE-3821-SloppyDecays.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml
>
>
> The general bug is a case where a phrase with no slop is found,
> but if you add slop its not.
> I committed a test today (TestSloppyPhraseQuery2) that actually triggers this case,
> jenkins just hasn't had enough time to chew on it.
> ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make it fail on trunk or 3.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Mar 6, 2012, 4:05 PM

Post #19 of 24 (166 views)
Permalink
[jira] [Commented] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds. [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13223839#comment-13223839 ]

Naomi Dushay commented on LUCENE-3821:
--------------------------------------

I would be glad to try out a nightly build with the patch as is against our tests - I would be glad to get the 80% solution if it fixes my bug. I haven't compiled from source yet, though, so am inclined to wait for the patch getting posted to the nightly.

> SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
> ---------------------------------------------------------------------------
>
> Key: LUCENE-3821
> URL: https://issues.apache.org/jira/browse/LUCENE-3821
> Project: Lucene - Java
> Issue Type: Bug
> Affects Versions: 3.5, 4.0
> Reporter: Naomi Dushay
> Assignee: Doron Cohen
> Attachments: LUCENE-3821-SloppyDecays.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml
>
>
> The general bug is a case where a phrase with no slop is found,
> but if you add slop its not.
> I committed a test today (TestSloppyPhraseQuery2) that actually triggers this case,
> jenkins just hasn't had enough time to chew on it.
> ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make it fail on trunk or 3.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Mar 6, 2012, 4:24 PM

Post #20 of 24 (171 views)
Permalink
[jira] [Commented] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds. [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13223856#comment-13223856 ]

Robert Muir commented on LUCENE-3821:
-------------------------------------

{quote}
To prevent two repeating terms from landing on the same document position, we propagate the smaller of them (smaller in its phrase-position, which takes into account both the doc-position and the offset of that term in the query).

Without this special treatment, a phrase query "a b a"~2 might match a document "a b", because both "a"'s (query terms) will land on the same document's "a". This is illegal and is prevented by such propagation.

But when one of the repeating terms is a multi-term, it is not possible to know which of the repeating terms to propagate. This is the unsolved bug.
{quote}

Not understanding really how SloppyPhraseScorer works now, but not trying to add confusion to the issue, I can't help but think
this problem is a variant on LevensteinAutomata... in fact that was the motivation for the new test, i just stole the testing
methodology from there and applied it to this!

It seems many things are the same but with a few twists:
* fundamentally we are interleaving the streams from the subscorers into the 'index automaton'
* 'query automaton' is produced from the user-supplied terms
* our 'alphabet' is the terms, and holes from position increment are just an additional symbol.
* just like the LevensteinAutomata case, repeats are problematic because they are different characteristic vectors
* stacked terms at the same position (index or query) just make the automata more complex (so they arent just strings)

I'm not suggesting we try to re-use any of that code at all, i don't think it will work. But I wonder if we can re-use even
some of the math to redefine the problem more formally to figure out what minimal state/lookahead we need for example...


> SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
> ---------------------------------------------------------------------------
>
> Key: LUCENE-3821
> URL: https://issues.apache.org/jira/browse/LUCENE-3821
> Project: Lucene - Java
> Issue Type: Bug
> Affects Versions: 3.5, 4.0
> Reporter: Naomi Dushay
> Assignee: Doron Cohen
> Attachments: LUCENE-3821-SloppyDecays.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml
>
>
> The general bug is a case where a phrase with no slop is found,
> but if you add slop its not.
> I committed a test today (TestSloppyPhraseQuery2) that actually triggers this case,
> jenkins just hasn't had enough time to chew on it.
> ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make it fail on trunk or 3.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Mar 9, 2012, 1:32 PM

Post #21 of 24 (167 views)
Permalink
[jira] [Commented] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds. [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13226494#comment-13226494 ]

Doron Cohen commented on LUCENE-3821:
-------------------------------------

{quote}
Not understanding really how SloppyPhraseScorer works now, but not trying to add confusion to the issue, I can't help but think this problem is a variant on LevensteinAutomata... in fact that was the motivation for the new test, i just stole the testing methodology from there and applied it to this!
{quote}

Interesting! I was not aware of this. I went and read some about this automaton, It is relevant.

{quote}
It seems many things are the same but with a few twists:

* fundamentally we are interleaving the streams from the subscorers into the 'index automaton'
'query automaton' is produced from the user-supplied terms
{quote}

True. In fact, the current code works hard to decide on the "correct interleaving order" - while if we had a "Perfect Levenstein Automaton" that took care of the computation we could just interleave, in the term position order (forget about phrase position and all that) and let the automaton compute the distance.

This might capture the difficulty in making the sloppy phrase scorer correct: it started with the algorithm that was optimized for exact matching, and adopted (hacked?) it for approximate matching.

Instead, starting with a model that fits approximate matching, might be easier and cleaner. I like that.

{quote}
* our 'alphabet' is the terms, and holes from position increment are just an additional symbol.
* just like the LevensteinAutomata case, repeats are problematic because they are different characteristic vectors
* stacked terms at the same position (index or query) just make the automata more complex (so they arent just strings)

I'm not suggesting we try to re-use any of that code at all, i don't think it will work. But I wonder if we can re-use even
some of the math to redefine the problem more formally to figure out what minimal state/lookahead we need for example...
{quote}

I agree. I'll think of this.

In the meantime I'll commit this partial fix.

> SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
> ---------------------------------------------------------------------------
>
> Key: LUCENE-3821
> URL: https://issues.apache.org/jira/browse/LUCENE-3821
> Project: Lucene - Java
> Issue Type: Bug
> Affects Versions: 3.5, 4.0
> Reporter: Naomi Dushay
> Assignee: Doron Cohen
> Attachments: LUCENE-3821-SloppyDecays.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml
>
>
> The general bug is a case where a phrase with no slop is found,
> but if you add slop its not.
> I committed a test today (TestSloppyPhraseQuery2) that actually triggers this case,
> jenkins just hasn't had enough time to chew on it.
> ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make it fail on trunk or 3.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Mar 9, 2012, 4:14 PM

Post #22 of 24 (171 views)
Permalink
[jira] [Commented] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds. [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13226623#comment-13226623 ]

Doron Cohen commented on LUCENE-3821:
-------------------------------------

Committed:
- r1299077 3x
- r1299112 trunk

bq. I would be glad to try out a nightly build with the patch as is against our tests - I would be glad to get the 80% solution if it fixes my bug.

It's in now...

bq. But I wonder if we can re-use even some of the math to redefine the problem more formally to figure out what minimal state/lookahead we need for example...

Robert, this gave me an idea... currently, in case of "collision" between repeaters, we compare them and advance the "lesser" of them (SloppyPhraseScorer.lesser(PhrasePositions, PhrasePositions)) - it should be fairly easy to add lookahead to this logic: if one of the two is multi-term, lesser can also do a lookahead. The amount of lookahead can depend on the slop. I'll give it a try before closing this issue.


> SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
> ---------------------------------------------------------------------------
>
> Key: LUCENE-3821
> URL: https://issues.apache.org/jira/browse/LUCENE-3821
> Project: Lucene - Java
> Issue Type: Bug
> Affects Versions: 3.5, 4.0
> Reporter: Naomi Dushay
> Assignee: Doron Cohen
> Attachments: LUCENE-3821-SloppyDecays.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml
>
>
> The general bug is a case where a phrase with no slop is found,
> but if you add slop its not.
> I committed a test today (TestSloppyPhraseQuery2) that actually triggers this case,
> jenkins just hasn't had enough time to chew on it.
> ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make it fail on trunk or 3.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Mar 10, 2012, 7:04 AM

Post #23 of 24 (170 views)
Permalink
[jira] [Commented] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds. [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13226873#comment-13226873 ]

Robert Muir commented on LUCENE-3821:
-------------------------------------

{quote}
Robert, this gave me an idea... currently, in case of "collision" between repeaters, we compare them and advance the "lesser" of them (SloppyPhraseScorer.lesser(PhrasePositions, PhrasePositions)) - it should be fairly easy to add lookahead to this logic: if one of the two is multi-term, lesser can also do a lookahead. The amount of lookahead can depend on the slop. I'll give it a try before closing this issue.
{quote}

Interesting... its hard to think about for me since the edit distance is a little different, but at least in the
levAutomata case the maximum 'context' the thing ever needs is {{2n+1}}, where n is the distance/slop. I don't
know if it applies here... but seems like it should be at least an upperbound.

Speaking of which on a related note, I think its possible we can implement a more... exhaustive test for
SloppyPhraseScorer (rather than relying so much on a random one). The idea would be a twist on
TestLevenshteinAutomata.assertCharVectors: using an alphabet of terms={0,1} the idea is to test all possible
'automaton structures', for sloppyphrasescorer, the idea would be we have the minimal test method that
tests all the cases...

I'll think on this one...



> SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
> ---------------------------------------------------------------------------
>
> Key: LUCENE-3821
> URL: https://issues.apache.org/jira/browse/LUCENE-3821
> Project: Lucene - Java
> Issue Type: Bug
> Affects Versions: 3.5, 4.0
> Reporter: Naomi Dushay
> Assignee: Doron Cohen
> Attachments: LUCENE-3821-SloppyDecays.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml
>
>
> The general bug is a case where a phrase with no slop is found,
> but if you add slop its not.
> I committed a test today (TestSloppyPhraseQuery2) that actually triggers this case,
> jenkins just hasn't had enough time to chew on it.
> ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make it fail on trunk or 3.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Mar 12, 2012, 6:28 PM

Post #24 of 24 (152 views)
Permalink
[jira] [Commented] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds. [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13228118#comment-13228118 ]

Naomi Dushay commented on LUCENE-3821:
--------------------------------------

The commits from March 10 fix my two failing tests - huzzah! Thank you so much! - Naomi

> SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
> ---------------------------------------------------------------------------
>
> Key: LUCENE-3821
> URL: https://issues.apache.org/jira/browse/LUCENE-3821
> Project: Lucene - Java
> Issue Type: Bug
> Affects Versions: 3.5, 4.0
> Reporter: Naomi Dushay
> Assignee: Doron Cohen
> Attachments: LUCENE-3821-SloppyDecays.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml
>
>
> The general bug is a case where a phrase with no slop is found,
> but if you add slop its not.
> I committed a test today (TestSloppyPhraseQuery2) that actually triggers this case,
> jenkins just hasn't had enough time to chew on it.
> ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make it fail on trunk or 3.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene

Lucene java-dev RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.