Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-Dev

[jira] Commented: (LUCENE-1372) Proposal: introduce more sensible sorting when a doc has multiple values for a term

 

 

Lucene java-dev RSS feed   Index | Next | Previous | View Threaded


jira at apache

Sep 4, 2008, 6:03 AM

Post #1 of 6 (370 views)
Permalink
[jira] Commented: (LUCENE-1372) Proposal: introduce more sensible sorting when a doc has multiple values for a term

[ https://issues.apache.org/jira/browse/LUCENE-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628331#action_12628331 ]

Mark Miller commented on LUCENE-1372:
-------------------------------------

Hey Paul,

I agree that your patch is more intuitive than the current behavior, but I don't know how intuitive that is - if the sort worked on multiple tokens, you would expect it to sort lexicographically across each word, and even with your patch it won't, it will just use the first word rather than the last, right? In other words, I see it as a half fix.

So while its low overhead, I wonder if any overhead is worth not getting the full fix? Currently the solution has been that you should only be sorting on single token fields - in fact, there is a check for this (that just isnt very good at checking <g>) that will possibly throw an exception if you sort on a field with multiple tokens - its just not safe unless that check is taken out (FieldCacheImpl string sorting).

It appears that to do this right, we need to pay a cost in the general case and sorting across multiple tokens may not be worth that, as you can get around the limitation by using multiple fields etc now. Personally though, if a patch were to be accepted, I think it would have to fully support the correct sorting and disable that check I mentioned (again, i doubt people want to pay that perf cost though). Finally, even if the committers decide this is a good way to go, the check needs to come out at a minimum.




> Proposal: introduce more sensible sorting when a doc has multiple values for a term
> -----------------------------------------------------------------------------------
>
> Key: LUCENE-1372
> URL: https://issues.apache.org/jira/browse/LUCENE-1372
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Search
> Affects Versions: 2.3.2
> Reporter: Paul Cowan
> Priority: Minor
> Attachments: lucene-multisort.patch
>
>
> At the moment, FieldCacheImpl has somewhat disconcerting values when sorting on a field for which multiple values exist for one document. For example, imagine a field "fruit" which is added to a document multiple times, with the values as follows:
> doc 1: {"apple"}
> doc 2: {"banana"}
> doc 3: {"apple", "banana"}
> doc 4: {"apple", "zebra"}
> if one sorts on the field "fruit", the loop in FieldCacheImpl.stringsIndexCache.createValue() (and similarly for the other methods in the various FieldCacheImpl caches) does the following:
> while (termDocs.next()) {
> retArray[termDocs.doc()] = t;
> }
> which means that we look over the terms in their natural order and, on each one, overwrite retArray[doc] with the value for each document with that term. Effectively, this overwriting means that a string sort in this circumstance will sort by the LAST term lexicographically, so the docs above will effecitvely be sorted as if they had the single values ("apple", "banana", "banana", "zebra") which is nonintuitive. To change this to sort on the first time in the TermEnum seems relatively trivial and low-overhead; while it's not perfect (it's not local-aware, for example) the behaviour seems much more sensible to me. Interested to see what people think.
> Patch to follow.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Sep 4, 2008, 3:09 PM

Post #2 of 6 (342 views)
Permalink
[jira] Commented: (LUCENE-1372) Proposal: introduce more sensible sorting when a doc has multiple values for a term [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628480#action_12628480 ]

Paul Smith commented on LUCENE-1372:
------------------------------------

Having a Document sorted last because it has "zebra", even though it has "apple" seems way incorrect. Yes it would be ideal if Lucene _could_ perform the multi-term sort properly, but in the absence of an effective fix in the short term, having the lexographically earlier term 'picked' as the primary sort candidate is likely to generate results that match what users would expect (even if it's not quite perfect).

Right now it looks blatantly silly at the presentation layer when one presents the search results with their data, and show that "apple,zebra" appears last in the list..

> Proposal: introduce more sensible sorting when a doc has multiple values for a term
> -----------------------------------------------------------------------------------
>
> Key: LUCENE-1372
> URL: https://issues.apache.org/jira/browse/LUCENE-1372
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Search
> Affects Versions: 2.3.2
> Reporter: Paul Cowan
> Priority: Minor
> Attachments: lucene-multisort.patch
>
>
> At the moment, FieldCacheImpl has somewhat disconcerting values when sorting on a field for which multiple values exist for one document. For example, imagine a field "fruit" which is added to a document multiple times, with the values as follows:
> doc 1: {"apple"}
> doc 2: {"banana"}
> doc 3: {"apple", "banana"}
> doc 4: {"apple", "zebra"}
> if one sorts on the field "fruit", the loop in FieldCacheImpl.stringsIndexCache.createValue() (and similarly for the other methods in the various FieldCacheImpl caches) does the following:
> while (termDocs.next()) {
> retArray[termDocs.doc()] = t;
> }
> which means that we look over the terms in their natural order and, on each one, overwrite retArray[doc] with the value for each document with that term. Effectively, this overwriting means that a string sort in this circumstance will sort by the LAST term lexicographically, so the docs above will effecitvely be sorted as if they had the single values ("apple", "banana", "banana", "zebra") which is nonintuitive. To change this to sort on the first time in the TermEnum seems relatively trivial and low-overhead; while it's not perfect (it's not local-aware, for example) the behaviour seems much more sensible to me. Interested to see what people think.
> Patch to follow.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Sep 4, 2008, 3:35 PM

Post #3 of 6 (340 views)
Permalink
[jira] Commented: (LUCENE-1372) Proposal: introduce more sensible sorting when a doc has multiple values for a term [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628496#action_12628496 ]

Mark Miller commented on LUCENE-1372:
-------------------------------------

Ah, but right now, the documentation will tell you its not supported, and possibly even that you can't do it (though you can because the check that would stop you is basically a guess). So it looks funny when presenting data because you are violating the rules<g> You are not just proposing making it work more unintuitive, but by necessity, you are also proposing Lucene support sorting on a field with multiple tokens in the first place, because the stance right now is that it does not - hence the weak guard around it in the code.


> Proposal: introduce more sensible sorting when a doc has multiple values for a term
> -----------------------------------------------------------------------------------
>
> Key: LUCENE-1372
> URL: https://issues.apache.org/jira/browse/LUCENE-1372
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Search
> Affects Versions: 2.3.2
> Reporter: Paul Cowan
> Priority: Minor
> Attachments: lucene-multisort.patch
>
>
> At the moment, FieldCacheImpl has somewhat disconcerting values when sorting on a field for which multiple values exist for one document. For example, imagine a field "fruit" which is added to a document multiple times, with the values as follows:
> doc 1: {"apple"}
> doc 2: {"banana"}
> doc 3: {"apple", "banana"}
> doc 4: {"apple", "zebra"}
> if one sorts on the field "fruit", the loop in FieldCacheImpl.stringsIndexCache.createValue() (and similarly for the other methods in the various FieldCacheImpl caches) does the following:
> while (termDocs.next()) {
> retArray[termDocs.doc()] = t;
> }
> which means that we look over the terms in their natural order and, on each one, overwrite retArray[doc] with the value for each document with that term. Effectively, this overwriting means that a string sort in this circumstance will sort by the LAST term lexicographically, so the docs above will effecitvely be sorted as if they had the single values ("apple", "banana", "banana", "zebra") which is nonintuitive. To change this to sort on the first time in the TermEnum seems relatively trivial and low-overhead; while it's not perfect (it's not local-aware, for example) the behaviour seems much more sensible to me. Interested to see what people think.
> Patch to follow.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Sep 4, 2008, 4:01 PM

Post #4 of 6 (338 views)
Permalink
[jira] Commented: (LUCENE-1372) Proposal: introduce more sensible sorting when a doc has multiple values for a term [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628509#action_12628509 ]

Hoss Man commented on LUCENE-1372:
----------------------------------

bq. Right now it looks blatantly silly at the presentation layer when one presents the search results with their data, and show that "apple,zebra" appears last in the list.

I'm not following this argument. Will it be less silly when {zebra,apple} sorts before {banana} ?

If we're going to break backwards compatibility for FieldCache users, let's break it completely and make the code throw a RuntimeException when it sees that retArray[termDocs.doc()] is non-null ... that way we are quickly alerting the client code that they are doing something very, very wrong.

> Proposal: introduce more sensible sorting when a doc has multiple values for a term
> -----------------------------------------------------------------------------------
>
> Key: LUCENE-1372
> URL: https://issues.apache.org/jira/browse/LUCENE-1372
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Search
> Affects Versions: 2.3.2
> Reporter: Paul Cowan
> Priority: Minor
> Attachments: lucene-multisort.patch
>
>
> At the moment, FieldCacheImpl has somewhat disconcerting values when sorting on a field for which multiple values exist for one document. For example, imagine a field "fruit" which is added to a document multiple times, with the values as follows:
> doc 1: {"apple"}
> doc 2: {"banana"}
> doc 3: {"apple", "banana"}
> doc 4: {"apple", "zebra"}
> if one sorts on the field "fruit", the loop in FieldCacheImpl.stringsIndexCache.createValue() (and similarly for the other methods in the various FieldCacheImpl caches) does the following:
> while (termDocs.next()) {
> retArray[termDocs.doc()] = t;
> }
> which means that we look over the terms in their natural order and, on each one, overwrite retArray[doc] with the value for each document with that term. Effectively, this overwriting means that a string sort in this circumstance will sort by the LAST term lexicographically, so the docs above will effecitvely be sorted as if they had the single values ("apple", "banana", "banana", "zebra") which is nonintuitive. To change this to sort on the first time in the TermEnum seems relatively trivial and low-overhead; while it's not perfect (it's not local-aware, for example) the behaviour seems much more sensible to me. Interested to see what people think.
> Patch to follow.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Sep 4, 2008, 4:13 PM

Post #5 of 6 (338 views)
Permalink
[jira] Commented: (LUCENE-1372) Proposal: introduce more sensible sorting when a doc has multiple values for a term [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628513#action_12628513 ]

Paul Smith commented on LUCENE-1372:
------------------------------------

bq. I'm not following this argument. Will it be less silly when {zebra,apple} sorts before {banana} ?

Well, at the presentation layer I don't think you'd present it like that (we don't). We'd sort the list of attributes so that it would appear as "apple,zebra".

> Proposal: introduce more sensible sorting when a doc has multiple values for a term
> -----------------------------------------------------------------------------------
>
> Key: LUCENE-1372
> URL: https://issues.apache.org/jira/browse/LUCENE-1372
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Search
> Affects Versions: 2.3.2
> Reporter: Paul Cowan
> Priority: Minor
> Attachments: lucene-multisort.patch
>
>
> At the moment, FieldCacheImpl has somewhat disconcerting values when sorting on a field for which multiple values exist for one document. For example, imagine a field "fruit" which is added to a document multiple times, with the values as follows:
> doc 1: {"apple"}
> doc 2: {"banana"}
> doc 3: {"apple", "banana"}
> doc 4: {"apple", "zebra"}
> if one sorts on the field "fruit", the loop in FieldCacheImpl.stringsIndexCache.createValue() (and similarly for the other methods in the various FieldCacheImpl caches) does the following:
> while (termDocs.next()) {
> retArray[termDocs.doc()] = t;
> }
> which means that we look over the terms in their natural order and, on each one, overwrite retArray[doc] with the value for each document with that term. Effectively, this overwriting means that a string sort in this circumstance will sort by the LAST term lexicographically, so the docs above will effecitvely be sorted as if they had the single values ("apple", "banana", "banana", "zebra") which is nonintuitive. To change this to sort on the first time in the TermEnum seems relatively trivial and low-overhead; while it's not perfect (it's not local-aware, for example) the behaviour seems much more sensible to me. Interested to see what people think.
> Patch to follow.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Sep 4, 2008, 10:45 PM

Post #6 of 6 (330 views)
Permalink
[jira] Commented: (LUCENE-1372) Proposal: introduce more sensible sorting when a doc has multiple values for a term [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628555#action_12628555 ]

Hoss Man commented on LUCENE-1372:
----------------------------------

bq. We'd sort the list of attributes so that it would appear as "apple,zebra".

Again i'm missing something in your argument ... you'll put code in your application which will change the order of stored fields when displaying them so it looks better, but you won't put code in your application to ensure that multiple values aren't indexed in the first place?

The application using Lucene is in the best position to decide "this is the value i want to sort on." FieldCache shouldn't guess which value to use if the application breaks the rules and indexes more then one. the fact that FieldCache currently picks the last one is just an artifact of how it was implemented ... it is "consistent" but "undefined" behavior.

if we are going to change the behavior we should change it should be an error.

> Proposal: introduce more sensible sorting when a doc has multiple values for a term
> -----------------------------------------------------------------------------------
>
> Key: LUCENE-1372
> URL: https://issues.apache.org/jira/browse/LUCENE-1372
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Search
> Affects Versions: 2.3.2
> Reporter: Paul Cowan
> Priority: Minor
> Attachments: lucene-multisort.patch
>
>
> At the moment, FieldCacheImpl has somewhat disconcerting values when sorting on a field for which multiple values exist for one document. For example, imagine a field "fruit" which is added to a document multiple times, with the values as follows:
> doc 1: {"apple"}
> doc 2: {"banana"}
> doc 3: {"apple", "banana"}
> doc 4: {"apple", "zebra"}
> if one sorts on the field "fruit", the loop in FieldCacheImpl.stringsIndexCache.createValue() (and similarly for the other methods in the various FieldCacheImpl caches) does the following:
> while (termDocs.next()) {
> retArray[termDocs.doc()] = t;
> }
> which means that we look over the terms in their natural order and, on each one, overwrite retArray[doc] with the value for each document with that term. Effectively, this overwriting means that a string sort in this circumstance will sort by the LAST term lexicographically, so the docs above will effecitvely be sorted as if they had the single values ("apple", "banana", "banana", "zebra") which is nonintuitive. To change this to sort on the first time in the TermEnum seems relatively trivial and low-overhead; while it's not perfect (it's not local-aware, for example) the behaviour seems much more sensible to me. Interested to see what people think.
> Patch to follow.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene

Lucene java-dev RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.