Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-Dev

[jira] [Commented] (SOLR-3589) Edismax parser does not honor mm parameter if analyzer splits a token

 

 

Lucene java-dev RSS feed   Index | Next | Previous | View Threaded


jira at apache

Aug 15, 2012, 2:56 PM

Post #1 of 14 (1175 views)
Permalink
[jira] [Commented] (SOLR-3589) Edismax parser does not honor mm parameter if analyzer splits a token

[ https://issues.apache.org/jira/browse/SOLR-3589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13435558#comment-13435558 ]

Jack Krupansky commented on SOLR-3589:
--------------------------------------

The root problem is that with automatic phrase query generation turned off, by default and for the text_general field in particular, the core Lucene query parser is generating a query for the tuple of sub-terms using the default query operator, which is "OR" by default. There is no notion of an "mm" or min-match parameter down at that level in Lucene, which knows nothing about Solr or edismax or request parameters.

As things stand, the only option is to set the default query operator, "q.op", to "AND".

You can of course also turn on autoGeneratePhraseQueries or select an analyzer than doesn't split terms.

At this point, I would advise resolving this issue as "Won't Fix", although it could also be spun off into a Lucene issue to add support for min-match down at that level, which edismax can then also communicate with.



> Edismax parser does not honor mm parameter if analyzer splits a token
> ---------------------------------------------------------------------
>
> Key: SOLR-3589
> URL: https://issues.apache.org/jira/browse/SOLR-3589
> Project: Solr
> Issue Type: Bug
> Components: search
> Affects Versions: 3.6
> Reporter: Tom Burton-West
>
> With edismax mm set to 100% if one of the tokens is split into two tokens by the analyzer chain (i.e. "fire-fly" => fire fly), the mm parameter is ignored and the equivalent of OR query for "fire OR fly" is produced.
> This is particularly a problem for languages that do not use white space to separate words such as Chinese or Japenese.
> See these messages for more discussion:
> http://lucene.472066.n3.nabble.com/edismax-parser-ignores-mm-parameter-when-tokenizer-splits-tokens-hypenated-words-WDF-splitting-etc-tc3991911.html
> http://lucene.472066.n3.nabble.com/edismax-parser-ignores-mm-parameter-when-tokenizer-splits-tokens-i-e-CJK-tc3991438.html
> http://lucene.472066.n3.nabble.com/Why-won-t-dismax-create-multiple-DisjunctionMaxQueries-when-autoGeneratePhraseQueries-is-false-tc3992109.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Aug 15, 2012, 7:17 PM

Post #2 of 14 (1150 views)
Permalink
[jira] [Commented] (SOLR-3589) Edismax parser does not honor mm parameter if analyzer splits a token [In reply to]

[ https://issues.apache.org/jira/browse/SOLR-3589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13435720#comment-13435720 ]

Joel Rosen commented on SOLR-3589:
----------------------------------

It's not just mm. You set q.op to AND and it does the same thing.

The issue is that the query parser should treat the split tokens as separate tokens just as if they were separated by whitespace, but it doesn't. If I use a smart Chinese tokenizer to split up a Chinese sentence into words, why can't the query parser treat those words exactly the same way it treats words from an English sentence?

> Edismax parser does not honor mm parameter if analyzer splits a token
> ---------------------------------------------------------------------
>
> Key: SOLR-3589
> URL: https://issues.apache.org/jira/browse/SOLR-3589
> Project: Solr
> Issue Type: Bug
> Components: search
> Affects Versions: 3.6
> Reporter: Tom Burton-West
>
> With edismax mm set to 100% if one of the tokens is split into two tokens by the analyzer chain (i.e. "fire-fly" => fire fly), the mm parameter is ignored and the equivalent of OR query for "fire OR fly" is produced.
> This is particularly a problem for languages that do not use white space to separate words such as Chinese or Japenese.
> See these messages for more discussion:
> http://lucene.472066.n3.nabble.com/edismax-parser-ignores-mm-parameter-when-tokenizer-splits-tokens-hypenated-words-WDF-splitting-etc-tc3991911.html
> http://lucene.472066.n3.nabble.com/edismax-parser-ignores-mm-parameter-when-tokenizer-splits-tokens-i-e-CJK-tc3991438.html
> http://lucene.472066.n3.nabble.com/Why-won-t-dismax-create-multiple-DisjunctionMaxQueries-when-autoGeneratePhraseQueries-is-false-tc3992109.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Aug 16, 2012, 5:15 AM

Post #3 of 14 (1154 views)
Permalink
[jira] [Commented] (SOLR-3589) Edismax parser does not honor mm parameter if analyzer splits a token [In reply to]

[ https://issues.apache.org/jira/browse/SOLR-3589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13435921#comment-13435921 ]

Jack Krupansky commented on SOLR-3589:
--------------------------------------

bq. It's not just mm. You set q.op to AND and it does the same thing.

Joel, you're right. Upon closer inspection of the code, I see that the reason is that edismax never sets the Lucene default operator directly. Instead, it sets the default value of "mm" parameter to 100% if "q.op" is "AND", and set's BooleanQuery.minNrShouldMatch to the number of optional terms. That is equivalent to setting the default Lucene query operator at the top-level boolean level, but has no effect for terms that get split down at the analyzer level. Oh well. Scratch that suggestion.

I think I'm back to wanting to suggest that edismax should actually set the Lucene-level default query operator if "mm" is 100%. I think that would fix the original problem and allow the user to choose whether to user "mm" or "q.op" to control AND/OR.


> Edismax parser does not honor mm parameter if analyzer splits a token
> ---------------------------------------------------------------------
>
> Key: SOLR-3589
> URL: https://issues.apache.org/jira/browse/SOLR-3589
> Project: Solr
> Issue Type: Bug
> Components: search
> Affects Versions: 3.6
> Reporter: Tom Burton-West
>
> With edismax mm set to 100% if one of the tokens is split into two tokens by the analyzer chain (i.e. "fire-fly" => fire fly), the mm parameter is ignored and the equivalent of OR query for "fire OR fly" is produced.
> This is particularly a problem for languages that do not use white space to separate words such as Chinese or Japenese.
> See these messages for more discussion:
> http://lucene.472066.n3.nabble.com/edismax-parser-ignores-mm-parameter-when-tokenizer-splits-tokens-hypenated-words-WDF-splitting-etc-tc3991911.html
> http://lucene.472066.n3.nabble.com/edismax-parser-ignores-mm-parameter-when-tokenizer-splits-tokens-i-e-CJK-tc3991438.html
> http://lucene.472066.n3.nabble.com/Why-won-t-dismax-create-multiple-DisjunctionMaxQueries-when-autoGeneratePhraseQueries-is-false-tc3992109.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Aug 16, 2012, 8:47 AM

Post #4 of 14 (1162 views)
Permalink
[jira] [Commented] (SOLR-3589) Edismax parser does not honor mm parameter if analyzer splits a token [In reply to]

[ https://issues.apache.org/jira/browse/SOLR-3589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13436045#comment-13436045 ]

Jack Krupansky commented on SOLR-3589:
--------------------------------------

bq. If I use a smart Chinese tokenizer to split up a Chinese sentence into words, why can't the query parser treat those words exactly the same way it treats words from an English sentence?

Indexing of whole documents can in fact treat text as if it were words from an English sentence, and split tokens do in fact behave as such in that context, but a query is not an English sentence or sentence in any natural language. Rather, a query is a structured expression composed of terms and operators, typically separated by whitespace or special operators such as parentheses. Portions of queries may look like natural language phrases or even whole sentences, but in reality they are sequences of terms and operators.

In addition to being parsed according to the syntax of queries, as opposed to natural language processing or the raw token stream processing of an indexer, each of the query terms must be "analyzed" before the final form of the term can be generated into a Lucene Query structure. That analysis is performed separate form the "parsing" of the structured user query expression. That means that the processing of sub-terms that result from analysis is handled at a different level than source-level query terms that happen to "look" like English words. In other words, the sub-terms are processed by the "query generator" while the source terms were processed by the "query parser". We loosely refer to the combination of (user) query parsing and (Lucene) query generation as "the query parser", but it is important to distinguish (user query) "parsing" from (Lucene Query) "generation".

The query parser does its best to handle sub-terms reasonably, but expecting that they will magically handled the same exact way as source terms is somewhat impractical. That doesn't mean that there can't be improvement, but simply that a dose of realism is needed when considering the potential, challenges, and limits of query parsing/processing/generation.


> Edismax parser does not honor mm parameter if analyzer splits a token
> ---------------------------------------------------------------------
>
> Key: SOLR-3589
> URL: https://issues.apache.org/jira/browse/SOLR-3589
> Project: Solr
> Issue Type: Bug
> Components: search
> Affects Versions: 3.6
> Reporter: Tom Burton-West
>
> With edismax mm set to 100% if one of the tokens is split into two tokens by the analyzer chain (i.e. "fire-fly" => fire fly), the mm parameter is ignored and the equivalent of OR query for "fire OR fly" is produced.
> This is particularly a problem for languages that do not use white space to separate words such as Chinese or Japenese.
> See these messages for more discussion:
> http://lucene.472066.n3.nabble.com/edismax-parser-ignores-mm-parameter-when-tokenizer-splits-tokens-hypenated-words-WDF-splitting-etc-tc3991911.html
> http://lucene.472066.n3.nabble.com/edismax-parser-ignores-mm-parameter-when-tokenizer-splits-tokens-i-e-CJK-tc3991438.html
> http://lucene.472066.n3.nabble.com/Why-won-t-dismax-create-multiple-DisjunctionMaxQueries-when-autoGeneratePhraseQueries-is-false-tc3992109.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Aug 16, 2012, 9:20 AM

Post #5 of 14 (1151 views)
Permalink
[jira] [Commented] (SOLR-3589) Edismax parser does not honor mm parameter if analyzer splits a token [In reply to]

[ https://issues.apache.org/jira/browse/SOLR-3589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13436069#comment-13436069 ]

Joel Rosen commented on SOLR-3589:
----------------------------------

Sounds to me like this is an English-centric design flaw with dismax. The point of dismax is to intelligently process simple user-entered phrases, right? If I understand correctly, it does this by looking at the terms entered and making some decisions about how to join them with AND or OR. But it assumes that a term is a whitespace-delimited string, yes? This is an incorrect assumption for Chinese. If instead of making this assumption, dismax ran the analyzers first to determine what is and isn't a term, then I imagine you would get more predictable behavior across both whitespace delimited and non-whitespace delimited languages, and you wouldn't need any "magical" handling for different languages.

> Edismax parser does not honor mm parameter if analyzer splits a token
> ---------------------------------------------------------------------
>
> Key: SOLR-3589
> URL: https://issues.apache.org/jira/browse/SOLR-3589
> Project: Solr
> Issue Type: Bug
> Components: search
> Affects Versions: 3.6
> Reporter: Tom Burton-West
>
> With edismax mm set to 100% if one of the tokens is split into two tokens by the analyzer chain (i.e. "fire-fly" => fire fly), the mm parameter is ignored and the equivalent of OR query for "fire OR fly" is produced.
> This is particularly a problem for languages that do not use white space to separate words such as Chinese or Japenese.
> See these messages for more discussion:
> http://lucene.472066.n3.nabble.com/edismax-parser-ignores-mm-parameter-when-tokenizer-splits-tokens-hypenated-words-WDF-splitting-etc-tc3991911.html
> http://lucene.472066.n3.nabble.com/edismax-parser-ignores-mm-parameter-when-tokenizer-splits-tokens-i-e-CJK-tc3991438.html
> http://lucene.472066.n3.nabble.com/Why-won-t-dismax-create-multiple-DisjunctionMaxQueries-when-autoGeneratePhraseQueries-is-false-tc3992109.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Aug 16, 2012, 9:41 AM

Post #6 of 14 (1158 views)
Permalink
[jira] [Commented] (SOLR-3589) Edismax parser does not honor mm parameter if analyzer splits a token [In reply to]

[ https://issues.apache.org/jira/browse/SOLR-3589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13436082#comment-13436082 ]

Jack Krupansky commented on SOLR-3589:
--------------------------------------

Be careful not to confuse dismax and edismax. They are two different query parsers, with different goals.

One of edismax's goals was to support "fielded queries" (e.g., "title:abc AND date:123") and the full Lucene query syntax. No typical analyzer will be able to tell you that title and date are field names.

Not "English-centric", but European/Latin-centric for sure. The edismax and classic Lucene query parsers share that heritage, based on whitespace, but the dismax query parser doesn't "suffer" from that same need to parse field names and operators.

There is no question that better query parser support is needed for non-European/Latin languages, but that requires careful, high-level, overall design, which is a tall order for a fast-paced open source community where features tend to be looked at in isolation.

One clarification...

bq. assumes that a term is a whitespace-delimited string

Yes and no. We need to be careful about distinguishing a "source term" - what the parser recognizes, which is whitespace delimited, from "analyzed terms" which are recognized and output by the field type analyzers. There is no requirement that the output terms be whitespace-delimited or that the input to an anlyzer be whitespace-delimited. So, the theory has been that even a whitespace-centric complex-structure query parser can also handle, for example, Chinese text. Obviously that hasn't worked out as cleanly as desired and more work is needed.


> Edismax parser does not honor mm parameter if analyzer splits a token
> ---------------------------------------------------------------------
>
> Key: SOLR-3589
> URL: https://issues.apache.org/jira/browse/SOLR-3589
> Project: Solr
> Issue Type: Bug
> Components: search
> Affects Versions: 3.6
> Reporter: Tom Burton-West
>
> With edismax mm set to 100% if one of the tokens is split into two tokens by the analyzer chain (i.e. "fire-fly" => fire fly), the mm parameter is ignored and the equivalent of OR query for "fire OR fly" is produced.
> This is particularly a problem for languages that do not use white space to separate words such as Chinese or Japenese.
> See these messages for more discussion:
> http://lucene.472066.n3.nabble.com/edismax-parser-ignores-mm-parameter-when-tokenizer-splits-tokens-hypenated-words-WDF-splitting-etc-tc3991911.html
> http://lucene.472066.n3.nabble.com/edismax-parser-ignores-mm-parameter-when-tokenizer-splits-tokens-i-e-CJK-tc3991438.html
> http://lucene.472066.n3.nabble.com/Why-won-t-dismax-create-multiple-DisjunctionMaxQueries-when-autoGeneratePhraseQueries-is-false-tc3992109.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Aug 16, 2012, 5:05 PM

Post #7 of 14 (1147 views)
Permalink
[jira] [Commented] (SOLR-3589) Edismax parser does not honor mm parameter if analyzer splits a token [In reply to]

[ https://issues.apache.org/jira/browse/SOLR-3589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13436454#comment-13436454 ]

Jack Krupansky commented on SOLR-3589:
--------------------------------------

My proposal is for edismax to set the Lucene default query operator to "AND" if either: 1) "q.op" is "AND", or 2) "mm" is "100%".

I think that will address the stated problem.

Any objection?

I'll try to come up with a patch, but a committer will be needed to apply it.


> Edismax parser does not honor mm parameter if analyzer splits a token
> ---------------------------------------------------------------------
>
> Key: SOLR-3589
> URL: https://issues.apache.org/jira/browse/SOLR-3589
> Project: Solr
> Issue Type: Bug
> Components: search
> Affects Versions: 3.6
> Reporter: Tom Burton-West
>
> With edismax mm set to 100% if one of the tokens is split into two tokens by the analyzer chain (i.e. "fire-fly" => fire fly), the mm parameter is ignored and the equivalent of OR query for "fire OR fly" is produced.
> This is particularly a problem for languages that do not use white space to separate words such as Chinese or Japenese.
> See these messages for more discussion:
> http://lucene.472066.n3.nabble.com/edismax-parser-ignores-mm-parameter-when-tokenizer-splits-tokens-hypenated-words-WDF-splitting-etc-tc3991911.html
> http://lucene.472066.n3.nabble.com/edismax-parser-ignores-mm-parameter-when-tokenizer-splits-tokens-i-e-CJK-tc3991438.html
> http://lucene.472066.n3.nabble.com/Why-won-t-dismax-create-multiple-DisjunctionMaxQueries-when-autoGeneratePhraseQueries-is-false-tc3992109.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Aug 16, 2012, 5:23 PM

Post #8 of 14 (1145 views)
Permalink
[jira] [Commented] (SOLR-3589) Edismax parser does not honor mm parameter if analyzer splits a token [In reply to]

[ https://issues.apache.org/jira/browse/SOLR-3589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13436468#comment-13436468 ]

Yonik Seeley commented on SOLR-3589:
------------------------------------

bq. My proposal is for edismax to set the Lucene default query operator to "AND"

Hmmm, I dunno. mm=100% is really only meant to apply to top level query terms, not structured lucene queries.

For example, in (foo:x foo:(a b c))
It doesn't seem like a b c should all be mandatory just because there happens to be a default mm of 100% (and they are not today).

> Edismax parser does not honor mm parameter if analyzer splits a token
> ---------------------------------------------------------------------
>
> Key: SOLR-3589
> URL: https://issues.apache.org/jira/browse/SOLR-3589
> Project: Solr
> Issue Type: Bug
> Components: search
> Affects Versions: 3.6
> Reporter: Tom Burton-West
>
> With edismax mm set to 100% if one of the tokens is split into two tokens by the analyzer chain (i.e. "fire-fly" => fire fly), the mm parameter is ignored and the equivalent of OR query for "fire OR fly" is produced.
> This is particularly a problem for languages that do not use white space to separate words such as Chinese or Japenese.
> See these messages for more discussion:
> http://lucene.472066.n3.nabble.com/edismax-parser-ignores-mm-parameter-when-tokenizer-splits-tokens-hypenated-words-WDF-splitting-etc-tc3991911.html
> http://lucene.472066.n3.nabble.com/edismax-parser-ignores-mm-parameter-when-tokenizer-splits-tokens-i-e-CJK-tc3991438.html
> http://lucene.472066.n3.nabble.com/Why-won-t-dismax-create-multiple-DisjunctionMaxQueries-when-autoGeneratePhraseQueries-is-false-tc3992109.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Aug 16, 2012, 5:45 PM

Post #9 of 14 (1150 views)
Permalink
[jira] [Commented] (SOLR-3589) Edismax parser does not honor mm parameter if analyzer splits a token [In reply to]

[ https://issues.apache.org/jira/browse/SOLR-3589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13436484#comment-13436484 ]

Jack Krupansky commented on SOLR-3589:
--------------------------------------

I could back off and simply say that edismax should set the Lucene default query operator to "AND" if "q.op" is "AND", but that would not address this particular issue, which is complaining that mm won't force the split terms to be ANDed.

If we really want to say that mm CAN'T be used to force split terms to be ANDED, then we should really resolve this issue asinvalid/won't fix.

I should probably file a separate issue for the fact that q.op is not obeyed for any but the top-level query.

And, the wiki makes no mention of "mm" being intended only for the top level query.


> Edismax parser does not honor mm parameter if analyzer splits a token
> ---------------------------------------------------------------------
>
> Key: SOLR-3589
> URL: https://issues.apache.org/jira/browse/SOLR-3589
> Project: Solr
> Issue Type: Bug
> Components: search
> Affects Versions: 3.6
> Reporter: Tom Burton-West
>
> With edismax mm set to 100% if one of the tokens is split into two tokens by the analyzer chain (i.e. "fire-fly" => fire fly), the mm parameter is ignored and the equivalent of OR query for "fire OR fly" is produced.
> This is particularly a problem for languages that do not use white space to separate words such as Chinese or Japenese.
> See these messages for more discussion:
> http://lucene.472066.n3.nabble.com/edismax-parser-ignores-mm-parameter-when-tokenizer-splits-tokens-hypenated-words-WDF-splitting-etc-tc3991911.html
> http://lucene.472066.n3.nabble.com/edismax-parser-ignores-mm-parameter-when-tokenizer-splits-tokens-i-e-CJK-tc3991438.html
> http://lucene.472066.n3.nabble.com/Why-won-t-dismax-create-multiple-DisjunctionMaxQueries-when-autoGeneratePhraseQueries-is-false-tc3992109.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Aug 16, 2012, 5:49 PM

Post #10 of 14 (1147 views)
Permalink
[jira] [Commented] (SOLR-3589) Edismax parser does not honor mm parameter if analyzer splits a token [In reply to]

[ https://issues.apache.org/jira/browse/SOLR-3589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13436486#comment-13436486 ]

Yonik Seeley commented on SOLR-3589:
------------------------------------

I was not saying that this issue shouldn't be fixed, but merely commenting on the negative consequences of one proposed solution.

> Edismax parser does not honor mm parameter if analyzer splits a token
> ---------------------------------------------------------------------
>
> Key: SOLR-3589
> URL: https://issues.apache.org/jira/browse/SOLR-3589
> Project: Solr
> Issue Type: Bug
> Components: search
> Affects Versions: 3.6
> Reporter: Tom Burton-West
>
> With edismax mm set to 100% if one of the tokens is split into two tokens by the analyzer chain (i.e. "fire-fly" => fire fly), the mm parameter is ignored and the equivalent of OR query for "fire OR fly" is produced.
> This is particularly a problem for languages that do not use white space to separate words such as Chinese or Japenese.
> See these messages for more discussion:
> http://lucene.472066.n3.nabble.com/edismax-parser-ignores-mm-parameter-when-tokenizer-splits-tokens-hypenated-words-WDF-splitting-etc-tc3991911.html
> http://lucene.472066.n3.nabble.com/edismax-parser-ignores-mm-parameter-when-tokenizer-splits-tokens-i-e-CJK-tc3991438.html
> http://lucene.472066.n3.nabble.com/Why-won-t-dismax-create-multiple-DisjunctionMaxQueries-when-autoGeneratePhraseQueries-is-false-tc3992109.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Aug 18, 2012, 3:23 PM

Post #11 of 14 (1145 views)
Permalink
[jira] [Commented] (SOLR-3589) Edismax parser does not honor mm parameter if analyzer splits a token [In reply to]

[ https://issues.apache.org/jira/browse/SOLR-3589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13437426#comment-13437426 ]

Lance Norskog commented on SOLR-3589:
-------------------------------------

See [SOLR-3636], it's the same problem space but with synonym expansion. If "Monkeyhouse" expands to "monkey house", then a dismax or edismax query finds words with either ("monkey" OR "house"). Must-match defaults to 100% so you would expect this to mean "monkey" AND "house".

This seems to be a multi-part problem.

> Edismax parser does not honor mm parameter if analyzer splits a token
> ---------------------------------------------------------------------
>
> Key: SOLR-3589
> URL: https://issues.apache.org/jira/browse/SOLR-3589
> Project: Solr
> Issue Type: Bug
> Components: search
> Affects Versions: 3.6
> Reporter: Tom Burton-West
>
> With edismax mm set to 100% if one of the tokens is split into two tokens by the analyzer chain (i.e. "fire-fly" => fire fly), the mm parameter is ignored and the equivalent of OR query for "fire OR fly" is produced.
> This is particularly a problem for languages that do not use white space to separate words such as Chinese or Japenese.
> See these messages for more discussion:
> http://lucene.472066.n3.nabble.com/edismax-parser-ignores-mm-parameter-when-tokenizer-splits-tokens-hypenated-words-WDF-splitting-etc-tc3991911.html
> http://lucene.472066.n3.nabble.com/edismax-parser-ignores-mm-parameter-when-tokenizer-splits-tokens-i-e-CJK-tc3991438.html
> http://lucene.472066.n3.nabble.com/Why-won-t-dismax-create-multiple-DisjunctionMaxQueries-when-autoGeneratePhraseQueries-is-false-tc3992109.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Aug 19, 2012, 11:19 PM

Post #12 of 14 (1129 views)
Permalink
[jira] [Commented] (SOLR-3589) Edismax parser does not honor mm parameter if analyzer splits a token [In reply to]

[ https://issues.apache.org/jira/browse/SOLR-3589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13437674#comment-13437674 ]

Bernd Fehling commented on SOLR-3589:
-------------------------------------

I would not mix synonyms into this because they need a special seperate treatment.
It might work for "monkeyhouse => monkey house" but what if you have synonyms like "nuclear fission, kernspaltung, fissione nucleare"?
You would expect to get a search like (nuclear AND fission) OR (kernspaltung) OR (fissione AND nucleare).
This is a simplified example just to show that if you include synonyms into this issue you also have to detect/parse/obey the kind of synonym mapping.


> Edismax parser does not honor mm parameter if analyzer splits a token
> ---------------------------------------------------------------------
>
> Key: SOLR-3589
> URL: https://issues.apache.org/jira/browse/SOLR-3589
> Project: Solr
> Issue Type: Bug
> Components: search
> Affects Versions: 3.6
> Reporter: Tom Burton-West
>
> With edismax mm set to 100% if one of the tokens is split into two tokens by the analyzer chain (i.e. "fire-fly" => fire fly), the mm parameter is ignored and the equivalent of OR query for "fire OR fly" is produced.
> This is particularly a problem for languages that do not use white space to separate words such as Chinese or Japenese.
> See these messages for more discussion:
> http://lucene.472066.n3.nabble.com/edismax-parser-ignores-mm-parameter-when-tokenizer-splits-tokens-hypenated-words-WDF-splitting-etc-tc3991911.html
> http://lucene.472066.n3.nabble.com/edismax-parser-ignores-mm-parameter-when-tokenizer-splits-tokens-i-e-CJK-tc3991438.html
> http://lucene.472066.n3.nabble.com/Why-won-t-dismax-create-multiple-DisjunctionMaxQueries-when-autoGeneratePhraseQueries-is-false-tc3992109.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Aug 23, 2012, 12:30 PM

Post #13 of 14 (1121 views)
Permalink
[jira] [Commented] (SOLR-3589) Edismax parser does not honor mm parameter if analyzer splits a token [In reply to]

[ https://issues.apache.org/jira/browse/SOLR-3589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13440583#comment-13440583 ]

Tom Burton-West commented on SOLR-3589:
---------------------------------------

Just repeated tests in Solr 4.0Beta and the bug behaves the same.

> Edismax parser does not honor mm parameter if analyzer splits a token
> ---------------------------------------------------------------------
>
> Key: SOLR-3589
> URL: https://issues.apache.org/jira/browse/SOLR-3589
> Project: Solr
> Issue Type: Bug
> Components: search
> Affects Versions: 3.6, 4.0-BETA
> Reporter: Tom Burton-West
>
> With edismax mm set to 100% if one of the tokens is split into two tokens by the analyzer chain (i.e. "fire-fly" => fire fly), the mm parameter is ignored and the equivalent of OR query for "fire OR fly" is produced.
> This is particularly a problem for languages that do not use white space to separate words such as Chinese or Japenese.
> See these messages for more discussion:
> http://lucene.472066.n3.nabble.com/edismax-parser-ignores-mm-parameter-when-tokenizer-splits-tokens-hypenated-words-WDF-splitting-etc-tc3991911.html
> http://lucene.472066.n3.nabble.com/edismax-parser-ignores-mm-parameter-when-tokenizer-splits-tokens-i-e-CJK-tc3991438.html
> http://lucene.472066.n3.nabble.com/Why-won-t-dismax-create-multiple-DisjunctionMaxQueries-when-autoGeneratePhraseQueries-is-false-tc3992109.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Aug 23, 2012, 2:12 PM

Post #14 of 14 (1120 views)
Permalink
[jira] [Commented] (SOLR-3589) Edismax parser does not honor mm parameter if analyzer splits a token [In reply to]

[ https://issues.apache.org/jira/browse/SOLR-3589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13440669#comment-13440669 ]

Tom Burton-West commented on SOLR-3589:
---------------------------------------

I'm not at the point where I understand the test cases for Edismax enough to write unit tests. If someone can point me to an example unit test somewhere that I could use to model a test please do.
In the meantime, attached is a file which can be put in the Solr exampledocs directory and indexed. Sample queries demonstrating the problem with English hyphenated words and with CJK are included

> Edismax parser does not honor mm parameter if analyzer splits a token
> ---------------------------------------------------------------------
>
> Key: SOLR-3589
> URL: https://issues.apache.org/jira/browse/SOLR-3589
> Project: Solr
> Issue Type: Bug
> Components: search
> Affects Versions: 3.6, 4.0-BETA
> Reporter: Tom Burton-West
> Attachments: testSolr3589.xml.gz
>
>
> With edismax mm set to 100% if one of the tokens is split into two tokens by the analyzer chain (i.e. "fire-fly" => fire fly), the mm parameter is ignored and the equivalent of OR query for "fire OR fly" is produced.
> This is particularly a problem for languages that do not use white space to separate words such as Chinese or Japenese.
> See these messages for more discussion:
> http://lucene.472066.n3.nabble.com/edismax-parser-ignores-mm-parameter-when-tokenizer-splits-tokens-hypenated-words-WDF-splitting-etc-tc3991911.html
> http://lucene.472066.n3.nabble.com/edismax-parser-ignores-mm-parameter-when-tokenizer-splits-tokens-i-e-CJK-tc3991438.html
> http://lucene.472066.n3.nabble.com/Why-won-t-dismax-create-multiple-DisjunctionMaxQueries-when-autoGeneratePhraseQueries-is-false-tc3992109.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene

Lucene java-dev RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.