
jira at apache
Jun 23, 2012, 6:02 AM
Post #1 of 4
(58 views)
Permalink
|
|
[jira] [Comment Edited] (SOLR-2592) Pluggable shard lookup mechanism for SolrCloud
|
|
[ https://issues.apache.org/jira/browse/SOLR-2592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13399933#comment-13399933 ] Andy Laird edited comment on SOLR-2592 at 6/23/12 1:01 PM: ----------------------------------------------------------- I've been using some code very similar to Michael's latest patch for a few weeks now and am liking it less and less for our use case. As I described above, we are using this patch to ensure that all docs with the same value for a specific field end up on the same shard -- this is so that the field collapse counting will work for distributed searches, otherwise the returned counts are only an upper bound. The problems we've encountered have entirely to do with our need to update the value of the field we're doing a field-collapse on. Our approach -- conceptually similar to the CompositeIdShardKeyParserFactory in Michael's latest patch -- involved creating a new schema field, indexId, that was a combination of what used to be our uniqueKey plus the field that we collapse on: *Original schema* {code:xml} <field name="id" type="string" indexed="true" stored="true" multiValued="false" required="true" /> <field name="xyz" type="string" indexed="true" stored="true" multiValued="false" required="true" /> ... <uniqueKey>id</uniqueKey> {code} *Modified schema* {code:xml} <field name="id" type="string" indexed="true" stored="true" multiValued="false" required="true" /> <field name="xyz" type="string" indexed="true" stored="true" multiValued="false" required="true" /> <field name="indexId" type="string" indexed="true" stored="true" multiValued="false" required="true" /> ... <uniqueKey>indexId</uniqueKey> {code} During indexing we insert the extra "indexId" data in the form, "id:xyz". Our custom ShardKeyParser extracts out the xyz portion of the uniqueKey and returns that as the hash value for shard selection. Everything works great in terms of field collapse, counts, etc. The problems begin when considering what happens when we need to change the value of the field, xyz. Suppose that our document starts out with these values for the 3 fields above: {quote} id=123 xyz=456 indexId=123:456 {quote} We then want to change xyz to the value 789, say. In other words, we want to end up... {quote} id=123 xyz=789 indexId=123:789 {quote} ...so that the doc lives on the same shard along with other docs that have xyz=789 (so that field collapse counts are correct since we use that to drive paging). Before any of this we would simply pass in a new document and all would be good since we weren't changing the uniqueKey. However, now we need to delete the old document (with the old uniqueKey) or we'll end up with duplicates. We don't know whether a given update changes the value of xyz or not and we don't know what the old value for xyz was (without doing an additional lookup) so we must include an extra delete along with every change: *Before* {code:xml} <add> <doc> <field name="id">123</field> <field name="xyz">789</field> <doc> </add> {code} *Now* {code:xml} <delete> <query>id:123 AND NOT xyz:789</query> </delete> <add> <doc> <field name="id">123</field> <field name="xyz">789</field> <field name="clusterId">123:789</field> <-- old clusterId was 123:456 <doc> </add> {code} So in addition to the "unsavory coupling" between id and xyz there is a significant performance hit to this approach (as we're doing this in the context of NRT). The fundamental issue, of course, is that we only have the uniqueKey value (id) and score for the first phase of distributed search -- we really need the other field that we are using for shard ownership, too. One idea is to have another standard schema field similar to uniqueKey that is used for the purposes of shard distribution: {code:xml} <uniqueKey>id</uniqueKey> <shardKey>xyz</shardKey> {code} Then, as standard procedure, the first phase of distributed search would ask for uniqueKey, shardKey and score. Perhaps the ShardKeyParser gets both uniqueKey and shardKey data for maximum flexibility. In addition to a solution to our issue with field collapse counts, date-based sharding could be done by setting the shardKey to a date field and doing appropriate slicing in the ShardKeyParser. was (Author: clavius): I've been using some code very similar to Michael's latest patch for a few weeks now and am liking it less and less for our use case. As I described above, we are using this patch to ensure that all docs with the same value for a specific field end up on the same shard -- this is so that the field collapse counting will work for distributed searches, otherwise the returned counts are only an upper bound. The problems we've encountered have entirely to do with our need to update the value of the field we're doing a field-collapse on. Our approach -- conceptually similar to the CompositeIdShardKeyParserFactory in Michael's latest patch, involved creating a new schema field, indexId, that was a combination of what used to be our uniqueKey plus the field that we collapse on: *Original schema* {code:xml} <field name="id" type="string" indexed="true" stored="true" multiValued="false" required="true" /> <field name="xyz" type="string" indexed="true" stored="true" multiValued="false" required="true" /> ... <uniqueKey>id</uniqueKey> {code} *Modified schema* {code:xml} <field name="id" type="string" indexed="true" stored="true" multiValued="false" required="true" /> <field name="xyz" type="string" indexed="true" stored="true" multiValued="false" required="true" /> <field name="indexId" type="string" indexed="true" stored="true" multiValued="false" required="true" /> ... <uniqueKey>indexId</uniqueKey> {code} During indexing we insert the extra "indexId" data in the form, "id:xyz". Our custom ShardKeyParser extracts out the xyz portion of the uniqueKey and returns that as the hash value for shard selection. Everything works great in terms of field collapse, counts, etc. The problems begin when considering what happens when we need to change the value of the field, xyz. Suppose that our document starts out with these values for the 3 fields above: {quote} id=123 xyz=456 indexId=123:456 {quote} We then want to change xyz to the value 789, say. In other words, we want to end up... {quote} id=123 xyz=789 indexId=123:789 {quote} ...so that the doc lives on the same shard along with other docs that have xyz=789 (so that field collapse counts are correct since we use that to drive paging). Before any of this we would simply pass in a new document and all would be good since we weren't changing the uniqueKey. However, now we need to delete the old document (with the old uniqueKey) or we'll end up with duplicates. We don't know whether a given update changes the value of xyz or not and we don't know what the old value for xyz was (without doing an additional lookup) so we must include an extra delete along with every change: *Before* {code:xml} <add> <doc> <field name="id">123</field> <field name="xyz">789</field> <doc> </add> {code} *Now* {code:xml} <delete> <query>id:123 AND NOT xyz:789</query> </delete> <add> <doc> <field name="id">123</field> <field name="xyz">789</field> <field name="clusterId">123:789</field> <-- old clusterId was 123:456 <doc> </add> {code} So in addition to the "unsavory coupling" between id and xyz there is a significant performance hit to this approach (as we're doing this in the context of NRT). The fundamental issue, of course, is that we only have the uniqueKey value (id) and score for the first phase of distributed search -- we really need the other field that we are using for shard ownership, too. One idea is to have another standard schema field similar to uniqueKey that is used for the purposes of shard distribution: {code:xml} <uniqueKey>id</uniqueKey> <shardKey>xyz</shardKey> {code} Then, as standard procedure, the first phase of distributed search would ask for uniqueKey, shardKey and score. Perhaps the ShardKeyParser gets both uniqueKey and shardKey data for maximum flexibility. In addition to a solution to our issue with field collapse counts, date-based sharding could be done by setting the shardKey to a date field and doing appropriate slicing in the ShardKeyParser. > Pluggable shard lookup mechanism for SolrCloud > ---------------------------------------------- > > Key: SOLR-2592 > URL: https://issues.apache.org/jira/browse/SOLR-2592 > Project: Solr > Issue Type: New Feature > Components: SolrCloud > Affects Versions: 4.0 > Reporter: Noble Paul > Attachments: SOLR-2592.patch, dbq_fix.patch, pluggable_sharding.patch, pluggable_sharding_V2.patch > > > If the data in a cloud can be partitioned on some criteria (say range, hash, attribute value etc) It will be easy to narrow down the search to a smaller subset of shards and in effect can achieve more efficient search. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe [at] lucene For additional commands, e-mail: dev-help [at] lucene
|