fancyerii at gmail
Apr 23, 2012, 7:56 PM
Post #8 of 8
(160 views)
Permalink

Re: why the of advance(int target) function of DocIdSetIterator is defined with uncertain?
[In reply to]


I got it. I found my test codes some problem. the new test result is: when minShould=1, new algorithm is a little bit faster than old one. when On Sun, Apr 22, 2012 at 3:32 AM, Mikhail Khludnev < mkhludnev [at] griddynamics> wrote: > Li, > > I can give you my understanding of difference in DisjunctionSumScorer and > https://issues.apache.org/jira/browse/LUCENE2686 > > When you choose steady children approach, even you omit call score() you > have to enumerate top of child scorers heap twice. > Check nextDoc() from the patch: first loop pushes top of heap first, then > the second loop through the top is done by > countMatches(1); countMatches(2); > > Current DisjunctionSumScorer enumerate top of heap once per every doc: > while it loops through top of heap it counts number of clauses matched; > accumulate score; and pushes top children one step forward. > > what's minimumNrMatchers do you have? can you upload your test github? > (mail list rips attachments off) > > On Thu, Apr 19, 2012 at 7:34 AM, Li Li <fancyerii [at] gmail> wrote: > >> Michael McCandless wrote: >> >> So... the good news is I made a new scorer (basically copied >> DisjunctionMaxScorer and then tweaked from there) that scores the ORonly >> case. All tests pass w/ this new scorer. >> >> And more good news is that if you don't score (I sort by doctitle to do >> that), you get a speedup – 7.7% in my simplistic test (prefix query unit*, >> expands to 988 terms, but I force it to do a scoring BQ rewrite, plus force >> it to use BS2 not BS – the nocommits in the patch). >> >> But the bad news is with scoring on it's 22.7% slower! >> >> And, the weird news is, I discovered accidentally that BS2 is much (> 2X) >> faster for this one query. I think we need to modify the criteria that >> decides whether to use BS or BS2... maybe when there are lots of >> lowishdocFreq terms, BS2 is better? >>  >> why new algorithm is 22% slower than old one? >> I read the codes and feel them almost the same. >> >> On Tue, Apr 17, 2012 at 8:16 PM, Mikhail Khludnev < >> mkhludnev [at] griddynamics> wrote: >> >>> Hello, >>> >>> I can't help with the particular question, but can share some >>> experience. My task is roughly the same I've found the patch >>> https://issues.apache.org/jira/browse/LUCENE2686 is absolutely useful >>> (with one small addition, I'll post it in comments soon). By using it I >>> have disjunction summing query with steady subscorers. >>> >>> Regards >>> >>> On Tue, Apr 17, 2012 at 2:37 PM, Li Li <fancyerii [at] gmail> wrote: >>> >>>> hi all, >>>> I am now hacking the BooleanScorer2 to let it keep the docID() of >>>> the leaf scorer(mostly possible TermScorer) the same as the toplevel >>>> Scorer. Why I want to do this is: When I Collect a doc, I want to know >>>> which term is matched(especially for BooleanClause whose Occur is SHOULD). >>>> we have discussed some solutions, such as adding bit masks in disjunction >>>> scorers. with this method, when we finds a matched doc, we can recursively >>>> find which leaf scorer is matched. But we think it's not very efficient and >>>> not convenient to use(this is my proposal but not agreed by others in our >>>> team). and then we came up with another one: Modifying DisjunctionSumScorer. >>>> we analysed the codes and found that the only Scorers used by >>>> BooleanScorer2 that will make the children scorers' docID() not equal to >>>> parent is an anonymous class inherited from DisjunctionSumScorer. All other >>>> ones including SingleMatchScorer, countingConjunctionSumScorer(anonymous), >>>> dualConjuctionSumScorer, ReqOptSumScorer and ReqExclScorer are fit our need. >>>> The implementation algorithm of DisjunctionSumScorer use a heap to >>>> find the smallest doc. after finding a matched doc, the currentDoc is the >>>> matched doc and all the scorers in the heap will call nextDoc() so all of >>>> the scorers' current docID the nextDoc of currentDoc. if there are N level >>>> DisjunctionSumScorer, the leaf scorer's current doc is the nth next docId >>>> of the root of the scorer tree. >>>> So we modify the DisjuctionSumScorer and let it behavior as we >>>> expected. And then I wrote some TestCase and it works well. And also I >>>> wrote some random generated TermScorer and compared the nextDoc(),score() >>>> and advance(int) method of original DisjunctionSumScorer and modified one. >>>> nextDoc() and score() and exactly the same. But for advance(int target), we >>>> found some interesting and strange things. >>>> at the beginning, I think if target is less than current docID, it >>>> will just return current docID and do nothing. this assumption let my >>>> algorithm go wrong. Then I read the codes of TermScorer and found each call >>>> of advance(int) of TermScorer will call nextDoc() no matter whether current >>>> docID is larger than target or not. >>>> So I am confused and then read the javadoc of DocIdSetIterator: >>>>  javadoc of DocIdSetIterator.advance(int >>>> target) >>>> >>>> int org.apache.lucene.search.DocIdSetIterator.advance(int target) >>>> throws IOException >>>> >>>> Advances to the first beyond (see NOTE below) the current whose >>>> document number is greater than or equal >>>> to target. Returns the current document number or NO_MORE_DOCS if >>>> there are no more docs in the set. >>>> Behaves as if written: >>>> int advance(int target) { >>>> int doc; >>>> while ((doc = nextDoc()) < target) { >>>> } >>>> return doc; >>>> } >>>> Some implementations are considerably more efficient than that. >>>> NOTE: when target < current implementations may opt not to advance >>>> beyond their current docID(). >>>> NOTE: this method may be called with NO_MORE_DOCS for efficiency by >>>> some Scorers. If your >>>> implementation cannot efficiently determine that it should exhaust, it >>>> is recommended that you check for >>>> that value in each call to this method. >>>> NOTE: after the iterator has exhausted you should not call this method, >>>> as it may result in unpredicted >>>> behavior. >>>>  >>>> Then I modified my algorithm again and found that >>>> DisjunctionSumScorer.advance(int target) has some strange behavior. most of >>>> the cases, it will return currentDoc if target < currentDoc. but in some >>>> boundary condition, it will not. >>>> it's not a bug but let me sad. I thought my algorithm has some bug >>>> because it's advance method is not exactly the same as original >>>> DisjunctionSumScorer's. >>>> codes of DisjunctionSumScorer >>>> @Override >>>> public int advance(int target) throws IOException { >>>> if (scorerDocQueue.size() < minimumNrMatchers) { >>>> return currentDoc = NO_MORE_DOCS; >>>> } >>>> if (target <= currentDoc) { >>>> return currentDoc; >>>> } >>>> .... >>>>  >>>> for most case if (target <= currentDoc) it will return currentDoc; >>>> but if previous advance will make sub scorers exhausted, then if may >>>> return NO_MORE_DOCS >>>> an example is: >>>> currentDoc=1 >>>> minimumNrMatchers=1 >>>> subScorers: >>>> TermScorer: docIds: [1, 2, 6] >>>> TermScorer: docIds: [2, 4] >>>> after first call advance(5) >>>> currentDoc=6 >>>> only first scorer is now in the heap, scorerDocQueue.size()==1 >>>> then call advance(6) >>>> because scorerDocQueue.size() < minimumNrMatchers, it just return >>>> NO_MORE_DOCS >>>> >>>> My question is why the advance(int target) method is defined like this? >>>> for the reason of efficient or any other reasons? >>>> >>>> >>> >>> >>> >>>  >>> Sincerely yours >>> Mikhail Khludnev >>> gedel [at] yandex >>> >>> <http://www.griddynamics.com> >>> <mkhludnev [at] griddynamics> >>> >>> >> > > >  > Sincerely yours > Mikhail Khludnev > gedel [at] yandex > > <http://www.griddynamics.com> > <mkhludnev [at] griddynamics> > >
