Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-Dev

How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

 

 

First page Previous page 1 2 3 Next page Last page  View All Lucene java-dev RSS feed   Index | Next | Previous | View Threaded


siberski at l3s

Jan 11, 2005, 1:55 AM

Post #1 of 51 (264 views)
Permalink
How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

As I'm very interested in resolving this bug,
I would like to resume the discussion about it.
Chuck Williams (the original bug reporter) and me
both already have provided a patch. Is any of the
committers willing to review them?
If changes are necessary, or another way of handling
this issue turns out to be more appropriate, I would
gladly put more work into that area.
But I need the support of (at least) one committer, and
also IMHO some additional discussion about how to tackle
that issue wouldn't hurt, too.

--Wolf


bugzilla [at] apache wrote:
> DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG·
> RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
> <http://issues.apache.org/bugzilla/show_bug.cgi?id=31841>.
> ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND·
> INSERTED IN THE BUG DATABASE.
>
> http://issues.apache.org/bugzilla/show_bug.cgi?id=31841
>
>
> daniel.naber [at] t-online changed:
>
> What |Removed |Added
> ----------------------------------------------------------------------------
> CC| |feigao [at] sohu-inc
>
>
>
>
> ------- Additional Comments From daniel.naber [at] t-online 2005-01-04 23:49 -------
> *** Bug 32053 has been marked as a duplicate of this bug. ***
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe [at] jakarta
For additional commands, e-mail: lucene-dev-help [at] jakarta


chuck at manawiz

Jan 11, 2005, 12:01 PM

Post #2 of 51 (269 views)
Permalink
RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ? [In reply to]

I agree that this bug is important to fix, but don't believe we have a
solid fix yet. Idf-normalization is essential to get correct for large
distributed-index apps. I have a client evaluating Lucene for this now.
As Wolf does, I hope a committer with deep knowledge of Lucene's design
in this area will weigh in on the issue and help to resolve it.

I've read through Wolf's patch and see a few issues (please correct
anything wrong here):
1. DfMapSimilarity works only with a limited set of queries. A
complete solution should support all Query types, and certainly must
support fundamental Query types like RangeQuery. Could this be
addressed by using primitive queries rather than surface queries (i.e.,
after rewriting)? There may be a more fundamental issue for Query's
that generate large numbers of clauses, because it is very inefficient
to go access all the RemoteSearchable's for each Term.
2. The patch hardwires the use of DfMapSimilarity into MultiSearcher.
As Wolf points out in his comments, this needs to be configurable. At
present, it would be impossible to use a custom Similarity, e.g. to
change the numerical computation of idf() from the docfreq. The ability
to configure custom Similarity's needs to be robust in the presence of
MultiSearcher, i.e. an application should be able to make the kinds of
changes currently made in a subclass of DefaultSimilarity while
inheriting the behavior that makes it work properly with MultiSearcher.
3. Philosophically, I'm not convinced that Similarity's are the right
solution. Similarity's are currently used for application-specific
scoring customizations. The issue here is idf-normalization in the
presence of multiple searchers, which should be an orthogonal
consideration.

My patch with a topmostSearcher field also has issues, especially the
fatal problem that it doesn't work for RemoteSearchable's.

A burning question for me is, what is the right solution for
RemoteSearchable's? With Wolf's patch, the MultiSearcher analyzes each
Query to identify the terms it uses and then calls each RemoteSearchable
to get the docFreq's from its index, sums them, extends the Query with a
Map of these sums (within a created Similarity), and then passes this
information back to the RemoteSearchable's to use during their scoring.

An alternative approach would be to precompute the docFreq sums and
distribute them to all the RemoteSearchable's ahead of time, independent
of Query's. Incremental indexing would need to recompute and propagate
the revised sums. Having the sums pre-distributed would make
Query-processing efficient. Is something along those lines possible?

Chuck

> -----Original Message-----
> From: Wolf Siberski [mailto:siberski [at] l3s]
> Sent: Tuesday, January 11, 2005 12:55 AM
> To: Lucene Developers List
> Subject: How to proceed with Bug 31841 - MultiSearcher problems with
> Similarity.docFreq() ?
>
> As I'm very interested in resolving this bug,
> I would like to resume the discussion about it.
> Chuck Williams (the original bug reporter) and me
> both already have provided a patch. Is any of the
> committers willing to review them?
> If changes are necessary, or another way of handling
> this issue turns out to be more appropriate, I would
> gladly put more work into that area.
> But I need the support of (at least) one committer, and
> also IMHO some additional discussion about how to tackle
> that issue wouldn't hurt, too.
>
> --Wolf
>
>
> bugzilla [at] apache wrote:
> > DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG*
> > RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
> > <http://issues.apache.org/bugzilla/show_bug.cgi?id=31841>.
> > ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND*
> > INSERTED IN THE BUG DATABASE.
> >
> > http://issues.apache.org/bugzilla/show_bug.cgi?id=31841
> >
> >
> > daniel.naber [at] t-online changed:
> >
> > What |Removed |Added
> >
----------------------------------------------------------------------
> ------
> > CC|
|feigao [at] sohu-inc
> >
> >
> >
> >
> > ------- Additional Comments From daniel.naber [at] t-online
2005-01-04
> 23:49 -------
> > *** Bug 32053 has been marked as a duplicate of this bug. ***
> >
>
>
>
---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe [at] jakarta
> For additional commands, e-mail: lucene-dev-help [at] jakarta


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe [at] jakarta
For additional commands, e-mail: lucene-dev-help [at] jakarta


cutting at apache

Jan 11, 2005, 2:13 PM

Post #3 of 51 (261 views)
Permalink
Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ? [In reply to]

Chuck Williams wrote:
> As Wolf does, I hope a committer with deep knowledge of Lucene's design
> in this area will weigh in on the issue and help to resolve it.

The root of the bug is in MultiSearcher.search(). This should construct
a Weight, weight the query, then score the now-weighted query.

Here's a potential way to fix it:

1. Replace all of the

... search(Query, ...)

methods in Searchable.java with

... search(Weight, ...)

methods.

2. Add search(Query, ...) convenience methods to Searcher.java which do
something like:

public ... search(Query query, ...) {
return search(query.weight(this), ...);
}

3. Update the search() methods in IndexSearcher, MultiSearcher and
RemoteSearchable to operate on Weight's instead of queries.

Does that make sense?

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe [at] jakarta
For additional commands, e-mail: lucene-dev-help [at] jakarta


chuck at manawiz

Jan 11, 2005, 4:14 PM

Post #4 of 51 (267 views)
Permalink
RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ? [In reply to]

This is a nice solution! By having MultiSearcher create the Weight, it
can pass itself in as the searcher, thereby allowing the correct
docFreq() method to be called. This is similar to what I tried to do
with topmostSearcher, but a much better way to do it.

I'm still left wondering if having MultiSearcher query all the
RemoteSearchable's on every call to docFreq() within each TermQuery,
PhraseQuery, SpanQuery and PhrasePrefixQuery is the way to go long term,
although it seems like the best thing to do right now. The calls only
happen when the Weight's are created, so maybe it's not too bad. Longer
term, it might be better to distribute the idf information out to the
RemoteSearchable's to minimize the required number of remote accesses
for each Query.

Wolf, do you want to implement Doug's solution?

Chuck

> -----Original Message-----
> From: Doug Cutting [mailto:cutting [at] apache]
> Sent: Tuesday, January 11, 2005 1:13 PM
> To: Lucene Developers List
> Subject: Re: How to proceed with Bug 31841 - MultiSearcher problems
with
> Similarity.docFreq() ?
>
> Chuck Williams wrote:
> > As Wolf does, I hope a committer with deep knowledge of Lucene's
> design
> > in this area will weigh in on the issue and help to resolve it.
>
> The root of the bug is in MultiSearcher.search(). This should
construct
> a Weight, weight the query, then score the now-weighted query.
>
> Here's a potential way to fix it:
>
> 1. Replace all of the
>
> ... search(Query, ...)
>
> methods in Searchable.java with
>
> ... search(Weight, ...)
>
> methods.
>
> 2. Add search(Query, ...) convenience methods to Searcher.java which
do
> something like:
>
> public ... search(Query query, ...) {
> return search(query.weight(this), ...);
> }
>
> 3. Update the search() methods in IndexSearcher, MultiSearcher and
> RemoteSearchable to operate on Weight's instead of queries.
>
> Does that make sense?
>
> Doug
>
>
---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe [at] jakarta
> For additional commands, e-mail: lucene-dev-help [at] jakarta


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe [at] jakarta
For additional commands, e-mail: lucene-dev-help [at] jakarta


cutting at apache

Jan 11, 2005, 4:45 PM

Post #5 of 51 (261 views)
Permalink
Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ? [In reply to]

Chuck Williams wrote:
> This is a nice solution! By having MultiSearcher create the Weight, it
> can pass itself in as the searcher, thereby allowing the correct
> docFreq() method to be called.

Glad to hear it at least makes sense... Now I hope it works!

> I'm still left wondering if having MultiSearcher query all the
> RemoteSearchable's on every call to docFreq() within each TermQuery,
> PhraseQuery, SpanQuery and PhrasePrefixQuery is the way to go long term,
> although it seems like the best thing to do right now. The calls only
> happen when the Weight's are created, so maybe it's not too bad. Longer
> term, it might be better to distribute the idf information out to the
> RemoteSearchable's to minimize the required number of remote accesses
> for each Query.

I'm not sure exactly what you mean by "distribute the idf information
out to the RemoteSearchable". I think one might profitably implement a
docFreq() cache in RemoteSearchable. This could be a simple cache, or
it could be fairly agressive, pre-fetching all the docFreqs. (As an
optimization, it could only pre-fetch those greater than 1, and, when a
term is not in the cache, assume its docFreq is 1. As a lossy
optimization, it could only pre-fetch those greater than N, and somehow
estimate those not in the cache.) Is that what you meant?

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe [at] jakarta
For additional commands, e-mail: lucene-dev-help [at] jakarta


chuck at manawiz

Jan 11, 2005, 5:01 PM

Post #6 of 51 (262 views)
Permalink
RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ? [In reply to]

Doug Cutting wrote:
> I'm not sure exactly what you mean by "distribute the idf
information
> out to the RemoteSearchable". I think one might profitably
implement a
> docFreq() cache in RemoteSearchable. This could be a simple cache,
or
> it could be fairly agressive, pre-fetching all the docFreqs. (As an
> optimization, it could only pre-fetch those greater than 1, and,
when a
> term is not in the cache, assume its docFreq is 1. As a lossy
> optimization, it could only pre-fetch those greater than N, and
somehow
> estimate those not in the cache.) Is that what you meant?

I was thinking of the aggressive version with an index-time solution,
although I don't know the Lucene architecture for distributed indexing
and searching well enough to formulate the idea precisely.
Conceptually, I'd like each server that owns a slice of the index in a
distributed environment to have the complete docFreq data, i.e. to have
docFreq's that represent the collection as a whole, not just its index
slice. If this was achieved at index-time, then the current
implementation would work at query time. I.e., MultiSearch could send
the queries out to the remote Searcher's and these Searcher's could
consult their local indexes for the correct docFreq's to use.

Chuck

> -----Original Message-----
> From: Doug Cutting [mailto:cutting [at] apache]
> Sent: Tuesday, January 11, 2005 3:46 PM
> To: Lucene Developers List
> Subject: Re: How to proceed with Bug 31841 - MultiSearcher problems
with
> Similarity.docFreq() ?
>
> Chuck Williams wrote:
> > This is a nice solution! By having MultiSearcher create the
Weight,
> it
> > can pass itself in as the searcher, thereby allowing the correct
> > docFreq() method to be called.
>
> Glad to hear it at least makes sense... Now I hope it works!
>
> > I'm still left wondering if having MultiSearcher query all the
> > RemoteSearchable's on every call to docFreq() within each
TermQuery,
> > PhraseQuery, SpanQuery and PhrasePrefixQuery is the way to go long
> term,
> > although it seems like the best thing to do right now. The calls
only
> > happen when the Weight's are created, so maybe it's not too bad.
> Longer
> > term, it might be better to distribute the idf information out to
the
> > RemoteSearchable's to minimize the required number of remote
accesses
> > for each Query.
>
> I'm not sure exactly what you mean by "distribute the idf
information
> out to the RemoteSearchable". I think one might profitably
implement a
> docFreq() cache in RemoteSearchable. This could be a simple cache,
or
> it could be fairly agressive, pre-fetching all the docFreqs. (As an
> optimization, it could only pre-fetch those greater than 1, and,
when a
> term is not in the cache, assume its docFreq is 1. As a lossy
> optimization, it could only pre-fetch those greater than N, and
somehow
> estimate those not in the cache.) Is that what you meant?
>
> Doug
>
>
---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe [at] jakarta
> For additional commands, e-mail: lucene-dev-help [at] jakarta


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe [at] jakarta
For additional commands, e-mail: lucene-dev-help [at] jakarta


cutting at apache

Jan 12, 2005, 9:57 AM

Post #7 of 51 (261 views)
Permalink
Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ? [In reply to]

Chuck Williams wrote:
> I was thinking of the aggressive version with an index-time solution,
> although I don't know the Lucene architecture for distributed indexing
> and searching well enough to formulate the idea precisely.
> Conceptually, I'd like each server that owns a slice of the index in a
> distributed environment to have the complete docFreq data, i.e. to have
> docFreq's that represent the collection as a whole, not just its index
> slice. If this was achieved at index-time, then the current
> implementation would work at query time. I.e., MultiSearch could send
> the queries out to the remote Searcher's and these Searcher's could
> consult their local indexes for the correct docFreq's to use.

This is different than what I described. I described keeping a docFreq
cache at the central dispatch node, while you describe replicating that
cache on every search node. I don't see the advantage in this
replication. It is both more efficient to maintain a single cache, and
faster to search, since fewer dictionary lookups are involved.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe [at] jakarta
For additional commands, e-mail: lucene-dev-help [at] jakarta


chuck at manawiz

Jan 12, 2005, 10:42 AM

Post #8 of 51 (260 views)
Permalink
RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ? [In reply to]

Ahhh, I didn't understand the part about caching the results in the
central dispatch node. I thought you were accessing the remote nodes on
every query to sum the docFreq's in each remote index for each query
term. I was trying to avoid a large number of round-trips to the remote
nodes by allowing them to have the aggregate docFreq's to use when
processing their queries. It would seem to make sense to build the
aggregate docFreq table in the central dispatch node, and so I agree it
therefore makes more sense to weight the query terms on the central node
rather than doing it separately on each remote node.

There needs to be a way to create the aggregate docFreq table and keep
it current under incremental changes to the indices on the various
remote nodes. One approach might be to always maintain the complete
aggregate docFreq table on the central dispatch node and have any remote
node that performs an indexing operation issue a delta-docFreq table to
the central dispatch node (i.e. a table of the changes in its docFreq
values). If you build the cache incrementally on the central dispatch
node (i.e., on demand as terms are used in Query's), this process would
seem to be more difficult, unless the central dispatch node keeps a
separate cache for each remote node. It could invalidate a remote
node's entire cache after an index operation, but this would lead to
slow subsequent query processing (reacquiring all the docFreq values)
and therefore could lead to poor performance in, for example, a
"realtime" indexing environment.

So, it seems to me that keeping a complete aggregate docFreq table on
the central dispatch node that is updated after after remote index would
be a good way to go. This table shouldn't be that much larger than any
single remote node docFreq table assuming the terms are substantially
the same in each index (although perhaps this isn't true, especially if
highly infrequent terms dominate the tables as is probably the case? I
think you suggested something about dropping such infrequent terms from
the aggregate table to address this issue and assuming a docFreq of 1).

Is there a better way, or perhaps I'm missing something?

Chuck

> -----Original Message-----
> From: Doug Cutting [mailto:cutting [at] apache]
> Sent: Wednesday, January 12, 2005 8:58 AM
> To: Lucene Developers List
> Subject: Re: How to proceed with Bug 31841 - MultiSearcher problems
with
> Similarity.docFreq() ?
>
> Chuck Williams wrote:
> > I was thinking of the aggressive version with an index-time
solution,
> > although I don't know the Lucene architecture for distributed
indexing
> > and searching well enough to formulate the idea precisely.
> > Conceptually, I'd like each server that owns a slice of the index
in a
> > distributed environment to have the complete docFreq data, i.e. to
> have
> > docFreq's that represent the collection as a whole, not just its
index
> > slice. If this was achieved at index-time, then the current
> > implementation would work at query time. I.e., MultiSearch could
send
> > the queries out to the remote Searcher's and these Searcher's
could
> > consult their local indexes for the correct docFreq's to use.
>
> This is different than what I described. I described keeping a
docFreq
> cache at the central dispatch node, while you describe replicating
that
> cache on every search node. I don't see the advantage in this
> replication. It is both more efficient to maintain a single cache,
and
> faster to search, since fewer dictionary lookups are involved.
>
> Doug
>
>
---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe [at] jakarta
> For additional commands, e-mail: lucene-dev-help [at] jakarta


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe [at] jakarta
For additional commands, e-mail: lucene-dev-help [at] jakarta


siberski at l3s

Jan 12, 2005, 11:22 AM

Post #9 of 51 (259 views)
Permalink
Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ? [In reply to]

Chuck Williams wrote:
> I've read through Wolf's patch and see a few issues (please correct
> anything wrong here):
> 1. DfMapSimilarity works only with a limited set of queries.[...]
> 2. The patch hardwires the use of DfMapSimilarity into MultiSearcher.[...]
> 3. Philosophically, I'm not convinced that Similarity's are the right
> solution.[...]
I agree with all three points.

Regarding the point 1, IMHO it will be very difficult to find an
efficient algorithm for all types of queries, because MultiSearcher
doesn't know in advance for which terms the idfs have to be provided,
and we don't want a bidirectional call relationship between MultiSearcher
and RemoteSearchables (or do we?). What we can do is make the framework
flexible enough that each user can trade efficiency vs. query complexity
by configuring the MultiSearcher according to his needs.

Point 2 can be solved, I just haven't found the right solution.

Point 3 is completely right, too. I was looking for a way to make this
work without too much redesign, but Similarity just isn't the right location.

Doug Cutting wrote:
> The root of the bug is in MultiSearcher.search(). This should construct
> a Weight, weight the query, then score the now-weighted query.

Indeed, Weight is the appropriate abstraction which needs to be modified.

Chuck Williams wrote:
> This is a nice solution! By having MultiSearcher create the Weight, it
> can pass itself in as the searcher, thereby allowing the correct
> docFreq() method to be called. This is similar to what I tried to do
> with topmostSearcher, but a much better way to do it.

This still wouldn't work for RemoteSearchables, except if you allow
call-backs from each RemoteSearchable to the MultiSearcher. For
this, MultiSearcher would have to be remotely callable, too. As I said
above, IMHO we should stay with a simple client/server model here.
From the MultiSearchers perspective, we just want to query several
information sources instead of one. If this would imply that we have to
expose ourselves as server, it would impose too great demands (IMHO).
Of course, for some applications this might be the way to go, but
I think we shouldn't make it mandatory.

However, to avoid callbacks the weight implementations
will need to change significantly, because currently they delegate
(via Query->Similarity) to the Searcher. Instead the MultiSearcher
would have to provide them with sufficient information which is then
used directly by the weight (in the same manner as DfMapSimilarity
works in my patch).

I'll take a deeper look at the different Weight implementations
in the next few days to see how this could be done.

--Wolf

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe [at] jakarta
For additional commands, e-mail: lucene-dev-help [at] jakarta


cutting at apache

Jan 12, 2005, 11:55 AM

Post #10 of 51 (261 views)
Permalink
Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ? [In reply to]

Chuck Williams wrote:
> There needs to be a way to create the aggregate docFreq table and keep
> it current under incremental changes to the indices on the various
> remote nodes.

I think you're getting ahead of yourself. Searchers are based on
IndexReaders, and hence doFreqs don't change until a new Searcher is
created. So long as this is true, and the central dispatch node uses a
searcher, then a simple cache, perhaps that is pre-fetched, is all
that's feasable. It shouldn't take that long to pre-fetch the cache
when indexes are re-opened. Lets run before we sprint, and hey, let's
even walk first by first fixing the bug in question.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe [at] jakarta
For additional commands, e-mail: lucene-dev-help [at] jakarta


cutting at apache

Jan 12, 2005, 12:04 PM

Post #11 of 51 (261 views)
Permalink
Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ? [In reply to]

Wolf Siberski wrote:
> Chuck Williams wrote:
>
>> This is a nice solution! By having MultiSearcher create the Weight, it
>> can pass itself in as the searcher, thereby allowing the correct
>> docFreq() method to be called. This is similar to what I tried to do
>> with topmostSearcher, but a much better way to do it.
>
> This still wouldn't work for RemoteSearchables, except if you allow
> call-backs from each RemoteSearchable to the MultiSearcher.

I don't see what callbacks are required. When the Weight is constructed
it invokes docFreq for each term, which, if RemoteSearchables are
involved, will result in IPC calls to those RemoteSearchables. Then,
the Weight object is serialized to each RemoteSearchable and a TopDocs
is returned. Where are the callbacks? These are only required for
HitCollector-based methods, which are not advised with RemoteSearchable.

> For
> this, MultiSearcher would have to be remotely callable, too.

A MultiSearcher can be made remotely callable by wrapping it in a
RemoteSearchable, if that's required. But I'm not sure that's your
concern here.

> As I said
> above, IMHO we should stay with a simple client/server model here.

I think we would still have a simple model, unless I'm missing something.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe [at] jakarta
For additional commands, e-mail: lucene-dev-help [at] jakarta


chuck at manawiz

Jan 12, 2005, 12:18 PM

Post #12 of 51 (262 views)
Permalink
RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ? [In reply to]

Doug Cutting wrote:
> Searchers are based on
> IndexReaders, and hence doFreqs don't change until a new Searcher is
> created. So long as this is true, and the central dispatch node
uses a
> searcher, then a simple cache, perhaps that is pre-fetched, is all
> that's feasable. It shouldn't take that long to pre-fetch the cache
> when indexes are re-opened.
and:
> I don't see what callbacks are required. When the Weight is
constructed
> it invokes docFreq for each term, which, if RemoteSearchables are
> involved, will result in IPC calls to those RemoteSearchables.
Then,

I don't understand the first statement, and don't see how these
statements are consistent (which is probably due to a misunderstanding
in how all this works). The purpose of the aggregate docFreq table is
to avoid the need to issue IPC calls to each RemoteSearchable for each
Term in each Query, as Query's can have very large numbers of Term's
(e.g., with RangeQuery's). Also, I can't believe the central dispatch
node will have all the indices open on each RemoteNode? The dispatch
node should have just the MultiSearcher, which is accessing the remote
nodes via the RemoteSearchable's. So how does a remote node updating
its index and reopening its Searcher cause an aggregate docFreq table on
the dispatcher to get updated? It is for this purpose that I suggested
a callback from the remote node to the dispatch node that passed a
delta-docFreq table so that this central aggregate table can be updated
easily and efficiently.

The first-order fix would seem to be your original proposal, which
requires the IPC-calls for each Term in each Query. This seems
straightforward to implement, although a bit tedious, as everything
needs to be changed to work with Weight's instead of Query's (and the
Query methods need to be maintained for backward compatibility).
Assuming this is too slow due to a barrage of IPC calls for non-trivial
queries, then the performance optimization is to introduce the central
aggregate docFreq table with a mechanism to keep it correct under remote
note index updates.

If I've got something fundamentally wrong in my presuppositions here,
please help me understand.

Thanks,

Chuck


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe [at] jakarta
For additional commands, e-mail: lucene-dev-help [at] jakarta


siberski at l3s

Jan 12, 2005, 5:07 PM

Post #13 of 51 (268 views)
Permalink
Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ? [In reply to]

Doug Cutting wrote:
> Wolf Siberski wrote:
>
>> Chuck Williams wrote:
>>
>>> This is a nice solution! By having MultiSearcher create the Weight, it
>>> can pass itself in as the searcher, thereby allowing the correct
>>> docFreq() method to be called. This is similar to what I tried to do
>>> with topmostSearcher, but a much better way to do it.
>>
>> This still wouldn't work for RemoteSearchables, except if you allow
>> call-backs from each RemoteSearchable to the MultiSearcher.
>
> I don't see what callbacks are required. When the Weight is constructed
> it invokes docFreq for each term, which, if RemoteSearchables are
> involved, will result in IPC calls to those RemoteSearchables. Then,
> the Weight object is serialized to each RemoteSearchable and a TopDocs
> is returned. Where are the callbacks? These are only required for
> HitCollector-based methods, which are not advised with RemoteSearchable.

Yes, I agree. I just wanted to point out that the current Weight
implementations need to be modified heavily to introduce the
behaviour you describe above. For example, take a look at
TermQuery.TermWeight.scorer():
[...]
return new TermScorer(this, termDocs, getSimilarity(searcher),
reader.norms(term.field()));

This typically results in a call to searcher.getSimilarity().
In the new context, the searcher would be a MultiSearcher,
and to resolve that call at on of the RemoteSearchables, the
method getSimilarity() would have to be called remotely on it.
In this case, we can change it so that the Weight is provided
with the Similarity object before it is serialized and sent
to the RemoteSearchables. But I'm not sure if all these cases
can be resolved that easily. As you already have pointed out,
it won't be possible for HitCollector-related Weights.

But, as I said, I still agree fully with the approach.



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe [at] jakarta
For additional commands, e-mail: lucene-dev-help [at] jakarta


chuck at manawiz

Jan 12, 2005, 5:19 PM

Post #14 of 51 (259 views)
Permalink
RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ? [In reply to]

I think there is another problem here. It is currently the Weight
implementations that do rewrite(), which requires access to the index,
not just to the idf's. E.g., RangeQuery.rewrite() must find the terms
in the index within the range. So, the Weight cannot be computed in the
MultiSearcher, as it does not have direct access to the remote index.

This seems to put the viability of the whole approach into question.
The better approach may be to distribute an aggregate docFreq table to
each remote node. A simple interim step could be to support a callback
to the dispatcher node from docFreq on the remote node, although this
would be gross (remote node calls dispatcher node to get docFreq which
in turn calls all remote nodes to get all their docFreqs and sum them).

We need an aggregate docFreq table, and it needs to be on the remote
nodes since the Weight's cannot be computed until after the Query is
rewritten, which requires access to the index on the remote node.

Chuck

> -----Original Message-----
> From: Wolf Siberski [mailto:siberski [at] l3s]
> Sent: Wednesday, January 12, 2005 4:08 PM
> To: Lucene Developers List
> Subject: Re: How to proceed with Bug 31841 - MultiSearcher problems
with
> Similarity.docFreq() ?
>
> Doug Cutting wrote:
> > Wolf Siberski wrote:
> >
> >> Chuck Williams wrote:
> >>
> >>> This is a nice solution! By having MultiSearcher create the
Weight,
> it
> >>> can pass itself in as the searcher, thereby allowing the correct
> >>> docFreq() method to be called. This is similar to what I tried
to
> do
> >>> with topmostSearcher, but a much better way to do it.
> >>
> >> This still wouldn't work for RemoteSearchables, except if you
allow
> >> call-backs from each RemoteSearchable to the MultiSearcher.
> >
> > I don't see what callbacks are required. When the Weight is
> constructed
> > it invokes docFreq for each term, which, if RemoteSearchables are
> > involved, will result in IPC calls to those RemoteSearchables.
Then,
> > the Weight object is serialized to each RemoteSearchable and a
TopDocs
> > is returned. Where are the callbacks? These are only required
for
> > HitCollector-based methods, which are not advised with
> RemoteSearchable.
>
> Yes, I agree. I just wanted to point out that the current Weight
> implementations need to be modified heavily to introduce the
> behaviour you describe above. For example, take a look at
> TermQuery.TermWeight.scorer():
> [...]
> return new TermScorer(this, termDocs, getSimilarity(searcher),
> reader.norms(term.field()));
>
> This typically results in a call to searcher.getSimilarity().
> In the new context, the searcher would be a MultiSearcher,
> and to resolve that call at on of the RemoteSearchables, the
> method getSimilarity() would have to be called remotely on it.
> In this case, we can change it so that the Weight is provided
> with the Similarity object before it is serialized and sent
> to the RemoteSearchables. But I'm not sure if all these cases
> can be resolved that easily. As you already have pointed out,
> it won't be possible for HitCollector-related Weights.
>
> But, as I said, I still agree fully with the approach.
>
>
>
>
---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe [at] jakarta
> For additional commands, e-mail: lucene-dev-help [at] jakarta


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe [at] jakarta
For additional commands, e-mail: lucene-dev-help [at] jakarta


paul.elschot at xs4all

Jan 13, 2005, 1:17 AM

Post #15 of 51 (261 views)
Permalink
Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ? [In reply to]

On Thursday 13 January 2005 01:19, Chuck Williams wrote:
> I think there is another problem here. It is currently the Weight
> implementations that do rewrite(), which requires access to the index,
> not just to the idf's. E.g., RangeQuery.rewrite() must find the terms
> in the index within the range. So, the Weight cannot be computed in the
> MultiSearcher, as it does not have direct access to the remote index.
>
> This seems to put the viability of the whole approach into question.
> The better approach may be to distribute an aggregate docFreq table to
> each remote node. A simple interim step could be to support a callback
> to the dispatcher node from docFreq on the remote node, although this
> would be gross (remote node calls dispatcher node to get docFreq which
> in turn calls all remote nodes to get all their docFreqs and sum them).
>
> We need an aggregate docFreq table, and it needs to be on the remote
> nodes since the Weight's cannot be computed until after the Query is
> rewritten, which requires access to the index on the remote node.

An alternative is to rewrite to a central cache, which is possible because
because it contains all terms and their total document frequencies.
After that all terms and their weights can be sent to the remote searchers,
which can then drop the terms that they don't have.

If it is possible to send a truncated term (or a range) with a centrally
determined weight to the remote searcher, this would avoid sending all terms
to all remote searchers.
In that case the remote searchers might rewrite again to
select only the terms they have indexed themselves.

The question then is whether it is possible to send the query extended with
weights to the remote searchers. Sounds doable to me.

It's losing simplicity, though. OTOH, with a replicated cache, much the same
thing would need to be done remotely.

Regards,
Paul Elschot.

P.S. Are you sure it is worthwhile to do this?
Term density (and it's square root tf()) vary much more than idf nowadays.

> Chuck
>
> > -----Original Message-----
> > From: Wolf Siberski [mailto:siberski [at] l3s]
> > Sent: Wednesday, January 12, 2005 4:08 PM
> > To: Lucene Developers List
> > Subject: Re: How to proceed with Bug 31841 - MultiSearcher problems
> with
> > Similarity.docFreq() ?
> >
> > Doug Cutting wrote:
> > > Wolf Siberski wrote:
> > >
> > >> Chuck Williams wrote:
> > >>
> > >>> This is a nice solution! By having MultiSearcher create the
> Weight,
> > it
> > >>> can pass itself in as the searcher, thereby allowing the correct
> > >>> docFreq() method to be called. This is similar to what I tried
> to
> > do
> > >>> with topmostSearcher, but a much better way to do it.
> > >>
> > >> This still wouldn't work for RemoteSearchables, except if you
> allow
> > >> call-backs from each RemoteSearchable to the MultiSearcher.
> > >
> > > I don't see what callbacks are required. When the Weight is
> > constructed
> > > it invokes docFreq for each term, which, if RemoteSearchables are
> > > involved, will result in IPC calls to those RemoteSearchables.
> Then,
> > > the Weight object is serialized to each RemoteSearchable and a
> TopDocs
> > > is returned. Where are the callbacks? These are only required
> for
> > > HitCollector-based methods, which are not advised with
> > RemoteSearchable.
> >
> > Yes, I agree. I just wanted to point out that the current Weight
> > implementations need to be modified heavily to introduce the
> > behaviour you describe above. For example, take a look at
> > TermQuery.TermWeight.scorer():
> > [...]
> > return new TermScorer(this, termDocs, getSimilarity(searcher),
> > reader.norms(term.field()));
> >
> > This typically results in a call to searcher.getSimilarity().
> > In the new context, the searcher would be a MultiSearcher,
> > and to resolve that call at on of the RemoteSearchables, the
> > method getSimilarity() would have to be called remotely on it.
> > In this case, we can change it so that the Weight is provided
> > with the Similarity object before it is serialized and sent
> > to the RemoteSearchables. But I'm not sure if all these cases
> > can be resolved that easily. As you already have pointed out,
> > it won't be possible for HitCollector-related Weights.
> >
> > But, as I said, I still agree fully with the approach.
> >
> >
> >
> >
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-dev-unsubscribe [at] jakarta
> > For additional commands, e-mail: lucene-dev-help [at] jakarta
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe [at] jakarta
> For additional commands, e-mail: lucene-dev-help [at] jakarta
>
>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe [at] jakarta
For additional commands, e-mail: lucene-dev-help [at] jakarta


chuck at manawiz

Jan 13, 2005, 9:52 AM

Post #16 of 51 (261 views)
Permalink
RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ? [In reply to]

It's a good point that the aggregate idf table holds enough information
to do the rewrite()'s. So MultiSearcher can compute the Weights, which
avoids the need to distribute the aggregate tables to the remote nodes.
It is still necessary to compute them and keep them current under index
updates on the remote nodes, for which a delta-docFreq table still seems
to me to be a good approach.

I think idf() is necessary for decent scoring / relevance-ranking and so
this is essential to do. With Paul's observation, one complicating step
has been removed.

Chuck

> -----Original Message-----
> From: Paul Elschot [mailto:paul.elschot [at] xs4all]
> Sent: Thursday, January 13, 2005 12:18 AM
> To: lucene-dev [at] jakarta
> Subject: Re: How to proceed with Bug 31841 - MultiSearcher problems
with
> Similarity.docFreq() ?
>
> On Thursday 13 January 2005 01:19, Chuck Williams wrote:
> > I think there is another problem here. It is currently the Weight
> > implementations that do rewrite(), which requires access to the
index,
> > not just to the idf's. E.g., RangeQuery.rewrite() must find the
terms
> > in the index within the range. So, the Weight cannot be computed
in
> the
> > MultiSearcher, as it does not have direct access to the remote
index.
> >
> > This seems to put the viability of the whole approach into
question.
> > The better approach may be to distribute an aggregate docFreq
table to
> > each remote node. A simple interim step could be to support a
> callback
> > to the dispatcher node from docFreq on the remote node, although
this
> > would be gross (remote node calls dispatcher node to get docFreq
which
> > in turn calls all remote nodes to get all their docFreqs and sum
them).
> >
> > We need an aggregate docFreq table, and it needs to be on the
remote
> > nodes since the Weight's cannot be computed until after the Query
is
> > rewritten, which requires access to the index on the remote node.
>
> An alternative is to rewrite to a central cache, which is possible
> because
> because it contains all terms and their total document frequencies.
> After that all terms and their weights can be sent to the remote
> searchers,
> which can then drop the terms that they don't have.
>
> If it is possible to send a truncated term (or a range) with a
centrally
> determined weight to the remote searcher, this would avoid sending
all
> terms
> to all remote searchers.
> In that case the remote searchers might rewrite again to
> select only the terms they have indexed themselves.
>
> The question then is whether it is possible to send the query
extended
> with
> weights to the remote searchers. Sounds doable to me.
>
> It's losing simplicity, though. OTOH, with a replicated cache, much
the
> same
> thing would need to be done remotely.
>
> Regards,
> Paul Elschot.
>
> P.S. Are you sure it is worthwhile to do this?
> Term density (and it's square root tf()) vary much more than idf
> nowadays.
>
> > Chuck
> >
> > > -----Original Message-----
> > > From: Wolf Siberski [mailto:siberski [at] l3s]
> > > Sent: Wednesday, January 12, 2005 4:08 PM
> > > To: Lucene Developers List
> > > Subject: Re: How to proceed with Bug 31841 - MultiSearcher
> problems
> > with
> > > Similarity.docFreq() ?
> > >
> > > Doug Cutting wrote:
> > > > Wolf Siberski wrote:
> > > >
> > > >> Chuck Williams wrote:
> > > >>
> > > >>> This is a nice solution! By having MultiSearcher create
the
> > Weight,
> > > it
> > > >>> can pass itself in as the searcher, thereby allowing the
> correct
> > > >>> docFreq() method to be called. This is similar to what I
> tried
> > to
> > > do
> > > >>> with topmostSearcher, but a much better way to do it.
> > > >>
> > > >> This still wouldn't work for RemoteSearchables, except if
you
> > allow
> > > >> call-backs from each RemoteSearchable to the MultiSearcher.
> > > >
> > > > I don't see what callbacks are required. When the Weight is
> > > constructed
> > > > it invokes docFreq for each term, which, if
RemoteSearchables
> are
> > > > involved, will result in IPC calls to those
RemoteSearchables.
> > Then,
> > > > the Weight object is serialized to each RemoteSearchable and
a
> > TopDocs
> > > > is returned. Where are the callbacks? These are only
required
> > for
> > > > HitCollector-based methods, which are not advised with
> > > RemoteSearchable.
> > >
> > > Yes, I agree. I just wanted to point out that the current
Weight
> > > implementations need to be modified heavily to introduce the
> > > behaviour you describe above. For example, take a look at
> > > TermQuery.TermWeight.scorer():
> > > [...]
> > > return new TermScorer(this, termDocs,
getSimilarity(searcher),
> > > reader.norms(term.field()));
> > >
> > > This typically results in a call to searcher.getSimilarity().
> > > In the new context, the searcher would be a MultiSearcher,
> > > and to resolve that call at on of the RemoteSearchables, the
> > > method getSimilarity() would have to be called remotely on it.
> > > In this case, we can change it so that the Weight is provided
> > > with the Similarity object before it is serialized and sent
> > > to the RemoteSearchables. But I'm not sure if all these cases
> > > can be resolved that easily. As you already have pointed out,
> > > it won't be possible for HitCollector-related Weights.
> > >
> > > But, as I said, I still agree fully with the approach.
> > >
> > >
> > >
> > >
> >
---------------------------------------------------------------------
> > > To unsubscribe, e-mail:
lucene-dev-unsubscribe [at] jakarta
> > > For additional commands, e-mail: lucene-dev-
> help [at] jakarta
> >
> >
> >
---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-dev-unsubscribe [at] jakarta
> > For additional commands, e-mail:
lucene-dev-help [at] jakarta
> >
> >
> >
> >
>
>
>
---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe [at] jakarta
> For additional commands, e-mail: lucene-dev-help [at] jakarta


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe [at] jakarta
For additional commands, e-mail: lucene-dev-help [at] jakarta


cutting at apache

Jan 13, 2005, 10:04 AM

Post #17 of 51 (260 views)
Permalink
Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ? [In reply to]

Wolf Siberski wrote:
> Yes, I agree. I just wanted to point out that the current Weight
> implementations need to be modified heavily to introduce the
> behaviour you describe above. For example, take a look at
> TermQuery.TermWeight.scorer():
> [...]
> return new TermScorer(this, termDocs, getSimilarity(searcher),
> reader.norms(term.field()));
>
> This typically results in a call to searcher.getSimilarity().
> In the new context, the searcher would be a MultiSearcher,
> and to resolve that call at on of the RemoteSearchables, the
> method getSimilarity() would have to be called remotely on it.

I think this can be handled by:

a. declaring TermQuery.searcher transient -- this should never be needed
remotely and we don't want to serialize it; and

b. adding a non-transient Similarity field to TermQuery.Weight. A
Similarity instance should be efficient to serialize with each Weight.
An instance of DefaultSimilarity has no non-static fields, and static
fields are not serialized, so it's serialization mostly consists of just
its class name. Also, serializing the Similarity is required for
correct behaviour, since we need to run some Similarity methods locally
(idf, queryNorm) and some remotely (tf, coord).

We should probably factor this behavior into a Weight base class, since
every Weight implementation should do this the same way.

Does this make sense?

Doug


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe [at] jakarta
For additional commands, e-mail: lucene-dev-help [at] jakarta


cutting at apache

Jan 13, 2005, 10:14 AM

Post #18 of 51 (261 views)
Permalink
Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ? [In reply to]

Chuck Williams wrote:
> I think there is another problem here. It is currently the Weight
> implementations that do rewrite(), which requires access to the index,
> not just to the idf's. E.g., RangeQuery.rewrite() must find the terms
> in the index within the range. So, the Weight cannot be computed in the
> MultiSearcher, as it does not have direct access to the remote index.

rewrite() is actually called before the weight is constructed. In the
remote case, rewrite() is another IPC. So, when a query is executed on
a MultiSearcher of RemoteSearchables, the following remote calls are made:

1. RemoteSearchable.rewrite(Query) is called
2. RemoteSearchable.docFreq(Term) is called for each term in the
rewritten query while constructing a Weight.
3. RemoteSearchable.search(Weight, ...) is called.

So I don't think this is a problem.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe [at] jakarta
For additional commands, e-mail: lucene-dev-help [at] jakarta


chuck at manawiz

Jan 13, 2005, 10:37 AM

Post #19 of 51 (264 views)
Permalink
RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ? [In reply to]

It just seems like a lot of IPC activity for each query. As things
stand now, I think you are proposing this?
1. MultiSearcher calls the remote node to rewrite the query,
requiring serialization of the query.
2. The remote node returns the rewritten query to the dispatcher
node, which requires serialization of the (potentially much larger)
rewritten query.
3. The dispatcher node computes the weights. This requires a call to
each remote node for each term in the query to compute the docFreq's;
this can be an extremely large number of IPC calls (e.g., 1,000 terms in
a rewritten query times 10 remote nodes = 10,000 IPC calls).
4. The weights are serialized (including the serialized Similarity's)
and passed back to remote node.
5. The remote nodes execute the queries and pass results back to the
dispatcher node for collation.

Is that right? This seems pretty expensive to me.

If we had a central docFreq table on the dispatcher node, the query
processing could be much simpler:
1. MultiSearcher rewrites the query and computes the weights, all
locally on the central node.
2. The rewritten query with weights is passed to each remote node (=
10 IPC calls in the example case above).
3. Each remote node processes the rewritten query. Here, the remote
node could rewrite the query again to eliminate term expansions for
terms it doesn't have as Paul suggests, or it could omit this step (I
believe the only difference in result is scoring, and it's not clear to
me the best way to score this case).
4. The results are passed back and collated.

If the aggregate docFreq table was replicated to each remote node, then
only the raw query would need to be passed as the remote nodes could
each do the rewriting and weighting. However, this would be offset by
the extra complexity to manage the distribution of the aggregate tables,
which is probably not worth it.

The methods required to keep an accurate central docFreq table could be:
1. Compute it initially by having the central node obtain and sum the
contributions from each remote node.
2. On each incremental index on a remote node, send the central node
a set of deltas for each term whose docFreq was changed by the
incremental index.

I think the question is how frequent and how expensive would those two
steps be in comparison to the difference in the query processing.

Chuck

> -----Original Message-----
> From: Doug Cutting [mailto:cutting [at] apache]
> Sent: Thursday, January 13, 2005 9:14 AM
> To: Lucene Developers List
> Subject: Re: How to proceed with Bug 31841 - MultiSearcher problems
with
> Similarity.docFreq() ?
>
> Chuck Williams wrote:
> > I think there is another problem here. It is currently the Weight
> > implementations that do rewrite(), which requires access to the
index,
> > not just to the idf's. E.g., RangeQuery.rewrite() must find the
terms
> > in the index within the range. So, the Weight cannot be computed
in
> the
> > MultiSearcher, as it does not have direct access to the remote
index.
>
> rewrite() is actually called before the weight is constructed. In
the
> remote case, rewrite() is another IPC. So, when a query is executed
on
> a MultiSearcher of RemoteSearchables, the following remote calls are
> made:
>
> 1. RemoteSearchable.rewrite(Query) is called
> 2. RemoteSearchable.docFreq(Term) is called for each term in the
> rewritten query while constructing a Weight.
> 3. RemoteSearchable.search(Weight, ...) is called.
>
> So I don't think this is a problem.
>
> Doug
>
>
---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe [at] jakarta
> For additional commands, e-mail: lucene-dev-help [at] jakarta


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe [at] jakarta
For additional commands, e-mail: lucene-dev-help [at] jakarta


cutting at apache

Jan 13, 2005, 11:29 AM

Post #20 of 51 (259 views)
Permalink
Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ? [In reply to]

Chuck Williams wrote:
> It just seems like a lot of IPC activity for each query. As things
> stand now, I think you are proposing this?
> 1. MultiSearcher calls the remote node to rewrite the query,
> requiring serialization of the query.
> 2. The remote node returns the rewritten query to the dispatcher
> node, which requires serialization of the (potentially much larger)
> rewritten query.
> 3. The dispatcher node computes the weights. This requires a call to
> each remote node for each term in the query to compute the docFreq's;
> this can be an extremely large number of IPC calls (e.g., 1,000 terms in
> a rewritten query times 10 remote nodes = 10,000 IPC calls).
> 4. The weights are serialized (including the serialized Similarity's)
> and passed back to remote node.
> 5. The remote nodes execute the queries and pass results back to the
> dispatcher node for collation.
>
> Is that right? This seems pretty expensive to me.

I think that's right. For simple queries with a couple of terms it
should not be too expensive. For queries that expand into thousands of
terms, yes, it is expensive, but these are slow queries anyway. It's
not clear how much worse this would make them. Yes, we might optimize
it some, but first let's get things working correctly!

An easy way to "optimize" this is to avoid queries that expand into
large numbers of terms. I've never permitted wildcard, fuzzy or range
queries in any system that I've deployed: they're simply too slow. When
I need, e.g., date ranges, I use a Filter instead. The auto-filter
proposal I've made could make this a lot easier. So I'd like to see
that implemented before I worry about optimizing remote range or
wildcard queries.

> If we had a central docFreq table on the dispatcher node, the query
> processing could be much simpler:

Perhaps, but that's a big "if".

> 1. MultiSearcher rewrites the query and computes the weights, all
> locally on the central node.

This could require a substantial change to rewrite implementations.
Rewriting is currently passed a full IndexReader: a central docFreq
table is not a full IndexReader. So we could add a search API for term
enumeration independent of an IndexReader, then change all rewrite
implementations to use this, and hope that none require other aspects of
the IndexReader.

> 2. The rewritten query with weights is passed to each remote node (=
> 10 IPC calls in the example case above).

This still serializes a huge query. A central docFreq table only
provides a constant factor improvement. The rewritten query only needs
to travel one-way rather than round-trip.

> If the aggregate docFreq table was replicated to each remote node, then
> only the raw query would need to be passed as the remote nodes could
> each do the rewriting and weighting. However, this would be offset by
> the extra complexity to manage the distribution of the aggregate tables,
> which is probably not worth it.
>
> The methods required to keep an accurate central docFreq table could be:
> 1. Compute it initially by having the central node obtain and sum the
> contributions from each remote node.
> 2. On each incremental index on a remote node, send the central node
> a set of deltas for each term whose docFreq was changed by the
> incremental index.

This sounds very hairy to me.

The delta approach is problematic. At present a Searchable instance,
like an IndexReader, does not change the set of documents it searches.
At present, when you want to search an updated collection you construct
a new Searcher. So this is (again) a substantive change. It means,
e.g., if folks cache things based on the Searcher that these caches
might become invalid.

> I think the question is how frequent and how expensive would those two
> steps be in comparison to the difference in the query processing.

I think the first question is: can we get RemoteSearchables to work
correctly and reasonably efficiently for simple queries?

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe [at] jakarta
For additional commands, e-mail: lucene-dev-help [at] jakarta


paul.elschot at xs4all

Jan 13, 2005, 12:06 PM

Post #21 of 51 (259 views)
Permalink
Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ? [In reply to]

On Thursday 13 January 2005 19:29, Doug Cutting wrote:
> Chuck Williams wrote:
> > It just seems like a lot of IPC activity for each query. As things
> > stand now, I think you are proposing this?
> > 1. MultiSearcher calls the remote node to rewrite the query,
> > requiring serialization of the query.
> > 2. The remote node returns the rewritten query to the dispatcher
> > node, which requires serialization of the (potentially much larger)
> > rewritten query.
> > 3. The dispatcher node computes the weights. This requires a call to
> > each remote node for each term in the query to compute the docFreq's;
> > this can be an extremely large number of IPC calls (e.g., 1,000 terms in
> > a rewritten query times 10 remote nodes = 10,000 IPC calls).
> > 4. The weights are serialized (including the serialized Similarity's)
> > and passed back to remote node.
> > 5. The remote nodes execute the queries and pass results back to the
> > dispatcher node for collation.
> >
> > Is that right? This seems pretty expensive to me.
>
> I think that's right. For simple queries with a couple of terms it
> should not be too expensive. For queries that expand into thousands of
> terms, yes, it is expensive, but these are slow queries anyway. It's
> not clear how much worse this would make them. Yes, we might optimize
> it some, but first let's get things working correctly!
>
> An easy way to "optimize" this is to avoid queries that expand into
> large numbers of terms. I've never permitted wildcard, fuzzy or range
> queries in any system that I've deployed: they're simply too slow. When
> I need, e.g., date ranges, I use a Filter instead. The auto-filter
> proposal I've made could make this a lot easier. So I'd like to see
> that implemented before I worry about optimizing remote range or
> wildcard queries.
>
> > If we had a central docFreq table on the dispatcher node, the query
> > processing could be much simpler:
>
> Perhaps, but that's a big "if".
>
> > 1. MultiSearcher rewrites the query and computes the weights, all
> > locally on the central node.
>
> This could require a substantial change to rewrite implementations.
> Rewriting is currently passed a full IndexReader: a central docFreq
> table is not a full IndexReader. So we could add a search API for term
> enumeration independent of an IndexReader, then change all rewrite
> implementations to use this, and hope that none require other aspects of
> the IndexReader.
>
> > 2. The rewritten query with weights is passed to each remote node (=
> > 10 IPC calls in the example case above).
>
> This still serializes a huge query. A central docFreq table only
> provides a constant factor improvement. The rewritten query only needs
> to travel one-way rather than round-trip.

One can pass the original query, with only changed query weights
to take into account the global aspects of the idf.
Term expansion would have to be done centrally to determine the idf weight
factors, and also locally to do the actual searching and scoring without
further idf computations.

Perhaps an easy way to send the term frequencies to the central node is
by sending the field info and term dictionary of each local index segment.
It's not ideal because the FreqDelta, ProxDelta, and SkipDelta are
superfluous, but it would be a relatively easy start.
Even the existing segment merging could be partially reused for central
summing of the document frequencies.

Regards,
Paul Elschot


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe [at] jakarta
For additional commands, e-mail: lucene-dev-help [at] jakarta


chuck at manawiz

Jan 13, 2005, 12:31 PM

Post #22 of 51 (261 views)
Permalink
RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ? [In reply to]

If auto-filters can provide an effective implementation for RangeQuery's
that avoids rewriting, and we can give up MultiTermQuery and PrefixQuery
in the distributed environment, then how about something like this
refinement:
1. No rewriting is done.
2. The central node maintains a cache of aggregate docFreq data that
is incrementally built on demand, and flushed after any remote node
opens a new Searcher.
3. The central node computes the Weights by accessing the docFreq for
each query term. This looks the value up in the cache, or queries it
from each remote node, sums the results, and caches the result.

This seems simple and avoids a great deal of IPC traffic, especially in
the common case where popular query terms are frequently reused.

I presume the auto-filters get pushed out to each remote node as part of
the query?

Chuck

> -----Original Message-----
> From: Doug Cutting [mailto:cutting [at] apache]
> Sent: Thursday, January 13, 2005 10:29 AM
> To: Lucene Developers List
> Subject: Re: How to proceed with Bug 31841 - MultiSearcher problems
with
> Similarity.docFreq() ?
>
> Chuck Williams wrote:
> > It just seems like a lot of IPC activity for each query. As
things
> > stand now, I think you are proposing this?
> > 1. MultiSearcher calls the remote node to rewrite the query,
> > requiring serialization of the query.
> > 2. The remote node returns the rewritten query to the
dispatcher
> > node, which requires serialization of the (potentially much
larger)
> > rewritten query.
> > 3. The dispatcher node computes the weights. This requires a
call
> to
> > each remote node for each term in the query to compute the
docFreq's;
> > this can be an extremely large number of IPC calls (e.g., 1,000
terms
> in
> > a rewritten query times 10 remote nodes = 10,000 IPC calls).
> > 4. The weights are serialized (including the serialized
> Similarity's)
> > and passed back to remote node.
> > 5. The remote nodes execute the queries and pass results back
to
> the
> > dispatcher node for collation.
> >
> > Is that right? This seems pretty expensive to me.
>
> I think that's right. For simple queries with a couple of terms it
> should not be too expensive. For queries that expand into thousands
of
> terms, yes, it is expensive, but these are slow queries anyway.
It's
> not clear how much worse this would make them. Yes, we might
optimize
> it some, but first let's get things working correctly!
>
> An easy way to "optimize" this is to avoid queries that expand into
> large numbers of terms. I've never permitted wildcard, fuzzy or
range
> queries in any system that I've deployed: they're simply too slow.
When
> I need, e.g., date ranges, I use a Filter instead. The auto-filter
> proposal I've made could make this a lot easier. So I'd like to see
> that implemented before I worry about optimizing remote range or
> wildcard queries.
>
> > If we had a central docFreq table on the dispatcher node, the
query
> > processing could be much simpler:
>
> Perhaps, but that's a big "if".
>
> > 1. MultiSearcher rewrites the query and computes the weights,
all
> > locally on the central node.
>
> This could require a substantial change to rewrite implementations.
> Rewriting is currently passed a full IndexReader: a central docFreq
> table is not a full IndexReader. So we could add a search API for
term
> enumeration independent of an IndexReader, then change all rewrite
> implementations to use this, and hope that none require other
aspects of
> the IndexReader.
>
> > 2. The rewritten query with weights is passed to each remote
node
> (=
> > 10 IPC calls in the example case above).
>
> This still serializes a huge query. A central docFreq table only
> provides a constant factor improvement. The rewritten query only
needs
> to travel one-way rather than round-trip.
>
> > If the aggregate docFreq table was replicated to each remote node,
> then
> > only the raw query would need to be passed as the remote nodes
could
> > each do the rewriting and weighting. However, this would be
offset by
> > the extra complexity to manage the distribution of the aggregate
> tables,
> > which is probably not worth it.
> >
> > The methods required to keep an accurate central docFreq table
could
> be:
> > 1. Compute it initially by having the central node obtain and
sum
> the
> > contributions from each remote node.
> > 2. On each incremental index on a remote node, send the central
> node
> > a set of deltas for each term whose docFreq was changed by the
> > incremental index.
>
> This sounds very hairy to me.
>
> The delta approach is problematic. At present a Searchable
instance,
> like an IndexReader, does not change the set of documents it
searches.
> At present, when you want to search an updated collection you
construct
> a new Searcher. So this is (again) a substantive change. It means,
> e.g., if folks cache things based on the Searcher that these caches
> might become invalid.
>
> > I think the question is how frequent and how expensive would those
two
> > steps be in comparison to the difference in the query processing.
>
> I think the first question is: can we get RemoteSearchables to work
> correctly and reasonably efficiently for simple queries?
>
> Doug
>
>
---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe [at] jakarta
> For additional commands, e-mail: lucene-dev-help [at] jakarta


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe [at] jakarta
For additional commands, e-mail: lucene-dev-help [at] jakarta


cutting at apache

Jan 13, 2005, 12:41 PM

Post #23 of 51 (261 views)
Permalink
Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ? [In reply to]

Chuck Williams wrote:
> If auto-filters can provide an effective implementation for RangeQuery's
> that avoids rewriting, and we can give up MultiTermQuery and PrefixQuery
> in the distributed environment, then how about something like this
> refinement:
> 1. No rewriting is done.

It would indeed be nice to be able to short-circuit rewriting for
queries where it is a no-op. Do you have a proposal for how this could
be done?

> 2. The central node maintains a cache of aggregate docFreq data that
> is incrementally built on demand, and flushed after any remote node
> opens a new Searcher.
> 3. The central node computes the Weights by accessing the docFreq for
> each query term. This looks the value up in the cache, or queries it
> from each remote node, sums the results, and caches the result.
>
> This seems simple and avoids a great deal of IPC traffic, especially in
> the common case where popular query terms are frequently reused.

I think this sort of a docFreq cache would be easy to build into either
MultiSearcher or RemoteSearchable.

> I presume the auto-filters get pushed out to each remote node as part of
> the query?

They're not yet implemented, so we don't know. One implementation would
be that Scorers would automatically use filters for amenable query
clauses. If that's the way things are done then yes, the filters would
essentially be a part of the query. No matter how they're implemented,
we should take care to consider remote performance.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe [at] jakarta
For additional commands, e-mail: lucene-dev-help [at] jakarta


chuck at manawiz

Jan 13, 2005, 1:53 PM

Post #24 of 51 (263 views)
Permalink
RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ? [In reply to]

Doug Cutting wrote:
> It would indeed be nice to be able to short-circuit rewriting for
> queries where it is a no-op. Do you have a proposal for how this
could
> be done?

First, this gets into the other part of Bug 31841. I don't believe
MultiSearcher.rewrite() is ever called. Rewriting is done in the
Weight's, which invoke the rewrite() method of the Searcher, which is
always the Seacher invoked by the MultiSearcher, not the MultiSearcher
itself. In fact, MultiSearcher.rewrite() is broken. It requires
Query.combine() which is unsupported except for the derived queries
(i.e., those for which rewriting is not a no-op). When I added
topmostSearcher to get the Weight's to call the MultiSearcher.docFreq(),
that also caused them to call MultiSearcher.rewrite() which blows up on,
for example, a simple TermQuery, because there is no
TermQuery.combine(). That's why my patch contains a new default
implementation for Query.combine() (which as noted in the bug report is
probably not a good idea in general).

So, I don't believe there is any valid rewrite() implementation for
MultiSearcher to start from, unless I've completely misunderstood
something.

To address the question above, RemoteSearchable.rewrite() should be a
no-op, i.e. always return this. For good error handling, it should
verify that the query does not require rewriting. This requires some
mechanism to determine whether or not a query requires rewriting. The
challenge here is that some query types have a non-trivial rewrite()
method not because they require rewriting, but because they might have
subqueries that require rewriting (e.g., BooleanQuery). Other query
types (e.g., MultiTermQuery) always require rewriting, while those that
implement Weight's never require it. I think an upward incompatibility
is required in the API to address this.

If that is acceptable, then this could work:
1. Add a new interface called Rewritable that specifies a boolean
rewriteRequired() method.
2. Have Query implement Rewritable but NOT provide an implementation
for rewriteRequired(). This will force all applications to add support
for this in order to upgrade.
2. Change all the Weight's to call Query.maybeRewrite() instead of
Query.rewrite().
3. Have Query.maybeRewrite() only call Query.rewrite() if
Query.rewriteRequired() is true.
4. Have RemoteSearchable.maybeRewrite() throw an Exception if
Query.rewriteRequired() is true.
5. Implement rewriteRequired() for all the built-in Query types
(which is either true for derived queries, false for primitive queries,
or an or of rewriteRequired() for all the subqueries).

Maybe there's a better way, but this should work. It does require an
extra pass over the query. There is a potential hole if there are
applications that implement new primitive queries, i.e. have Weight's
that directly call Query.rewrite(). This hole could be (mostly) plugged
by renaming rewrite(), but that would introduce another upward
incompatibility.

An optimization could omit the call to rewriteRequired() in
Query.maybeRewrite(), as this mechanism is really only needed in
RemoteSearchable (and could be beneficial in MultiSeacher).

There is still the need to properly implement Query.combine() for all
query types (which is greatly simplified by a good default
implementation).

Chuck

> -----Original Message-----
> From: Doug Cutting [mailto:cutting [at] apache]
> Sent: Thursday, January 13, 2005 11:41 AM
> To: Lucene Developers List
> Subject: Re: How to proceed with Bug 31841 - MultiSearcher problems
with
> Similarity.docFreq() ?
>
> Chuck Williams wrote:
> > If auto-filters can provide an effective implementation for
> RangeQuery's
> > that avoids rewriting, and we can give up MultiTermQuery and
> PrefixQuery
> > in the distributed environment, then how about something like this
> > refinement:
> > 1. No rewriting is done.
>
> It would indeed be nice to be able to short-circuit rewriting for
> queries where it is a no-op. Do you have a proposal for how this
could
> be done?
>
> > 2. The central node maintains a cache of aggregate docFreq data
> that
> > is incrementally built on demand, and flushed after any remote
node
> > opens a new Searcher.
> > 3. The central node computes the Weights by accessing the
docFreq
> for
> > each query term. This looks the value up in the cache, or queries
it
> > from each remote node, sums the results, and caches the result.
> >
> > This seems simple and avoids a great deal of IPC traffic,
especially
> in
> > the common case where popular query terms are frequently reused.
>
> I think this sort of a docFreq cache would be easy to build into
either
> MultiSearcher or RemoteSearchable.
>
> > I presume the auto-filters get pushed out to each remote node as
part
> of
> > the query?
>
> They're not yet implemented, so we don't know. One implementation
would
> be that Scorers would automatically use filters for amenable query
> clauses. If that's the way things are done then yes, the filters
would
> essentially be a part of the query. No matter how they're
implemented,
> we should take care to consider remote performance.
>
> Doug
>
>
---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe [at] jakarta
> For additional commands, e-mail: lucene-dev-help [at] jakarta


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe [at] jakarta
For additional commands, e-mail: lucene-dev-help [at] jakarta


siberski at l3s

Jan 14, 2005, 7:27 AM

Post #25 of 51 (260 views)
Permalink
Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ? [In reply to]

Doug Cutting wrote:
> Wolf Siberski wrote:
>> In the new context, the searcher would be a MultiSearcher,
>> and to resolve that call at on of the RemoteSearchables, the
>> method getSimilarity() would have to be called remotely on it.
>
> I think this can be handled by:
>
> a. declaring TermQuery.searcher transient -- this should never be needed
> remotely and we don't want to serialize it; and
>
> b. adding a non-transient Similarity field to TermQuery.Weight.
[...]
>
> Does this make sense?
Sounds good. For my current patch, I already had to make Similarities
serializable, and it posed no problem.

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe [at] jakarta
For additional commands, e-mail: lucene-dev-help [at] jakarta

First page Previous page 1 2 3 Next page Last page  View All Lucene java-dev RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.