Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

Fastest way to get just the "bits" of matching documents

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


Robert.Stewart at INFONGEN

Jul 22, 2008, 12:37 PM

Post #1 of 5 (307 views)
Permalink
Fastest way to get just the "bits" of matching documents

I need to execute a boolean query and get back just the bits of all the matching documents. I do additional filtering (date ranges and entitlements) and then do my own sorting later on. I know that using QueryFilter.Bits() will still compute scores for all matching documents. I do not want to compute any scores. For queries with large results (over 5 million), seems like it is somewhat slow , and maybe computing scores is taking some time. I have 10million document index, and for some very broad queries (4-5 million matching documents), seems like getting bits is slow (1.5 seconds). I can do my own sorting of results for requested page in under 30 ms, since I have efficient cached permutations of sorting by various fields. Is there a way given a BooleanQuery, to get matching bits without computing any scores internally? I looked at ConstantScoreQuery but I believe it actually still computes scores since it gets bits from the underlying query anyway. In fact I tested it and it is actually slower to use ConstantScoreQuery than not to.

Is it possible to use a custom similarity class to make scoring faster (by returning 0 values, etc)?




Thanks,
Bob


eksdev at yahoo

Jul 22, 2008, 1:25 PM

Post #2 of 5 (297 views)
Permalink
Re: Fastest way to get just the "bits" of matching documents [In reply to]

no, at the moment you can not make pure boolean queries. But 1.5 seconds on 10Mio document sounds a bit too much (we have well under 200mS on 150Mio collection) what you can do:

1. use Filter for high frequency terms, e.g. via ConstantScoreQuery as much as you can, but you have to cache them (CachingWrapperFilter or something like that). SoretedVIntList can help a lot in reducing memory requirements for filter caching
2. Use RAMDisk if it fits in RAM, or MMAPDisk
3.Provide more details, what is the structure of the Query takes so long, what is the data in index... so someone can help you really. Your question it is just too abstract now
4. try to sort your index so that things that you expect in result get close, e.g if you search predominantly on some number, sort it on it... if you can... this helps reduce IO stress due locality
5. try https://issues.apache.org/jira/browse/LUCENE-1340 as you do not need term frequencies for scoring
6. try using your HitCollector insted of QueryFilter.Bits() to get your bits


if you tried all these options and it still does not work fast enough and you really have bottelneck in Scoring (I doubt it) then you have 2:
- Wait for Paul to come back from Holidays, he wanted to make "pure Boolean" queries, without Scoring, possible :)
- Invest in faster CPU/Memory


have fun
eks



----- Original Message ----
> From: Robert Stewart <Robert.Stewart[at]INFONGEN.COM>
> To: "java-user[at]lucene.apache.org" <java-user[at]lucene.apache.org>
> Sent: Tuesday, 22 July, 2008 9:37:26 PM
> Subject: Fastest way to get just the "bits" of matching documents
>
> I need to execute a boolean query and get back just the bits of all the matching
> documents. I do additional filtering (date ranges and entitlements) and then do
> my own sorting later on. I know that using QueryFilter.Bits() will still
> compute scores for all matching documents. I do not want to compute any
> scores. For queries with large results (over 5 million), seems like it is
> somewhat slow , and maybe computing scores is taking some time. I have
> 10million document index, and for some very broad queries (4-5 million matching
> documents), seems like getting bits is slow (1.5 seconds). I can do my own
> sorting of results for requested page in under 30 ms, since I have efficient
> cached permutations of sorting by various fields. Is there a way given a
> BooleanQuery, to get matching bits without computing any scores internally? I
> looked at ConstantScoreQuery but I believe it actually still computes scores
> since it gets bits from the underlying query anyway. In fact I tested it and it
> is actually slower to use ConstantScoreQuery than not to.
>
> Is it possible to use a custom similarity class to make scoring faster (by
> returning 0 values, etc)?
>
>
>
>
> Thanks,
> Bob



__________________________________________________________
Not happy with your email address?.
Get the one you really want - millions of new email addresses available now at Yahoo! http://uk.docs.yahoo.com/ymail/new.html

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-user-help[at]lucene.apache.org


Robert.Stewart at INFONGEN

Jul 24, 2008, 2:00 PM

Post #3 of 5 (264 views)
Permalink
RE: Fastest way to get just the "bits" of matching documents [In reply to]

Queries are very complex in our case, some have up to 100 or more clauses (over several fields), including disjunctions and prohibited clauses. Some queries take over 5 seconds total time on 10 million document index. I think it is because queries are too big and complicated. Is there any smarter ways to optimize Boolean queries than using default Boolean query classes? Would it make sense to pursue some custom "query optimizer"?





-----Original Message-----
From: eks dev [mailto:eksdev[at]yahoo.co.uk]
Sent: Tuesday, July 22, 2008 4:26 PM
To: java-user[at]lucene.apache.org
Subject: Re: Fastest way to get just the "bits" of matching documents

no, at the moment you can not make pure boolean queries. But 1.5 seconds on 10Mio document sounds a bit too much (we have well under 200mS on 150Mio collection) what you can do:

1. use Filter for high frequency terms, e.g. via ConstantScoreQuery as much as you can, but you have to cache them (CachingWrapperFilter or something like that). SoretedVIntList can help a lot in reducing memory requirements for filter caching
2. Use RAMDisk if it fits in RAM, or MMAPDisk
3.Provide more details, what is the structure of the Query takes so long, what is the data in index... so someone can help you really. Your question it is just too abstract now
4. try to sort your index so that things that you expect in result get close, e.g if you search predominantly on some number, sort it on it... if you can... this helps reduce IO stress due locality
5. try https://issues.apache.org/jira/browse/LUCENE-1340 as you do not need term frequencies for scoring
6. try using your HitCollector insted of QueryFilter.Bits() to get your bits


if you tried all these options and it still does not work fast enough and you really have bottelneck in Scoring (I doubt it) then you have 2:
- Wait for Paul to come back from Holidays, he wanted to make "pure Boolean" queries, without Scoring, possible :)
- Invest in faster CPU/Memory


have fun
eks



----- Original Message ----
> From: Robert Stewart <Robert.Stewart[at]INFONGEN.COM>
> To: "java-user[at]lucene.apache.org" <java-user[at]lucene.apache.org>
> Sent: Tuesday, 22 July, 2008 9:37:26 PM
> Subject: Fastest way to get just the "bits" of matching documents
>
> I need to execute a boolean query and get back just the bits of all the matching
> documents. I do additional filtering (date ranges and entitlements) and then do
> my own sorting later on. I know that using QueryFilter.Bits() will still
> compute scores for all matching documents. I do not want to compute any
> scores. For queries with large results (over 5 million), seems like it is
> somewhat slow , and maybe computing scores is taking some time. I have
> 10million document index, and for some very broad queries (4-5 million matching
> documents), seems like getting bits is slow (1.5 seconds). I can do my own
> sorting of results for requested page in under 30 ms, since I have efficient
> cached permutations of sorting by various fields. Is there a way given a
> BooleanQuery, to get matching bits without computing any scores internally? I
> looked at ConstantScoreQuery but I believe it actually still computes scores
> since it gets bits from the underlying query anyway. In fact I tested it and it
> is actually slower to use ConstantScoreQuery than not to.
>
> Is it possible to use a custom similarity class to make scoring faster (by
> returning 0 values, etc)?
>
>
>
>
> Thanks,
> Bob



__________________________________________________________
Not happy with your email address?.
Get the one you really want - millions of new email addresses available now at Yahoo! http://uk.docs.yahoo.com/ymail/new.html

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-user-help[at]lucene.apache.org


paul.elschot at xs4all

Jul 26, 2008, 12:47 PM

Post #4 of 5 (242 views)
Permalink
Re: Fastest way to get just the "bits" of matching documents [In reply to]

Op Thursday 24 July 2008 23:00:33 schreef Robert Stewart:
> Queries are very complex in our case, some have up to 100 or more
> clauses (over several fields), including disjunctions and prohibited
> clauses.

Other than the earlier advice, did you try setAllowDocsOutOfOrder() ?

Regards,
Paul Elschot

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-user-help[at]lucene.apache.org


Robert.Stewart at INFONGEN

Jul 28, 2008, 7:36 AM

Post #5 of 5 (223 views)
Permalink
RE: Fastest way to get just the "bits" of matching documents [In reply to]

BTW, we use Lucene .NET not Java currently, so version is 1.9. Unfortunately we don’t have "setAllowDocsOutOfOrder" but do have "useScorer14" which is almost the same thing for some queries. I did not see much improvement and for other queries it was slower. We are stuck on 1.9 due some stability issues (memory leaks?) in 2.0+ of the .NET port.

What I did finally, was pre-load fields which have a limited set of unique values into memory (using BitArray for >= num_docs/8, and SortedVIntList for < num_docs/8). Then I "optimize" incoming queries by replacing instances of TermQuery for those fields with my own "CachedTermQuery" objects which have custom weight and scorer classes which use the cached field data.


-----Original Message-----
From: Paul Elschot [mailto:paul.elschot[at]xs4all.nl]
Sent: Saturday, July 26, 2008 3:48 PM
To: java-user[at]lucene.apache.org
Subject: Re: Fastest way to get just the "bits" of matching documents

Op Thursday 24 July 2008 23:00:33 schreef Robert Stewart:
> Queries are very complex in our case, some have up to 100 or more
> clauses (over several fields), including disjunctions and prohibited
> clauses.

Other than the earlier advice, did you try setAllowDocsOutOfOrder() ?

Regards,
Paul Elschot

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-user-help[at]lucene.apache.org

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact lists@gossamer-threads.com
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.