Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

Finding the highest term in a field

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


daniel at nuix

Nov 18, 2009, 7:48 PM

Post #1 of 5 (582 views)
Permalink
Finding the highest term in a field

Hi all.

If I want to find the lowest term in a field, I can do something like this:

public Date computeEarliestDate(IndexReader reader) throws IOException {
TermEnum terms = reader.terms(new Term("date", "00000000"));
if (terms.term() == null || !"date".equals(terms.term().field()))
{
return new Date(); // some date before all data
}

return dateFormat.parse(terms.term().text());
}

But what if I want to find the highest? TermEnum can't step backwards.

I am working under these constraints:
* It can't involve iterating every value in the TermEnum because
the number of documents is too large for that to be efficient.
* It has to work with existing text indexes, so I can't cheat by
having another field which sorts in the other direction.

Is my best option to do a sort of binary search by getting the
TermEnum for different terms until I find a term where there are terms
higher than the term but no terms higher than the term for the next
day?

Daniel


--
Daniel Noll Forensic and eDiscovery Software
Senior Developer The world's most advanced
Nuix email data analysis
http://nuix.com/ and eDiscovery software

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


yonik at lucidimagination

Nov 18, 2009, 9:01 PM

Post #2 of 5 (548 views)
Permalink
Re: Finding the highest term in a field [In reply to]

On Wed, Nov 18, 2009 at 10:48 PM, Daniel Noll <daniel [at] nuix> wrote:
> But what if I want to find the highest?  TermEnum can't step backwards.

I've also wanted to do the same. It's coming with the new flexible
indexing patch:
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12764020#action_12764020

-Yonik
http://www.lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


daniel at nuix

Nov 18, 2009, 10:04 PM

Post #3 of 5 (544 views)
Permalink
Re: Finding the highest term in a field [In reply to]

On Thu, Nov 19, 2009 at 16:01, Yonik Seeley <yonik [at] lucidimagination> wrote:
> On Wed, Nov 18, 2009 at 10:48 PM, Daniel Noll <daniel [at] nuix> wrote:
>> But what if I want to find the highest?  TermEnum can't step backwards.
>
> I've also wanted to do the same. It's coming with the new flexible
> indexing patch:
> https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12764020#action_12764020

This sounds interesting.

I take it the existing numeric fields can't already do stuff like
this? (We don't have access to them yet anyway for backwards
compatibility reasons, otherwise I would have looked into it. But
next major version...)

For now I am writing a routine which subdivides the term space until
it thinks it's down to some size which is small enough to use
iteration instead of seeking (which seems to be in the realm of
100,000 ~ 1,000,000 terms -- but the hard thing is guessing how many
terms would be either side of the split.)

Daniel

--
Daniel Noll Forensic and eDiscovery Software
Senior Developer The world's most advanced
Nuix email data analysis
http://nuix.com/ and eDiscovery software

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


yonik at lucidimagination

Nov 19, 2009, 6:29 AM

Post #4 of 5 (532 views)
Permalink
Re: Finding the highest term in a field [In reply to]

On Thu, Nov 19, 2009 at 1:04 AM, Daniel Noll <daniel [at] nuix> wrote:
> I take it the existing numeric fields can't already do stuff like
> this?

Nope, it's a fundamental limitation of the current TermEnums.

-Yonik
http://www.lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


uwe at thetaphi

Nov 19, 2009, 6:39 AM

Post #5 of 5 (524 views)
Permalink
RE: Finding the highest term in a field [In reply to]

Hi Daniel, hi Yonik,

With NumericFields it would be possible to get faster to the really last
position in the TermEnum. It would be possible to iterate first over the
lowest precision terms until the end is reached. By that you know the prefix
of the last term. You can then place the TermEnum on the first term with the
same prefix, but the next better precision and iterate again. You do this
until you are in the highest precision. Depending on the precStep value you
can find the end much faster. E.g. with the default precStep of 4, each
precision needs to enumerate a theoretical maximum of 16 terms and then go
to the next lower prec. With 32 bit its, you need to do this 8 times, so you
need to iterate as maximum (but never in reality), 16*8 terms.

To implement this, you need much knowledge about NumericFields, but it is
possible with an very simple algorithm (simplier than the range splitter in
NumericUtils). If you like, I could possibly help you to implement this.

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe [at] thetaphi


> -----Original Message-----
> From: yseeley [at] gmail [mailto:yseeley [at] gmail] On Behalf Of Yonik
> Seeley
> Sent: Thursday, November 19, 2009 3:29 PM
> To: Daniel Noll
> Cc: java-user [at] lucene
> Subject: Re: Finding the highest term in a field
>
> On Thu, Nov 19, 2009 at 1:04 AM, Daniel Noll <daniel [at] nuix> wrote:
> > I take it the existing numeric fields can't already do stuff like
> > this?
>
> Nope, it's a fundamental limitation of the current TermEnums.
>
> -Yonik
> http://www.lucidimagination.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.