Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

Enumerating NumericField using TermEnum?

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


phil123 at gmail

Sep 11, 2009, 3:00 PM

Post #1 of 8 (1389 views)
Permalink
Enumerating NumericField using TermEnum?

Hi,

I've used NumericField to store my "hour" field.

Example...

doc.add(new NumericField("hour").setIntValue(Integer.parseInt("12")));

Before I was using plain string Field and enumerating them with
TermEnum, which worked fine.
Now I'm using NumericField's I'm not sure how to port this enumeration code.

Any pointers?

This is the code I was using previously for plain Fields.

ArrayList hours = new ArrayList();
TermEnum termEnum = reader.terms( new Term( "hour", "" ) );
Term term = null;
while ( ( term = termEnum.term() ) != null ) {

if ( ! term.field().equals( "hour" ) )
break;

hours.add( (Integer)term.text() );
termEnum.next();
}

Thanks,
Phil

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


uwe at thetaphi

Sep 12, 2009, 1:55 AM

Post #2 of 8 (1348 views)
Permalink
RE: Enumerating NumericField using TermEnum? [In reply to]

Hi Phil,

thanks for checking out NumericField. I have two comments about your
problem:

> I've used NumericField to store my "hour" field.
>
> Example...
>
> doc.add(new
> NumericField("hour").setIntValue(Integer.parseInt("12")));

NumericField uses a spezial encoding of terms for fast NumericRangeQueries.
It indexes more than one term per value. How many terms depends on the
precisionStep ctor parameter. If you set it to infinity (or something ge the
bit size of your value, 32 for ints, it indexes exactly one value). These
terms are used for very fast numeric queries. This extra overhead only has a
positive impact for field with high cardinality (something > 500). For a
simple hour field with 24 distinct values, the speed impact of
NumericRangeQuery would be neglectible, it may even be a little bit slower
because of additional overhead. I would suggest to use NumericField ony for
real high-cardinality fields (like unix time stamps, prices,
latitudes/longitudes (all types of float/doubles), day of year,...).

Maybe I add this t the javadocs.

> Before I was using plain string Field and enumerating them with
> TermEnum, which worked fine.
> Now I'm using NumericField's I'm not sure how to port this enumeration
> code.

As explained above, each numerfic value is indexed by more than one term, so
your termenum is of no use. There are some tricks to get the distict values,
but this needs deeper knowledge of the underlying term structure encoding of
terms, shift value,... - see the FieldCache parsers for numeric fields).

As your field (hours) is of low cardinality, you can index with
precisionStep=Integer.MAX_VALUE. Range queries will be not faster than with
normal TermRangeQuery and your term enum will work. You only have to use
NumericUtils.prefixCodedToInt() to decode the term into a int:

hours.add( Integer.valueOf(NumericUtils.prefixCodedToInt(term.text()) );

This code would also work for other precision steps, but you would get some
additional "lower precision terms" (values with some lower bits removed).
You have to break iteration in this case (see FieldCache code).

> Any pointers?
>
> This is the code I was using previously for plain Fields.
>
> ArrayList hours = new ArrayList();
> TermEnum termEnum = reader.terms( new Term( "hour", "" ) );
> Term term = null;
> while ( ( term = termEnum.term() ) != null ) {
>
> if ( ! term.field().equals( "hour" ) )
> break;
>
> hours.add( (Integer)term.text() );
> termEnum.next();
> }
>
> Thanks,
> Phil
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


phil123 at gmail

Sep 13, 2009, 10:20 AM

Post #3 of 8 (1334 views)
Permalink
Re: Enumerating NumericField using TermEnum? [In reply to]

Hi Uwe,

Thanks for the explanation! It really helps. That makes sense that for
a small number of values, such as "hour" NumericField is not going to
help me. I'm experimenting with using epoch NumericField for sorting,
which funnily is where I started with 2.4.1, before going down the
usual TooManyClauses path and breaking it down to multiple fields. 2.9
seems a great improvement there. Downloading the new 2.9 rc4...

Thanks,
Phil

On Sat, Sep 12, 2009 at 1:55 AM, Uwe Schindler <uwe [at] thetaphi> wrote:
> Hi Phil,
>
> thanks for checking out NumericField. I have two comments about your
> problem:
>
>> I've used NumericField to store my "hour" field.
>>
>> Example...
>>
>>      doc.add(new
>> NumericField("hour").setIntValue(Integer.parseInt("12")));
>
> NumericField uses a spezial encoding of terms for fast NumericRangeQueries.
> It indexes more than one term per value. How many terms depends on the
> precisionStep ctor parameter. If you set it to infinity (or something ge the
> bit size of your value, 32 for ints, it indexes exactly one value). These
> terms are used for very fast numeric queries. This extra overhead only has a
> positive impact for field with high cardinality (something > 500). For a
> simple hour field with 24 distinct values, the speed impact of
> NumericRangeQuery would be neglectible, it may even be a little bit slower
> because of additional overhead. I would suggest to use NumericField ony for
> real high-cardinality fields (like unix time stamps, prices,
> latitudes/longitudes (all types of float/doubles), day of year,...).
>
> Maybe I add this t the javadocs.
>
>> Before I was using plain string Field and enumerating them with
>> TermEnum, which worked fine.
>> Now I'm using NumericField's I'm not sure how to port this enumeration
>> code.
>
> As explained above, each numerfic value is indexed by more than one term, so
> your termenum is of no use. There are some tricks to get the distict values,
> but this needs deeper knowledge of the underlying term structure encoding of
> terms, shift value,... - see the FieldCache parsers for numeric fields).
>
> As your field (hours) is of low cardinality, you can index with
> precisionStep=Integer.MAX_VALUE. Range queries will be not faster than with
> normal TermRangeQuery and your term enum will work. You only have to use
> NumericUtils.prefixCodedToInt() to decode the term into a int:
>
> hours.add( Integer.valueOf(NumericUtils.prefixCodedToInt(term.text()) );
>
> This code would also work for other precision steps, but you would get some
> additional "lower precision terms" (values with some lower bits removed).
> You have to break iteration in this case (see FieldCache code).
>
>> Any pointers?
>>
>> This is the code I was using previously for plain Fields.
>>
>>     ArrayList hours = new ArrayList();
>>     TermEnum termEnum = reader.terms( new Term( "hour", "" ) );
>>     Term term = null;
>>     while ( ( term = termEnum.term() ) != null ) {
>>
>>         if ( ! term.field().equals( "hour" ) )
>>             break;
>>
>>         hours.add( (Integer)term.text() );
>>         termEnum.next();
>>     }
>>
>> Thanks,
>> Phil
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>> For additional commands, e-mail: java-user-help [at] lucene
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>



--
Mobile: +1 778-233-4935
Website: http://philw.co.uk
Skype: philwhelan76
Twitter: philwhln
Email : phil123 [at] gmail
iChat: philwhln [at] mac

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


markrmiller at gmail

Sep 13, 2009, 11:09 AM

Post #4 of 8 (1321 views)
Permalink
Re: Enumerating NumericField using TermEnum? [In reply to]

>> NumericField uses a spezial encoding of terms for fast NumericRangeQueries.
>> It indexes more than one term per value. How many terms depends on the
>> precisionStep ctor parameter. If you set it to infinity (or something ge the
>> bit size of your value, 32 for ints, it indexes exactly one value). These
>> terms are used for very fast numeric queries. This extra overhead only has a
>> positive impact for field with high cardinality (something > 500). For a
>> simple hour field with 24 distinct values, the speed impact of
>> NumericRangeQuery would be neglectible, it may even be a little bit slower
>> because of additional overhead. I would suggest to use NumericField ony for
>> real high-cardinality fields (like unix time stamps, prices,
>> latitudes/longitudes (all types of float/doubles), day of year,...).
>>
>> Maybe I add this t the javadocs.
>>
+1 - intuition might be to use it for anything numeric.

--
- Mark

http://www.lucidimagination.com




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


uwe at thetaphi

Sep 13, 2009, 1:26 PM

Post #5 of 8 (1319 views)
Permalink
RE: Enumerating NumericField using TermEnum? [In reply to]

> >> Maybe I add this t the javadocs.
> >>
> +1 - intuition might be to use it for anything numeric.

If we do not need a new RC fort hat I can do it tomorrow! I am not yet sure
what to write: I tend to say: "Use NumericField, but with infinite
precisionStep for low-cardinality fields - and you get the old TermEnum
value list as before (with conversion through NumericUtils)". In general,
users should use NumericField for numbers, but use a appropinquate precStep,
so infinite if no faster RangeQueries are possible because of low
cardinality.

Uwe


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


markrmiller at gmail

Sep 13, 2009, 1:37 PM

Post #6 of 8 (1313 views)
Permalink
Re: Enumerating NumericField using TermEnum? [In reply to]

Uwe Schindler wrote:
>>>> Maybe I add this t the javadocs.
>>>>
>>>>
>> +1 - intuition might be to use it for anything numeric.
>>
>
> If we do not need a new RC fort hat I can do it tomorrow! I am not yet sure
> what to write: I tend to say: "Use NumericField, but with infinite
> precisionStep for low-cardinality fields - and you get the old TermEnum
> value list as before (with conversion through NumericUtils)". In general,
> users should use NumericField for numbers, but use a appropinquate precStep,
> so infinite if no faster RangeQueries are possible because of low
> cardinality.
>
> Uwe
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>
Okay, good point. That makes sense - how about an example of low card,
just for grounding? In the hundreds? Under 10,000? Only 10's?

Also, do you mean to use Integer.Max_VALUE as infinite?

My personal opinion is that we can make javadoc changes for the final
without doing an RC, as long as no code/build/scipts at all is touched.
Not sure how others feel though.

--
- Mark

http://www.lucidimagination.com




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


uwe at thetaphi

Sep 13, 2009, 1:51 PM

Post #7 of 8 (1313 views)
Permalink
RE: Enumerating NumericField using TermEnum? [In reply to]

> > If we do not need a new RC fort hat I can do it tomorrow! I am not yet
> sure
> > what to write: I tend to say: "Use NumericField, but with infinite
> > precisionStep for low-cardinality fields - and you get the old TermEnum
> > value list as before (with conversion through NumericUtils)". In
> general,
> > users should use NumericField for numbers, but use a appropinquate
> precStep,
> > so infinite if no faster RangeQueries are possible because of low
> > cardinality.
>
> Okay, good point. That makes sense - how about an example of low card,
> just for grounding? In the hundreds? Under 10,000? Only 10's?

The example of Phil is good: 24 for a hour fiel dis good. I would say,
everything upto 100 is low cardinality. It does not hurt to use small
precSteps, but you loose the possibility to enumerate the terms easily. The
typical example of a drop down list filled by a list terms in the web
interface is a typical example for low cardinality. E.g. credit card
expiration,... These are no real numeric values (although "hour" is a
number), they are used as list of preselected terms (and the list is
intelligently filled by the index). I have a lot of these, but they are
mostly country names, project names, and so on (or better said: facets). So:
low cardinality lists. If you use numbers this way, you can handle them as
simple text terms (and you will not use RangeQueries on them).

> Also, do you mean to use Integer.Max_VALUE as infinite?

Yes, sorry.

> My personal opinion is that we can make javadoc changes for the final
> without doing an RC, as long as no code/build/scipts at all is touched.
> Not sure how others feel though.

I just wanted to ask for confirmation.

Uwe


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


markrmiller at gmail

Sep 14, 2009, 9:14 AM

Post #8 of 8 (1288 views)
Permalink
Re: Enumerating NumericField using TermEnum? [In reply to]

Uwe Schindler wrote:
>
>> My personal opinion is that we can make javadoc changes for the final
>> without doing an RC, as long as no code/build/scipts at all is touched.
>> Not sure how others feel though.
>>
>
> I just wanted to ask for confirmation.
>
> Uwe
>
>
I know - we always should check for consensus, I agree. I was just giving
my piece of the consensus pie. Now we are two strong, and if no grumpy,
stickler jumps in to thwart us, I think we are good to go!

--
- Mark

http://www.lucidimagination.com




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.