Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: General

MaxFieldLength in Lucene 3.4

 

 

Lucene general RSS feed   Index | Next | Previous | View Threaded


mrjama at comcast

Nov 27, 2011, 11:09 PM

Post #1 of 4 (771 views)
Permalink
MaxFieldLength in Lucene 3.4

While upgrading to Lucene 3.4, I noticed the MaxFieldLength values on the
indexers are deprecated. There appears to be a LimitTokenCountAnalyzer
that limits the tokens - so does that mean the default for all other
analyzers is unlimited?

Thanks in advance -
JM


uwe at thetaphi

Nov 27, 2011, 11:40 PM

Post #2 of 4 (758 views)
Permalink
RE: MaxFieldLength in Lucene 3.4 [In reply to]

Hi,

The move is simple - LimitTokenCountAnalyzer is just a wrapper around any
other Analyzer, so I don't really understand your question - of course all
other analyzers are unlimited. If you have myAnalyzer with
myMaxFieldLengthValue used before, you can change your code as follows:

Before:
new IndexWriter(dir, new IndexWriterConfig(Version.LUCENE_34,
myAnalyzer).setFoo().setBar().setMaxFieldLength(myMaxFieldLengthValue));

After:
new IndexWriter(dir, new IndexWriterConfig(Version.LUCENE_34, new
LimitTokenCountAnalyzer(myAnalyzer,
myMaxFieldLengthValue)).setFoo().setBar());

You only have to do this on the indexing side, on the query side
(QueryParser) just use myAnalyzer without wrapping. With the new code, the
responsibilities for cutting the field after a specific number of tokens was
moved out out the indexing code in Lucene. This is now just an analysis
feature not a indexing feature anymore.

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe [at] thetaphi

> -----Original Message-----
> From: Joe MA [mailto:mrjama [at] comcast]
> Sent: Monday, November 28, 2011 8:09 AM
> To: general [at] lucene
> Subject: MaxFieldLength in Lucene 3.4
>
> While upgrading to Lucene 3.4, I noticed the MaxFieldLength values on the
> indexers are deprecated. There appears to be a LimitTokenCountAnalyzer
> that limits the tokens - so does that mean the default for all other
analyzers is
> unlimited?
>
> Thanks in advance -
> JM


mrjama at comcast

Dec 1, 2011, 12:23 AM

Post #3 of 4 (748 views)
Permalink
RE: MaxFieldLength in Lucene 3.4 [In reply to]

> "of course all other analyzers are unlimited"

Maybe I am too far behind the times. I was updating some pretty old stuff.
I think it was written originally with Lucene 1.4. I seem to recall that
Lucene v1.x had analyzers where the default was "limited", because I learned
pretty early that I had to set that option during indexing. Perhaps at some
point the switch was made to default unlimited. Thanks your answer clears
it up.

One question - why even have this option now? Are things more efficient with
a limited token field? If you know your data is 'bounded', should you
always limit the token field to improve performance?

Thanks!


-----Original Message-----
From: Uwe Schindler [mailto:uwe [at] thetaphi]
Sent: Monday, November 28, 2011 2:41 AM
To: general [at] lucene
Subject: RE: MaxFieldLength in Lucene 3.4

Hi,

The move is simple - LimitTokenCountAnalyzer is just a wrapper around any
other Analyzer, so I don't really understand your question - of course all
other analyzers are unlimited. If you have myAnalyzer with
myMaxFieldLengthValue used before, you can change your code as follows:

Before:
new IndexWriter(dir, new IndexWriterConfig(Version.LUCENE_34,
myAnalyzer).setFoo().setBar().setMaxFieldLength(myMaxFieldLengthValue));

After:
new IndexWriter(dir, new IndexWriterConfig(Version.LUCENE_34, new
LimitTokenCountAnalyzer(myAnalyzer,
myMaxFieldLengthValue)).setFoo().setBar());

You only have to do this on the indexing side, on the query side
(QueryParser) just use myAnalyzer without wrapping. With the new code, the
responsibilities for cutting the field after a specific number of tokens was
moved out out the indexing code in Lucene. This is now just an analysis
feature not a indexing feature anymore.

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe [at] thetaphi

> -----Original Message-----
> From: Joe MA [mailto:mrjama [at] comcast]
> Sent: Monday, November 28, 2011 8:09 AM
> To: general [at] lucene
> Subject: MaxFieldLength in Lucene 3.4
>
> While upgrading to Lucene 3.4, I noticed the MaxFieldLength values on the
> indexers are deprecated. There appears to be a LimitTokenCountAnalyzer
> that limits the tokens - so does that mean the default for all other
analyzers is
> unlimited?
>
> Thanks in advance -
> JM


uwe at thetaphi

Dec 1, 2011, 12:32 AM

Post #4 of 4 (752 views)
Permalink
RE: MaxFieldLength in Lucene 3.4 [In reply to]

Hi,

This option is a safety thing in the case you cannot trust your input data.
Maybe you suddenly tokenize a binary file and produce millions of random
tokens. In that case only maybe 10000 are generated. If you input data is
trusted and text-based (e.g. read from elements in XML files,
databases,...), then you don't need this filter.

> Maybe I am too far behind the times. I was updating some pretty old
stuff.
> I think it was written originally with Lucene 1.4. I seem to recall that
Lucene
> v1.x had analyzers where the default was "limited", because I learned
pretty
> early that I had to set that option during indexing. Perhaps at some
point the

The limiting option was almost always on IndexWriter, but it defaulted to
10000 tokens from the beginning. The analyzers had nothing to do with this
option.

The recent change removed the token counting from IndexWriter (as it only
makes the already complicated code more unreadable) and was moved to a
simple TokenFilter because it's much more reasonable to do it during
analysis.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe [at] thetaphi


> -----Original Message-----
> From: Joe MA [mailto:mrjama [at] comcast]
> Sent: Thursday, December 01, 2011 9:24 AM
> To: general [at] lucene
> Subject: RE: MaxFieldLength in Lucene 3.4
>
>
> > "of course all other analyzers are unlimited"
>
> Maybe I am too far behind the times. I was updating some pretty old
stuff.
> I think it was written originally with Lucene 1.4. I seem to recall that
Lucene
> v1.x had analyzers where the default was "limited", because I learned
pretty
> early that I had to set that option during indexing. Perhaps at some
point the
> switch was made to default unlimited. Thanks your answer clears it up.
>
> One question - why even have this option now? Are things more efficient
with a
> limited token field? If you know your data is 'bounded', should you
always limit
> the token field to improve performance?
>
> Thanks!
>
>
> -----Original Message-----
> From: Uwe Schindler [mailto:uwe [at] thetaphi]
> Sent: Monday, November 28, 2011 2:41 AM
> To: general [at] lucene
> Subject: RE: MaxFieldLength in Lucene 3.4
>
> Hi,
>
> The move is simple - LimitTokenCountAnalyzer is just a wrapper around any
> other Analyzer, so I don't really understand your question - of course all
other
> analyzers are unlimited. If you have myAnalyzer with myMaxFieldLengthValue
> used before, you can change your code as follows:
>
> Before:
> new IndexWriter(dir, new IndexWriterConfig(Version.LUCENE_34,
> myAnalyzer).setFoo().setBar().setMaxFieldLength(myMaxFieldLengthValue));
>
> After:
> new IndexWriter(dir, new IndexWriterConfig(Version.LUCENE_34, new
> LimitTokenCountAnalyzer(myAnalyzer,
> myMaxFieldLengthValue)).setFoo().setBar());
>
> You only have to do this on the indexing side, on the query side
> (QueryParser) just use myAnalyzer without wrapping. With the new code, the
> responsibilities for cutting the field after a specific number of tokens
was
> moved out out the indexing code in Lucene. This is now just an analysis
feature
> not a indexing feature anymore.
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe [at] thetaphi
>
> > -----Original Message-----
> > From: Joe MA [mailto:mrjama [at] comcast]
> > Sent: Monday, November 28, 2011 8:09 AM
> > To: general [at] lucene
> > Subject: MaxFieldLength in Lucene 3.4
> >
> > While upgrading to Lucene 3.4, I noticed the MaxFieldLength values on
the
> > indexers are deprecated. There appears to be a LimitTokenCountAnalyzer
> > that limits the tokens - so does that mean the default for all other
> analyzers is
> > unlimited?
> >
> > Thanks in advance -
> > JM

Lucene general RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.