Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

substring indexing to avoid 'TooManyClauses' exception

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


hardy at ferentschik

Nov 12, 2007, 1:44 PM

Post #1 of 4 (2140 views)
Permalink
substring indexing to avoid 'TooManyClauses' exception

Hi,

I have a question regarding the way I got around the 'TooManyClauses'
exception when using wild card queries
(http://wiki.apache.org/lucene-java/LuceneFAQ#head-06fafb5d19e786a50fb3dfb8821a6af9f37aa831).


I am using Lucene in conjunction with Hibernate Search
(http://www.hibernate.org/410.html). I am indexing 'Compmany' objects
which contain multiple attibutes and the application supports different
types of searches.

One type of search is a right hand truncated (wildcard query) search of
the company name. If eg the user searches for 'M' I constructed initially
a 'M*' query. I have about 250.000 companies in the index. Without any
modifications I get the 'TooManyClauses' exception and I initially kept
increasing the 'maxClauseCount'. It works, but performace was terrible. I
haven't tried working with a filter, but instead decided to try a
different approach. I index all possible substrings of a string , eg 'Foo'
would be indexed as 'F', 'Fo' and 'Foo'.

I got rid of the 'TooManyClauses' exception and performace improved by
magnitude, but I would like to get some feedback from other users whether
this is a good approach or not.

Of course the index size increased, but that was no issue in this case.
Are there any potential problems with ranking/scoring?

Thanks for any feedback.

--Hardy


--
Hartmut Ferentschik
Ekholmsv.339 ,1, 127 45 Skärholmen, Sweden
Phone: +46 855 923 676 (h); +46 704 225 097 (m)

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


erickerickson at gmail

Nov 13, 2007, 7:12 AM

Post #2 of 4 (2047 views)
Permalink
Re: substring indexing to avoid 'TooManyClauses' exception [In reply to]

Hardy:

I'm certainly not an expert on ranking and scoring, but I've got to assume
that this approach influences scoring.

Another issue is how you indexed multiple values. If you took a hint from
the SynonymAnalyzer example in Lucene In Action, and indexed all the
substrings with an increment of 0, you're probably OK with phrase and
span queries. Consider indexing the following "anne gables". You'd
index a, an, ann and anne followed by g, ga, gab, gabl, gable, gables.
If the increments for the variants of anne were the default of 1, searching
for the phrase "anne gables"~2 would fail because anne is really 5 or so
terms away from gables. You can fix this by insuring that the gap
is 0 for the variants, but you need to be aware of this.

What about false hits? I don't know enough about your problem
space to know whether matching on "ann AND gab" in the
above example would be acceptable (note: no wild cards,
so the user might be left scratching her head wondering why
she got a match).

There are several approaches. There is a thread titled "I just don't
understand wildcards at all" that has a bunch of information about
wildcards, and searching the archive for "wildcards" will turn up a
wealth of information.

But here are a couple of approaches:
1> use filters. This has worked quite well for me where I construct
the filter by WildCardTermEnum. If you have potential to re-use
these you can always use CachingWrapperFilter (?) to keep
them around.

1a> you could pre-compute, say, 1 filter for each of the
one-letter possibilities (e.g. a filter for a*, one for b*, etc)
and use the filters for the pathological case of one leading
character and wildcard queries for the 2 or more leading
character wildcard queries. Assuming this performance
was acceptable for the two-leading character case.

2> ask your users whether there's really much value in getting
so many hits. That is, if you can restrict the leading number
of characters to, say, two (e.g. ab*), wildcarding might
work acceptably out of the box. I think a legitimate question
is "is 10,000 matching terms a useful thing to allow?" Of course
the immediate response is "yes", but challenging the person
to with producing an actual use case often results in the
realization that catching the "too many clauses" error and
responding with a message of "your search is too broad to
be useful" is reasonable.

Best
Erick

On Nov 12, 2007 4:44 PM, Hardy Ferentschik <hardy [at] ferentschik> wrote:

> Hi,
>
> I have a question regarding the way I got around the 'TooManyClauses'
> exception when using wild card queries
> (
> http://wiki.apache.org/lucene-java/LuceneFAQ#head-06fafb5d19e786a50fb3dfb8821a6af9f37aa831
> ).
>
>
> I am using Lucene in conjunction with Hibernate Search
> (http://www.hibernate.org/410.html). I am indexing 'Compmany' objects
> which contain multiple attibutes and the application supports different
> types of searches.
>
> One type of search is a right hand truncated (wildcard query) search of
> the company name. If eg the user searches for 'M' I constructed initially
> a 'M*' query. I have about 250.000 companies in the index. Without any
> modifications I get the 'TooManyClauses' exception and I initially kept
> increasing the 'maxClauseCount'. It works, but performace was terrible. I
> haven't tried working with a filter, but instead decided to try a
> different approach. I index all possible substrings of a string , eg 'Foo'
> would be indexed as 'F', 'Fo' and 'Foo'.
>
> I got rid of the 'TooManyClauses' exception and performace improved by
> magnitude, but I would like to get some feedback from other users whether
> this is a good approach or not.
>
> Of course the index size increased, but that was no issue in this case.
> Are there any potential problems with ranking/scoring?
>
> Thanks for any feedback.
>
> --Hardy
>
>
> --
> Hartmut Ferentschik
> Ekholmsv.339 ,1, 127 45 Skärholmen, Sweden
> Phone: +46 855 923 676 (h); +46 704 225 097 (m)
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>


hardy at ferentschik

Nov 14, 2007, 1:25 PM

Post #3 of 4 (2045 views)
Permalink
Re: substring indexing to avoid 'TooManyClauses' exception [In reply to]

On Tue, 13 Nov 2007 16:12:26 +0100, Erick Erickson
<erickerickson [at] gmail> wrote:

Thanks for your help.

> I'm certainly not an expert on ranking and scoring, but I've got to
> assume that this approach influences scoring.
No doubt. The question is if it matters for this particular use case. For
this particualt field I will ever only have a simple right hand truncated
search. The user cannot use span or phrase queries against this field, not
even explicit AND. I don't think this approach makes much sense when
indexing a whole block of text. I only want to use it for indexing a
simple name which at most consits of a few words. I guess what I want to
do here is comparable to a single column SQL LIKE query, eg SELECT FROM
COMPANY WHERE COMPANY.NAME LIKE 'M%'. Of course this is only the simple
case. There are other queries where I combine the name search with other
fields which are indexed using for example a SnowballAnalyzer.

> There are several approaches. There is a thread titled "I just don't
> understand wildcards at all" that has a bunch of information about
> wildcards, and searching the archive for "wildcards" will turn up a
> wealth of information.
Great. I will look into it.

Thanks again.

-- Hardy
--
Hartmut Ferentschik
Ekholmsv.339 ,1, 127 45 Skärholmen, Sweden
Phone: +46 855 923 676 (h); +46 704 225 097 (m)

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


erickerickson at gmail

Nov 14, 2007, 1:52 PM

Post #4 of 4 (2051 views)
Permalink
Re: substring indexing to avoid 'TooManyClauses' exception [In reply to]

Hardy:

Since your use-case is so restricted, I'd recommend that you
just construct a filter. I think you'll find it's much faster than
you'd think at first glance. Of course, "Your mileage may
vary" Is there any equivalent phrase like "Your kilometerage
may vary" <G>?

Most of the discussion in the archives has to do with the
more general case, so much of it probably doesn't apply to
your specific case.

Best
Erick

On Nov 14, 2007 4:25 PM, Hardy Ferentschik <hardy [at] ferentschik> wrote:

> On Tue, 13 Nov 2007 16:12:26 +0100, Erick Erickson
> <erickerickson [at] gmail> wrote:
>
> Thanks for your help.
>
> > I'm certainly not an expert on ranking and scoring, but I've got to
> > assume that this approach influences scoring.
> No doubt. The question is if it matters for this particular use case. For
> this particualt field I will ever only have a simple right hand truncated
> search. The user cannot use span or phrase queries against this field, not
> even explicit AND. I don't think this approach makes much sense when
> indexing a whole block of text. I only want to use it for indexing a
> simple name which at most consits of a few words. I guess what I want to
> do here is comparable to a single column SQL LIKE query, eg SELECT FROM
> COMPANY WHERE COMPANY.NAME LIKE 'M%'. Of course this is only the simple
> case. There are other queries where I combine the name search with other
> fields which are indexed using for example a SnowballAnalyzer.
>
> > There are several approaches. There is a thread titled "I just don't
> > understand wildcards at all" that has a bunch of information about
> > wildcards, and searching the archive for "wildcards" will turn up a
> > wealth of information.
> Great. I will look into it.
>
> Thanks again.
>
> -- Hardy
> --
> Hartmut Ferentschik
> Ekholmsv.339 ,1, 127 45 Skärholmen, Sweden
> Phone: +46 855 923 676 (h); +46 704 225 097 (m)
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.