Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

Indexing/Analyzer question - case-insensitive "contains" search

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


jattardi at gmail

Jul 30, 2007, 6:56 AM

Post #1 of 4 (1330 views)
Permalink
Indexing/Analyzer question - case-insensitive "contains" search

Hi everyone,

I told you I'd be back with more questions! :-)
Here is my situation. In my application, the field to be searched is
selected via a drop-down box. I want my searches to basically be "contains"
searches - I take what the user typed in, put a wildcard character at the
beginning and end, and put that in a WildcardQuery with the selected field.
In simple cases, this works great.

But, the StandardAnalyzer and SimpleAnalyzer is removing some characters I
need. For example, one of my objects has a name of "Joe's Devices". If I
search for "Joe's", it doesn't work, because the apostrophe is stripped out.
I tried using the KeywordAnalyzer, which keeps the string intact, but then
won't my searches be case-sensitive? This is easy to fix of course by
calling toLowerCase() on the text when it is indexed, but then later when
retrieved from the index to be displayed in the search results, "Joe's
Devices" is displayed as "joe's devices". Is there anything I can do here
short of putting two copies of the name in the document - one indexed/not
stored ("joe's devices"), and one stored/not indexed("Joe's Devices") ? Or
can I accomplish this case-insensitive "contains" search some other way -
would I have to write a custom Analyzer, or something?

Thanks in advance!

--
Joe Attardi
jattardi [at] gmail
http://thinksincode.blogspot.com/


a.schrijvers at hippo

Jul 30, 2007, 7:12 AM

Post #2 of 4 (1290 views)
Permalink
RE: Indexing/Analyzer question - case-insensitive "contains" search [In reply to]

Hello,

> Hi everyone,
>
> I told you I'd be back with more questions! :-)
> Here is my situation. In my application, the field to be searched is
> selected via a drop-down box. I want my searches to basically
> be "contains"
> searches - I take what the user typed in, put a wildcard
> character at the
> beginning and end, and put that in a WildcardQuery with the
> selected field.
> In simple cases, this works great.

It does sound very strange to me, to default to a WildCardQuery! Suppose I am looking for "bold", I am getting hits for "old".

IMO, you should move the WildcardQuery and just use a simple QueryParser (with the analyzer you use at indexing time). Your problem below arises from the fact that you construct your search with WildcardQuery(Term t) and t = new Term("field","Joe's");

But, now you are looking for a term that is very likely not to be present in the index, although you idnexed text that contains "Joe's". The StandardAnalyzer() for example would probably split in ' , and ignores the s. If you use queryparser instead of creating your own term, you are save (and you have no problems with case-sensitive either).

If you do not really understand why it works like this, it might be good to play around with luke: open your index with luke, go to plugins tab, and put in some text and see how it is tokenized with some sample analyzers.

Regards Ard

>
> But, the StandardAnalyzer and SimpleAnalyzer is removing some
> characters I
> need. For example, one of my objects has a name of "Joe's
> Devices". If I
> search for "Joe's", it doesn't work, because the apostrophe
> is stripped out.
> I tried using the KeywordAnalyzer, which keeps the string
> intact, but then
> won't my searches be case-sensitive? This is easy to fix of course by
> calling toLowerCase() on the text when it is indexed, but
> then later when
> retrieved from the index to be displayed in the search results, "Joe's
> Devices" is displayed as "joe's devices". Is there anything I
> can do here
> short of putting two copies of the name in the document - one
> indexed/not
> stored ("joe's devices"), and one stored/not indexed("Joe's
> Devices") ? Or
> can I accomplish this case-insensitive "contains" search some
> other way -
> would I have to write a custom Analyzer, or something?
>
> Thanks in advance!
>
> --
> Joe Attardi
> jattardi [at] gmail
> http://thinksincode.blogspot.com/
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


jattardi at gmail

Jul 30, 2007, 7:31 AM

Post #3 of 4 (1292 views)
Permalink
Re: Indexing/Analyzer question - case-insensitive "contains" search [In reply to]

>
> It does sound very strange to me, to default to a WildCardQuery! Suppose I
> am looking for "bold", I am getting hits for "old".

I know - but that's what the requirements dictate. A better example might be
a MAC or IP address, where someone might be searching for a string in the
middle - like, I might search for "102" to get "192.168.102.151" and "
192.168.102.200" as results.


a.schrijvers at hippo

Jul 30, 2007, 8:03 AM

Post #4 of 4 (1291 views)
Permalink
RE: Indexing/Analyzer question - case-insensitive "contains" search [In reply to]

> > It does sound very strange to me, to default to a
> WildCardQuery! Suppose I
> > am looking for "bold", I am getting hits for "old".
>
> I know - but that's what the requirements dictate. A better
> example might be
> a MAC or IP address, where someone might be searching for a
> string in the
> middle - like, I might search for "102" to get "192.168.102.151" and "
> 192.168.102.200" as results.

Yes....so you think that if you use StandardAnalyzer(), and you index a field with 192.168.102.151, and then you use the queryparser [1] to search for "102", you do not get a hit? Lucene would not have the status it has if it could not do this. Obviously, it depends on your analyzer how the indexing is done exactly.

I disencourage you to use the WildCardQuery when it is not needed.

[1] http://lucene.apache.org/java/docs/queryparsersyntax.html

Regards Ard

>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.