Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: General

issues with wildcard search and snowball english analyzer

 

 

Lucene general RSS feed   Index | Next | Previous | View Threaded


jb4tech at gmail

Jul 24, 2008, 3:39 PM

Post #1 of 4 (376 views)
Permalink
issues with wildcard search and snowball english analyzer

I am using SnowballAnalayzer(English).
I just created one document with one field with content as "elephant is a
big animal".
I searched for e*t using queryparser.
This did not return any results.
I indexed with "lion is a big animal".
Then searched for l*t. This returned one result as expected.
I looked at the index using Luke and figured out that elephant has been
steemed to eleph by analyzer.
I reindexed "elephant is a big animal" and tried with e*p, this time I got
one hit.
I like the stemming as it stems tests, tested, testing etc... to test.
Is there a way to avoid stemming in certain cases?
--
View this message in context: http://www.nabble.com/issues-with-wildcard-search-and-snowball-english-analyzer-tp18641947p18641947.html
Sent from the Lucene - General mailing list archive at Nabble.com.


andrewgilmartin at yahoo

Jul 24, 2008, 4:25 PM

Post #2 of 4 (349 views)
Permalink
Re: issues with wildcard search and snowball english analyzer [In reply to]

--- On Thu, 7/24/08, JBTech <jb4tech[at]gmail.com> wrote:

> Is there a way to avoid stemming in certain cases?

As a general rule, make the query intelligent and not the index. Therefore, index your text verbatim. Small changes like changing terms to lowercase and removing possessives are fine. You now have an index upon which you can make intelligent queries.

An intelligent query requires keeping track of several collections of term-to-term(s) mappings. For example, stemmed-term to verbatim-term(s). Now, convert the users search for "elephant is a big animal" into something akin to

( (elephant^10) OR (A) OR (B) ) AND
( (big^10) OR (C) ) AND
( (animal^10) OR (D) )

Where A and B are other terms with the same stemming as elephant, C is another term with the same stemming as big, and D is a another term with the same stemming as animal. Adding the boost ensures that a verbatim match pushes the document's rank higher and so ensure that what the user asked for is closer to the top.

This basic idea of making the queries more intelligent by broadening them and boosting term weights gives you a lot of control over the query and how results are ranked. The same control is not possible by making the index more intelligent.

Don't worry about Lucene's performance with complex queries. My experience is that it is very fast.

And to answer your specific question, search for "e*t" will work as is.

-- Andrew


jb4tech at gmail

Jul 25, 2008, 7:04 AM

Post #3 of 4 (341 views)
Permalink
Re: issues with wildcard search and snowball english analyzer [In reply to]

Hi Andrew,
Thanks for your quick reply.
I tried with e*t and that did not return any results.
I am using Lucene 2.2.
The full word elephant returned one hit as I am using the same analayzer for
indexing and searching.
I uploaded the java class I used for testing this.
Thanks
JB

Andrew Gilmartin-2 wrote:
>
> --- On Thu, 7/24/08, JBTech <jb4tech[at]gmail.com> wrote:
>
>> Is there a way to avoid stemming in certain cases?
>
> As a general rule, make the query intelligent and not the index.
> Therefore, index your text verbatim. Small changes like changing terms to
> lowercase and removing possessives are fine. You now have an index upon
> which you can make intelligent queries.
>
> An intelligent query requires keeping track of several collections of
> term-to-term(s) mappings. For example, stemmed-term to verbatim-term(s).
> Now, convert the users search for "elephant is a big animal" into
> something akin to
>
> ( (elephant^10) OR (A) OR (B) ) AND
> ( (big^10) OR (C) ) AND
> ( (animal^10) OR (D) )
>
> Where A and B are other terms with the same stemming as elephant, C is
> another term with the same stemming as big, and D is a another term with
> the same stemming as animal. Adding the boost ensures that a verbatim
> match pushes the document's rank higher and so ensure that what the user
> asked for is closer to the top.
>
> This basic idea of making the queries more intelligent by broadening them
> and boosting term weights gives you a lot of control over the query and
> how results are ranked. The same control is not possible by making the
> index more intelligent.
>
> Don't worry about Lucene's performance with complex queries. My experience
> is that it is very fast.
>
> And to answer your specific question, search for "e*t" will work as is.
>
> -- Andrew
>
>
>
>
>
http://www.nabble.com/file/p18652365/Testing.java Testing.java
--
View this message in context: http://www.nabble.com/issues-with-wildcard-search-and-snowball-english-analyzer-tp18641947p18652365.html
Sent from the Lucene - General mailing list archive at Nabble.com.


andrewgilmartin at yahoo

Jul 25, 2008, 7:44 AM

Post #4 of 4 (339 views)
Permalink
Re: issues with wildcard search and snowball english analyzer [In reply to]

--- On Fri, 7/25/08, JBTech <jb4tech[at]gmail.com> wrote:

> I tried with e*t and that did not return any results.

Hum. Example code would be helpful now.

-- Andrew

Lucene general RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact lists@gossamer-threads.com
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.