Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: General

Lucene Not Throwing Matches Without Spaces

 

 

Lucene general RSS feed   Index | Next | Previous | View Threaded


nishu.soni at 3i-infotech

Nov 17, 2009, 5:16 AM

Post #1 of 5 (698 views)
Permalink
Lucene Not Throwing Matches Without Spaces

Lucene is not throwing matches when search string is without space and data
in my index file is with space.For e.g. if "Saddam Hussain" text is in index
file and I am searchin "SaddamHussain", I am not getting any matches.I am
using Boolean Query for scanning.

Any help will be highly appreciated.
--
View this message in context: http://old.nabble.com/Lucene-Not-Throwing-Matches-Without-Spaces-tp26389750p26389750.html
Sent from the Lucene - General mailing list archive at Nabble.com.


simon.willnauer at googlemail

Nov 17, 2009, 8:09 AM

Post #2 of 5 (668 views)
Permalink
Re: Lucene Not Throwing Matches Without Spaces [In reply to]

Nishu,

first you should send this question to java-users not to general :)
When you index a doc the the content "mighty duck" your TokenStream
most likely builds two tokens t1:"mighty" t2:"duck"
the same happens (most likely) when you search for "mighty duck" with
the QueryParser so the query will be a boolean TermQuery("mighty") OR
TermQuery("duck"). This will retrieve your document. If you search for
"mightyduck" the query will only have one boolean clause (actually
none, its just a term query) with TermQuery("mightyduck"). Lucene will
not find any matches as this term is not in the index.

Hope that helps for understanding what is going on.

simon

On Tue, Nov 17, 2009 at 2:16 PM, Nishu Soni <nishu.soni [at] 3i-infotech> wrote:
>
> Lucene is not throwing matches when search string is without space and data
> in my index file is with space.For e.g. if "Saddam Hussain" text is in index
> file and I am searchin "SaddamHussain", I am not getting any matches.I am
> using Boolean Query for scanning.
>
> Any help will be highly appreciated.
> --
> View this message in context: http://old.nabble.com/Lucene-Not-Throwing-Matches-Without-Spaces-tp26389750p26389750.html
> Sent from the Lucene - General mailing list archive at Nabble.com.
>


ted.dunning at gmail

Nov 17, 2009, 9:14 AM

Post #3 of 5 (662 views)
Permalink
Re: Lucene Not Throwing Matches Without Spaces [In reply to]

That is what is going on.

To fix the problem you generally need to do a bit of statistics on your
corpus to discover word pairs that appear both with and without a space.
Once you have that, you have two approaches that will work.

The first approach is to index your text in an ambiguous fashion. Where
your "mighty duck" text would have previously been indexed, as Simon says,
as two terms ["mighty"@0, "duck"@1] with the pair lexicon, you would index
the text as ["mighty duck"@0, "mighty"@0, "duck"@1]. At this point, either
query will work.

Another approach that is easier if you don't want to mess with the indexer
and analyzer chain, is to do the same transformation at query time. If the
user types the query [mightyduck], you would rewrite this to be [mightyduck
OR phrase(mighty duck)]. Similarly, if the user types [mighty duck], you
would rewrite the query to be [mightyduck OR phrase(mighty duck) OR mighty
OR duck].

On Tue, Nov 17, 2009 at 8:09 AM, Simon Willnauer <
simon.willnauer [at] googlemail> wrote:

> Nishu,
>
> first you should send this question to java-users not to general :)
> When you index a doc the the content "mighty duck" your TokenStream
> most likely builds two tokens t1:"mighty" t2:"duck"
> the same happens (most likely) when you search for "mighty duck" with
> the QueryParser so the query will be a boolean TermQuery("mighty") OR
> TermQuery("duck"). This will retrieve your document. If you search for
> "mightyduck" the query will only have one boolean clause (actually
> none, its just a term query) with TermQuery("mightyduck"). Lucene will
> not find any matches as this term is not in the index.
>
> Hope that helps for understanding what is going on.
>
> simon
>
> On Tue, Nov 17, 2009 at 2:16 PM, Nishu Soni <nishu.soni [at] 3i-infotech>
> wrote:
> >
> > Lucene is not throwing matches when search string is without space and
> data
> > in my index file is with space.For e.g. if "Saddam Hussain" text is in
> index
> > file and I am searchin "SaddamHussain", I am not getting any matches.I am
> > using Boolean Query for scanning.
> >
> > Any help will be highly appreciated.
> > --
> > View this message in context:
> http://old.nabble.com/Lucene-Not-Throwing-Matches-Without-Spaces-tp26389750p26389750.html
> > Sent from the Lucene - General mailing list archive at Nabble.com.
> >
>



--
Ted Dunning, CTO
DeepDyve


rcmuir at gmail

Nov 17, 2009, 9:24 AM

Post #4 of 5 (654 views)
Permalink
Re: Lucene Not Throwing Matches Without Spaces [In reply to]

Solr's WordDelimiterFilter has an option splitOnCaseChange i think that
might work for your SaddamHussain example.

if you want to use Ted's first approach with lucene, you could try the
compounds package in Lucene's analysis contrib, and give it an english
wordlist.
(or create a very refined custom list of your own as he suggested).

On Tue, Nov 17, 2009 at 12:14 PM, Ted Dunning <ted.dunning [at] gmail> wrote:

> That is what is going on.
>
> To fix the problem you generally need to do a bit of statistics on your
> corpus to discover word pairs that appear both with and without a space.
> Once you have that, you have two approaches that will work.
>
> The first approach is to index your text in an ambiguous fashion. Where
> your "mighty duck" text would have previously been indexed, as Simon says,
> as two terms ["mighty"@0, "duck"@1] with the pair lexicon, you would index
> the text as ["mighty duck"@0, "mighty"@0, "duck"@1]. At this point, either
> query will work.
>
> Another approach that is easier if you don't want to mess with the indexer
> and analyzer chain, is to do the same transformation at query time. If the
> user types the query [mightyduck], you would rewrite this to be [mightyduck
> OR phrase(mighty duck)]. Similarly, if the user types [mighty duck], you
> would rewrite the query to be [mightyduck OR phrase(mighty duck) OR mighty
> OR duck].
>
> On Tue, Nov 17, 2009 at 8:09 AM, Simon Willnauer <
> simon.willnauer [at] googlemail> wrote:
>
> > Nishu,
> >
> > first you should send this question to java-users not to general :)
> > When you index a doc the the content "mighty duck" your TokenStream
> > most likely builds two tokens t1:"mighty" t2:"duck"
> > the same happens (most likely) when you search for "mighty duck" with
> > the QueryParser so the query will be a boolean TermQuery("mighty") OR
> > TermQuery("duck"). This will retrieve your document. If you search for
> > "mightyduck" the query will only have one boolean clause (actually
> > none, its just a term query) with TermQuery("mightyduck"). Lucene will
> > not find any matches as this term is not in the index.
> >
> > Hope that helps for understanding what is going on.
> >
> > simon
> >
> > On Tue, Nov 17, 2009 at 2:16 PM, Nishu Soni <nishu.soni [at] 3i-infotech>
> > wrote:
> > >
> > > Lucene is not throwing matches when search string is without space and
> > data
> > > in my index file is with space.For e.g. if "Saddam Hussain" text is in
> > index
> > > file and I am searchin "SaddamHussain", I am not getting any matches.I
> am
> > > using Boolean Query for scanning.
> > >
> > > Any help will be highly appreciated.
> > > --
> > > View this message in context:
> >
> http://old.nabble.com/Lucene-Not-Throwing-Matches-Without-Spaces-tp26389750p26389750.html
> > > Sent from the Lucene - General mailing list archive at Nabble.com.
> > >
> >
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>



--
Robert Muir
rcmuir [at] gmail


nishu.soni at 3i-infotech

Nov 17, 2009, 9:28 PM

Post #5 of 5 (641 views)
Permalink
Re: Lucene Not Throwing Matches Without Spaces [In reply to]

Simon,

Thanks for replying.Next time, I will take care to post question in right
forum.

I understood your explanation.But still, if I want to get matches without
space in lucene,
is there any way out.

Nishu



Simon Willnauer wrote:
>
> Nishu,
>
> first you should send this question to java-users not to general :)
> When you index a doc the the content "mighty duck" your TokenStream
> most likely builds two tokens t1:"mighty" t2:"duck"
> the same happens (most likely) when you search for "mighty duck" with
> the QueryParser so the query will be a boolean TermQuery("mighty") OR
> TermQuery("duck"). This will retrieve your document. If you search for
> "mightyduck" the query will only have one boolean clause (actually
> none, its just a term query) with TermQuery("mightyduck"). Lucene will
> not find any matches as this term is not in the index.
>
> Hope that helps for understanding what is going on.
>
> simon
>
> On Tue, Nov 17, 2009 at 2:16 PM, Nishu Soni wrote:
>>
>> Lucene is not throwing matches when search string is without space and
>> data
>> in my index file is with space.For e.g. if "Saddam Hussain" text is in
>> index
>> file and I am searchin "SaddamHussain", I am not getting any matches.I am
>> using Boolean Query for scanning.
>>
>> Any help will be highly appreciated.
>> --
>> View this message in context:
>> http://old.nabble.com/Lucene-Not-Throwing-Matches-Without-Spaces-tp26389750p26389750.html
>> Sent from the Lucene - General mailing list archive at Nabble.com.
>>
>
>

--
View this message in context: http://old.nabble.com/Lucene-Not-Throwing-Matches-Without-Spaces-tp26389750p26402748.html
Sent from the Lucene - General mailing list archive at Nabble.com.

Lucene general RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.