Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: General

n-gram and multiword query

 

 

Lucene general RSS feed   Index | Next | Previous | View Threaded


rajeshm at dessci

Jul 14, 2005, 8:23 AM

Post #1 of 4 (2341 views)
Permalink
n-gram and multiword query

Consider a document with the following contents
" Levenshtein distance is named after the Russian scientist Vladimir
Levenshtein and is also called edit distance"

Possible bi-grams are (after removing the stop words in the beginning
and end)
"Levenshtein distance", "named after", "Russian scientist", "scientist
Vladimir", "Vladimir Levenshtein" called edit", "edit distance"

If my query term is "Vladimir levenshtein distance", how does Lucene
compute the similarity to the indexed terms? Are query terms appearing
together given more importance? How does it account for gaps (caused by
stop word removal) while matching multiword query?

thanks,

Rajesh Munavalli


moonshotter at gmail

Jul 14, 2005, 8:39 AM

Post #2 of 4 (2266 views)
Permalink
Re: n-gram and multiword query [In reply to]

i remember lucene doesn't do anything for proximity.

On 7/14/05, Rajesh Munavalli <rajeshm [at] dessci> wrote:
> Consider a document with the following contents
> " Levenshtein distance is named after the Russian scientist Vladimir
> Levenshtein and is also called edit distance"
>
> Possible bi-grams are (after removing the stop words in the beginning
> and end)
> "Levenshtein distance", "named after", "Russian scientist", "scientist
> Vladimir", "Vladimir Levenshtein" called edit", "edit distance"
>
> If my query term is "Vladimir levenshtein distance", how does Lucene
> compute the similarity to the indexed terms? Are query terms appearing
> together given more importance? How does it account for gaps (caused by
> stop word removal) while matching multiword query?
>
> thanks,
>
> Rajesh Munavalli
>
>


--
Thanks!
yours, WeiZhu Chen


rajeshm at dessci

Jul 14, 2005, 8:48 AM

Post #3 of 4 (2268 views)
Permalink
RE: n-gram and multiword query [In reply to]

What if my intention was to find all three words in a document not
necessarily in one sentence? Here is my goal

(1) All three words appearing together should be given Rank 1
(2) Three words appearing somewhere in the sentence given Rank 2
(3) Documents containing words in different sentences should be given
Rank 3
(4) Documents missing one or more of query terms should be given Rank 4

Correct me if I am wrong... Proximity search is concerned about query
terms appearing closer to one another within a certain distance in the
document.

Thanks,

Rajesh Munavalli

-----Original Message-----
From: Chen Wei Zhu [mailto:moonshotter [at] gmail]
Sent: Thursday, July 14, 2005 10:40 AM
To: general [at] lucene
Subject: Re: n-gram and multiword query

i remember lucene doesn't do anything for proximity.

On 7/14/05, Rajesh Munavalli <rajeshm [at] dessci> wrote:
> Consider a document with the following contents " Levenshtein distance

> is named after the Russian scientist Vladimir Levenshtein and is also
> called edit distance"
>
> Possible bi-grams are (after removing the stop words in the beginning
> and end) "Levenshtein distance", "named after", "Russian scientist",
> "scientist Vladimir", "Vladimir Levenshtein" called edit", "edit
> distance"
>
> If my query term is "Vladimir levenshtein distance", how does Lucene
> compute the similarity to the indexed terms? Are query terms appearing

> together given more importance? How does it account for gaps (caused
> by stop word removal) while matching multiword query?
>
> thanks,
>
> Rajesh Munavalli
>
>


--
Thanks!
yours, WeiZhu Chen


moonshotter at gmail

Jul 14, 2005, 8:51 AM

Post #4 of 4 (2243 views)
Permalink
Re: n-gram and multiword query [In reply to]

hi, munavalli,
for the (1), (2), (3), it seems only proximity could solve this
problem. and for (4), lucene has consider it with coordinate time of
a document.

in my idea, you are partially right for Proximarity search, since
proximity consider the sequence of terms at the same time.

On 7/14/05, Rajesh Munavalli <rajeshm [at] dessci> wrote:
> What if my intention was to find all three words in a document not
> necessarily in one sentence? Here is my goal
>
> (1) All three words appearing together should be given Rank 1
> (2) Three words appearing somewhere in the sentence given Rank 2
> (3) Documents containing words in different sentences should be given
> Rank 3
> (4) Documents missing one or more of query terms should be given Rank 4
>
> Correct me if I am wrong... Proximity search is concerned about query
> terms appearing closer to one another within a certain distance in the
> document.
>
> Thanks,
>
> Rajesh Munavalli
>
> -----Original Message-----
> From: Chen Wei Zhu [mailto:moonshotter [at] gmail]
> Sent: Thursday, July 14, 2005 10:40 AM
> To: general [at] lucene
> Subject: Re: n-gram and multiword query
>
> i remember lucene doesn't do anything for proximity.
>
> On 7/14/05, Rajesh Munavalli <rajeshm [at] dessci> wrote:
> > Consider a document with the following contents " Levenshtein distance
>
> > is named after the Russian scientist Vladimir Levenshtein and is also
> > called edit distance"
> >
> > Possible bi-grams are (after removing the stop words in the beginning
> > and end) "Levenshtein distance", "named after", "Russian scientist",
> > "scientist Vladimir", "Vladimir Levenshtein" called edit", "edit
> > distance"
> >
> > If my query term is "Vladimir levenshtein distance", how does Lucene
> > compute the similarity to the indexed terms? Are query terms appearing
>
> > together given more importance? How does it account for gaps (caused
> > by stop word removal) while matching multiword query?
> >
> > thanks,
> >
> > Rajesh Munavalli
> >
> >
>
>
> --
> Thanks!
> yours, WeiZhu Chen
>


--
Thanks!
yours, WeiZhu Chen

Lucene general RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.