Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

Top field count scoring across documents

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


peter4u at hotmail

Nov 22, 2009, 8:42 AM

Post #1 of 3 (444 views)
Permalink
Top field count scoring across documents

Hello Lucene Experts,

I wonder if someone might be able to shed some insight on this interesting scoring question:

The problem:
Build a search query that will return [ordered] hits by the top number of occurences of field values across matched documents (or as close to this as possible).
The built-in scoring is great for scoring number of hits within a document, but is there an efficient way to do this across the same field in a set of matched documents? (maybe scoring isn't the best way?)

Example:
Let's say you have an index containing book information. Each document has a 'title' field.
Let's say the index contains 100 entries, with:
65 'title's containing the word 'tiger'
21 containing 'lion'
6 containing 'panther'
5 containing 'kitten'
3 containing 'slug'

What would be the best way to build a query such that returned documents are ordered in this way:
Rank Value Occurences
================================
1 tiger 65
2 lion 21
3 panther 6
4 kitten 5
5 slug 3

I can, of course, build a standard query, traverse the returned documents and build such a list, but if the returned query had many 100,000's of hits, the performance would degrade linearly, particularly if only the 'Top 5' are actually required.


One idea is to maintain a separate index with this information - the main problem with this is that you essentially need to know what you're searching for at index-time, which isn't ideal.


Has anyone come across and solved this particular issue using Lucene?

Many thanks,
Peter



_________________________________________________________________
Add your Gmail and Yahoo! Mail email accounts into Hotmail - it's easy
http://clk.atdmt.com/UKM/go/186394592/direct/01/


jake.mannix at gmail

Nov 22, 2009, 9:20 AM

Post #2 of 3 (415 views)
Permalink
Re: Top field count scoring across documents [In reply to]

Peter,

You want to do a facet query. This kind of functionality is not in
Lucene-core (sadly), but both Solr (the fully featured search application
built on Lucene) and bobo-browse (just a library, like Lucene itself) are
open-source and work with Lucene to provide faceting capabilities for you.

-jake

On Sun, Nov 22, 2009 at 8:42 AM, Peter 4U <peter4u [at] hotmail> wrote:

>
> Hello Lucene Experts,
>
> I wonder if someone might be able to shed some insight on this interesting
> scoring question:
>
> The problem:
> Build a search query that will return [ordered] hits by the top number of
> occurences of field values across matched documents (or as close to this as
> possible).
> The built-in scoring is great for scoring number of hits within a document,
> but is there an efficient way to do this across the same field in a set of
> matched documents? (maybe scoring isn't the best way?)
>
> Example:
> Let's say you have an index containing book information. Each document has
> a 'title' field.
> Let's say the index contains 100 entries, with:
> 65 'title's containing the word 'tiger'
> 21 containing 'lion'
> 6 containing 'panther'
> 5 containing 'kitten'
> 3 containing 'slug'
>
> What would be the best way to build a query such that returned documents
> are ordered in this way:
> Rank Value Occurences
> ================================
> 1 tiger 65
> 2 lion 21
> 3 panther 6
> 4 kitten 5
> 5 slug 3
>
> I can, of course, build a standard query, traverse the returned documents
> and build such a list, but if the returned query had many 100,000's of hits,
> the performance would degrade linearly, particularly if only the 'Top 5' are
> actually required.
>
>
> One idea is to maintain a separate index with this information - the main
> problem with this is that you essentially need to know what you're searching
> for at index-time, which isn't ideal.
>
>
> Has anyone come across and solved this particular issue using Lucene?
>
> Many thanks,
> Peter
>
>
>
> _________________________________________________________________
> Add your Gmail and Yahoo! Mail email accounts into Hotmail - it's easy
> http://clk.atdmt.com/UKM/go/186394592/direct/01/
>


peter4u at hotmail

Nov 22, 2009, 9:45 AM

Post #3 of 3 (411 views)
Permalink
RE: Top field count scoring across documents [In reply to]

Hi Jake,



Many thanks for your quick reply.

I shall check these out.



Thanks!

Peter



> Date: Sun, 22 Nov 2009 09:20:24 -0800
> Subject: Re: Top field count scoring across documents
> From: jake.mannix [at] gmail
> To: java-user [at] lucene
>
> Peter,
>
> You want to do a facet query. This kind of functionality is not in
> Lucene-core (sadly), but both Solr (the fully featured search application
> built on Lucene) and bobo-browse (just a library, like Lucene itself) are
> open-source and work with Lucene to provide faceting capabilities for you.
>
> -jake
>
> On Sun, Nov 22, 2009 at 8:42 AM, Peter 4U <peter4u [at] hotmail> wrote:
>
> >
> > Hello Lucene Experts,
> >
> > I wonder if someone might be able to shed some insight on this interesting
> > scoring question:
> >
> > The problem:
> > Build a search query that will return [ordered] hits by the top number of
> > occurences of field values across matched documents (or as close to this as
> > possible).
> > The built-in scoring is great for scoring number of hits within a document,
> > but is there an efficient way to do this across the same field in a set of
> > matched documents? (maybe scoring isn't the best way?)
> >
> > Example:
> > Let's say you have an index containing book information. Each document has
> > a 'title' field.
> > Let's say the index contains 100 entries, with:
> > 65 'title's containing the word 'tiger'
> > 21 containing 'lion'
> > 6 containing 'panther'
> > 5 containing 'kitten'
> > 3 containing 'slug'
> >
> > What would be the best way to build a query such that returned documents
> > are ordered in this way:
> > Rank Value Occurences
> > ================================
> > 1 tiger 65
> > 2 lion 21
> > 3 panther 6
> > 4 kitten 5
> > 5 slug 3
> >
> > I can, of course, build a standard query, traverse the returned documents
> > and build such a list, but if the returned query had many 100,000's of hits,
> > the performance would degrade linearly, particularly if only the 'Top 5' are
> > actually required.
> >
> >
> > One idea is to maintain a separate index with this information - the main
> > problem with this is that you essentially need to know what you're searching
> > for at index-time, which isn't ideal.
> >
> >
> > Has anyone come across and solved this particular issue using Lucene?
> >
> > Many thanks,
> > Peter
> >
> >
> >
> > _________________________________________________________________
> > Add your Gmail and Yahoo! Mail email accounts into Hotmail - it's easy
> > http://clk.atdmt.com/UKM/go/186394592/direct/01/
> >

_________________________________________________________________
Use Hotmail to send and receive mail from your different email accounts
http://clk.atdmt.com/UKM/go/186394592/direct/01/

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.