Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: General

Question for top term frequency

 

 

Lucene general RSS feed   Index | Next | Previous | View Threaded


bradjoe99 at yahoo

Jun 16, 2009, 1:30 PM

Post #1 of 7 (570 views)
Permalink
Question for top term frequency

We have one column called "Author" indexed which contains the author name.
We'd like to
get the records with the top 10 authors who have most records in the lucene.
Is there a
good way to do it? I searched the mailing list, and did not find a good
match.
--
View this message in context: http://www.nabble.com/Question-for-top-term-frequency-tp24062253p24062253.html
Sent from the Lucene - General mailing list archive at Nabble.com.


ted.dunning at gmail

Jun 16, 2009, 2:29 PM

Post #2 of 7 (532 views)
Permalink
Re: Question for top term frequency [In reply to]

It is easy to get global document frequencies for all authors.

Then it is easy to build a query that accepts documents from any of the top
authors.

It requires more than one query, but only a few lines of code.

On Tue, Jun 16, 2009 at 1:30 PM, zehua <bradjoe99[at]yahoo.com> wrote:

> Is there a
> good way to do it? I searched the mailing list, and did not find a good
> match.
>


bradjoe99 at yahoo

Jun 17, 2009, 2:33 PM

Post #3 of 7 (528 views)
Permalink
Re: Question for top term frequency [In reply to]

Thanks for the reply.

The problem is that the number of global document maybe huge, for example
10,000.
If we returned all these doucments and find the top author using the term
frequency loop,
it can take longer time.

We are considering to use CustomScoreQuery. First parameter is the normal
query to match the result.
Second parameter is to use the Field "Author"'s frequency to increase the
score. So the results for
top authors will have higher score and returned. Does it makes sense?



Ted Dunning wrote:
>
> It is easy to get global document frequencies for all authors.
>
> Then it is easy to build a query that accepts documents from any of the
> top
> authors.
>
> It requires more than one query, but only a few lines of code.
>
> On Tue, Jun 16, 2009 at 1:30 PM, zehua <bradjoe99[at]yahoo.com> wrote:
>
>> Is there a
>> good way to do it? I searched the mailing list, and did not find a good
>> match.
>>
>
>

--
View this message in context: http://www.nabble.com/Question-for-top-term-frequency-tp24062253p24082504.html
Sent from the Lucene - General mailing list archive at Nabble.com.


bradjoe99 at yahoo

Jun 17, 2009, 2:45 PM

Post #4 of 7 (528 views)
Permalink
Re: Question for top term frequency [In reply to]

One thing to add is that the top author is based on all doucments. It is
based on the returned results.
For example, we have 10000 results match the query, the top authors are
among the 10000 results.


zehua wrote:
>
> We have one column called "Author" indexed which contains the author name.
> We'd like to
> get the records with the top 10 authors who have most records in the
> lucene. Is there a
> good way to do it? I searched the mailing list, and did not find a good
> match.
>

--
View this message in context: http://www.nabble.com/Question-for-top-term-frequency-tp24062253p24082683.html
Sent from the Lucene - General mailing list archive at Nabble.com.


gsingers at apache

Jun 17, 2009, 3:42 PM

Post #5 of 7 (527 views)
Permalink
Re: Question for top term frequency [In reply to]

Isn't this just faceting on the author field and then making a query
out of the top ten authors? I think you could do this in Solr pretty
easily. Or maybe I don't understand the question.

-Grant
On Jun 17, 2009, at 5:45 PM, zehua wrote:

>
> One thing to add is that the top author is based on all doucments.
> It is
> based on the returned results.
> For example, we have 10000 results match the query, the top authors
> are
> among the 10000 results.
>
>
> zehua wrote:
>>
>> We have one column called "Author" indexed which contains the
>> author name.
>> We'd like to
>> get the records with the top 10 authors who have most records in the
>> lucene. Is there a
>> good way to do it? I searched the mailing list, and did not find a
>> good
>> match.
>>
>
> --
> View this message in context: http://www.nabble.com/Question-for-top-term-frequency-tp24062253p24082683.html
> Sent from the Lucene - General mailing list archive at Nabble.com.
>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
using Solr/Lucene:
http://www.lucidimagination.com/search


ted.dunning at gmail

Jun 17, 2009, 4:28 PM

Post #6 of 7 (527 views)
Permalink
Re: Question for top term frequency [In reply to]

It is indeed faceting. I misunderstood the original request as being
against the entire corpus. For the very modest size result that he is
talking about, SOLR faceting should work just fine.

Zehua's loss of the word NOT in his latest message increased my confusion a
bit.

On Wed, Jun 17, 2009 at 3:42 PM, Grant Ingersoll <gsingers[at]apache.org>wrote:

> Isn't this just faceting on the author field and then making a query out of
> the top ten authors? I think you could do this in Solr pretty easily. Or
> maybe I don't understand the question.
>
> -Grant
>
> On Jun 17, 2009, at 5:45 PM, zehua wrote:
>
>
>> One thing to add is that the top author is *[NOT]* based on all
>> doucments. It is
>> based on the returned results.
>> For example, we have 10000 results match the query, the top authors are
>> among the 10000 results.
>>
>>
>>


gsingers at apache

Jun 17, 2009, 4:32 PM

Post #7 of 7 (530 views)
Permalink
Re: Question for top term frequency [In reply to]

On Jun 17, 2009, at 7:28 PM, Ted Dunning wrote:

> It is indeed faceting. I misunderstood the original request as being
> against the entire corpus. For the very modest size result that he is
> talking about, SOLR faceting should work just fine.

Even the entire corpus is fine, just use *:* (MatchAllDocsQuery) in
Solr ;-)

Lucene general RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact lists@gossamer-threads.com
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.