Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

Bucketing (was Re: Wikia search goes live today)

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


otis_gospodnetic at yahoo

Jan 8, 2008, 11:06 PM

Post #1 of 3 (884 views)
Permalink
Bucketing (was Re: Wikia search goes live today)

Sounds useful. I suppose this means one would have custom function for within-bucket-reordering? e.g. for a web search you might reorder based on the URL length if you think shorter URLs are an indicator of higher quality. It also sounds like something that can easily sit outside Lucene....or do you have something else in mind, such as a mechanism to pass a reordering function in Lucene?

Otis

--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

----- Original Message ----
From: Andrzej Bialecki <ab [at] getopt>
To: java-user [at] lucene
Sent: Tuesday, January 8, 2008 5:24:01 PM
Subject: Re: Wikia search goes live today

Ryan McKinley wrote:
> Andrzej Bialecki wrote:
>> Lukas Vlcek wrote:
>>> So staring will be accommodated only during indexing phase. Does it

>>> mean it
>>> will be pretty static value not a dynamically changing variable...
>>> correct?
>>> In other words if I add my starts to some document it won't affect
the
>>> scoring immediately but after indexing cycle. Correct?
>>
>> (I'm not involved in Wikia development). There are some ways to go
>> about it even in the pure Lucene-land, so that the updates are fast
>> without reindexing the main content. Hint: ParallelReader.
>>
>
> in solr (1.3-dev) you can have an external value source with a
function
> query...

True, although function query tends to bring more overhead ...

While we're on the subject of complex scoring - I read an interesting
paper (I don't have a link now), which discussed a so called bucketed
scoring. The idea is that if your basic scoring is good enough to
ensure
that top-N results are highly relevant, then you can split these
results
into buckets of k documents (let's say 10 ;) ), and within each bucket
apply arbitrary re-ranking function, which is then very inexpensive to
perform because of the limited number of documents.

Example: you have a large corpus of web pages, and you want home pages
to appear first, even if they score somewhat lower - and it doesn't pay

off to modify the base scoring, because of overfitting, i.e. the
scoring
would be good for home pages but poor for other relevant documents.


--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene





---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


ab at getopt

Jan 9, 2008, 2:30 AM

Post #2 of 3 (804 views)
Permalink
Re: Bucketing (was Re: Wikia search goes live today) [In reply to]

Otis Gospodnetic wrote:
> Sounds useful. I suppose this means one would have custom function
> for within-bucket-reordering? e.g. for a web search you might reorder
> based on the URL length if you think shorter URLs are an indicator of


Yes, that's precisely the idea. It combines the advantages of simple
(hence fast) scoring inside the IR system, with a complex (hence slow)
reordering of a small sample of results, performed outside the IR system
prior to delivering the results.


> higher quality. It also sounds like something that can easily sit
> outside Lucene....or do you have something else in mind, such as a
> mechanism to pass a reordering function in Lucene?

It should definitely be something outside Lucene - it's meant for cases
that require more complex ranking (or faster) than those available
through function query. I only mentioned this here because it is simple
to implement, yet produces useful results difficult to obtain through
the usual means (similarity, boosting, even function query).


--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


gsingers at apache

Jan 9, 2008, 5:28 AM

Post #3 of 3 (794 views)
Permalink
Re: Bucketing (was Re: Wikia search goes live today) [In reply to]

Would be a nice contrib module, though...

-Grant

On Jan 9, 2008, at 5:30 AM, Andrzej Bialecki wrote:

> Otis Gospodnetic wrote:
>> Sounds useful. I suppose this means one would have custom function
>> for within-bucket-reordering? e.g. for a web search you might reorder
>> based on the URL length if you think shorter URLs are an indicator of
>
>
> Yes, that's precisely the idea. It combines the advantages of simple
> (hence fast) scoring inside the IR system, with a complex (hence
> slow) reordering of a small sample of results, performed outside the
> IR system prior to delivering the results.
>
>
>> higher quality. It also sounds like something that can easily sit
>> outside Lucene....or do you have something else in mind, such as a
>> mechanism to pass a reordering function in Lucene?
>
> It should definitely be something outside Lucene - it's meant for
> cases that require more complex ranking (or faster) than those
> available through function query. I only mentioned this here because
> it is simple to implement, yet produces useful results difficult to
> obtain through the usual means (similarity, boosting, even function
> query).
>
>
> --
> Best regards,
> Andrzej Bialecki <><
> ___. ___ ___ ___ _ _ __________________________________
> [__ || __|__/|__||\/| Information Retrieval, Semantic Web
> ___|||__|| \| || | Embedded Unix, System Integration
> http://www.sigram.com Contact: info at sigram dot com
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>

--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com
http://www.lucenebootcamp.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ





---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.