Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

use index, big or small?

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


teddyyyy123 at gmail

May 4, 2012, 4:47 PM

Post #1 of 2 (139 views)
Permalink
use index, big or small?

I have an index containing all students, now I want to do an index
search inside an Apache Hadoop mapper,
i.e.

for each (record from mapper input reader) {
output = lucene.search("name:"+ record.name + " OR " + " id:" +
record.id );
emit(output)
}


my question is whether I should shard the index (across terms, not
splitting the same postings list for one term) or simply replicate it.
the index for the entire dataset is not too big, so it can fig into
my local disk, and I can copy it to every node in the cluster, and let
them sit there all the time, so no copy overhead is incurred.
the only argument in favor of sharding is that a smaller index might
be faster. but since index search is only O(lg(n)) time, maybe this
time saving is very small.

so will sharding be worth the effort?

thanks
yang

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


ian.lea at gmail

May 10, 2012, 1:22 AM

Post #2 of 2 (130 views)
Permalink
Re: use index, big or small? [In reply to]

Impossible to say - how big is big? How fast is fast? I'd start with
the simplest option and if it's fast enough, stop.


--
Ian.


On Sat, May 5, 2012 at 12:47 AM, Yang <teddyyyy123 [at] gmail> wrote:
> I have an index containing all students, now I want to do an index
> search inside an Apache Hadoop mapper,
> i.e.
>
> for each (record from mapper input reader) {
>    output = lucene.search("name:"+ record.name  + " OR " + " id:" +
> record.id );
>    emit(output)
> }
>
>
> my question is whether I should shard the index (across terms, not
> splitting the same postings list for one term) or simply replicate it.
> the index for the entire dataset is not too big, so it can fig into
> my local disk, and I can copy it to every node in the cluster, and let
> them sit there all the time, so no copy overhead is incurred.
> the only argument in favor of sharding is that a smaller index might
> be faster.  but since index search is only O(lg(n)) time, maybe this
> time saving is very small.
>
> so will sharding be worth the effort?
>
> thanks
> yang
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.