Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

Is Lucene a good choice for PB scale mailbox search?

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


tangfulin at gmail

Nov 23, 2009, 6:35 PM

Post #1 of 7 (1297 views)
Permalink
Is Lucene a good choice for PB scale mailbox search?

We are going to add full-text search for our mailbox service .

The problem is we have more than 1 PB mails there , and obviously we
don't want to add another PB storage for search service , so we hope
the index data will be small enough for storage while the search keeps
fast .

The lucky is that every user just search with mails of their own , so
we can split the data into a lot of indexes instead of keeping them in
a big one .

So, after all these concerns , the question is , is lucene a good
choice for this ? or which is the right way to do this ? Does anyone
have done this before ?

All opinions and comments are welcome !

fulin


--
梦的开始挣扎于城市的边缘
心的远方执着在脚步的瞬间
我的宿命埋藏了寂寞的永远

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


skant at sloan

Nov 23, 2009, 7:59 PM

Post #2 of 7 (1238 views)
Permalink
Re: Is Lucene a good choice for PB scale mailbox search? [In reply to]

Hi, I have not worked on a petascale (yet!) - mostly on the scale of tens of
terabyes - but I do think Lucene would be very helpful for such usecase. I
would indeed suggest partitioning the index by users (seems the most
logical., straightforward way, also offers the security of insulating one
user's emails from others.

Take a look at Compass and Solr (based on Lucene) and they might be more
oriented to your needs.

HTH,
Shashi


On Mon, Nov 23, 2009 at 9:35 PM, fulin tang <tangfulin [at] gmail> wrote:

> We are going to add full-text search for our mailbox service .
>
> The problem is we have more than 1 PB mails there , and obviously we
> don't want to add another PB storage for search service , so we hope
> the index data will be small enough for storage while the search keeps
> fast .
>
> The lucky is that every user just search with mails of their own , so
> we can split the data into a lot of indexes instead of keeping them in
> a big one .
>
> So, after all these concerns , the question is , is lucene a good
> choice for this ? or which is the right way to do this ? Does anyone
> have done this before ?
>
> All opinions and comments are welcome !
>
> fulin
>
>
> --
> εĿʼڳеıԵ
> ĵԶִڽŲ˲
> ҵ˼įԶ
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>


jason.rutherglen at gmail

Nov 23, 2009, 9:41 PM

Post #3 of 7 (1235 views)
Permalink
Re: Is Lucene a good choice for PB scale mailbox search? [In reply to]

A sharded architecture (i.e. smaller indexes) used by Google for
example and implemented by open source in the Katta project may be
best for scaling to sizable levels. Katta is also useful for
redundancy and fault tolerance.

On Mon, Nov 23, 2009 at 6:35 PM, fulin tang <tangfulin [at] gmail> wrote:
> We are going to add full-text search for our mailbox service .
>
> The problem is we have more than 1 PB mails there , and obviously we
> don't want to add another PB storage for search service , so we hope
> the index data will be small enough for storage while the search keeps
> fast .
>
> The lucky is that every user just search with mails of their own , so
> we can split the data into a lot of indexes instead of keeping them in
> a big one .
>
> So, after all these concerns , the question is , is lucene a good
> choice for this ? or which is the right way to do this ? Does anyone
> have done this before ?
>
> All opinions and comments are welcome !
>
> fulin
>
>
> --
> εĿʼڳеıԵ
> ĵԶִڽŲ˲
> ҵ˼įԶ
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


kaykay.unique at gmail

Nov 23, 2009, 10:56 PM

Post #4 of 7 (1235 views)
Permalink
Re: Is Lucene a good choice for PB scale mailbox search? [In reply to]

fulin tang wrote:
> We are going to add full-text search for our mailbox service .
>
> The problem is we have more than 1 PB mails there , and obviously we
> don't want to add another PB storage for search service , so we hope
> the index data will be small enough for storage while the search keeps
> fast .
>
> The lucky is that every user just search with mails of their own , so
> we can split the data into a lot of indexes instead of keeping them in
> a big one .
>
If it is going to be sharded by the 'To' or 'Cc' list - then potentially
the mail information is going to be duplicated proportional to the
number of people in an email thread. Selecting some other dimension like
time, for sharding might be useful to begin with.
> So, after all these concerns , the question is , is lucene a good
> choice for this ? or which is the right way to do this ? Does anyone
> have done this before ?
>

With PB of storage - check out solr sharding / katta for prior work in
this arena.

> All opinions and comments are welcome !
>
> fulin
>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


otis_gospodnetic at yahoo

Nov 24, 2009, 3:20 PM

Post #5 of 7 (1207 views)
Permalink
Re: Is Lucene a good choice for PB scale mailbox search? [In reply to]

For what it's worth, AOL uses a Solr cluster to handle searches for @aol users. Each user has his own index.

Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



----- Original Message ----
> From: fulin tang <tangfulin [at] gmail>
> To: java-user [at] lucene
> Sent: Mon, November 23, 2009 9:35:57 PM
> Subject: Is Lucene a good choice for PB scale mailbox search?
>
> We are going to add full-text search for our mailbox service .
>
> The problem is we have more than 1 PB mails there , and obviously we
> don't want to add another PB storage for search service , so we hope
> the index data will be small enough for storage while the search keeps
> fast .
>
> The lucky is that every user just search with mails of their own , so
> we can split the data into a lot of indexes instead of keeping them in
> a big one .
>
> So, after all these concerns , the question is , is lucene a good
> choice for this ? or which is the right way to do this ? Does anyone
> have done this before ?
>
> All opinions and comments are welcome !
>
> fulin
>
>
> --
> 梦的开始挣扎于城市的边缘
> 心的远方执着在脚步的瞬间
> 我的宿命埋藏了寂寞的永远
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


tangfulin at gmail

Nov 25, 2009, 5:34 PM

Post #6 of 7 (1181 views)
Permalink
Re: Is Lucene a good choice for PB scale mailbox search? [In reply to]

Thanks all for the good suggestions !

But any idea of the storage? How can we make the indexes as small as possible?

We know compressing is the only way, but when and where to compress is
best for search?

Thanks all again!


2009/11/24 Kay Kay <kaykay.unique [at] gmail>:
> fulin tang wrote:
>>
>> We are going to add full-text search for our mailbox service .
>>
>> The problem is we have more than 1 PB mails there , and obviously we
>> don't want to add another PB storage for search service , so we hope
>> the index data will be small enough for storage while the search keeps
>> fast .
>>
>> The lucky is that every user just search with mails of their own , so
>> we can split the data into a lot of indexes instead of keeping them in
>> a big one .
>>
>
> If it is going to be sharded by the 'To' or 'Cc' list - then potentially the
> mail information is going to be duplicated proportional to the number of
> people in an email thread. Selecting some other dimension like time, for
> sharding  might be useful to begin with.
>>
>> So, after all these concerns ,  the question is , is lucene a good
>> choice for this ? or which is the right way to do this ? Does anyone
>> have done this  before ?
>>
>
> With PB of storage - check out solr sharding / katta for prior work in this
> arena.
>
>> All opinions and comments are welcome !
>>
>> fulin
>>
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>



--
梦的开始挣扎于城市的边缘
心的远方执着在脚步的瞬间
我的宿命埋藏了寂寞的永远

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


ian.lea at gmail

Nov 26, 2009, 1:57 AM

Post #7 of 7 (1174 views)
Permalink
Re: Is Lucene a good choice for PB scale mailbox search? [In reply to]

If you are planning on using lucene only for searching then you don't
need to store much data at all - just the message id or whatever you
use to identify messages. And there won't be much point in
compressing that.

If on the other hand you plan on storing data in lucene, perhaps for
displaying hits on a web page, you might want to compress it. That
will save some space but at the cost of some performance at indexing
and retrieval time. If you are storing, say, From:, To: and Subject:
for display in search results and message body only displayed when
they want to view the message, you could leave the first three
uncompressed and compress the message body.

Personally, I only use compression in indexes storing large fields but
with low search/retrieval rate. But my indexes are only a few Gb in
size.

Lucene's handling of compressed fields is changing in 3.0 - see the
release notes or 2.9 javadocs for Field.Store.html#COMPRESS


--
Ian.

On Thu, Nov 26, 2009 at 1:34 AM, fulin tang <tangfulin [at] gmail> wrote:
> Thanks all for the good suggestions !
>
> But any idea of the storage? How can we make the indexes as small as possible?
>
> We know compressing is the only way, but when and where to compress is
> best for search?
>
> Thanks all again!
>
>
> 2009/11/24 Kay Kay <kaykay.unique [at] gmail>:
>> fulin tang wrote:
>>>
>>> We are going to add full-text search for our mailbox service .
>>>
>>> The problem is we have more than 1 PB mails there , and obviously we
>>> don't want to add another PB storage for search service , so we hope
>>> the index data will be small enough for storage while the search keeps
>>> fast .
>>>
>>> The lucky is that every user just search with mails of their own , so
>>> we can split the data into a lot of indexes instead of keeping them in
>>> a big one .
>>>
>>
>> If it is going to be sharded by the 'To' or 'Cc' list - then potentially the
>> mail information is going to be duplicated proportional to the number of
>> people in an email thread. Selecting some other dimension like time, for
>> sharding might be useful to begin with.
>>>
>>> So, after all these concerns , the question is , is lucene a good
>>> choice for this ? or which is the right way to do this ? Does anyone
>>> have done this before ?
>>>
>>
>> With PB of storage - check out solr sharding / katta for prior work in this
>> arena.
>>
>>> All opinions and comments are welcome !
>>>
>>> fulin
>>>
>>>
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>> For additional commands, e-mail: java-user-help [at] lucene
>>
>>
>
>
>
> --
> εĿʼڳеıԵ
> ĵԶִڽŲ˲
> ҵ˼įԶ
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.