Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: General

Open Relevance Infrastucture Request

 

 

Lucene general RSS feed   Index | Next | Previous | View Threaded


gsingers at apache

May 26, 2009, 5:32 AM

Post #1 of 12 (1654 views)
Permalink
Open Relevance Infrastucture Request

FYI, I have sent the following message to infrastructure [at] a If you
have access to that mailing list, then you can follow the conversation
there. Otherwise, I will report back on it here.

-Grant

Begin forwarded message:

> From: Grant Ingersoll <gsingers [at] apache>
> Date: May 26, 2009 8:27:54 AM EDT
> To: Apache Infrastructure <infrastructure [at] apache>
> Subject: Crawling and Bandwidth
>
> Hi,
>
> Over in Lucene land, we are investigating starting a new project
> that would go out and acquire and re-distribute content from the web
> for use in scalability and relevance testing (http://wiki.apache.org/lucene-java/OpenRelevance
> ). The content would consist of pages that we know are freely re-
> distributable (Creative Commons, etc. that allow for distribution).
>
> Obviously, this is likely to have a bearing on ASF infrastructure,
> which is why I'm writing. The crawling aspect is likely to be
> discrete events lasting for a few days or a week (depending on
> bandwidth throttling.) and is likely to happen a lot as we startup,
> but then will stabilize over time and be less frequent. We can
> likely handle this through our Lucene zone, but are not sure if it
> would be capable performance wise.
>
> Disk space and download bandwidth, on the other hand, are likely to
> be more of a concern. We anticipate having several collections
> (web, mail, etc.), of varying sizes. Practically speaking, 50-100
> GB is likely the maximum size for a collection, but we probably
> would have other smaller collections ranging from 100s of MBs to a
> few gigs. Even so, people with really big pipes may be interested
> in larger collections. Typically, when others have done this kind
> of thing, they actually send out hard drives containing the data.
> We are not proposing that.
>
> We don't anticipate an overwhelming number of downloads (it's kind
> of a niche area) but we're also not sure how to even go about
> estimating. We're also not sure how this should work w/ the ASF
> mirroring system, if at all.
>
> Another option is to ask the board for funding for us to use
> Amazon. I don't particularly like this approach b/c it is not
> obvious to me how one would cap the cost.
>
> To sum up, this project (we haven't even made it an official project
> yet) is purely exploratory at this point. I'm writing because we
> wanted to get Infrastructure's input before foisting something on
> the ASF that _could_ be a burden.
>
> WDYT? What concerns are we not thinking about in regards to
> infrastructure? Where could we put this data and how can we
> efficiently distribute it without affecting others?
>
> Thanks,
> Grant Ingersoll


markrmiller at gmail

May 26, 2009, 6:50 AM

Post #2 of 12 (1584 views)
Permalink
Re: Open Relevance Infrastucture Request [In reply to]

Grant Ingersoll wrote:
>
>> Even so, people with really big pipes may be interested in larger
>> collections. Typically, when others have done this kind of thing,
>> they actually send out hard drives containing the data. We are not
>> proposing that.
>>
>>
>> Another option is to ask the board for funding for us to use Amazon.
>> I don't particularly like this approach b/c it is not obvious to me
>> how one would cap the cost.
You can cap the cost by limiting how much data you store right? You can
use RequesterPayBuckets
http://docs.amazonwebservices.com/AmazonS3/latest/index.html?RequesterPaysBuckets.html
to move the cost onto the users who want the data. Per user, it would
still be fairly cheap. You get the added bonus of other S3 services,
like being able to send a device back and forth to import/export on
site. You would just pay for storage and transferring the data in - both
cap-able by limiting the amount of data you put in it.

Not a recommendation or anything (its more convenient to not charge the
downloaders), but I think you could technically cap the costs associated
with putting it on S3.

--
- Mark


gsingers at apache

May 26, 2009, 7:14 AM

Post #3 of 12 (1593 views)
Permalink
Re: Open Relevance Infrastucture Request [In reply to]

On May 26, 2009, at 9:50 AM, Mark Miller wrote:

> Grant Ingersoll wrote:
>>
>>> Even so, people with really big pipes may be interested in larger
>>> collections. Typically, when others have done this kind of thing,
>>> they actually send out hard drives containing the data. We are
>>> not proposing that.
>>>
>>>
>>> Another option is to ask the board for funding for us to use
>>> Amazon. I don't particularly like this approach b/c it is not
>>> obvious to me how one would cap the cost.
> You can cap the cost by limiting how much data you store right? You
> can use RequesterPayBuckets http://docs.amazonwebservices.com/AmazonS3/latest/index.html?RequesterPaysBuckets.html
> to move the cost onto the users who want the data. Per user, it
> would still be fairly cheap. You get the added bonus of other S3
> services, like being able to send a device back and forth to import/
> export on site. You would just pay for storage and transferring the
> data in - both cap-able by limiting the amount of data you put in it.
>

One of the goals is to make the data available for free, so I don't
think this would work. Currently, one can get the TREC data for a
nominal fee as well.


> Not a recommendation or anything (its more convenient to not charge
> the downloaders), but I think you could technically cap the costs
> associated with putting it on S3.
>
> --
> - Mark
>
>


simon.willnauer at googlemail

May 26, 2009, 7:32 AM

Post #4 of 12 (1585 views)
Permalink
Re: Open Relevance Infrastucture Request [In reply to]

I wonder if a P2P network would be an option at all? I doubt that P2P
is feasible for 100s of GB but we might get peers with bigger pipes
supporting the ASF.
Providing a Bittorrent download could work as soon as it is boostrapped.

simon

On Tue, May 26, 2009 at 4:14 PM, Grant Ingersoll <gsingers [at] apache> wrote:
>
> On May 26, 2009, at 9:50 AM, Mark Miller wrote:
>
>> Grant Ingersoll wrote:
>>>
>>>> Even so, people with really big pipes may be interested in larger
>>>> collections.  Typically, when others have done this kind of thing, they
>>>> actually send out hard drives containing the data.  We are not proposing
>>>> that.
>>>>
>>>>
>>>> Another option is to ask the board for funding for us to use Amazon.  I
>>>> don't particularly like this approach b/c it is not obvious to me how one
>>>> would cap the cost.
>>
>> You can cap the cost by limiting how much data you store right? You can
>> use RequesterPayBuckets
>> http://docs.amazonwebservices.com/AmazonS3/latest/index.html?RequesterPaysBuckets.html to
>> move the cost onto the users who want the data. Per user, it would still be
>> fairly cheap. You get the added bonus of other S3 services, like being able
>> to send a device back and forth to import/export on site. You would just pay
>> for storage and transferring the data in - both cap-able by limiting the
>> amount of data you put in it.
>>
>
> One of the goals is to make the data available for free, so I don't think
> this would work.  Currently, one can get the TREC data for a nominal fee as
> well.
>
>
>> Not a recommendation or anything (its more convenient to not charge the
>> downloaders), but I think you could technically cap the costs associated
>> with putting it on S3.
>>
>> --
>> - Mark
>>
>>
>
>
>


ted.dunning at gmail

May 26, 2009, 8:20 AM

Post #5 of 12 (1588 views)
Permalink
Re: Open Relevance Infrastucture Request [In reply to]

The cost for storing a few hundred GB of data would be < $100/month.

The cost for transfer would be $17/100GB which could add up fairly quickly
if more than dozens of downloads happen. My guess is that would be
unlikely.

Another option is to request Amazon host the dataset as a public dataset:

http://aws.amazon.com/publicdatasets/

On Tue, May 26, 2009 at 7:14 AM, Grant Ingersoll <gsingers [at] apache>wrote:

> You can cap the cost by limiting how much data you store right? You can use
>> RequesterPayBuckets
>> http://docs.amazonwebservices.com/AmazonS3/latest/index.html?RequesterPaysBuckets.html to
>> move the cost onto the users who want the data. Per user, it would still be
>> fairly cheap. You get the added bonus of other S3 services, like being able
>> to send a device back and forth to import/export on site. You would just pay
>> for storage and transferring the data in - both cap-able by limiting the
>> amount of data you put in it.
>>
>>
> One of the goals is to make the data available for free, so I don't think
> this would work. Currently, one can get the TREC data for a nominal fee as
> well.




--
Ted Dunning, CTO
DeepDyve


ted.dunning at gmail

May 26, 2009, 8:22 AM

Post #6 of 12 (1590 views)
Permalink
Re: Open Relevance Infrastucture Request [In reply to]

Plausible, but bitTorrent typically requires a pretty large number of peers
to get over difficulties in penetrating firewalls and finding live seeds.
It works great for the originally intended purpose of piracy or for other
highly popular items such as Linux kernel distribution, but not so well for
long-tail content like an IR data set.

On Tue, May 26, 2009 at 7:32 AM, Simon Willnauer <
simon.willnauer [at] googlemail> wrote:

> I wonder if a P2P network would be an option at all? I doubt that P2P
> is feasible for 100s of GB but we might get peers with bigger pipes
> supporting the ASF.
> Providing a Bittorrent download could work as soon as it is boostrapped.
>
>


james at ryley

May 26, 2009, 8:35 AM

Post #7 of 12 (1585 views)
Permalink
RE: Open Relevance Infrastucture Request [In reply to]

Hey Ted,

What are the parameters of the file bandwidth and access you need? We'd consider donating this, but I'm not sure exactly how it would work. We can't have completely open FTP where abusive people find out about it and start storing movies and such. Would it be read-only, or would we set up accounts for interested parties?

BTW, a mention on the project page would be great, but not necessary -- our site is free so any visibility is always appreciated.

Sincerely,
James Ryley, Ph.D.
www.sumobrain.com / www.freepatentsonline.com

This communication is to be treated as confidential and the information in it may not be used or disclosed except for the purpose for which it has been sent. Nothing contained herein nor on a related web site should be construed as legal or patenting advice.


> -----Original Message-----
> From: Ted Dunning [mailto:ted.dunning [at] gmail]
> Sent: Tuesday, May 26, 2009 11:21 AM
> To: general [at] lucene
> Subject: Re: Open Relevance Infrastucture Request
>
> The cost for storing a few hundred GB of data would be < $100/month.
>
> The cost for transfer would be $17/100GB which could add up fairly quickly
> if more than dozens of downloads happen. My guess is that would be
> unlikely.
>
> Another option is to request Amazon host the dataset as a public dataset:
>
> http://aws.amazon.com/publicdatasets/
>
> On Tue, May 26, 2009 at 7:14 AM, Grant Ingersoll <gsingers [at] apache>wrote:
>
> > You can cap the cost by limiting how much data you store right? You can use
> >> RequesterPayBuckets
> >>
> http://docs.amazonwebservices.com/AmazonS3/latest/index.html?RequesterPaysBuck
> ets.html to
> >> move the cost onto the users who want the data. Per user, it would still be
> >> fairly cheap. You get the added bonus of other S3 services, like being able
> >> to send a device back and forth to import/export on site. You would just
> pay
> >> for storage and transferring the data in - both cap-able by limiting the
> >> amount of data you put in it.
> >>
> >>
> > One of the goals is to make the data available for free, so I don't think
> > this would work. Currently, one can get the TREC data for a nominal fee as
> > well.
>
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve


ted.dunning at gmail

May 26, 2009, 9:56 AM

Post #8 of 12 (1586 views)
Permalink
Re: Open Relevance Infrastucture Request [In reply to]

Grant is the spark-plug here.

I would predict < 10TB of transfer per month backed by 200-300GB of
storage. File sizes would probably be in the 50-100MB range organized into
larger groups. Access would be probably be read-only except for Lucene
committers.

Others would have to comment on a "bandwidth provided by" link, but it seems
that this already exists (see http://www.apache.org/foundation/thanks.html).

On Tue, May 26, 2009 at 8:35 AM, James <james [at] ryley> wrote:

> Hey Ted,
>
> What are the parameters of the file bandwidth and access you need? We'd
> consider donating this, but I'm not sure exactly how it would work.
>



--
Ted Dunning, CTO
DeepDyve


yonik at lucidimagination

May 26, 2009, 10:11 AM

Post #9 of 12 (1587 views)
Permalink
Re: Open Relevance Infrastucture Request [In reply to]

On Tue, May 26, 2009 at 11:22 AM, Ted Dunning <ted.dunning [at] gmail> wrote:
> Plausible, but bitTorrent typically requires a pretty large number of peers
> to get over difficulties in penetrating firewalls and finding live seeds.
> It works great for the originally intended purpose of piracy or for other
> highly popular items such as Linux kernel distribution, but not so well for
> long-tail content like an IR data set.

Seems like it could work if a few people/companies offered to seed as
a donation.
bitTorrent clients have nice features built in like bandwidth shaping
so donors can help limit how much they donate.

-Yonik
http://www.lucidimagination.com

> On Tue, May 26, 2009 at 7:32 AM, Simon Willnauer <
> simon.willnauer [at] googlemail> wrote:
>
>> I wonder if a P2P network would be an option at all? I doubt that P2P
>> is feasible for 100s of GB but we might get peers with bigger pipes
>> supporting the ASF.
>> Providing a Bittorrent download could work as soon as it is boostrapped.


gsingers at apache

May 27, 2009, 3:35 AM

Post #10 of 12 (1567 views)
Permalink
Re: Open Relevance Infrastucture Request [In reply to]

James, Ted,

Ted's estimates sound reasonable to me. One of the responses on
infra@ suggested we might find a few mirrors who can help lend a
hand. Let's keep in touch on this and see how it works out. If we
have a mirroring approach in place, then the need for even project
committers having write access goes away as we can manage writes here
at the ASF and you could just sync.

I have no clue about bandwidth provided issue, especially b/c this is
a unique situation. We can find out as we proceed.

-Grant

On May 26, 2009, at 12:56 PM, Ted Dunning wrote:

> Grant is the spark-plug here.
>
> I would predict < 10TB of transfer per month backed by 200-300GB of
> storage. File sizes would probably be in the 50-100MB range
> organized into
> larger groups. Access would be probably be read-only except for
> Lucene
> committers.
>
> Others would have to comment on a "bandwidth provided by" link, but
> it seems
> that this already exists (see http://www.apache.org/foundation/thanks.html)
> .


ted.dunning at gmail

May 27, 2009, 8:10 AM

Post #11 of 12 (1568 views)
Permalink
Re: Open Relevance Infrastucture Request [In reply to]

Grant, do mirror sites get a who-we-are link?

On Wed, May 27, 2009 at 3:35 AM, Grant Ingersoll <gsingers [at] apache>wrote:

> I have no clue about bandwidth provided issue, especially b/c this is a
> unique situation. We can find out as we proceed.




--
Ted Dunning, CTO
DeepDyve


gsingers at apache

May 27, 2009, 8:27 AM

Post #12 of 12 (1571 views)
Permalink
Re: Open Relevance Infrastucture Request [In reply to]

I don't know. I don't think so.

On May 27, 2009, at 11:10 AM, Ted Dunning wrote:

> Grant, do mirror sites get a who-we-are link?
>
> On Wed, May 27, 2009 at 3:35 AM, Grant Ingersoll
> <gsingers [at] apache>wrote:
>
>> I have no clue about bandwidth provided issue, especially b/c this
>> is a
>> unique situation. We can find out as we proceed.
>

Lucene general RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.