
gsingers at apache
May 26, 2009, 5:32 AM
Post #1 of 12
(1654 views)
Permalink
|
|
Open Relevance Infrastucture Request
|
|
FYI, I have sent the following message to infrastructure [at] a If you have access to that mailing list, then you can follow the conversation there. Otherwise, I will report back on it here. -Grant Begin forwarded message: > From: Grant Ingersoll <gsingers [at] apache> > Date: May 26, 2009 8:27:54 AM EDT > To: Apache Infrastructure <infrastructure [at] apache> > Subject: Crawling and Bandwidth > > Hi, > > Over in Lucene land, we are investigating starting a new project > that would go out and acquire and re-distribute content from the web > for use in scalability and relevance testing (http://wiki.apache.org/lucene-java/OpenRelevance > ). The content would consist of pages that we know are freely re- > distributable (Creative Commons, etc. that allow for distribution). > > Obviously, this is likely to have a bearing on ASF infrastructure, > which is why I'm writing. The crawling aspect is likely to be > discrete events lasting for a few days or a week (depending on > bandwidth throttling.) and is likely to happen a lot as we startup, > but then will stabilize over time and be less frequent. We can > likely handle this through our Lucene zone, but are not sure if it > would be capable performance wise. > > Disk space and download bandwidth, on the other hand, are likely to > be more of a concern. We anticipate having several collections > (web, mail, etc.), of varying sizes. Practically speaking, 50-100 > GB is likely the maximum size for a collection, but we probably > would have other smaller collections ranging from 100s of MBs to a > few gigs. Even so, people with really big pipes may be interested > in larger collections. Typically, when others have done this kind > of thing, they actually send out hard drives containing the data. > We are not proposing that. > > We don't anticipate an overwhelming number of downloads (it's kind > of a niche area) but we're also not sure how to even go about > estimating. We're also not sure how this should work w/ the ASF > mirroring system, if at all. > > Another option is to ask the board for funding for us to use > Amazon. I don't particularly like this approach b/c it is not > obvious to me how one would cap the cost. > > To sum up, this project (we haven't even made it an official project > yet) is purely exploratory at this point. I'm writing because we > wanted to get Infrastructure's input before foisting something on > the ASF that _could_ be a burden. > > WDYT? What concerns are we not thinking about in regards to > infrastructure? Where could we put this data and how can we > efficiently distribute it without affecting others? > > Thanks, > Grant Ingersoll
|