alexandrdmitriromanov at gmail
May 17, 2012, 4:27 AM
Post #19 of 28
I'd like to point out that the increasingly technical nature of this
Re: [Wikimedia-l] Fire Drill Re: Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow )
[In reply to]
conversation probably belongs either on wikitech-l, or off-list, and that
the strident nature of the comments is fast approaching inappropriate.
Wikimedia-l list administrator
2012/5/17 Anthony <wikimail [at] inbox>
> On Thu, May 17, 2012 at 2:06 AM, John <phoenixoverride [at] gmail> wrote:
> > On Thu, May 17, 2012 at 1:52 AM, Anthony <wikimail [at] inbox> wrote:
> >> On Thu, May 17, 2012 at 1:22 AM, John <phoenixoverride [at] gmail>
> >> > Anthony the process is linear, you have a php inserting X number of
> >> > per
> >> > Y time frame.
> >> Amazing. I need to switch all my databases to MySQL. It can insert X
> >> rows per Y time frame, regardless of whether the database is 20
> >> gigabytes or 20 terabytes in size, regardless of whether the average
> >> row is 3K or 1.5K, regardless of whether I'm using a thumb drive or a
> >> RAID array or a cluster of servers, etc.
> > When refering to X over Y time, its an average of a of say 1000 revisions
> > per 1 minute, any X over Y period must be considered with averages in
> > or getting a count wouldnt be possible.
> The *average* en.wikipedia revision is more than twice the size of the
> *average* simple.wikipedia revision. The *average* performance of a
> 20 gig database is faster than the *average* performance of a 20
> terabyte database. The *average* performance of your laptop's thumb
> drive is different from the *average* performance of a(n array of)
> drive(s) which can handle 20 terabytes of data.
> > If you setup your sever/hardware correctly it will compress the text
> > information during insertion into the database
> Is this how you set up your simple.wikipedia test? How long does it
> take import the data if you're using the same compression mechanism as
> WMF (which, you didn't answer, but I assume is concatenation and
> compression). How exactly does this work "during insertion" anyway?
> Does it intelligently group sets of revisions together to avoid
> decompressing and recompressing the same revision several times? I
> suppose it's possible, but that would introduce quite a lot of
> complication into the import script, slowing things down dramatically.
> What about the answers to my other questions?
> >> If you want to put your money where your mouth is, import
> >> en.wikipedia. It'll only take 5 days, right?
> > If I actually had a server or the disc space to do it I would, just to
> > your smartass comments as stupid as they actually are. However given my
> > current resource limitations (fairly crappy internet connection, older
> > laptops, and lack of HDD) I tried to select something that could give
> > reliable benchmarks. If your willing to foot the bill for the new
> > Ill gladly prove my point
> What you seem to be saying is that you're *not* putting your money
> where your mouth is.
> Anyway, if you want, I'll make a deal with you. A neutral third party
> rents the hardware at Amazon Web Services (AWS). We import
> simple.wikipedia full history (concatenating and compressing during
> import). We take the ratio of revisions in simple.wikipedia to the
> ratio of revisions in en.wikipedia. We import en.wikipedia full
> history (concatenating and compressing during import). If the ratio
> of time it takes to import en.wikipedia vs simple.wikipedia is greater
> than or equal to twice the ratio of revisions, then you reimburse the
> third party. If the ratio of import time is less than twice the ratio
> of revisions (you claim it is linear, therefore it'll be the same
> ratio), then I reimburse the third party.
> Either way, we save the new dump, with the processing already done,
> and send it to archive.org (and WMF if they're willing to host it).
> So we actually get a useful result out of this. It's not just for the
> purpose of settling an argument.
> Either of us can concede defeat at any point, and stop the experiment.
> At that point if the neutral third party wishes to pay to continue
> the job, s/he would be responsible for the additional costs.
> Shouldn't be too expensive. If you concede defeat after 5 days, then
> your CPU-time costs are $54 (assuming Extra Large High Memory
> Instance). Including 4 terabytes of EBS (which should be enough if
> you compress on the fly) for 5 days should be less than $100.
> I'm tempted to do it even if you don't take the bet.
> Wikimedia-l mailing list
> Wikimedia-l [at] lists
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
Wikimedia-l mailing list
Wikimedia-l [at] lists