Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Wikipedia: Foundation

[Wikimedia-l] Fire Drill Re: Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow )

 

 

First page Previous page 1 2 Next page Last page  View All Wikipedia foundation RSS feed   Index | Next | Previous | View Threaded


kim at bruning

May 16, 2012, 7:28 PM

Post #1 of 28 (597 views)
Permalink
[Wikimedia-l] Fire Drill Re: Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow )

On Wed, May 16, 2012 at 11:11:04PM -0400, John wrote:
> I know from experience that a wiki can be re-built from any one of the
> dumps that are provided, (pages-meta-current) for example contains
> everything needed to reboot a site except its user database
> (names/passwords ect). see
> http://www.mediawiki.org/wiki/Manual:Moving_a_wiki


Sure. Does this include all images, including commons images, eventually
converted to operate locally?

I'm thinking about full snapshot-and-later-restore, say 25 or 50 years
from now, or in an academic setting, (or FSM-forbid in a worst case scenario
<knock on wood>). That's what the AT folks are most interested in.

==Fire Drill==
Has anyone recently set up a full-external-duplicate of (for instance) en.wp?
This includes all images, all discussions, all page history (excepting the user
accounts and deleted pages)

This would be a useful and important exercise; possibly to be repeated once per year.

I get a sneaky feeling that the first few iterations won't go so well.

I'm sure AT would be glad to help out with the running of these fire drills, as
it seems to be in line with their mission.

sincerely,
Kim Bruning

_______________________________________________
Wikimedia-l mailing list
Wikimedia-l [at] lists
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


kim at bruning

May 16, 2012, 8:10 PM

Post #2 of 28 (585 views)
Permalink
Re: [Wikimedia-l] Fire Drill Re: Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow ) [In reply to]

On Thu, May 17, 2012 at 12:03:02AM -0400, John wrote:
> Except for files, getting a content clone up is relativity easy, and can be
> done in a fairly quick order (aka less than two weeks for everything). I
> know there is talk about getting a rsync setup for images.

Ouch, 2 weeks. We need the images to be replicable too though. <scratches head>


sincerely,
Kim Bruning


_______________________________________________
Wikimedia-l mailing list
Wikimedia-l [at] lists
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


phoenixoverride at gmail

May 16, 2012, 9:03 PM

Post #3 of 28 (590 views)
Permalink
Re: [Wikimedia-l] Fire Drill Re: Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow ) [In reply to]

Except for files, getting a content clone up is relativity easy, and can be
done in a fairly quick order (aka less than two weeks for everything). I
know there is talk about getting a rsync setup for images.
_______________________________________________
Wikimedia-l mailing list
Wikimedia-l [at] lists
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


phoenixoverride at gmail

May 16, 2012, 9:13 PM

Post #4 of 28 (585 views)
Permalink
Re: [Wikimedia-l] Fire Drill Re: Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow ) [In reply to]

that two week estimate was given worst case scenario. Given the best case
we are talking as little as a few hours for the smaller wikis to 5 days or
so for a project the size of enwiki. (see
http://lists.wikimedia.org/pipermail/xmldatadumps-l/2012-May/000491.htmlfor
progress on image dumps`)

On Wed, May 16, 2012 at 11:10 PM, Kim Bruning <kim [at] bruning> wrote:

> On Thu, May 17, 2012 at 12:03:02AM -0400, John wrote:
> > Except for files, getting a content clone up is relativity easy, and can
> be
> > done in a fairly quick order (aka less than two weeks for everything). I
> > know there is talk about getting a rsync setup for images.
>
> Ouch, 2 weeks. We need the images to be replicable too though. <scratches
> head>
>
>
> sincerely,
> Kim Bruning
>
>
> _______________________________________________
> Wikimedia-l mailing list
> Wikimedia-l [at] lists
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
>
_______________________________________________
Wikimedia-l mailing list
Wikimedia-l [at] lists
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


phoenixoverride at gmail

May 16, 2012, 9:18 PM

Post #5 of 28 (588 views)
Permalink
Re: [Wikimedia-l] Fire Drill Re: Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow ) [In reply to]

take a look at http://www.mediawiki.org/wiki/Manual:Importing_XML_dumps for
exactly how to import an existing dump, I know the process of re-importing
a cluster for the toolserver is normally just a few days when they have the
needed dumps.

On Thu, May 17, 2012 at 12:13 AM, John
<phoeInixoverride [at] gmail<phoenixoverride [at] gmail>
> wrote:

> that two week estimate was given worst case scenario. Given the best case
> we are talking as little as a few hours for the smaller wikis to 5 days or
> so for a project the size of enwiki. (see
> http://lists.wikimedia.org/pipermail/xmldatadumps-l/2012-May/000491.htmlfor progress on image dumps`)
>
>
> On Wed, May 16, 2012 at 11:10 PM, Kim Bruning <kim [at] bruning>wrote:
>
>> On Thu, May 17, 2012 at 12:03:02AM -0400, John wrote:
>> > Except for files, getting a content clone up is relativity easy, and
>> can be
>> > done in a fairly quick order (aka less than two weeks for everything). I
>> > know there is talk about getting a rsync setup for images.
>>
>> Ouch, 2 weeks. We need the images to be replicable too though. <scratches
>> head>
>>
>>
>> sincerely,
>> Kim Bruning
>>
>>
>> _______________________________________________
>> Wikimedia-l mailing list
>> Wikimedia-l [at] lists
>> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
>>
>
>
_______________________________________________
Wikimedia-l mailing list
Wikimedia-l [at] lists
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


wikimail at inbox

May 16, 2012, 9:23 PM

Post #6 of 28 (582 views)
Permalink
Re: [Wikimedia-l] Fire Drill Re: Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow ) [In reply to]

On Thu, May 17, 2012 at 12:13 AM, John <phoenixoverride [at] gmail> wrote:
> that two week estimate was given worst case scenario. Given the best case
> we are talking as little as a few hours for the smaller wikis to 5 days or
> so for a project the size of enwiki. (see
> http://lists.wikimedia.org/pipermail/xmldatadumps-l/2012-May/000491.htmlfor
> progress on image dumps`)

Where are you getting these figures from?

Are you talking about a full history copy?

Also, what about the copyright issues (especially, attribution)?

_______________________________________________
Wikimedia-l mailing list
Wikimedia-l [at] lists
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


wikimail at inbox

May 16, 2012, 9:23 PM

Post #7 of 28 (582 views)
Permalink
Re: [Wikimedia-l] Fire Drill Re: Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow ) [In reply to]

On Thu, May 17, 2012 at 12:18 AM, John <phoenixoverride [at] gmail> wrote:
> take a look at http://www.mediawiki.org/wiki/Manual:Importing_XML_dumps for
> exactly how to import an existing dump, I know the process of re-importing
> a cluster for the toolserver is normally just a few days when they have the
> needed dumps.

Toolserver doesn't have full history, does it?

_______________________________________________
Wikimedia-l mailing list
Wikimedia-l [at] lists
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


phoenixoverride at gmail

May 16, 2012, 9:26 PM

Post #8 of 28 (586 views)
Permalink
Re: [Wikimedia-l] Fire Drill Re: Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow ) [In reply to]

Toolserver is a clone of the wmf servers minus files. they run a database
replication of all wikis. these times are dependent on available hardware
and may very, but should provide a decent estimate


On Thu, May 17, 2012 at 12:23 AM, Anthony <wikimail [at] inbox> wrote:

> On Thu, May 17, 2012 at 12:18 AM, John <phoenixoverride [at] gmail> wrote:
> > take a look at http://www.mediawiki.org/wiki/Manual:Importing_XML_dumpsfor
> > exactly how to import an existing dump, I know the process of
> re-importing
> > a cluster for the toolserver is normally just a few days when they have
> the
> > needed dumps.
>
> Toolserver doesn't have full history, does it?
>
_______________________________________________
Wikimedia-l mailing list
Wikimedia-l [at] lists
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


phoenixoverride at gmail

May 16, 2012, 9:30 PM

Post #9 of 28 (583 views)
Permalink
Re: [Wikimedia-l] Fire Drill Re: Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow ) [In reply to]

Ill run a quick benchmark and import the full history of simple.wikipedia
to my laptop wiki on a stick, and give an exact duration


On Thu, May 17, 2012 at 12:26 AM, John <phoenixoverride [at] gmail> wrote:

> Toolserver is a clone of the wmf servers minus files. they run a database
> replication of all wikis. these times are dependent on available hardware
> and may very, but should provide a decent estimate
>
>
>
> On Thu, May 17, 2012 at 12:23 AM, Anthony <wikimail [at] inbox> wrote:
>
>> On Thu, May 17, 2012 at 12:18 AM, John <phoenixoverride [at] gmail> wrote:
>> > take a look at http://www.mediawiki.org/wiki/Manual:Importing_XML_dumpsfor
>> > exactly how to import an existing dump, I know the process of
>> re-importing
>> > a cluster for the toolserver is normally just a few days when they have
>> the
>> > needed dumps.
>>
>> Toolserver doesn't have full history, does it?
>>
>
>
_______________________________________________
Wikimedia-l mailing list
Wikimedia-l [at] lists
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


wikimail at inbox

May 16, 2012, 9:37 PM

Post #10 of 28 (584 views)
Permalink
Re: [Wikimedia-l] Fire Drill Re: Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow ) [In reply to]

On Thu, May 17, 2012 at 12:30 AM, John <phoenixoverride [at] gmail> wrote:
> Ill run a quick benchmark and import the full history of simple.wikipedia to
> my laptop wiki on a stick, and give an exact duration

Simple.wikipedia is nothing like en.wikipedia. For one thing, there's
no need to turn on $wgCompressRevisions with simple.wikipedia.

Is $wgCompressRevisions still used? I haven't followed this in quite a while.

_______________________________________________
Wikimedia-l mailing list
Wikimedia-l [at] lists
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


phoenixoverride at gmail

May 16, 2012, 9:45 PM

Post #11 of 28 (584 views)
Permalink
Re: [Wikimedia-l] Fire Drill Re: Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow ) [In reply to]

*Simple.wikipedia is nothing like en.wikipedia* I care to dispute that
statement, All WMF wikis are setup basically the same (an odd extension
here or there is different, and different namespace names at times) but for
the purpose of recovery simplewiki_p is a very standard example. this issue
isnt just about enwiki_p but *all* wmf wikis. Doing a data recovery for
enwiki vs simplewiki is just a matter of time, for enwiki a 5 day estimate
would be fairly standard (depending on server setup) and lower times for
smaller databases. typically you can explain it in a rate of X revisions
processed per Y time unit, regardless of the project. and that rate should
be similar for everything given the same hardware setup.

On Thu, May 17, 2012 at 12:37 AM, Anthony <wikimail [at] inbox> wrote:

> On Thu, May 17, 2012 at 12:30 AM, John <phoenixoverride [at] gmail> wrote:
> > Ill run a quick benchmark and import the full history of
> simple.wikipedia to
> > my laptop wiki on a stick, and give an exact duration
>
> Simple.wikipedia is nothing like en.wikipedia. For one thing, there's
> no need to turn on $wgCompressRevisions with simple.wikipedia.
>
> Is $wgCompressRevisions still used? I haven't followed this in quite a
> while.
>
_______________________________________________
Wikimedia-l mailing list
Wikimedia-l [at] lists
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


wikimail at inbox

May 16, 2012, 9:54 PM

Post #12 of 28 (584 views)
Permalink
Re: [Wikimedia-l] Fire Drill Re: Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow ) [In reply to]

On Thu, May 17, 2012 at 12:45 AM, John <phoenixoverride [at] gmail> wrote:
> Simple.wikipedia is nothing like en.wikipedia I care to dispute that
> statement, All WMF wikis are setup basically the same (an odd extension here
> or there is different, and different namespace names at times) but for the
> purpose of recovery simplewiki_p is a very standard example. this issue isnt
> just about enwiki_p but *all* wmf wikis. Doing a data recovery for enwiki vs
> simplewiki is just a matter of time, for enwiki a 5 day estimate would be
> fairly standard (depending on server setup) and lower times for smaller
> databases. typically you can explain it in a rate of X revisions processed
> per Y time unit, regardless of the project. and that rate should be similar
> for everything given the same hardware setup.

Are you compressing old revisions, or not? Does the WMF database
compress old revisions, or not?

In any case, I'm sorry, a 20 gig mysql database does not scale
linearly to a 20 terabyte mysql database.

_______________________________________________
Wikimedia-l mailing list
Wikimedia-l [at] lists
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


phoenixoverride at gmail

May 16, 2012, 10:22 PM

Post #13 of 28 (564 views)
Permalink
Re: [Wikimedia-l] Fire Drill Re: Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow ) [In reply to]

Anthony the process is linear, you have a php inserting X number of rows
per Y time frame. Yes rebuilding the externallinks, links, and langlinks
tables will take some additional time and wont scale. However I have been
working with the toolserver since 2007 and Ive lost count of the number of
times that the TS has needed to re-import a cluster, (s1-s7) and even
enwiki can be done in a semi-reasonable timeframe. The WMF actually
compresses all text blobs not just old versions. complete download and
decompression of simple only took 20 minutes on my 2 year old consumer
grade laptop with a standard home cable internet connection, same download
on the toolserver (minus decompression) was 88s. Yeah Importing will take a
little longer but shouldnt be that big of a deal. There will also be some
need cleanup tasks. However the main issue, archiving and restoring wmf
wikis isnt an issue, and with moderately recent hardware is no big deal. Im
putting my money where my mouth is, and getting actual valid stats and
figures. Yes it may not be an exactly 1:1 ratio when scaling up, however
given the basics of how importing a dump functions it should remain close
to the same ratio

On Thu, May 17, 2012 at 12:54 AM, Anthony <wikimail [at] inbox> wrote:

> On Thu, May 17, 2012 at 12:45 AM, John <phoenixoverride [at] gmail> wrote:
> > Simple.wikipedia is nothing like en.wikipedia I care to dispute that
> > statement, All WMF wikis are setup basically the same (an odd extension
> here
> > or there is different, and different namespace names at times) but for
> the
> > purpose of recovery simplewiki_p is a very standard example. this issue
> isnt
> > just about enwiki_p but *all* wmf wikis. Doing a data recovery for
> enwiki vs
> > simplewiki is just a matter of time, for enwiki a 5 day estimate would be
> > fairly standard (depending on server setup) and lower times for smaller
> > databases. typically you can explain it in a rate of X revisions
> processed
> > per Y time unit, regardless of the project. and that rate should be
> similar
> > for everything given the same hardware setup.
>
> Are you compressing old revisions, or not? Does the WMF database
> compress old revisions, or not?
>
> In any case, I'm sorry, a 20 gig mysql database does not scale
> linearly to a 20 terabyte mysql database.
>
_______________________________________________
Wikimedia-l mailing list
Wikimedia-l [at] lists
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


jamesmikedupont at googlemail

May 16, 2012, 10:48 PM

Post #14 of 28 (568 views)
Permalink
Re: [Wikimedia-l] Fire Drill Re: Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow ) [In reply to]

Well to be honest, I am still upset about how much data is deleted
from wikipedia because it is not "notable",
there are so many articles that I might be interested in that are lost
in the same garbage as spam and other things.
We should make non notable articles and non harmful ones available in
the backups as well.
mike

On Thu, May 17, 2012 at 2:28 AM, Kim Bruning <kim [at] bruning> wrote:
> On Wed, May 16, 2012 at 11:11:04PM -0400, John wrote:
>> I know from experience that a wiki can be re-built from any one of the
>> dumps that are provided, (pages-meta-current) for example contains
>> everything needed to reboot a site except its user database
>> (names/passwords ect). see
>> http://www.mediawiki.org/wiki/Manual:Moving_a_wiki
>
>
> Sure. Does this include all images, including commons images, eventually
> converted to operate locally?
>
> I'm thinking about full snapshot-and-later-restore, say 25 or 50 years
> from now, or in an academic setting, (or FSM-forbid in a worst case scenario
> <knock on wood>). That's what the AT folks are most interested in.
>
> ==Fire Drill==
> Has anyone recently set up a full-external-duplicate of (for instance) en.wp?
> This includes all images, all discussions, all page history (excepting the user
> accounts and deleted pages)
>
> This would be a useful and important exercise; possibly to be repeated once per year.
>
> I get a sneaky feeling that the first few iterations won't go so well.
>
> I'm sure AT would be glad to help out with the running of these fire drills, as
> it seems to be in line with their mission.
>
> sincerely,
>        Kim Bruning
>
> _______________________________________________
> Wikimedia-l mailing list
> Wikimedia-l [at] lists
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l



--
James Michael DuPont
Member of Free Libre Open Source Software Kosova http://flossk.org

_______________________________________________
Wikimedia-l mailing list
Wikimedia-l [at] lists
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


wikimail at inbox

May 16, 2012, 10:52 PM

Post #15 of 28 (571 views)
Permalink
Re: [Wikimedia-l] Fire Drill Re: Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow ) [In reply to]

On Thu, May 17, 2012 at 1:22 AM, John <phoenixoverride [at] gmail> wrote:
> Anthony the process is linear, you have a php inserting X number of rows per
> Y time frame.

Amazing. I need to switch all my databases to MySQL. It can insert X
rows per Y time frame, regardless of whether the database is 20
gigabytes or 20 terabytes in size, regardless of whether the average
row is 3K or 1.5K, regardless of whether I'm using a thumb drive or a
RAID array or a cluster of servers, etc.

> Yes rebuilding the externallinks, links, and langlinks tables
> will take some additional time and wont scale.

And this is part of the process too, right?

> However I have been working
> with the toolserver since 2007 and Ive lost count of the number of times
> that the TS has needed to re-import a cluster, (s1-s7) and even enwiki can
> be done in a semi-reasonable timeframe.

Re-importing how? From the compressed XML full history dumps?

> The WMF actually compresses all text
> blobs not just old versions.

Is http://www.mediawiki.org/wiki/Manual:Text_table still accurate? Is
WMF using gzip or object?

> complete download and decompression of simple
> only took 20 minutes on my 2 year old consumer grade laptop with a standard
> home cable internet connection, same download on the toolserver (minus
> decompression) was 88s. Yeah Importing will take a little longer but
> shouldnt be that big of a deal.

For the full history English Wikipedia it *is* a big deal.

If you think it isn't, stop playing with simple.wikipedia, and tell us
how long it takes to get a mirror up and running of en.wikipedia.

Do you plan to run compressOld.php? Are you going to import
everything in plain text first, and *then* start compressing? Seems
like an awful lot of wasted hard drive space.

> There will also be some need cleanup tasks.
> However the main issue, archiving and restoring wmf wikis isnt an issue, and
> with moderately recent hardware is no big deal. Im putting my money where my
> mouth is, and getting actual valid stats and figures. Yes it may not be an
> exactly 1:1 ratio when scaling up, however given the basics of how importing
> a dump functions it should remain close to the same ratio

If you want to put your money where your mouth is, import
en.wikipedia. It'll only take 5 days, right?

_______________________________________________
Wikimedia-l mailing list
Wikimedia-l [at] lists
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


phoenixoverride at gmail

May 16, 2012, 11:06 PM

Post #16 of 28 (567 views)
Permalink
Re: [Wikimedia-l] Fire Drill Re: Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow ) [In reply to]

On Thu, May 17, 2012 at 1:52 AM, Anthony <wikimail [at] inbox> wrote:

> On Thu, May 17, 2012 at 1:22 AM, John <phoenixoverride [at] gmail> wrote:
> > Anthony the process is linear, you have a php inserting X number of rows
> per
> > Y time frame.
>
> Amazing. I need to switch all my databases to MySQL. It can insert X
> rows per Y time frame, regardless of whether the database is 20
> gigabytes or 20 terabytes in size, regardless of whether the average
> row is 3K or 1.5K, regardless of whether I'm using a thumb drive or a
> RAID array or a cluster of servers, etc.
>

When refering to X over Y time, its an average of a of say 1000 revisions
per 1 minute, any X over Y period must be considered with averages in mind,
or getting a count wouldnt be possible.



> > Yes rebuilding the externallinks, links, and langlinks tables
> > will take some additional time and wont scale.
>
> And this is part of the process too, right?

That does not need to be completed prior to the site going live, it can be
done after making it public

> That part isnt
> > However I have been working
> > with the toolserver since 2007 and Ive lost count of the number of times
> > that the TS has needed to re-import a cluster, (s1-s7) and even enwiki
> can
> > be done in a semi-reasonable timeframe.
>
> Re-importing how? From the compressed XML full history dumps?


> > The WMF actually compresses all text
> > blobs not just old versions.
>
> Is http://www.mediawiki.org/wiki/Manual:Text_table still accurate? Is
> WMF using gzip or object?
>
> > complete download and decompression of simple
> > only took 20 minutes on my 2 year old consumer grade laptop with a
> standard
> > home cable internet connection, same download on the toolserver (minus
> > decompression) was 88s. Yeah Importing will take a little longer but
> > shouldnt be that big of a deal.
>
> For the full history English Wikipedia it *is* a big deal.
>
> If you think it isn't, stop playing with simple.wikipedia, and tell us
> how long it takes to get a mirror up and running of en.wikipedia.
>
> Do you plan to run compressOld.php? Are you going to import
> everything in plain text first, and *then* start compressing? Seems
> like an awful lot of wasted hard drive space.
>

If you setup your sever/hardware correctly it will compress the text
information during insertion into the database and compressOld.php is
actually designed only for cases where you start with an uncompressed
configuration


> > There will also be some need cleanup tasks.
> > However the main issue, archiving and restoring wmf wikis isnt an issue,
> and
> > with moderately recent hardware is no big deal. Im putting my money
> where my
> > mouth is, and getting actual valid stats and figures. Yes it may not be
> an
> > exactly 1:1 ratio when scaling up, however given the basics of how
> importing
> > a dump functions it should remain close to the same ratio
>
> If you want to put your money where your mouth is, import
> en.wikipedia. It'll only take 5 days, right?
>

If I actually had a server or the disc space to do it I would, just to
prove your smartass comments as stupid as they actually are. However given
my current resource limitations (fairly crappy internet connection, older
laptops, and lack of HDD) I tried to select something that could give
reliable benchmarks. If your willing to foot the bill for the new hardware
Ill gladly prove my point
_______________________________________________
Wikimedia-l mailing list
Wikimedia-l [at] lists
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


jamesmikedupont at googlemail

May 16, 2012, 11:08 PM

Post #17 of 28 (570 views)
Permalink
Re: [Wikimedia-l] Fire Drill Re: Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow ) [In reply to]

On Thu, May 17, 2012 at 6:06 AM, John <phoenixoverride [at] gmail> wrote:
> If your willing to foot the bill for the new hardware
> Ill gladly prove my point

given the millions of dollars that wikipedia has, it should not be a
problem to provide such resources for a good cause like that.

--
James Michael DuPont
Member of Free Libre Open Source Software Kosova http://flossk.org

_______________________________________________
Wikimedia-l mailing list
Wikimedia-l [at] lists
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


wikimail at inbox

May 17, 2012, 4:23 AM

Post #18 of 28 (566 views)
Permalink
Re: [Wikimedia-l] Fire Drill Re: Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow ) [In reply to]

On Thu, May 17, 2012 at 2:06 AM, John <phoenixoverride [at] gmail> wrote:
> On Thu, May 17, 2012 at 1:52 AM, Anthony <wikimail [at] inbox> wrote:
>> On Thu, May 17, 2012 at 1:22 AM, John <phoenixoverride [at] gmail> wrote:
>> > Anthony the process is linear, you have a php inserting X number of rows
>> > per
>> > Y time frame.
>>
>> Amazing.  I need to switch all my databases to MySQL.  It can insert X
>> rows per Y time frame, regardless of whether the database is 20
>> gigabytes or 20 terabytes in size, regardless of whether the average
>> row is 3K or 1.5K, regardless of whether I'm using a thumb drive or a
>> RAID array or a cluster of servers, etc.
>
> When refering to X over Y time, its an average of a of say 1000 revisions
> per 1 minute, any X over Y period must be considered with averages in mind,
> or getting a count wouldnt be possible.

The *average* en.wikipedia revision is more than twice the size of the
*average* simple.wikipedia revision. The *average* performance of a
20 gig database is faster than the *average* performance of a 20
terabyte database. The *average* performance of your laptop's thumb
drive is different from the *average* performance of a(n array of)
drive(s) which can handle 20 terabytes of data.

> If you setup your sever/hardware correctly it will compress the text
> information during insertion into the database

Is this how you set up your simple.wikipedia test? How long does it
take import the data if you're using the same compression mechanism as
WMF (which, you didn't answer, but I assume is concatenation and
compression). How exactly does this work "during insertion" anyway?
Does it intelligently group sets of revisions together to avoid
decompressing and recompressing the same revision several times? I
suppose it's possible, but that would introduce quite a lot of
complication into the import script, slowing things down dramatically.

What about the answers to my other questions?

>> If you want to put your money where your mouth is, import
>> en.wikipedia.  It'll only take 5 days, right?
>
> If I actually had a server or the disc space to do it I would, just to prove
> your smartass comments as stupid as they actually are. However given my
> current resource limitations (fairly crappy internet connection, older
> laptops, and lack of HDD) I tried to select something that could give
> reliable benchmarks. If your willing to foot the bill for the new hardware
> Ill gladly prove my point

What you seem to be saying is that you're *not* putting your money
where your mouth is.

Anyway, if you want, I'll make a deal with you. A neutral third party
rents the hardware at Amazon Web Services (AWS). We import
simple.wikipedia full history (concatenating and compressing during
import). We take the ratio of revisions in simple.wikipedia to the
ratio of revisions in en.wikipedia. We import en.wikipedia full
history (concatenating and compressing during import). If the ratio
of time it takes to import en.wikipedia vs simple.wikipedia is greater
than or equal to twice the ratio of revisions, then you reimburse the
third party. If the ratio of import time is less than twice the ratio
of revisions (you claim it is linear, therefore it'll be the same
ratio), then I reimburse the third party.

Either way, we save the new dump, with the processing already done,
and send it to archive.org (and WMF if they're willing to host it).
So we actually get a useful result out of this. It's not just for the
purpose of settling an argument.

Either of us can concede defeat at any point, and stop the experiment.
At that point if the neutral third party wishes to pay to continue
the job, s/he would be responsible for the additional costs.

Shouldn't be too expensive. If you concede defeat after 5 days, then
your CPU-time costs are $54 (assuming Extra Large High Memory
Instance). Including 4 terabytes of EBS (which should be enough if
you compress on the fly) for 5 days should be less than $100.

I'm tempted to do it even if you don't take the bet.

_______________________________________________
Wikimedia-l mailing list
Wikimedia-l [at] lists
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


alexandrdmitriromanov at gmail

May 17, 2012, 4:27 AM

Post #19 of 28 (570 views)
Permalink
Re: [Wikimedia-l] Fire Drill Re: Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow ) [In reply to]

I'd like to point out that the increasingly technical nature of this
conversation probably belongs either on wikitech-l, or off-list, and that
the strident nature of the comments is fast approaching inappropriate.

Alex
Wikimedia-l list administrator


2012/5/17 Anthony <wikimail [at] inbox>

> On Thu, May 17, 2012 at 2:06 AM, John <phoenixoverride [at] gmail> wrote:
> > On Thu, May 17, 2012 at 1:52 AM, Anthony <wikimail [at] inbox> wrote:
> >> On Thu, May 17, 2012 at 1:22 AM, John <phoenixoverride [at] gmail>
> wrote:
> >> > Anthony the process is linear, you have a php inserting X number of
> rows
> >> > per
> >> > Y time frame.
> >>
> >> Amazing. I need to switch all my databases to MySQL. It can insert X
> >> rows per Y time frame, regardless of whether the database is 20
> >> gigabytes or 20 terabytes in size, regardless of whether the average
> >> row is 3K or 1.5K, regardless of whether I'm using a thumb drive or a
> >> RAID array or a cluster of servers, etc.
> >
> > When refering to X over Y time, its an average of a of say 1000 revisions
> > per 1 minute, any X over Y period must be considered with averages in
> mind,
> > or getting a count wouldnt be possible.
>
> The *average* en.wikipedia revision is more than twice the size of the
> *average* simple.wikipedia revision. The *average* performance of a
> 20 gig database is faster than the *average* performance of a 20
> terabyte database. The *average* performance of your laptop's thumb
> drive is different from the *average* performance of a(n array of)
> drive(s) which can handle 20 terabytes of data.
>
> > If you setup your sever/hardware correctly it will compress the text
> > information during insertion into the database
>
> Is this how you set up your simple.wikipedia test? How long does it
> take import the data if you're using the same compression mechanism as
> WMF (which, you didn't answer, but I assume is concatenation and
> compression). How exactly does this work "during insertion" anyway?
> Does it intelligently group sets of revisions together to avoid
> decompressing and recompressing the same revision several times? I
> suppose it's possible, but that would introduce quite a lot of
> complication into the import script, slowing things down dramatically.
>
> What about the answers to my other questions?
>
> >> If you want to put your money where your mouth is, import
> >> en.wikipedia. It'll only take 5 days, right?
> >
> > If I actually had a server or the disc space to do it I would, just to
> prove
> > your smartass comments as stupid as they actually are. However given my
> > current resource limitations (fairly crappy internet connection, older
> > laptops, and lack of HDD) I tried to select something that could give
> > reliable benchmarks. If your willing to foot the bill for the new
> hardware
> > Ill gladly prove my point
>
> What you seem to be saying is that you're *not* putting your money
> where your mouth is.
>
> Anyway, if you want, I'll make a deal with you. A neutral third party
> rents the hardware at Amazon Web Services (AWS). We import
> simple.wikipedia full history (concatenating and compressing during
> import). We take the ratio of revisions in simple.wikipedia to the
> ratio of revisions in en.wikipedia. We import en.wikipedia full
> history (concatenating and compressing during import). If the ratio
> of time it takes to import en.wikipedia vs simple.wikipedia is greater
> than or equal to twice the ratio of revisions, then you reimburse the
> third party. If the ratio of import time is less than twice the ratio
> of revisions (you claim it is linear, therefore it'll be the same
> ratio), then I reimburse the third party.
>
> Either way, we save the new dump, with the processing already done,
> and send it to archive.org (and WMF if they're willing to host it).
> So we actually get a useful result out of this. It's not just for the
> purpose of settling an argument.
>
> Either of us can concede defeat at any point, and stop the experiment.
> At that point if the neutral third party wishes to pay to continue
> the job, s/he would be responsible for the additional costs.
>
> Shouldn't be too expensive. If you concede defeat after 5 days, then
> your CPU-time costs are $54 (assuming Extra Large High Memory
> Instance). Including 4 terabytes of EBS (which should be enough if
> you compress on the fly) for 5 days should be less than $100.
>
> I'm tempted to do it even if you don't take the bet.
>
> _______________________________________________
> Wikimedia-l mailing list
> Wikimedia-l [at] lists
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
>
_______________________________________________
Wikimedia-l mailing list
Wikimedia-l [at] lists
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


wikimail at inbox

May 17, 2012, 4:43 AM

Post #20 of 28 (570 views)
Permalink
Re: [Wikimedia-l] Fire Drill Re: Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow ) [In reply to]

On Thu, May 17, 2012 at 7:27 AM, J Alexandr Ledbury-Romanov
<alexandrdmitriromanov [at] gmail> wrote:
> I'd like to point out that the increasingly technical nature of this
> conversation probably belongs either on wikitech-l, or off-list, and that
> the strident nature of the comments is fast approaching inappropriate.

Really? I think we're really getting somewhere.

In fact, I think someone at WMF should contact Amazon and see if
they'll let us conduct the experiment for free, in exchange for us
creating the dump for them to host as a public data set
(http://aws.amazon.com/publicdatasets/).

In case you got lost in the technical details, the original post was
asking "Has anyone recently set up a full-external-duplicate of (for
instance) en.wp?" and suggesting that we should do this on a yearly
basis as a fire drill.

My latest post was a concrete proposal for doing exactly that.

_______________________________________________
Wikimedia-l mailing list
Wikimedia-l [at] lists
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


wikimail at inbox

May 17, 2012, 4:49 AM

Post #21 of 28 (561 views)
Permalink
Re: [Wikimedia-l] Fire Drill Re: Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow ) [In reply to]

Please have someone at WMF coordinate this so that there aren't
multiple requests made. In my opinion, it should preferably be made
by a WMF employee.

Fill out the form at
https://aws-portal.amazon.com/gp/aws/html-forms-controller/aws-dataset-inquiry

Tell them you want to create a public data set which is a snapshot of
the English Wikipedia. We can coordinate any questions, and any
implementation details, on a separate list.

On Thu, May 17, 2012 at 7:43 AM, Anthony <wikimail [at] inbox> wrote:
> On Thu, May 17, 2012 at 7:27 AM, J Alexandr Ledbury-Romanov
> <alexandrdmitriromanov [at] gmail> wrote:
>> I'd like to point out that the increasingly technical nature of this
>> conversation probably belongs either on wikitech-l, or off-list, and that
>> the strident nature of the comments is fast approaching inappropriate.
>
> Really?  I think we're really getting somewhere.
>
> In fact, I think someone at WMF should contact Amazon and see if
> they'll let us conduct the experiment for free, in exchange for us
> creating the dump for them to host as a public data set
> (http://aws.amazon.com/publicdatasets/).
>
> In case you got lost in the technical details, the original post was
> asking "Has anyone recently set up a full-external-duplicate of (for
> instance) en.wp?" and suggesting that we should do this on a yearly
> basis as a fire drill.
>
> My latest post was a concrete proposal for doing exactly that.

_______________________________________________
Wikimedia-l mailing list
Wikimedia-l [at] lists
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


thomas.dalton at gmail

May 17, 2012, 5:11 AM

Post #22 of 28 (561 views)
Permalink
Re: [Wikimedia-l] Fire Drill Re: Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow ) [In reply to]

On 17 May 2012 12:43, Anthony <wikimail [at] inbox> wrote:
> In fact, I think someone at WMF should contact Amazon and see if
> they'll let us conduct the experiment for free, in exchange for us
> creating the dump for them to host as a public data set
> (http://aws.amazon.com/publicdatasets/).

What dump are you going to create? You are starting from a dump, why
can't Amazon just host that?

_______________________________________________
Wikimedia-l mailing list
Wikimedia-l [at] lists
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


wikimail at inbox

May 17, 2012, 5:15 AM

Post #23 of 28 (562 views)
Permalink
Re: [Wikimedia-l] Fire Drill Re: Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow ) [In reply to]

On Thu, May 17, 2012 at 8:11 AM, Thomas Dalton <thomas.dalton [at] gmail> wrote:
> On 17 May 2012 12:43, Anthony <wikimail [at] inbox> wrote:
>> In fact, I think someone at WMF should contact Amazon and see if
>> they'll let us conduct the experiment for free, in exchange for us
>> creating the dump for them to host as a public data set
>> (http://aws.amazon.com/publicdatasets/).
>
> What dump are you going to create? You are starting from a dump, why
> can't Amazon just host that?

Because the XML dump is semi-useless - it's compressed in all the
wrong places to use for an actual running system.

Anyway, looking at how the AWS Public Data Sets work, it probably
would be best not to even create a dump, but just put up the running
(object compressed) database.

_______________________________________________
Wikimedia-l mailing list
Wikimedia-l [at] lists
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


kim at bruning

May 17, 2012, 6:14 AM

Post #24 of 28 (563 views)
Permalink
Re: [Wikimedia-l] Fire Drill Re: Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow ) [In reply to]

On Thu, May 17, 2012 at 07:43:09AM -0400, Anthony wrote:
>
> In fact, I think someone at WMF should contact Amazon and see if
> they'll let us conduct the experiment for free, in exchange for us
> creating the dump for them to host as a public data set
> (http://aws.amazon.com/publicdatasets/).


That sounds like an excellent plan. At the same time, it might be useful to get Archive Team
involved.

* They have warm bodies. (always useful, one can never have enough volunteers ;)
* They have experience with very large datasets
* They'd be very happy to help (it's their mission)
* Some of them may be able to provide Sufficient Storage(tm) and server capacity. Saves us
the Amazon AWS bill.
* We might set a precedent where others might provide their data to AT directly too.

AT's mission dovetails nicely with ours. We provide the sum of all human knowledge to people.
AT ensures that the sum of all human knowledge is not subtracted from.


sincerely,
Kim Bruning

_______________________________________________
Wikimedia-l mailing list
Wikimedia-l [at] lists
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


neil at tonal

May 17, 2012, 6:59 AM

Post #25 of 28 (565 views)
Permalink
Re: [Wikimedia-l] Fire Drill Re: Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow ) [In reply to]

On 17/05/12 12:49, Anthony wrote:
> Please have someone at WMF coordinate this so that there aren't
> multiple requests made. In my opinion, it should preferably be made
> by a WMF employee.
>
> Fill out the form at
> https://aws-portal.amazon.com/gp/aws/html-forms-controller/aws-dataset-inquiry
>
> Tell them you want to create a public data set which is a snapshot of
> the English Wikipedia. We can coordinate any questions, and any
> implementation details, on a separate list.
>

That's a fantastic idea, and would give en: Wikipedia yet another public
replica for very little effort. I would imagine that if they are willing
to host enwiki, they may also be be willing to host most, or all, of the
other projects.

It will also mean that running Wikipedia data-munching experiments on
EC2 will become much easier.

Neil


_______________________________________________
Wikimedia-l mailing list
Wikimedia-l [at] lists
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l

First page Previous page 1 2 Next page Last page  View All Wikipedia foundation RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.