Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Wikipedia: Wikitech

Dumps Still Stuck (since 7/1)?

 

 

Wikipedia wikitech RSS feed   Index | Next | Previous | View Threaded


yegg at alum

Jul 11, 2008, 5:20 AM

Post #1 of 14 (749 views)
Permalink
Dumps Still Stuck (since 7/1)?

The database dumps (http://download.wikimedia.org/backup-index.html)
don't seem to have made any progress since 7/1. I realize they can
appear stalled in the normal process
(http://meta.wikimedia.org/wiki/Data_dumps#Schedule), but in the
recent past (as far as I know) they have not been stalled this long
without there being something actually wrong.

Are they indeed still stuck
(http://lists.wikimedia.org/pipermail/wikitech-l/2008-July/038625.html)?
And is there anything I (or other community members) can do about it?

Thank you for your time.

_______________________________________________
Wikitech-l mailing list
Wikitech-l[at]lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


brion at wikimedia

Jul 12, 2008, 7:29 AM

Post #2 of 14 (702 views)
Permalink
Re: Dumps Still Stuck (since 7/1)? [In reply to]

yegg[at]alum.mit.edu wrote:
> The database dumps (http://download.wikimedia.org/backup-index.html)
> don't seem to have made any progress since 7/1. I realize they can
> appear stalled in the normal process
> (http://meta.wikimedia.org/wiki/Data_dumps#Schedule), but in the
> recent past (as far as I know) they have not been stalled this long
> without there being something actually wrong.
>
> Are they indeed still stuck
> (http://lists.wikimedia.org/pipermail/wikitech-l/2008-July/038625.html)?

Yep.

> And is there anything I (or other community members) can do about it?

Nope. We just gotta get in and unplug it when we have a moment. Right
now it's tending to stick because we're still sharing space between
upload backups and download dumps.

Still waiting on the new fileservers -- this server order has been stuck
for a loooong time, and we're not very happy about it...

-- brion

_______________________________________________
Wikitech-l mailing list
Wikitech-l[at]lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


rlullmann at gmail

Jul 23, 2008, 4:54 AM

Post #3 of 14 (672 views)
Permalink
Re: Dumps Still Stuck (since 7/1)? [In reply to]

The dumps are stuck again (7/19).

An additional problem is that when they are restarted, the pending order is
changed, the code is not picking up the oldest *successful* dump to do.
en.wikt was almost at the top when the last restart was done, and got moved
to somewhere near the bottom

It is getting more painful for us, as we have dozens of tools that work from
the XML dumps, and they are all now 6+ weeks out of date.

Maybe we could run en.wikt? Pretty please?
Robert
_______________________________________________
Wikitech-l mailing list
Wikitech-l[at]lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


brion at wikimedia

Jul 23, 2008, 5:35 AM

Post #4 of 14 (672 views)
Permalink
Re: Dumps Still Stuck (since 7/1)? [In reply to]

Robert Ullmann wrote:
> The dumps are stuck again (7/19).

Yes, as stated previously we're still waiting on promised fileservers to
be delivered.

Fun, isn't it?

-- brion

_______________________________________________
Wikitech-l mailing list
Wikitech-l[at]lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


rlullmann at gmail

Aug 1, 2008, 4:39 PM

Post #5 of 14 (614 views)
Permalink
Re: Dumps Still Stuck (since 7/1)? [In reply to]

Brion

For the sake of love, and all that is good and holy, what does it take to
get an en.wikt dump?

Every time it gets stuck, and you "reset" it, we get "dumped" at least
half-way down the queue.

And it seems to be stuck again ...

We can fix this, if there is some way I can possibly be allowed to help?

Best Regards,
Robert

On Fri, Jul 11, 2008 at 3:20 PM, <yegg[at]alum.mit.edu> wrote:

> The database dumps (http://download.wikimedia.org/backup-index.html)
> don't seem to have made any progress since 7/1. I realize they can
> appear stalled in the normal process
> (http://meta.wikimedia.org/wiki/Data_dumps#Schedule), but in the
> recent past (as far as I know) they have not been stalled this long
> without there being something actually wrong.
>
> Are they indeed still stuck
> (http://lists.wikimedia.org/pipermail/wikitech-l/2008-July/038625.html)?
> And is there anything I (or other community members) can do about it?
>
> Thank you for your time.
>
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l[at]lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
_______________________________________________
Wikitech-l mailing list
Wikitech-l[at]lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


brion at wikimedia

Aug 1, 2008, 5:18 PM

Post #6 of 14 (614 views)
Permalink
Re: Dumps Still Stuck (since 7/1)? [In reply to]

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Robert Ullmann wrote:
> Brion
>
> For the sake of love, and all that is good and holy, what does it take to
> get an en.wikt dump?

When the big batch of servers we got in two days ago are all unpacked
and have their disks installed, we can start shuffling some data around.

At that point, I'll have free disk space necessary to run dumps.

- -- brion
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.8 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkiTp9kACgkQwRnhpk1wk45scgCgggY7QtqPIktg/SX2VJQyRfqQ
xfAAoKAxt+3GddBTyKYvT5XiYgPxkVbo
=w+cy
-----END PGP SIGNATURE-----

_______________________________________________
Wikitech-l mailing list
Wikitech-l[at]lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


rlullmann at gmail

Aug 2, 2008, 7:14 AM

Post #7 of 14 (604 views)
Permalink
Re: Dumps Still Stuck (since 7/1)? [In reply to]

It is good that we will have new disks and it likely won't get stuck; but
that doesn't address the primary problem of the length of time these things
take. Let me try to be more constructive.

First thing is that the projects are hugely different in size. This causes a
fundamental queuing problem: with n threads, and more than n huge tasks in
the queue, the threads will all end up doing those. (we recently saw a
number of days in which it was working on enwiki, frwiki, dewiki, and jawiki
and nothing else). This can be fixed with a thread that is restricted to
smaller tasks. Like in a market or a bank, with an express lane. (My bank
has one teller only for deposits and withdrawals in 500s and 1000s notes, no
other transactions.)

However there are other problems and opportunities. Each project does a
number of minor tasks, and then 4 larger ones:

* main articles, current versions
* all pages, current
* all-history of all pages, bz2 compressed
* all-history, re-compressed in 7z

For the enwiki, main articles takes (7/14 numbers) 10 hours, 30 min, all
pages 16 hours, 10 min. All-history bz2 was estimated at 67 days when it got
stuck. (would have been shorter, as that estimate was right at the start)
For jawiki (7/24): main was 48 min, all pages 65 min, all history bz2 3
days, 18 hours, 7z 2 days, 12 hours.

Some observations then:

* the main articles dump is a subset of all pages. The latter might usefully
be only a dump of all the pages *not* in the first.
* alternatively, the process could dump the first, then copy the file and
continue with the others for the second (yes, one has to be careful with the
bz2 compression state)
* or it could write both at the same time, saving the DB access time if not
the compression time
* the all-history dump might be only the 7z. Yes, it takes longer than the
bz2, but direct to 7z will be much less total time.
* alternatively, write both bz2 and 7z at the same time (if we must have the
bz2, but I don't see why; methinks anyone would want the 7z)
* make the all-history dump(s) separate tasks, in separate queue(s); without
them the rest will go very well

Note that the all-history dumps are cumulative: each contains everything
that was in the previous, plus all the new versions. We might reconsider
whether we want those at all, or make each an incremental. (I'm not sure
what these are for exactly) A dump that is taken over a several month
period is also hardly a snapshot, from a DB integrity POV it is nearly
useless. But no matter.

* the format of the all-history dump could be changed to store only
differences (going backward from current) in each XML record
* or a variant of the 7z compressor used that knows where to search for the
matching strings, rather than a general search; it would then be *much*
faster. (as it is an LZ77-class method, this doesn't change the decompressor
logic)

Either of these last two would make the all-history dumps at least a couple
of orders of magnitude faster.

best regards,
Robert
_______________________________________________
Wikitech-l mailing list
Wikitech-l[at]lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


marco at harddisk

Aug 2, 2008, 7:34 AM

Post #8 of 14 (603 views)
Permalink
Re: Dumps Still Stuck (since 7/1)? [In reply to]

2008/8/2 Robert Ullmann <rlullmann[at]gmail.com>

> A dump that is taken over a several month
> period is also hardly a snapshot, from a DB integrity POV it is nearly
> useless. But no matter.

You could also take one or two DB slaves out of replication for the whole
dump period to keep the database consistent and then, after the dump is
finished, let it replicate again. Dunno though if that is possible with
MySQL.

Marco
_______________________________________________
Wikitech-l mailing list
Wikitech-l[at]lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Platonides at gmail

Aug 2, 2008, 12:10 PM

Post #9 of 14 (598 views)
Permalink
Re: Dumps Still Stuck (since 7/1)? [In reply to]

Robert Ullmann wrote:
> It is good that we will have new disks and it likely won't get stuck; but
> that doesn't address the primary problem of the length of time these things
> take. Let me try to be more constructive.

It just paralellizes it ;)


> First thing is that the projects are hugely different in size. This causes a
> fundamental queuing problem: with n threads, and more than n huge tasks in
> the queue, the threads will all end up doing those. (we recently saw a
> number of days in which it was working on enwiki, frwiki, dewiki, and jawiki
> and nothing else). This can be fixed with a thread that is restricted to
> smaller tasks. Like in a market or a bank, with an express lane. (My bank
> has one teller only for deposits and withdrawals in 500s and 1000s notes, no
> other transactions.)

Seems reasonable.


> Some observations then:
>
> * the main articles dump is a subset of all pages. The latter might usefully
> be only a dump of all the pages *not* in the first.
> * alternatively, the process could dump the first, then copy the file and
> continue with the others for the second (yes, one has to be careful with the
> bz2 compression state)
If you mean what i think you mean, it won't work.

> * or it could write both at the same time, saving the DB access time if not
> the compression time
There's one snapshot of the DB for the articles content. All the
metadata is extracted at one point (the stub-* files). Then it is filled
with content from last full dump and getting new revisions from db.

> * the all-history dump might be only the 7z. Yes, it takes longer than the
> bz2, but direct to 7z will be much less total time.
> * alternatively, write both bz2 and 7z at the same time (if we must have the
> bz2, but I don't see why; methinks anyone would want the 7z)
AFAIK the 7z is reading the bz2. It's much easier to recompress on a
different format than recreating the xml. Plus it's much less load on
the db servers.

> * make the all-history dump(s) separate tasks, in separate queue(s); without
> them the rest will go very well
That could work. But note that the difference with metadata such as
templatelinks will be even greater.

> Note that the all-history dumps are cumulative: each contains everything
> that was in the previous, plus all the new versions. We might reconsider
> whether we want those at all, or make each an incremental. (I'm not sure
> what these are for exactly)
So you would need all dumps since January (the first full, then
incremental) to get the status at August?
It may be better or worse depending on what you'll do with the data.


> A dump that is taken over a several month
> period is also hardly a snapshot, from a DB integrity POV it is nearly
> useless. But no matter.
See above. The history dump reflects the status at the beginning. You're
getting through a month the contents on the history.
There is a difference with the additional metadata, such as template and
image usage. Not easy to fix if you wanted to, because even if you
dumped them in the same transaction as the revision table, it will
contain outdated information to be updated by the job queue.

> * the format of the all-history dump could be changed to store only
> differences (going backward from current) in each XML record
Has been proposed before for the db store. It was determined that there
was little difference with just compressing.
Moreover, it would make the process slower, as you would also need to
diff the revisions. The worst case would be a history merge, where
there're new intermediate revisions, so you need to recover the full
contents of each revision (from db/undiffing the last dump) and diff it
again.

> * or a variant of the 7z compressor used that knows where to search for the
> matching strings, rather than a general search; it would then be *much*
> faster. (as it is an LZ77-class method, this doesn't change the decompressor
> logic)

Could work. Are you volunteering to write it?

> Either of these last two would make the all-history dumps at least a couple
> of orders of magnitude faster.
>
> best regards,
> Robert


_______________________________________________
Wikitech-l mailing list
Wikitech-l[at]lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


magnusmanske at googlemail

Aug 2, 2008, 1:01 PM

Post #10 of 14 (595 views)
Permalink
Re: Dumps Still Stuck (since 7/1)? [In reply to]

Knowing little about the current dump generation process, but some
about terabyte-scale data handling (actually, we here are well into
the petabyte range by now;-), how about this:
* Set up the usual MySQL replication slave
* At one point in time, disconnect it from the MySQL master, but leave
it running in read-only mode
* Use that as the dump base

This should result in a single-point-in-time snapshot. Also, it will
reduce load to the rest of the system. Not sure if IDs will change
internally, though.


Independent of that,
* Run several parallel processes on several servers (assuming we have several)
* Each process generates the complete history dump of a single
article, or a small group of them, bz2ipped to save intermittent disk
space
* Success/faliure is checked, so each process can be rerun if needed
* At the end, all these files are appended into a single bzip2/7zip file

This will need more diskspace while the entire thing is running, as
small text files compress less well than larger ones. Also, it eats
more CPU cycles, for starting all these processes, and then for
re-bzip2ing the intermediate files.

But, it is a lot less error-prone (if a process or a bunch of them
fail, just restart them), and it scales better (just throw more
machines at it to make it faster; or use apaches during low-traffic
hours). Individual processes should be less memory-intensive, so
several of them can run on the same machine.

My 2c

Magnus

_______________________________________________
Wikitech-l mailing list
Wikitech-l[at]lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Platonides at gmail

Aug 2, 2008, 1:49 PM

Post #11 of 14 (596 views)
Permalink
Re: Dumps Still Stuck (since 7/1)? [In reply to]

Magnus Manske wrote:
> Knowing little about the current dump generation process, but some
> about terabyte-scale data handling (actually, we here are well into
> the petabyte range by now;-), how about this:
> * Set up the usual MySQL replication slave
> * At one point in time, disconnect it from the MySQL master, but leave
> it running in read-only mode
> * Use that as the dump base
>
> This should result in a single-point-in-time snapshot.

Why? As i already said, the revision status is a snapshot, it's done in
a transaction.

> Also, it will reduce load to the rest of the system. Not sure if
> IDs will change internally, though.
IDs won't change, but you don't need the disconnected slave. Once you
have the revisions, you will be querying external storage. That's where
the load goes.


> Independent of that,
> * Run several parallel processes on several servers (assuming we have several)
> * Each process generates the complete history dump of a single
> article, or a small group of them, bz2ipped to save intermittent disk
> space
> * Success/faliure is checked, so each process can be rerun if needed
> * At the end, all these files are appended into a single bzip2/7zip file

The system we use is not exactly that. It's writing compressed data
from the compressed reading of last dump and the revision snapshot. It's
never using uncompressed data.
The little processes would need to know where in the last file is the
section they're doing.
However, if you knew in which part it was in the old dump... it's
worthwhile considering.

> This will need more diskspace while the entire thing is running, as
> small text files compress less well than larger ones. Also, it eats
> more CPU cycles, for starting all these processes, and then for
> re-bzip2ing the intermediate files.

Not neccessarily. If the number of files per bzip2 group is large
enough, there is almost no difference.


> But, it is a lot less error-prone (if a process or a bunch of them
> fail, just restart them), and it scales better (just throw more
> machines at it to make it faster; or use apaches during low-traffic
> hours). Individual processes should be less memory-intensive, so
> several of them can run on the same machine.
>
> My 2c
>
> Magnus

We are talking very happily here, but what is slowing the dump process?
Brion, Tim, there's some profiling information about that? Is it I/O
waiting for the revisions fetched for external storage? Is it disk speed
when reading/writing? Is it CPU for decompressing previous dump? Is it
CPU for compressing? How is dbzip2 helping with it?*



*I thought you were using dbzip2, but i now see mw:Dbzip2 says "dbzip2
is not ready for public use yet" Has it been indefinitely postponed?


_______________________________________________
Wikitech-l mailing list
Wikitech-l[at]lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


magnusmanske at googlemail

Aug 2, 2008, 2:49 PM

Post #12 of 14 (597 views)
Permalink
Re: Dumps Still Stuck (since 7/1)? [In reply to]

On Sat, Aug 2, 2008 at 9:49 PM, Platonides <Platonides[at]gmail.com> wrote:
> Magnus Manske wrote:
>> Independent of that,
>> * Run several parallel processes on several servers (assuming we have several)
>> * Each process generates the complete history dump of a single
>> article, or a small group of them, bz2ipped to save intermittent disk
>> space
>> * Success/faliure is checked, so each process can be rerun if needed
>> * At the end, all these files are appended into a single bzip2/7zip file
>
> The system we use is not exactly that. It's writing compressed data
> from the compressed reading of last dump and the revision snapshot. It's
> never using uncompressed data.
> The little processes would need to know where in the last file is the
> section they're doing.
> However, if you knew in which part it was in the old dump... it's
> worthwhile considering.

Why is it using the old dump instead of the "real" storage? For
performance reasons?

Does that mean that if there's an error in an old dump, it will stay
there forever?

How does this cope with deleted revisions?


>> This will need more diskspace while the entire thing is running, as
>> small text files compress less well than larger ones. Also, it eats
>> more CPU cycles, for starting all these processes, and then for
>> re-bzip2ing the intermediate files.
>
> Not neccessarily. If the number of files per bzip2 group is large
> enough, there is almost no difference.

Yes. we'd have to find a balance between many fast processes with lots
of overhead and few slow ones that, when failing, will set back the
dump for weeks.

At work, I'm using a computing farm with several thousand cores, and
the suggested time per process is < 2h. May be worth contemplating,
even though the technical situation for Wikimedia is very much
different.

Magnus

_______________________________________________
Wikitech-l mailing list
Wikitech-l[at]lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Platonides at gmail

Aug 2, 2008, 4:59 PM

Post #13 of 14 (596 views)
Permalink
Re: Dumps Still Stuck (since 7/1)? [In reply to]

Magnus Manske wrote:
> Why is it using the old dump instead of the "real" storage? For
> performance reasons?

Yes. It's nicer to fill the stub by reading the last dump. Most
revisions are already there and in the order you will need them. If it's
not, it's retrieved from external storage.
From dumpTextPass.php usage: "Use a prior dump file as a text source,
to save pressure on the database."


> Does that mean that if there's an error in an old dump, it will stay
> there forever?

Only until the dump generation fails and a new one is created from
scratch ;)
Any reason for old dumps to be more corruptable than the db blobs?


> How does this cope with deleted revisions?
The revision contents are read from the old dump, but the revisions and
pages are read from the stub, created from db.


_______________________________________________
Wikitech-l mailing list
Wikitech-l[at]lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


jra at baylink

Aug 6, 2008, 2:04 PM

Post #14 of 14 (544 views)
Permalink
Re: Dumps Still Stuck (since 7/1)? [In reply to]

On Sat, Aug 02, 2008 at 09:01:26PM +0100, Magnus Manske wrote:
> Knowing little about the current dump generation process, but some
> about terabyte-scale data handling (actually, we here are well into
> the petabyte range by now;-), how about this:
> * Set up the usual MySQL replication slave
> * At one point in time, disconnect it from the MySQL master, but leave
> it running in read-only mode
> * Use that as the dump base
>
> This should result in a single-point-in-time snapshot. Also, it will
> reduce load to the rest of the system. Not sure if IDs will change
> internally, though.

That's roughly equivalent to what Phil Greenspun says that the "SQL
studs" at Mass General Hospital do with their backups, though in their
case it's breaking a RAID mirror rather than a replication.

Cheers,
-- jra
--
Jay R. Ashworth Baylink jra[at]baylink.com
Designer The Things I Think RFC 2100
Ashworth & Associates http://baylink.pitas.com '87 e24
St Petersburg FL USA http://photo.imageinc.us +1 727 647 1274

Those who cast the vote decide nothing.
Those who count the vote decide everything.
-- (Josef Stalin)

_______________________________________________
Wikitech-l mailing list
Wikitech-l[at]lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Wikipedia wikitech RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact lists@gossamer-threads.com
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.