Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Wikipedia: Wikitech

Broken dump enwiki-20080103-pages-meta-current.xml.bz2

 

 

Wikipedia wikitech RSS feed   Index | Next | Previous | View Threaded


lev.bishop+wikitech at gmail

Jan 27, 2008, 9:06 PM

Post #1 of 6 (1386 views)
Permalink
Broken dump enwiki-20080103-pages-meta-current.xml.bz2

The most recent enwiki dump seems corrupt (CRC failure when bunzipping).
Another person (Nessus) has also noticed this, so it's not just me:
http://meta.wikimedia.org/wiki/Talk:Data_dumps#Broken_image_.28enwiki-20080103-pages-meta-current.xml.bz2.29

Steps to reproduce:

lsb32 [at] cm:~/enwiki> md5sum enwiki-20080103-pages-meta-current.xml.bz2
9aa19d3a871071f4895431f19d674650 enwiki-20080103-pages-meta-current.xml.bz2
lsb32 [at] cm:~/enwiki> bzip2 -tvv
enwiki-20080103-pages-meta-current.xml.bz2 &> bunzip.log
lsb32 [at] cm:~/enwiki> tail bunzip.log
[3490: huff+mtf rt+rld]
[3491: huff+mtf rt+rld]
[3492: huff+mtf rt+rld]
[3493: huff+mtf rt+rld]
[3494: huff+mtf rt+rld]
[3495: huff+mtf data integrity (CRC) error in data

You can use the `bzip2recover' program to attempt to recover
data from undamaged sections of corrupted files.
lsb32 [at] cm:~/enwiki> bzip2 -V
bzip2, a block-sorting file compressor. Version 1.0.3, 15-Feb-2005.

Copyright (C) 1996-2005 by Julian Seward.

This program is free software; you can redistribute it and/or modify
it under the terms set out in the LICENSE file, which is included
in the bzip2-1.0 source distribution.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
LICENSE file for more details.

bzip2: I won't write compressed data to a terminal.
bzip2: For help, type: `bzip2 --help'.
lsb32 [at] cm:~/enwiki>

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
http://lists.wikimedia.org/mailman/listinfo/wikitech-l


glimmer_phoenix at yahoo

Jan 28, 2008, 4:31 PM

Post #2 of 6 (1335 views)
Permalink
Re: Broken dump enwiki-20080103-pages-meta-current.xml.bz2 [In reply to]

If you read previous threads, this is the #1 broken feature request right now for researchers and other people interested in full dumps.

The dump process have been broken now for more than a whole year, and admins are overwhelmed with urgent work, so it seems it will still take a long while to repair it again.

Other editions are starting to fail in the same black hole too, as the grow in size.

Bests,

Felipe.

Lev Bishop <lev.bishop+wikitech [at] gmail> escribió: The most recent enwiki dump seems corrupt (CRC failure when bunzipping).
Another person (Nessus) has also noticed this, so it's not just me:
http://meta.wikimedia.org/wiki/Talk:Data_dumps#Broken_image_.28enwiki-20080103-pages-meta-current.xml.bz2.29

Steps to reproduce:

lsb32 [at] cm:~/enwiki> md5sum enwiki-20080103-pages-meta-current.xml.bz2
9aa19d3a871071f4895431f19d674650 enwiki-20080103-pages-meta-current.xml.bz2
lsb32 [at] cm:~/enwiki> bzip2 -tvv
enwiki-20080103-pages-meta-current.xml.bz2 &> bunzip.log
lsb32 [at] cm:~/enwiki> tail bunzip.log
[3490: huff+mtf rt+rld]
[3491: huff+mtf rt+rld]
[3492: huff+mtf rt+rld]
[3493: huff+mtf rt+rld]
[3494: huff+mtf rt+rld]
[3495: huff+mtf data integrity (CRC) error in data

You can use the `bzip2recover' program to attempt to recover
data from undamaged sections of corrupted files.
lsb32 [at] cm:~/enwiki> bzip2 -V
bzip2, a block-sorting file compressor. Version 1.0.3, 15-Feb-2005.

Copyright (C) 1996-2005 by Julian Seward.

This program is free software; you can redistribute it and/or modify
it under the terms set out in the LICENSE file, which is included
in the bzip2-1.0 source distribution.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
LICENSE file for more details.

bzip2: I won't write compressed data to a terminal.
bzip2: For help, type: `bzip2 --help'.
lsb32 [at] cm:~/enwiki>

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
http://lists.wikimedia.org/mailman/listinfo/wikitech-l



---------------------------------

¿Con Mascota por primera vez? - Sé un mejor Amigo
Entra en Yahoo! Respuestas.

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
http://lists.wikimedia.org/mailman/listinfo/wikitech-l


lev.bishop+wikitech at gmail

Jan 28, 2008, 7:00 PM

Post #3 of 6 (1334 views)
Permalink
Re: Broken dump enwiki-20080103-pages-meta-current.xml.bz2 [In reply to]

On Jan 28, 2008 7:31 PM, Felipe Ortega wrote:
> If you read previous threads, this is the #1 broken feature request right now for researchers and other people interested in full dumps.

Thank you for responding. I have checked some previous threads and I
see that full dumps (with history) for enwiki, dewiki and others have
been a problem for some years now. I also saw that the dump server
crashed in december, got fixed a few weeks later and then died
completely and the machine had to be rebuilt.

However, this seems to be a different problem from the previous issues, because:

1) this is the without-history dump that has the problem
(pages-meta-current not pages-meta-history);

2) the dump appeared to have completed properly (on the status page
there is no mention of any error, and the md5 checksum was generated
(and matches with the md5sum of the downloaded file))

To reinforce why I think this is a new problem, in this message
http://lists.wikimedia.org/pipermail/wikitech-l/2007-November/034561.html
David A. Desrosiers says (in regards to a question about possibly
corrupted enwiki-20071018-pages-meta-current.xml.bz2 )

>I have the whole process of fetch, unpack, import scripted to happen
>unattended and aside from initial debugging, it has not failed yet in
>the last year or more.

Anyway, to save people from spending time and bandwidth downloading
6GB (or larger) files, which then turn out to be corrupt and useless,
I would like to request if the dump script could be changed to run an
integrity check (bzip2 -t) on the file before updating the status to
"done". It only takes about 7 minutes on my computer to do this test
for the enwiki pages-meta-current file -- compared with the 46 hours
it took to generate the dump in the first place this should not add
significantly to the time taken to generate dumps.

Lev

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
http://lists.wikimedia.org/mailman/listinfo/wikitech-l


glimmer_phoenix at yahoo

Jan 30, 2008, 11:24 AM

Post #4 of 6 (1328 views)
Permalink
Re: Broken dump enwiki-20080103-pages-meta-current.xml.bz2 [In reply to]

Lev Bishop <lev.bishop+wikitech [at] gmail> escribió:
1) this is the without-history dump that has the problem
(pages-meta-current not pages-meta-history);

Sorry, then I misunderstood the version.

Felipe.

Lev

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
http://lists.wikimedia.org/mailman/listinfo/wikitech-l



---------------------------------

¿Con Mascota por primera vez? - Sé un mejor Amigo
Entra en Yahoo! Respuestas.

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
http://lists.wikimedia.org/mailman/listinfo/wikitech-l


brion at wikimedia

Feb 29, 2008, 10:31 PM

Post #5 of 6 (1262 views)
Permalink
Re: Broken dump enwiki-20080103-pages-meta-current.xml.bz2 [In reply to]

Lev Bishop wrote:
> The most recent enwiki dump seems corrupt (CRC failure when bunzipping).
> Another person (Nessus) has also noticed this, so it's not just me:
> http://meta.wikimedia.org/wiki/Talk:Data_dumps#Broken_image_.28enwiki-20080103-pages-meta-current.xml.bz2.29

This file's been regenerated, BTW; now passes bzip2 integrity check.

-- brion vibber (brion @ wikimedia.org)

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


wikimail at inbox

Apr 20, 2008, 12:13 PM

Post #6 of 6 (1211 views)
Permalink
Re: Broken dump enwiki-20080103-pages-meta-current.xml.bz2 [In reply to]

On Sat, Mar 1, 2008 at 2:31 AM, Brion Vibber <brion [at] wikimedia> wrote:
> Lev Bishop wrote:
> > The most recent enwiki dump seems corrupt (CRC failure when bunzipping).
> > Another person (Nessus) has also noticed this, so it's not just me:
> > http://meta.wikimedia.org/wiki/Talk:Data_dumps#Broken_image_.28enwiki-20080103-pages-meta-current.xml.bz2.29
>
> This file's been regenerated, BTW; now passes bzip2 integrity check.
>
> -- brion vibber (brion @ wikimedia.org)
>
What's the new md5sum? The old one is still in enwiki-20080103-md5sums.txt

Are there any other files that were regenerated?

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Wikipedia wikitech RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.