Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Wikipedia: Wikitech

dataset1, xml dumps

 

 

First page Previous page 1 2 Next page Last page  View All Wikipedia wikitech RSS feed   Index | Next | Previous | View Threaded


ariel at wikimedia

Dec 14, 2010, 1:12 AM

Post #1 of 31 (3555 views)
Permalink
dataset1, xml dumps

For folks who have not been following the saga on
http://wikitech.wikimedia.org/view/Dataset1
we were able to get the raid array back in service last night on the XML
data dumps server, and we are now busily copying data off of it to
another host. There's about 11T of dumps to copy over; once that's done
we will start serving these dumps read-only to the public again.
Because the state of the server hardware is still uncertain, we don't
want to do anything that might put the data at risk until that copy has
been made.

The replacement server is on order and we are watching that closely.

We have also been working on deploying a server to run one round of
dumps in the interrim.

Thanks for your patience (which is a way of saying, I know you are all
out of patience, as am I, but hang on just a little longer).

Ariel



_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


brion at pobox

Dec 14, 2010, 9:02 AM

Post #2 of 31 (3501 views)
Permalink
Re: dataset1, xml dumps [In reply to]

Great news! Thanks for the update and thanks for all you guys' work getting
it beaten back into shape. Keeping fingers crossed for all going well on the
transfer...

-- brion
On Dec 14, 2010 1:12 AM, "Ariel T. Glenn" <ariel [at] wikimedia> wrote:
> For folks who have not been following the saga on
> http://wikitech.wikimedia.org/view/Dataset1
> we were able to get the raid array back in service last night on the XML
> data dumps server, and we are now busily copying data off of it to
> another host. There's about 11T of dumps to copy over; once that's done
> we will start serving these dumps read-only to the public again.
> Because the state of the server hardware is still uncertain, we don't
> want to do anything that might put the data at risk until that copy has
> been made.
>
> The replacement server is on order and we are watching that closely.
>
> We have also been working on deploying a server to run one round of
> dumps in the interrim.
>
> Thanks for your patience (which is a way of saying, I know you are all
> out of patience, as am I, but hang on just a little longer).
>
> Ariel
>
>
>
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l [at] lists
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


dvanliere at gmail

Dec 14, 2010, 9:11 AM

Post #3 of 31 (3501 views)
Permalink
Re: dataset1, xml dumps [In reply to]

+1
Diederik

On 2010-12-14, at 12:02, Brion Vibber <brion [at] pobox> wrote:

> Great news! Thanks for the update and thanks for all you guys' work getting
> it beaten back into shape. Keeping fingers crossed for all going well on the
> transfer...
>
> -- brion
> On Dec 14, 2010 1:12 AM, "Ariel T. Glenn" <ariel [at] wikimedia> wrote:
>> For folks who have not been following the saga on
>> http://wikitech.wikimedia.org/view/Dataset1
>> we were able to get the raid array back in service last night on the XML
>> data dumps server, and we are now busily copying data off of it to
>> another host. There's about 11T of dumps to copy over; once that's done
>> we will start serving these dumps read-only to the public again.
>> Because the state of the server hardware is still uncertain, we don't
>> want to do anything that might put the data at risk until that copy has
>> been made.
>>
>> The replacement server is on order and we are watching that closely.
>>
>> We have also been working on deploying a server to run one round of
>> dumps in the interrim.
>>
>> Thanks for your patience (which is a way of saying, I know you are all
>> out of patience, as am I, but hang on just a little longer).
>>
>> Ariel
>>
>>
>>
>> _______________________________________________
>> Wikitech-l mailing list
>> Wikitech-l [at] lists
>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l [at] lists
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


emijrp at gmail

Dec 14, 2010, 9:27 AM

Post #4 of 31 (3497 views)
Permalink
Re: dataset1, xml dumps [In reply to]

Thanks.

Double good news:
http://lists.wikimedia.org/pipermail/foundation-l/2010-December/063088.html

2010/12/14 Ariel T. Glenn <ariel [at] wikimedia>

> For folks who have not been following the saga on
> http://wikitech.wikimedia.org/view/Dataset1
> we were able to get the raid array back in service last night on the XML
> data dumps server, and we are now busily copying data off of it to
> another host. There's about 11T of dumps to copy over; once that's done
> we will start serving these dumps read-only to the public again.
> Because the state of the server hardware is still uncertain, we don't
> want to do anything that might put the data at risk until that copy has
> been made.
>
> The replacement server is on order and we are watching that closely.
>
> We have also been working on deploying a server to run one round of
> dumps in the interrim.
>
> Thanks for your patience (which is a way of saying, I know you are all
> out of patience, as am I, but hang on just a little longer).
>
> Ariel
>
>
>
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l [at] lists
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


ariel at wikimedia

Dec 15, 2010, 11:57 AM

Post #5 of 31 (3491 views)
Permalink
Re: dataset1, xml dumps [In reply to]

We now have a copy of the dumps on a backup host. Although we are still
resolving hardware issues on the XML dumps server, we think it is safe
enough to serve the existing dumps read-only. DNS was updated to that
effect already; people should see the dumps within the hour.

Ariel



_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


mastigm at gmail

Dec 15, 2010, 12:16 PM

Post #6 of 31 (3500 views)
Permalink
Re: dataset1, xml dumps [In reply to]

Good news, but looking form a professional point of view having them
just on array will be leading to such outages.
Any idea to have a tape backup or mirror?

masti

On 12/15/2010 08:57 PM, Ariel T. Glenn wrote:
> We now have a copy of the dumps on a backup host. Although we are still
> resolving hardware issues on the XML dumps server, we think it is safe
> enough to serve the existing dumps read-only. DNS was updated to that
> effect already; people should see the dumps within the hour.
>
> Ariel
>
>
>
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l [at] lists
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l


_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


ariel at wikimedia

Dec 15, 2010, 12:30 PM

Post #7 of 31 (3502 views)
Permalink
Re: dataset1, xml dumps [In reply to]

Currently the files have been copied off of the server onto a backup
host, which is the only reason we feel safe about serving them again.

We will be getting a new host (it is due to be shipped soon) which will
host the live data. The current server will have a backup copy. That is
the short term answer to your question. In the longer term we expect to
have a redundant copy elsewhere and cease to rely on dataset1
whatsoever.

We are interested in other mirrors of the dumps; see

http://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps

Ariel

Στις 15-12-2010, ημέρα Τετ, και ώρα 21:16 +0100, ο/η masti έγραψε:
> Good news, but looking form a professional point of view having them
> just on array will be leading to such outages.
> Any idea to have a tape backup or mirror?
>
> masti
>
> On 12/15/2010 08:57 PM, Ariel T. Glenn wrote:
> > We now have a copy of the dumps on a backup host. Although we are still
> > resolving hardware issues on the XML dumps server, we think it is safe
> > enough to serve the existing dumps read-only. DNS was updated to that
> > effect already; people should see the dumps within the hour.
> >
> > Ariel
> >
> >
> >
> > _______________________________________________
> > Wikitech-l mailing list
> > Wikitech-l [at] lists
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
>
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l [at] lists
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l



_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


emijrp at gmail

Dec 15, 2010, 12:53 PM

Post #8 of 31 (3495 views)
Permalink
Re: [Xmldatadumps-l] dataset1, xml dumps [In reply to]

Good work.

2010/12/15 Ariel T. Glenn <ariel [at] wikimedia>

> We now have a copy of the dumps on a backup host. Although we are still
> resolving hardware issues on the XML dumps server, we think it is safe
> enough to serve the existing dumps read-only. DNS was updated to that
> effect already; people should see the dumps within the hour.
>
> Ariel
>
>
>
> _______________________________________________
> Xmldatadumps-l mailing list
> Xmldatadumps-l [at] lists
> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
>
_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


wikimail at inbox

Dec 15, 2010, 12:57 PM

Post #9 of 31 (3497 views)
Permalink
Re: dataset1, xml dumps [In reply to]

On Wed, Dec 15, 2010 at 3:30 PM, Ariel T. Glenn <ariel [at] wikimedia> wrote:
> We are interested in other mirrors of the dumps; see
>
> http://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps

On the talk page, it says "torrents are useful to save bandwidth,
which is not our problem". If bandwidth is not the problem, then what
*is* the problem?

If the problem is just to get someone to store the data on hard
drives, then it's a much easier problem than actually *hosting* that
data.

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


ariel at wikimedia

Dec 15, 2010, 1:03 PM

Post #10 of 31 (3499 views)
Permalink
Re: dataset1, xml dumps [In reply to]

Στις 15-12-2010, ημέρα Τετ, και ώρα 15:57 -0500, ο/η Anthony έγραψε:
> On Wed, Dec 15, 2010 at 3:30 PM, Ariel T. Glenn <ariel [at] wikimedia> wrote:
> > We are interested in other mirrors of the dumps; see
> >
> > http://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps
>
> On the talk page, it says "torrents are useful to save bandwidth,
> which is not our problem". If bandwidth is not the problem, then what
> *is* the problem?
>
> If the problem is just to get someone to store the data on hard
> drives, then it's a much easier problem than actually *hosting* that
> data.

We certainly want people to host it as well. It's not a matter of
bandwidth but of protection: if someone can't get to our copy for
whatever reason, another copy is accessible.

Ariel




_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


bryan.tongminh at gmail

Dec 15, 2010, 1:50 PM

Post #11 of 31 (3494 views)
Permalink
Re: dataset1, xml dumps [In reply to]

On Wed, Dec 15, 2010 at 10:03 PM, Ariel T. Glenn <ariel [at] wikimedia> wrote:
>
> We certainly want people to host it as well. It's not a matter of
> bandwidth but of protection: if someone can't get to our copy for
> whatever reason, another copy is accessible.
>
Is there a copy in Amsterdam? Seems like that would be the most
obvious choice to put a backup as WMF already has a lot of servers
there.

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


ariel at wikimedia

Dec 15, 2010, 1:56 PM

Post #12 of 31 (3498 views)
Permalink
Re: dataset1, xml dumps [In reply to]

Στις 15-12-2010, ημέρα Τετ, και ώρα 22:50 +0100, ο/η Bryan Tong Minh
έγραψε:
> On Wed, Dec 15, 2010 at 10:03 PM, Ariel T. Glenn <ariel [at] wikimedia> wrote:
> >
> > We certainly want people to host it as well. It's not a matter of
> > bandwidth but of protection: if someone can't get to our copy for
> > whatever reason, another copy is accessible.
> >
> Is there a copy in Amsterdam? Seems like that would be the most
> obvious choice to put a backup as WMF already has a lot of servers
> there.
>

We want people besides us to host it. We expect to put a copy at the
new data center (at least), as well.

Ariel



_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


lars at aronsson

Dec 15, 2010, 3:04 PM

Post #13 of 31 (3506 views)
Permalink
Re: dataset1, xml dumps [In reply to]

On 12/15/2010 09:30 PM, Ariel T. Glenn wrote:
> We are interested in other mirrors of the dumps; see
>
> http://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps

Just as a small-scale experiment, I tried to mirror the
Faroese (fowiki) and Sami (sewiki) language projects.
But "wget -m" says that timestamps are turned off,
so it keeps downloading the same files again. Is this
an error at my side or at the server side?

This happens for some files, but not for all.
Here is one example:

--2010-12-15 23:59:54--
http://download.wikimedia.org/fowikisource/20100307/fowikisource-20100307-pages-meta-history.xml.bz2
Reusing existing connection to download.wikimedia.org:80.
HTTP request sent, awaiting response... 200 OK
Length: 95974 (94K) [application/octet-stream]
Last-modified header missing -- time-stamps turned off.
--2010-12-15 23:59:54--
http://download.wikimedia.org/fowikisource/20100307/fowikisource-20100307-pages-meta-history.xml.bz2
Reusing existing connection to download.wikimedia.org:80.
HTTP request sent, awaiting response... 200 OK
Length: 95974 (94K) [application/octet-stream]
Saving to:
`download.wikimedia.org/fowikisource/20100307/fowikisource-20100307-pages-meta-history.xml.bz2'

100%[======================================>] 95,974 156K/s in 0.6s



--
Lars Aronsson (lars [at] aronsson)
Aronsson Datateknik - http://aronsson.se



_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


glimmer_phoenix at yahoo

Dec 15, 2010, 3:32 PM

Post #14 of 31 (3508 views)
Permalink
Re: [Xmldatadumps-l] dataset1, xml dumps [In reply to]

Yeah, great work Ariel. Thanks a lot for the effort.

Best,
F.

--- El mi, 15/12/10, Ariel T. Glenn <ariel [at] wikimedia> escribi:

> De: Ariel T. Glenn <ariel [at] wikimedia>
> Asunto: Re: [Xmldatadumps-l] dataset1, xml dumps
> Para: wikitech-l [at] lists
> CC: xmldatadumps-l [at] lists
> Fecha: mircoles, 15 de diciembre, 2010 20:57
> We now have a copy of the dumps on a
> backup host. Although we are still
> resolving hardware issues on the XML dumps server, we think
> it is safe
> enough to serve the existing dumps read-only. DNS was
> updated to that
> effect already; people should see the dumps within the
> hour.
>
> Ariel
>
>
>
> _______________________________________________
> Xmldatadumps-l mailing list
> Xmldatadumps-l [at] lists
> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
>




_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


yegg at alum

Dec 16, 2010, 1:02 PM

Post #15 of 31 (3494 views)
Permalink
Re: dataset1, xml dumps [In reply to]

Ariel T. Glenn <ariel <at> wikimedia.org> writes:

>
> We now have a copy of the dumps on a backup host. Although we are still
> resolving hardware issues on the XML dumps server, we think it is safe
> enough to serve the existing dumps read-only. DNS was updated to that
> effect already; people should see the dumps within the hour.
>
> Ariel
>

Hi, thank you for working so hard on this issue, but I'm still having trouble
with the latest en.wikipedia dump, however. I downloaded
http://download.wikimedia.org/enwiki/20101011/enwiki-20101011-pages-
articles.xml.bz2 and am running into trouble decompressing.

In particular, bzip2 -d enwiki-20101011-pages-articles.xml.bz2 fails.

And bzip2 -tvv enwiki-20101011-pages-articles.xml.bz2 reports:

[2752: huff+mtf data integrity (CRC) error in data

I ran bzip2recover & then bzip2 -t rec* and got the following:

bzip2: rec02752enwiki-20101011-pages-articles.xml.bz2: data integrity (CRC)
error in data
bzip2: rec08881enwiki-20101011-pages-articles.xml.bz2: data integrity (CRC)
error in data
bzip2: rec26198enwiki-20101011-pages-articles.xml.bz2: data integrity (CRC)
error in data



_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


emijrp at gmail

Dec 16, 2010, 2:41 PM

Post #16 of 31 (3484 views)
Permalink
Re: dataset1, xml dumps [In reply to]

Have you checked the md5sum?

2010/12/16 Gabriel Weinberg <yegg [at] alum>

> Ariel T. Glenn <ariel <at> wikimedia.org> writes:
>
> >
> > We now have a copy of the dumps on a backup host. Although we are still
> > resolving hardware issues on the XML dumps server, we think it is safe
> > enough to serve the existing dumps read-only. DNS was updated to that
> > effect already; people should see the dumps within the hour.
> >
> > Ariel
> >
>
> Hi, thank you for working so hard on this issue, but I'm still having
> trouble
> with the latest en.wikipedia dump, however. I downloaded
> http://download.wikimedia.org/enwiki/20101011/enwiki-20101011-pages-
> articles.xml.bz2 and am running into trouble decompressing.
>
> In particular, bzip2 -d enwiki-20101011-pages-articles.xml.bz2 fails.
>
> And bzip2 -tvv enwiki-20101011-pages-articles.xml.bz2 reports:
>
> [2752: huff+mtf data integrity (CRC) error in data
>
> I ran bzip2recover & then bzip2 -t rec* and got the following:
>
> bzip2: rec02752enwiki-20101011-pages-articles.xml.bz2: data integrity (CRC)
> error in data
> bzip2: rec08881enwiki-20101011-pages-articles.xml.bz2: data integrity (CRC)
> error in data
> bzip2: rec26198enwiki-20101011-pages-articles.xml.bz2: data integrity (CRC)
> error in data
>
>
>
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l [at] lists
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


yegg at alum

Dec 16, 2010, 2:48 PM

Post #17 of 31 (3487 views)
Permalink
Re: dataset1, xml dumps [In reply to]

md5sum doesn't match. I get e74170eaaedc65e02249e1a54b1087cb (as
opposed to 7a4805475bba1599933b3acd5150bd4d
on http://download.wikimedia.org/enwiki/20101011/enwiki-20101011-md5sums.txt
).

I've downloaded it twice now and have gotten the same md5sum. Can anyone
else confirm?

On Thu, Dec 16, 2010 at 5:41 PM, emijrp <emijrp [at] gmail> wrote:

> Have you checked the md5sum?
>
> 2010/12/16 Gabriel Weinberg <yegg [at] alum>
>
> > Ariel T. Glenn <ariel <at> wikimedia.org> writes:
> >
> > >
> > > We now have a copy of the dumps on a backup host. Although we are
> still
> > > resolving hardware issues on the XML dumps server, we think it is safe
> > > enough to serve the existing dumps read-only. DNS was updated to that
> > > effect already; people should see the dumps within the hour.
> > >
> > > Ariel
> > >
> >
> > Hi, thank you for working so hard on this issue, but I'm still having
> > trouble
> > with the latest en.wikipedia dump, however. I downloaded
> > http://download.wikimedia.org/enwiki/20101011/enwiki-20101011-pages-
> > articles.xml.bz2 and am running into trouble decompressing.
> >
> > In particular, bzip2 -d enwiki-20101011-pages-articles.xml.bz2 fails.
> >
> > And bzip2 -tvv enwiki-20101011-pages-articles.xml.bz2 reports:
> >
> > [2752: huff+mtf data integrity (CRC) error in data
> >
> > I ran bzip2recover & then bzip2 -t rec* and got the following:
> >
> > bzip2: rec02752enwiki-20101011-pages-articles.xml.bz2: data integrity
> (CRC)
> > error in data
> > bzip2: rec08881enwiki-20101011-pages-articles.xml.bz2: data integrity
> (CRC)
> > error in data
> > bzip2: rec26198enwiki-20101011-pages-articles.xml.bz2: data integrity
> (CRC)
> > error in data
> >
> >
> >
> > _______________________________________________
> > Wikitech-l mailing list
> > Wikitech-l [at] lists
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> >
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l [at] lists
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


emijrp at gmail

Dec 16, 2010, 2:53 PM

Post #18 of 31 (3504 views)
Permalink
Re: dataset1, xml dumps [In reply to]

If the md5s don't match, the files are obviously different, I mean, one of
them is corrupt.

What is the size of your local file? I use to download dumps with wget UNIX
command and I don't get errors. If you are using FAT32, the file size is
limited to 2 GB and the file is truncated. Is your case?

2010/12/16 Gabriel Weinberg <yegg [at] alum>

> md5sum doesn't match. I get e74170eaaedc65e02249e1a54b1087cb (as
> opposed to 7a4805475bba1599933b3acd5150bd4d
> on
> http://download.wikimedia.org/enwiki/20101011/enwiki-20101011-md5sums.txt
> ).
>
> I've downloaded it twice now and have gotten the same md5sum. Can anyone
> else confirm?
>
> On Thu, Dec 16, 2010 at 5:41 PM, emijrp <emijrp [at] gmail> wrote:
>
> > Have you checked the md5sum?
> >
> > 2010/12/16 Gabriel Weinberg <yegg [at] alum>
> >
> > > Ariel T. Glenn <ariel <at> wikimedia.org> writes:
> > >
> > > >
> > > > We now have a copy of the dumps on a backup host. Although we are
> > still
> > > > resolving hardware issues on the XML dumps server, we think it is
> safe
> > > > enough to serve the existing dumps read-only. DNS was updated to
> that
> > > > effect already; people should see the dumps within the hour.
> > > >
> > > > Ariel
> > > >
> > >
> > > Hi, thank you for working so hard on this issue, but I'm still having
> > > trouble
> > > with the latest en.wikipedia dump, however. I downloaded
> > > http://download.wikimedia.org/enwiki/20101011/enwiki-20101011-pages-
> > > articles.xml.bz2 and am running into trouble decompressing.
> > >
> > > In particular, bzip2 -d enwiki-20101011-pages-articles.xml.bz2 fails.
> > >
> > > And bzip2 -tvv enwiki-20101011-pages-articles.xml.bz2 reports:
> > >
> > > [2752: huff+mtf data integrity (CRC) error in data
> > >
> > > I ran bzip2recover & then bzip2 -t rec* and got the following:
> > >
> > > bzip2: rec02752enwiki-20101011-pages-articles.xml.bz2: data integrity
> > (CRC)
> > > error in data
> > > bzip2: rec08881enwiki-20101011-pages-articles.xml.bz2: data integrity
> > (CRC)
> > > error in data
> > > bzip2: rec26198enwiki-20101011-pages-articles.xml.bz2: data integrity
> > (CRC)
> > > error in data
> > >
> > >
> > >
> > > _______________________________________________
> > > Wikitech-l mailing list
> > > Wikitech-l [at] lists
> > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> > >
> > _______________________________________________
> > Wikitech-l mailing list
> > Wikitech-l [at] lists
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> >
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l [at] lists
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


yegg at alum

Dec 16, 2010, 2:57 PM

Post #19 of 31 (3484 views)
Permalink
Re: dataset1, xml dumps [In reply to]

I've been downloading this file (using wget on ubuntu or fetch on FreeBSD)
with no issues for years. The current one is 6.2GB as it should be.

On Thu, Dec 16, 2010 at 5:53 PM, emijrp <emijrp [at] gmail> wrote:

> If the md5s don't match, the files are obviously different, I mean, one of
> them is corrupt.
>
> What is the size of your local file? I use to download dumps with wget UNIX
> command and I don't get errors. If you are using FAT32, the file size is
> limited to 2 GB and the file is truncated. Is your case?
>
> 2010/12/16 Gabriel Weinberg <yegg [at] alum>
>
> > md5sum doesn't match. I get e74170eaaedc65e02249e1a54b1087cb (as
> > opposed to 7a4805475bba1599933b3acd5150bd4d
> > on
> >
> http://download.wikimedia.org/enwiki/20101011/enwiki-20101011-md5sums.txt
> > ).
> >
> > I've downloaded it twice now and have gotten the same md5sum. Can anyone
> > else confirm?
> >
> > On Thu, Dec 16, 2010 at 5:41 PM, emijrp <emijrp [at] gmail> wrote:
> >
> > > Have you checked the md5sum?
> > >
> > > 2010/12/16 Gabriel Weinberg <yegg [at] alum>
> > >
> > > > Ariel T. Glenn <ariel <at> wikimedia.org> writes:
> > > >
> > > > >
> > > > > We now have a copy of the dumps on a backup host. Although we are
> > > still
> > > > > resolving hardware issues on the XML dumps server, we think it is
> > safe
> > > > > enough to serve the existing dumps read-only. DNS was updated to
> > that
> > > > > effect already; people should see the dumps within the hour.
> > > > >
> > > > > Ariel
> > > > >
> > > >
> > > > Hi, thank you for working so hard on this issue, but I'm still having
> > > > trouble
> > > > with the latest en.wikipedia dump, however. I downloaded
> > > > http://download.wikimedia.org/enwiki/20101011/enwiki-20101011-pages-
> > > > articles.xml.bz2 and am running into trouble decompressing.
> > > >
> > > > In particular, bzip2 -d enwiki-20101011-pages-articles.xml.bz2 fails.
> > > >
> > > > And bzip2 -tvv enwiki-20101011-pages-articles.xml.bz2 reports:
> > > >
> > > > [2752: huff+mtf data integrity (CRC) error in data
> > > >
> > > > I ran bzip2recover & then bzip2 -t rec* and got the following:
> > > >
> > > > bzip2: rec02752enwiki-20101011-pages-articles.xml.bz2: data integrity
> > > (CRC)
> > > > error in data
> > > > bzip2: rec08881enwiki-20101011-pages-articles.xml.bz2: data integrity
> > > (CRC)
> > > > error in data
> > > > bzip2: rec26198enwiki-20101011-pages-articles.xml.bz2: data integrity
> > > (CRC)
> > > > error in data
> > > >
> > > >
> > > >
> > > > _______________________________________________
> > > > Wikitech-l mailing list
> > > > Wikitech-l [at] lists
> > > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> > > >
> > > _______________________________________________
> > > Wikitech-l mailing list
> > > Wikitech-l [at] lists
> > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> > >
> > _______________________________________________
> > Wikitech-l mailing list
> > Wikitech-l [at] lists
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> >
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l [at] lists
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


ariel at wikimedia

Dec 16, 2010, 3:13 PM

Post #20 of 31 (3485 views)
Permalink
Re: dataset1, xml dumps [In reply to]

I was able to unzip a copy of the file on another host (taken from the
same location) without problems. On the download host itself I get the
correct md5sum: 7a4805475bba1599933b3acd5150bd4d

Ariel

Στις 16-12-2010, ημέρα Πεμ, και ώρα 17:48 -0500, ο/η Gabriel Weinberg
έγραψε:
> md5sum doesn't match. I get e74170eaaedc65e02249e1a54b1087cb (as
> opposed to 7a4805475bba1599933b3acd5150bd4d
> on http://download.wikimedia.org/enwiki/20101011/enwiki-20101011-md5sums.txt
> ).
>
> I've downloaded it twice now and have gotten the same md5sum. Can anyone
> else confirm?
>
> On Thu, Dec 16, 2010 at 5:41 PM, emijrp <emijrp [at] gmail> wrote:
>
> > Have you checked the md5sum?
> >
> > 2010/12/16 Gabriel Weinberg <yegg [at] alum>
> >
> > > Ariel T. Glenn <ariel <at> wikimedia.org> writes:
> > >
> > > >
> > > > We now have a copy of the dumps on a backup host. Although we are
> > still
> > > > resolving hardware issues on the XML dumps server, we think it is safe
> > > > enough to serve the existing dumps read-only. DNS was updated to that
> > > > effect already; people should see the dumps within the hour.
> > > >
> > > > Ariel
> > > >
> > >
> > > Hi, thank you for working so hard on this issue, but I'm still having
> > > trouble
> > > with the latest en.wikipedia dump, however. I downloaded
> > > http://download.wikimedia.org/enwiki/20101011/enwiki-20101011-pages-
> > > articles.xml.bz2 and am running into trouble decompressing.
> > >
> > > In particular, bzip2 -d enwiki-20101011-pages-articles.xml.bz2 fails.
> > >
> > > And bzip2 -tvv enwiki-20101011-pages-articles.xml.bz2 reports:
> > >
> > > [2752: huff+mtf data integrity (CRC) error in data
> > >
> > > I ran bzip2recover & then bzip2 -t rec* and got the following:
> > >
> > > bzip2: rec02752enwiki-20101011-pages-articles.xml.bz2: data integrity
> > (CRC)
> > > error in data
> > > bzip2: rec08881enwiki-20101011-pages-articles.xml.bz2: data integrity
> > (CRC)
> > > error in data
> > > bzip2: rec26198enwiki-20101011-pages-articles.xml.bz2: data integrity
> > (CRC)
> > > error in data
> > >
> > >
> > >
> > > _______________________________________________
> > > Wikitech-l mailing list
> > > Wikitech-l [at] lists
> > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> > >
> > _______________________________________________
> > Wikitech-l mailing list
> > Wikitech-l [at] lists
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> >
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l [at] lists
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l



_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


yegg at alum

Dec 16, 2010, 3:18 PM

Post #21 of 31 (3487 views)
Permalink
Re: dataset1, xml dumps [In reply to]

Thx--I guess I'll try again--third time's the charm I suppose :)

Sorry to waste your time,

Gabriel


On Thu, Dec 16, 2010 at 6:13 PM, Ariel T. Glenn <ariel [at] wikimedia> wrote:

> I was able to unzip a copy of the file on another host (taken from the
> same location) without problems. On the download host itself I get the
> correct md5sum: 7a4805475bba1599933b3acd5150bd4d
>
> Ariel
>
> 16-12-2010, , 17:48 -0500, / Gabriel Weinberg
> :
> > md5sum doesn't match. I get e74170eaaedc65e02249e1a54b1087cb (as
> > opposed to 7a4805475bba1599933b3acd5150bd4d
> > on
> http://download.wikimedia.org/enwiki/20101011/enwiki-20101011-md5sums.txt
> > ).
> >
> > I've downloaded it twice now and have gotten the same md5sum. Can anyone
> > else confirm?
> >
> > On Thu, Dec 16, 2010 at 5:41 PM, emijrp <emijrp [at] gmail> wrote:
> >
> > > Have you checked the md5sum?
> > >
> > > 2010/12/16 Gabriel Weinberg <yegg [at] alum>
> > >
> > > > Ariel T. Glenn <ariel <at> wikimedia.org> writes:
> > > >
> > > > >
> > > > > We now have a copy of the dumps on a backup host. Although we are
> > > still
> > > > > resolving hardware issues on the XML dumps server, we think it is
> safe
> > > > > enough to serve the existing dumps read-only. DNS was updated to
> that
> > > > > effect already; people should see the dumps within the hour.
> > > > >
> > > > > Ariel
> > > > >
> > > >
> > > > Hi, thank you for working so hard on this issue, but I'm still having
> > > > trouble
> > > > with the latest en.wikipedia dump, however. I downloaded
> > > > http://download.wikimedia.org/enwiki/20101011/enwiki-20101011-pages-
> > > > articles.xml.bz2 and am running into trouble decompressing.
> > > >
> > > > In particular, bzip2 -d enwiki-20101011-pages-articles.xml.bz2 fails.
> > > >
> > > > And bzip2 -tvv enwiki-20101011-pages-articles.xml.bz2 reports:
> > > >
> > > > [2752: huff+mtf data integrity (CRC) error in data
> > > >
> > > > I ran bzip2recover & then bzip2 -t rec* and got the following:
> > > >
> > > > bzip2: rec02752enwiki-20101011-pages-articles.xml.bz2: data integrity
> > > (CRC)
> > > > error in data
> > > > bzip2: rec08881enwiki-20101011-pages-articles.xml.bz2: data integrity
> > > (CRC)
> > > > error in data
> > > > bzip2: rec26198enwiki-20101011-pages-articles.xml.bz2: data integrity
> > > (CRC)
> > > > error in data
> > > >
> > > >
> > > >
> > > > _______________________________________________
> > > > Wikitech-l mailing list
> > > > Wikitech-l [at] lists
> > > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> > > >
> > > _______________________________________________
> > > Wikitech-l mailing list
> > > Wikitech-l [at] lists
> > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> > >
> > _______________________________________________
> > Wikitech-l mailing list
> > Wikitech-l [at] lists
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
>
>
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l [at] lists
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Platonides at gmail

Dec 16, 2010, 3:21 PM

Post #22 of 31 (3471 views)
Permalink
Re: dataset1, xml dumps [In reply to]

Gabriel Weinberg wrote:
> md5sum doesn't match. I get e74170eaaedc65e02249e1a54b1087cb (as
> opposed to 7a4805475bba1599933b3acd5150bd4d
> on http://download.wikimedia.org/enwiki/20101011/enwiki-20101011-md5sums.txt
> ).
>
> I've downloaded it twice now and have gotten the same md5sum. Can anyone
> else confirm?

I downloaded the right file without problems.

You can also try downloading it from
http://archivos.wikimedia-es.org/mirror/wikimedia/dumps/enwiki/2010101/enwiki-20101011-pages-articles.xml.bz2


_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


ariel at wikimedia

Dec 20, 2010, 3:22 AM

Post #23 of 31 (3433 views)
Permalink
Re: dataset1, xml dumps [In reply to]

Google donated storage space for backups for XML dumps. Accordingly, a
copy of the latest complete dump for each project is being copied over
(public files only). We expect to run similar copies once every two
weeks, keeping the four latest copies as well as one permanent copy at
every six month interval. That can be adjusted as we see how things go.

Ariel


_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


platonides at gmail

Dec 20, 2010, 8:41 AM

Post #24 of 31 (3423 views)
Permalink
Re: [Xmldatadumps-l] dataset1, xml dumps [In reply to]

Ariel T. Glenn wrote:
> Google donated storage space for backups for XML dumps. Accordingly, a
> copy of the latest complete dump for each project is being copied over
> (public files only). We expect to run similar copies once every two
> weeks, keeping the four latest copies as well as one permanent copy at
> every six month interval. That can be adjusted as we see how things go.
>
> Ariel

Are they readable from somewhere?
Apparently, in order to read them you need to sign up a list and wait
for an invitation, available only for US developers.

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


ariel at wikimedia

Dec 20, 2010, 9:38 AM

Post #25 of 31 (3421 views)
Permalink
Re: [Xmldatadumps-l] dataset1, xml dumps [In reply to]

I sent mail immediately after my initiual mail to these lists, to find
out whether we can make them readable to the public and whether there
would be a fee, etc. As soon as I have more information, I will pass it
on. At the least this gives WMF one more copy. Of course it would be
best if it gave everyone one more copy.

Ariel

Στις 20-12-2010, ημέρα Δευ, και ώρα 17:41 +0100, ο/η Platonides έγραψε:
> Ariel T. Glenn wrote:
> > Google donated storage space for backups for XML dumps. Accordingly, a
> > copy of the latest complete dump for each project is being copied over
> > (public files only). We expect to run similar copies once every two
> > weeks, keeping the four latest copies as well as one permanent copy at
> > every six month interval. That can be adjusted as we see how things go.
> >
> > Ariel
>
> Are they readable from somewhere?
> Apparently, in order to read them you need to sign up a list and wait
> for an invitation, available only for US developers.
>
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l [at] lists
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l



_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

First page Previous page 1 2 Next page Last page  View All Wikipedia wikitech RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.