Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Wikipedia: Wikitech

dumps failing for en.wp

 

 

Wikipedia wikitech RSS feed   Index | Next | Previous | View Threaded


jdunck at gmail

Oct 14, 2005, 9:41 PM

Post #1 of 10 (757 views)
Permalink
dumps failing for en.wp

Several en dumps have failed due to the backup.lock file already existing.

Just FYI...
_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] wikimedia
http://mail.wikipedia.org/mailman/listinfo/wikitech-l


brion at pobox

Oct 14, 2005, 10:23 PM

Post #2 of 10 (742 views)
Permalink
Re: dumps failing for en.wp [In reply to]

Jeremy Dunck wrote:
> Several en dumps have failed due to the backup.lock file already existing.

That's because I run it on a separate pass from the others due to its
size; the main pass sees the lock when it gets to enwiki and skips over it.

Now, *those* dumps failing, if they do, are for entirely different
reasons. :P

-- brion vibber (brion @ pobox.com)
Attachments: signature.asc (0.25 KB)


jo at magnus

Oct 17, 2005, 7:22 AM

Post #3 of 10 (747 views)
Permalink
Re: dumps failing for en.wp [In reply to]

Brion,

I think that not only en dumps failed. I found some errors in the most
recent de dump (20051017_pages_articles.xml): As far as I inspected the
xml file manually, I saw several <title>'s which do not belong to the
comtent. I.e.:

<title>Vitruvius</title> contains an article about the planet Venus
<title>Indianische Flöte</title> contains the history of Poland
<title>Marlon Brando</title> contains Madonna (sic!)
and so on

Besides, there a many articles which exceptionally length in the dump,
which are not belonging into Namespace #0.

Cheers

jo


>> Several en dumps have failed due to the backup.lock file already
>> existing.

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] wikimedia
http://mail.wikipedia.org/mailman/listinfo/wikitech-l


brion at pobox

Oct 17, 2005, 8:29 AM

Post #4 of 10 (751 views)
Permalink
Re: Re: dumps failing for en.wp [In reply to]

Jochen Magnus wrote:
> Brion,
>
> I think that not only en dumps failed. I found some errors in the most
> recent de dump (20051017_pages_articles.xml): As far as I inspected the
> xml file manually, I saw several <title>'s which do not belong to the
> comtent. I.e.:
>
> <title>Vitruvius</title> contains an article about the planet Venus
> <title>Indianische Flöte</title> contains the history of Poland
> <title>Marlon Brando</title> contains Madonna (sic!)
> and so on

Hmm, that shouldn't happen. I'll have to debug it. Sigh.

> Besides, there a many articles which exceptionally length in the dump,
> which are not belonging into Namespace #0.

?

-- brion vibber (brion @ pobox.com)
Attachments: signature.asc (0.25 KB)


brion at pobox

Oct 17, 2005, 11:21 AM

Post #5 of 10 (742 views)
Permalink
Re: Re: dumps failing for en.wp [In reply to]

> Jochen Magnus wrote:
>> I think that not only en dumps failed. I found some errors in the most
>> recent de dump (20051017_pages_articles.xml): As far as I inspected the
>> xml file manually, I saw several <title>'s which do not belong to the
>> comtent. I.e.:
>>
>> <title>Vitruvius</title> contains an article about the planet Venus
>> <title>Indianische Flöte</title> contains the history of Poland
>> <title>Marlon Brando</title> contains Madonna (sic!)
>> and so on
>
>
> Hmm, that shouldn't happen. I'll have to debug it. Sigh.

Ok, confirmed on dewiki dump; here's a fragment example:
http://meta.wikimedia.org/wiki/User:Brion_VIBBER/Crap

Obviously there's a synchronization bug in my prefetch code. I'll try
and debug this today or tonight and restart the dumps tonight/tomorrow.

Don't use any of the 20051017 dumps; they're all suspect.

-- brion vibber (brion @ pobox.com)
Attachments: signature.asc (0.25 KB)


messias at gmail

Oct 18, 2005, 6:58 AM

Post #6 of 10 (746 views)
Permalink
Re: Re: dumps failing for en.wp [In reply to]

2005/10/17, Brion Vibber <brion [at] pobox>:
> Obviously there's a synchronization bug in my prefetch code. I'll try
> and debug this today or tonight and restart the dumps tonight/tomorrow.
>
> Don't use any of the 20051017 dumps; they're all suspect.

Then please delete them.
_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] wikimedia
http://mail.wikipedia.org/mailman/listinfo/wikitech-l


brion at pobox

Oct 18, 2005, 10:53 AM

Post #7 of 10 (744 views)
Permalink
Re: Re: dumps failing for en.wp [In reply to]

Folke Behrens wrote:
> 2005/10/17, Brion Vibber <brion [at] pobox>:
>>Don't use any of the 20051017 dumps; they're all suspect.
>
> Then please delete them.

I want to keep them around for bug comparison at the moment, but I have
gone ahead and renamed them with a ".broken" extension which I hope will
discourage people from downloading them. ;)

-- brion vibber (brion @ pobox.com)
Attachments: signature.asc (0.25 KB)


jo at magnus

Oct 21, 2005, 4:49 PM

Post #8 of 10 (744 views)
Permalink
Re: dumps failing for en.wp [In reply to]

Brion,

the latest XML dumps seems to be fine. I tested the de dump from
2005-10-20: No problems while importing it into MySQL and indexing it
with ioda.

Thank you!

jo

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] wikimedia
http://mail.wikipedia.org/mailman/listinfo/wikitech-l


brion at pobox

Oct 21, 2005, 9:32 PM

Post #9 of 10 (741 views)
Permalink
Re: Re: dumps failing for en.wp [In reply to]

Jochen Magnus wrote:
> Brion,
>
> the latest XML dumps seems to be fine. I tested the de dump from
> 2005-10-20: No problems while importing it into MySQL and indexing it
> with ioda.

Yay, something worked! ;)

I forgot to include the detail in my prior message, that some of the
older dumps may contain duplicate page titles due to the existence of
old non-normalized Unicode titles in the databases; that can lead to
'existing key' errors on the unique title indexes when importing with
mwdumper.

Those have been purged, and so it shouldn't happen anymore.

I am interested in details of failure reports with importDump.php; the
prior memory leaks with 1.5 prereleases should be fixed as of 1.5.0 but
there might be intermittent or specific problems. Unfortunately
importDump is relatively slow because it works page-by-page (mwdumper
does faster bulk imports, assuming a blank slate empty database to start
with) so it takes a while to test or confirm failures. :)

-- brion vibber (brion @ pobox.com)
Attachments: signature.asc (0.25 KB)


puglisi at arcetri

Oct 22, 2005, 8:06 AM

Post #10 of 10 (739 views)
Permalink
Re: Re: dumps failing for en.wp [In reply to]

On Fri, 21 Oct 2005, Brion Vibber wrote:

> I am interested in details of failure reports with importDump.php; the
> prior memory leaks with 1.5 prereleases should be fixed as of 1.5.0 but
> there might be intermittent or specific problems. Unfortunately
> importDump is relatively slow because it works page-by-page (mwdumper
> does faster bulk imports, assuming a blank slate empty database to start
> with) so it takes a while to test or confirm failures. :)

Test of importDump.php with the it: current page xml
(20051012_pages_current.xml.7z) went OK.

No apparent memory leaks, memory usage hovered around 17-18 MB.

Alfio
_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] wikimedia
http://mail.wikipedia.org/mailman/listinfo/wikitech-l

Wikipedia wikitech RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.