Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Wikipedia: Wikitech

Importing English Wikipeida XML Dumps into MediaWiki

 

 

Wikipedia wikitech RSS feed   Index | Next | Previous | View Threaded


olson_ot at yahoo

Oct 7, 2009, 10:31 AM

Post #1 of 10 (2163 views)
Permalink
Importing English Wikipeida XML Dumps into MediaWiki

Hi,

I have been importing the English Wikipeida XML Dumps every few
months (last time I did this was in June). I then used xml2sql and it
always worked for me. Now I attempted the import on the latest dump
enwiki-20090920-pages-articles.xml (and on the dump from
enwiki-20090810-pages-articles.xml), both of these have the error:

>$ xml2sql enwiki-20090920-pages-articles.xml
unexpected element <redirect>
xml2sql: parsing aborted at line 33 pos 16.

So then I try mwdumper and after 1.4 M Pages, it craps out:
……
1,423,000 pages (957.283/sec), 1,423,000 revs (957.283/sec)
1,424,000 pages (957.465/sec), 1,424,000 revs (957.465/sec)
Exception in thread "main" java.lang.IllegalArgumentException: Invalid
contributor
at
org.mediawiki.importer.XmlDumpReader.closeContributor(Unknown Source)
at org.mediawiki.importer.XmlDumpReader.endElement(Unknown Source)
at
org.apache.xerces.parsers.AbstractSAXParser.endElement(Unknown Source)
at
org.apache.xerces.parsers.AbstractXMLDocumentParser.emptyElement(Unknown
Source)
at
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanStartElement(Unknown
Source)
at
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown
Source)
at
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown
Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown
Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown
Source)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown
Source)
at
org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
at javax.xml.parsers.SAXParser.parse(SAXParser.java:395)
at javax.xml.parsers.SAXParser.parse(SAXParser.java:198)
at org.mediawiki.importer.XmlDumpReader.readDump(Unknown Source)
at org.mediawiki.dumper.Dumper.main(Unknown Source)


I tried the importDump.php and I get errors of the kind (MediaWiki 1.14.0)

Warning: xml_parse(): Unable to call handler in_() in
/var/www/includes/Import.php on line 437
Warning: xml_parse(): Unable to call handler in_() in
/var/www/includes/Import.php on line 437
Warning: xml_parse(): Unable to call handler out_() in
/var/www/includes/Import.php on line 437
….
(Sorry I don’t know where this error starts, but it processes a few
thousand pages, up till I get sick of looking at it before failing.)

Any ideas if the format of the XML files have changed because I can
swear that as of June or may be May, I had xml2sql working. I know that
I might need to upgrade MediaWiki to 1.15, however importDump.php
usually does not work for the English Wikipedia anyways.

I would be grateful if someone has any ideas?
Thanks guys,
O. O.

P.S. http://download.wikimedia.org/tools/ does not have the source of
MWDumper. I thought this was open source?


_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Platonides at gmail

Oct 7, 2009, 4:19 PM

Post #2 of 10 (2110 views)
Permalink
Re: Importing English Wikipeida XML Dumps into MediaWiki [In reply to]

O. O. writes:
> (Sorry I don’t know where this error starts, but it processes a few
> thousand pages, up till I get sick of looking at it before failing.)
>
> Any ideas if the format of the XML files have changed because I can
> swear that as of June or may be May, I had xml2sql working. I know that
> I might need to upgrade MediaWiki to 1.15, however importDump.php
> usually does not work for the English Wikipedia anyways.
>
> I would be grateful if someone has any ideas?
> Thanks guys,
> O. O.

Seems it fails on the new <redirect> tag.

> P.S. http://download.wikimedia.org/tools/ does not have the source of
> MWDumper. I thought this was open source?

MWDumper source is available at
http://svn.wikimedia.org/viewvc/mediawiki/trunk/mwdumper/

It should be noted at the readme.


_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


olson_ot at yahoo

Oct 8, 2009, 9:14 AM

Post #3 of 10 (2102 views)
Permalink
Re: Importing English Wikipeida XML Dumps into MediaWiki [In reply to]

Platonides wrote:
> Seems it fails on the new <redirect> tag.
>
>> P.S. http://download.wikimedia.org/tools/ does not have the source of
>> MWDumper. I thought this was open source?
>
> MWDumper source is available at
> http://svn.wikimedia.org/viewvc/mediawiki/trunk/mwdumper/
>
> It should be noted at the readme.

Thanks Platonides. With the new <redirect> tag is there anyway to
import the new XML Files?

Could I simply strip out the <redirect /> tags from the file, if I
wanted MWDumper to work. Or if I upgrade to MediaWiki 1.16, would
import.php work without any problems?

(Thanks for the pointer to the source of MW Dumper. The Source is not
mentioned in the Readme. However, I found it too complicated - or not
well documented for me at this point.)

Thanks again,
O.O.


_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


tfinc at wikimedia

Oct 8, 2009, 10:57 AM

Post #4 of 10 (2102 views)
Permalink
Re: Importing English Wikipeida XML Dumps into MediaWiki [In reply to]

O. O. wrote:
> Platonides wrote:
>> Seems it fails on the new <redirect> tag.
>>
>>> P.S. http://download.wikimedia.org/tools/ does not have the source of
>>> MWDumper. I thought this was open source?
>> MWDumper source is available at
>> http://svn.wikimedia.org/viewvc/mediawiki/trunk/mwdumper/
>>
>> It should be noted at the readme.
>
> Thanks Platonides. With the new <redirect> tag is there anyway to
> import the new XML Files?
>
> Could I simply strip out the <redirect /> tags from the file, if I
> wanted MWDumper to work. Or if I upgrade to MediaWiki 1.16, would
> import.php work without any problems?

If it's failing due to an old xsd then ..

The updated xsd and copy of Import.php just got checked into our
repositories so you can either pull this

http://svn.wikimedia.org/viewvc/mediawiki?view=rev&revision=54472

and increase the version number ala

http://svn.wikimedia.org/viewvc/mediawiki/trunk/phase3/includes/Export.php?r1=56298&r2=56612

Or you can wait till the next tagged release which will likely include this.

>
> (Thanks for the pointer to the source of MW Dumper. The Source is not
> mentioned in the Readme. However, I found it too complicated - or not
> well documented for me at this point.)

I'll have a peek at this and see if it can be improved.

--tomasz

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


olson_ot at yahoo

Oct 8, 2009, 4:46 PM

Post #5 of 10 (2102 views)
Permalink
Re: Importing English Wikipeida XML Dumps into MediaWiki [In reply to]

Tomasz Finc wrote:
> O. O. wrote:
>
> If it's failing due to an old xsd then ..
>
> The updated xsd and copy of Import.php just got checked into our
> repositories so you can either pull this
>
> http://svn.wikimedia.org/viewvc/mediawiki?view=rev&revision=54472
>
> and increase the version number ala
>
> http://svn.wikimedia.org/viewvc/mediawiki/trunk/phase3/includes/Export.php?r1=56298&r2=56612
>
> Or you can wait till the next tagged release which will likely include this.

Thanks Tomasz. I don’t mind waiting for your next release if it is going
to be in the next month or so.

>
>> (Thanks for the pointer to the source of MW Dumper. The Source is not
>> mentioned in the Readme. However, I found it too complicated - or not
>> well documented for me at this point.)
>
> I'll have a peek at this and see if it can be improved.

I hope someone could updated MW Dumper to the new XSD – it would help a
lot as far as importing Wikipedia Dumps are concerned, because
importDump.php is not practical.


_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


andrew.krizhanovsky at gmail

Oct 9, 2009, 1:14 AM

Post #6 of 10 (2097 views)
Permalink
Re: Importing English Wikipeida XML Dumps into MediaWiki [In reply to]

Hi!

I have got the same "<redirect>" problem while importing the dump of
Russian Wiktionary. :(

Best regards,
Andrew Krizhanovsky.

On Fri, Oct 9, 2009 at 3:46 AM, O. O. <olson_ot [at] yahoo> wrote:
> Tomasz Finc wrote:
>> O. O. wrote:
>>
>> If it's failing due to an old xsd then ..
>>
>> The updated xsd and copy of Import.php just got checked into our
>> repositories so you can either pull this
>>
>> http://svn.wikimedia.org/viewvc/mediawiki?view=rev&revision=54472
>>
>> and increase the version number ala
>>
>> http://svn.wikimedia.org/viewvc/mediawiki/trunk/phase3/includes/Export.php?r1=56298&r2=56612
>>
>> Or you can wait till the next tagged release which will likely include this.
>
> Thanks Tomasz. I don’t mind waiting for your next release if it is going
> to be in the next month or so.
>
>>
>>> (Thanks for the pointer to the source of MW Dumper. The Source is not
>>> mentioned in the Readme. However, I found it too complicated - or not
>>> well documented for me at this point.)
>>
>> I'll have a peek at this and see if it can be improved.
>
> I hope someone could  updated MW Dumper to the new XSD – it would help a
> lot as far as importing Wikipedia Dumps are concerned, because
> importDump.php is not practical.
>
>
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l [at] lists
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


olson_ot at yahoo

Oct 9, 2009, 11:18 AM

Post #7 of 10 (2083 views)
Permalink
Re: Importing English Wikipeida XML Dumps into MediaWiki [In reply to]

Andrew Krizhanovsky wrote:
> Hi!
>
> I have got the same "<redirect>" problem while importing the dump of
> Russian Wiktionary. :(
>
> Best regards,
> Andrew Krizhanovsky.
>

So Andrew, do you import using importDump.php, MWDumper or xml2sql? I am
curious to know what others are using for their imports. (This is for
my personal knowledge.)

It seems that the “<redirect />” tags are mostly blank while grepping
through the English Wikipedia Dump. I hope someone can fix this soon.

Thanks to you guys,
O. O.


_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


bilalak at gmail

Oct 9, 2009, 11:28 AM

Post #8 of 10 (2086 views)
Permalink
Re: Importing English Wikipeida XML Dumps into MediaWiki [In reply to]

I have used xml2sql, mwdumper, import.php and the python script to import
The two fastest are xml2sql and the python script (xray). The best results
is from importDump.php
mwDumper is slow but it gives good results.

I have not done any import with the new <redirect> tag.

bilal


On Fri, Oct 9, 2009 at 2:18 PM, O. O. <olson_ot [at] yahoo> wrote:

> Andrew Krizhanovsky wrote:
> > Hi!
> >
> > I have got the same "<redirect>" problem while importing the dump of
> > Russian Wiktionary. :(
> >
> > Best regards,
> > Andrew Krizhanovsky.
> >
>
> So Andrew, do you import using importDump.php, MWDumper or xml2sql? I am
> curious to know what others are using for their imports. (This is for
> my personal knowledge.)
>
> It seems that the “<redirect />” tags are mostly blank while grepping
> through the English Wikipedia Dump. I hope someone can fix this soon.
>
> Thanks to you guys,
> O. O.
>
>
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l [at] lists
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>



--
Verily, with hardship comes ease.
_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


andrew.krizhanovsky at gmail

Oct 10, 2009, 9:37 AM

Post #9 of 10 (2082 views)
Permalink
Re: Importing English Wikipeida XML Dumps into MediaWiki [In reply to]

Hi!

I have tried xml2sql and importDump.php.
The same error.

Best regards,
Andrew.

On Fri, Oct 9, 2009 at 10:18 PM, O. O. <olson_ot [at] yahoo> wrote:
> Andrew Krizhanovsky wrote:
>> Hi!
>>
>> I have got the same "<redirect>" problem while importing the dump of
>> Russian Wiktionary. :(
>>
>> Best regards,
>> Andrew Krizhanovsky.
>>
>
> So Andrew, do you import using importDump.php, MWDumper or xml2sql? I am
> curious to know what others are using  for their imports. (This is for
> my personal knowledge.)
>
> It seems that the “<redirect />” tags are mostly blank while grepping
> through the English Wikipedia Dump. I hope someone can fix this soon.
>
> Thanks to you guys,
> O. O.
>
>
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l [at] lists
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


olson_ot at yahoo

Oct 12, 2009, 8:24 AM

Post #10 of 10 (2034 views)
Permalink
Re: Importing English Wikipeida XML Dumps into MediaWiki [In reply to]

Andrew Krizhanovsky wrote:
> Hi!
>
> I have tried xml2sql and importDump.php.
> The same error.
>
> Best regards,
> Andrew.
>

Thanks Bilal and Andrew.
O.O.


_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Wikipedia wikitech RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.