Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Wikipedia: Wikitech
Article selection with MWDumper.
 

Index | Next | Previous | View Flat


cjb at laptop

May 6, 2008, 9:28 AM


Views: 399
Permalink
Article selection with MWDumper.

Dear wikitech list,

I'm preparing a Spanish Wikipedia snapshot for the One Laptop per Child
laptop, and to get the size down I'm trying to do some article selection
using MWDumper¹. As background, here are the steps I've gone thrown
so far using the eswiki dump:

thunk:cjb~% mysql -u eswiki -p eswiki < eswiki-20080416-pagelinks.sql

mysql> SELECT pl_namespace, pl_title, COUNT(*) INTO outfile \
"/tmp/incominglinks" FROM pagelinks GROUP BY pl_namespace, pl_title

The "incominglinks" file, after processing it to be one article per
line and excluding articles with few inbound links, looks like this:

thunk:cjb~ % wc -l incominglinks.names
162974 incominglinks.names

This is where MWDumper comes in -- I'd like to create an XML dump of
each article in the incominglinks.names. I tried:

thunk:cjb~% java -jar mwdumper-2008-04-13.jar \
--output=bzip2:eswiki_limited.xml.bz2 \
--format=xml \
--filter=list:incominglinks.names \
eswiki-20080416-pages-articles.xml.bz2

MWDumper didn't return any errors, and ran through the whole of the
.bz2 to completion. I then ran:

thunk:cjb~ % perl -nle 'print $1 if /<title>(.*?)<\/title>/' \
< eswiki_limited.xml > incominglinks.output

The resulting file is:

thunk:cjb~ % wc -l incominglinks.output
45395 incominglinks.output

Can anyone think of possible reasons for the discrepancy in number of
articles asked for and received? I've looked briefly at character set
and namespaces, and neither seems responsible. Here is an example of
articles present in the input and not the output:

thunk:cjb~ % grep nuclear incominglinks.names.sorted
Abandono_de_la_energía_nuclear
Accidente_nuclear
Arma_nuclear
Armas_nucleares
Bomba_nuclear
Central_nuclear
[59 matches]

thunk:cjb~ % grep nuclear incominglinks.output.sorted
thunk:cjb~ %

Here are complete versions of the two files above:

http://dev.laptop.org/~cjb/incominglinks.names.sorted
http://dev.laptop.org/~cjb/incominglinks.output.sorted

So, a few questions:

* Any ideas on what could cause many articles to be being dropped here?
* Is there any further output I could provide that would be useful?
* Is there a tool other than MWDumper that could do this for me?

Thanks very much for any suggestions!

- Chris.

¹: http://www.mediawiki.org/wiki/MWDumper
--
Chris Ball <cjb[at]laptop.org>

_______________________________________________
Wikitech-l mailing list
Wikitech-l[at]lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Subject User Time
Article selection with MWDumper. cjb at laptop May 6, 2008, 9:28 AM
    Re: Article selection with MWDumper. brion at wikimedia May 6, 2008, 3:55 PM
        Re: Article selection with MWDumper. cjb at laptop May 6, 2008, 6:32 PM
            Re: Article selection with MWDumper. bryan.tongminh at gmail May 7, 2008, 2:00 AM

  Index | Next | Previous | View Flat
 
 


Interested in having your list archived? Contact lists@gossamer-threads.com
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.