Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Wikipedia: Wikitech

Datamining infoboxes

 

 

First page Previous page 1 2 Next page Last page  View All Wikipedia wikitech RSS feed   Index | Next | Previous | View Threaded


hippytrail at gmail

Oct 22, 2009, 7:21 PM

Post #1 of 27 (2444 views)
Permalink
Datamining infoboxes

Infoboxes in Wikipedia often contain information which is quite useful
outside Wikipedia but can be surprisingly difficult to data-mine.

I would like to find all Wikipedia pages that use
Template:Infobox_Language and parse the parameters iso3 and
fam1...fam15

But my attempts to find such pages using either the Toolserver's
Wikipedia database or the Mediawiki API have not been fruitful. In
particular, SQL queries on the templatelinks table are intractably
slow. Why are there no keys on tl_from or tl_title?

Andrew Dunbar (hippietrail)

--
http://wiktionarydev.leuksman.com http://linguaphile.sf.net

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


lists at schwen

Oct 22, 2009, 8:33 PM

Post #2 of 27 (2380 views)
Permalink
Re: Datamining infoboxes [In reply to]

> particular, SQL queries on the templatelinks table are intractably
> slow. Why are there no keys on tl_from or tl_title?

How are you planning to get the template parameters? Have I missed a
recent schema change?
I'd be interested in following your progress. I'm not extracting
infobox data, but parameters of the coordinate template. Maybe a
similar approach could be interesting for you:

The coordinate template stuffs all its parameters int an external
link (which can easily be obtained from the externallinks table).
Creating dummy links containing parameters for some infoboxes could be
one way of making the data available for automatic extraction (yes,
it's a hack, but I'd prefer better suggestions over flames).

The link could actually be made useful, it could point to a query page
for the data in these infoboxes.

[[User:Dschwen]]

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


hippytrail at gmail

Oct 22, 2009, 11:13 PM

Post #3 of 27 (2376 views)
Permalink
Re: Datamining infoboxes [In reply to]

2009/10/22 Daniel Schwen <lists [at] schwen>:
>> particular, SQL queries on the templatelinks table are intractably
>> slow. Why are there no keys on tl_from or tl_title?
>
> How are you planning to get the template parameters? Have I missed a
> recent schema change?

I've been trying to parse the wikitext of section 0 with a minimal
parser that uses just the tokens {{ }} {{{ and }}} but it already has
probems when it sees }}}}

> I'd be interested in following your progress. I'm not extracting
> infobox data, but parameters of the coordinate template. Maybe a
> similar approach could be interesting for you:
>
>  The coordinate template stuffs all its parameters int an external
> link (which can easily be obtained from the externallinks table).
> Creating dummy links containing parameters for some infoboxes could be
> one way of making the data available for automatic extraction (yes,
> it's a hack, but I'd prefer better suggestions over flames).
>
> The link could actually be made useful, it could point to a query page
> for the data in these infoboxes.

The template and parameters I'm interested don't generate any such
external links and probably couldn't very easily...

But I have just discovered the rvgeneratexml parameter to
action=query&prop=revisions
This includes a <part> field for each template parameter with a <name>
and a <value> for each...

Andrew Dunbar (hippietrail)

> [[User:Dschwen]]
>
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l [at] lists
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>



--
http://wiktionarydev.leuksman.com http://linguaphile.sf.net

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


george.herbert at gmail

Oct 22, 2009, 11:37 PM

Post #4 of 27 (2387 views)
Permalink
Re: Datamining infoboxes [In reply to]

This discussion brings to mind several historical threads.

I wonder if a project to simply mine the whole article contents and
provide a DB of some sort with the articles and infobox contents would
be worthwhile. Develop a specific parser and generate and publish the
complete set of article-infobox-(key-value) sets...


On Thu, Oct 22, 2009 at 11:13 PM, Andrew Dunbar <hippytrail [at] gmail> wrote:
> 2009/10/22 Daniel Schwen <lists [at] schwen>:
>>> particular, SQL queries on the templatelinks table are intractably
>>> slow. Why are there no keys on tl_from or tl_title?
>>
>> How are you planning to get the template parameters? Have I missed a
>> recent schema change?
>
> I've been trying to parse the wikitext of section 0 with a minimal
> parser that uses just the tokens {{ }} {{{ and }}} but it already has
> probems when it sees }}}}
>
>> I'd be interested in following your progress. I'm not extracting
>> infobox data, but parameters of the coordinate template. Maybe a
>> similar approach could be interesting for you:
>>
>>  The coordinate template stuffs all its parameters int an external
>> link (which can easily be obtained from the externallinks table).
>> Creating dummy links containing parameters for some infoboxes could be
>> one way of making the data available for automatic extraction (yes,
>> it's a hack, but I'd prefer better suggestions over flames).
>>
>> The link could actually be made useful, it could point to a query page
>> for the data in these infoboxes.
>
> The template and parameters I'm interested don't generate any such
> external links and probably couldn't very easily...
>
> But I have just discovered the rvgeneratexml parameter to
> action=query&prop=revisions
> This includes a <part> field for each template parameter with a <name>
> and a <value> for each...
>
> Andrew Dunbar (hippietrail)
>
>> [[User:Dschwen]]
>>
>> _______________________________________________
>> Wikitech-l mailing list
>> Wikitech-l [at] lists
>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>>
>
>
>
> --
> http://wiktionarydev.leuksman.com http://linguaphile.sf.net
>
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l [at] lists
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>



--
-george william herbert
george.herbert [at] gmail

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


william at scissor

Oct 22, 2009, 11:51 PM

Post #5 of 27 (2373 views)
Permalink
Re: Datamining infoboxes [In reply to]

George Herbert wrote:
> This discussion brings to mind several historical threads.
>
> I wonder if a project to simply mine the whole article contents and
> provide a DB of some sort with the articles and infobox contents would
> be worthwhile. Develop a specific parser and generate and publish the
> complete set of article-infobox-(key-value) sets...
>

I don't know anybody on the data side at Metaweb anymore, but I know
that they did something like that to import a lot of structured
Wikipedia data into their Freebase project. They publish some sort of
data dump here:

http://download.freebase.com/wex/

Perhaps they'd be willing to open-source their parser.

William

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


roan.kattouw at gmail

Oct 23, 2009, 12:10 AM

Post #6 of 27 (2373 views)
Permalink
Re: Datamining infoboxes [In reply to]

2009/10/23 Andrew Dunbar <hippytrail [at] gmail>:
> But my attempts to find such pages using either the Toolserver's
> Wikipedia database or the Mediawiki API have not been fruitful. In
> particular, SQL queries on the templatelinks table are intractably
> slow. Why are there no keys on tl_from or tl_title?
>
There are:
CREATE UNIQUE INDEX /*i*/tl_from ON /*_*/templatelinks
(tl_from,tl_namespace,tl_title);
CREATE UNIQUE INDEX /*i*/tl_namespace ON /*_*/templatelinks
(tl_namespace,tl_title,tl_from);

It's just that tl_title is always coupled with tl_namespace because
that's how you should be using it (tl_namespace=10 for the template
namespace). Note that the former index can be used as an index on
(tl_from) as well.

Roan Kattouw (Catrope)

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


jcsahnwaldt at gmail

Oct 23, 2009, 4:55 AM

Post #7 of 27 (2360 views)
Permalink
Re: Datamining infoboxes [In reply to]

On Fri, Oct 23, 2009 at 08:37, George Herbert <george.herbert [at] gmail> wrote:
> I wonder if a project to simply mine the whole article contents and
> provide a DB of some sort with the articles and infobox contents would
> be worthwhile.  Develop a specific parser and generate and publish the
> complete set of article-infobox-(key-value) sets...

That's what DBpedia is doing.

The extracted data can be found here, in N-Triples and CSV format:

http://wiki.dbpedia.org/Downloads

The entries in the row labelled 'Infoboxes' are files
that contain the extracted values of all template
properties in each page of a Wikipedia instance.
For large Wikipedias like en, the unzipped files are
pretty big (several GB).

Most of the extraction code can be found in these
PHP classes:

https://dbpedia.svn.sourceforge.net/svnroot/dbpedia/extraction/extractors/InfoboxExtractor.php
https://dbpedia.svn.sourceforge.net/svnroot/dbpedia/extraction/extractors/infobox/


Christopher

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


rlullmann at gmail

Oct 23, 2009, 4:56 AM

Post #8 of 27 (2362 views)
Permalink
Re: Datamining infoboxes [In reply to]

Hi Hippietrail!

What do you mean by "intractably slow"? Just how fast must it be?

If I do
http://en.wikipedia.org/w/api.php?action=query&list=embeddedin&eititle=Template:Infobox_Language&eilimit=100&einamespace=0
it says (on one given try) that it was served in 0,047 seconds. How
long can it take to read them all? A few minutes?

Seems to me that time would be swamped by the time it takes to pull
the wikitext for the pages?

And methinks you might be trying too hard to parse the text, some
fairly simple regex or such can extract the template invocation and
the parameters; people use it in a pretty regular way.

Oh, and do remember to look for "Template:Infobox language" as well,
depending on which way you find them.

Robert

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


lists at schwen

Oct 23, 2009, 5:10 AM

Post #9 of 27 (2374 views)
Permalink
Re: Datamining infoboxes [In reply to]

> I wonder if a project to simply mine the whole article contents and
> provide a DB of some sort with the articles and infobox contents would
> be worthwhile.  Develop a specific parser and generate and publish the
> complete set of article-infobox-(key-value) sets...

That is a brilliant idea...
...that somebody else already had and implemented

Templatetiger
http://toolserver.org/~kolossos/templatetiger/template-choice.php?lang=enwiki

Should have mentioned that earlier.

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


hippytrail at gmail

Oct 23, 2009, 5:27 AM

Post #10 of 27 (2363 views)
Permalink
Re: Datamining infoboxes [In reply to]

2009/10/23 Robert Ullmann <rlullmann [at] gmail>:
> Hi Hippietrail!
>
> What do you mean by "intractably slow"? Just how fast must it be?
>
> If I do
> http://en.wikipedia.org/w/api.php?action=query&list=embeddedin&eititle=Template:Infobox_Language&eilimit=100&einamespace=0
> it says (on one given try) that it was served in 0,047 seconds. How
> long can it take to read them all? A few minutes?

Yes I found how to get it through the API now. It was actually just
the Toolserver database that was intractably slow.

> Seems to me that time would be swamped by the time it takes to pull
> the wikitext for the pages?
>
> And methinks you might be trying too hard to parse the text, some
> fairly simple regex or such can extract the template invocation and
> the parameters; people use it in a pretty regular way.

I've been spending hours on the parsing now and don't find it simple
at all due to the fact that templates can be nested. Just extracting
the Infobox as one big lump is hard due to the need to match nested {{
and }}

Andrew Dunbar (hippietrail)

> Oh, and do remember to look for "Template:Infobox language" as well,
> depending on which way you find them.
>
> Robert
>
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l [at] lists
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>



--
http://wiktionarydev.leuksman.com http://linguaphile.sf.net

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


rarohde at gmail

Oct 23, 2009, 5:35 AM

Post #11 of 27 (2365 views)
Permalink
Re: Datamining infoboxes [In reply to]

Given the fairly obvious utility for data mining, it might make sense
for someone to extend the Mediawiki API to generate a list of template
calls and the parameters sent in each case.

-Robert Rohde

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


magnusmanske at googlemail

Oct 23, 2009, 5:42 AM

Post #12 of 27 (2367 views)
Permalink
Re: Datamining infoboxes [In reply to]

I am so glad that someone re-re-resurrects this topic :-)


On Fri, Oct 23, 2009 at 1:27 PM, Andrew Dunbar <hippytrail [at] gmail> wrote:
> I've been spending hours on the parsing now and don't find it simple
> at all due to the fact that templates can be nested. Just extracting
> the Infobox as one big lump is hard due to the need to match nested {{
> and }}

Not perfect, but try
http://toolserver.org/~magnus/wiki2xml/w2x.php

1. Unckeck "Use API", chose "Do not use templates"
2. Enter article name(s)
3. Get XML
4. Parse XML, re-submit the wiki text in templates to process the next
level of templates

I should really offer #4 in this...

Caveat: Will break on things like HTML attributes that are filled by
templates etc.

Cheers,
Magnus

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


rlullmann at gmail

Oct 23, 2009, 6:22 AM

Post #13 of 27 (2360 views)
Permalink
Re: Datamining infoboxes [In reply to]

> I've been spending hours on the parsing now and don't find it simple
> at all due to the fact that templates can be nested. Just extracting
> the Infobox as one big lump is hard due to the need to match nested {{
> and }}
>
> Andrew Dunbar (hippietrail)

Hi,

Come now, you are over-thinking it. Find "{{Infobox [Ll]anguage" in
the text, then count braces. Start at depth=2, count up and down 'till
you reach 0, and you are at the end of the template. (you can be picky
about only counting them if paired if you like ;-)

Then just regex match the lines/parameters you want.

However, if you are pulling the wikitext with the API, the XML parse
tree option sounds good; then you can just use elementTree (or the
like) and pull out the parameters directly

Robert

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


hippytrail at gmail

Oct 23, 2009, 7:02 AM

Post #14 of 27 (2358 views)
Permalink
Re: Datamining infoboxes [In reply to]

2009/10/23 Robert Ullmann <rlullmann [at] gmail>:
>> I've been spending hours on the parsing now and don't find it simple
>> at all due to the fact that templates can be nested. Just extracting
>> the Infobox as one big lump is hard due to the need to match nested {{
>> and }}
>>
>> Andrew Dunbar (hippietrail)
>
> Hi,
>
> Come now, you are over-thinking it. Find "{{Infobox [Ll]anguage" in
> the text, then count braces. Start at depth=2, count up and down 'till
> you reach 0, and you are at the end of the template. (you can be picky
> about only counting them if paired if you like ;-)

Actually you have to find "{{[Ii]nfobox[ _][Ll]anguage"
And I wanted to be robust. It's perfectly legal for single unmatched
braces to apear anywhere and I didn't want them to break my code. As
it happens there don't seem to currently be any in the language
infofoxes.
I couldn't be sure whether there would be any cases where a {{{ or }}}
might show up either. And a few other edge cases such as HTML
comments, <nowiki> and friends, template invocations in values, and
even possibly template invokations in names?

> Then just regex match the lines/parameters you want.
>
> However, if you are pulling the wikitext with the API, the XML parse
> tree option sounds good; then you can just use elementTree (or the
> like) and pull out the parameters directly

I've got it extracting the name/value pairs from the XML finally but
parsing XML is always a pain. And it still misses Norwegian, Bokmal,
and Nynorsk which wrap the infobox in another template...

Andrew Dunbar (hippietrail)

> Robert
>
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l [at] lists
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>



--
http://wiktionarydev.leuksman.com http://linguaphile.sf.net

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


roan.kattouw at gmail

Oct 23, 2009, 7:04 AM

Post #15 of 27 (2362 views)
Permalink
Re: Datamining infoboxes [In reply to]

2009/10/23 Robert Rohde <rarohde [at] gmail>:
> Given the fairly obvious utility for data mining, it might make sense
> for someone to extend the Mediawiki API to generate a list of template
> calls and the parameters sent in each case.
>
We had a discussion about this Tuesday in the tech staff meeting, and
decided that we want to put this data mining possibility in core at
some point (using a table like pagelinks to store these key/value
pairs and modifying the parser). As you may understand this is not a
very high priority project, and I don't know if any of the paid
developers are gonna do it any time soon.

Roan Kattouw (Catrope)

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


neil at tonal

Oct 23, 2009, 8:10 AM

Post #16 of 27 (2366 views)
Permalink
Re: Datamining infoboxes [In reply to]

Robert Ullmann wrote:
>> I've been spending hours on the parsing now and don't find it simple
>> at all due to the fact that templates can be nested. Just extracting
>> the Infobox as one big lump is hard due to the need to match nested {{
>> and }}
>>
>> Andrew Dunbar (hippietrail)
>>
>
> Hi,
>
> Come now, you are over-thinking it. Find "{{Infobox [Ll]anguage" in
> the text, then count braces. Start at depth=2, count up and down 'till
> you reach 0, and you are at the end of the template. (you can be picky
> about only counting them if paired if you like ;-)
>
> Then just regex match the lines/parameters you want.
>
> However, if you are pulling the wikitext with the API, the XML parse
> tree option sounds good; then you can just use elementTree (or the
> like) and pull out the parameters directly
>
> Robert
>

Or you could use the pyparsing Python library, with which you can
implement the grammar of your choice, making matching nested template
extraction trivial. Using the psyco package to accelerate it, you can
parse a whole en: dump in a few hours.

See the code below for a sample grammar...

-- Neil

------------------------------------------------

# Use pyparsing, enablePackrat() _and_ psyco for a considerable speed-up
from pyparsing import *
import psyco
# These two must be in the correct order, or bad things will happen
ParserElement.enablePackrat()
psyco.full()

wikitemplate = Forward()

wikilink = Combine("[[" + SkipTo("]]") + "]]")

wikiargname = CharsNotIn("|{}=")
wikiargval = ZeroOrMore(
wikilink | Group(wikitemplate) | CharsNotIn("[|{}") | "[" | "{" |
Regex("}[^}]"))

wikiarg = Group(Optional(wikiargname + Suppress("="), default="??") +
wikiargval)

wikitemplate << (Suppress("{{") + wikiargname + Optional(Suppress("|") +
delimitedList(wikiarg, "|")) + Suppress("}}"))

wikitext = ZeroOrMore(CharsNotIn("{") | Group(wikitemplate) | "{" )

def parse_page(text):
return wikitext.parseString(text)


_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Simetrical+wikilist at gmail

Oct 23, 2009, 8:18 AM

Post #17 of 27 (2360 views)
Permalink
Re: Datamining infoboxes [In reply to]

On Fri, Oct 23, 2009 at 8:27 AM, Andrew Dunbar <hippytrail [at] gmail> wrote:
> Yes I found how to get it through the API now. It was actually just
> the Toolserver database that was intractably slow.

There's nothing slow about the TS database here:

mysql> pager true
PAGER set to 'true'
mysql> SELECT tl_from FROM templatelinks WHERE tl_namespace=10 AND
tl_title IN ('Infobox_Language', 'Infobox_language');
3144 rows in set (0.12 sec)

Your query might have been what was slow.

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


lists at schwen

Oct 23, 2009, 8:25 AM

Post #18 of 27 (2357 views)
Permalink
Re: Datamining infoboxes [In reply to]

Fascinating!
It seems to be a repeating pattern on these mailing lists that people
ignore existing solutions and discuss re-inventing wheels (please
correct me if I'm wrong here).
While I agree this is fun some it rarely helps the OP...

[[User:Dschwen]]

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


dgerard at gmail

Oct 23, 2009, 8:28 AM

Post #19 of 27 (2345 views)
Permalink
Re: Datamining infoboxes [In reply to]

2009/10/23 William Pietri <william [at] scissor>:
> George Herbert wrote:

>> This discussion brings to mind several historical threads.
>> I wonder if a project to simply mine the whole article contents and
>> provide a DB of some sort with the articles and infobox contents would
>> be worthwhile.  Develop a specific parser and generate and publish the
>> complete set of article-infobox-(key-value) sets...

> I don't know anybody on the data side at Metaweb anymore, but I know
> that they did something like that to import a lot of structured
> Wikipedia data into their Freebase project. They publish some sort of
> data dump here:
> http://download.freebase.com/wex/
> Perhaps they'd be willing to open-source their parser.


They're right into open source, I suspect they would.


- d.

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


hippytrail at gmail

Oct 23, 2009, 9:20 AM

Post #20 of 27 (2349 views)
Permalink
Re: Datamining infoboxes [In reply to]

2009/10/23 Aryeh Gregor <Simetrical+wikilist [at] gmail>:
> On Fri, Oct 23, 2009 at 8:27 AM, Andrew Dunbar <hippytrail [at] gmail> wrote:
>> Yes I found how to get it through the API now. It was actually just
>> the Toolserver database that was intractably slow.
>
> There's nothing slow about the TS database here:
>
> mysql> pager true
> PAGER set to 'true'
> mysql> SELECT tl_from FROM templatelinks WHERE tl_namespace=10 AND
> tl_title IN ('Infobox_Language', 'Infobox_language');
> 3144 rows in set (0.12 sec)
>
> Your query might have been what was slow.

Yes I didn't specify tl_namespace and when I check for which columns
have keys I could see none:
+--------------+-----------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+--------------+-----------------+------+-----+---------+-------+
| tl_from | int(8) unsigned | NO | | 0 | |
| tl_namespace | int(11) | NO | | 0 | |
| tl_title | varchar(255) | NO | | | |
+--------------+-----------------+------+-----+---------+-------+
3 rows in set (0.01 sec)

But I don't know much about databases and SQL...

I have reached an important milestone of extracting all the name value
pairs for language infobox ISO 639 language codes and language family
string by the way.

But the values still need some work before I can try to match them
against ISO 639-5 language family codes which is my ultimate goal.

Thanks for all the tips.

Andrew Dunbar (hippietrail)

> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l [at] lists
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>



--
http://wiktionarydev.leuksman.com http://linguaphile.sf.net

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


jcsahnwaldt at gmail

Oct 23, 2009, 9:45 AM

Post #21 of 27 (2345 views)
Permalink
Re: Datamining infoboxes [In reply to]

Because of result count restrictions, these queries don't
return all ISO language codes extracted by DBpedia,
but I think they give a good impression of the data quality
and coverage (or sometimes lack thereof):

http://dbpedia.org/sparql?query=select+distinct+%3Fs%2C+%3Fo+where{%3Fs+%3Chttp%3A%2F%2Fdbpedia.org%2Fproperty%2Fiso%3E+%3Fo+.%3Fs+%3Chttp%3A%2F%2Fdbpedia.org%2Fproperty%2FwikiPageUsesTemplate%3E+%3Chttp%3A%2F%2Fdbpedia.org%2Fresource%2FTemplate%3Ainfobox_language%3E+.}

http://dbpedia.org/sparql?default-graph-uri=http%3A%2F%2Fdbpedia.org&query=select+distinct+%3Fs%2C+%3Fo+where{%3Fs+%3Chttp%3A%2F%2Fdbpedia.org%2Fproperty%2Fiso%3E+%3Fo+.%3Fs+%3Chttp%3A%2F%2Fdbpedia.org%2Fproperty%2FwikiPageUsesTemplate%3E+%3Chttp%3A%2F%2Fdbpedia.org%2Fresource%2FTemplate%3Ainfobox_language%3E+.}


On Fri, Oct 23, 2009 at 18:20, Andrew Dunbar <hippytrail [at] gmail> wrote:
> 2009/10/23 Aryeh Gregor <Simetrical+wikilist [at] gmail>:
>> On Fri, Oct 23, 2009 at 8:27 AM, Andrew Dunbar <hippytrail [at] gmail> wrote:
>>> Yes I found how to get it through the API now. It was actually just
>>> the Toolserver database that was intractably slow.
>>
>> There's nothing slow about the TS database here:
>>
>> mysql> pager true
>> PAGER set to 'true'
>> mysql> SELECT tl_from FROM templatelinks WHERE tl_namespace=10 AND
>> tl_title IN ('Infobox_Language', 'Infobox_language');
>> 3144 rows in set (0.12 sec)
>>
>> Your query might have been what was slow.
>
> Yes I didn't specify tl_namespace and when I check for which columns
> have keys I could see none:
> mysql> describe templatelinks;
> +--------------+-----------------+------+-----+---------+-------+
> | Field        | Type            | Null | Key | Default | Extra |
> +--------------+-----------------+------+-----+---------+-------+
> | tl_from      | int(8) unsigned | NO   |     | 0       |       |
> | tl_namespace | int(11)         | NO   |     | 0       |       |
> | tl_title     | varchar(255)    | NO   |     |         |       |
> +--------------+-----------------+------+-----+---------+-------+
> 3 rows in set (0.01 sec)
>
> But I don't know much about databases and SQL...
>
> I have reached an important milestone of extracting all the name value
> pairs for language infobox ISO 639 language codes and language family
> string by the way.
>
> But the values still need some work before I can try to match them
> against ISO 639-5 language family codes which is my ultimate goal.
>
> Thanks for all the tips.
>
> Andrew Dunbar (hippietrail)
>
>> _______________________________________________
>> Wikitech-l mailing list
>> Wikitech-l [at] lists
>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>>
>
>
>
> --
> http://wiktionarydev.leuksman.com http://linguaphile.sf.net
>
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l [at] lists
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


jcsahnwaldt at gmail

Oct 23, 2009, 9:47 AM

Post #22 of 27 (2346 views)
Permalink
Re: Datamining infoboxes [In reply to]

Note: the trailing "}" is part of the URL. Some mail readers may
cut it off.

On Fri, Oct 23, 2009 at 18:45, Jona Christopher Sahnwaldt
<jcsahnwaldt [at] gmail> wrote:
> Because of result count restrictions, these queries don't
> return all ISO language codes extracted by DBpedia,
> but I think they give a good impression of the data quality
> and coverage (or sometimes lack thereof):
>
> http://dbpedia.org/sparql?query=select+distinct+%3Fs%2C+%3Fo+where{%3Fs+%3Chttp%3A%2F%2Fdbpedia.org%2Fproperty%2Fiso%3E+%3Fo+.%3Fs+%3Chttp%3A%2F%2Fdbpedia.org%2Fproperty%2FwikiPageUsesTemplate%3E+%3Chttp%3A%2F%2Fdbpedia.org%2Fresource%2FTemplate%3Ainfobox_language%3E+.}
>
> http://dbpedia.org/sparql?default-graph-uri=http%3A%2F%2Fdbpedia.org&query=select+distinct+%3Fs%2C+%3Fo+where{%3Fs+%3Chttp%3A%2F%2Fdbpedia.org%2Fproperty%2Fiso%3E+%3Fo+.%3Fs+%3Chttp%3A%2F%2Fdbpedia.org%2Fproperty%2FwikiPageUsesTemplate%3E+%3Chttp%3A%2F%2Fdbpedia.org%2Fresource%2FTemplate%3Ainfobox_language%3E+.}
>
>
> On Fri, Oct 23, 2009 at 18:20, Andrew Dunbar <hippytrail [at] gmail> wrote:
>> 2009/10/23 Aryeh Gregor <Simetrical+wikilist [at] gmail>:
>>> On Fri, Oct 23, 2009 at 8:27 AM, Andrew Dunbar <hippytrail [at] gmail> wrote:
>>>> Yes I found how to get it through the API now. It was actually just
>>>> the Toolserver database that was intractably slow.
>>>
>>> There's nothing slow about the TS database here:
>>>
>>> mysql> pager true
>>> PAGER set to 'true'
>>> mysql> SELECT tl_from FROM templatelinks WHERE tl_namespace=10 AND
>>> tl_title IN ('Infobox_Language', 'Infobox_language');
>>> 3144 rows in set (0.12 sec)
>>>
>>> Your query might have been what was slow.
>>
>> Yes I didn't specify tl_namespace and when I check for which columns
>> have keys I could see none:
>> mysql> describe templatelinks;
>> +--------------+-----------------+------+-----+---------+-------+
>> | Field        | Type            | Null | Key | Default | Extra |
>> +--------------+-----------------+------+-----+---------+-------+
>> | tl_from      | int(8) unsigned | NO   |     | 0       |       |
>> | tl_namespace | int(11)         | NO   |     | 0       |       |
>> | tl_title     | varchar(255)    | NO   |     |         |       |
>> +--------------+-----------------+------+-----+---------+-------+
>> 3 rows in set (0.01 sec)
>>
>> But I don't know much about databases and SQL...
>>
>> I have reached an important milestone of extracting all the name value
>> pairs for language infobox ISO 639 language codes and language family
>> string by the way.
>>
>> But the values still need some work before I can try to match them
>> against ISO 639-5 language family codes which is my ultimate goal.
>>
>> Thanks for all the tips.
>>
>> Andrew Dunbar (hippietrail)
>>
>>> _______________________________________________
>>> Wikitech-l mailing list
>>> Wikitech-l [at] lists
>>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>>>
>>
>>
>>
>> --
>> http://wiktionarydev.leuksman.com http://linguaphile.sf.net
>>
>> _______________________________________________
>> Wikitech-l mailing list
>> Wikitech-l [at] lists
>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>>
>

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


rarohde at gmail

Oct 23, 2009, 12:19 PM

Post #23 of 27 (2352 views)
Permalink
Re: Datamining infoboxes [In reply to]

On Fri, Oct 23, 2009 at 7:04 AM, Roan Kattouw <roan.kattouw [at] gmail> wrote:
> 2009/10/23 Robert Rohde <rarohde [at] gmail>:
>> Given the fairly obvious utility for data mining, it might make sense
>> for someone to extend the Mediawiki API to generate a list of template
>> calls and the parameters sent in each case.
>>
> We had a discussion about this Tuesday in the tech staff meeting, and
> decided that we want to put this data mining possibility in core at
> some point (using a table like pagelinks to store these key/value
> pairs and modifying the parser). As you may understand this is not a
> very high priority project, and I don't know if any of the paid
> developers are gonna do it any time soon.

Out of curiousity, did you discuss internal uses for this? The
biggest class of users would probably be external data miners, but I
suspect there might be some uses for such data within Mediawiki as
well, so I'm curious if you had some application(s) in mind.

-Robert Rohde

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Simetrical+wikilist at gmail

Oct 23, 2009, 2:09 PM

Post #24 of 27 (2341 views)
Permalink
Re: Datamining infoboxes [In reply to]

On Fri, Oct 23, 2009 at 12:20 PM, Andrew Dunbar <hippytrail [at] gmail> wrote:
> Yes I didn't specify tl_namespace

In MySQL that will usually make it impossible to effectively use an
index on (tl_namespace, tl_title), so it's essential that you specify
the NS. (Which you should anyway to avoid hitting things like
[[Template talk:Infobox language]].) Some DBMSes (including sometimes
MySQL >= 5.0, although apparently not here) are smart enough to use
this kind of index pretty well even if you don't specify the
namespace, but it would still be somewhat more efficient to specify it
-- the DB would have to do O(1/n) times as many index lookups, where n
is the number of namespaces.

> and when I check for which columns
> have keys I could see none:
> mysql> describe templatelinks;
> +--------------+-----------------+------+-----+---------+-------+
> | Field        | Type            | Null | Key | Default | Extra |
> +--------------+-----------------+------+-----+---------+-------+
> | tl_from      | int(8) unsigned | NO   |     | 0       |       |
> | tl_namespace | int(11)         | NO   |     | 0       |       |
> | tl_title     | varchar(255)    | NO   |     |         |       |
> +--------------+-----------------+------+-----+---------+-------+
> 3 rows in set (0.01 sec)

The toolserver database uses views. In MySQL, views can't have
indexes themselves, but your query is rewritten to run against the
real table -- which you can't access directly, but which does have
indexes. EXPLAIN is your best bet here:

mysql> EXPLAIN SELECT tl_from FROM templatelinks WHERE tl_title IN
('Infobox_Language', 'Infobox_language');
+----+-------------+---------------+-------+---------------+---------+---------+------+-----------+--------------------------+
| id | select_type | table | type | possible_keys | key |
key_len | ref | rows | Extra |
+----+-------------+---------------+-------+---------------+---------+---------+------+-----------+--------------------------+
| 1 | SIMPLE | templatelinks | index | NULL | tl_from |
265 | NULL | 149740990 | Using where; Using index |
+----+-------------+---------------+-------+---------------+---------+---------+------+-----------+--------------------------+
1 row in set (0.00 sec)

mysql> EXPLAIN SELECT tl_from FROM templatelinks WHERE tl_namespace=10
AND tl_title IN ('Infobox_Language', 'Infobox_language');
+----+-------------+---------------+-------+---------------+--------------+---------+------+------+--------------------------+
| id | select_type | table | type | possible_keys | key
| key_len | ref | rows | Extra |
+----+-------------+---------------+-------+---------------+--------------+---------+------+------+--------------------------+
| 1 | SIMPLE | templatelinks | range | tl_namespace |
tl_namespace | 261 | NULL | 6949 | Using where; Using index |
+----+-------------+---------------+-------+---------------+--------------+---------+------+------+--------------------------+
1 row in set (0.00 sec)

Note the number of rows scanned in each case. Your query was scanning
all of templatelinks, the other is retrieving the exact rows needed
and not looking at any others ("type" = "index" vs. "range"). The
reason for this is given in the "possible_keys" column: MySQL can find
no keys that are usable for lookup, if you omit tl_namespace.

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


hippytrail at gmail

Oct 25, 2009, 6:55 PM

Post #25 of 27 (2259 views)
Permalink
Re: Datamining infoboxes [In reply to]

2009/10/23 Jona Christopher Sahnwaldt <jcsahnwaldt [at] gmail>:
> Because of result count restrictions, these queries don't
> return all ISO language codes extracted by DBpedia,
> but I think they give a good impression of the data quality
> and coverage (or sometimes lack thereof):
>
> http://dbpedia.org/sparql?query=select+distinct+%3Fs%2C+%3Fo+where{%3Fs+%3Chttp%3A%2F%2Fdbpedia.org%2Fproperty%2Fiso%3E+%3Fo+.%3Fs+%3Chttp%3A%2F%2Fdbpedia.org%2Fproperty%2FwikiPageUsesTemplate%3E+%3Chttp%3A%2F%2Fdbpedia.org%2Fresource%2FTemplate%3Ainfobox_language%3E+.}
>
> http://dbpedia.org/sparql?default-graph-uri=http%3A%2F%2Fdbpedia.org&query=select+distinct+%3Fs%2C+%3Fo+where{%3Fs+%3Chttp%3A%2F%2Fdbpedia.org%2Fproperty%2Fiso%3E+%3Fo+.%3Fs+%3Chttp%3A%2F%2Fdbpedia.org%2Fproperty%2FwikiPageUsesTemplate%3E+%3Chttp%3A%2F%2Fdbpedia.org%2Fresource%2FTemplate%3Ainfobox_language%3E+.}

This is really amazing and I think I'm going to be learning the query
language and possibly spending some time with dbpedia. Have you
thought about doing the same for Wiktionary?

Andrew Dunbar (hippietrail)

> On Fri, Oct 23, 2009 at 18:20, Andrew Dunbar <hippytrail [at] gmail> wrote:
>> 2009/10/23 Aryeh Gregor <Simetrical+wikilist [at] gmail>:
>>> On Fri, Oct 23, 2009 at 8:27 AM, Andrew Dunbar <hippytrail [at] gmail> wrote:
>>>> Yes I found how to get it through the API now. It was actually just
>>>> the Toolserver database that was intractably slow.
>>>
>>> There's nothing slow about the TS database here:
>>>
>>> mysql> pager true
>>> PAGER set to 'true'
>>> mysql> SELECT tl_from FROM templatelinks WHERE tl_namespace=10 AND
>>> tl_title IN ('Infobox_Language', 'Infobox_language');
>>> 3144 rows in set (0.12 sec)
>>>
>>> Your query might have been what was slow.
>>
>> Yes I didn't specify tl_namespace and when I check for which columns
>> have keys I could see none:
>> mysql> describe templatelinks;
>> +--------------+-----------------+------+-----+---------+-------+
>> | Field        | Type            | Null | Key | Default | Extra |
>> +--------------+-----------------+------+-----+---------+-------+
>> | tl_from      | int(8) unsigned | NO   |     | 0       |       |
>> | tl_namespace | int(11)         | NO   |     | 0       |       |
>> | tl_title     | varchar(255)    | NO   |     |         |       |
>> +--------------+-----------------+------+-----+---------+-------+
>> 3 rows in set (0.01 sec)
>>
>> But I don't know much about databases and SQL...
>>
>> I have reached an important milestone of extracting all the name value
>> pairs for language infobox ISO 639 language codes and language family
>> string by the way.
>>
>> But the values still need some work before I can try to match them
>> against ISO 639-5 language family codes which is my ultimate goal.
>>
>> Thanks for all the tips.
>>
>> Andrew Dunbar (hippietrail)
>>
>>> _______________________________________________
>>> Wikitech-l mailing list
>>> Wikitech-l [at] lists
>>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>>>
>>
>>
>>
>> --
>> http://wiktionarydev.leuksman.com http://linguaphile.sf.net
>>
>> _______________________________________________
>> Wikitech-l mailing list
>> Wikitech-l [at] lists
>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>>
>
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l [at] lists
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>



--
http://wiktionarydev.leuksman.com http://linguaphile.sf.net

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

First page Previous page 1 2 Next page Last page  View All Wikipedia wikitech RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.