Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Wikipedia: Wikitech

Datamining infoboxes

 

 

First page Previous page 1 2 Next page Last page  View All Wikipedia wikitech RSS feed   Index | Next | Previous | View Threaded


hippytrail at gmail

Oct 25, 2009, 6:58 PM

Post #26 of 27 (530 views)
Permalink
Re: Datamining infoboxes [In reply to]

2009/10/23 Aryeh Gregor <Simetrical+wikilist [at] gmail>:
> On Fri, Oct 23, 2009 at 12:20 PM, Andrew Dunbar <hippytrail [at] gmail> wrote:
>> Yes I didn't specify tl_namespace
>
> In MySQL that will usually make it impossible to effectively use an
> index on (tl_namespace, tl_title), so it's essential that you specify
> the NS.  (Which you should anyway to avoid hitting things like
> [[Template talk:Infobox language]].)  Some DBMSes (including sometimes
> MySQL >= 5.0, although apparently not here) are smart enough to use
> this kind of index pretty well even if you don't specify the
> namespace, but it would still be somewhat more efficient to specify it
> -- the DB would have to do O(1/n) times as many index lookups, where n
> is the number of namespaces.
>
>> and when I check for which columns
>> have keys I could see none:
>> mysql> describe templatelinks;
>> +--------------+-----------------+------+-----+---------+-------+
>> | Field        | Type            | Null | Key | Default | Extra |
>> +--------------+-----------------+------+-----+---------+-------+
>> | tl_from      | int(8) unsigned | NO   |     | 0       |       |
>> | tl_namespace | int(11)         | NO   |     | 0       |       |
>> | tl_title     | varchar(255)    | NO   |     |         |       |
>> +--------------+-----------------+------+-----+---------+-------+
>> 3 rows in set (0.01 sec)
>
> The toolserver database uses views.  In MySQL, views can't have
> indexes themselves, but your query is rewritten to run against the
> real table -- which you can't access directly, but which does have
> indexes.  EXPLAIN is your best bet here:
>
> mysql> EXPLAIN SELECT tl_from FROM templatelinks WHERE tl_title IN
> ('Infobox_Language', 'Infobox_language');
> +----+-------------+---------------+-------+---------------+---------+---------+------+-----------+--------------------------+
> | id | select_type | table         | type  | possible_keys | key     |
> key_len | ref  | rows      | Extra                    |
> +----+-------------+---------------+-------+---------------+---------+---------+------+-----------+--------------------------+
> |  1 | SIMPLE      | templatelinks | index | NULL          | tl_from |
> 265     | NULL | 149740990 | Using where; Using index |
> +----+-------------+---------------+-------+---------------+---------+---------+------+-----------+--------------------------+
> 1 row in set (0.00 sec)
>
> mysql> EXPLAIN SELECT tl_from FROM templatelinks WHERE tl_namespace=10
> AND tl_title IN ('Infobox_Language', 'Infobox_language');
> +----+-------------+---------------+-------+---------------+--------------+---------+------+------+--------------------------+
> | id | select_type | table         | type  | possible_keys | key
>   | key_len | ref  | rows | Extra                    |
> +----+-------------+---------------+-------+---------------+--------------+---------+------+------+--------------------------+
> |  1 | SIMPLE      | templatelinks | range | tl_namespace  |
> tl_namespace | 261     | NULL | 6949 | Using where; Using index |
> +----+-------------+---------------+-------+---------------+--------------+---------+------+------+--------------------------+
> 1 row in set (0.00 sec)
>
> Note the number of rows scanned in each case.  Your query was scanning
> all of templatelinks, the other is retrieving the exact rows needed
> and not looking at any others ("type" = "index" vs. "range").  The
> reason for this is given in the "possible_keys" column: MySQL can find
> no keys that are usable for lookup, if you omit tl_namespace.

Thanks for the very informative reply. I already knew most of this
stuff passively except database/SQL views. Now I've just got to put it
into more practice.

Andrew Dunbar (hippietrail)

> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l [at] lists
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l



--
http://wiktionarydev.leuksman.com http://linguaphile.sf.net

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


jcsahnwaldt at gmail

Oct 26, 2009, 7:49 AM

Post #27 of 27 (520 views)
Permalink
Re: Datamining infoboxes [In reply to]

On Mon, Oct 26, 2009 at 02:55, Andrew Dunbar <hippytrail [at] gmail> wrote:
> Have you thought about doing the same for Wiktionary?

Interesting idea. I don't know much about Wiktionary.
Are its pages structured similarly? How difficult would
it be to extract structured data from them? What kind
of data would you expect to extract?

I don't think there is much dictionary data on the Web
of Data yet, so extracting data from wiktionary may
be an interesting first step in that direction. We haven't
really thought about it yet, and it probably won't get to
the top of our to-do-list anytime soon, but it's an
interesting idea!


Christopher

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

First page Previous page 1 2 Next page Last page  View All Wikipedia wikitech RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.