Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Wikipedia: Wikitech

Size of DB/table of enwiki after import into MySQL

 

 

First page Previous page 1 2 Next page Last page  View All Wikipedia wikitech RSS feed   Index | Next | Previous | View Threaded


ryanchan404 at gmail

Nov 20, 2009, 5:13 AM

Post #1 of 43 (2767 views)
Permalink
Size of DB/table of enwiki after import into MySQL

Hello,

Anyone has experience in importing enwiki database dump at
http://download.wikimedia.org/backup-index.html into a real MySQL
server?

1. It seems pages-meta-history has the max. size in term of download,
how much storage space does it take when imported into a table?
(including index)
2. What are the total storage needed for importing the whole enwiki?
3. Do you experience performance problem when querying the database,
seems I think most table if over 10GB size? Any suggestion?


I need this imformation as I require to prepare budget plan to buy a
proper server for doing the job.

Thank you.

Ryan

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


jcsahnwaldt at gmail

Nov 20, 2009, 8:38 AM

Post #2 of 43 (2706 views)
Permalink
Re: Size of DB/table of enwiki after import into MySQL [In reply to]

Hi Ryan,

pages-meta-history hasn't been generated for enwiki in a while
(it's gotten too big), so I can't tell you anything about it. We're
importing pages-articles.xml (currently about 20 GB, 5 GB as
bzip2) using mwdumper. We're using MyISAM, not InnoDB.
The import takes about 8 hours, most of it (80%) for creating
the indexes.

Besides pages-articles.xml, we also import categorylinks.sql,
imagelinks.sql, image.sql, langlinks.sql and templatelinks.sql.

The MySQL database filled from all these files takes up 39 GB
hard drive space. The largest file is text.MYD - about 20 GB.

With the indexes defined in tables.sql, query performance
is ok. For example, selecting the titles of all articles that
are not redirects takes five or ten minutes (didn't profile it
exactly).

Hope that helps.

Christopher

On Fri, Nov 20, 2009 at 14:13, Ryan Chan <ryanchan404 [at] gmail> wrote:
> Hello,
>
> Anyone has experience in importing enwiki database dump at
> http://download.wikimedia.org/backup-index.html into a real MySQL
> server?
>
> 1. It seems pages-meta-history has the max. size in term of download,
> how much storage space does it take when imported into a table?
> (including index)
> 2. What are the total storage needed for importing the whole enwiki?
> 3. Do you experience performance problem when querying the database,
> seems I think most table if over 10GB size? Any suggestion?
>
>
> I need this imformation as I require to prepare budget plan to buy a
> proper server for doing the job.
>
> Thank you.
>
> Ryan
>
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l [at] lists
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


ryanchan404 at gmail

Nov 20, 2009, 7:47 PM

Post #3 of 43 (2700 views)
Permalink
Re: Size of DB/table of enwiki after import into MySQL [In reply to]

On Sat, Nov 21, 2009 at 12:38 AM, Jona Christopher Sahnwaldt
<jcsahnwaldt [at] gmail> wrote:
> With the indexes defined in tables.sql, query performance
> is ok. For example, selecting the titles of all articles that
> are not redirects takes five or ten minutes (didn't profile it
> exactly).
>

Any reason I would like to ask is why not use PostgreSQL?

Seems MySQL is not suitable for handling large table (e.g. over few
GB), I just wonder why wikipedia don't use PostgreSQL?

It should provide better performance.

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


overlordq at gmail

Nov 20, 2009, 8:05 PM

Post #4 of 43 (2705 views)
Permalink
Re: Size of DB/table of enwiki after import into MySQL [In reply to]

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Ryan Chan wrote:
> On Sat, Nov 21, 2009 at 12:38 AM, Jona Christopher Sahnwaldt
> <jcsahnwaldt [at] gmail> wrote:
>> With the indexes defined in tables.sql, query performance
>> is ok. For example, selecting the titles of all articles that
>> are not redirects takes five or ten minutes (didn't profile it
>> exactly).
>>
>
> Any reason I would like to ask is why not use PostgreSQL?
>
> Seems MySQL is not suitable for handling large table (e.g. over few
> GB), I just wonder why wikipedia don't use PostgreSQL?
>
> It should provide better performance.
>

MediaWiki has used MySQL since the beginning and has let the code get
away with things that shouldn't have been done which makes switching to
Postgres hard.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEAREIAAYFAksHZv0ACgkQ69PBoSWyJd7JeACfVxxq+t4GyWzAGX3BSMxh80da
GSgAoINjp9zKszTUfnEm+RM2ORJpyMjZ
=cWa5
-----END PGP SIGNATURE-----

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


ryanchan404 at gmail

Nov 20, 2009, 8:07 PM

Post #5 of 43 (2702 views)
Permalink
Re: Size of DB/table of enwiki after import into MySQL [In reply to]

On Sat, Nov 21, 2009 at 12:05 PM, Q <overlordq [at] gmail> wrote:
>
> MediaWiki has used MySQL since the beginning and has let the code get
> away with things that shouldn't have been done which makes switching to
> Postgres hard.

Isn't that mediawiki also support pgsql?

http://www.mediawiki.org/wiki/Installation

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


overlordq at gmail

Nov 20, 2009, 8:32 PM

Post #6 of 43 (2705 views)
Permalink
Re: Size of DB/table of enwiki after import into MySQL [In reply to]

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Ryan Chan wrote:
> On Sat, Nov 21, 2009 at 12:05 PM, Q <overlordq [at] gmail> wrote:
>> MediaWiki has used MySQL since the beginning and has let the code get
>> away with things that shouldn't have been done which makes switching to
>> Postgres hard.
>
> Isn't that mediawiki also support pgsql?
>
> http://www.mediawiki.org/wiki/Installation
>

It does to an extent but some things are broken.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEAREIAAYFAksHbV8ACgkQ69PBoSWyJd46pgCgjWdkrotZyjlXjuebDOQNAmS6
aiYAn0SUsAWLKqL5CP3tfBVVDtW0QvZh
=Af6p
-----END PGP SIGNATURE-----

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Simetrical+wikilist at gmail

Nov 21, 2009, 3:39 PM

Post #7 of 43 (2690 views)
Permalink
Re: Size of DB/table of enwiki after import into MySQL [In reply to]

On Fri, Nov 20, 2009 at 10:47 PM, Ryan Chan <ryanchan404 [at] gmail> wrote:
> Any reason I would like to ask is why not use PostgreSQL?
>
> Seems MySQL is not suitable for handling large table (e.g. over few
> GB), I just wonder why wikipedia don't use PostgreSQL?
>
> It should provide better performance.

MySQL is easily capable of handling very large tables, if used
properly. Certainly tables the size of Wikipedia's (which aren't very
big by DB standards). Selecting a list of all titles that are not
redirects will take a long time on any database, unless you have
everything in memory, because it requires a table scan -- there's no
index that covers the relevant columns (IIRC). Of course, if you
don't configure MySQL properly, or don't give it a reasonable amount
of hardware, it will perform poorly, but the database is not much
overtaxed on Wikipedia right now.

It's also worth pointing out that Wikipedia uses a version of MySQL
with substantial modifications, and Wikimedia sysadmins are very
familiar with its behavior. Switching to a new technology might
theoretically be better in the long term (although I wouldn't take
that for granted in this case), but the transition cost would be
substantial. Heck, Wikipedia hasn't even upgraded to MySQL 4.1, let
alone a whole different DBMS.

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


wikimail at inbox

Nov 21, 2009, 4:02 PM

Post #8 of 43 (2690 views)
Permalink
Re: Size of DB/table of enwiki after import into MySQL [In reply to]

On Sat, Nov 21, 2009 at 6:39 PM, Aryeh Gregor
<Simetrical+wikilist [at] gmail> wrote:
> Selecting a list of all titles that are not
> redirects will take a long time on any database, unless you have
> everything in memory, because it requires a table scan -- there's no
> index that covers the relevant columns (IIRC).

You could build an index on page_is_redirect in the "page" table (see
enwiki-*-page.sql.gz). But I'm pretty sure Postgresql wouldn't use
it, and would do a sequential scan, since pretty much all the pages
are going to have to be accessed anyway.

Five or ten minutes sounds about right. I can't imagine this is a
query you want to run over and over again. If it is, you'd probably
want to use partitioning
(http://www.postgresql.org/docs/8.4/interactive/ddl-partitioning.html),
but you're still not going to cut down the query time very much, as
it's going to be returning millions of rows.

> It's also worth pointing out that Wikipedia uses a version of MySQL
> with substantial modifications, and Wikimedia sysadmins are very
> familiar with its behavior.  Switching to a new technology might
> theoretically be better in the long term (although I wouldn't take
> that for granted in this case), but the transition cost would be
> substantial.  Heck, Wikipedia hasn't even upgraded to MySQL 4.1, let
> alone a whole different DBMS.

Yes, it can be very hard to switch your DBMS, and that's a very good
thing for MySQL. :)

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Simetrical+wikilist at gmail

Nov 21, 2009, 5:45 PM

Post #9 of 43 (2686 views)
Permalink
Re: Size of DB/table of enwiki after import into MySQL [In reply to]

On Sat, Nov 21, 2009 at 7:02 PM, Anthony <wikimail [at] inbox> wrote:
> You could build an index on page_is_redirect in the "page" table (see
> enwiki-*-page.sql.gz).  But I'm pretty sure Postgresql wouldn't use
> it, and would do a sequential scan, since pretty much all the pages
> are going to have to be accessed anyway.
>
> Five or ten minutes sounds about right.  I can't imagine this is a
> query you want to run over and over again.  If it is, you'd probably
> want to use partitioning
> (http://www.postgresql.org/docs/8.4/interactive/ddl-partitioning.html),
> but you're still not going to cut down the query time very much, as
> it's going to be returning millions of rows.

Yeah, pretty much. If you had an index on (page_is_redirect,
page_namespace, page_title) or such, that would speed it up
significantly (at least in MySQL). It would have to scan through the
whole index, but that only contains three columns plus a row id of
some kind, so it should be quite a lot faster than scanning the whole
table. But this isn't a likely query for optimization.

> Yes, it can be very hard to switch your DBMS, and that's a very good
> thing for MySQL.  :)

Let's not have a DBMS flame war here, please.

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


jcsahnwaldt at gmail

Nov 21, 2009, 6:48 PM

Post #10 of 43 (2686 views)
Permalink
Re: Size of DB/table of enwiki after import into MySQL [In reply to]

On Sun, Nov 22, 2009 at 02:45, Aryeh Gregor
<Simetrical+wikilist [at] gmail> wrote:
> Yeah, pretty much.  If you had an index on (page_is_redirect,
> page_namespace, page_title) or such, that would speed it up
> significantly (at least in MySQL).  It would have to scan through the
> whole index, but that only contains three columns plus a row id of
> some kind, so it should be quite a lot faster than scanning the whole
> table.  But this isn't a likely query for optimization.

That's correct. We run this query once, and then we do
a SELECT for each title (which only takes a few millis,
because it uses an index) and work with the result.
Building the index would probably take longer than
5 or 10 minutes.

The DB access is fast compared to all the other stuff
we do, so there's not much need to optimize it. MySQL
with the indexes in tables.sql suits us fine so far, and I
have no reason to suspect any other DB would be faster.

The one thing that is slow is builiding the indexes after
the data has been imported (eight hours or so). Maybe
we could omit some indexes that are not used in our
application, but I haven't really looked into that.

Christopher

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


wikimail at inbox

Nov 21, 2009, 7:36 PM

Post #11 of 43 (2686 views)
Permalink
Re: Size of DB/table of enwiki after import into MySQL [In reply to]

On Sat, Nov 21, 2009 at 8:45 PM, Aryeh Gregor
<Simetrical+wikilist [at] gmail> wrote:
> Yeah, pretty much. If you had an index on (page_is_redirect,
> page_namespace, page_title) or such, that would speed it up
> significantly (at least in MySQL). It would have to scan through the
> whole index, but that only contains three columns plus a row id of
> some kind, so it should be quite a lot faster than scanning the whole
> table.

Quite a lot? The theoretical max would be about twice as fast, as
(page_is_redirect, page_namespace, page_title) is going to take up at
least half as much space as the whole page table. But I'm not sure
even that theoretical max could be reached by MySQL.

I know it wouldn't be reached by PostgreSQL, which would still do a
sequential scan through the table. If you clustered on
page_is_redirect you'd save yourself from having to go through the
parts of the table which were redirects, but you're still stuck with a
sequential scan.

In either database, if you really wanted the absolute fastest
solution, you'd create a materialized view for exactly that query.
But as you said, "this isn't a likely query for optimization."

>> Yes, it can be very hard to switch your DBMS, and that's a very good
>> thing for MySQL. :)
>
> Let's not have a DBMS flame war here, please.

Aww, c'mon, just a little light ribbing... I couldn't resist, you set
me right up for it.

I'm actually in the process of trying to import enwiki into a
postgresql database right now. Attempt 1 was to import everything
into a MySQL database (with no indexes), export it as a TSV file, then
import from the TSV file into Postgresql. Hit a snag with some data
that Postgres is saying isn't valid UTF8, which is probably due to
something I did wrong with the import, but I can't figure out what it
is.

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


ryanchan404 at gmail

Nov 21, 2009, 7:40 PM

Post #12 of 43 (2686 views)
Permalink
Re: Size of DB/table of enwiki after import into MySQL [In reply to]

On Sun, Nov 22, 2009 at 7:39 AM, Aryeh Gregor
<Simetrical+wikilist [at] gmail> wrote:
> It's also worth pointing out that Wikipedia uses a version of MySQL
> with substantial modifications,

Is the source available in the svn? Can you point me to the right direction?


Thanks.

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Simetrical+wikilist at gmail

Nov 22, 2009, 8:26 AM

Post #13 of 43 (2676 views)
Permalink
Re: Size of DB/table of enwiki after import into MySQL [In reply to]

On Sat, Nov 21, 2009 at 9:48 PM, Jona Christopher Sahnwaldt
<jcsahnwaldt [at] gmail> wrote:
> The one thing that is slow is builiding the indexes after
> the data has been imported (eight hours or so). Maybe
> we could omit some indexes that are not used in our
> application, but I haven't really looked into that.

MyISAM should be able to build keys quite quickly; it can do so by
sorting if all goes well. That's commonly ten times as fast as
rebuilding by keycache, or more. I don't know offhand how mwdumper
works, but you might want to try increasing the value of
myisam_max_sort_file_size to something larger than all your indexes,
if you have enough disk space. Or you could do something like import
into a table with no keys at all, create a table with the same
definition but with keys and no data, shut down MySQL, copy the new
table's .frm and .MYI over the old table's, run myisamchk
--sort-recover on the old table (with the data and copied .frm and
.MYI), and restart MySQL, but this is at your own risk, of course. :)

On Sat, Nov 21, 2009 at 10:36 PM, Anthony <wikimail [at] inbox> wrote:
> Quite a lot?  The theoretical max would be about twice as fast, as
> (page_is_redirect, page_namespace, page_title) is going to take up at
> least half as much space as the whole page table.

Yes, that's true. Twice as fast is still a pretty good improvement, though. :)

> I know it wouldn't be reached by PostgreSQL, which would still do a
> sequential scan through the table.

MySQL skips the table and just looks at the index, in both InnoDB and MyISAM.

page_is_redirect=0;
+----+-------------+-------+------+---------------+------+---------+------+-------+-------------+
| id | select_type | table | type | possible_keys | key | key_len |
ref | rows | Extra |
+----+-------------+-------+------+---------------+------+---------+------+-------+-------------+
| 1 | SIMPLE | page | ALL | NULL | NULL | NULL |
NULL | 43603 | Using where |
+----+-------------+-------+------+---------------+------+---------+------+-------+-------------+
1 row in set (0.00 sec)

page_is_redirect=0;
+----+-------------+-------+------+------------------+------------------+---------+-------+-------+-------------+
| id | select_type | table | type | possible_keys | key
| key_len | ref | rows | Extra |
+----+-------------+-------+------+------------------+------------------+---------+-------+-------+-------------+
| 1 | SIMPLE | page2 | ref | page_is_redirect |
page_is_redirect | 1 | const | 22048 | Using index |
+----+-------------+-------+------+------------------+------------------+---------+-------+-------+-------------+
1 row in set (0.00 sec)

Note "Using index" in the second query; it never looks at the table
data, only the index. (Those queries are on a local copy of Simple,
not enwiki, thus the small row counts.)

> I'm actually in the process of trying to import enwiki into a
> postgresql database right now.  Attempt 1 was to import everything
> into a MySQL database (with no indexes), export it as a TSV file, then
> import from the TSV file into Postgresql.  Hit a snag with some data
> that Postgres is saying isn't valid UTF8, which is probably due to
> something I did wrong with the import, but I can't figure out what it
> is.

The most likely problem is that it's not actually valid UTF-8. There
are some places where we end up truncating things bytewise rather than
characterwise, like edit summaries, leaving only the first byte or two
of a multibyte character. Last I checked, edit summaries were just
passed to the database with no hard length check, so MySQL in
non-strict mode with a binary schema (like Wikipedia) will just
truncate them to fit. (In strict mode it will raise an error, and if
we used utf8 rather than binary it would presumably truncate by
characters.)

On Sat, Nov 21, 2009 at 10:40 PM, Ryan Chan <ryanchan404 [at] gmail> wrote:
> Is the source available in the svn? Can you point me to the right direction?

http://svn.wikimedia.org/viewvc/mysql/

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


ryanchan404 at gmail

Nov 22, 2009, 8:45 AM

Post #14 of 43 (2675 views)
Permalink
Re: Size of DB/table of enwiki after import into MySQL [In reply to]

On Mon, Nov 23, 2009 at 12:26 AM, Aryeh Gregor
<Simetrical+wikilist [at] gmail> wrote:
> On Sat, Nov 21, 2009 at 10:40 PM, Ryan Chan <ryanchan404 [at] gmail> wrote:
>> Is the source available in the svn? Can you point me to the right direction?
>
> http://svn.wikimedia.org/viewvc/mysql/
>


Thanks for the link.

May I ask why still using the 4.0 version? Seems 5.1 above did provide
much performance enhancements?

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Simetrical+wikilist at gmail

Nov 22, 2009, 8:54 AM

Post #15 of 43 (2673 views)
Permalink
Re: Size of DB/table of enwiki after import into MySQL [In reply to]

On Sun, Nov 22, 2009 at 11:45 AM, Ryan Chan <ryanchan404 [at] gmail> wrote:
> May I ask why still using the 4.0 version? Seems 5.1 above did provide
> much performance enhancements?

I'm not the one to ask. My understanding is that it's mostly a
question of how much administrative resources it would take to switch.

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


wikimail at inbox

Nov 22, 2009, 8:57 AM

Post #16 of 43 (2675 views)
Permalink
Re: Size of DB/table of enwiki after import into MySQL [In reply to]

On Sun, Nov 22, 2009 at 11:26 AM, Aryeh Gregor
<Simetrical+wikilist [at] gmail> wrote:
> Last I checked, edit summaries were just
> passed to the database with no hard length check, so MySQL in
> non-strict mode with a binary schema (like Wikipedia) will just
> truncate them to fit.  (In strict mode it will raise an error, and if
> we used utf8 rather than binary it would presumably truncate by
> characters.)

Hmm, okay. Should I be using character set and collation "binary"?
The dumps build the table using character set utf8, and don't say
anything about the collation. Is this specific to edit summaries, or
does it apply to all the text fields?

I guess this is getting off topic, but I'm not quite sure what the
topic of this thread is anyway.

On Sun, Nov 22, 2009 at 11:54 AM, Aryeh Gregor
<Simetrical+wikilist [at] gmail> wrote:
> On Sun, Nov 22, 2009 at 11:45 AM, Ryan Chan <ryanchan404 [at] gmail> wrote:
>> May I ask why still using the 4.0 version? Seems 5.1 above did provide
>> much performance enhancements?
>
> I'm not the one to ask. My understanding is that it's mostly a
> question of how much administrative resources it would take to switch.

I don't know if it was supposed to be taken as sarcasm or not, but Tim
Starling recently commented that "it seems that I'm the only staff
member who knows MySQL." That was a joke, right?

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Simetrical+wikilist at gmail

Nov 22, 2009, 9:02 AM

Post #17 of 43 (2673 views)
Permalink
Re: Size of DB/table of enwiki after import into MySQL [In reply to]

On Sun, Nov 22, 2009 at 11:57 AM, Anthony <wikimail [at] inbox> wrote:
> Hmm, okay.  Should I be using character set and collation "binary"?
> The dumps build the table using character set utf8, and don't say
> anything about the collation.  Is this specific to edit summaries, or
> does it apply to all the text fields?
>
> I guess this is getting off topic, but I'm not quite sure what the
> topic of this thread is anyway.

Wikipedia uses binary fields for everything. It has to, since MySQL
4.0 doesn't support anything else. I don't know what other sites
running off Wikipedia dumps do.

> I don't know if it was supposed to be taken as sarcasm or not, but Tim
> Starling recently commented that "it seems that I'm the only staff
> member who knows MySQL."  That was a joke, right?

I assume he was saying that he's the only paid Wikimedia staff right
now who knows a lot about databases, and MySQL in particular. I don't
know without context.

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


wikimail at inbox

Nov 22, 2009, 1:40 PM

Post #18 of 43 (2669 views)
Permalink
Re: Size of DB/table of enwiki after import into MySQL [In reply to]

On Sun, Nov 22, 2009 at 12:02 PM, Aryeh Gregor
<Simetrical+wikilist [at] gmail> wrote:
> Wikipedia uses binary fields for everything.  It has to, since MySQL
> 4.0 doesn't support anything else.

Reading through https://bugzilla.wikimedia.org/show_bug.cgi?id=164 was
enlightening.

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


jcsahnwaldt at gmail

Nov 22, 2009, 3:20 PM

Post #19 of 43 (2672 views)
Permalink
Re: Size of DB/table of enwiki after import into MySQL [In reply to]

On Sun, Nov 22, 2009 at 17:57, Anthony <wikimail [at] inbox> wrote:
> Hmm, okay.  Should I be using character set and collation "binary"?

Yes. The main problem with using UTF-8 for the tables is that
MySQL only supports Unicode characters U+0000 .. U+FFFF.
Other characters are silently removed, which leads to problems
with duplicate page titles etc.

See http://dev.mysql.com/doc/refman/5.1/en/charset-unicode.html

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


questpc at rambler

Nov 23, 2009, 12:48 AM

Post #20 of 43 (2635 views)
Permalink
Re: Size of DB/table of enwiki after import into MySQL [In reply to]

* Ryan Chan <ryanchan404 [at] gmail> [Mon, 23 Nov 2009 00:45:33 +0800]:
> May I ask why still using the 4.0 version? Seems 5.1 above did provide
> much performance enhancements?
>
There was a message at mysql.com site that google performance
ehancements were incorporated into version 5.4.
Dmitriy

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


midom.lists at gmail

Nov 24, 2009, 4:57 PM

Post #21 of 43 (2574 views)
Permalink
Re: Size of DB/table of enwiki after import into MySQL [In reply to]

Hi!

I was traveling around a bit, missed some of threads entirely!

Ryan writes:

> Any reason I would like to ask is why not use PostgreSQL?

Any reason we should?
> Seems MySQL is not suitable for handling large table (e.g. over few
> GB), I just wonder why wikipedia don't use PostgreSQL?

Is PG more suitable? Last time I looked at it, both engines were using
B+-Trees.

> It should provide better performance.


Do you have any benchmarks on that?

Simetrical writes:
> Heck, Wikipedia hasn't even upgraded to MySQL 4.1, let
> alone a whole different DBMS.

We do have 5.1 servers running in production for quite a while (e.g.
dewiki's lomaria :-)
We were running enwiki slaves on 5.0 too few times :) It is not like
there're any showstoppers for migration at the moment.

Antony writes:
> If it is, you'd probably want to use partitioning


Partitioning makes selects faster only when there's parallel execution
on multiple partitions at once. PG doesn't have that, MySQL doesn't
have that, some commercial PG offsprings (Greenplum?) have it.

Simetrical writes again:
> Let's not have a DBMS flame war here, please.


Oh come on, it has been a while. Nowadays we also need people from
NoSQL camp, telling we should migrate to ultimately scalable erlang-
based key/value/document storages, with lots of javascript map/reduce.

Jona writes:
> The one thing that is slow is builiding the indexes after the data
> has been imported (eight hours or so).

People with not enough of RAM to have efficient b-tree builds can use
either InnoDB Plugin's fast-index-creation, or Tokutek's fractal tree
storage (which is commercial software, but has free license up to 50G,
or for developers, iirc)

Ryan asks:
> May I ask why still using the 4.0 version?


Because it does what we need it to do, is rock-solid and fast enough.
Also because someone was lazy with 5.1 build engineering, but now
there's one nearly ready for production at lp:~wikimedia

> Seems 5.1 above did provide much performance enhancements?

Yes, some of them are same ones we had in our 4.0 builds for years,
others are ones we don't really need. We're read-i/o-constrained and
we're doing quite well at that with our current builds.

> I don't know if it was supposed to be taken as sarcasm or not, but
> Tim Starling recently commented that "it seems that I'm the only
> staff member who knows MySQL." That was a joke, right?

Tim is indeed the only one at the staff who knows really well how to
handle replicated MySQL setups, as well as other advanced MySQL
topics. Apparently it wasn't only WMF staff running Wikipedia's
databases.

Jona comments on utf8:
> Yes. The main problem with using UTF-8 for the tables is that
> MySQL only supports Unicode characters U+0000 .. U+FFFF.
> Other characters are silently removed, which leads to problems
> with duplicate page titles etc.

Actually the main problem with using utf8 is that most of language-
specific collations are case-insensitive, which would mean lots of
pain with case senstive->case insensitive transition (due to how
indexes work, it is relatively difficult to have efficient sorting
order different from equality rules).
And yes, characters outside BMP can be an issue, but we would be
hitting that as a problem only in few page titles.

Dmitry suggests:
> There was a message at mysql.com site that google performance
> ehancements were incorporated into version 5.4.

Google performance enhancements were also incorporated into version
4.0.40. Not all, but most of ones we'd need (I/O related, we're not
really in shape with our datasets where we would care about SMP
performance ;-)

BR,
Domas

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


wikimail at inbox

Nov 24, 2009, 7:17 PM

Post #22 of 43 (2568 views)
Permalink
Re: Size of DB/table of enwiki after import into MySQL [In reply to]

On Tue, Nov 24, 2009 at 7:57 PM, Domas Mituzas <midom.lists [at] gmail> wrote:
> Antony writes:
>> If it is, you'd probably want to use partitioning
>
> Partitioning makes selects faster only when there's parallel execution
> on multiple partitions at once.

That's just not at all true, not for PostgreSQL at least. Say you
have 100 million records in the page table, of which 20 million are
is_redirect=1 and 80 million are is_redirect=0. Say the average size
of a record is 100 bytes, so on average 80 records fit in one page.
The table is not clustered, or it is clustered on something other than
is_redirect. If you run select * from page where is_redirect=1 from a
table which is not partitioned, you have to access pretty much all
1.25 million pages. If you partition the table on is_redirect, you
only have to access 250,000 pages.

http://www.postgresql.org/docs/8.1/interactive/ddl-partitioning.html

I have no idea how this works on MySQL, or if it works on MySQL at
all. In MySQL, you could achieve the same thing through clustering,
however.

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


wikimail at inbox

Nov 24, 2009, 7:29 PM

Post #23 of 43 (2570 views)
Permalink
Re: Size of DB/table of enwiki after import into MySQL [In reply to]

On Tue, Nov 24, 2009 at 10:17 PM, Anthony <wikimail [at] inbox> wrote:
> On Tue, Nov 24, 2009 at 7:57 PM, Domas Mituzas <midom.lists [at] gmail> wrote:
>> Antony writes:
>>> If it is, you'd probably want to use partitioning
>>
>> Partitioning makes selects faster only when there's parallel execution
>> on multiple partitions at once.
> [snip]
> I have no idea how this works on MySQL, or if it works on MySQL at
> all.  In MySQL, you could achieve the same thing through clustering,
> however.
>

http://dev.mysql.com/doc/refman/5.1/en/partitioning-pruning.html

"When the optimizer can make use of partition pruning in performing a
query, execution of the query can be an order of magnitude faster than
the same query against a nonpartitioned table containing the same
column definitions and data. "

I don't see any mention of "parallel execution", and I don't see why
that would be necessary to benefit from partitioning.

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


wikimail at inbox

Nov 24, 2009, 8:16 PM

Post #24 of 43 (2566 views)
Permalink
Re: Size of DB/table of enwiki after import into MySQL [In reply to]

On Tue, Nov 24, 2009 at 7:57 PM, Domas Mituzas <midom.lists [at] gmail> wrote:
> Ryan writes:
>> Any reason I would like to ask is why not use PostgreSQL?
> Any reason we should?

> Actually the main problem with using utf8 is that most of language-
> specific collations are case-insensitive, which would mean lots of
> pain with case senstive->case insensitive transition (due to how
> indexes work, it is relatively difficult to have efficient sorting
> order different from equality rules).

psql (8.4.1)
Type "help" for help.

a=# create table category (title varchar(255) not null primary key);
NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index
"category_pkey" for table "category"
CREATE TABLE
a=# insert into category values ('Apple');
INSERT 0 1
a=# insert into category values ('Banana');
INSERT 0 1
a=# insert into category values ('Pear');
INSERT 0 1
a=# insert into category values ('banana');
INSERT 0 1
a=# insert into category values ('Orange');
INSERT 0 1
a=# insert into category values ('apple');
INSERT 0 1
a=# insert into category values ('Apple');
ERROR: duplicate key value violates unique constraint "category_pkey"
a=# insert into category values ('APPLE');
INSERT 0 1
a=# select * from category order by title;
title
--------
apple
Apple
APPLE
banana
Banana
Orange
Pear
(7 rows)

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


midom.lists at gmail

Nov 24, 2009, 9:39 PM

Post #25 of 43 (2567 views)
Permalink
Re: Size of DB/table of enwiki after import into MySQL [In reply to]

Hi,
> That's just not at all true, not for PostgreSQL at least. Say you
> have 100 million records in the page table, of which 20 million are
> is_redirect=1 and 80 million are is_redirect=0. Say the average size
> of a record is 100 bytes, so on average 80 records fit in one page.
> The table is not clustered, or it is clustered on something other than
> is_redirect. If you run select * from page where is_redirect=1 from a
> table which is not partitioned, you have to access pretty much all
> 1.25 million pages. If you partition the table on is_redirect, you
> only have to access 250,000 pages.

But.... who will partition based on is_redirect? If it is for one-off
task, you can just create two separate tables and do 'manual
partitioning' even in sqlite :)
Even though your is_redirect queries may become faster, you just added
*2 cost for every other index operation (as partitions require you to
loop over all indexes for all the lookups not satisfied by
partitioning key).

> I have no idea how this works on MySQL, or if it works on MySQL at
> all. In MySQL, you could achieve the same thing through clustering,
> however.

Right, indexing can be used to achieve the result, without making
other selects slower (kind of). Thats what indexing is for :)

> http://dev.mysql.com/doc/refman/5.1/en/partitioning-pruning.html
>
> "When the optimizer can make use of partition pruning in performing a
> query, execution of the query can be an order of magnitude faster than
> the same query against a nonpartitioned table containing the same
> column definitions and data. "

This is obvious, dropping a partition is faster than DELETE that has
to go and maintain the index. I'm not talking about DELETE operations,
but SELECTs, you seem to fail at reading there :-)

> a=# insert into category values ('Apple');
> ERROR: duplicate key value violates unique constraint "category_pkey"

And how do native language collations work? (and since which version
are they supported per-database? are they supported per-schema? ;-)
I got somewhat incorrect results once I used lt_LT.UTF-8 for my
'initdb' - and default collation was providing incorrect order too, as
well as unique constraints were not enforcing dictionary-order rules.

postgres=# create database xx encoding 'utf8';
CREATE DATABASE
postgres=# \c xx;
You are now connected to database "xx".
xx=# create table t1 (a varchar(255) primary key);
NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index
"t1_pkey" for table "t1"
CREATE TABLE
xx=# insert into t1 values ('a');
INSERT 0 1
xx=# insert into t1 values ('b');
INSERT 0 1
xx=# insert into t1 values ('Ä…');
INSERT 0 1
xx=# select * from t1 order by a;
a
---
a
b
Ä…
(3 rows)

You guys seem to talk about stuff you never used and never really
understood. Good for you, probably much easier that way.

Domas
_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

First page Previous page 1 2 Next page Last page  View All Wikipedia wikitech RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.