Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: DBMail: users

messages table index brainstorming

 

 

DBMail users RSS feed   Index | Next | Previous | View Threaded


lists.hzfw4 at liquidbytes

May 14, 2008, 4:06 AM

Post #1 of 10 (357 views)
Permalink
messages table index brainstorming

Paul J Stevens wrote:
> Michael Mayer wrote:
>> But especially the messages
>> table seems to be over-indexed.
>
> Please propose a better index setup.

An interesting idea for performance improvement of the messages table
would be a PRIMARY KEY consisting of mailbox_idnr and status, as the
primary key serves as clustered index in InnoDB. This saves disk I/O and
speeds up the whole thing, because the typical use case is that the user
wants to download all new (that information is stored in status, right?)
messages from it's mailbox. All these messages would be stored in
clustered data pages. It also slows down writing a bit, because the
table has to be reorganized every time.

But, the primary key is also stored in every row of every index, so it
should be very small. IMHO, this is a shortcoming of MySQL at the moment
and I don't see a quick fix. You should be able to define a clustered
index other than the primary key. On the other hand, it might work
anyways, since InnoDB could use it's own 6 byte row id integer, if it
has to deal with multi-part primary keys. This would even save 2 bytes
compared to the current bigint (8 byte) primary key. In the worst case,
you would end up with a 9 byte primary key.

- The general goal is to reduce data page access as much as possible and
keep data structures small. The current index setup leads to a messages
table where 2/3 of the space is occupied by the index and only 1/3 is
data. That means, if data grows to 1 GB, the table requires 3 GB on the
hard disk.

- Every additional index also degrades write performance a lot.

The bottom line is that table optimization heavily depends on
implementation details of the storage engine and the exact use case.

Michael
_______________________________________________
DBmail mailing list
DBmail[at]dbmail.org
https://mailman.fastxs.nl/mailman/listinfo/dbmail


lists.hzfw4 at liquidbytes

May 14, 2008, 4:50 AM

Post #2 of 10 (342 views)
Permalink
Re: messages table index brainstorming [In reply to]

Michael Mayer wrote:
> An interesting idea for performance improvement of the messages table
> would be a PRIMARY KEY consisting of mailbox_idnr and status, as the

Of course, the physmessage_id must be added to that index as well, so
that it is unique (and the primary key really should be unique). As I
said, it could even make sense to leave out the status, as most
mailboxes can be sorted and filtered in memory. So the final idea is to
use this setup:

PRIMARY KEY(mailbox_idnr, physmessage_id)

UNIQUE(message_idnr)

message_idnr INT AUTO_INCREMENT

mailbox_idnr INT

Maybe INDEX(unique_id), if you need it for something. Delete all the
other indexes on the messages table.

That's it for today, promised,

Michael
_______________________________________________
DBmail mailing list
DBmail[at]dbmail.org
https://mailman.fastxs.nl/mailman/listinfo/dbmail


dbmail at bobich

May 14, 2008, 4:57 AM

Post #3 of 10 (341 views)
Permalink
Re: Re: messages table index brainstorming [In reply to]

I'm still not convinced that over-indexing is harmful in this case. In
general, too many indices is better than too few, and even if the data
fits in memory, an indexed btree search will always be much faster than an
unindexed search. Even if the table only has a few hundred small records,
an indexed search still comes out vastly ahead, in my experience.

But if you can empirically demonstrate that over-indexing signifficantly
slows down the typical (mostly read) workload of DBMail, I'd be most
interested to see it.

Gordan

On Wed, 14 May 2008, Michael Mayer wrote:

> Michael Mayer wrote:
>> An interesting idea for performance improvement of the messages table would
>> be a PRIMARY KEY consisting of mailbox_idnr and status, as the
>
> Of course, the physmessage_id must be added to that index as well, so that it
> is unique (and the primary key really should be unique). As I said, it could
> even make sense to leave out the status, as most mailboxes can be sorted and
> filtered in memory. So the final idea is to use this setup:
>
> PRIMARY KEY(mailbox_idnr, physmessage_id)
>
> UNIQUE(message_idnr)
>
> message_idnr INT AUTO_INCREMENT
>
> mailbox_idnr INT
>
> Maybe INDEX(unique_id), if you need it for something. Delete all the other
> indexes on the messages table.
>
> That's it for today, promised,
>
> Michael
> _______________________________________________
> DBmail mailing list
> DBmail[at]dbmail.org
> https://mailman.fastxs.nl/mailman/listinfo/dbmail
>
_______________________________________________
DBmail mailing list
DBmail[at]dbmail.org
https://mailman.fastxs.nl/mailman/listinfo/dbmail


lists.hzfw4 at liquidbytes

May 14, 2008, 5:17 AM

Post #4 of 10 (339 views)
Permalink
Re: Re: messages table index brainstorming [In reply to]

dbmail[at]bobich.net wrote:
> I'm still not convinced that over-indexing is harmful in this case. In
> general, too many indices is better than too few, and even if the data
> fits in memory, an indexed btree search will always be much faster than
> an unindexed search. Even if the table only has a few hundred small
> records, an indexed search still comes out vastly ahead, in my experience.

The point is, that an index normally holds a pointer to a data page in
which the row can be found. So, the whole page must be read and searched
anyways. Plus, the index page must be read. That means you have to read
32K of data, to find 130 bytes of data (when selecting a single row). It
might vary depending on the exact implementation.

In contrast, storing the messages in a very ineffectively clustered way,
like it currently seems to be, really costs read performance. It's not
that there are not enough indexes, but you must also use the right ones.
It seems like you prefer to define an infinite amount of indexes in the
hope that you hit the right one sooner or later ;)

Another point is, that the MySQL optimizer might not be as mature as the
one of Oracle (which should be pretty sophisticated) - it makes mistakes
sometimes and especially choosing the right index can be a difficult task.

Michael
_______________________________________________
DBmail mailing list
DBmail[at]dbmail.org
https://mailman.fastxs.nl/mailman/listinfo/dbmail


dbmail at bobich

May 14, 2008, 5:28 AM

Post #5 of 10 (341 views)
Permalink
Re: Re: messages table index brainstorming [In reply to]

On Wed, 14 May 2008, Michael Mayer wrote:

> In contrast, storing the messages in a very ineffectively clustered way, like
> it currently seems to be, really costs read performance. It's not that there
> are not enough indexes, but you must also use the right ones. It seems like
> you prefer to define an infinite amount of indexes in the hope that you hit
> the right one sooner or later ;)

In general, I find that defining suitable indices for all the frequently
executed queries is a pretty good starting point. If the performance isn't
holding up, the chances are that there is an unindexed query somewhere
that is causing the problem.

> Another point is, that the MySQL optimizer might not be as mature as the one
> of Oracle (which should be pretty sophisticated) - it makes mistakes
> sometimes and especially choosing the right index can be a difficult task.

While I acknowledge that there is the occasional bug that causes the
optimizer to get it wrong, it has been a fair few years since I've run
into such a condition on MySQL or PostgreSQL. And the problem isn't
generally that hard - using the index with the highest cardinality of the
available ones is a pretty good heuristic.

There are also a couple of other things to consider:
1) IIRC, MySQL will not bother looking up the data page if all the data it
needs is available in the index already, so pruning the selected fields
can be useful.

2) I'm not sure if this is still the case, but a while back MySQL couldn't
use multiple indices per table per query. That means that there was
performance could be improved by making a multi-column index. Ordering of
the index is also important here. The first element should be the one that
is often used on it's own (a multi-column index can be used when only the
first column is required for look-up), and as a secondary optimization,
the ordering should be in the descending cardinality order.

Having said that, by the time you getting to this level of optimization,
you are definitely in the diminishing returns territory. Single-column
indices on frequently used fields will generally get you most of the way
there in the first pass.

Gordan
_______________________________________________
DBmail mailing list
DBmail[at]dbmail.org
https://mailman.fastxs.nl/mailman/listinfo/dbmail


lists.hzfw4 at liquidbytes

May 14, 2008, 9:36 AM

Post #6 of 10 (340 views)
Permalink
Re: Re: messages table index brainstorming [In reply to]

dbmail[at]bobich.net wrote:
> But if you can empirically demonstrate that over-indexing signifficantly
> slows down the typical (mostly read) workload of DBMail, I'd be most
> interested to see it.

I just decided to test it... maybe on the weekend :)

The only question is how to set up the test environment, because my own
dbmail database is too small. Can somebody give me a hint how to produce
test data or even send a dump?

What are the 5 most common queries for an average dbmail server?

Michael
_______________________________________________
DBmail mailing list
DBmail[at]dbmail.org
https://mailman.fastxs.nl/mailman/listinfo/dbmail


dbmail at bobich

May 14, 2008, 9:42 AM

Post #7 of 10 (341 views)
Permalink
Re: Re: messages table index brainstorming [In reply to]

I wish this came up 3 weeks ago. Just before I migrated to dbmail, I
purged all of my spam - all 2GB of it (100K-150K messages, IIRC). I
figured there'd be no point in migrating it. It'd have made for a good
test case. :(

Gordan

On Wed, 14 May 2008, Michael Mayer wrote:

> dbmail[at]bobich.net wrote:
>> But if you can empirically demonstrate that over-indexing signifficantly
>> slows down the typical (mostly read) workload of DBMail, I'd be most
>> interested to see it.
>
> I just decided to test it... maybe on the weekend :)
>
> The only question is how to set up the test environment, because my own
> dbmail database is too small. Can somebody give me a hint how to produce test
> data or even send a dump?
>
> What are the 5 most common queries for an average dbmail server?
>
> Michael
> _______________________________________________
> DBmail mailing list
> DBmail[at]dbmail.org
> https://mailman.fastxs.nl/mailman/listinfo/dbmail
>
_______________________________________________
DBmail mailing list
DBmail[at]dbmail.org
https://mailman.fastxs.nl/mailman/listinfo/dbmail


paul at nfg

May 14, 2008, 11:32 AM

Post #8 of 10 (342 views)
Permalink
Re: Re: messages table index brainstorming [In reply to]

Michael Mayer wrote:
> dbmail[at]bobich.net wrote:
>> But if you can empirically demonstrate that over-indexing
>> signifficantly slows down the typical (mostly read) workload of
>> DBMail, I'd be most interested to see it.
>
> I just decided to test it... maybe on the weekend :)
>
> The only question is how to set up the test environment, because my own
> dbmail database is too small. Can somebody give me a hint how to produce
> test data or even send a dump?

If you have a large set of email from various sources
(spam/ham/plain-text/attachments) you should insert those into dbmail
using something like:

#!/bin/sh
i=1; j=1;
while [ $i -lt 100 ]; do
while [ $j -lt 100 ]; do
cat somebigmailbox.mbox|\
formail -ds dbmail-smtp -u testuser$i -m mailbox$j
let j=$j+1
done
let i=$i+1
done

that should produce a nice set of mailboxes spread acros multiple accounts.

> What are the 5 most common queries for an average dbmail server?

Difficult. I don't have any numbers. Mostly depends on the most
prevalent clients. You can find some of the most used IMAP commands for
TB and Outlook in test-scripts/textimap.py in the testFetch testcase.

--
________________________________________________________________
Paul Stevens paul at nfg.nl
NET FACILITIES GROUP GPG/PGP: 1024D/11F8CD31
The Netherlands________________________________http://www.nfg.nl
_______________________________________________
DBmail mailing list
DBmail[at]dbmail.org
https://mailman.fastxs.nl/mailman/listinfo/dbmail


lists.hzfw4 at liquidbytes

May 14, 2008, 12:20 PM

Post #9 of 10 (339 views)
Permalink
Re: Re: messages table index brainstorming [In reply to]

Paul J Stevens wrote:
> formail -ds dbmail-smtp -u testuser$i -m mailbox$j

Do I have to add those 100 testusers and mailboxes first or will that
happen automatically for some reason?

As far as I understood the script, it will add each mail to 100
mailboxes shared by 100 users, so it will create 10.000 mails for each
input email, right? Or is dbmail smart enough to just create 100
database rows (one for each mailbox or user), because the mail body is
100 times the same in each mailbox? I saw the md5 hashes and that might
be, what they are good for.

Michael
_______________________________________________
DBmail mailing list
DBmail[at]dbmail.org
https://mailman.fastxs.nl/mailman/listinfo/dbmail


paul at nfg

May 14, 2008, 2:06 PM

Post #10 of 10 (341 views)
Permalink
Re: Re: messages table index brainstorming [In reply to]

Michael Mayer wrote:
> Paul J Stevens wrote:
>> formail -ds dbmail-smtp -u testuser$i -m mailbox$j
>
> Do I have to add those 100 testusers and mailboxes first or will that
> happen automatically for some reason?

users are not created automatically, mailboxes are if you use the pipe
interface (dbmail-smtp aka dbmail-deliver).

>
> As far as I understood the script, it will add each mail to 100
> mailboxes shared by 100 users, so it will create 10.000 mails for each
> input email, right? Or is dbmail smart enough to just create 100
> database rows (one for each mailbox or user), because the mail body is
> 100 times the same in each mailbox? I saw the md5 hashes and that might
> be, what they are good for.

The idea is to create 100 users with each 100 mailboxes, and fill each
of these 10000 mailboxes with the number of messages you have in your
test mbox.

If you have 1000 messages that will result in:
10e6 rows in dbmail_messages (one for each message)
and beyond that are the number of rows in the other tables, depending on
the storage data model which differs greatly between 2.2 and 2.3.

The messages table contains 'virtual' messages are represented as a
'message' in a 'mailbox'. They do contain a reference to a 'real'
message as stored in 'messageblks' until 2.2. The md5 unique_id in the
messages table is a field only used by POP3 (and targetted for removal
for being redundant).

In 2.3+ the actual messages are fully recursively dissected into their
constituant mimeparts. Mimeparts are stored as a single-instance rows in
dbmail_mimeparts. The hash field in that table is used to quickly
located possible duplicates to prevent them from being stored more than
once. The actual comparison being byte by byte.

The tables that are prime targets for improvement - next to
dbmail_messages - are dbmail_mailboxes since it's used in just about all
the joins that hit dbmail_messages, and dbmail_headervalue because it
grows too big since each header for each real messages.



--
________________________________________________________________
Paul Stevens paul at nfg.nl
NET FACILITIES GROUP GPG/PGP: 1024D/11F8CD31
The Netherlands________________________________http://www.nfg.nl
_______________________________________________
DBmail mailing list
DBmail[at]dbmail.org
https://mailman.fastxs.nl/mailman/listinfo/dbmail

DBMail users RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact lists@gossamer-threads.com
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.