Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: DBMail: users

marking a mailbox as read is slow, suggesting improvement

 

 

DBMail users RSS feed   Index | Next | Previous | View Threaded


michael.monnerie at is

Mar 3, 2009, 3:47 PM

Post #1 of 4 (505 views)
Permalink
marking a mailbox as read is slow, suggesting improvement

I just had a mailbox with 5k messages, about 4k unread, which was quite slow.
So I tested with a smaller mailbox: In kmail 1.10.3, I say
"mark whole mailbox as read". That generates these commands:

15232 UID STORE 3651501,3651521,3651555,3651557,3651579,3651706,3651745 -FLAGS.SILENT (\SEEN \ANSWERED \FLAGGED \DRAFT)
15233 UID STORE 3651501,3651521,3651555,3651557,3651579,3651706,3651745 +FLAGS.SILENT (\SEEN)

Those are the queries generated to Postgres:

2009-03-04 00:26:48 CET pid=4609 25/120872 idle: LOG: statement: SELECT message_idnr FROM dbmail_messages WHERE mailbox_idnr = 3324 AND status IN (0,1) ORDER BY message_idnr
2009-03-04 00:26:48 CET pid=4609 25/120873 idle: LOG: statement: SELECT message_idnr FROM dbmail_messages WHERE mailbox_idnr = 3324 AND status IN (0,1) ORDER BY message_idnr
2009-03-04 00:26:48 CET pid=4609 25/120874 idle: LOG: statement: SELECT permission,seen_flag,answered_flag,deleted_flag,flagged_flag,recent_flag,draft_flag FROM dbmail_mailboxes WHERE mailbox_idnr = 3324
2009-03-04 00:26:48 CET pid=4609 25/120875 idle: LOG: statement: SELECT 'a',COUNT(*) FROM dbmail_messages WHERE mailbox_idnr=3324 AND (status < 2) UNION SELECT 'b',COUNT(*) FROM dbmail_messages WHERE mailbox_idnr=3324 AND (status < 2)
AND seen_flag=1 UNION SELECT 'c',COUNT(*) FROM dbmail_messages WHERE mailbox_idnr=3324 AND (status < 2) AND recent_flag=1
2009-03-04 00:26:48 CET pid=4609 25/120876 idle: LOG: statement: SELECT max(message_idnr)+1 FROM dbmail_messages WHERE mailbox_idnr=3324
2009-03-04 00:26:48 CET pid=4609 25/120877 idle: LOG: statement: SELECT seen_flag, answered_flag, deleted_flag, flagged_flag, draft_flag, recent_flag, TO_CHAR(internal_date, 'YYYY-MM-DD HH24:MI:SS' ), rfcsize, message_idnr FROM dbmail_
messages msg, dbmail_physmessage pm WHERE pm.id = msg.physmessage_id AND message_idnr BETWEEN 3635587 AND 3651745 AND mailbox_idnr = 3324 AND status IN (0,1,2) ORDER BY message_idnr ASC
2009-03-04 00:26:48 CET pid=4609 25/120878 idle: LOG: statement: SELECT permission,seen_flag,answered_flag,deleted_flag,flagged_flag,recent_flag,draft_flag FROM dbmail_mailboxes WHERE mailbox_idnr = 3324
2009-03-04 00:26:48 CET pid=4609 25/120879 idle: LOG: statement: SELECT permission,seen_flag,answered_flag,deleted_flag,flagged_flag,recent_flag,draft_flag FROM dbmail_mailboxes WHERE mailbox_idnr = 3324
2009-03-04 00:26:48 CET pid=4609 25/120880 idle: LOG: statement: SELECT seen_flag, answered_flag, deleted_flag, flagged_flag, draft_flag, recent_flag, TO_CHAR(internal_date, 'YYYY-MM-DD HH24:MI:SS' ), rfcsize, message_idnr FROM dbmail_
messages msg, dbmail_physmessage pm WHERE pm.id = msg.physmessage_id AND message_idnr BETWEEN 3635587 AND 3651745 AND mailbox_idnr = 3324 AND status IN (0,1,2) ORDER BY message_idnr ASC

(are all those SELECTs really neccessary until here?)

2009-03-04 00:26:48 CET pid=4609 25/120881 idle: LOG: statement: UPDATE dbmail_messages SET recent_flag=0, seen_flag=0, answered_flag=0, flagged_flag=0, draft_flag=0 WHERE message_idnr = 3651501 AND status < 2 AND mailbox_idnr = 3324
2009-03-04 00:26:48 CET pid=4609 25/120882 idle: LOG: statement: UPDATE dbmail_messages SET recent_flag=0, seen_flag=0, answered_flag=0, flagged_flag=0, draft_flag=0 WHERE message_idnr = 3651521 AND status < 2 AND mailbox_idnr = 3324
2009-03-04 00:26:48 CET pid=4609 25/120883 idle: LOG: statement: UPDATE dbmail_messages SET recent_flag=0, seen_flag=0, answered_flag=0, flagged_flag=0, draft_flag=0 WHERE message_idnr = 3651555 AND status < 2 AND mailbox_idnr = 3324
2009-03-04 00:26:48 CET pid=4609 25/120884 idle: LOG: statement: UPDATE dbmail_messages SET recent_flag=0, seen_flag=0, answered_flag=0, flagged_flag=0, draft_flag=0 WHERE message_idnr = 3651557 AND status < 2 AND mailbox_idnr = 3324
2009-03-04 00:26:48 CET pid=4609 25/120885 idle: LOG: statement: UPDATE dbmail_messages SET recent_flag=0, seen_flag=0, answered_flag=0, flagged_flag=0, draft_flag=0 WHERE message_idnr = 3651579 AND status < 2 AND mailbox_idnr = 3324
2009-03-04 00:26:48 CET pid=4609 25/120886 idle: LOG: statement: UPDATE dbmail_messages SET recent_flag=0, seen_flag=0, answered_flag=0, flagged_flag=0, draft_flag=0 WHERE message_idnr = 3651706 AND status < 2 AND mailbox_idnr = 3324
2009-03-04 00:26:48 CET pid=4609 25/120887 idle: LOG: statement: UPDATE dbmail_messages SET recent_flag=0, seen_flag=0, answered_flag=0, flagged_flag=0, draft_flag=0 WHERE message_idnr = 3651745 AND status < 2 AND mailbox_idnr = 3324
(here seems to be the 2nd commands start)
2009-03-04 00:26:48 CET pid=4609 25/120888 idle: LOG: statement: SELECT permission,seen_flag,answered_flag,deleted_flag,flagged_flag,recent_flag,draft_flag FROM dbmail_mailboxes WHERE mailbox_idnr = 3324
2009-03-04 00:26:48 CET pid=4609 25/120889 idle: LOG: statement: SELECT seen_flag, answered_flag, deleted_flag, flagged_flag, draft_flag, recent_flag, TO_CHAR(internal_date, 'YYYY-MM-DD HH24:MI:SS' ), rfcsize, message_idnr FROM dbmail_
messages msg, dbmail_physmessage pm WHERE pm.id = msg.physmessage_id AND message_idnr BETWEEN 3635587 AND 3651745 AND mailbox_idnr = 3324 AND status IN (0,1,2) ORDER BY message_idnr ASC
(again, why are those 2 SELECTs needed?)
2009-03-04 00:26:48 CET pid=4609 25/120890 idle: LOG: statement: UPDATE dbmail_messages SET recent_flag=0, seen_flag=1 WHERE message_idnr = 3651501 AND status < 2 AND mailbox_idnr = 3324
2009-03-04 00:26:48 CET pid=4609 25/120891 idle: LOG: statement: UPDATE dbmail_messages SET recent_flag=0, seen_flag=1 WHERE message_idnr = 3651521 AND status < 2 AND mailbox_idnr = 3324
2009-03-04 00:26:48 CET pid=4609 25/120892 idle: LOG: statement: UPDATE dbmail_messages SET recent_flag=0, seen_flag=1 WHERE message_idnr = 3651555 AND status < 2 AND mailbox_idnr = 3324
2009-03-04 00:26:48 CET pid=4609 25/120893 idle: LOG: statement: UPDATE dbmail_messages SET recent_flag=0, seen_flag=1 WHERE message_idnr = 3651557 AND status < 2 AND mailbox_idnr = 3324
2009-03-04 00:26:48 CET pid=4609 25/120894 idle: LOG: statement: UPDATE dbmail_messages SET recent_flag=0, seen_flag=1 WHERE message_idnr = 3651579 AND status < 2 AND mailbox_idnr = 3324
2009-03-04 00:26:48 CET pid=4609 25/120895 idle: LOG: statement: UPDATE dbmail_messages SET recent_flag=0, seen_flag=1 WHERE message_idnr = 3651706 AND status < 2 AND mailbox_idnr = 3324
2009-03-04 00:26:48 CET pid=4609 25/120896 idle: LOG: statement: UPDATE dbmail_messages SET recent_flag=0, seen_flag=1 WHERE message_idnr = 3651745 AND status < 2 AND mailbox_idnr = 3324

Now I know that development to dbmail-2.2 has stopped, but I guess the
process is not a lot different in 2.3 so I want to suggest two improvements:

1) Currently for *every single* message a separate UPDATE is done,
this could easily be replaced by a multi-message UPDATE. And I
guess lots of SELECTs could be dropped also.

dbmail=> EXPLAIN UPDATE dbmail_messages SET recent_flag=0, seen_flag=0, answered_flag=0, flagged_flag=0, draft_flag=0 WHERE message_idnr IN (3651501,3651521,3651555) AND status < 2 AND mailbox_idnr = 3324;
QUERY PLAN
----------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on dbmail_messages (cost=19.22..23.24 rows=1 width=67)
Recheck Cond: ((mailbox_idnr = 3324) AND (message_idnr = ANY ('{3651501,3651521,3651555}'::bigint[])))
Filter: (status < 2)
-> BitmapAnd (cost=19.22..19.22 rows=1 width=0)
-> Bitmap Index Scan on dbmail_messages_1 (cost=0.00..6.17 rows=253 width=0)
Index Cond: (mailbox_idnr = 3324)
-> Bitmap Index Scan on dbmail_messages_pkey (cost=0.00..12.80 rows=3 width=0)
Index Cond: (message_idnr = ANY ('{3651501,3651521,3651555}'::bigint[]))
(8 Zeilen)

But I found that leaving "AND mailbox_idnr = 3324" is not needed anyway, and makes the query slower,
so let it away:

dbmail=> EXPLAIN UPDATE dbmail_messages SET recent_flag=0, seen_flag=0, answered_flag=0, flagged_flag=0, draft_flag=0 WHERE message_idnr IN (3651501,3651521,3651555) AND status < 2;
QUERY PLAN
------------------------------------------------------------------------------------
Bitmap Heap Scan on dbmail_messages (cost=12.80..24.56 rows=3 width=67)
Recheck Cond: (message_idnr = ANY ('{3651501,3651521,3651555}'::bigint[]))
Filter: (status < 2)
-> Bitmap Index Scan on dbmail_messages_pkey (cost=0.00..12.80 rows=3 width=0)
Index Cond: (message_idnr = ANY ('{3651501,3651521,3651555}'::bigint[]))
(5 Zeilen)

2)
When an IMAP command "UID STORE" arrives, could dbmail wait
to see if a 2nd "UID STORE" arrives next? Currently there are
2 updates, first setting all flags=0, later seen=1. That could
be grouped. Doesn't matter with only some users, but once
you have 500+ concurrent users, or WAL logs (which generate
a record for every single transaction), there's a big difference.


mfg zmi
--
// Michael Monnerie, Ing.BSc ----- http://it-management.at
// Tel: 0660 / 415 65 31 .network.your.ideas.
// PGP Key: "curl -s http://zmi.at/zmi.asc | gpg --import"
// Fingerprint: AC19 F9D5 36ED CD8A EF38 500E CE14 91F7 1C12 09B4
// Keyserver: wwwkeys.eu.pgp.net Key-ID: 1C1209B4

_______________________________________________
DBmail mailing list
DBmail[at]dbmail.org
https://mailman.fastxs.nl/mailman/listinfo/dbmail


paul at nfg

Mar 4, 2009, 1:48 PM

Post #2 of 4 (465 views)
Permalink
Re: marking a mailbox as read is slow, suggesting improvement [In reply to]

Michael Monnerie wrote:
> I just had a mailbox with 5k messages, about 4k unread, which was quite slow.
> So I tested with a smaller mailbox: In kmail 1.10.3, I say
> "mark whole mailbox as read". That generates these commands:
>
> 15232 UID STORE 3651501,3651521,3651555,3651557,3651579,3651706,3651745 -FLAGS.SILENT (\SEEN \ANSWERED \FLAGGED \DRAFT)
> 15233 UID STORE 3651501,3651521,3651555,3651557,3651579,3651706,3651745 +FLAGS.SILENT (\SEEN)
>
> Those are the queries generated to Postgres:
>
> 2009-03-04 00:26:48 CET pid=4609 25/120872 idle: LOG: statement: SELECT message_idnr FROM dbmail_messages WHERE mailbox_idnr = 3324 AND status IN (0,1) ORDER BY message_idnr
> 2009-03-04 00:26:48 CET pid=4609 25/120873 idle: LOG: statement: SELECT message_idnr FROM dbmail_messages WHERE mailbox_idnr = 3324 AND status IN (0,1) ORDER BY message_idnr
> 2009-03-04 00:26:48 CET pid=4609 25/120874 idle: LOG: statement: SELECT permission,seen_flag,answered_flag,deleted_flag,flagged_flag,recent_flag,draft_flag FROM dbmail_mailboxes WHERE mailbox_idnr = 3324
> 2009-03-04 00:26:48 CET pid=4609 25/120875 idle: LOG: statement: SELECT 'a',COUNT(*) FROM dbmail_messages WHERE mailbox_idnr=3324 AND (status < 2) UNION SELECT 'b',COUNT(*) FROM dbmail_messages WHERE mailbox_idnr=3324 AND (status < 2)
> AND seen_flag=1 UNION SELECT 'c',COUNT(*) FROM dbmail_messages WHERE mailbox_idnr=3324 AND (status < 2) AND recent_flag=1
> 2009-03-04 00:26:48 CET pid=4609 25/120876 idle: LOG: statement: SELECT max(message_idnr)+1 FROM dbmail_messages WHERE mailbox_idnr=3324
> 2009-03-04 00:26:48 CET pid=4609 25/120877 idle: LOG: statement: SELECT seen_flag, answered_flag, deleted_flag, flagged_flag, draft_flag, recent_flag, TO_CHAR(internal_date, 'YYYY-MM-DD HH24:MI:SS' ), rfcsize, message_idnr FROM dbmail_
> messages msg, dbmail_physmessage pm WHERE pm.id = msg.physmessage_id AND message_idnr BETWEEN 3635587 AND 3651745 AND mailbox_idnr = 3324 AND status IN (0,1,2) ORDER BY message_idnr ASC
> 2009-03-04 00:26:48 CET pid=4609 25/120878 idle: LOG: statement: SELECT permission,seen_flag,answered_flag,deleted_flag,flagged_flag,recent_flag,draft_flag FROM dbmail_mailboxes WHERE mailbox_idnr = 3324
> 2009-03-04 00:26:48 CET pid=4609 25/120879 idle: LOG: statement: SELECT permission,seen_flag,answered_flag,deleted_flag,flagged_flag,recent_flag,draft_flag FROM dbmail_mailboxes WHERE mailbox_idnr = 3324
> 2009-03-04 00:26:48 CET pid=4609 25/120880 idle: LOG: statement: SELECT seen_flag, answered_flag, deleted_flag, flagged_flag, draft_flag, recent_flag, TO_CHAR(internal_date, 'YYYY-MM-DD HH24:MI:SS' ), rfcsize, message_idnr FROM dbmail_
> messages msg, dbmail_physmessage pm WHERE pm.id = msg.physmessage_id AND message_idnr BETWEEN 3635587 AND 3651745 AND mailbox_idnr = 3324 AND status IN (0,1,2) ORDER BY message_idnr ASC
>
> (are all those SELECTs really neccessary until here?)

Yes. They are all part of the 'open mailbox and fetch meta-data'
sequence. That some queries are called twice can only be explained by
kmail opening two seperate imap sessions. If that is not the case I'd
have to see this from dbmail's logs.


>
> 2009-03-04 00:26:48 CET pid=4609 25/120881 idle: LOG: statement: UPDATE dbmail_messages SET recent_flag=0, seen_flag=0, answered_flag=0, flagged_flag=0, draft_flag=0 WHERE message_idnr = 3651501 AND status < 2 AND mailbox_idnr = 3324
> 2009-03-04 00:26:48 CET pid=4609 25/120882 idle: LOG: statement: UPDATE dbmail_messages SET recent_flag=0, seen_flag=0, answered_flag=0, flagged_flag=0, draft_flag=0 WHERE message_idnr = 3651521 AND status < 2 AND mailbox_idnr = 3324
> 2009-03-04 00:26:48 CET pid=4609 25/120883 idle: LOG: statement: UPDATE dbmail_messages SET recent_flag=0, seen_flag=0, answered_flag=0, flagged_flag=0, draft_flag=0 WHERE message_idnr = 3651555 AND status < 2 AND mailbox_idnr = 3324
> 2009-03-04 00:26:48 CET pid=4609 25/120884 idle: LOG: statement: UPDATE dbmail_messages SET recent_flag=0, seen_flag=0, answered_flag=0, flagged_flag=0, draft_flag=0 WHERE message_idnr = 3651557 AND status < 2 AND mailbox_idnr = 3324
> 2009-03-04 00:26:48 CET pid=4609 25/120885 idle: LOG: statement: UPDATE dbmail_messages SET recent_flag=0, seen_flag=0, answered_flag=0, flagged_flag=0, draft_flag=0 WHERE message_idnr = 3651579 AND status < 2 AND mailbox_idnr = 3324
> 2009-03-04 00:26:48 CET pid=4609 25/120886 idle: LOG: statement: UPDATE dbmail_messages SET recent_flag=0, seen_flag=0, answered_flag=0, flagged_flag=0, draft_flag=0 WHERE message_idnr = 3651706 AND status < 2 AND mailbox_idnr = 3324
> 2009-03-04 00:26:48 CET pid=4609 25/120887 idle: LOG: statement: UPDATE dbmail_messages SET recent_flag=0, seen_flag=0, answered_flag=0, flagged_flag=0, draft_flag=0 WHERE message_idnr = 3651745 AND status < 2 AND mailbox_idnr = 3324

Yep, that's the first STORE command.

> (here seems to be the 2nd commands start)


> 2009-03-04 00:26:48 CET pid=4609 25/120888 idle: LOG: statement: SELECT permission,seen_flag,answered_flag,deleted_flag,flagged_flag,recent_flag,draft_flag FROM dbmail_mailboxes WHERE mailbox_idnr = 3324
> 2009-03-04 00:26:48 CET pid=4609 25/120889 idle: LOG: statement: SELECT seen_flag, answered_flag, deleted_flag, flagged_flag, draft_flag, recent_flag, TO_CHAR(internal_date, 'YYYY-MM-DD HH24:MI:SS' ), rfcsize, message_idnr FROM dbmail_
> messages msg, dbmail_physmessage pm WHERE pm.id = msg.physmessage_id AND message_idnr BETWEEN 3635587 AND 3651745 AND mailbox_idnr = 3324 AND status IN (0,1,2) ORDER BY message_idnr ASC
> (again, why are those 2 SELECTs needed?)

My guess is that the first STORE command has updated the
dbmail_mailboxes.seq field, which (partially) invalidates the internally
cached view of the mailbox.

> 2009-03-04 00:26:48 CET pid=4609 25/120890 idle: LOG: statement: UPDATE dbmail_messages SET recent_flag=0, seen_flag=1 WHERE message_idnr = 3651501 AND status < 2 AND mailbox_idnr = 3324
> 2009-03-04 00:26:48 CET pid=4609 25/120891 idle: LOG: statement: UPDATE dbmail_messages SET recent_flag=0, seen_flag=1 WHERE message_idnr = 3651521 AND status < 2 AND mailbox_idnr = 3324
> 2009-03-04 00:26:48 CET pid=4609 25/120892 idle: LOG: statement: UPDATE dbmail_messages SET recent_flag=0, seen_flag=1 WHERE message_idnr = 3651555 AND status < 2 AND mailbox_idnr = 3324
> 2009-03-04 00:26:48 CET pid=4609 25/120893 idle: LOG: statement: UPDATE dbmail_messages SET recent_flag=0, seen_flag=1 WHERE message_idnr = 3651557 AND status < 2 AND mailbox_idnr = 3324
> 2009-03-04 00:26:48 CET pid=4609 25/120894 idle: LOG: statement: UPDATE dbmail_messages SET recent_flag=0, seen_flag=1 WHERE message_idnr = 3651579 AND status < 2 AND mailbox_idnr = 3324
> 2009-03-04 00:26:48 CET pid=4609 25/120895 idle: LOG: statement: UPDATE dbmail_messages SET recent_flag=0, seen_flag=1 WHERE message_idnr = 3651706 AND status < 2 AND mailbox_idnr = 3324
> 2009-03-04 00:26:48 CET pid=4609 25/120896 idle: LOG: statement: UPDATE dbmail_messages SET recent_flag=0, seen_flag=1 WHERE message_idnr = 3651745 AND status < 2 AND mailbox_idnr = 3324
>
> Now I know that development to dbmail-2.2 has stopped, but I guess the
> process is not a lot different in 2.3 so I want to suggest two improvements:
>
> 1) Currently for *every single* message a separate UPDATE is done,
> this could easily be replaced by a multi-message UPDATE. And I
> guess lots of SELECTs could be dropped also.

Just because something looks 'obvious' from an sql perspective doesn't
mean it's doable without massive refactoring, lots and lots of testing,
etc.... In this particular case, the same pattern could be applied that
is currently in use in dbmail_imap_session_mailbox_update_recent.

Part of the problem lies with kmail, which apparently isn't able to
issue a simple range-uid command where

x uid store 123456:234567 +flags (\Seen)

*does* result in single-update queries. That was one of the first things
I did when I started working on dbmail back in 2004, iirc.

For single-UID commands something similar can still be done. In
dbmail_imap_session_mailbox_update_recent \Recent flags are updated for
possibly large sets of messages. In that case, the whole set of UID's is
sliced into groups which are send as IN(n1,n2, ...nN) update queries.

However, whereas this might result in fewer queries, this pattern
precludes the possibility of using prepared statements that would reduce
hits on the query-parser/optimizer downstream in the database.

Common sense would dictate that fewer queries are more efficient, but is
that *really* the case?


>
> dbmail=> EXPLAIN UPDATE dbmail_messages SET recent_flag=0, seen_flag=0, answered_flag=0, flagged_flag=0, draft_flag=0 WHERE message_idnr IN (3651501,3651521,3651555) AND status < 2 AND mailbox_idnr = 3324;
> QUERY PLAN
> ----------------------------------------------------------------------------------------------------------
> Bitmap Heap Scan on dbmail_messages (cost=19.22..23.24 rows=1 width=67)
> Recheck Cond: ((mailbox_idnr = 3324) AND (message_idnr = ANY ('{3651501,3651521,3651555}'::bigint[])))
> Filter: (status < 2)
> -> BitmapAnd (cost=19.22..19.22 rows=1 width=0)
> -> Bitmap Index Scan on dbmail_messages_1 (cost=0.00..6.17 rows=253 width=0)
> Index Cond: (mailbox_idnr = 3324)
> -> Bitmap Index Scan on dbmail_messages_pkey (cost=0.00..12.80 rows=3 width=0)
> Index Cond: (message_idnr = ANY ('{3651501,3651521,3651555}'::bigint[]))
> (8 Zeilen)
>
> But I found that leaving "AND mailbox_idnr = 3324" is not needed anyway, and makes the query slower,
> so let it away:
>
> dbmail=> EXPLAIN UPDATE dbmail_messages SET recent_flag=0, seen_flag=0, answered_flag=0, flagged_flag=0, draft_flag=0 WHERE message_idnr IN (3651501,3651521,3651555) AND status < 2;
> QUERY PLAN
> ------------------------------------------------------------------------------------
> Bitmap Heap Scan on dbmail_messages (cost=12.80..24.56 rows=3 width=67)
> Recheck Cond: (message_idnr = ANY ('{3651501,3651521,3651555}'::bigint[]))
> Filter: (status < 2)
> -> Bitmap Index Scan on dbmail_messages_pkey (cost=0.00..12.80 rows=3 width=0)
> Index Cond: (message_idnr = ANY ('{3651501,3651521,3651555}'::bigint[]))
> (5 Zeilen)

Agreed. The where clause is a remnant from the (my)isam days (<2.0)
where key restraints were non-existant. They are completely redundant.
Feel like trying your hand at fixing that?

> 2)
> When an IMAP command "UID STORE" arrives, could dbmail wait
> to see if a 2nd "UID STORE" arrives next? Currently there are
> 2 updates, first setting all flags=0, later seen=1. That could
> be grouped. Doesn't matter with only some users, but once
> you have 500+ concurrent users, or WAL logs (which generate
> a record for every single transaction), there's a big difference.

That may be possible in some situations, but in this case you have to
avoid being naive in assuming only one single user is accessing that
particular mailbox at that time. Those STORE commands are atomic, and
the changes MUST be accessible in other concurrently connected sessions.
But we might get away by assuming such concurrent clients are all
connected to the same imapd process. If not we'd have to use some
synchronisation channel between different imapd processes using the same
database backend.

It all comes down to keep pushing the code and design and look for ways
to reduce cost of operation - be that for system calls like Artem is
pushing for, or database calls as you are.



--
________________________________________________________________
Paul Stevens paul at nfg.nl
NET FACILITIES GROUP GPG/PGP: 1024D/11F8CD31
The Netherlands________________________________http://www.nfg.nl
_______________________________________________
DBmail mailing list
DBmail[at]mailman.fastxs.nl
http://mailman.fastxs.nl/cgi-bin/mailman/listinfo/dbmail


michael.monnerie at is

Mar 9, 2009, 3:12 AM

Post #3 of 4 (446 views)
Permalink
Re: marking a mailbox as read is slow, suggesting improvement [In reply to]

On Mittwoch 04 März 2009 Paul J Stevens wrote:
> Yes. They are all part of the 'open mailbox and fetch meta-data'
> sequence. That some queries are called twice can only be explained by
> kmail opening two seperate imap sessions. If that is not the case I'd
> have to see this from dbmail's logs.

OK. Do you want me to do that? level 5 I guess?

> Just because something looks 'obvious' from an sql perspective
> doesn't mean it's doable without massive refactoring, lots and lots
> of testing, etc.... In this particular case, the same pattern could
> be applied that is currently in use in
> dbmail_imap_session_mailbox_update_recent.

Right. I did not look at dbmails code. This should be a hint, and maybe
an improvement can be found. If not, OK - then we are at optimal speed
already. Just trying to improve performance :-)

> Part of the problem lies with kmail, which apparently isn't able to
> issue a simple range-uid command where
>
> x uid store 123456:234567 +flags (\Seen)
>
> *does* result in single-update queries. That was one of the first
> things I did when I started working on dbmail back in 2004, iirc.

I see, but using UID STORE
3651501,3651521,3651555,3651557,3651579,3651706,3651745 -FLAGS.SILENT
(\SEEN \ANSWERED \FLAGGED \DRAFT)
doesn't look too bad to me. After all, all the message-IDs are here in
one single statement. Do all the other IMAP clients use ranges like
123456:234567? I doubt. If there are more clients than just kmail using
the form 1,2,3,4,5 then it should be worth the effort of using that in
the resulting queries to the database (...WHERE message_idnr IN
(1,2,3,4,5); )

> For single-UID commands something similar can still be done. In
> dbmail_imap_session_mailbox_update_recent \Recent flags are updated
> for possibly large sets of messages. In that case, the whole set of
> UID's is sliced into groups which are send as IN(n1,n2, ...nN) update
> queries.

Yes, I saw some grouped UPDATEs in the logs. Very good.

> However, whereas this might result in fewer queries, this pattern
> precludes the possibility of using prepared statements that would
> reduce hits on the query-parser/optimizer downstream in the database.
>
> Common sense would dictate that fewer queries are more efficient, but
> is that *really* the case?

Yes, the bigger the user count the more efficient it gets. With only
several users, I think the difference is peanuts, as today's hardware
does that fast enough, with caching and so on. But once you reach 100+
users running queries at the same time, plus several messages/s
arriving, every single transaction becomes more costly. So if there's a
db function where you just throw in the message_idnr's and it updates
them all at once, it is quicker than using 10x and update with 10
message_idnr's each. Thats 10 transactions instead one, meaning your db
engine must be able to do 10x more transactions/s. At some point (number
of concurrent users) transactions start to become expensive.

> > dbmail=> EXPLAIN UPDATE dbmail_messages SET recent_flag=0,
> > seen_flag=0, answered_flag=0, flagged_flag=0, draft_flag=0 WHERE
> > message_idnr IN (3651501,3651521,3651555) AND status < 2; QUERY
> > PLAN
> > -------------------------------------------------------------------
> >----------------- Bitmap Heap Scan on dbmail_messages
> > (cost=12.80..24.56 rows=3 width=67) Recheck Cond: (message_idnr =
> > ANY ('{3651501,3651521,3651555}'::bigint[])) Filter: (status < 2)
> > -> Bitmap Index Scan on dbmail_messages_pkey (cost=0.00..12.80
> > rows=3 width=0) Index Cond: (message_idnr = ANY
> > ('{3651501,3651521,3651555}'::bigint[])) (5 Zeilen)
>
> Agreed. The where clause is a remnant from the (my)isam days (<2.0)
> where key restraints were non-existant. They are completely
> redundant. Feel like trying your hand at fixing that?

OK, a quick grep didn't let me find where that query is generated. Can
you guide me? (maybe we should move this thread to -dev?)

> > 2)
> > When an IMAP command "UID STORE" arrives, could dbmail wait
> > to see if a 2nd "UID STORE" arrives next? Currently there are
> > 2 updates, first setting all flags=0, later seen=1. That could
> > be grouped. Doesn't matter with only some users, but once
> > you have 500+ concurrent users, or WAL logs (which generate
> > a record for every single transaction), there's a big difference.
>
> That may be possible in some situations, but in this case you have to
> avoid being naive in assuming only one single user is accessing that
> particular mailbox at that time. Those STORE commands are atomic, and
> the changes MUST be accessible in other concurrently connected
> sessions. But we might get away by assuming such concurrent clients
> are all connected to the same imapd process. If not we'd have to use
> some synchronisation channel between different imapd processes using
> the same database backend.

Right. If STOREs must be atomic, forget about caching it. Syncing
between imap processes doesn't sound like something implemented in an
afternoon ;-)

> It all comes down to keep pushing the code and design and look for
> ways to reduce cost of operation - be that for system calls like
> Artem is pushing for, or database calls as you are.

Yes, and I'd like to help here, for that part that I can help with.

mfg zmi
--
// Michael Monnerie, Ing.BSc ----- http://it-management.at
// Tel: 0660 / 415 65 31 .network.your.ideas.
// PGP Key: "curl -s http://zmi.at/zmi.asc | gpg --import"
// Fingerprint: AC19 F9D5 36ED CD8A EF38 500E CE14 91F7 1C12 09B4
// Keyserver: wwwkeys.eu.pgp.net Key-ID: 1C1209B4

_______________________________________________
DBmail mailing list
DBmail[at]dbmail.org
http://mailman.fastxs.nl/cgi-bin/mailman/listinfo/dbmail


paul at nfg

Mar 10, 2009, 1:21 AM

Post #4 of 4 (444 views)
Permalink
Re: marking a mailbox as read is slow, suggesting improvement [In reply to]

Michael Monnerie wrote:
> On Mittwoch 04 März 2009 Paul J Stevens wrote:
>> Yes. They are all part of the 'open mailbox and fetch meta-data'
>> sequence. That some queries are called twice can only be explained by
>> kmail opening two seperate imap sessions. If that is not the case I'd
>> have to see this from dbmail's logs.
>
> OK. Do you want me to do that? level 5 I guess?

Yes please. But only if there are not two separate sessions accessing
the same mailbox.

>> Just because something looks 'obvious' from an sql perspective
>> doesn't mean it's doable without massive refactoring, lots and lots
>> of testing, etc.... In this particular case, the same pattern could
>> be applied that is currently in use in
>> dbmail_imap_session_mailbox_update_recent.
>
> Right. I did not look at dbmails code. This should be a hint, and maybe
> an improvement can be found. If not, OK - then we are at optimal speed
> already. Just trying to improve performance :-)

I've looked into the imap-store code over the weekend, and it will be
far from trivial to optimize for less queries. The same store command
doesn't apply equally to all messages affected since every message has
it's own individual set of flags and labels. Nothing is impossible of
course, but it's not exactly low-hanging fruit.

>> Part of the problem lies with kmail, which apparently isn't able to
>> issue a simple range-uid command where
>>
>> x uid store 123456:234567 +flags (\Seen)
>>
>> *does* result in single-update queries. That was one of the first
>> things I did when I started working on dbmail back in 2004, iirc.

I was wrong there. For store single-update queries are never issued. At
least not for 2.3 which was heavily tuned for imap compliance.

> Yes, I saw some grouped UPDATEs in the logs. Very good.

I did wrap changes on message due to store commands in transactions to
make them atomic.

> At some point (number
> of concurrent users) transactions start to become expensive.

Understood.

>>> dbmail=> EXPLAIN UPDATE dbmail_messages SET recent_flag=0,
>>> seen_flag=0, answered_flag=0, flagged_flag=0, draft_flag=0 WHERE
>>> message_idnr IN (3651501,3651521,3651555) AND status < 2; QUERY
>>> PLAN

> OK, a quick grep didn't let me find where that query is generated. Can
> you guide me? (maybe we should move this thread to -dev?)

I have no idea. It's not update_recent, and it's not _ic_store. I can't
find it. What version would that be?


> Right. If STOREs must be atomic, forget about caching it. Syncing
> between imap processes doesn't sound like something implemented in an
> afternoon ;-)

That is what the database is currently used for. I know: smart-gui
anti-pattern.

>
>> It all comes down to keep pushing the code and design and look for
>> ways to reduce cost of operation - be that for system calls like
>> Artem is pushing for, or database calls as you are.
>
> Yes, and I'd like to help here, for that part that I can help with.

You're already helping us tune the schema, and find overly expensive
query patterns.

--
________________________________________________________________
Paul Stevens paul at nfg.nl
NET FACILITIES GROUP GPG/PGP: 1024D/11F8CD31
The Netherlands________________________________http://www.nfg.nl
_______________________________________________
DBmail mailing list
DBmail[at]dbmail.org
http://mailman.fastxs.nl/cgi-bin/mailman/listinfo/dbmail

DBMail users RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact lists@gossamer-threads.com
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.