
paul at nfg
Mar 4, 2009, 1:48 PM
Post #2 of 4
(468 views)
Permalink
|
|
Re: marking a mailbox as read is slow, suggesting improvement
[In reply to]
|
|
Michael Monnerie wrote: > I just had a mailbox with 5k messages, about 4k unread, which was quite slow. > So I tested with a smaller mailbox: In kmail 1.10.3, I say > "mark whole mailbox as read". That generates these commands: > > 15232 UID STORE 3651501,3651521,3651555,3651557,3651579,3651706,3651745 -FLAGS.SILENT (\SEEN \ANSWERED \FLAGGED \DRAFT) > 15233 UID STORE 3651501,3651521,3651555,3651557,3651579,3651706,3651745 +FLAGS.SILENT (\SEEN) > > Those are the queries generated to Postgres: > > 2009-03-04 00:26:48 CET pid=4609 25/120872 idle: LOG: statement: SELECT message_idnr FROM dbmail_messages WHERE mailbox_idnr = 3324 AND status IN (0,1) ORDER BY message_idnr > 2009-03-04 00:26:48 CET pid=4609 25/120873 idle: LOG: statement: SELECT message_idnr FROM dbmail_messages WHERE mailbox_idnr = 3324 AND status IN (0,1) ORDER BY message_idnr > 2009-03-04 00:26:48 CET pid=4609 25/120874 idle: LOG: statement: SELECT permission,seen_flag,answered_flag,deleted_flag,flagged_flag,recent_flag,draft_flag FROM dbmail_mailboxes WHERE mailbox_idnr = 3324 > 2009-03-04 00:26:48 CET pid=4609 25/120875 idle: LOG: statement: SELECT 'a',COUNT(*) FROM dbmail_messages WHERE mailbox_idnr=3324 AND (status < 2) UNION SELECT 'b',COUNT(*) FROM dbmail_messages WHERE mailbox_idnr=3324 AND (status < 2) > AND seen_flag=1 UNION SELECT 'c',COUNT(*) FROM dbmail_messages WHERE mailbox_idnr=3324 AND (status < 2) AND recent_flag=1 > 2009-03-04 00:26:48 CET pid=4609 25/120876 idle: LOG: statement: SELECT max(message_idnr)+1 FROM dbmail_messages WHERE mailbox_idnr=3324 > 2009-03-04 00:26:48 CET pid=4609 25/120877 idle: LOG: statement: SELECT seen_flag, answered_flag, deleted_flag, flagged_flag, draft_flag, recent_flag, TO_CHAR(internal_date, 'YYYY-MM-DD HH24:MI:SS' ), rfcsize, message_idnr FROM dbmail_ > messages msg, dbmail_physmessage pm WHERE pm.id = msg.physmessage_id AND message_idnr BETWEEN 3635587 AND 3651745 AND mailbox_idnr = 3324 AND status IN (0,1,2) ORDER BY message_idnr ASC > 2009-03-04 00:26:48 CET pid=4609 25/120878 idle: LOG: statement: SELECT permission,seen_flag,answered_flag,deleted_flag,flagged_flag,recent_flag,draft_flag FROM dbmail_mailboxes WHERE mailbox_idnr = 3324 > 2009-03-04 00:26:48 CET pid=4609 25/120879 idle: LOG: statement: SELECT permission,seen_flag,answered_flag,deleted_flag,flagged_flag,recent_flag,draft_flag FROM dbmail_mailboxes WHERE mailbox_idnr = 3324 > 2009-03-04 00:26:48 CET pid=4609 25/120880 idle: LOG: statement: SELECT seen_flag, answered_flag, deleted_flag, flagged_flag, draft_flag, recent_flag, TO_CHAR(internal_date, 'YYYY-MM-DD HH24:MI:SS' ), rfcsize, message_idnr FROM dbmail_ > messages msg, dbmail_physmessage pm WHERE pm.id = msg.physmessage_id AND message_idnr BETWEEN 3635587 AND 3651745 AND mailbox_idnr = 3324 AND status IN (0,1,2) ORDER BY message_idnr ASC > > (are all those SELECTs really neccessary until here?) Yes. They are all part of the 'open mailbox and fetch meta-data' sequence. That some queries are called twice can only be explained by kmail opening two seperate imap sessions. If that is not the case I'd have to see this from dbmail's logs. > > 2009-03-04 00:26:48 CET pid=4609 25/120881 idle: LOG: statement: UPDATE dbmail_messages SET recent_flag=0, seen_flag=0, answered_flag=0, flagged_flag=0, draft_flag=0 WHERE message_idnr = 3651501 AND status < 2 AND mailbox_idnr = 3324 > 2009-03-04 00:26:48 CET pid=4609 25/120882 idle: LOG: statement: UPDATE dbmail_messages SET recent_flag=0, seen_flag=0, answered_flag=0, flagged_flag=0, draft_flag=0 WHERE message_idnr = 3651521 AND status < 2 AND mailbox_idnr = 3324 > 2009-03-04 00:26:48 CET pid=4609 25/120883 idle: LOG: statement: UPDATE dbmail_messages SET recent_flag=0, seen_flag=0, answered_flag=0, flagged_flag=0, draft_flag=0 WHERE message_idnr = 3651555 AND status < 2 AND mailbox_idnr = 3324 > 2009-03-04 00:26:48 CET pid=4609 25/120884 idle: LOG: statement: UPDATE dbmail_messages SET recent_flag=0, seen_flag=0, answered_flag=0, flagged_flag=0, draft_flag=0 WHERE message_idnr = 3651557 AND status < 2 AND mailbox_idnr = 3324 > 2009-03-04 00:26:48 CET pid=4609 25/120885 idle: LOG: statement: UPDATE dbmail_messages SET recent_flag=0, seen_flag=0, answered_flag=0, flagged_flag=0, draft_flag=0 WHERE message_idnr = 3651579 AND status < 2 AND mailbox_idnr = 3324 > 2009-03-04 00:26:48 CET pid=4609 25/120886 idle: LOG: statement: UPDATE dbmail_messages SET recent_flag=0, seen_flag=0, answered_flag=0, flagged_flag=0, draft_flag=0 WHERE message_idnr = 3651706 AND status < 2 AND mailbox_idnr = 3324 > 2009-03-04 00:26:48 CET pid=4609 25/120887 idle: LOG: statement: UPDATE dbmail_messages SET recent_flag=0, seen_flag=0, answered_flag=0, flagged_flag=0, draft_flag=0 WHERE message_idnr = 3651745 AND status < 2 AND mailbox_idnr = 3324 Yep, that's the first STORE command. > (here seems to be the 2nd commands start) > 2009-03-04 00:26:48 CET pid=4609 25/120888 idle: LOG: statement: SELECT permission,seen_flag,answered_flag,deleted_flag,flagged_flag,recent_flag,draft_flag FROM dbmail_mailboxes WHERE mailbox_idnr = 3324 > 2009-03-04 00:26:48 CET pid=4609 25/120889 idle: LOG: statement: SELECT seen_flag, answered_flag, deleted_flag, flagged_flag, draft_flag, recent_flag, TO_CHAR(internal_date, 'YYYY-MM-DD HH24:MI:SS' ), rfcsize, message_idnr FROM dbmail_ > messages msg, dbmail_physmessage pm WHERE pm.id = msg.physmessage_id AND message_idnr BETWEEN 3635587 AND 3651745 AND mailbox_idnr = 3324 AND status IN (0,1,2) ORDER BY message_idnr ASC > (again, why are those 2 SELECTs needed?) My guess is that the first STORE command has updated the dbmail_mailboxes.seq field, which (partially) invalidates the internally cached view of the mailbox. > 2009-03-04 00:26:48 CET pid=4609 25/120890 idle: LOG: statement: UPDATE dbmail_messages SET recent_flag=0, seen_flag=1 WHERE message_idnr = 3651501 AND status < 2 AND mailbox_idnr = 3324 > 2009-03-04 00:26:48 CET pid=4609 25/120891 idle: LOG: statement: UPDATE dbmail_messages SET recent_flag=0, seen_flag=1 WHERE message_idnr = 3651521 AND status < 2 AND mailbox_idnr = 3324 > 2009-03-04 00:26:48 CET pid=4609 25/120892 idle: LOG: statement: UPDATE dbmail_messages SET recent_flag=0, seen_flag=1 WHERE message_idnr = 3651555 AND status < 2 AND mailbox_idnr = 3324 > 2009-03-04 00:26:48 CET pid=4609 25/120893 idle: LOG: statement: UPDATE dbmail_messages SET recent_flag=0, seen_flag=1 WHERE message_idnr = 3651557 AND status < 2 AND mailbox_idnr = 3324 > 2009-03-04 00:26:48 CET pid=4609 25/120894 idle: LOG: statement: UPDATE dbmail_messages SET recent_flag=0, seen_flag=1 WHERE message_idnr = 3651579 AND status < 2 AND mailbox_idnr = 3324 > 2009-03-04 00:26:48 CET pid=4609 25/120895 idle: LOG: statement: UPDATE dbmail_messages SET recent_flag=0, seen_flag=1 WHERE message_idnr = 3651706 AND status < 2 AND mailbox_idnr = 3324 > 2009-03-04 00:26:48 CET pid=4609 25/120896 idle: LOG: statement: UPDATE dbmail_messages SET recent_flag=0, seen_flag=1 WHERE message_idnr = 3651745 AND status < 2 AND mailbox_idnr = 3324 > > Now I know that development to dbmail-2.2 has stopped, but I guess the > process is not a lot different in 2.3 so I want to suggest two improvements: > > 1) Currently for *every single* message a separate UPDATE is done, > this could easily be replaced by a multi-message UPDATE. And I > guess lots of SELECTs could be dropped also. Just because something looks 'obvious' from an sql perspective doesn't mean it's doable without massive refactoring, lots and lots of testing, etc.... In this particular case, the same pattern could be applied that is currently in use in dbmail_imap_session_mailbox_update_recent. Part of the problem lies with kmail, which apparently isn't able to issue a simple range-uid command where x uid store 123456:234567 +flags (\Seen) *does* result in single-update queries. That was one of the first things I did when I started working on dbmail back in 2004, iirc. For single-UID commands something similar can still be done. In dbmail_imap_session_mailbox_update_recent \Recent flags are updated for possibly large sets of messages. In that case, the whole set of UID's is sliced into groups which are send as IN(n1,n2, ...nN) update queries. However, whereas this might result in fewer queries, this pattern precludes the possibility of using prepared statements that would reduce hits on the query-parser/optimizer downstream in the database. Common sense would dictate that fewer queries are more efficient, but is that *really* the case? > > dbmail=> EXPLAIN UPDATE dbmail_messages SET recent_flag=0, seen_flag=0, answered_flag=0, flagged_flag=0, draft_flag=0 WHERE message_idnr IN (3651501,3651521,3651555) AND status < 2 AND mailbox_idnr = 3324; > QUERY PLAN > ---------------------------------------------------------------------------------------------------------- > Bitmap Heap Scan on dbmail_messages (cost=19.22..23.24 rows=1 width=67) > Recheck Cond: ((mailbox_idnr = 3324) AND (message_idnr = ANY ('{3651501,3651521,3651555}'::bigint[]))) > Filter: (status < 2) > -> BitmapAnd (cost=19.22..19.22 rows=1 width=0) > -> Bitmap Index Scan on dbmail_messages_1 (cost=0.00..6.17 rows=253 width=0) > Index Cond: (mailbox_idnr = 3324) > -> Bitmap Index Scan on dbmail_messages_pkey (cost=0.00..12.80 rows=3 width=0) > Index Cond: (message_idnr = ANY ('{3651501,3651521,3651555}'::bigint[])) > (8 Zeilen) > > But I found that leaving "AND mailbox_idnr = 3324" is not needed anyway, and makes the query slower, > so let it away: > > dbmail=> EXPLAIN UPDATE dbmail_messages SET recent_flag=0, seen_flag=0, answered_flag=0, flagged_flag=0, draft_flag=0 WHERE message_idnr IN (3651501,3651521,3651555) AND status < 2; > QUERY PLAN > ------------------------------------------------------------------------------------ > Bitmap Heap Scan on dbmail_messages (cost=12.80..24.56 rows=3 width=67) > Recheck Cond: (message_idnr = ANY ('{3651501,3651521,3651555}'::bigint[])) > Filter: (status < 2) > -> Bitmap Index Scan on dbmail_messages_pkey (cost=0.00..12.80 rows=3 width=0) > Index Cond: (message_idnr = ANY ('{3651501,3651521,3651555}'::bigint[])) > (5 Zeilen) Agreed. The where clause is a remnant from the (my)isam days (<2.0) where key restraints were non-existant. They are completely redundant. Feel like trying your hand at fixing that? > 2) > When an IMAP command "UID STORE" arrives, could dbmail wait > to see if a 2nd "UID STORE" arrives next? Currently there are > 2 updates, first setting all flags=0, later seen=1. That could > be grouped. Doesn't matter with only some users, but once > you have 500+ concurrent users, or WAL logs (which generate > a record for every single transaction), there's a big difference. That may be possible in some situations, but in this case you have to avoid being naive in assuming only one single user is accessing that particular mailbox at that time. Those STORE commands are atomic, and the changes MUST be accessible in other concurrently connected sessions. But we might get away by assuming such concurrent clients are all connected to the same imapd process. If not we'd have to use some synchronisation channel between different imapd processes using the same database backend. It all comes down to keep pushing the code and design and look for ways to reduce cost of operation - be that for system calls like Artem is pushing for, or database calls as you are. -- ________________________________________________________________ Paul Stevens paul at nfg.nl NET FACILITIES GROUP GPG/PGP: 1024D/11F8CD31 The Netherlands________________________________http://www.nfg.nl _______________________________________________ DBmail mailing list DBmail[at]mailman.fastxs.nl http://mailman.fastxs.nl/cgi-bin/mailman/listinfo/dbmail
|