Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: exim: users

strange exim behaviour

 

 

exim users RSS feed   Index | Next | Previous | View Threaded


exim at inode

Nov 25, 2004, 9:37 AM

Post #1 of 17 (909 views)
Permalink
strange exim behaviour

Hej,

first off: I know this problem is quite generic, and can be caused by
various things, but I'm asking anyways..

I'm seeing strange "fluctuations" in the message logs. Doing a
# tail -f mainlog|grep Completed
I'm getting sth. like this:

2004-11-25 12:41:25 1CXH2S-0006yW-00 Completed
2004-11-25 12:41:25 1CXHGQ-0007WF-01 Completed
2004-11-25 12:41:25 1CXHZo-0003JB-00 Completed
2004-11-25 12:41:25 1CXHgm-0007Pt-00 Completed
2004-11-25 12:41:25 1CXHjk-0007HC-01 Completed
2004-11-25 12:41:25 1CXH4S-0008Ld-00 Completed
2004-11-25 12:41:25 1CXHGQ-0007Sw-00 Completed
2004-11-25 12:41:25 1CXI05-0003jR-00 Completed
2004-11-25 12:41:25 1CXHgm-0007HC-00 Completed
2004-11-25 12:41:26 1CXHjk-0007Te-03 Completed
2004-11-25 12:41:26 1CXHZo-0003Dz-00 Completed
2004-11-25 12:41:26 1CXHZm-0003GE-00 Completed
2004-11-25 12:41:26 1CXHkk-0007Te-00 Completed
2004-11-25 12:41:26 1CXH4S-0008HI-00 Completed
2004-11-25 12:41:26 1CXHZo-0003Ly-00 Completed
2004-11-25 12:41:26 1CXHZm-0003Hr-00 Completed
2004-11-25 12:41:26 1CXH4S-0008IN-00 Completed
2004-11-25 12:41:39 1CXHjk-0007Te-00 Completed
2004-11-25 12:41:42 1CXHZm-0003NS-00 Completed
2004-11-25 12:41:43 1CXHgk-0007Ml-00 Completed
2004-11-25 12:41:43 1CXHZo-0003Nx-00 Completed
2004-11-25 12:41:44 1CXH4S-0008Kw-01 Completed
2004-11-25 12:41:45 1CXHZk-0003MM-00 Completed
2004-11-25 12:41:45 1CXHZo-0003MM-00 Completed
2004-11-25 12:41:45 1CXH8S-0002b7-00 Completed
2004-11-25 12:41:45 1CXHGQ-0007Wb-00 Completed
2004-11-25 12:41:45 1CXI0M-0003sm-00 Completed
2004-11-25 12:41:47 1CXHWk-0000tw-00 Completed
2004-11-25 12:41:47 1CXH8S-0002QH-00 Completed
2004-11-25 12:41:47 1CXHZo-0003GE-00 Completed
2004-11-25 12:41:47 1CXHGQ-0007TZ-00 Completed
2004-11-25 12:41:47 1CXI0Q-0003wF-00 Completed


So, between 12:41:26 and 12:41:39 no single message got delivered.
Loadaverage is ~15 at that moment, the following settings in exim's
config may be of prior interest:

smtp_load_reserve = 5 # lowered this to 5 during the "test"
queue_only_load = 40
deliver_queue_load_max = 40

There were no messages coming in from remote hosts, exim was processing
mails in its queue. I'm using Berkeley DB files to check local addresses
and their quota on this host. This is on a dual XEON 2.4GHz with 2GB RAM
and SCSI RAID5.

So, can anyone say something like "Ah, I allready know that, it's
because ..."? AFAICS there's no obvious reason for this behaviour. And
tpop3d which is running too on that box, does not reflect this
fluctuations...


lg,
daniel


peter at bowyer

Nov 25, 2004, 10:03 AM

Post #2 of 17 (892 views)
Permalink
Re: strange exim behaviour [In reply to]

Daniel Tiefnig <exim [at] inode> wrote:

> I'm seeing strange "fluctuations" in the message logs. Doing a
> # tail -f mainlog|grep Completed
>
> So, between 12:41:26 and 12:41:39 no single message got delivered.

Right. What other lines were in the log during that time? What about all the
other application logs, syslogs etc? And why do you consider this 'strange'?

Peter


exim at inode

Nov 25, 2004, 10:56 AM

Post #3 of 17 (886 views)
Permalink
Re: strange exim behaviour [In reply to]

Peter Bowyer wrote:
> Daniel Tiefnig <exim [at] inode> wrote:
>> So, between 12:41:26 and 12:41:39 no single message got delivered.
>
> Right. What other lines were in the log during that time?

In the exim log? None.

> What about all the other application logs,

Well, from 12:41:27 to 12:41:38, there are ~350 POP logins and some IMAP
logins.

> syslogs etc?

Nothing of interest, AFAICS.

> And why do you consider this 'strange'?

Well, I don't know why exim pauses for about 10 seconds, when there are
quite a few hundred mails (all queued for local delivery) in its queue.
I thought it would try to deliver them. If that's not unusual, I'd be
happy if someone could tell me why this happens, even if its nothing
exim specific...


lg,
daniel


exim at inode

Nov 25, 2004, 11:07 AM

Post #4 of 17 (886 views)
Permalink
Re: strange exim behaviour [In reply to]

Daniel Tiefnig wrote:
> So, between 12:41:26 and 12:41:39 no single message got delivered.

One more thing, from the logs, I found this:

2004-11-25 12:41:27 1CXHjk-0007Te-00 => user <censored> R=virtual_user \
T=virtual_delivery S=4510
2004-11-25 12:41:39 1CXHjk-0007Te-00 Completed

So, the message got delivered at 12:41:27, but the final logline was
written at 12:41:39. What could have caused exim to pause between these
two lines? (And not just this exim process, but all exim processes. And
not all processes, but just the exim processes...)

I allready thought it might be just a logging issue, but if loglines
were just delayed, there would not be a gap in the timestamps, would be
there? And tpop3d, as well as courier-imap is able to log without any
problems all the time...


lg,
daniel

PS: I'll try to get an strace of one of these processes, maybe I'm able
to see where it blocks...


steve at lobefin

Nov 25, 2004, 12:23 PM

Post #5 of 17 (889 views)
Permalink
Re: strange exim behaviour [In reply to]

On Thu, Nov 25, 2004 at 07:07:16PM +0100, Daniel Tiefnig said:
> Daniel Tiefnig wrote:
> >So, between 12:41:26 and 12:41:39 no single message got delivered.
>
> One more thing, from the logs, I found this:
>
> 2004-11-25 12:41:27 1CXHjk-0007Te-00 => user <censored> R=virtual_user \
> T=virtual_delivery S=4510
> 2004-11-25 12:41:39 1CXHjk-0007Te-00 Completed
>
> So, the message got delivered at 12:41:27, but the final logline was
> written at 12:41:39. What could have caused exim to pause between these
> two lines? (And not just this exim process, but all exim processes. And
> not all processes, but just the exim processes...)
>
> I allready thought it might be just a logging issue, but if loglines
> were just delayed, there would not be a gap in the timestamps, would be
> there? And tpop3d, as well as courier-imap is able to log without any
> problems all the time...

What's your virtual_delivery transport look like? Could it be using
another program that didn't give a return code for 10 seconds or so?
--
--------------------------------------------------------------------------
| Stephen Gran | All life evolves by the differential |
| steve [at] lobefin | survival of replicating entities. -- |
| http://www.lobefin.net/~steve | Dawkins |
--------------------------------------------------------------------------


exim at inode

Nov 25, 2004, 1:32 PM

Post #6 of 17 (887 views)
Permalink
Re: strange exim behaviour [In reply to]

Stephen Gran wrote:
> On Thu, Nov 25, 2004 at 07:07:16PM +0100, Daniel Tiefnig said:
> > 2004-11-25 12:41:27 1CXHjk-0007Te-00 => user <censored> \
> > R=virtual_user T=virtual_delivery S=4510
> > 2004-11-25 12:41:39 1CXHjk-0007Te-00 Completed
>
> What's your virtual_delivery transport look like? Could it be using
> another program that didn't give a return code for 10 seconds or so?

No. My virtual_delivery transport (and router) is as simple as possible.

begin routers

virtual_user:
driver = accept
condition = ${if
eq{${lookup{$local_part@$domain}dbm{/var/spool/exim4/dbm/pop.bdb}}} \
{172.20.0.4} {yes}{no}}
transport = virtual_delivery

remote_smtp:
...

begin transports

virtual_delivery:
driver = appendfile
maildir_format
directory = /var/spool/exim4/${domain}/${local_part}
maildir_tag = ,S=$message_size
quota_size_regex = ,S=(\d+)
delivery_date_add
return_path_add
user = exim
group = exim
mode = 0660
quota =
${lookup{$local_part@$domain}dbm{/var/spool/exim4/dbm/quota.bdb}{$value}{60}}M
quota_warn_threshold = 70%
quota_warn_message = "..."


That's it.

lg,
daniel


Nigel.Metheringham at dev

Nov 26, 2004, 2:11 AM

Post #7 of 17 (886 views)
Permalink
Re: strange exim behaviour [In reply to]

On Thu, 2004-11-25 at 19:07 +0100, Daniel Tiefnig wrote:
> I allready thought it might be just a logging issue, but if loglines
> were just delayed, there would not be a gap in the timestamps, would be
> there? And tpop3d, as well as courier-imap is able to log without any
> problems all the time...

What filesystem are you using?
If it is a journaling filesystem, and the journal had filled right up
and was being flushed to a busy disk subsystem there is a chance that
all writes could hang. This might well cause a logging gap for exim
(since it often logs after a disk synchronous operation) but not for a
pop daemon which might not perform any disk synchronous writes prior to
logging.


Nigel.
--
[ Nigel Metheringham Nigel.Metheringham [at] InTechnology ]
[. - Comments in this message are my own and not ITO opinion/policy - ]


exim at inode

Nov 26, 2004, 7:37 AM

Post #8 of 17 (885 views)
Permalink
Re: strange exim behaviour [In reply to]

Nigel Metheringham wrote:
> What filesystem are you using?

It's ext3.

> If it is a journaling filesystem, and the journal had filled right up
> and was being flushed to a busy disk subsystem there is a chance
> that all writes could hang.

I think you got it.
Watching the cache statistics of the RAID controler, I can see that the
hangs in the logging appear when the write hits drop down to about 20%
to 30%. It then goes up to about 70% to 90%, and logging/work continues.

Data throughput is not very high at these moments (about 8MByte/s each
read and write to the controler) but all the small files and empty
directories may sure cause a lot of head seeks on the underlying disks.


Thank you,
daniel


Nigel.Metheringham at dev

Nov 26, 2004, 7:48 AM

Post #9 of 17 (886 views)
Permalink
Re: strange exim behaviour [In reply to]

On Fri, 2004-11-26 at 15:37 +0100, Daniel Tiefnig wrote:
> Nigel Metheringham wrote:
> > What filesystem are you using?
>
> It's ext3.

> I think you got it.
> Watching the cache statistics of the RAID controler, I can see that the
> hangs in the logging appear when the write hits drop down to about 20%
> to 30%. It then goes up to about 70% to 90%, and logging/work continues.
>
> Data throughput is not very high at these moments (about 8MByte/s each
> read and write to the controler) but all the small files and empty
> directories may sure cause a lot of head seeks on the underlying disks.

When ext3 was very young, we had long discussions on the idea of
separate stable fast journal devices along with data journaling. This
would prevent an awful lot of mail traffic ever hitting the real disk
because the file would have ceased to exist by the time the system got
to flushing the journal transactions to disk.

I still think this would be a neat disk configuration for mail servers.

What I don't know, and it might well be worth seeing if folks in the
ext3 community have an idea, is whether using data journaling on a disk
device with its own caching system would help.

Nigel.

--
[ Nigel Metheringham Nigel.Metheringham [at] InTechnology ]
[. - Comments in this message are my own and not ITO opinion/policy - ]


lists-exim at silverwraith

Nov 26, 2004, 10:19 AM

Post #10 of 17 (894 views)
Permalink
Re: strange exim behaviour [In reply to]

On Fri, Nov 26, 2004 at 03:37:36PM +0100, Daniel Tiefnig wrote:
> >What filesystem are you using?
>
> It's ext3.
>
> >If it is a journaling filesystem, and the journal had filled right up
> >and was being flushed to a busy disk subsystem there is a chance
> >that all writes could hang.
>
> I think you got it.
> Watching the cache statistics of the RAID controler, I can see that the
> hangs in the logging appear when the write hits drop down to about 20%
> to 30%. It then goes up to about 70% to 90%, and logging/work continues.

ext3 is a really bad idea for this i have to say.
I'm sure a lot of people find it ok, but for high performance (or even
som medium performance!) ext3 is not the choice. It's approximately half
the speed of the non-journalling ext2.
Filesystems like XFS and Reiser will be much faster.


dot at dotat

Aug 26, 2008, 10:16 AM

Post #11 of 17 (888 views)
Permalink
Re: Strange Exim Behaviour [In reply to]

On Tue, 26 Aug 2008, Marc Silver wrote:
>
> The oldest mails in the queue do not seem to be getting hit by the queue
> runners and therefore the queue size is not decreasing nor are old mails being
> expired from the queue. If I manually instruct Exim to flush these messages
> (exim -v -M <msgID>) the messages flush, but a manual queue run just never
> seems to get there.

What command are you using to run Exim? i.e. what is the queue runner
interval? Try reducing it to something like -q5m. You only have 8
concurrent queue runners. Try increasing that to a few dozen. Check your
logs for queue run start and end lines, and see if they are reasonable.
Use exiwhat to check that you have a suitable number of queue runners.

Tony.
--
<fanf [at] exim> <dot [at] dotat> http://dotat.at/ ${sg{\N${sg{\
N\}{([^N]*)(.)(.)(.*)}{\$1\$3\$2\$1\$3\n\$2\$3\$4\$3\n\$3\$2\$4}}\
\N}{([^N]*)(.)(.)(.*)}{\$1\$3\$2\$1\$3\n\$2\$3\$4\$3\n\$3\$2\$4}}

--
## List details at http://lists.exim.org/mailman/listinfo/exim-users
## Exim details at http://www.exim.org/
## Please use the Wiki with this list - http://wiki.exim.org/


marcs at discoverylink

Aug 26, 2008, 10:19 AM

Post #12 of 17 (888 views)
Permalink
Re: Strange Exim Behaviour [In reply to]

Hi Tony,

I'm manually trying to flush the queue with 'exim -v -qff'.

Am I incorrect in assuming that if the internet line is already saturated
then I must have enough queue runners?

Cheers,
Marc

On Tue, 26 Aug 2008 19:16:48 +0200, Tony Finch <dot [at] dotat> wrote:
> What command are you using to run Exim? i.e. what is the queue runner
> interval? Try reducing it to something like -q5m. You only have 8
> concurrent queue runners. Try increasing that to a few dozen. Check your
> logs for queue run start and end lines, and see if they are reasonable.
> Use exiwhat to check that you have a suitable number of queue runners.

--
## List details at http://lists.exim.org/mailman/listinfo/exim-users
## Exim details at http://www.exim.org/
## Please use the Wiki with this list - http://wiki.exim.org/


marcs at discoverylink

Aug 26, 2008, 10:20 AM

Post #13 of 17 (890 views)
Permalink
Re: Strange Exim Behaviour [In reply to]

My bad... I'm starting exim with 'exim -bd -q10m'

On Tue, 26 Aug 2008 19:16:48 +0200, Tony Finch <dot [at] dotat> wrote:
> What command are you using to run Exim? i.e. what is the queue runner
> interval? Try reducing it to something like -q5m. You only have 8
> concurrent queue runners. Try increasing that to a few dozen. Check your
> logs for queue run start and end lines, and see if they are reasonable.
> Use exiwhat to check that you have a suitable number of queue runners.
>
> Tony.

--
## List details at http://lists.exim.org/mailman/listinfo/exim-users
## Exim details at http://www.exim.org/
## Please use the Wiki with this list - http://wiki.exim.org/


dot at dotat

Aug 26, 2008, 4:52 PM

Post #14 of 17 (872 views)
Permalink
Re: Strange Exim Behaviour [In reply to]

On Tue, 26 Aug 2008, Marc Silver wrote:
>
> I'm manually trying to flush the queue with 'exim -v -qff'.

Try omitting the force flags, so that you aren't defeating the retry
logic. Doing so causes Exim to do unnecessary work.

> Am I incorrect in assuming that if the internet line is already
> saturated then I must have enough queue runners?

If that's the case then you have fairly serious problems. You still need
plenty of queue runners to deal with slow DNS lookups etc.

Tony.
--
<fanf [at] exim> <dot [at] dotat> http://dotat.at/ ${sg{\N${sg{\
N\}{([^N]*)(.)(.)(.*)}{\$1\$3\$2\$1\$3\n\$2\$3\$4\$3\n\$3\$2\$4}}\
\N}{([^N]*)(.)(.)(.*)}{\$1\$3\$2\$1\$3\n\$2\$3\$4\$3\n\$3\$2\$4}}

--
## List details at http://lists.exim.org/mailman/listinfo/exim-users
## Exim details at http://www.exim.org/
## Please use the Wiki with this list - http://wiki.exim.org/


marcs at discoverylink

Aug 26, 2008, 10:19 PM

Post #15 of 17 (873 views)
Permalink
Re: Strange Exim Behaviour [In reply to]

Thanks for your help so far Tony.

We're doing on average around 150,000 outbound messages a day on these
servers. In your opinion, how many queue runners should I have set up?
These machines are dual-core 3.0 GHz machines with 4GB of RAM each if that
makes a difference.

Thanks again for your help so far.

On Wed, 27 Aug 2008 01:52:20 +0200, Tony Finch <dot [at] dotat> wrote:

> On Tue, 26 Aug 2008, Marc Silver wrote:
> If that's the case then you have fairly serious problems. You still need
> plenty of queue runners to deal with slow DNS lookups etc.

--
## List details at http://lists.exim.org/mailman/listinfo/exim-users
## Exim details at http://www.exim.org/
## Please use the Wiki with this list - http://wiki.exim.org/


dot at dotat

Aug 27, 2008, 7:29 AM

Post #16 of 17 (861 views)
Permalink
Re: Strange Exim Behaviour [In reply to]

On Wed, 27 Aug 2008, Marc Silver wrote:
>
> We're doing on average around 150,000 outbound messages a day on these
> servers. In your opinion, how many queue runners should I have set up?
> These machines are dual-core 3.0 GHz machines with 4GB of RAM each if that
> makes a difference.

If you ahve deep queues you need lots of queue runners. Divide the size of
your queue (from exim -bpc) by the number of queue runners, and multiply
that by the average time to deal with a message (usually fast but often
very slow - say 10 seconds) and if the result is unreasonably long add
more queue runners. Your machines are fairly studly so they can bear
having dozens of queue runners.

Tony.
--
<fanf [at] exim> <dot [at] dotat> http://dotat.at/ ${sg{\N${sg{\
N\}{([^N]*)(.)(.)(.*)}{\$1\$3\$2\$1\$3\n\$2\$3\$4\$3\n\$3\$2\$4}}\
\N}{([^N]*)(.)(.)(.*)}{\$1\$3\$2\$1\$3\n\$2\$3\$4\$3\n\$3\$2\$4}}

--
## List details at http://lists.exim.org/mailman/listinfo/exim-users
## Exim details at http://www.exim.org/
## Please use the Wiki with this list - http://wiki.exim.org/


exim at inode

Aug 27, 2008, 8:29 AM

Post #17 of 17 (864 views)
Permalink
Re: Strange Exim Behaviour [In reply to]

Tony Finch wrote:
> On Tue, 26 Aug 2008, Marc Silver wrote:
>> I'm manually trying to flush the queue with 'exim -v -qff'.
>
> Try omitting the force flags, so that you aren't defeating the retry
> logic. Doing so causes Exim to do unnecessary work.

Sure, but that would also result in old (older than 4d) messages being
left in the queue, when they are skipped due to retry logic, wouldn't
it? I may be wrong with that, but I think exim doesn't abandon delivery
without a last try, or does it?

>> Am I incorrect in assuming that if the internet line is already
>> saturated then I must have enough queue runners?
>
> If that's the case then you have fairly serious problems.

Yes, that may explain why the system is having these problems. Bandwidth
may just not be sufficient for the amount of e-mails flowing through.
Tuning exim of course will not help here.

24k messages, 45GB in size result in 1.9MB average size. With the
mentioned 150k mails a day, this would result in a required bandwidth of
28MBit, roughly. The servers are delivering "170GB of mail per 24 hour
period (limited by bandwidth entirely)", this would result in ~17MBit
available bandwidth.
Something is obviously wrong here. :o)

hth,
daniel

--
## List details at http://lists.exim.org/mailman/listinfo/exim-users
## Exim details at http://www.exim.org/
## Please use the Wiki with this list - http://wiki.exim.org/

exim users RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.