Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Qmail: users

Intermittent tcpserver failure / stuck queue?

 

 

Qmail users RSS feed   Index | Next | Previous | View Threaded


akirk at sourcefire

May 18, 2009, 7:58 PM

Post #1 of 7 (2240 views)
Permalink
Intermittent tcpserver failure / stuck queue?

I've got a strange issue with a netqmail-1.05 system that's been running
smoothly for the last 7+ months (I've been running Qmail on one system or
another for nearly 10 years now). As of today, inbound mail delivery (which
comes on a relay through mx1.mailhop.org and mx2.mailhop.org) has become
very erratic, and the two users I have who use Thunderbird instead of
webmail have been unable to send mail all day. Here are the symptoms as I've
seen them over the last half-hour of attempting to diagnose things:

* Connecting to the mail server via telnet on port 25 succeeds, but I never
get a 220-style greeting.
* If I restart qmail via "qmailctl stop" and "qmailctl start", however, I
can connect.
* This ability to connect lasts ~5 minutes, before I hang forever waiting
for a response from tcpserver/qmail.
* I've got a ton of paired "/var/qmail/bin/qmail-smtpd /home/mail/bin/vchkpw
/usr/bin/true" and "bin/qmail-queue" processes that just don't seem to want
to go away, even after restarting qmail.
* "qmailctl stat" shows an apparently growing number of messages in the
queue, though 0 are not yet processed.

At this point, I suspect that either tcpserver - which Qmail is running
under - is running into some bizarre condition that's not letting it hand
off connections to Qmail after a certain point, or that my queue has somehow
become fouled up, and Qmail is freaking out and hanging each time it
attempts to process whatever message has fouled things up.

I'm seeing no errors in /var/log/messages, /var/log/qmail/current, or
/var/log/qmail/smtpd/current, BTW. Also, FWIW, it's FreeBSD 7.0-RELEASE.

Any idea where I should start debugging things?

Alex Kirk


lists-qmail at maexotic

May 19, 2009, 6:02 AM

Post #2 of 7 (2099 views)
Permalink
Re: Intermittent tcpserver failure / stuck queue? [In reply to]

On Mon, May 18, 2009 at 10:58:41PM -0400, Alex Kirk wrote:
> Any idea where I should start debugging things?

A few thoughts:
- you can try to run
# truss -f -o /tmp/qmail.log /var/qmail/bin/qmail-smtpd /home/mail/bin/vchkpw /usr/bin/true
and see if you get anything from its output (/tmp/truss.log) or if that
works fine

- my first thought was DNS problems
o pro
+ it hangs (in tcpserver?)
+ maybe rblsmtpd (more timeouts, longer hanging)
o contra
+ lots of running qmail-smtpd/qmail-queue
you may want to do a "killall qmail-smtpd" to freshly start over
without and hanging/old processes

- that you can connect after a restart is that tcpserver still has
connections slots (-c [default 40]). After those and the backlog
(-b) is filled (kinda irrelevant to FreeBSD) you can't even connect.

- do you run tcpserver with "-v"? this might give some more infos.
Do the connection lines show the hostname or only IP addresses of
the connecting hosts (unless you are already using -H)?
@400000004a121bc82a78df14 tcpserver: pid 99338 from 10.0.0.2
@400000004a121bc923bb9734 tcpserver: ok 99338 qmail:10.0.0.1:25 remote-host.example.com:10.0.0.2::3291
vs.
@400000004a121bc82a78df14 tcpserver: pid 99338 from 10.0.0.2
@400000004a121bc923bb9734 tcpserver: ok 99338 qmail:10.0.0.1:25 :10.0.0.2::3291

To rule out DNS problems even further you can run tcpserver with -HP

\Maex


akirk at sourcefire

May 19, 2009, 7:04 AM

Post #3 of 7 (2104 views)
Permalink
Re: Intermittent tcpserver failure / stuck queue? [In reply to]

Thanks for the reply, you guys are one of the most reliably awesome lists
I've ever dealt with in open-source land. Replies inline. :-)

On Tue, May 19, 2009 at 9:02 AM, Markus Stumpf <lists-qmail [at] maexotic>wrote:

> On Mon, May 18, 2009 at 10:58:41PM -0400, Alex Kirk wrote:
> > Any idea where I should start debugging things?
>
> A few thoughts:
> - you can try to run
> # truss -f -o /tmp/qmail.log /var/qmail/bin/qmail-smtpd
> /home/mail/bin/vchkpw /usr/bin/true
> and see if you get anything from its output (/tmp/truss.log) or if that
> works fine


It works in that I get a nice "220 www.schnarff.com ESMTP" like I'd expect
and I had to go kill -9 it from a different shell. It did generate
/tmp/qmail.log - which I can send if desired, for now it seems like it'd
just clog things up - but a look through it has nothing that jumps out at me
as a problem.

Looking at the truss man page, I decided to attach to the PID of one of my
bin/qmail-queue processes, and it did...nothing. It just hung out, like the
zombie process it appears to be.

>
>
> - my first thought was DNS problems
> o pro
> + it hangs (in tcpserver?)
> + maybe rblsmtpd (more timeouts, longer hanging)
> o contra
> + lots of running qmail-smtpd/qmail-queue
> you may want to do a "killall qmail-smtpd" to freshly start over
> without and hanging/old processes


I had done this when I initially discovered all of those processes, and had
nothing but the "supervise qmail-send" and "supervise qmail-smtpd"
processes running. In fact, I went so far as to reboot the entire system -
it's not so mission critical that it couldn't stand the ~90 seconds of
downtime, and I figured it was worth trying. The problems have persisted
afterwards, much to my chagrin.


> - that you can connect after a restart is that tcpserver still has
> connections slots (-c [default 40]). After those and the backlog
> (-b) is filled (kinda irrelevant to FreeBSD) you can't even connect.
>

FWIW, I saw in /var/qmail/supervise/qmail-smtpd/run that the -c parameter
here was 20, based on the value of /var/qmail/control/concurrencyincoming; I
bumped it to 100, since the server can handle the extra connectivity just
fine at the OS level.


>
> - do you run tcpserver with "-v"? this might give some more infos.


I do, and will be happy to provide additional details as necessary.

>
> Do the connection lines show the hostname or only IP addresses of
> the connecting hosts (unless you are already using -H)?
> @400000004a121bc82a78df14 tcpserver: pid 99338 from 10.0.0.2
> @400000004a121bc923bb9734 tcpserver: ok 99338 qmail:10.0.0.1:25remote-host.example.com:10
> .0.0.2::3291
> vs.
> @400000004a121bc82a78df14 tcpserver: pid 99338 from 10.0.0.2
> @400000004a121bc923bb9734 tcpserver: ok 99338 qmail:10.0.0.1:25:10.0.0.2:
> :3291
>

They look like this, roughly:

@400000004a12b7cc1126e074 tcpserver: status: 20/20
@400000004a12b7cc1128a97c tcpserver: pid 20194 from 216.146.33.13
@400000004a12b7cc1325f07c tcpserver: ok 20194
www.schnarff.com:65.102.233.117:25
mxout-013-bos.mailhop.org:216.146.33.13::63308
@400000004a12b7cc21626cc4 CHKUSER accepted sender: from
<Chandra [at] flyingwebsites::> remote <mhfr-03-ewr.dyndns.com:
mxout-013-bos.mailhop.org:216.146.33.13> rcpt <> : sender accepted
@400000004a12b7cc21a18cec CHKUSER accepted rcpt: from
<Chandra [at] flyingwebsites::> remote <mhfr-03-ewr.dyndns.com:
mxout-013-bos.mailhop.org:216.146.33.13> rcpt <alex [at] schnarff> : found
existing recipient
@400000004a12b7d51f027424 tcpserver: status: 0/100
@400000004a12b80204dda4c4 tcpserver: status: 1/100



>
> To rule out DNS problems even further you can run tcpserver with -HP
>

Sure, just did this. While I'm still able to connect to Qmail properly some
9 minutes after my last restart - which is much longer than it had been
going - I'm up to 99 connections on port 25, and 213 qmail-related
processes, both of which have been steadily growing. I'm thinking that's not
exactly sustainable.

Do you think there's something blocking actual mail delivery, and that's
what's causing the problem?


Alex Kirk


akirk at sourcefire

May 19, 2009, 7:07 AM

Post #4 of 7 (2081 views)
Permalink
Re: Intermittent tcpserver failure / stuck queue? [In reply to]

On Tue, May 19, 2009 at 9:02 AM, Markus Stumpf <lists-qmail [at] maexotic>wrote:

> On Mon, May 18, 2009 at 10:58:41PM -0400, Alex Kirk wrote:
> > Any idea where I should start debugging things?
>
> A few thoughts:
> - you can try to run
> # truss -f -o /tmp/qmail.log /var/qmail/bin/qmail-smtpd
> /home/mail/bin/vchkpw /usr/bin/true
> and see if you get anything from its output (/tmp/truss.log) or if that
> works fine
>
> - my first thought was DNS problems
> o pro
> + it hangs (in tcpserver?)
> + maybe rblsmtpd (more timeouts, longer hanging)
> o contra
> + lots of running qmail-smtpd/qmail-queue
> you may want to do a "killall qmail-smtpd" to freshly start over
> without and hanging/old processes
>
> - that you can connect after a restart is that tcpserver still has
> connections slots (-c [default 40]). After those and the backlog
> (-b) is filled (kinda irrelevant to FreeBSD) you can't even connect.
>
> - do you run tcpserver with "-v"? this might give some more infos.
> Do the connection lines show the hostname or only IP addresses of
> the connecting hosts (unless you are already using -H)?
> @400000004a121bc82a78df14 tcpserver: pid 99338 from 10.0.0.2
> @400000004a121bc923bb9734 tcpserver: ok 99338 qmail:10.0.0.1:25remote-host.example.com:10
> .0.0.2::3291
> vs.
> @400000004a121bc82a78df14 tcpserver: pid 99338 from 10.0.0.2
> @400000004a121bc923bb9734 tcpserver: ok 99338 qmail:10.0.0.1:25:10.0.0.2:
> :3291
>
> To rule out DNS problems even further you can run tcpserver with -HP
>
> \Maex
>

Oh, and one extra tidbit that I should have remembered before I hit send: I
downloaded qmHandle-1.3.2 last night, thinking it was a queue issue. I don't
know if it's an issue with qmQueue or an indication that my queue is
actually busted, but while it reports an empty queue, "qmailctl stat" shows
128 messages in my queue. Any thoughts on which one is correct, and/or
suggestions on a better queue management tool?

Alex Kirk


lists-qmail at maexotic

May 19, 2009, 8:21 AM

Post #5 of 7 (2087 views)
Permalink
Re: Intermittent tcpserver failure / stuck queue? [In reply to]

On Tue, May 19, 2009 at 10:04:35AM -0400, Alex Kirk wrote:
> It works in that I get a nice "220 www.schnarff.com ESMTP" like I'd expect
> and I had to go kill -9 it from a different shell. It did generate
> /tmp/qmail.log - which I can send if desired, for now it seems like it'd
> just clog things up - but a look through it has nothing that jumps out at me
> as a problem.

I should have mention that qmail-smtpd is then in an SMTP dialog,
reading from STDIN. You can type
HELO example.com
MAIL FROM: <joe [at] example>
RCPT TO: <somelocaluser>
DATA
blabla
.
quit
If the dialogue hangs somewhere it would be interesting to see the
last couple of lines of the truss log.
Look at the log backwards from the end. If you are using some spam/virus
filters the exec()s should show up, which would help to see which
program causes the problem (see also below).

> Looking at the truss man page, I decided to attach to the PID of one of my
> bin/qmail-queue processes, and it did...nothing. It just hung out, like the
> zombie process it appears to be.

If the process already hangs in a syscall and does nothing truss cannot
report anything.

> @400000004a12b7cc1126e074 tcpserver: status: 20/20
> @400000004a12b7cc1128a97c tcpserver: pid 20194 from 216.146.33.13
> @400000004a12b7cc1325f07c tcpserver: ok 20194
> www.schnarff.com:65.102.233.117:25
> mxout-013-bos.mailhop.org:216.146.33.13::63308
> @400000004a12b7cc21626cc4 CHKUSER accepted sender: from
> <Chandra [at] flyingwebsites::> remote <mhfr-03-ewr.dyndns.com:
> mxout-013-bos.mailhop.org:216.146.33.13> rcpt <> : sender accepted
> @400000004a12b7cc21a18cec CHKUSER accepted rcpt: from
> <Chandra [at] flyingwebsites::> remote <mhfr-03-ewr.dyndns.com:
> mxout-013-bos.mailhop.org:216.146.33.13> rcpt <alex [at] schnarff> : found
> existing recipient

Ok, looks like you are using a patched version of netqmail with CHKUSER
patch. Any other patches?
Are you using simscan/amavis/spamassassin or other spam/virus checking?
I don't know about the CHKUSER patch, but that *seems* to work.
If the messages don't go to the queue and you have spam/virus checking
this might be the source of the problem.

> Do you think there's something blocking actual mail delivery, and that's
> what's causing the problem?

I think the problem is already at the queueing, otherwise the smtpds
would finish.

\Maex


lists-qmail at maexotic

May 19, 2009, 8:30 AM

Post #6 of 7 (2103 views)
Permalink
Re: Intermittent tcpserver failure / stuck queue? [In reply to]

On Tue, May 19, 2009 at 10:07:02AM -0400, Alex Kirk wrote:
> actually busted, but while it reports an empty queue, "qmailctl stat" shows
> 128 messages in my queue. Any thoughts on which one is correct, and/or
> suggestions on a better queue management tool?

IMHO "qmailctl stat" simply calls
/var/qmail/bin/qmail-qstat
you could also try
/var/qmail/bin/qmail-qread
which is a bit more informative about the messages in the queue.

Also qmHandle (never used that) probably reports only "processed" messages,
while qmail-qstat also reports messages in the "todo" (not yet
preprocessed) queue.

\Maex


akirk at sourcefire

May 19, 2009, 10:26 AM

Post #7 of 7 (2091 views)
Permalink
Re: Intermittent tcpserver failure / stuck queue? [In reply to]

On Tue, May 19, 2009 at 11:21 AM, Markus Stumpf <lists-qmail [at] maexotic>wrote:

> On Tue, May 19, 2009 at 10:04:35AM -0400, Alex Kirk wrote:
> > It works in that I get a nice "220 www.schnarff.com ESMTP" like I'd
> expect
> > and I had to go kill -9 it from a different shell. It did generate
> > /tmp/qmail.log - which I can send if desired, for now it seems like it'd
> > just clog things up - but a look through it has nothing that jumps out at
> me
> > as a problem.
>
> I should have mention that qmail-smtpd is then in an SMTP dialog,
> reading from STDIN. You can type
> HELO example.com
> MAIL FROM: <joe [at] example>
> RCPT TO: <somelocaluser>
> DATA
> blabla
> .
> quit
> If the dialogue hangs somewhere it would be interesting to see the
> last couple of lines of the truss log.
> Look at the log backwards from the end. If you are using some spam/virus
> filters the exec()s should show up, which would help to see which
> program causes the problem (see also below).
>

Actually, the last thing is just it waiting to read:

28969: select(2,0x0,{1},0x0,{1200.000000}) = 1 (0x1)
28969: write(1,"220 www.schnarff.com ESMTP\r\n",28) = 28 (0x1c)
28969: select(1,{0},0x0,0x0,{1200.000000})

I mean, maybe I'm doing it wrong, but when I tried typing in and hitting
"ehlo schnarff.com", it didn't do a damn thing. It's like it didn't get the
input at all. I'm guessing this isn't the underlying Qmail issue, so much as
I've fouled something up in the test here.


>
> > Looking at the truss man page, I decided to attach to the PID of one of
> my
> > bin/qmail-queue processes, and it did...nothing. It just hung out, like
> the
> > zombie process it appears to be.
>
> If the process already hangs in a syscall and does nothing truss cannot
> report anything.
>
> > @400000004a12b7cc1126e074 tcpserver: status: 20/20
> > @400000004a12b7cc1128a97c tcpserver: pid 20194 from 216.146.33.13
> > @400000004a12b7cc1325f07c tcpserver: ok 20194
> > www.schnarff.com:65.102.233.117:25
> > mxout-013-bos.mailhop.org:216.146.33.13::63308
> > @400000004a12b7cc21626cc4 CHKUSER accepted sender: from
> > <Chandra [at] flyingwebsites::> remote <mhfr-03-ewr.dyndns.com:
> > mxout-013-bos.mailhop.org:216.146.33.13> rcpt <> : sender accepted
> > @400000004a12b7cc21a18cec CHKUSER accepted rcpt: from
> > <Chandra [at] flyingwebsites::> remote <mhfr-03-ewr.dyndns.com:
> > mxout-013-bos.mailhop.org:216.146.33.13> rcpt <alex [at] schnarff> :
> found
> > existing recipient
>
> Ok, looks like you are using a patched version of netqmail with CHKUSER
> patch. Any other patches?


If memory serves, I think I went with the qmail-toaster-0.8.3.patch, which
provides smtp-auth, tls, oversize DNS, qregex, netqmail-maildir++, chkuser,
and SPF.

>
> Are you using simscan/amavis/spamassassin or other spam/virus checking?


Spamassassin, called from .qmail files. Haven't touched it since this
behavior began, and "ps aux" shows no evidence it's causing issues - not a
ton of processes, the ones that are there aren't hogging the CPU, etc.


>
> I don't know about the CHKUSER patch, but that *seems* to work.
> If the messages don't go to the queue and you have spam/virus checking
> this might be the source of the problem.
>
> > Do you think there's something blocking actual mail delivery, and that's
> > what's causing the problem?
>
> I think the problem is already at the queueing, otherwise the smtpds
> would finish.
>

Time for me to look more closely at the queue, then, I guess. I'll see what
I can find and report back if I can solve the problem there, so that it's in
the archives...

Alex Kirk

Qmail users RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.