Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: OpenSSH: Dev

Issues with ssh-agent connecting to a large number of hosts at once

 

 

OpenSSH dev RSS feed   Index | Next | Previous | View Threaded


bbelnap at gmail

Apr 17, 2009, 9:04 AM

Post #1 of 4 (1008 views)
Permalink
Issues with ssh-agent connecting to a large number of hosts at once

Hi,

I'm having problems with ssh-agent when I am connecting to a large (several
hundred) hosts at once. I'm using a kanif (
http://taktuk.gforge.inria.fr/kanif/) which is a very nice package that
distributes ssh connections across the hosts you are connecting to (a
fan-out sort of approach, so all connections are not coming from one host).
However, all hosts have to authenticate, so all the hosts have to wind their
way back to the ssh-agent. This problem isn't isolated to just kanif,
however. I see it when using other utilities that rely on many concurrent
connections to the ssh-agent.

running strace on the ssh-agent, things start out ok, then go sour and it
starts spitting out:

read(160, 0xbf8f300a, 1024) = -1 EAGAIN (Resource temporarily
unavailable)
read(160, 0xbf8f300a, 1024) = -1 EAGAIN (Resource temporarily
unavailable)
read(160, 0xbf8f300a, 1024) = -1 EAGAIN (Resource temporarily
unavailable)

while pegging the cpu. Tracking the number of connections to the agent once
every second (while true; do netstat -x | grep -c <agent socket name>; sleep
1) looks like:

5
5
5
35
98
154
155
200
287
287

at that point I kill the agent, but it will stick at that value if I don't.
It's not always 287, but varies. I've seen it as high as 447 connections at
once, but it's usually in the 200 range.

I've tried different ssh-agents on different kernels and machines, and
haven't found a combination that works. However, it seems like most FreeBSD
machines I've tried did not have the problem. Also, using pagent on windows
does not have any issues (*gasp*)

It seems to me that I'm hitting some kind of kernel limit (open file limit
perhaps?) But I've fiddled with every sysctl value I can find, and haven't
found the right magic. Anyone run into this or can offer further debugging
suggestions? (btw, ssh-v shows: OpenSSH_5.1p1 Debian-3ubuntu1, OpenSSL
0.9.8g)

Thanks.

--Bob
_______________________________________________
openssh-unix-dev mailing list
openssh-unix-dev [at] mindrot
https://lists.mindrot.org/mailman/listinfo/openssh-unix-dev


bob at proulx

Apr 17, 2009, 6:48 PM

Post #2 of 4 (924 views)
Permalink
Re: Issues with ssh-agent connecting to a large number of hosts at once [In reply to]

Bob Belnap wrote:
> It seems to me that I'm hitting some kind of kernel limit (open file limit
> perhaps?) But I've fiddled with every sysctl value I can find, and haven't
> found the right magic. Anyone run into this or can offer further debugging
> suggestions? (btw, ssh-v shows: OpenSSH_5.1p1 Debian-3ubuntu1, OpenSSL
> 0.9.8g)

I don't have a perfect understanding of this but not seeing anyone
else say anything I will jump in and make some suggestions imperfect
though they will be. Different types of kernels will handle this
differently and will account for why different systems behave
differently. But most have a limited amount of memory available for
network resources. Quickly opening and closing network connections
can cause memory to be consumed at a high right. Once the available
memory is exceeded system calls fail for being out of resources until
more resources are available. This is what you are seeing.

Why do resources become consumed? Look at RFC793 and you will find
the TCP state diagram. Look particularly at the TIME_WAIT state. You
are probably creating many connections hanging around in the TIME_WAIT
state after they are closed and until the timeout. Each of those
consumes network memory. You can see these connections by looking at
the state reported by netstat. (e.g. 'netstat | grep TIME_WAIT') If
you see many connections in the TIME_WAIT state then this is what you
are running into. In many kernels with a limited amount of network
resources this limits the rate at which connections may be created and
closed.

I am not familiar with TakTuk but it appears to try to avoid this
problem by spreading the load around. That is good. But perhaps you
are still exceeding the system limits. It appears to me that you are.

This isn't really particular to ssh but is generic to anything that
creates TCP connections. Since ssh uses TCP it has the same
limitation as any other program that uses TCP and leaves connections
in the TIME_WAIT state until they timeout and their resources are
reclaimed.

Hope that helps.

Bob
_______________________________________________
openssh-unix-dev mailing list
openssh-unix-dev [at] mindrot
https://lists.mindrot.org/mailman/listinfo/openssh-unix-dev


bbelnap at gmail

Apr 20, 2009, 7:21 AM

Post #3 of 4 (936 views)
Permalink
Re: Issues with ssh-agent connecting to a large number of hosts at once [In reply to]

Thanks Bob, for your detailed and informative response. Comments inline...

On Fri, Apr 17, 2009 at 7:48 PM, Bob Proulx <bob [at] proulx> wrote:

> I don't have a perfect understanding of this but not seeing anyone
> else say anything I will jump in and make some suggestions imperfect
> though they will be. Different types of kernels will handle this
> differently and will account for why different systems behave
> differently. But most have a limited amount of memory available for
> network resources. Quickly opening and closing network connections
> can cause memory to be consumed at a high right. Once the available
> memory is exceeded system calls fail for being out of resources until
> more resources are available. This is what you are seeing.
>
> Why do resources become consumed? Look at RFC793 and you will find
> the TCP state diagram. Look particularly at the TIME_WAIT state. You
> are probably creating many connections hanging around in the TIME_WAIT
> state after they are closed and until the timeout. Each of those
> consumes network memory. You can see these connections by looking at
> the state reported by netstat. (e.g. 'netstat | grep TIME_WAIT') If
> you see many connections in the TIME_WAIT state then this is what you
> are running into. In many kernels with a limited amount of network
> resources this limits the rate at which connections may be created and
> closed.
>

Connections aren't in the TIME_WAIT state, they are either CONNECTED or
CONNECTING (about evenly split)


> This isn't really particular to ssh but is generic to anything that
> creates TCP connections. Since ssh uses TCP it has the same
> limitation as any other program that uses TCP and leaves connections
> in the TIME_WAIT state until they timeout and their resources are
> reclaimed.


Yes, I realize this is not an issue with ssh in particular, but since it is
triggered by ssh, I had hoped this group could more easily point out what
limit is being triggered. I am continuing to research the issue..

--Bob
_______________________________________________
openssh-unix-dev mailing list
openssh-unix-dev [at] mindrot
https://lists.mindrot.org/mailman/listinfo/openssh-unix-dev


stevesk at pobox

Apr 22, 2009, 3:58 PM

Post #4 of 4 (911 views)
Permalink
Re: Issues with ssh-agent connecting to a large number of hosts at once [In reply to]

On Fri, Apr 17, 2009 at 10:04:34AM -0600, Bob Belnap wrote:
: read(160, 0xbf8f300a, 1024) = -1 EAGAIN (Resource temporarily
: unavailable)

looks like select() tells us a non-blocking fd is ready for reading
but there is nothing to read and we loop forever on EAGAIN.

is it an ssh(1) that is connecting to the agent?

there is an ssh-agent -d option, you could add some debug()
to troubleshoot.
_______________________________________________
openssh-unix-dev mailing list
openssh-unix-dev [at] mindrot
https://lists.mindrot.org/mailman/listinfo/openssh-unix-dev

OpenSSH dev RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.