
wombat at us
Jun 1, 2000, 11:01 AM
Post #2 of 7
(450 views)
Permalink
|
Being triggered by the Luis' note, and especially this latter comment: ========== If this is a future vision, something for when we will have lots of nodes in the cluster, I have one more thought. Broadcasts aren't routed... so your entire cluster must fit in the same subnet... it isn't a good thing if you have load balance and other services that may be located in servers on different subnets. Multicasts would do that with more elegance. Hugs! Luis [ Luis Claudio R. Goncalves lclaudio [at] conectiva ] =========== OK, I can throw in some hard-won experience here, not with Heartbeat, but with our own work. The Topology Services (TS) component in IBM's Phoenix cluster services (see Greg Pfister's "In Search of Clusters," or the reference on the Linux-HA TODO page) performs the yeoman work of "heartbeating." We use both unicast and broadcast messages, in addition, we need to scale from 2 nodes to 512 nodes, which means that we have encountered issues with nodes being on multiple subnets. For the actual keepalive heartbeats, we use a ring model, where a node only heartbeats to its 'neighbor' (normally we do it unidirectionally, an extension to do it bidirectionally is straightforward.) We do not use broadcasts for keepalives, nodes only see the unicast messages from their neighbors. If a node does not see a number of hearbeats (based on the number configured by the admin) it considers its neighbor dead. Neighbor is based on IP address, NOT on physical location. We essentially have a logical hierarchy of nodes: a group leader (GL), mayors, and peons. The GL determines overall cluster health, and distributes that information via broadcasts to the other nodes in the cluster. In the case of multiple subnets, the GL sends a unicast to a mayor node on remote subnets, and the mayor in turn broadcasts the message to its subnet. Peons are simply all of the other nodes :-) We need to recover when a GL or mayor fails, and elect new ones. These protocols are all combinations of unicast and broadcast, or, all unicast if the comm media doesn't support broadcast (or the admin configures TS to not use it.) We've always though multicast would be good, but there are complexities with it regarding setting up multicast masks, recovery, etc., and when we started this work multicast wasn't supported on all of the networks and adapters we needed to support! Overall cluster status is built by the GL, by collecting status from the nodes. When a node sees its neighbor die (well, when a node sees nothing from its neighbor) it reports "node dead" to the GL, who subsequently distributes the new cluster membership to all of the surviving nodes. Initial cluster bringup is a set of smaller clusters coalescing as nodes PROCLAIM to each other, these subclusters coalesce eventually into the One True Cluster. Note that TS requires a pre-defined list of nodes that are to be allowed into the cluster, we then do discovery based on finding the nodes on that list. Nodes not in the list are ignored. The list of nodes can be dynamically modified ("refresh"). The biggest bite here is the need to handle multiple physical subnets, and the need to distribute messages to large numbers of nodes in a timely fashion -- this is heartbeating after all, so, all of this must happen with a specific number of seconds, even on 500 nodes. The use of broadcast thus kicks in significant complexity in having to store-and-forward protocol messages. Given it all to do over again, using multicast from the beginning may be more effective. Using pure unicast actually makes the code simpler, but introduces significant network traffic and performance concerns. One comment about resource starvation, and TS thus causing nodes to be shut down, we've seen that on heavily loaded nodes. We've found that setting TS' process priority and pinning it in memory on AIX is not enough to avoid all possible blockages on a node that is very heavily loaded. I/O interrupts can be a significant problem, as they block any process-level execution, and the efficiency of their handling is a concern (on AIX, at least.) Also, simply pinning TS is not sufficient, there are also the various system libraries and their data that are used by TS that need to be pinned. All of this begins to drag heavily on the overall memory resources available. We (who know the Phoenix code) don't have enough experience yet with Linux to be able to know the different effects that may be hit. One last point (this is already too long, my apologies), Phoenix is strictly layered. TS ONLY deals with heartbeating, and declares which nodes (in fact, which specific comm adapters and networks) are UP or DOWN. Above TS is Group Services, the n-phase commit protocol driver and group membership service. It uses TS to determine which nodes are up/down, and allows its clients to perform barrier synch protocols. Above GS are various monitors, cluster managers, and such. It is up to the layers above TS and GS to control resources, perform recovery and failover, etc. Peter R. Badovinatz -- (503)578-5530 (TL 775) Clusters and High Availability, Beaverton, OR wombat [at] us or IBMUSM00(WOMBAT)
|