
tv at wlwonline
Mar 30, 1999, 1:19 AM
Post #1 of 5
(774 views)
Permalink
|
Alan Robertson <alanr [at] bell-labs> wrote: > > systems sending 1kb packets every second. I think if ha is important for a > > company, a few $$$ for a 100 mbit hub aren't a problem. > > No they aren't. But, something you can set up for the cost of a couple of > serial cables, and have greater reliability without worring about power, etc., > then that's a good option to have. as I see it - and maybe that IS specific to the setup I'm working with - when the ups fails, there's nothing to send or receive heartbeats left anyways. and to my experience, hubs and co. are quite reliable nowadays. we didn't have a single network failure because of a hub during the year I work there. > > definitly. the trouble that will be caused by systems having different > > opinions about who's dead and who alive would be serious. I'm VERY > > interested in hearing suggestions to solve that problem. > > If your hub fails (or loses power, etc.), your ethernet is toast, and all > machines think the others are gone out to lunch. however, it doesn't matter if or which ip's they take over, because there's no ethernet there anyways where the collision would cause trouble. :) there might be trouble when the ethernet returns. > Similar things happen if you > scrog your routing. I don't think so. I have SO_DONTROUTE active already. I don't want any routing. to be honest, I was thinking of using promisc mode at first, dropped that only because it requires root privs. > If a serial driver fails, one of your paths is toast. If > you use redundant heartbeat methods, you have to lose at least two serial > drivers, and a hub to make the machines incommunicado. They should have no > points of failure in common (except the inevitable: CPU, memory, etc.) I agree completely that multiple systems are good. I'd just like to see ethernet as the base (because it's already there) and the additionals as, well additionals. > > anyone read my CONCEPTS file inside the code on my ftp site? > > It looks basically sensible, except that for an HA system you MUST have a > communications mechanism which you trust [the heartbeat comm path], and you > can use that to negotiate who gets what interface, instead of using a random > number. what kind of negotiation do you have in mind? > I think the views you express about heartbeat takeover strategies assume a lot > of symmetry which may not exist in the real world. For example, two machines > may share a disk, so they might be the only candidates in the cluster for who > takes over certain services they serve. I have thought about that and have something that I think will work. the basic idea is to define groups, using the CID I already have in the packet. only nodes that share the CID of the node that died will even try to take over. that way, by setting the CIDs you can define your network structure. the downside is that it requires nodes that have several jobs to send several heartbeats. however, I see no other way to ensure clean takeovers. imagine that you have one powerful server that does both, ftp and http. you have two fallback servers, but each of them does only one thing - one does ftp, one http. with the above structure, in case the main server fails, you will have a clean takeover of the ftp service by fallback server one and http by two. (I think that is pretty cool :) ). I don't see a way to ensure this without the main server sending two heartbeats. > Some of this has been discussed on the list. It might be appropriate to open > up the discussion some more on the list, as opposed to in private email. I agree and cc'ed the list on this reply. I hope none of you two minds. -- Tom Vogt System Manager WLW Online
|