Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Linux-HA: Users

HA

 

 

Linux-HA users RSS feed   Index | Next | Previous | View Threaded


root at yekta

Jun 19, 1998, 9:37 PM

Post #1 of 11 (890 views)
Permalink
HA

Dear all,

Hi!

I have a question about High Availability.
In almost HA systems, there is a heartbit mechanism to detect failure
of the primary server. If backup server does not receive heartbit, it
supposes that primary server is faulty, But it is possible that its
own network interface be faulty! It is seem that we need another to judge
about their fault, is it seem?

--Thanks
--Bagheri


mdonald at sofresimr

Jun 21, 1998, 12:50 AM

Post #2 of 11 (873 views)
Permalink
HA [In reply to]

root wrote:
>
> Dear all,
>
> Hi!
>
> I have a question about High Availability.
> In almost HA systems, there is a heartbit mechanism to detect failure
> of the primary server. If backup server does not receive heartbit, it
> supposes that primary server is faulty, But it is possible that its
> own network interface be faulty! It is seem that we need another to judge
> about their fault, is it seem?

It's good to see that you are thinking along these lines. You don't
need to be paranoid to develop HA software, but it helps.

This is why most heart-beat mechanisms are not implemented using a
network. They either use SCSI busses or serial comms (but not tcp/ip).
Typically what happens is each server has two network cards and two ip
addreses. The main ip address is used to communicate with clients and
secondary ip address is only used to communicate with the other server.
The second network card is used as a standby if the other server fails
(you need to take over both the ip addr and mac addr from the other
server). In addition, there is a second comms channel (SCSI or serial).

The heart-beat goes over the second comms channel. If the heart-beat
fails n times in a row (8 or 16 seem popular), the other server is then
pinged using the network. This will detect if the second comms channel
has failed (eg someone unplugged the connection).

I've seen one implementation for FreeBSD which uses a perl daemon for
the heart-beat. The heart-beat went across both channels all the time.
It had to fail on both channels for a fail-over to be triggered.

cheers Matthew
--
____________________________________________________________________
Matthew Donald Ph +61 3 9864 0706
Technical Director mdonald [at] sofresimr
IMR Worldwide http://www.sofresimr.com


root at yekta

Jun 21, 1998, 2:57 AM

Post #3 of 11 (865 views)
Permalink
HA [In reply to]

Dear Matthew,

Hi,
The source of the failure is important!
If the source of it is the backup itself the takeover is not require and
if the source is primary then take over must be done.
Backup must not take place primary if it is faulty itself and is source of
the fault. If I have a mistake, explain more please.

--Thanks
--Bagheri

On Sun, 21 Jun 1998, Matthew Donald wrote:

> Mr. Bagheri wrote:
> >
> > Thanks for your good information. Clients communicate with the primary
> > server via network, so if the network component of the server( including
> > network interface cards, and, TCP/IP modules) be faulty then the clients
> > can not communicate with the server. So, it is bettar that the bacup
> > server check the network component of the primary and when detect that
> > there is a problem try to detect fault source. Backup can detect that
> > the primary is up or down via SCSI Bus. If it detects the primary is up,
> > it detect that there is a problem in network component. But, How does it
> > detect source of the fault? Is its own network component is faulty or
> > primary's network component?
> >
>
> The aim of HA component is to decide whether to do a take-over or not.
> It is not responsible for detecting the source of the problem - that
> should be left for manual analysis.
>
> So if there are three scenarios:
>
> 1. Communication works on both channels - normal operation
> 2. Communication works on one channel - log error, but continue
> normally
> 3. Communication fails on both channels - fail-over.
>
> cheers Matthew
>
>
> --
> ____________________________________________________________________
> Matthew Donald Ph +61 3 9864 0706
> Technical Director mdonald [at] sofresimr
> IMR Worldwide http://www.sofresimr.com
>


mdonald at sofresimr

Jun 21, 1998, 3:12 AM

Post #4 of 11 (865 views)
Permalink
HA [In reply to]

Mr. Bagheri wrote:
>
> Thanks for your good information. Clients communicate with the primary
> server via network, so if the network component of the server( including
> network interface cards, and, TCP/IP modules) be faulty then the clients
> can not communicate with the server. So, it is bettar that the bacup
> server check the network component of the primary and when detect that
> there is a problem try to detect fault source. Backup can detect that
> the primary is up or down via SCSI Bus. If it detects the primary is up,
> it detect that there is a problem in network component. But, How does it
> detect source of the fault? Is its own network component is faulty or
> primary's network component?
>

The aim of HA component is to decide whether to do a take-over or not.
It is not responsible for detecting the source of the problem - that
should be left for manual analysis.

So if there are three scenarios:

1. Communication works on both channels - normal operation
2. Communication works on one channel - log error, but continue
normally
3. Communication fails on both channels - fail-over.

cheers Matthew


--
____________________________________________________________________
Matthew Donald Ph +61 3 9864 0706
Technical Director mdonald [at] sofresimr
IMR Worldwide http://www.sofresimr.com


mdonald at sofresimr

Jun 21, 1998, 3:40 AM

Post #5 of 11 (871 views)
Permalink
HA [In reply to]

root wrote:
>
> Dear Matthew,
>
> Hi,
> The source of the failure is important!
> If the source of it is the backup itself the takeover is not require and
> if the source is primary then take over must be done.
> Backup must not take place primary if it is faulty itself and is source of
> the fault. If I have a mistake, explain more please.

Sure, sure the source of the fault is important. It is not the task of
the HA software to determine the source of the fault. It should simply
log the symptoms. Keep the HA software as simple as possible.

cheers Matthew
--
____________________________________________________________________
Matthew Donald Ph +61 3 9864 0706
Technical Director mdonald [at] sofresimr
IMR Worldwide http://www.sofresimr.com


h.milz at seneca

Jun 21, 1998, 7:14 AM

Post #6 of 11 (870 views)
Permalink
HA [In reply to]

root (root [at] yekta) wrote:
> In almost HA systems, there is a heartbit mechanism to detect failure
> of the primary server. If backup server does not receive heartbit, it
> supposes that primary server is faulty, But it is possible that its
> own network interface be faulty! It is seem that we need another to judge
> about their fault, is it seem?

Please read the Linux HA HOWTO about multiple HB paths. You're right but
this has been invented before.


h.milz at seneca

Jun 21, 1998, 7:17 AM

Post #7 of 11 (869 views)
Permalink
HA [In reply to]

Matthew Donald (mdonald [at] sofresimr) wrote:
> This is why most heart-beat mechanisms are not implemented using a
> network. They either use SCSI busses or serial comms (but not tcp/ip).

You _must_ also use a TCP/IP based HB mechanism because this is the only
way to detect network card problems.

Folks please - before you start to think about problems that have been
solved in the industry years ago please read the Linux HA HOWTO. It
describes a Linux adaptation of a widely used commercial HA package (namely
IBM's HACMP for which I happened to be a second level technical support
specialist for some time.

The basis is there.


yzhang at integrix

Jun 22, 1998, 10:21 AM

Post #8 of 11 (868 views)
Permalink
HA [In reply to]

At 06:37 AM 6/20/98 +0200, you wrote:
>Dear all,
>
>Hi!
>
>I have a question about High Availability.
>In almost HA systems, there is a heartbit mechanism to detect failure
>of the primary server. If backup server does not receive heartbit, it
>supposes that primary server is faulty, But it is possible that its
>own network interface be faulty! It is seem that we need another to judge
>about their fault, is it seem?
>

I saw the replies to this post.

I myself orchestraed a HA product on Solaris (SunOS 5.X).
We pinpoint if failure is due to local network adaptor fault
in low level (but in user land). Sun's NIC driver maintain and increment
the i/o packets error and nocarrier count that is accessable via
kstat(3) in application program.

If these counter increase in proportion with the probing
packets you send (upon you lose some heartbeats), then we assert
local adaptor is failed. We use libcap (link level) to send
packet (to bypass routing and make sure the packet does
send via the suspect NIC).

I do not know if Linux also has similiar interface.

As the "In search of cluster" book said,
the first priority of HA is "do no harm", heartbeat
should be via network/serial and SCSI bus.


Y Zhang


alan at lxorguk

Jun 22, 1998, 10:33 AM

Post #9 of 11 (869 views)
Permalink
HA [In reply to]

> If these counter increase in proportion with the probing
> packets you send (upon you lose some heartbeats), then we assert
> local adaptor is failed. We use libcap (link level) to send
> packet (to bypass routing and make sure the packet does
> send via the suspect NIC).
>
> I do not know if Linux also has similiar interface.

You can see the error counters under Linux and send raw packets, but
not every PC NIC chip keeps useful counters, some lie and some merely
pretend to for compatibility. With 'decent' network cards that should work
fine


yzhang at integrix

Jun 22, 1998, 10:59 AM

Post #10 of 11 (865 views)
Permalink
HA [In reply to]

At 08:57 PM 6/22/98 +0200, you wrote:
>
>Hello!
>
>On Mon, 22 Jun 1998, Yiming Zhang wrote:
>
>> As the "In search of cluster" book said,
>
>Do you have a ISBN nummer of this book???


As the title indicate, it covers cluster, HA is only one
of the topics.

Make sure what you get is the 2nd edition.


http://http://www.prenhall.com/ns-search/ptrbooks/ptr_0138997098.html?NS-sea
rch-set=/358e9/aaaa0060u8e9aa6&NS-doc-offset=0&


hm at seneca

Jun 22, 1998, 1:37 PM

Post #11 of 11 (869 views)
Permalink
HA [In reply to]

On Mon, Jun 22, 1998 at 11:55:17AM +0430, root wrote:
> Dear Hilz,

Nice abbreviation...

> I have read Linux HA HOWTO, but if in all of multiple HB pathes there is a
> problem then neither backup nor primary detect what is wrong.

Right. This is described in what is currently chapter 8.4 of the HA HOWTO
"If heartbeats are no longer received from a node, a
timeout counter starts, and after a number of heartbeat packets missing, a
failure of this node is assumed. " and 8.6 "Set up a "host ping file" on
each node, ".

Hope this helps.

>
> --Thanks
> --Bagheri
>
> On Sun, 21 Jun 1998, Harald Milz wrote:
>
> > root (root [at] yekta) wrote:
> > > In almost HA systems, there is a heartbit mechanism to detect failure
> > > of the primary server. If backup server does not receive heartbit, it
> > > supposes that primary server is faulty, But it is possible that its
> > > own network interface be faulty! It is seem that we need another to judge
> > > about their fault, is it seem?
> >
> > Please read the Linux HA HOWTO about multiple HB paths. You're right but
> > this has been invented before.
> >

--
Democracy is a device that insures we shall be governed no better than
we deserve.
-- George Bernard Shaw

Linux-HA users RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.