Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Linux-HA: Dev

Heartbeat 0.45 experiences

 

 

Linux-HA dev RSS feed   Index | Next | Previous | View Threaded


steve at wirex

Oct 18, 1999, 1:28 PM

Post #1 of 12 (2254 views)
Permalink
Heartbeat 0.45 experiences

Hi Alan, ha-developers,

I toyed around with 0.45 release last week, and thought I would report
my results. Mostly, I am very happy with it; however, I did run into a
few snags of varying seriousness.

1. "/etc/rc.d/init.d/heartbeat restart" only stops heartbeat, it does
not actually restart it.

2. If heartbeat has failed over to the backup machine, and then the
heartbeat on the backup machine is cleanly stopped, it keeps the
resource even though it claims to have relinquished it (i.e. it
still has the IP address it took over from the original host).

3. One of my test machines is a laptop with a PCMCIA ethernet card.
When I yanked the card out, heartbeat failed over to the other
machine just fine, but when I put my NIC back in, the alias
interface was not recreated. Heartbeat was running on my laptop
the entire time, and was attempting to send out heartbeats on the
interface that no longer existed.

While I can see that laptops are unlikely HA hardware, I can
foresee using PCMCIA cards as hot-swappable devices. Something to
think about, though I can understand a response of "not our problem."

4. Situation: two machines, "good" and "bad". Bad has a failing disk, which
corrupts a few files on its filesystem. Good is the primary, bad is
the backup. On startup, good does not successfully grab the resource.
However, killing the heartbeat on good causes bad to successfully
take over. Restarting the heartbeat on good causes bad to relinquish,
but again good unsuccessfully attempts to take the resource.

Here's the typical sort of log on good:

heartbeat: 1999/10/14_15:24:53 info: ***********************
heartbeat: 1999/10/14_15:24:53 info: Configuration validated. Starting heartbeat.
heartbeat: 1999/10/14_15:24:53 notice: UDP heartbeat started on port 1001 interface eth0
heartbeat: 1999/10/14_15:24:53 error: Cannot open /proc/ha/.control: No such file or directory
heartbeat: 1999/10/14_15:24:59 warn: node bad.int.wirex.com: is dead
heartbeat: 1999/10/14_15:24:59 INFO: Running /etc/ha.d/rc.d/status status

and then nothing.

Occasionally, I would see something like this in good's logs when bad
would start up:

heartbeat: 1999/10/14_14:32:04 info: ***********************
heartbeat: 1999/10/14_14:32:04 info: Configuration validated. Starting heartbeat.
heartbeat: 1999/10/14_14:32:04 notice: UDP heartbeat started on port 1001 interface eth0
heartbeat: 1999/10/14_14:32:04 error: Cannot open /proc/ha/.control: No such file or directory
heartbeat: 1999/10/14_14:32:14 error: string2msg: no MSG_START
heartbeat: 1999/10/14_14:32:14 error: Bad message is: [9^F]

Nothing of interest was found in the debug log.

All of this went away as soon as I stopped using bad, and started
using a third machine with good.

In this instance, the software failed me. It was unable to detect
that one of my machines was mildly insane; furthermore, the way the
problems manifested themselves, I was starting to believe the problem
was with the machine good, until I looked in bad's /var/log/messages
and saw disk errors.

Also note that I am using md5 authentication, and good did not complain
about bad's packets failing authentication (these would have shown up
in the debug log). Which begs me to ask: what is the security model
behind the authentication scheme? What sort of threats are you
attempting to prevent by using it?

If necessary, I can recreate the situation with bad, though I don't
have a lot of time that I can allocate to it.

5. Oh yeah, the proc module does not compile under 2.0.36, which is what
all my machines in my testbed are running.

Hope this is of use to you. Let me know if I can provide you with more
information. Thanks.

Steve


th at ant

Oct 18, 1999, 1:51 PM

Post #2 of 12 (2257 views)
Permalink
Heartbeat 0.45 experiences [In reply to]

Hi,
On Mon, Oct 18, 1999 at 01:28:18PM -0700, Steve Beattie wrote:
>
> 2. If heartbeat has failed over to the backup machine, and then the
> heartbeat on the backup machine is cleanly stopped, it keeps the
> resource even though it claims to have relinquished it (i.e. it
> still has the IP address it took over from the original host).

Same here. Still looking for some more debug output

>
> 4. Situation: two machines, "good" and "bad". Bad has a failing disk, which
> corrupts a few files on its filesystem. Good is the primary, bad is
> the backup. On startup, good does not successfully grab the resource.
> However, killing the heartbeat on good causes bad to successfully
> take over. Restarting the heartbeat on good causes bad to relinquish,
> but again good unsuccessfully attempts to take the resource.
>
> Here's the typical sort of log on good:
>
> heartbeat: 1999/10/14_15:24:53 info: ***********************
> heartbeat: 1999/10/14_15:24:53 info: Configuration validated. Starting heartbeat.
> heartbeat: 1999/10/14_15:24:53 notice: UDP heartbeat started on port 1001 interface eth0
> heartbeat: 1999/10/14_15:24:53 error: Cannot open /proc/ha/.control: No such file or directory
> heartbeat: 1999/10/14_15:24:59 warn: node bad.int.wirex.com: is dead
> heartbeat: 1999/10/14_15:24:59 INFO: Running /etc/ha.d/rc.d/status status
>
> and then nothing.

Yup the same here ...

This happens here after an longer uptime. After the start on both nodes
an killing and starting of heartbeast on the master works as expected,
next day the slave will take the resources when stopping heartbeat on the
master, but it will not release it after starting the master. The result
than is that both nodes have the ip address.

So the question is, how to debug this ?


Thomas
--
-----------------------------------------------
| Thomas Hepper th [at] ant |
| ( If the above address fail try ) |
| ( thomas.hepper [at] planet-interkom) |
-----------------------------------------------


alanr at bell-labs

Oct 18, 1999, 3:03 PM

Post #3 of 12 (2242 views)
Permalink
Heartbeat 0.45 experiences [In reply to]

Hi Steve,

Thanks for the detailed report.

Steve Beattie wrote:
> =

> Hi Alan, ha-developers,
> =

> I toyed around with 0.45 release last week, and thought I would report
> my results. Mostly, I am very happy with it; however, I did run into a
> few snags of varying seriousness.
> =

> 1. "/etc/rc.d/init.d/heartbeat restart" only stops heartbeat, it does
> not actually restart it.

There are lock problems which might occur if the start actually started b=
efore
the stop was complete. I suspect that of being the trouble. Try putting=
a
sleep 2 in between the StopHA and StartHA calls right around line 257 to =
see
if that fixes the problem. There should be a message about locking probl=
ems,
but it might get lost in this circumstance.

> 2. If heartbeat has failed over to the backup machine, and then the
> heartbeat on the backup machine is cleanly stopped, it keeps the
> resource even though it claims to have relinquished it (i.e. it
> still has the IP address it took over from the original host).

This was just reported to me separately. The particular instance reporte=
d to
me earlier was due to a strange behavior of the apache script when given =
a
"status" argument. There are other possibilities to check out as well. =

Another would be that certain buggy versions of ifconfig don't show alias=
es
when you just do "ifconfig". When they don't tell me, I don't know to ta=
ke
them down. Logs and config files from the backup machine would help.

I rely on the "status" argument to tell me whether I have a particular
resource, so I can tell whether to give it up.

> 3. One of my test machines is a laptop with a PCMCIA ethernet card.
> When I yanked the card out, heartbeat failed over to the other
> machine just fine, but when I put my NIC back in, the alias
> interface was not recreated. Heartbeat was running on my laptop
> the entire time, and was attempting to send out heartbeats on the
> interface that no longer existed.
> =

> While I can see that laptops are unlikely HA hardware, I can
> foresee using PCMCIA cards as hot-swappable devices. Something to
> think about, though I can understand a response of "not our problem.=
"

I highly recommend that you consider redundant heartbeat media for this a=
nd
other reasons. This problem is not one that the current code handles
correctly. It's on the TODO list on the web. It's roughly the same as
pulling the cable. That will have the same effect. This is our problem,=
but
we can't handle it the way you expect right at this point in time.

> 4. Situation: two machines, "good" and "bad". Bad has a failing disk, w=
hich
> corrupts a few files on its filesystem. Good is the primary, bad is
> the backup. On startup, good does not successfully grab the resource=
=2E
> However, killing the heartbeat on good causes bad to successfully
> take over. Restarting the heartbeat on good causes bad to relinquish=
,
> but again good unsuccessfully attempts to take the resource.
> =

> Here's the typical sort of log on good:
> =

> heartbeat: 1999/10/14_15:24:53 info: ***********************
> heartbeat: 1999/10/14_15:24:53 info: Configuration validated. Start=
ing heartbeat.
> heartbeat: 1999/10/14_15:24:53 notice: UDP heartbeat started on port=
1001 interface eth0
> heartbeat: 1999/10/14_15:24:53 error: Cannot open /proc/ha/.control:=
No such file or directory
> heartbeat: 1999/10/14_15:24:59 warn: node bad.int.wirex.com: is dead=

> heartbeat: 1999/10/14_15:24:59 INFO: Running /etc/ha.d/rc.d/status s=
tatus
> =

> and then nothing.
> =

> Occasionally, I would see something like this in good's logs when ba=
d
> would start up:
> =

> heartbeat: 1999/10/14_14:32:04 info: ***********************
> heartbeat: 1999/10/14_14:32:04 info: Configuration validated. Start=
ing heartbeat.
> heartbeat: 1999/10/14_14:32:04 notice: UDP heartbeat started on port=
1001 interface eth0
> heartbeat: 1999/10/14_14:32:04 error: Cannot open /proc/ha/.control:=
No such file or directory
> heartbeat: 1999/10/14_14:32:14 error: string2msg: no MSG_START
> heartbeat: 1999/10/14_14:32:14 error: Bad message is: [9^F=DE=CD]
> =

> Nothing of interest was found in the debug log.
> =

> All of this went away as soon as I stopped using bad, and started
> using a third machine with good.
> =

> In this instance, the software failed me. It was unable to detect
> that one of my machines was mildly insane; furthermore, the way the
> problems manifested themselves, I was starting to believe the proble=
m
> was with the machine good, until I looked in bad's /var/log/messages=

> and saw disk errors.
> =

> Also note that I am using md5 authentication, and good did not compl=
ain
> about bad's packets failing authentication (these would have shown u=
p
> in the debug log). Which begs me to ask: what is the security model
> behind the authentication scheme? What sort of threats are you
> attempting to prevent by using it?

If you crank up the debug level (with several SIGUSR1 signals), then you =
can
see the what's going on here. I seem to recall that I log packets with
invalid auth information.
=

> If necessary, I can recreate the situation with bad, though I don't
> have a lot of time that I can allocate to it.
> =

> 5. Oh yeah, the proc module does not compile under 2.0.36, which is wha=
t
> all my machines in my testbed are running.

I mangled Volker's code, and it won't compile correctly on old kernels. =
I
don't think it's complicated, but I don't have any way to test it to fix =
it
here.

> Hope this is of use to you. Let me know if I can provide you with more
> information. Thanks.
> =

> Steve

I very much appreciate the report. Let me know what you find out.

=

-- Alan Robertson
alanr [at] bell-labs


alanr at bell-labs

Oct 18, 1999, 3:09 PM

Post #4 of 12 (2209 views)
Permalink
Heartbeat 0.45 experiences [In reply to]

Thomas Hepper wrote:
>
> Hi,
> On Mon, Oct 18, 1999 at 01:28:18PM -0700, Steve Beattie wrote:
> >
> > 2. If heartbeat has failed over to the backup machine, and then the
> > heartbeat on the backup machine is cleanly stopped, it keeps the
> > resource even though it claims to have relinquished it (i.e. it
> > still has the IP address it took over from the original host).
>
> Same here. Still looking for some more debug output

See my earlier messages on this subject. There's a SuSE bug I identified
earlier. Thomas, could you check to see if that's your situation? These two
things have to be true:
I have to be able to rely on script-name status to give "running" when it's
up
I have to be able to rely on ifconfig to give the alias names when they're
configured
>
> >
> > 4. Situation: two machines, "good" and "bad". Bad has a failing disk, which
> > corrupts a few files on its filesystem. Good is the primary, bad is
> > the backup. On startup, good does not successfully grab the resource.
> > However, killing the heartbeat on good causes bad to successfully
> > take over. Restarting the heartbeat on good causes bad to relinquish,
> > but again good unsuccessfully attempts to take the resource.
> >
> > Here's the typical sort of log on good:
> >
> > heartbeat: 1999/10/14_15:24:53 info: ***********************
> > heartbeat: 1999/10/14_15:24:53 info: Configuration validated. Starting heartbeat.
> > heartbeat: 1999/10/14_15:24:53 notice: UDP heartbeat started on port 1001 interface eth0
> > heartbeat: 1999/10/14_15:24:53 error: Cannot open /proc/ha/.control: No such file or directory
> > heartbeat: 1999/10/14_15:24:59 warn: node bad.int.wirex.com: is dead
> > heartbeat: 1999/10/14_15:24:59 INFO: Running /etc/ha.d/rc.d/status status
> >
> > and then nothing.
>
> Yup the same here ...
>
> This happens here after an longer uptime. After the start on both nodes
> an killing and starting of heartbeast on the master works as expected,
> next day the slave will take the resources when stopping heartbeat on the
> master, but it will not release it after starting the master. The result
> than is that both nodes have the ip address.
>
> So the question is, how to debug this ?

I have still never seen this one. Thomas: Is this still happening to you in
0.4.5?

You could try cranking up the debug level on the slave just before restarting
the master. I don't think I've seen your logs for this circumstance...

I still need to see detailed logs from both sides with config files.

What OS are you running? Can I see your (non-auth) config files?


-- Alan Robertson
alanr [at] bell-labs


alanr at bell-labs

Oct 18, 1999, 5:16 PM

Post #5 of 12 (2253 views)
Permalink
Heartbeat 0.45 experiences [In reply to]

Thomas Hepper wrote:
>
> Hi,
> On Mon, Oct 18, 1999 at 01:28:18PM -0700, Steve Beattie wrote:
> >
> > 2. If heartbeat has failed over to the backup machine, and then the
> > heartbeat on the backup machine is cleanly stopped, it keeps the
> > resource even though it claims to have relinquished it (i.e. it
> > still has the IP address it took over from the original host).
>
> Same here. Still looking for some more debug output
>
> >
> > 4. Situation: two machines, "good" and "bad". Bad has a failing disk, which
> > corrupts a few files on its filesystem. Good is the primary, bad is
> > the backup. On startup, good does not successfully grab the resource.
> > However, killing the heartbeat on good causes bad to successfully
> > take over. Restarting the heartbeat on good causes bad to relinquish,
> > but again good unsuccessfully attempts to take the resource.
> >
> > Here's the typical sort of log on good:
> >
> > heartbeat: 1999/10/14_15:24:53 info: ***********************
> > heartbeat: 1999/10/14_15:24:53 info: Configuration validated. Starting heartbeat.
> > heartbeat: 1999/10/14_15:24:53 notice: UDP heartbeat started on port 1001 interface eth0
> > heartbeat: 1999/10/14_15:24:53 error: Cannot open /proc/ha/.control: No such file or directory
> > heartbeat: 1999/10/14_15:24:59 warn: node bad.int.wirex.com: is dead
> > heartbeat: 1999/10/14_15:24:59 INFO: Running /etc/ha.d/rc.d/status status
> >
> > and then nothing.
>
> Yup the same here ...
>
> This happens here after an longer uptime. After the start on both nodes
> an killing and starting of heartbeast on the master works as expected,
> next day the slave will take the resources when stopping heartbeat on the
> master, but it will not release it after starting the master. The result
> than is that both nodes have the ip address.
>
> So the question is, how to debug this ?
>
> Thomas

The two things I thought of for debugging this are:

1) do the "resource-script status" for the resources you think
it should give up before restarting the master. You can
do this at any time. Any time it doesn't show the
"running" on the right resources, things are broken.
Maybe I should do this after taking over any resources.

2) Turn on debug to level 5 (5 SIGUSR1's) on the slave
just before restarting the master. After the master
has restarted (a minute or two elapsed), you can run
the debug level back down again on the slave.

Could you try these? The first one is "harmless" and can be done at any time.
I expect this one to be the problem.

Thanks guys!

-- Alan Robertson
alanr [at] bell-labs


alanr at bell-labs

Oct 18, 1999, 5:40 PM

Post #6 of 12 (2240 views)
Permalink
Heartbeat 0.45 experiences [In reply to]

I sent this, but it's not in the archives, so I'm resending it...

Thomas Hepper wrote:
>
> Hi,
> On Mon, Oct 18, 1999 at 01:28:18PM -0700, Steve Beattie wrote:
> >
> > 2. If heartbeat has failed over to the backup machine, and then the
> > heartbeat on the backup machine is cleanly stopped, it keeps the
> > resource even though it claims to have relinquished it (i.e. it
> > still has the IP address it took over from the original host).
>
> Same here. Still looking for some more debug output

See my earlier messages on this subject. There's a SuSE bug I identified
earlier. Thomas, could you check to see if that's your situation? These two
things have to be true:
I have to be able to rely on script-name status to give "running" when it's
up
I have to be able to rely on ifconfig to give the alias names when they're
configured
>
> >
> > 4. Situation: two machines, "good" and "bad". Bad has a failing disk, which
> > corrupts a few files on its filesystem. Good is the primary, bad is
> > the backup. On startup, good does not successfully grab the resource.
> > However, killing the heartbeat on good causes bad to successfully
> > take over. Restarting the heartbeat on good causes bad to relinquish,
> > but again good unsuccessfully attempts to take the resource.
> >
> > Here's the typical sort of log on good:
> >
> > heartbeat: 1999/10/14_15:24:53 info: ***********************
> > heartbeat: 1999/10/14_15:24:53 info: Configuration validated. Starting heartbeat.
> > heartbeat: 1999/10/14_15:24:53 notice: UDP heartbeat started on port 1001 interface eth0
> > heartbeat: 1999/10/14_15:24:53 error: Cannot open /proc/ha/.control: No such file or directory
> > heartbeat: 1999/10/14_15:24:59 warn: node bad.int.wirex.com: is dead
> > heartbeat: 1999/10/14_15:24:59 INFO: Running /etc/ha.d/rc.d/status status
> >
> > and then nothing.
>
> Yup the same here ...
>
> This happens here after an longer uptime. After the start on both nodes
> an killing and starting of heartbeast on the master works as expected,
> next day the slave will take the resources when stopping heartbeat on the
> master, but it will not release it after starting the master. The result
> than is that both nodes have the ip address.
>
> So the question is, how to debug this ?

I have still never seen this one. Thomas: Is this still happening to you in
0.4.5?

You could try cranking up the debug level on the slave just before restarting
the master. I don't think I've seen your logs for this circumstance...

I still need to see detailed logs from both sides with config files.

What OS are you running? Can I see your (non-auth) config files?


-- Alan Robertson
alanr [at] bell-labs


alanr at bell-labs

Oct 18, 1999, 5:43 PM

Post #7 of 12 (2247 views)
Permalink
Heartbeat 0.45 experiences [In reply to]

This is not in the archives, so I'm resending...

Hi Steve,

Thanks for the detailed report.

Steve Beattie wrote:
>
> Hi Alan, ha-developers,
>
> I toyed around with 0.45 release last week, and thought I would report
> my results. Mostly, I am very happy with it; however, I did run into a
> few snags of varying seriousness.
>
> 1. "/etc/rc.d/init.d/heartbeat restart" only stops heartbeat, it does
> not actually restart it.

There are lock problems which might occur if the start actually started before
the stop was complete. I suspect that of being the trouble. Try putting a
sleep 2 in between the StopHA and StartHA calls right around line 257 to see
if that fixes the problem. There should be a message about locking problems,
but it might get lost in this circumstance.

> 2. If heartbeat has failed over to the backup machine, and then the
> heartbeat on the backup machine is cleanly stopped, it keeps the
> resource even though it claims to have relinquished it (i.e. it
> still has the IP address it took over from the original host).

This was just reported to me separately. The particular instance reported to
me earlier was due to a strange behavior of the apache script when given a
"status" argument. There are other possibilities to check out as well.
Another would be that certain buggy versions of ifconfig don't show aliases
when you just do "ifconfig". When they don't tell me, I don't know to take
them down. Logs and config files from the backup machine would help.

I rely on the "status" argument to tell me whether I have a particular
resource, so I can tell whether to give it up.

> 3. One of my test machines is a laptop with a PCMCIA ethernet card.
> When I yanked the card out, heartbeat failed over to the other
> machine just fine, but when I put my NIC back in, the alias
> interface was not recreated. Heartbeat was running on my laptop
> the entire time, and was attempting to send out heartbeats on the
> interface that no longer existed.
>
> While I can see that laptops are unlikely HA hardware, I can
> foresee using PCMCIA cards as hot-swappable devices. Something to
> think about, though I can understand a response of "not our problem."

I highly recommend that you consider redundant heartbeat media for this and
other reasons. This problem is not one that the current code handles
correctly. It's on the TODO list on the web. It's roughly the same as
pulling the cable. That will have the same effect. This is our problem, but
we can't handle it the way you expect right at this point in time.

> 4. Situation: two machines, "good" and "bad". Bad has a failing disk, which
> corrupts a few files on its filesystem. Good is the primary, bad is
> the backup. On startup, good does not successfully grab the resource.
> However, killing the heartbeat on good causes bad to successfully
> take over. Restarting the heartbeat on good causes bad to relinquish,
> but again good unsuccessfully attempts to take the resource.
>
> Here's the typical sort of log on good:
>
> heartbeat: 1999/10/14_15:24:53 info: ***********************
> heartbeat: 1999/10/14_15:24:53 info: Configuration validated. Starting heartbeat.
> heartbeat: 1999/10/14_15:24:53 notice: UDP heartbeat started on port 1001 interface eth0
> heartbeat: 1999/10/14_15:24:53 error: Cannot open /proc/ha/.control: No such file or directory
> heartbeat: 1999/10/14_15:24:59 warn: node bad.int.wirex.com: is dead
> heartbeat: 1999/10/14_15:24:59 INFO: Running /etc/ha.d/rc.d/status status
>
> and then nothing.
>
> Occasionally, I would see something like this in good's logs when bad
> would start up:
>
> heartbeat: 1999/10/14_14:32:04 info: ***********************
> heartbeat: 1999/10/14_14:32:04 info: Configuration validated. Starting heartbeat.
> heartbeat: 1999/10/14_14:32:04 notice: UDP heartbeat started on port 1001 interface eth0
> heartbeat: 1999/10/14_14:32:04 error: Cannot open /proc/ha/.control: No such file or directory
> heartbeat: 1999/10/14_14:32:14 error: string2msg: no MSG_START
> heartbeat: 1999/10/14_14:32:14 error: Bad message is: [9^F]
>
> Nothing of interest was found in the debug log.
>
> All of this went away as soon as I stopped using bad, and started
> using a third machine with good.
>
> In this instance, the software failed me. It was unable to detect
> that one of my machines was mildly insane; furthermore, the way the
> problems manifested themselves, I was starting to believe the problem
> was with the machine good, until I looked in bad's /var/log/messages
> and saw disk errors.
>
> Also note that I am using md5 authentication, and good did not complain
> about bad's packets failing authentication (these would have shown up
> in the debug log). Which begs me to ask: what is the security model
> behind the authentication scheme? What sort of threats are you
> attempting to prevent by using it?

If you crank up the debug level (with several SIGUSR1 signals), then you can
see the what's going on here. I seem to recall that I log packets with
invalid auth information.

> If necessary, I can recreate the situation with bad, though I don't
> have a lot of time that I can allocate to it.
>
> 5. Oh yeah, the proc module does not compile under 2.0.36, which is what
> all my machines in my testbed are running.

I mangled Volker's code, and it won't compile correctly on old kernels. I
don't think it's complicated, but I don't have any way to test it to fix it
here.

> Hope this is of use to you. Let me know if I can provide you with more
> information. Thanks.
>
> Steve

I very much appreciate the report. Let me know what you find out.


-- Alan Robertson
alanr [at] bell-labs


alanr at bell-labs

Oct 18, 1999, 7:15 PM

Post #8 of 12 (2250 views)
Permalink
Heartbeat 0.45 experiences [In reply to]

Steve Beattie wrote:
>
> Hi Alan, ha-developers,
>
> I toyed around with 0.45 release last week, and thought I would report
> my results. Mostly, I am very happy with it; however, I did run into a
> few snags of varying seriousness.
>
> 1. "/etc/rc.d/init.d/heartbeat restart" only stops heartbeat, it does
> not actually restart it.

I think I fixed this in my CVS instance at cvs.linux-ha.org. I put in code to
delay exiting from a heartbeat -k until the heartbeat process it's trying to
stop actually exits and disappears from the process table. This is probably
what was causing your problem. I've seen it do that. The code from the
alarm(0)...do...while... is the fix. Here's the new code:

/*
* We've been asked to shut down the currently running heartbeat
* process
*/

if (killrunninghb) {

if (running_hb_pid < 0) {
fprintf(stderr, "ERROR: Heartbeat not currently
running.\n");
cleanexit(1);
}

if (kill(running_hb_pid, SIGTERM) >= 0) {
/* Wait for the running heartbeat to die */
alarm(0);
do {
sleep(1);
}while (kill(running_hb_pid, 0) >= 0);
cleanexit(0);
}
fprintf(stderr, "ERROR: Could not kill pid %d", running_hb_pid);
perror(" ");
cleanexit(1);
}


Thanks for the report!

-- Alan Robertson
alanr [at] bell-labs


steve at wirex

Oct 18, 1999, 7:37 PM

Post #9 of 12 (2244 views)
Permalink
Heartbeat 0.45 experiences [In reply to]

Alan Robertson wrote:
>
> I think I fixed this in my CVS instance at cvs.linux-ha.org. I put in code to
> delay exiting from a heartbeat -k until the heartbeat process it's trying to
> stop actually exits and disappears from the process table. This is probably
> what was causing your problem. I've seen it do that. The code from the
> alarm(0)...do...while... is the fix. Here's the new code:

[Patch to heartbeat.c snipped]

Applied and works here. Thanks!

For reference, all the machines that I've been using for testing are
Stackguarded[1] Redhat 5.2 machines running the 2.0.3x kernels.

Steve

[1] See http://www.immunix.org/products.html#stackguard for more
information about Stackguard.


alanr at bell-labs

Oct 18, 1999, 7:51 PM

Post #10 of 12 (2251 views)
Permalink
Heartbeat 0.45 experiences [In reply to]

Hi Steve,

Steve Beattie wrote:
>
> Alan Robertson wrote:
> >
> > I think I fixed this in my CVS instance at cvs.linux-ha.org. I put in code to
> > delay exiting from a heartbeat -k until the heartbeat process it's trying to
> > stop actually exits and disappears from the process table. This is probably
> > what was causing your problem. I've seen it do that. The code from the
> > alarm(0)...do...while... is the fix. Here's the new code:
>
> [Patch to heartbeat.c snipped]
>
> Applied and works here. Thanks!

You're quite welcome. Your feedback is much appreciated!

You might try pulling it out of the CVS instance, and see if it matches what
you've got on your machine [see announcement I just made].

-- Alan Robertson
alanr [at] bell-labs


th at ant

Oct 19, 1999, 10:34 AM

Post #11 of 12 (2283 views)
Permalink
Heartbeat 0.45 experiences [In reply to]

--PEIAKu/WMn1b1Hv9
Content-Type: text/plain; charset=us-ascii

Hi,
On Mon, Oct 18, 1999 at 06:16:23PM -0600, Alan Robertson wrote:
>
> The two things I thought of for debugging this are:
>
> 1) do the "resource-script status" for the resources you think
> it should give up before restarting the master. You can
> do this at any time. Any time it doesn't show the
> "running" on the right resources, things are broken.
> Maybe I should do this after taking over any resources.

Not tested yet, will do it

> 2) Turn on debug to level 5 (5 SIGUSR1's) on the slave
> just before restarting the master. After the master
> has restarted (a minute or two elapsed), you can run
> the debug level back down again on the slave.
>
OK did it. I have attached my debug files from different tests.

One strange thing is that on startup, heartbeat is started by init,
it does not work. Either it hangs, or it thinks himself is dead (see the logs).

If i stop it in the normal root shell, and start it again it will take
the resources, and the slave will release it .....



Thomas
--
-----------------------------------------------
| Thomas Hepper th [at] ant |
| ( If the above address fail try ) |
| ( thomas.hepper [at] planet-interkom) |
-----------------------------------------------

--PEIAKu/WMn1b1Hv9
Content-Type: application/octet-stream
Content-Disposition: attachment; filename="heartbeat-debug.tar.gz"
Content-Transfer-Encoding: base64

H4sIAFaqDDgAA+xdeXPbRrLPv+KnGMtKFMeWOBcubpQtr5Ndu5I4KWv9klcurz0ABhYjimQI
0I7eJt/99eAgQVkUZoSRQihkqXgAMz3dPX39QKgp5tnJqTxPD/+PZCOMyckn9h+YYOxy/gkG
+p7jqldMis/Fg5FPsKdeKHc9eA+vDvsE4Rvg5aPHPM3EDKFPsgbJ56mcpbfB0O0+BOw/oj2C
olnUoyg9EQQ9nZ/Me382Y9vHrTzEqv/TP9v/PRdv/f8WH1v//2s/TsRBLMP5uz8t/1NMCXeW
+Z+4yv8Z3vr/rTxOpJhloRTZAJEgCPoE90nwhngDTAaOg17Mx+Ph+B3qyyzqn4jDuD+L4Al0
ls3TASpee+uJ0AH32hIhzoBsDpEAHYPBZBeoyHQyn0US3j77UcTxDBEeHFLCDgk/9P0+JTD2
BMgsj1LQrjK97MrlmGNpFRRPxvIQvXhyhK9e0G2Q7yTLpjEqWe/P01k/DYfjvpiK6ERG2ag4
M0DFuPyDjK9cktPmlbTYp3TgaLM/ma7hfjJdMj+ZTq9ivliwaR1t3j2rpgUCNqxmZRE96diA
YG3DutIHC1LNFOp8vUzFOzm46PXD6cFM/jqXaQav6RTtXjiw22uY8GE4GqFMnEoUTcbZbDJC
kwRlJxK9G76XY/TsR6RUKdMUzdNcbGVvo2HYXwjXf1Fy/70YA4uzK8Xm5PZCDyznrHHLmwk9
xYK3GnrUktxW6Akg17VNL0DE8dsR8QeMAJ1NIcJuzWTz5Zw1geEmTLZYsEk+qyYLS9IBbgx+
uuz7yvxvLVtWC1rIlnq828qWVzJ+A9myWM+3kS0rUn+xbAliBwPapEGboSeAv1sNPfmCtxt6
AlWA2Ak9nCuc0TK9WCNya4guX85O3W2g6UZYYdlQuDPAa1CroaEEA8IH1ApuqEj9xSJhLnZj
kWLNwAMF5/Et4oZqwVs08GLJJtR/gf3ei6f/fPn8yfFRPqswivEwg5dkPo6y4WSc9h4ijUE/
vvjhyZunj4/2nj5+849nzxUDE5hZHb7EUvIRxy//cfy/x0eLozDlo0P3e/d3fhyJSKYom6Bk
OI7RKRjqdDaJ0GSWvxdgt8Px4eEhDP722XffqWWPj3YLdS4G50xUR2DKLqy2HN0wuPfs+fH3
P3xdDhuO07NJDPMvO/ry+OjtfCzOJDoYv+1V73oPH6LlewSDqivbvd59dCwz8LRhLqJABA0T
dD6Zow9inOWH5tnkTGTDCDg5R6OJAB3I2ViOECw4H8kUFv3mDXDy8rtvjo9ITr/+OVficW4N
oCj0Qsboqcj2U1TtIILNmYnZudLgMOkh9ArdQwe/ob1y79HrHgSCMZyIhTybjD9/gP7b29n7
Ag780atCVHmwMoFqB9FBmg+To1TCm8MF0V4yBE73X+2j/Xv7arX1Bob2X+8XUjwbw2IqTkFc
SidJFk/elUpQSvsg0ViCdNlECQJE3nwQWXQCg3LmgADa+WkxBnRREBtm+eRyKIJ9SKcyGiZD
GAb62nvyw/N/PvvXIyTGMbDQj+X7/mLwTCZyVm5cxRAMGEbykbJPIL0yHubHE5mO9zMkfxum
GQIFAQuHhXBLCVYmPVLijuHcPjB7piJ0wfFYgluk5b7d3/npRGTyvZzlylFfDKBUnKeKZHoy
mY9idQwlwNSC+YrR7ESZA5D4AG/KdVQQOlXDH+XzQHYUzWA/5erKCD3Ls8Q5OhumoDbwlfs7
wywXHkINyDmdZ4rf+TgGjpRcj0BRRbb5IIHznaeTqSxIxBM1cjZ8d5KhwcEDJRWYTOFYR+MJ
vL+PlJRqYD4BnGZ8uqostX3K0eK/w/Dvnx0/+fqb/zl6+24mp2hf8Rjt7aO+cvJ+IX2KfkcR
sHgQkQP+tj5HRicTtFd+VGd+evzvJ0+X5A7eo/3/3N+v7APo5IcXjPze20lhK9P9T//zCu28
/qI6kX/49NP9VZLFatVndS73ROWLuz8vju+ie0foZ/BHhEqPRGiAflpaY57QYdXwfMFXMhwp
z1OqvK/MP5ud59tZ6DW3lqULwaeZhNAwnijrULuozqu3ub2mORE1ezoZjrPc8Fd8UfyiNj7X
7GG56EKSMq5EaLcmz+v81EKYxXafyzQ/UMYN9YgEGOLbUYoORisk6Fe5BYzno9FbmF6O/qLa
ukdfPOjtrFL+29+qQQ92BjVaYDzKZC+6Mvr8w2wyLqV7sIsW02UqovwtxLIaqwP0fHLRyy6N
KfneKD3l83P59sptKeUAZh9Uiqm0iApDq3gsrLlMBWipC9DLZwSVc2oKVvxVc8UIfDoukgpU
F6VUS5UjtFckucWMC/TLcckw10mpjyss90B8bAJwrJL6SAm8at5nw/FkdvQ2ArcvJQUn/tjZ
Vp0Ypp2Ogd7BGXIxXvoVihYejfZyyoXy/4DIciFhrCacYukTUbrKI+Urwyx3lGyhxTKhgCJV
5CoykJr35kTkCaimGVBCvCS6IrAK0ZVDgqWUVVR+Kp58bAvVBu0Vcy7f74oK+r4gXOw3+uHb
xZiZzOazMcKLDS0XHMtSQ72yjsirSOXzyxRfVAVFBdbLy8w3i5Orghcj0cdVwn3Qx2pQy3lb
mGLJ3t7fL+Pmu+F4/tsB6AiWz488fXxR33u1kgisjKyq/P7OkyK3raYSFefycmulxCiyaRHY
6iZTUvpOTbhgLuWp5z/8+5sBejKfzSRUdpCqU1V8AcwqjaRaR1WdaU2/6jTkW5neKwihumVV
0WN2hg4S1H8vZv3ZfNyfTqdQQn1R14E6rzT/9bMXgP4qgJCWPrk8dyKW51Zd8f2l8y+ZVzFV
7dBFW1mydcFcrrIBAC35/sMhlSx3ny6JDuPcEQuI82oWHe29ePJ6d8VyXjxZmI6ynMfpaY0r
SGPqQqgqZ7KiNlGoF82nhZMv5Cps/BiGliamCoZ/vVggn4sIuGJ1rwA2EHhP5hBGP4yVa8MG
vlOA+nC3t9THJQX06c599O0wL1Tr6ql5LIgFzxWcOdrNCd7feZnmhVVV8S+R06Lwr6JYnvaV
cHV7WUUAtZUXdEqx6pukotcveeBaoKu1oatY4peyBrgQtBbS7P3yt8XBENz0tPq4GqNWygxI
PNX8Wsm0ssJiwMcqv1h4FHsIOb0SfAkiywG/Qd1KlpVA/jRSZqjstbJAOJAHrhx8PywwuAJ/
EEonH1Rk+QCbgKI8g8BO5OXA7h7ZLWqB3LahksmZORij3cXlhadQMB88fi+GIxEOR8PsHEHx
n9e2UNv0dsqI2Nsp/Ccn0Nt5pTwCHchfEUav0WefgQvMo5Mifowm0Wk/nYfpedqvtngHEnyv
+DJklYnJdKrFhHKZJh5qMexKHhTeBC7Ka2PLfavIF+PAt5Tov8+kCuIPlkxc1Egx/ItKrt3y
ChsE5P8WFJTUvxer/b5K9Q+Qrdz9vPh5iCrV7Gvtz7666FGyU6BhkqcnhXgfXsgvD2tQSBPQ
qOsOKx+umAf0K+yDCEa9Ffyj5uYHCK6Ngw9WoNCCs4rQ8joWPEWJOq8I7KdNBGrKWmCr3iq2
qkRZGVLo/me0f+8of8nV31zHX+CzrOcvEIP8ml/iiND+vvoA2Kqoc4tVCqPfv5i590syyiEW
q1xI2znZZIWL5fmC+sX8+rCqwS67Klcb1XA6z82LlzKilVpVXOPSrQsm1sSWJcFe7kMqRz8s
gukVF1cpGRCG8vtDB8ULaPQD2Eeexgl6NYVqgPj4tSYJGDpAx8N3Ywi8yl3LaguOU/SKWCDD
7JDhdsg4dsi4JmTW7xJtT4K1J+G0J8Hbk3AbSPCKxNewFfc0x56JNJOzN0XyelPu3wDWjyV6
Vf17xWuU33GPJqd/KtWPVEN1nZnbcWYNTpoMVoMTnXigQUYnHmiQ0YkHGnppciENEk0upEGi
KRZo6EMnsGlw0uTMjoEzL8Yey3GsuIUkKkaNXy6vTNXwWKLtsZ2k+tFeMd3o4tiJLhpkdEKD
hkBNbqBBoinOaQijE6A0OGkKCxqc6MQ4DU6aYpwGJzrRRYOTpujiGkSXxdgfRXQK6yi3kmP1
dfOVNxysTLVaD2wuQx9tCNcNIa6dEKJBRieEaAjUFEI0SDSFEM9OCNEgo+P/XrNATf6vQaIp
mGkIoxNCNDhpCiGeQQhZjDUvUBZT300yRDhF4Xkm0bTw/GQ2OYM59JC4/iFx8CHRJPbTbJhf
eiOc5ATTvo+SoRzFaXGpP79uOB1OpSY980jkGUQi/WJmcxkqvhV7E88m0zdn6uYQdQX16/l0
lPNWbujn5IEuOfUNWWUO+abVzYA6jgU6xICOhnWSrXXekDF8FMQc3bSrEU510q4GGZ20qyFQ
U9rtSs502udMZ2NypmOcM9eTaCqKfIO0669GJ3517qSaxL766qteOouOqiq5l8pfj2Kc9LL0
iPk4Er4QvVF8hA8xRssn2ucMEUzdXpaNjlgvOyrTf5odzcenoItxT4WRIwiiCXE4C7HvYxI7
oYfdCEde6LtA3qWCxCKJWUJl78svv+xpcr2MqaxNTO24FswzgYkNka0N/QVsyCp0tmGZN8OQ
nRolMAjXi7HmKGkxtWVdHFxumqQwTc6WluknhWUyH4ySufBEHET68I5y2myYkmMK5phQEVHG
MBOe58WRRz0HOw5NMBeBG3o+0zDMYDVCtUOInVaAJdzQaR20BHSdlt0GCO20ArYOcJ1EGhgk
Un2wv7kMWc7sd1M+ivUrl+VYC0BzSezy8pjgWnkctSuPPZcRD7MY+1JE3HXcwPWEE7pJSHgc
Mixl4mE/ps3+t+TaDkjouhaMncLIhpqAZte1t7Uh27jOimXeDEN2wjUxCNeLscZAczm1HdBc
0rm8xOJLywzIJSUW1S6xOI7diEexULboh4JxRzoyIlxQzt2QUSJxTH2uYZjEoM5uAprdVoCd
OrvbOmgHNLstuwWg2W0FbB3gOomU3ATu2WCGLGf2OyofNahcqAFIaASatKE8JrXyWLYrjyMW
B4QkYcQSEfOIxk7CWChI4lAomimWPvYZizUu9Cy5tgQSOq4Fc6cwsaFGoNlx7W1tyDqus2GZ
N8OQnXDNDML1Yqw50GQ6dbYG0GRXl1hOrcRi7UosB0vBZKLMmDPsJ74rWKK+XifYF27gUsxl
IrGGYTKDOrsRaHZaAZbq7E7roCXQ7LTsNoBmpxWwdYDrJFJ2I7hncxmynNnvqHwG/x+/HGsD
aPKG8pgu3S/E7cpjPxQBDT1GCNTHIfcpD0PmuJ4XctcVjpcQGdBICg3/45ZBQse1YAMtdlwF
W0O4XnS8kbYYG8yQnXBt0AGBXr8DwnJqS6DpXF1iubUSyylLLAeVT8sSizXbdJIQzxdJ6MN6
BPj1seuHju86fhLFHhEUTBvjQKfEcgzq7Eag2WkFWKqzO62DlkCz07LbAJqdVsDWAa6TSG+k
Jc0GM2Q5s99R+Qy6qyzH2gCabkNlXbtzPaTtKmtGcCJ5xCM/5Dhm6jY/HxMv8CJGHcZ8SVzB
vSTR8D/XMr7ouBbMncLEhhoxase1t7Uh67jOhmXeDEN/Yrg2B5quTp2tATTXmGZVYnm1Estt
VWJ5BEdOKKKIiZjESciZ63AnCUgiXREQ6gRQaiWhkWHaAJqdVoClOrvTOmgJNDstuw2g2WkF
bB3gphOpPu7ZXIYsZ/Y7Kp9BDz7qGYCERqDpNZTHtTvXQ96uPA5kSMI44EnEOA8S6jlcumHC
aMRiEfGQJ6GDQ6Hzn2WXNFhrBRI6rgVzpzCxoUag2XHtbW3IOq6zYZk3w5CdcG3Qu2051hxo
+jp1tgbQXNOnqiqx/FqJ5ZclFr1WieX6IaMBDmWcEMdJ1IX7MHajxHNFGBMvxiFlLta6AuIb
1NmNQLPTCrBUZ3daBy2BZqdltwE0O62ArQNcJ5GadNXTxz2by5DlzH5H5TNoY0gDA5DQCDTX
tOJalMe1O9dDt2ULEx5HRLqeJyMn9ARzfC+RFJ65E0s3cDxOfBqGvob/XdKQrBVI6LgWzJ3C
xIYagWbHtbe1Ieu4zoZl3gxDVsI1M+jdthxrDDSXU9sBTbamT1VVYgW1Eku0KrG8IEwolkJi
j0tPJNz1XJKAUcrQwVBlOQ6VPgk0DJNhgzq7CWh2WwF26uxu66Ad0Oy27BaAZrcVsHWAayRS
ZtJVTxv3bDBDljP7HZXPoI0hIwYgoQlosjWtuBblce3O9dBvVx5LzLwAu8RJPJlwx405DgLh
Oz4LpeNSwjwv8D2h43+XNCRrAxK6rgVzpzCxoSag2XXtbW3INq6zYpk3w5CdcG3Qu2051hxo
Up06WwNorulTVZVYolZiRa1KrCAIieBghyJOXOGEVFJGCQ59JknkekJyykMvdDQMkxrU2Y1A
s9MKsFRnd1oHLYFmp2W3ATQ7rYCtA1wnkZp01dPHPZvLkOXMfkflM2hjyJgBSGgEmmtacS3K
49qd62HLX/+jLg1xEAeu6/nCpQkPMPgbfEoE9RKP0pgnseO6Gv53SUOyViCh41qwgRY7roKt
IVwvOpq0StPHdZvLkJ1wbdC7bTnWHGhynTpbA2iuaXFVlVhhrcQq+yEzZc8wz/SmsSQQ6rfi
IBj5Tuj5LnMjEkbCEzSKAj+MAk6D0NFoh8y4QZ3dCDQ7rQBLdXanddASaHZadhtAs9MK2DrA
dRKpSVc9fdyzuQxZzux3VD6DNobMMcAXjUBzTSuuRWVdu3M9bPnrf4HnOERIde9AwsELCSFe
lBDpg/9h4TAugiRO3EDD/y5pSNYKX3RcCzaAZsdVsDWE60VHk1Zp+rhucxmyE64NmgGx6zcD
YpaaAbGGNhhR7Yd3cKsSK3C5xyI3wbHjBwH2ZcTjmOHQSTwn8iKIUTSUsc4PvDKbzYC6rQBL
dXanddASaHZadhtAs9MK2DrAdRLpjfSm2WCGLGf2OyqfQTMgZrMZEGtqYVK7cz1s+et/sRt5
MvSjBPvUJYnHvSRIcOh7zI+oF7oSR5wmQazhf5YbuXRdCzaAZsdVsDWE60XHG+m9s8EM2QnX
Bs2A2PWbATFLzYBYQxuMuFZilf2QqadKLGZcYkU8ihworEQUksTxReQk2HUdcAhGAhxLGVPJ
wHOabdpmM6BuK8BSnd1pHdgAW51WwNYIWl9t6LTs5on0RnrTbDBDljP7HZXPoBkQs9kMiDW1
MKnduR61/OFAz3ODIAxjz6FezKUgLHYoZeCHHvGYKzmJkySQOpW15UYuXdeCDaDZcRVsDeF6
0fFGeu9sMENWwjU3aAbEr98MiFtqBsQb2mDIWonFW5VYPEk813O5F7HYx07k+Z6Pic84wcIL
cEQT6hPKNUosbrMZULcVYAdjdFsHFoBmtxWwNYK2QLPbshsnUn4jvWk2mCHLmf2OymfQDIjb
bAbEm1qY1O5cj9r++p/wfSehceASjH3XhbI9lFFAuE+IjD3ic+6HDtXxP8uNXLquBQtAs+sq
2BrC9aLjjfTe2WCG7IRrg2ZA/PrNgLilZkC8oQ1GUiux3EtKLNZnPpRYjsa39H6AqUxi4EwG
JImcgErXp9RhbhzE2BVe6HteQDRs2mYzoG4rwBLG6LQObADNTitgawStgWanZTdPpDfSm2aD
GbKc2e+6fMTHUJIM343FaKQCS8kmHKfoFXFetybDzMiwNWSwHTJGQn3zWxFs8wwzHcbm0qxT
CrdDxrFDxjUjw+1s0Toyhna3jozRTl2y04Zb5NpRyjoyhkpZR8bQfNeRsaQbI/O9ZIsM7dax
s0XryBhu0Toyhlu0jozRFl2iW8PQQm3o9hI2dJTKrpivsb6D0f+3d63NbeNYNl/tX4FxPkz3
VCzxrUfV1JSne9LJ1uYxibu6tvbDFC1RFicUqSYpO66p/e97L8CnTEIECCqZGqIftiXhADi4
uOcCJK8+p/A21px66Wq6dSfraewl0SFeefDrNk33a9Jlpw5QpxHIOgq9Cfn005+1y1+ZfFcq
reB//v469n4/eEkKP5M9uTp64eryRIVHPwhI6n7xyCoK0zgKSLQhEMCQe//BC8nbj8Rdr2Oc
oUNCh31I4mng302LwU0/Zb1/54bQxZgzbANGfoLBtx+xPYjZFxNDNye6NZnPp4YOn90CTPkq
xO5IUpxymzMNRa1UZ+JEg10tBLpOyUzu/HDqQjC29VZpwN5ZEvY5+gc3ijTwIe+TLXXtvm2R
T4cwPOo92g0zyGVu2i/G8r2WrXsdRPeTfNcwSBuarmmOZb3QoIWZ7eBPTWd/Q7FNXZ+90Gaa
ZuqG5WgmfF6fzewXRBukN0flgCZPyIv0xNgPiRcn5+jQeUvL+p4tNR3X96Mbh9nOMj8SXxIf
pMZzW/0Mq2uTt+9ff1jyPcQJ7QMkY2nNoH3YIHtH/SCJ9zsBD0Y9lk4eEmJppmadwvLDTfRs
RFlv8nOWExAqBqZbS83MOvMm/xDswQ/pGnoAb2Csc49SOuGDWBnIL/4D9uawJxArkTc3JHfu
pwDsDODmqBr8BjHX7wc/2XrrjhgNI1lFu33gpR4PwV5q+cT8qbl0q/xTFG78+0Pspn4Ukgc3
8Nd4rDIpVbaAOdkdL46jGCBj+ATWfP329QcyfXDjaXwIy3jmeuNvopNguQH/+vPHsgu5XhPo
6z5CIwa3CCNJvXjjrjyC4cVQwK1H2cfDd0NogkR7iO6mGH2DpU8nWfC3JO8jkhxWW7LxA49E
MVn7MYQlUfzEhdft535F7+JXWN1nK1gXWMEI4ahZwUqRZuTW/YIQ0YMXF4uQ3McRrGganPHr
z8nNCtZqjBD12suCpSxM7BLH8htbtA97gNictmloWZv+ZkVXOV0eS62OT0jopTs3+UIAdUL/
s+YT7eIujtz1yk1SIjBMwyouqf4CLuXgpxFM5k28J5voaFy40rL+/C/+bN0YUmCTo40tsX8r
HAvDe+pIFURKRxDAIZ+8wHOTwS2QNSbCYLTno3EkXdaeOzT5kdxkm+W6LcWUxvZ9XA7QT7ar
GHKybZhLbS4r29XKamSbIQ6gtP2AOUrLgAdTWoDXNTFPw5UpxNP5gleeF5HyVz6i2RWRnUAd
/c3FtvRziSJr7LyiiG0a5xVF1uQAogjAtqFQFBewU5Pcu7K6KsI6QLLnXfeus/nsFFKvnSuD
6D+s+dLUcNH2ijjqIBIRBwDoGPNKiU+9sgrxyRGV7BlzMOVK1he4VcmOh69YySi8MZPaM+Z1
e+wZGYSitaMUaSG9Z6T1zTPJY97YOeWRtWmdUx5Zk456eaTAdif+usgjwM1Rbft68AqInAcH
APM8e0bWmKVozyiApmrPWDQpu2fMAfrsGesYMntGhiB51FuvrEi2KeIQStsLmKe0FHg4pZ1z
w0TBPeNJPOE9IyBCfM7Z1fbYM1Js40wHqXljZxbFBUYN5xVF2uQQorjA/A7KRNEyl4bTVxSr
IFKiiAD9TvbqGHJe2rKWhiHrpbGyqdRLM0RFmysGNoDL7wfMcfn14St3+QBvapKbK1a31+YK
IVRcRlONNO+xucL6i3PpCGvsvDoCbVpn3lxhk/YQOgLA3S4AddERCE4sfHi/l44wEF1eRxiA
0UtH6hgyOgIIzlLT5XSkXlmFjuSISnQkB1OuI32BOc8o1oevWEcWS13nhh3dDL8KImX4Oj1h
62f4VQw5w9eNpeXIGn61shrDZ4iKDJ+BDWD4/YA5hl8fvnrDt3pfEKmDyBk+APQ736ljSBq+
JX1Zpl5ZkeFbg1xJ6QvMs1VrwCspFN5Qd0/ASTzh8x2GOMw9ARTb1s4Tl7PGznvRo97mWeJy
1uQA9wQs6H2oyu4JALg5BEC9vXQFRM5Lz8vbnqS9dAVD0kvTexxkvfSivHNdkZfGvYay8ISC
DeHyewHzXH5t+MpdvqGXkcW7z79Ql54siWNNHYu49x6Bpbj31/pcm/704f3tpw//zX+ssQDb
uv/YgelHqxxSNy0LHBYYJ7FtY+FMTW3mLCTRP3lu8K4Gb0MxSBqlbkCyhmmijkn+SKZgCxUy
tKk5d2pkGNN3n2//8fn25vbXz7J8aNOZBesMfvZAfc7DHOatnQWJJmpEHBmFOX3z198+vb39
mzwJzCJKEkQhnzOg6xB0tFMg2gBv/Lb68YtCCo9ftIHa+HVjUSPAArBPf7v5WX78xtysjV8Q
UXj4gvjc0TvKRy+IKDz6jvgWTxF0QZ9tDaoI1uCK0EAGKsJcVhFa+UBFsCxJRWjnQZkiNBJx
ZBRdvQuHBFmP2M6AIo/YOH7d1GQ8Imf8xtzRJDyi9PAF8Xmz31VbBWZfFHKweKDT7Av61w6z
L4g4lCLYDYpgTx1bao9gtysCgOoWkEpsR3fAzeqGLYveoAiOpTsKFKGBDO2Yi65m1cqFlhEh
vhTaCVC0FBrHby4MWUXkcDCz9ZmkIrbToEwRG4nQTV1GETgkGPO5IaEI0nYgiM9bBl21VWAZ
iEIOFg90mn1B/9ph9gURh1IEh6cIonsEZ1BFcAZXhAYytGMuunrXVi7kFaGdAEWKMGtWhHp0
KOBYZzxFsBeSijAbXhEaicgMwRB0YBwSZH1iOwOKfGLj+HXTqBlCR33hjN9YaJaEIkoPXxCf
O/qO/lVg9IKIQylCw+gdZ+o4UnuE1uHrJvRHt2zQRHumGRYowswU1Zt2KuyZMedxIdpC3RE4
UnsEriOwDRlFkLaFXopwNH4ljrA+/u/dEZoyWwOuKzCcfx9HaMqExgKj/34dIUZD0ifGvGjI
MWWvIQ4fDfFuEO52S0UVROqWCpapt98tFVUMuVsqEEHywcZ6ZTW3VDDEAe6C6Ad84htUBnuw
MU+irO7GN4P/SLbEjW+IONSNbyyX8plufGONnffGN2zzrMlw8iYHuPGNJYZWduMbJvoypZLh
5HUFHqn61ulsxyJYavmfjW+R/9lytJle5H+2dX3M/3zG0rL0aapjnXOXabfEklUQqcSSAGBo
vYK7OoZMcAcIZnn5VTC4Y5VthcFdjqgoQTEDGyBtYj9gTtrE+vDVJyimjxFLJihW9AgyQ+qV
E5jWP1dOYNrYmXMC02e0z5sTmDY5RE5ga2kpS2XBUgxbTYkH9cak+bO5eQqrxxP+OYSiRNmW
Lp87nJeiU7AXRr903cWJhzyllqIklaZsar28rrpeyFMK9U2nq8UbJsfeGVKvyWEQimjhpXaQ
y7grmjTnVA8t7WyJv2ljyhJ/D5Ozt0OTPRJ/w5ac810bHbO9V0Dksr0v+iYpqmPIBeWmVuZa
Eg7Kq5XVBOWIqOxbQxjYAEF5P2BOUF4fvvq8dBo+wi+ZAZbWVZT62JKXqbz+ufLLadzvKhom
vxy0edagPG9yiPxy2tJWmF/O1KvZLU6FKJre+kVWOVSPGIVBqAjdWG7ahu8J65qbmZdHVbgX
vdIh272+WSiHUDIYKtBdTcVw+EgnXJ94+k1ezC+eIJRmwTxXFuCuKTe7ZQEeJoHnqSYXfbIA
WxZmT5VO9ncq82pHA2dIvfLraWfK48EaO3fy8mqb58uvB00Ok19PV3Y5k6Xr07s6R91s/VaN
HKmXz7cs7veWiC0JW/YrDk5kMBTuRa+F2fRFJaKUqviKhMVSs/GqjQyleV11vZCllGbtM7Wu
9u5o81NIPSYnh1BBi26gu5GbHFZXXS/kJ0enFwa7OiOnNabPkXpNDoNQQoslsFex+MmhLJXp
3xcyl0pODdU+0wFm3pia+FMATVX8WeQXk40/ae4qcyG77OcnT8O72jdF6rHs52fL70wb63bq
rTKPXKXNc+WRo00OkUdurvDLc2i2LtiCPbv7W9e0qWnMjpMlnH6SpIr37O5vA2FnhmY6BG/Z
hj+suaEvJOEbbgOf2XPejeCiLdQfBTFnsxod89P31nPZQCvWzHpGLUHMpucBuAwI4tefBzi2
h1mHJ0tOMJAZQ0mAKKYwA6IN8G2gw/MVwjYgiCnMgCA+3wa6PF0kagOimMIMiDZQfyhoYWqz
o6dDOz1fcoIG01lY4Nlrj0YJ4z6nYmHMuU9HdW3C1FtV4vip8U5OtsBrVQnYN5cqYWiS8IOp
RBMh1Dp0XdI62imh1mEblpx1cLgY0joabKPjwuMRkRmGsLfgkKDGWzRTAIoxl1AMHgOoGIbo
49S9GBDE59tARwEWsgFRzKGiho420C0KE7MBQcyBIkfTOKEShqAbN4ZVCWNwlWggpKoShrg3
b6Wkp0q0c6FMJRrJeGYbXR0uhwhplWgnQZFKNFJQeAhDzKNzGJBWCWkGBPH5NtDV4YrYgCjm
YCrRzQYEPa5KlZBmoDN+14sMeuvtIPWeyl5kWHAfThU5hOWK33EaFUXiN9eq4ie6A/tG4veM
DjX+fi6cX+rb+ftnFKhxd3UKvnd3NxfORNvB3Tmi2cm/meQ9I0CNv68T8N34+5YgeHaUXUhR
EOyYi+83CG7I1F9TCVE33pqtX41KcL65Q5FKNH+DyTEdXR0u7/tLZFWC9/UFSlSi+bs7cg8h
eHbE//IOORcpzYAgPt8GuiqOiA2IYg4mk91soKPHFbIBQcyhVKL5K0wKlZA4UOew0Esl2plQ
ohKGgRlwO26RZgs+kMonLk8iiieeotksznTDUudUTp1uWBooMVSHJuVvWAIAq+1CnjnTheOP
Kl5T/GFNZ5Y+s2j84VgQf5imLPww8UcLIQ10dMqjyaNDy7kQTP7J5UBB/NGdgi5KJkqBKOYg
8ttKgWM7c9FN2gkGdMuyNVH57cWAID6fgC4ZRUUJEMQcIgZtJUBf4LcZSOxST7AAcZ11/CUZ
wrhDxR9mmYH/Tf4h4UyoFRC5TKhmmfJVOhNqBaNhJOVz+d86cdlYlBQIvVabSR40DtMGP/+f
BlHcrMj/51gG5v9zdGPM/3eO8vLy5cXt1os94sJ/QZQmJNqQaI+5MBJ0XOnWT2imhgmhTuUp
OoB3fvDAabKf8LYLm6wUgKAm7r0SEvhJ6q3/i/zriu7FJpPJ/8HbbojJJTxs4F/Apu8Gr8hh
vX+FOSD2+/01/A4fgw9+8by9G/gPsJfbRo/gmMMnaGEVheuE3Hnpo+eFZd6KBKsUNYhBEfBJ
gtTfAUBW7zqNrtfeKoBRXm+jJL2mzxrAR/NPQkgP9bbRfnNYg3bt3K/+7rCD9vfg9A5hSnZ+
eEhIeNjdeTEpRuqjT8Tb4y9fFnV1AIJ/2RCzH5hRI3R3lIvLl9lb07X3ME3Tp88a7fSHMHii
t7qzt2kWDlAi8jYl7n4f+NAckH4XpVvy8ePHKWbvQEqvYvfxin34kk0nTMnOc2H+4kOIH2XP
ONBPEGxOp/XwN4NNv0+fgNh7qxQphO2Sy7ZL0CKahZvPKHT94n2UwuRvQRXeEg90Avee9M90
6yW5Ga2g+wUGDOEDvBmXr6AxIG6wg6kgKy9OXR8H7wLLsB6/gL68zCyiJEmHKZrgPzopXjPy
19i0/9U9rEnsQgeRRspUxiUOOEPMeMV5uLyDChf6wtDYDPyG40BeaQIUIBtWPYXCaqWVogzu
DqG/oilj/oJV4VWsc4FJU0qoInsKnbnCZOl8YDUcH03YgqAXNMMKreumq+06use1BcSRx/xv
NNQYLQKW6AGs8DFbB2QdeUn4x5RQdOgwEkwtNvVeIURIP79zV1sfFuCjDws59u6iKEUSXhb4
lNf8L9qV15ijBfr+6MfA6tq7O9yDbSUJxHc4pkv6Ck3kQrPcBNE97Oav6atY//IlKSFiHyAi
aglViJeXUAshLqoQQZQBXLx2V37gp0/V+UieEvjADz/ip+8B7wc3AKZDl9ovfA5Biq79iDDY
RgZ0Qe1Ty+BvPSDjEWcro4fZPHV9HlkFIBAe8onrnf4vX8YX19dkB+9CPSCMHOjr1+El/WCu
qNW/jO33FzVV9X+g9L9C+j8zHKb/zqj/5yij/o/6P+r/qP+j/v+n6n95sjTUKcAJ/Z+Zhlnu
/20bPm/q9qj/ZymFYFAZRy+P2lYeG1LPvgN/RTZxtCs8CCyx/Fc3yRZHQu4jQk8J0dmCi/Tw
/LK+jsB3/RzhhVp4HV5ae1D3yl2Dp4KWY7p8r9DRbvyv3rouRLVYhCnsbx70dAPLH7pTvAdr
Gdzdo/dHWMTQCXaM+or2ib7jFj33MZqA9wEJNSdkMuluoJ8wFKyIo6Ee8uJ/IOwJPegTHfkX
j7UX+/db6m8xLIqhi2BMQUBVIXO72UvExxEDTnG9M5NqlMyS7eoYqRfyfOouq0TQcAkpRX/D
gq5kFfv7lGoMKj78uKLXuymVV3iF9IrWuAd6w6K5Sd4FHMAO1QplPhvaEt/EWb2mbi2voxe/
GegCi7/eUyyQJWymuCRNq2LaT1B3MK+QLJe5Pez89TqgoaCfUqqg+h6VjNGPH1nSnBB7Fwa9
ZjNcgWZDxsgEYN34/rDzQiZmF28zi3MTil/nrgmEdjO75uyjNfs7jLbWDO41cOh9dfHcm1Uv
AYlu2pPFZD6ZYYwYrCGgSSBAwxb28D70yFujEicZ+nJZVGDcv3n7meC/7z5++HR78/72D38g
WB46l2IK2dRWuobpNGheVUaeG5bBCKwDHwRri+SROIIYAVdAVDGRDARWayWYfBb5u5j8oQCA
LQCud/gr8deUecR7Ayvk+ubB9QP3jqk4rPPDKj3EuKGABbyOMG7Byj5uITD5Bu5GQFsfPfLF
ewKgDTOT0l7L7BL5aDfZmkACAt/N/BZ0fYWZeBkFyYGGaHQjUmVq7W3cQ8CWD2JsocteBTti
9pQc7mC4ZSfyxBd50yW9edvQzQP6B2QlW9hls3STUm84QTvMYV0W+lF2M0SkKWAzSn1JtpTA
r+6JewfOa5KvQpyqRzekISyG9f7mKY/Ra52gASa12LKv6FLQS7LxQuAHQbRbdAtmosj0Af9p
r1hb1PqzltCYsJkms5/OCUx7rZv0ihSlwGN+Mt9MtE4zYLPOAcqjW65CQ9cyr/usT5XRP7pP
4NwuLhr6Nq0gFfNcEyEEpiL5zFNDRBnAOK4gTHjwV6BipdPBxQFyh9PoPaEsVX6S/eEu8Fe4
91mD+qQ+OruyPdrtFTijEDd3Fxj+AlP3MC/wOhoqVmSrC/qRtZ2UjsuFDWyKT2SE67sn8kMI
MxsAqdcJxH0gcD8S45ru0CGeTr0dm+wIdmJ0/4D3F5Woea9ewfRdsJmD+drH0R20/URrwWBS
usPP4KhjzXf9VasL6SKf5CsJP4aI9zR63xZN4VizLRD9TOZTsnVDrmBsV0gubW2Sbbu8jNgi
VG9AoGNmXvAqRPELgqcr3E4lmRQVGkn3WnSmK1EDxhN+yghwAzCoJHOMFXHB9Um9QLKlIVGC
8EBo6UDAEWKMgQNH3oAUNO+i25WtBeLm24uK7D9V9nQYWHn7LHsOzDKemxQO28cVxGIsYGdN
O08FC3Du0I9DH+NMB6nRvP77z+9pR6/7FzrezyipTJOX5EPYYFT5pGZuB60m8zm4WDFlONSA
EJwtCYg08do4fIgGnNRu69qMzf4TKJwEsAv+eg0hVxTfX7AFbujORGdLXNUAb5LkQGNNZr47
jPtYPHt0lpKrCXKdSQqeRFzcgF9JU+BoF7HAFfj62p2w7Hgto4yFuzUVoPc/4qsYDtKbA1k4
TNcdrV/nj52PnKKQAilkkUPC7Tb2ntHgfTdE5H/Z8JdOqu8ZWToxZSw1mgNdAptDiEel5RK6
K+QTV9FnHDEbOpLAbtrMkYo1lFfvto7MqTWtjLYYrJrh5kcSFwJ3s37rLfW/VXl+/qP+KpDI
+c/M0tj5z3j/x1nKeP4znv+M5z/j+c94/jOe/4znP+P5z3j+M57/jOc/4gMcz3+UsDie/3Rg
aTz/Gc9/xjKWsYxlLGMZy1jGMpaxjGUsYxnLWMYylrGMZSxjGctYxjKWpvL/1lGvUgAIAgA=

--PEIAKu/WMn1b1Hv9--


alanr at bell-labs

Oct 20, 1999, 10:29 PM

Post #12 of 12 (2250 views)
Permalink
Heartbeat 0.45 experiences [In reply to]

Hi Thomas,

Thomas Hepper wrote:
>
> Hi,
> On Mon, Oct 18, 1999 at 06:16:23PM -0600, Alan Robertson wrote:
> >
> > The two things I thought of for debugging this are:
> >
> > 1) do the "resource-script status" for the resources you think
> > it should give up before restarting the master. You can
> > do this at any time. Any time it doesn't show the
> > "running" on the right resources, things are broken.
> > Maybe I should do this after taking over any resources.
>
> Not tested yet, will do it
>
> > 2) Turn on debug to level 5 (5 SIGUSR1's) on the slave
> > just before restarting the master. After the master
> > has restarted (a minute or two elapsed), you can run
> > the debug level back down again on the slave.
> >
> OK did it. I have attached my debug files from different tests.
>
> One strange thing is that on startup, heartbeat is started by init,
> it does not work. Either it hangs, or it thinks himself is dead (see the logs).

I'm going to ignore this for now. We'll come back to it later... I suspect
SIGSTOP.

> If i stop it in the normal root shell, and start it again it will take
> the resources, and the slave will release it .....

Your debug files were a little strange...

In particular, I noticed that every time /etc/ha.d/rc.d/ip-request-resp was
invoked, it gave a usage message. It shouldn't ever do that. Maybe it didn't
get installed correctly on Debian?

Could you add
set -x
echo "$0: $# arguments: " $* >&2
env >&2
to the top of this script and send me the output when you run it again?

Thanks!

-- Alan Robertson
alanr [at] bell-labs

Linux-HA dev RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.