Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Linux-HA: Users

New problem(s) with heartbeat 2.0.3 and STONITH

 

 

First page Previous page 1 2 3 Next page Last page  View All Linux-HA users RSS feed   Index | Next | Previous | View Threaded


peinkofe at fhm

Oct 26, 2005, 8:54 AM

Post #1 of 55 (2551 views)
Permalink
New problem(s) with heartbeat 2.0.3 and STONITH

Hello everybody,

unforunately I have new prolbems with the heartbeat 2.0.3 cvs version
and stonith.

I ran a cvs heartbeat which was checked out on 2005-10-18 and
encountered a problem with stonithd which was killed by signal 11.
The effects were that the stonith resources were NOT_ACTIVE and when I
initiated a split brain no node could fence the other off.

I thought maybe it's already fixed in cvs and checkout a version today
(2005-10-26). But unfortunately this version seems to contain a even
worse problem with stonith.


After I startup heartbeat on the two nodes, and wait until it's started
up completely I initiated the split brain situation. I had expected that
this works as expected because both stonith resources were active.

In the logs I saw:
Oct 26 17:30:53 spock pengine: [20031]: WARN: mask(stages.c:stage6):
Scheduling Node sarek for STONITH
Thats what I want :)
But then the following message appeared:
Oct 26 17:31:03 spock tengine: [20030]: ERROR: stonithd_node_fence:
cannot add field to ha_msg.

And no node kills the other. The try it over and over again but it
breaks always with the above message.

I have attached the complete logfile of the DC. As well as my ha.cf and
the cib.xml.
Note that both nodes have the problem.

My system: two RHEL 4 Update 2 Kernel 2.6.0-11ELsmp
2 wti_nps power switches.

Many thanks in advance.

MFG
Stefan Peinkofer
--
--------------------------------------------------------------------------------
Stefan Peinkofer
Zentrum fuer angewandte Kommunikationstechnologien (ZaK)
Fachhochschule Muenchen, Munich University of Applied Sciences
URL: http://www.fhm.edu/zak/
--------------------------------------------------------------------------------
Attachments: sarek-heartbeat.log.gz (13.9 KB)
  cib-mail.xml (8.78 KB)
  ha.cf (0.59 KB)
  signature.asc (0.18 KB)


alanr at unix

Oct 26, 2005, 9:14 AM

Post #2 of 55 (2490 views)
Permalink
Re: New problem(s) with heartbeat 2.0.3 and STONITH [In reply to]

Stefan Peinkofer wrote:
> Hello everybody,
>
> unforunately I have new prolbems with the heartbeat 2.0.3 cvs version
> and stonith.
>
> I ran a cvs heartbeat which was checked out on 2005-10-18 and
> encountered a problem with stonithd which was killed by signal 11.
> The effects were that the stonith resources were NOT_ACTIVE and when I
> initiated a split brain no node could fence the other off.
>
> I thought maybe it's already fixed in cvs and checkout a version today
> (2005-10-26). But unfortunately this version seems to contain a even
> worse problem with stonith.
>
>
> After I startup heartbeat on the two nodes, and wait until it's started
> up completely I initiated the split brain situation. I had expected that
> this works as expected because both stonith resources were active.
>
> In the logs I saw:
> Oct 26 17:30:53 spock pengine: [20031]: WARN: mask(stages.c:stage6):
> Scheduling Node sarek for STONITH
> Thats what I want :)
> But then the following message appeared:
> Oct 26 17:31:03 spock tengine: [20030]: ERROR: stonithd_node_fence:
> cannot add field to ha_msg.
>
> And no node kills the other. The try it over and over again but it
> breaks always with the above message.
>
> I have attached the complete logfile of the DC. As well as my ha.cf and
> the cib.xml.
> Note that both nodes have the problem.
>
> My system: two RHEL 4 Update 2 Kernel 2.6.0-11ELsmp
> 2 wti_nps power switches.

IIRC used to see the signal 11 stuff in our testing a few months ago,
but it went away - so we could't fix it.

Can you get us the stack trace from the core dump from this occurance?

It's odd that the monitoring of the STONITH objects didn't detect that
they weren't running any more. Guess we'll have to look at the logs
more closely.

--
Alan Robertson <alanr [at] unix>

"Openness is the foundation and preservative of friendship... Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


peinkofe at fhm

Oct 26, 2005, 9:32 AM

Post #3 of 55 (2492 views)
Permalink
Re: New problem(s) with heartbeat 2.0.3 and STONITH [In reply to]

Hello Alan,
On Wed, 2005-10-26 at 10:14 -0600, Alan Robertson wrote:
> Stefan Peinkofer wrote:
> > Hello everybody,
> >
> > unforunately I have new prolbems with the heartbeat 2.0.3 cvs version
> > and stonith.
> >
> > I ran a cvs heartbeat which was checked out on 2005-10-18 and
> > encountered a problem with stonithd which was killed by signal 11.
> > The effects were that the stonith resources were NOT_ACTIVE and when I
> > initiated a split brain no node could fence the other off.
> >
> > I thought maybe it's already fixed in cvs and checkout a version today
> > (2005-10-26). But unfortunately this version seems to contain a even
> > worse problem with stonith.
> >
> >
> > After I startup heartbeat on the two nodes, and wait until it's started
> > up completely I initiated the split brain situation. I had expected that
> > this works as expected because both stonith resources were active.
> >
> > In the logs I saw:
> > Oct 26 17:30:53 spock pengine: [20031]: WARN: mask(stages.c:stage6):
> > Scheduling Node sarek for STONITH
> > Thats what I want :)
> > But then the following message appeared:
> > Oct 26 17:31:03 spock tengine: [20030]: ERROR: stonithd_node_fence:
> > cannot add field to ha_msg.
> >
> > And no node kills the other. The try it over and over again but it
> > breaks always with the above message.
> >
> > I have attached the complete logfile of the DC. As well as my ha.cf and
> > the cib.xml.
> > Note that both nodes have the problem.
> >
> > My system: two RHEL 4 Update 2 Kernel 2.6.0-11ELsmp
> > 2 wti_nps power switches.
>
> IIRC used to see the signal 11 stuff in our testing a few months ago,
> but it went away - so we could't fix it.

> Can you get us the stack trace from the core dump from this occurance?
>
Sorry, my problem description may be ambiguous. I'm talking about two
presumably independent problems. Problem 1 is the 'killed by signal 11'
problem. That was the reason why I updated my heartbeat to a more recent
cvs version. Unfortunately I haven't keep the logs of this problem.
(Because I wanted to use the more recent cvs version to provide logs and
stuff)
Problem 2 is the problem with 'cannot add field to ha_msg' and it
appeared with the more recent cvs version. The logs attached are for
Prolbem 2. I will be able to provide logs, cores and stuff for Problem 1
if Problem 2 is fixed (since it takes place before Problem 1 occours).
I hope I did a better job this time.

Many thanks in advance.
Stefan Peinkofer
> It's odd that the monitoring of the STONITH objects didn't detect that
> they weren't running any more. Guess we'll have to look at the logs
> more closely.
>
Attachments: signature.asc (0.18 KB)


beekhof at gmail

Oct 26, 2005, 11:27 AM

Post #4 of 55 (2502 views)
Permalink
Re: New problem(s) with heartbeat 2.0.3 and STONITH [In reply to]

On 10/26/05, Stefan Peinkofer <peinkofe [at] fhm> wrote:
> Hello Alan,
> On Wed, 2005-10-26 at 10:14 -0600, Alan Robertson wrote:
> > Stefan Peinkofer wrote:
> > > Hello everybody,
> > >
> > > unforunately I have new prolbems with the heartbeat 2.0.3 cvs version
> > > and stonith.
> > >
> > > I ran a cvs heartbeat which was checked out on 2005-10-18 and
> > > encountered a problem with stonithd which was killed by signal 11.
> > > The effects were that the stonith resources were NOT_ACTIVE and when I
> > > initiated a split brain no node could fence the other off.
> > >
> > > I thought maybe it's already fixed in cvs and checkout a version today
> > > (2005-10-26). But unfortunately this version seems to contain a even
> > > worse problem with stonith.
> > >
> > >
> > > After I startup heartbeat on the two nodes, and wait until it's started
> > > up completely I initiated the split brain situation. I had expected that
> > > this works as expected because both stonith resources were active.
> > >
> > > In the logs I saw:
> > > Oct 26 17:30:53 spock pengine: [20031]: WARN: mask(stages.c:stage6):
> > > Scheduling Node sarek for STONITH
> > > Thats what I want :)
> > > But then the following message appeared:
> > > Oct 26 17:31:03 spock tengine: [20030]: ERROR: stonithd_node_fence:
> > > cannot add field to ha_msg.
> > >
> > > And no node kills the other. The try it over and over again but it
> > > breaks always with the above message.
> > >
> > > I have attached the complete logfile of the DC. As well as my ha.cf and
> > > the cib.xml.
> > > Note that both nodes have the problem.
> > >
> > > My system: two RHEL 4 Update 2 Kernel 2.6.0-11ELsmp
> > > 2 wti_nps power switches.
> >
> > IIRC used to see the signal 11 stuff in our testing a few months ago,
> > but it went away - so we could't fix it.
>
> > Can you get us the stack trace from the core dump from this occurance?
> >
> Sorry, my problem description may be ambiguous. I'm talking about two
> presumably independent problems. Problem 1 is the 'killed by signal 11'
> problem. That was the reason why I updated my heartbeat to a more recent
> cvs version. Unfortunately I haven't keep the logs of this problem.
> (Because I wanted to use the more recent cvs version to provide logs and
> stuff)
> Problem 2 is the problem with 'cannot add field to ha_msg' and it
> appeared with the more recent cvs version. The logs attached are for
> Prolbem 2. I will be able to provide logs, cores and stuff for Problem 1
> if Problem 2 is fixed (since it takes place before Problem 1 occours).
> I hope I did a better job this time.

I believe IBM China fixed Problem 1 in CVS a while back - or maybe
this is a different problem with the same symptom.

The Problem 2 ERRORs indicate an internal stonithd problem (rather
than a CRM one).

>
> Many thanks in advance.
> Stefan Peinkofer
> > It's odd that the monitoring of the STONITH objects didn't detect that
> > they weren't running any more. Guess we'll have to look at the logs
> > more closely.
> >
>
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.0 (GNU/Linux)
>
> iD8DBQBDX6+TlOJ92uOdG/4RAvuTAJ0cfMm9F0Q3OyxJo3yeLcoDFNIoLACeKeWY
> DUWYPwyigijbdaHeexxyC0g=
> =yJwI
> -----END PGP SIGNATURE-----
>
>
> _______________________________________________
> Linux-HA mailing list
> Linux-HA [at] lists
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>
>
_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


pk at q-leap

Oct 27, 2005, 2:05 AM

Post #5 of 55 (2496 views)
Permalink
Re: New problem(s) with heartbeat 2.0.3 and STONITH [In reply to]

Hello all,

Andrew Beekhof wrote:
> On 10/26/05, Stefan Peinkofer <peinkofe [at] fhm> wrote:
>
>>Hello Alan,
>>On Wed, 2005-10-26 at 10:14 -0600, Alan Robertson wrote:
>>
>>>Stefan Peinkofer wrote:
>>>
>>>>I ran a cvs heartbeat which was checked out on 2005-10-18 and
>>>>encountered a problem with stonithd which was killed by signal 11.
>>>>The effects were that the stonith resources were NOT_ACTIVE and when I
>>>>initiated a split brain no node could fence the other off.
>>>>

I have a similair problem, or maybe the same. I looks like
stonithd can not recover if the connection to the powerswitch
was lost. It stays in "NOT ACTIVE" state in output of "crm_mon -1".
In my setup there are two nodes and one powerswitch (the stonith
device). The stonith resources are configured as clones. The
nodes are connected via a crossover-cable and
with another link to a switch to a network where they can
reach each other (the second hearbeat link) and the powerswitch.
When I pull the network cable of this public link of the active
node, a failover occurs (yeah!), and "crm_mon -1" shows:

Clone Set: fence1
fence1:apc1:0 (stonith:apcmastersnmp): NOT ACTIVE
fence1:apc1:1 (stonith:apcmastersnmp): ha-test-2

Unfortunately it stays like this even when I connect the
network again. This is with current CVS, and see attached logs
and cib.xml.
So the stonithd does not seem to be able to recover from
a connection loss.
What would I have to do to manually restart stonithd such
that heartbeat marks the device as "ACTIVE"?

Regards,

Peter
_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


pk at q-leap

Oct 27, 2005, 2:07 AM

Post #6 of 55 (2493 views)
Permalink
Re: New problem(s) with heartbeat 2.0.3 and STONITH [In reply to]

attachment included...

Hello all,

Andrew Beekhof wrote:
> On 10/26/05, Stefan Peinkofer <peinkofe [at] fhm> wrote:
>
>>Hello Alan,
>>On Wed, 2005-10-26 at 10:14 -0600, Alan Robertson wrote:
>>
>>>Stefan Peinkofer wrote:
>>>
>>>>I ran a cvs heartbeat which was checked out on 2005-10-18 and
>>>>encountered a problem with stonithd which was killed by signal 11.
>>>>The effects were that the stonith resources were NOT_ACTIVE and when I
>>>>initiated a split brain no node could fence the other off.
>>>>

I have a similair problem, or maybe the same. I looks like
stonithd can not recover if the connection to the powerswitch
was lost. It stays in "NOT ACTIVE" state in output of "crm_mon -1".
In my setup there are two nodes and one powerswitch (the stonith
device). The stonith resources are configured as clones. The
nodes are connected via a crossover-cable and
with another link to a switch to a network where they can
reach each other (the second hearbeat link) and the powerswitch.
When I pull the network cable of this public link of the active
node, a failover occurs (yeah!), and "crm_mon -1" shows:

Clone Set: fence1
fence1:apc1:0 (stonith:apcmastersnmp): NOT ACTIVE
fence1:apc1:1 (stonith:apcmastersnmp): ha-test-2

Unfortunately it stays like this even when I connect the
network again. This is with current CVS, and see attached logs
and cib.xml.
So the stonithd does not seem to be able to recover from
a connection loss.
What would I have to do to manually restart stonithd such
that heartbeat marks the device as "ACTIVE"?

Regards,

Peter
_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
Attachments: logs.tgz (49.4 KB)


pk at q-leap

Oct 27, 2005, 7:34 AM

Post #7 of 55 (2490 views)
Permalink
Re: New problem(s) with heartbeat 2.0.3 and STONITH [In reply to]

Hi,

Stefan Peinkofer wrote:
>
> In the logs I saw:
> Oct 26 17:30:53 spock pengine: [20031]: WARN: mask(stages.c:stage6):
> Scheduling Node sarek for STONITH
> Thats what I want :)
> But then the following message appeared:
> Oct 26 17:31:03 spock tengine: [20030]: ERROR: stonithd_node_fence:
> cannot add field to ha_msg.
>

yes, I have exactly the same behaviour, this is with current CVS
revision. Additionally "crm_verify -L -V" gives:

crm_verify[4591]: 2005/10/27_16:34:06 WARN:
mask(unpack.c:unpack_config): No value specified for cluster preference:
stop_orphan_resources
crm_verify[4591]: 2005/10/27_16:34:06 WARN:
mask(unpack.c:unpack_config): No value specified for cluster preference:
stop_orphan_actions
crm_verify[4591]: 2005/10/27_16:34:06 WARN:
mask(unpack.c:unpack_config): No value specified for cluster preference:
remove_after_stop
crm_verify[4591]: 2005/10/27_16:34:06 WARN:
mask(unpack.c:unpack_config): No value specified for cluster preference:
is_managed_default
crm_verify[4591]: 2005/10/27_16:34:06 WARN:
mask(unpack.c:unpack_rsc_op): Processing failed op (rg1:pbs1_stop_0) for
rg1:pbs1 on ha-test-1
crm_verify[4591]: 2005/10/27_16:34:06 WARN:
mask(unpack.c:unpack_rsc_op): Handling failed stop for rg1:pbs1 on ha-test-1
crm_verify[4591]: 2005/10/27_16:34:06 WARN:
mask(unpack.c:unpack_rsc_op): Processing failed op (rg1:drbd1_stop_0)
for rg1:drbd1 on ha-test-1
crm_verify[4591]: 2005/10/27_16:34:06 WARN:
mask(unpack.c:unpack_rsc_op): Handling failed stop for rg1:drbd1 on
ha-test-1

Not sure if this is related.

Peter
_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


beekhof at gmail

Oct 27, 2005, 7:54 AM

Post #8 of 55 (2489 views)
Permalink
Re: New problem(s) with heartbeat 2.0.3 and STONITH [In reply to]

On 10/27/05, Peter Kruse <pk [at] q-leap> wrote:
> Hi,
>
> Stefan Peinkofer wrote:
> >
> > In the logs I saw:
> > Oct 26 17:30:53 spock pengine: [20031]: WARN: mask(stages.c:stage6):
> > Scheduling Node sarek for STONITH
> > Thats what I want :)
> > But then the following message appeared:
> > Oct 26 17:31:03 spock tengine: [20030]: ERROR: stonithd_node_fence:
> > cannot add field to ha_msg.
> >
>
> yes, I have exactly the same behaviour, this is with current CVS
> revision. Additionally "crm_verify -L -V" gives:
>
> crm_verify[4591]: 2005/10/27_16:34:06 WARN:
> mask(unpack.c:unpack_config): No value specified for cluster preference:
> stop_orphan_resources
> crm_verify[4591]: 2005/10/27_16:34:06 WARN:
> mask(unpack.c:unpack_config): No value specified for cluster preference:
> stop_orphan_actions
> crm_verify[4591]: 2005/10/27_16:34:06 WARN:
> mask(unpack.c:unpack_config): No value specified for cluster preference:
> remove_after_stop
> crm_verify[4591]: 2005/10/27_16:34:06 WARN:
> mask(unpack.c:unpack_config): No value specified for cluster preference:
> is_managed_default
> crm_verify[4591]: 2005/10/27_16:34:06 WARN:
> mask(unpack.c:unpack_rsc_op): Processing failed op (rg1:pbs1_stop_0) for
> rg1:pbs1 on ha-test-1
> crm_verify[4591]: 2005/10/27_16:34:06 WARN:
> mask(unpack.c:unpack_rsc_op): Handling failed stop for rg1:pbs1 on ha-test-1
> crm_verify[4591]: 2005/10/27_16:34:06 WARN:
> mask(unpack.c:unpack_rsc_op): Processing failed op (rg1:drbd1_stop_0)
> for rg1:drbd1 on ha-test-1
> crm_verify[4591]: 2005/10/27_16:34:06 WARN:
> mask(unpack.c:unpack_rsc_op): Handling failed stop for rg1:drbd1 on
> ha-test-1
>
> Not sure if this is related.
>

not in this case
_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


alanr at unix

Oct 27, 2005, 3:06 PM

Post #9 of 55 (2491 views)
Permalink
Re: New problem(s) with heartbeat 2.0.3 and STONITH [In reply to]

Stefan Peinkofer wrote:
> Hello everybody,
>
> unforunately I have new prolbems with the heartbeat 2.0.3 cvs version
> and stonith.
>
> I ran a cvs heartbeat which was checked out on 2005-10-18 and
> encountered a problem with stonithd which was killed by signal 11.
> The effects were that the stonith resources were NOT_ACTIVE and when I
> initiated a split brain no node could fence the other off.
>
> I thought maybe it's already fixed in cvs and checkout a version today
> (2005-10-26). But unfortunately this version seems to contain a even
> worse problem with stonith.
>
>
> After I startup heartbeat on the two nodes, and wait until it's started
> up completely I initiated the split brain situation. I had expected that
> this works as expected because both stonith resources were active.
>
> In the logs I saw:
> Oct 26 17:30:53 spock pengine: [20031]: WARN: mask(stages.c:stage6):
> Scheduling Node sarek for STONITH
> Thats what I want :)
> But then the following message appeared:
> Oct 26 17:31:03 spock tengine: [20030]: ERROR: stonithd_node_fence:
> cannot add field to ha_msg.

This is some kind of an issue in the lib/fencing/stonithd_lib.c file

if ( (ha_msg_add_int(request, F_STONITHD_OPTYPE, op->optype)
!= HA_OK )
||(ha_msg_add(request, F_STONITHD_NODE, op->node_name ) !=
HA_OK)
||(op->node_uuid == NULL
|| ha_msg_add(request, F_STONITHD_NODE_UUID,
op->node_uuid) != HA_OK)
||(op->private_data == NULL
|| ha_msg_add(request, F_STONITHD_PDATA,
op->private_data) != HA_OK)
||(ha_msg_add_int(request, F_STONITHD_TIMEOUT, op->timeout)
!= HA_OK) ) {
stdlib_log(LOG_ERR, "stonithd_node_fence: "
"cannot add field to ha_msg.");
ZAPMSG(request);
return ST_FAIL;
}

My guess is that op->node_name or op->optype is NULL. The code should
have validated those. Since they're critical, and they come from
who-knows-where (meaning some doofus user process), they should
definitely have been error checked, and there should be a clear message
about their errors.

Things I don't quite understand...
UUIDs are normally special portable binary values with their own type in
the structure world... Having this be a string violates the law of
least surprise. If they're not really uuids, then they shouldn't be
CALLED uuids.

Normally private_data is also binary. If either of this is actually
binary, then this would also be wrong. Having them be strings violates
the law of least surprise... So, as a design element, it's odd to have
them not be binary blobs. Of course, sending the private data as binary
would cause it's own problems with portability.

But, renaming it to private_string_data or something would alleviate the
confusion, and make it clearer.


--
Alan Robertson <alanr [at] unix>

"Openness is the foundation and preservative of friendship... Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


hasjd at cn

Oct 28, 2005, 3:15 AM

Post #10 of 55 (2512 views)
Permalink
Re: New problem(s) with heartbeat 2.0.3 and STONITH [In reply to]

Alan Robertson wrote:
> Stefan Peinkofer wrote:
>
>> Hello everybody,
>>
>> unforunately I have new prolbems with the heartbeat 2.0.3 cvs version
>> and stonith.
>>
>> I ran a cvs heartbeat which was checked out on 2005-10-18 and
>> encountered a problem with stonithd which was killed by signal 11.
>> The effects were that the stonith resources were NOT_ACTIVE and when I
>> initiated a split brain no node could fence the other off.
>>
>> I thought maybe it's already fixed in cvs and checkout a version today
>> (2005-10-26). But unfortunately this version seems to contain a even
>> worse problem with stonith.
>>
>> After I startup heartbeat on the two nodes, and wait until it's started
>> up completely I initiated the split brain situation. I had expected that
>> this works as expected because both stonith resources were active.
>>
>> In the logs I saw:
>> Oct 26 17:30:53 spock pengine: [20031]: WARN: mask(stages.c:stage6):
>> Scheduling Node sarek for STONITH
>> Thats what I want :)
>> But then the following message appeared:
>> Oct 26 17:31:03 spock tengine: [20030]: ERROR: stonithd_node_fence:
>> cannot add field to ha_msg.
>
>
> This is some kind of an issue in the lib/fencing/stonithd_lib.c file
>
> if ( (ha_msg_add_int(request, F_STONITHD_OPTYPE, op->optype) !=
> HA_OK )
> ||(ha_msg_add(request, F_STONITHD_NODE, op->node_name ) !=
> HA_OK)
> ||(op->node_uuid == NULL
> || ha_msg_add(request, F_STONITHD_NODE_UUID,
> op->node_uuid) != HA_OK)
> ||(op->private_data == NULL
> || ha_msg_add(request, F_STONITHD_PDATA,
> op->private_data) != HA_OK)
> ||(ha_msg_add_int(request, F_STONITHD_TIMEOUT, op->timeout)
> != HA_OK) ) {
> stdlib_log(LOG_ERR, "stonithd_node_fence: "
> "cannot add field to ha_msg.");
> ZAPMSG(request);
> return ST_FAIL;
> }
>
> My guess is that op->node_name or op->optype is NULL. The code should
> have validated those. Since they're critical, and they come from
> who-knows-where (meaning some doofus user process), they should
> definitely have been error checked, and there should be a clear message
> about their errors.
>

Should be op->private_data == NULL. This condition is not reasonable.
I'll fix it.

> Things I don't quite understand...
> UUIDs are normally special portable binary values with their own type in
> the structure world... Having this be a string violates the law of
> least surprise. If they're not really uuids, then they shouldn't be
> CALLED uuids.
There is a long story regarding this, it's required by Andrew.

>
> Normally private_data is also binary. If either of this is actually
> binary, then this would also be wrong. Having them be strings violates
> the law of least surprise... So, as a design element, it's odd to have
> them not be binary blobs. Of course, sending the private data as binary
> would cause it's own problems with portability.
Yes.
>
> But, renaming it to private_string_data or something would alleviate the
> confusion, and make it clearer.
It makes sense, i'll rename it.
>
>

--
BRs,

Sun Jiang Dong

_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


alanr at unix

Oct 28, 2005, 6:41 AM

Post #11 of 55 (2497 views)
Permalink
Re: New problem(s) with heartbeat 2.0.3 and STONITH [In reply to]

Sun Jiang Dong wrote:
>
>
> Alan Robertson wrote:
>> Stefan Peinkofer wrote:
>>
>>> Hello everybody,
>>>
>>> unforunately I have new prolbems with the heartbeat 2.0.3 cvs version
>>> and stonith.
>>>
>>> I ran a cvs heartbeat which was checked out on 2005-10-18 and
>>> encountered a problem with stonithd which was killed by signal 11.
>>> The effects were that the stonith resources were NOT_ACTIVE and when I
>>> initiated a split brain no node could fence the other off.
>>>
>>> I thought maybe it's already fixed in cvs and checkout a version today
>>> (2005-10-26). But unfortunately this version seems to contain a even
>>> worse problem with stonith.
>>>
>>> After I startup heartbeat on the two nodes, and wait until it's started
>>> up completely I initiated the split brain situation. I had expected that
>>> this works as expected because both stonith resources were active.
>>>
>>> In the logs I saw:
>>> Oct 26 17:30:53 spock pengine: [20031]: WARN: mask(stages.c:stage6):
>>> Scheduling Node sarek for STONITH
>>> Thats what I want :)
>>> But then the following message appeared:
>>> Oct 26 17:31:03 spock tengine: [20030]: ERROR: stonithd_node_fence:
>>> cannot add field to ha_msg.
>>
>>
>> This is some kind of an issue in the lib/fencing/stonithd_lib.c file
>>
>> if ( (ha_msg_add_int(request, F_STONITHD_OPTYPE, op->optype)
>> != HA_OK )
>> ||(ha_msg_add(request, F_STONITHD_NODE, op->node_name ) !=
>> HA_OK)
>> ||(op->node_uuid == NULL
>> || ha_msg_add(request, F_STONITHD_NODE_UUID,
>> op->node_uuid) != HA_OK)
>> ||(op->private_data == NULL
>> || ha_msg_add(request, F_STONITHD_PDATA,
>> op->private_data) != HA_OK)
>> ||(ha_msg_add_int(request, F_STONITHD_TIMEOUT, op->timeout)
>> != HA_OK) ) {
>> stdlib_log(LOG_ERR, "stonithd_node_fence: "
>> "cannot add field to ha_msg.");
>> ZAPMSG(request);
>> return ST_FAIL;
>> }
>>
>> My guess is that op->node_name or op->optype is NULL. The code should
>> have validated those. Since they're critical, and they come from
>> who-knows-where (meaning some doofus user process), they should
>> definitely have been error checked, and there should be a clear
>> message about their errors.
>>
>
> Should be op->private_data == NULL. This condition is not reasonable.
> I'll fix it.
>
>> Things I don't quite understand...
>> UUIDs are normally special portable binary values with their own type
>> in the structure world... Having this be a string violates the law of
>> least surprise. If they're not really uuids, then they shouldn't be
>> CALLED uuids.
> There is a long story regarding this, it's required by Andrew.


If Andrew requires you to call something which isn't a UUID as a uuid,
then he screwed up and he should fix it.

A UUID is not simply a random identifier which is forced to be unique
(like he requires his id= in XML), it's an industry standard term as per
DCE 1.1, ISO/IEC 11578:1996 and RFC 4122.

So, it is not some string guaranteed to be unique. In fact, it isn't a
string at all, but a 128-bit binary value. There are specified ways of
printing UUIDs, but they're not precisely UUIDs, but ASCII
representations of UUIDs.

So, if it's not a 128-bit binary value in compliance with DCE 1.2,
ISO/IEC 11578:1996 or RFC 4122, it's not really a UUID.
http://www.faqs.org/rfcs/rfc4122.html

[This URL even contains a sample UUID implementation]

--
Alan Robertson <alanr [at] unix>

"Openness is the foundation and preservative of friendship... Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


beekhof at gmail

Oct 28, 2005, 7:18 AM

Post #12 of 55 (2498 views)
Permalink
Re: New problem(s) with heartbeat 2.0.3 and STONITH [In reply to]

On 10/28/05, Alan Robertson <alanr [at] unix> wrote:
> Sun Jiang Dong wrote:
> >
> >
> > Alan Robertson wrote:
> >> Stefan Peinkofer wrote:
> >>
> >>> Hello everybody,
> >>>
> >>> unforunately I have new prolbems with the heartbeat 2.0.3 cvs version
> >>> and stonith.
> >>>
> >>> I ran a cvs heartbeat which was checked out on 2005-10-18 and
> >>> encountered a problem with stonithd which was killed by signal 11.
> >>> The effects were that the stonith resources were NOT_ACTIVE and when I
> >>> initiated a split brain no node could fence the other off.
> >>>
> >>> I thought maybe it's already fixed in cvs and checkout a version today
> >>> (2005-10-26). But unfortunately this version seems to contain a even
> >>> worse problem with stonith.
> >>>
> >>> After I startup heartbeat on the two nodes, and wait until it's started
> >>> up completely I initiated the split brain situation. I had expected that
> >>> this works as expected because both stonith resources were active.
> >>>
> >>> In the logs I saw:
> >>> Oct 26 17:30:53 spock pengine: [20031]: WARN: mask(stages.c:stage6):
> >>> Scheduling Node sarek for STONITH
> >>> Thats what I want :)
> >>> But then the following message appeared:
> >>> Oct 26 17:31:03 spock tengine: [20030]: ERROR: stonithd_node_fence:
> >>> cannot add field to ha_msg.
> >>
> >>
> >> This is some kind of an issue in the lib/fencing/stonithd_lib.c file
> >>
> >> if ( (ha_msg_add_int(request, F_STONITHD_OPTYPE, op->optype)
> >> != HA_OK )
> >> ||(ha_msg_add(request, F_STONITHD_NODE, op->node_name ) !=
> >> HA_OK)
> >> ||(op->node_uuid == NULL
> >> || ha_msg_add(request, F_STONITHD_NODE_UUID,
> >> op->node_uuid) != HA_OK)
> >> ||(op->private_data == NULL
> >> || ha_msg_add(request, F_STONITHD_PDATA,
> >> op->private_data) != HA_OK)
> >> ||(ha_msg_add_int(request, F_STONITHD_TIMEOUT, op->timeout)
> >> != HA_OK) ) {
> >> stdlib_log(LOG_ERR, "stonithd_node_fence: "
> >> "cannot add field to ha_msg.");
> >> ZAPMSG(request);
> >> return ST_FAIL;
> >> }
> >>
> >> My guess is that op->node_name or op->optype is NULL. The code should
> >> have validated those. Since they're critical, and they come from
> >> who-knows-where (meaning some doofus user process), they should
> >> definitely have been error checked, and there should be a clear
> >> message about their errors.
> >>
> >
> > Should be op->private_data == NULL. This condition is not reasonable.
> > I'll fix it.
> >
> >> Things I don't quite understand...
> >> UUIDs are normally special portable binary values with their own type
> >> in the structure world... Having this be a string violates the law of
> >> least surprise. If they're not really uuids, then they shouldn't be
> >> CALLED uuids.
> > There is a long story regarding this, it's required by Andrew.
>
>
> If Andrew requires you to call something which isn't a UUID as a uuid,
> then he screwed up and he should fix it.

delightfully tactful as ever.

from reading this one would think that its the first time time we've
had this discussion.

>
> A UUID is not simply a random identifier which is forced to be unique
> (like he requires his id= in XML), it's an industry standard term as per
> DCE 1.1, ISO/IEC 11578:1996 and RFC 4122.
>
> So, it is not some string guaranteed to be unique. In fact, it isn't a
> string at all, but a 128-bit binary value. There are specified ways of
> printing UUIDs, but they're not precisely UUIDs, but ASCII
> representations of UUIDs.
>
> So, if it's not a 128-bit binary value in compliance with DCE 1.2,
> ISO/IEC 11578:1996 or RFC 4122, it's not really a UUID.
> http://www.faqs.org/rfcs/rfc4122.html
>
> [This URL even contains a sample UUID implementation]
>
> --
> Alan Robertson <alanr [at] unix>
>
> "Openness is the foundation and preservative of friendship... Let me
> claim from you at all times your undisguised opinions." - William
> Wilberforce
>
_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


alanr at unix

Oct 28, 2005, 7:55 AM

Post #13 of 55 (2490 views)
Permalink
Re: New problem(s) with heartbeat 2.0.3 and STONITH [In reply to]

Andrew Beekhof wrote:
> On 10/28/05, Alan Robertson <alanr [at] unix> wrote:
>> Sun Jiang Dong wrote:
>>>
>>> Alan Robertson wrote:
>>>> Stefan Peinkofer wrote:
>>>>
>>>>> Hello everybody,
>>>>>
>>>>> unforunately I have new prolbems with the heartbeat 2.0.3 cvs version
>>>>> and stonith.
>>>>>
>>>>> I ran a cvs heartbeat which was checked out on 2005-10-18 and
>>>>> encountered a problem with stonithd which was killed by signal 11.
>>>>> The effects were that the stonith resources were NOT_ACTIVE and when I
>>>>> initiated a split brain no node could fence the other off.
>>>>>
>>>>> I thought maybe it's already fixed in cvs and checkout a version today
>>>>> (2005-10-26). But unfortunately this version seems to contain a even
>>>>> worse problem with stonith.
>>>>>
>>>>> After I startup heartbeat on the two nodes, and wait until it's started
>>>>> up completely I initiated the split brain situation. I had expected that
>>>>> this works as expected because both stonith resources were active.
>>>>>
>>>>> In the logs I saw:
>>>>> Oct 26 17:30:53 spock pengine: [20031]: WARN: mask(stages.c:stage6):
>>>>> Scheduling Node sarek for STONITH
>>>>> Thats what I want :)
>>>>> But then the following message appeared:
>>>>> Oct 26 17:31:03 spock tengine: [20030]: ERROR: stonithd_node_fence:
>>>>> cannot add field to ha_msg.
>>>>
>>>> This is some kind of an issue in the lib/fencing/stonithd_lib.c file
>>>>
>>>> if ( (ha_msg_add_int(request, F_STONITHD_OPTYPE, op->optype)
>>>> != HA_OK )
>>>> ||(ha_msg_add(request, F_STONITHD_NODE, op->node_name ) !=
>>>> HA_OK)
>>>> ||(op->node_uuid == NULL
>>>> || ha_msg_add(request, F_STONITHD_NODE_UUID,
>>>> op->node_uuid) != HA_OK)
>>>> ||(op->private_data == NULL
>>>> || ha_msg_add(request, F_STONITHD_PDATA,
>>>> op->private_data) != HA_OK)
>>>> ||(ha_msg_add_int(request, F_STONITHD_TIMEOUT, op->timeout)
>>>> != HA_OK) ) {
>>>> stdlib_log(LOG_ERR, "stonithd_node_fence: "
>>>> "cannot add field to ha_msg.");
>>>> ZAPMSG(request);
>>>> return ST_FAIL;
>>>> }
>>>>
>>>> My guess is that op->node_name or op->optype is NULL. The code should
>>>> have validated those. Since they're critical, and they come from
>>>> who-knows-where (meaning some doofus user process), they should
>>>> definitely have been error checked, and there should be a clear
>>>> message about their errors.
>>>>
>>> Should be op->private_data == NULL. This condition is not reasonable.
>>> I'll fix it.
>>>
>>>> Things I don't quite understand...
>>>> UUIDs are normally special portable binary values with their own type
>>>> in the structure world... Having this be a string violates the law of
>>>> least surprise. If they're not really uuids, then they shouldn't be
>>>> CALLED uuids.
>>> There is a long story regarding this, it's required by Andrew.
>>
>> If Andrew requires you to call something which isn't a UUID as a uuid,
>> then he screwed up and he should fix it.
>
> delightfully tactful as ever.

Untactful, yes. Delightful, no. I screwed up. Again.

> from reading this one would think that its the first time time we've
> had this discussion.

I wasn't sure it was this same issue, and I had (foolishly) hoped that
it wasn't really still broken.

The project really does use the concept of a UUID. It is (and has been
and will continue to be) inappropriate to misuse terminology and/or use
it in inconsistent ways. It creates confusion - because that word
already means something else. Confusion violates the principle of least
surprise.

How would you suggest we go about fixing this?

Would it be of value to have a bugzilla for this?

--
Alan Robertson <alanr [at] unix>

"Openness is the foundation and preservative of friendship... Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


pk at q-leap

Oct 28, 2005, 8:01 AM

Post #14 of 55 (2492 views)
Permalink
Re: New problem(s) with heartbeat 2.0.3 and STONITH [In reply to]

Hello,

Alan Robertson wrote:

> My guess is that op->node_name or op->optype is NULL. The code should
> have validated those. Since they're critical, and they come from
> who-knows-where (meaning some doofus user process), they should
> definitely have been error checked, and there should be a clear
> message about their errors.
>
I'm sorry, but I don't understand any of this. Does that mean you know the
cause of this error, or just that the error message has no meaning?

Peter
_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


alanr at unix

Oct 28, 2005, 10:16 PM

Post #15 of 55 (2502 views)
Permalink
Re: New problem(s) with heartbeat 2.0.3 and STONITH [In reply to]

Peter Kruse wrote:
> Hello,
>
> Alan Robertson wrote:
>
>> My guess is that op->node_name or op->optype is NULL. The code should
>> have validated those. Since they're critical, and they come from
>> who-knows-where (meaning some doofus user process), they should
>> definitely have been error checked, and there should be a clear
>> message about their errors.
>>
> I'm sorry, but I don't understand any of this. Does that mean you know the
> cause of this error, or just that the error message has no meaning?

It means I was reading the code, and got a clue from it, and was in
effect hinting to the author or that code to look at it in more detail.

From emails that were sent, it appears that he got the hint and looked
at it. From looking at the CVS logs, it looks like a patch was checked
in for this problem.

Exactly what the cause was (from your perspective), I'm not sure.

--
Alan Robertson <alanr [at] unix>

"Openness is the foundation and preservative of friendship... Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
Attachments: message-rfc822.eml (3.72 KB)


peinkofe at fhm

Oct 30, 2005, 10:29 AM

Post #16 of 55 (2519 views)
Permalink
Re: New problem(s) with heartbeat 2.0.3 and STONITH [In reply to]

Hello everybody,

On Fri, Oct 28, 2005 at 11:16:28PM -0600, Alan Robertson wrote:
> Peter Kruse wrote:
> > Hello,
> >
> > Alan Robertson wrote:
> >
> >> My guess is that op->node_name or op->optype is NULL. The code should
> >> have validated those. Since they're critical, and they come from
> >> who-knows-where (meaning some doofus user process), they should
> >> definitely have been error checked, and there should be a clear
> >> message about their errors.
> >>
> > I'm sorry, but I don't understand any of this. Does that mean you know the
> > cause of this error, or just that the error message has no meaning?
>
> It means I was reading the code, and got a clue from it, and was in
> effect hinting to the author or that code to look at it in more detail.
>
> From emails that were sent, it appears that he got the hint and looked
> at it. From looking at the CVS logs, it looks like a patch was checked
> in for this problem.
>
Yes, I just tried the current cvs version and it works. (Problem 2 (the "cannot add field to ha_msg" Error) is gone and Problem 1 seems to be solved either)

Many thanks to all who fixed the Problems.

Best regards,
Stefan Peinkofer
> Exactly what the cause was (from your perspective), I'm not sure.
>
> --
> Alan Robertson <alanr [at] unix>
>
> "Openness is the foundation and preservative of friendship... Let me
> claim from you at all times your undisguised opinions." - William
> Wilberforce

> X-Account-Key: account1
> Return-Path: <linux-ha-cvs-bounces [at] lists>
> Delivered-To: spamcop-net-alanr [at] spamcop
> Received: (qmail 23876 invoked from network); 29 Oct 2005 03:31:33 -0000
> X-Spam-Checker-Version: SpamAssassin 3.1.0 (2005-09-13) on blade4
> X-Spam-Level:
> X-Spam-Status: hits=0.6 tests=AWL,NO_REAL_NAME version=3.1.0
> Received: from unknown (192.168.1.103)
> by blade4.cesmail.net with QMQP; 29 Oct 2005 03:31:33 -0000
> Received: from mail.maclawran.ca (HELO demo.bb4.com) (65.39.147.83)
> by mx53.cesmail.net with SMTP; 29 Oct 2005 03:31:33 -0000
> Received: from new.community.tummy.com (postfix [at] newcommunity
> [198.49.126.209])
> by demo.bb4.com (8.13.0/8.13.0) with ESMTP id j9T3QTte074372;
> Fri, 28 Oct 2005 23:26:29 -0400 (EDT)
> Received: from newcommunity.tummy.com (localhost [127.0.0.1])
> by new.community.tummy.com (Postfix) with ESMTP id D5B1F20347C2;
> Fri, 28 Oct 2005 21:31:31 -0600 (MDT)
> X-Original-To: linux-ha-cvs [at] lists
> Delivered-To: mailman+post-linux-ha-cvs [at] newcommunity
> Received: by new.community.tummy.com (Postfix, from userid 547)
> id 5DD442034025; Fri, 28 Oct 2005 21:31:30 -0600 (MDT)
> To: linux-ha-cvs [at] lists
> Message-Id: <20051029033130.5DD442034025 [at] new>
> Date: Fri, 28 Oct 2005 21:31:30 -0600 (MDT)
> From: linux-ha-cvs [at] lists
> Subject: [Linux-ha-cvs] Linux-HA CVS: lib by sunjd from
> X-BeenThere: linux-ha-cvs [at] lists
> X-Mailman-Version: 2.1.5
> Precedence: list
> Reply-To: linux-ha-dev [at] lists
> List-Id: Linux-HA CVS commits <linux-ha-cvs.lists.linux-ha.org>
> List-Unsubscribe: <http://lists.community.tummy.com/mailman/listinfo/linux-ha-cvs>,
> <mailto:linux-ha-cvs-request [at] lists?subject=unsubscribe>
> List-Post: <mailto:linux-ha-cvs [at] lists>
> List-Help: <mailto:linux-ha-cvs-request [at] lists?subject=help>
> List-Subscribe: <http://lists.community.tummy.com/mailman/listinfo/linux-ha-cvs>,
> <mailto:linux-ha-cvs-request [at] lists?subject=subscribe>
> Sender: linux-ha-cvs-bounces [at] lists
> Errors-To: linux-ha-cvs-bounces [at] lists
> X-DCC-Misty-Metrics: demo.bb4.com 1170; Body=2 Fuz1=2 Fuz2=2
> X-SpamCop-Checked: 192.168.1.103 65.39.147.83 198.49.126.209 127.0.0.1
>
> linux-ha CVS committal
>
> Author : sunjd
> Host :
> Project : linux-ha
> Module : lib
>
> Dir : linux-ha/lib/fencing
>
>
> Modified Files:
> stonithd_lib.c
>
>
> Log Message:
> permit private_data be null
> ===================================================================
> RCS file: /home/cvs/linux-ha/linux-ha/lib/fencing/stonithd_lib.c,v
> retrieving revision 1.18
> retrieving revision 1.19
> diff -u -3 -r1.18 -r1.19
> --- stonithd_lib.c 24 Oct 2005 14:57:44 -0000 1.18
> +++ stonithd_lib.c 29 Oct 2005 03:31:29 -0000 1.19
> @@ -283,8 +283,6 @@
> ||(ha_msg_add(request, F_STONITHD_NODE, op->node_name ) != HA_OK)
> ||(op->node_uuid == NULL
> || ha_msg_add(request, F_STONITHD_NODE_UUID, op->node_uuid) != HA_OK)
> - ||(op->private_data == NULL
> - || ha_msg_add(request, F_STONITHD_PDATA, op->private_data) != HA_OK)
> ||(ha_msg_add_int(request, F_STONITHD_TIMEOUT, op->timeout)
> != HA_OK) ) {
> stdlib_log(LOG_ERR, "stonithd_node_fence: "
> @@ -292,6 +290,14 @@
> ZAPMSG(request);
> return ST_FAIL;
> }
> + if (op->private_data != NULL) {
> + if ( ha_msg_add(request, F_STONITHD_PDATA, op->private_data) != HA_OK) {
> + stdlib_log(LOG_ERR, "stonithd_node_fence: "
> + "Failed to add F_STONITHD_PDATA field to ha_msg.");
> + ZAPMSG(request);
> + return ST_FAIL;
> + }
> + }
>
> /* Send the stonith request message */
> if (msg2ipcchan(request, chan) != HA_OK) {
>
>
> _______________________________________________
> Linux-ha-cvs mailing list
> Linux-ha-cvs [at] lists
> http://lists.community.tummy.com/mailman/listinfo/linux-ha-cvs
>

> _______________________________________________
> Linux-HA mailing list
> Linux-HA [at] lists
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


hasjd at cn

Oct 31, 2005, 12:48 AM

Post #17 of 55 (2494 views)
Permalink
Re: New problem(s) with heartbeat 2.0.3 and STONITH [In reply to]

Peter Kruse wrote:
> Hello,
>
> Alan Robertson wrote:
>
>
>>My guess is that op->node_name or op->optype is NULL. The code should
>>have validated those. Since they're critical, and they come from
>>who-knows-where (meaning some doofus user process), they should
>>definitely have been error checked, and there should be a clear
>>message about their errors.
>>
>
> I'm sorry, but I don't understand any of this. Does that mean you know the
> cause of this error, or just that the error message has no meaning?
>
> Peter

Anyway I think the problem you met has been fixed in CVS. Please have a try.
If you still meet it, please tell me. Thanks.

> _______________________________________________
> Linux-HA mailing list
> Linux-HA [at] lists
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>

--
BRs,

Sun Jiang Dong

_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


peinkofe at fhm

Oct 31, 2005, 1:58 AM

Post #18 of 55 (2503 views)
Permalink
Re: New problem(s) with heartbeat 2.0.3 and STONITH [In reply to]

Hello everybody,
On Sun, Oct 30, 2005 at 07:29:31PM +0100, peinkofe [at] fhm wrote:
> Hello everybody,
>
> On Fri, Oct 28, 2005 at 11:16:28PM -0600, Alan Robertson wrote:
> > Peter Kruse wrote:
> > > Hello,
> > >
> > > Alan Robertson wrote:
> > >
> > >> My guess is that op->node_name or op->optype is NULL. The code should
> > >> have validated those. Since they're critical, and they come from
> > >> who-knows-where (meaning some doofus user process), they should
> > >> definitely have been error checked, and there should be a clear
> > >> message about their errors.
> > >>
> > > I'm sorry, but I don't understand any of this. Does that mean you know the
> > > cause of this error, or just that the error message has no meaning?
> >
> > It means I was reading the code, and got a clue from it, and was in
> > effect hinting to the author or that code to look at it in more detail.
> >
> > From emails that were sent, it appears that he got the hint and looked
> > at it. From looking at the CVS logs, it looks like a patch was checked
> > in for this problem.
> >
> Yes, I just tried the current cvs version and it works. (Problem 2 (the "cannot add field to ha_msg" Error) is gone and Problem 1 seems to be solved either)
>
Seems that I was a little bit too optimistic. Problem 1 isn't solved yet. In fact it worked once and failed many times.
In the case which worked, a timeout of the monitor op was discovered:
Oct 30 19:01:46 spock lrmd: [4468]: WARN: on_op_timeout_expired: TIMEOUT: operation monitor[15] on stonith::wti_nps::kill_sarek for client 4469, its parameters: timeout=5000 ipaddr=192.168.1.205 te-target-rc=7 lrm-is-probe=true password=XXXXXXX crm_feature_set=1.0.3 interval=10000 .

Oct 30 19:01:51 spock crmd: [4469]: ERROR: mask(lrm.c:do_lrm_event): LRM operation (15) monitor_10000 on kill_sarek Timed Out

The it said that sontihd was killed by signal 11 and respawned it.
Oct 30 19:01:55 spock heartbeat: [4447]: ERROR: Exiting /usr/lib/heartbeat/stonithd process 4467 killed by signal 11.
Oct 30 19:01:55 spock heartbeat: [4447]: ERROR: Exiting /usr/lib/heartbeat/stonithd process 4467 dumped core
Oct 30 19:01:55 spock heartbeat: [4447]: ERROR: Client /usr/lib/heartbeat/stonithd killed by signal 11.
Oct 30 19:01:55 spock heartbeat: [4447]: ERROR: Respawning client "/usr/lib/heartbeat/stonithd":
Oct 30 19:01:55 spock heartbeat: [4447]: info: Starting child client "/usr/lib/heartbeat/stonithd" (0,0)
Oct 30 19:01:55 spock heartbeat: [11922]: info: Starting "/usr/lib/heartbeat/stonithd" as uid 0 gid 0 (pid 11922)

Then it said, that it want to start the stonith resource again.

Oct 30 19:01:59 spock crmd: [4469]: info: mask(lrm.c:do_lrm_rsc_op): Performing op start on kill_sarek

And the resource was active again.

In the next case it didn't work:
Again it noticed the op monitor timeout:
Oct 30 19:31:50 spock lrmd: [4468]: WARN: on_op_timeout_expired: TIMEOUT: operation monitor[27] on stonith::wti_nps::kill_sarek for client 4469, its parameters: timeout=5000 ipaddr=192.168.1.205 te-target-rc=7 lrm-is-probe=true password=XXXXXXX crm_feature_set=1.0.3 interval=10000 .

Oct 30 19:31:58 spock crmd: [4469]: ERROR: mask(lrm.c:do_lrm_event): LRM operation (27) monitor_10000 on kill_sarek Timed Out

Then it tryed to perform a op start on the stonith resource:
Oct 30 19:32:01 spock crmd: [4469]: info: mask(lrm.c:do_lrm_rsc_op): Performing op start on kill_sarek

which failed
Oct 30 19:32:12 spock crmd: [4469]: ERROR: mask(lrm.c:do_lrm_event): LRM operation (30) start_0 on kill_sarek Error: unknown error

and after that it notices the dead of stonithd and respawns it.
Oct 30 19:32:16 spock heartbeat: [4447]: ERROR: Exiting /usr/lib/heartbeat/stonithd process 11922 killed by signal 11.
Oct 30 19:32:16 spock heartbeat: [4447]: ERROR: Exiting /usr/lib/heartbeat/stonithd process 11922 dumped core
Oct 30 19:32:16 spock heartbeat: [4447]: ERROR: Client /usr/lib/heartbeat/stonithd killed by signal 11.
Oct 30 19:32:16 spock heartbeat: [4447]: ERROR: Respawning client "/usr/lib/heartbeat/stonithd":

I have attached a file which contains all stonithd killed by signal 11 cases which occorued form yesterday to today. Especially interesting is the last case: the it says something about STONITH_RA_EXEC: cannot sign on the sontihd which only occoured in this case.

Btw. pengine tells someting about a memory leak.

Many thanks in advance.
Stefan Peinkofer
> Many thanks to all who fixed the Problems.
>
> Best regards,
> Stefan Peinkofer
> > Exactly what the cause was (from your perspective), I'm not sure.
> >
> > --
> > Alan Robertson <alanr [at] unix>
> >
> > "Openness is the foundation and preservative of friendship... Let me
> > claim from you at all times your undisguised opinions." - William
> > Wilberforce
>
> > X-Account-Key: account1
> > Return-Path: <linux-ha-cvs-bounces [at] lists>
> > Delivered-To: spamcop-net-alanr [at] spamcop
> > Received: (qmail 23876 invoked from network); 29 Oct 2005 03:31:33 -0000
> > X-Spam-Checker-Version: SpamAssassin 3.1.0 (2005-09-13) on blade4
> > X-Spam-Level:
> > X-Spam-Status: hits=0.6 tests=AWL,NO_REAL_NAME version=3.1.0
> > Received: from unknown (192.168.1.103)
> > by blade4.cesmail.net with QMQP; 29 Oct 2005 03:31:33 -0000
> > Received: from mail.maclawran.ca (HELO demo.bb4.com) (65.39.147.83)
> > by mx53.cesmail.net with SMTP; 29 Oct 2005 03:31:33 -0000
> > Received: from new.community.tummy.com (postfix [at] newcommunity
> > [198.49.126.209])
> > by demo.bb4.com (8.13.0/8.13.0) with ESMTP id j9T3QTte074372;
> > Fri, 28 Oct 2005 23:26:29 -0400 (EDT)
> > Received: from newcommunity.tummy.com (localhost [127.0.0.1])
> > by new.community.tummy.com (Postfix) with ESMTP id D5B1F20347C2;
> > Fri, 28 Oct 2005 21:31:31 -0600 (MDT)
> > X-Original-To: linux-ha-cvs [at] lists
> > Delivered-To: mailman+post-linux-ha-cvs [at] newcommunity
> > Received: by new.community.tummy.com (Postfix, from userid 547)
> > id 5DD442034025; Fri, 28 Oct 2005 21:31:30 -0600 (MDT)
> > To: linux-ha-cvs [at] lists
> > Message-Id: <20051029033130.5DD442034025 [at] new>
> > Date: Fri, 28 Oct 2005 21:31:30 -0600 (MDT)
> > From: linux-ha-cvs [at] lists
> > Subject: [Linux-ha-cvs] Linux-HA CVS: lib by sunjd from
> > X-BeenThere: linux-ha-cvs [at] lists
> > X-Mailman-Version: 2.1.5
> > Precedence: list
> > Reply-To: linux-ha-dev [at] lists
> > List-Id: Linux-HA CVS commits <linux-ha-cvs.lists.linux-ha.org>
> > List-Unsubscribe: <http://lists.community.tummy.com/mailman/listinfo/linux-ha-cvs>,
> > <mailto:linux-ha-cvs-request [at] lists?subject=unsubscribe>
> > List-Post: <mailto:linux-ha-cvs [at] lists>
> > List-Help: <mailto:linux-ha-cvs-request [at] lists?subject=help>
> > List-Subscribe: <http://lists.community.tummy.com/mailman/listinfo/linux-ha-cvs>,
> > <mailto:linux-ha-cvs-request [at] lists?subject=subscribe>
> > Sender: linux-ha-cvs-bounces [at] lists
> > Errors-To: linux-ha-cvs-bounces [at] lists
> > X-DCC-Misty-Metrics: demo.bb4.com 1170; Body=2 Fuz1=2 Fuz2=2
> > X-SpamCop-Checked: 192.168.1.103 65.39.147.83 198.49.126.209 127.0.0.1
> >
> > linux-ha CVS committal
> >
> > Author : sunjd
> > Host :
> > Project : linux-ha
> > Module : lib
> >
> > Dir : linux-ha/lib/fencing
> >
> >
> > Modified Files:
> > stonithd_lib.c
> >
> >
> > Log Message:
> > permit private_data be null
> > ===================================================================
> > RCS file: /home/cvs/linux-ha/linux-ha/lib/fencing/stonithd_lib.c,v
> > retrieving revision 1.18
> > retrieving revision 1.19
> > diff -u -3 -r1.18 -r1.19
> > --- stonithd_lib.c 24 Oct 2005 14:57:44 -0000 1.18
> > +++ stonithd_lib.c 29 Oct 2005 03:31:29 -0000 1.19
> > @@ -283,8 +283,6 @@
> > ||(ha_msg_add(request, F_STONITHD_NODE, op->node_name ) != HA_OK)
> > ||(op->node_uuid == NULL
> > || ha_msg_add(request, F_STONITHD_NODE_UUID, op->node_uuid) != HA_OK)
> > - ||(op->private_data == NULL
> > - || ha_msg_add(request, F_STONITHD_PDATA, op->private_data) != HA_OK)
> > ||(ha_msg_add_int(request, F_STONITHD_TIMEOUT, op->timeout)
> > != HA_OK) ) {
> > stdlib_log(LOG_ERR, "stonithd_node_fence: "
> > @@ -292,6 +290,14 @@
> > ZAPMSG(request);
> > return ST_FAIL;
> > }
> > + if (op->private_data != NULL) {
> > + if ( ha_msg_add(request, F_STONITHD_PDATA, op->private_data) != HA_OK) {
> > + stdlib_log(LOG_ERR, "stonithd_node_fence: "
> > + "Failed to add F_STONITHD_PDATA field to ha_msg.");
> > + ZAPMSG(request);
> > + return ST_FAIL;
> > + }
> > + }
> >
> > /* Send the stonith request message */
> > if (msg2ipcchan(request, chan) != HA_OK) {
> >
> >
> > _______________________________________________
> > Linux-ha-cvs mailing list
> > Linux-ha-cvs [at] lists
> > http://lists.community.tummy.com/mailman/listinfo/linux-ha-cvs
> >
>
> > _______________________________________________
> > Linux-HA mailing list
> > Linux-HA [at] lists
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > See also: http://linux-ha.org/ReportingProblems
> _______________________________________________
> Linux-HA mailing list
> Linux-HA [at] lists
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
Attachments: cluster.logs.gz (27.6 KB)


alanr at unix

Oct 31, 2005, 7:18 AM

Post #19 of 55 (2488 views)
Permalink
Re: New problem(s) with heartbeat 2.0.3 and STONITH [In reply to]

peinkofe [at] fhm wrote:
> Hello everybody,
> On Sun, Oct 30, 2005 at 07:29:31PM +0100, peinkofe [at] fhm wrote:

>> Yes, I just tried the current cvs version and it works. (Problem 2 (the "cannot add field to ha_msg" Error) is gone and Problem 1 seems to be solved either)
>>
> Seems that I was a little bit too optimistic. Problem 1 isn't solved yet. In fact it worked once and failed many times.
> In the case which worked, a timeout of the monitor op was discovered:
> Oct 30 19:01:46 spock lrmd: [4468]: WARN: on_op_timeout_expired: TIMEOUT: operation monitor[15] on stonith::wti_nps::kill_sarek for client 4469, its parameters: timeout=5000 ipaddr=192.168.1.205 te-target-rc=7 lrm-is-probe=true password=XXXXXXX crm_feature_set=1.0.3 interval=10000 .
>
> Oct 30 19:01:51 spock crmd: [4469]: ERROR: mask(lrm.c:do_lrm_event): LRM operation (15) monitor_10000 on kill_sarek Timed Out
>
> The it said that sontihd was killed by signal 11 and respawned it.
> Oct 30 19:01:55 spock heartbeat: [4447]: ERROR: Exiting /usr/lib/heartbeat/stonithd process 4467 killed by signal 11.
> Oct 30 19:01:55 spock heartbeat: [4447]: ERROR: Exiting /usr/lib/heartbeat/stonithd process 4467 dumped core

WE NEED THE STACK TRACE FROM THIS CORE DUMP.

--
Alan Robertson <alanr [at] unix>

"Openness is the foundation and preservative of friendship... Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


peinkofe at fhm

Oct 31, 2005, 8:26 AM

Post #20 of 55 (2505 views)
Permalink
Re: New problem(s) with heartbeat 2.0.3 and STONITH [In reply to]

Hello Alan,
On Mon, Oct 31, 2005 at 08:18:10AM -0700, Alan Robertson wrote:
> peinkofe [at] fhm wrote:
> > Hello everybody,
> > On Sun, Oct 30, 2005 at 07:29:31PM +0100, peinkofe [at] fhm wrote:
>
> >> Yes, I just tried the current cvs version and it works. (Problem 2 (the "cannot add field to ha_msg" Error) is gone and Problem 1 seems to be solved either)
> >>
> > Seems that I was a little bit too optimistic. Problem 1 isn't solved yet. In fact it worked once and failed many times.
> > In the case which worked, a timeout of the monitor op was discovered:
> > Oct 30 19:01:46 spock lrmd: [4468]: WARN: on_op_timeout_expired: TIMEOUT: operation monitor[15] on stonith::wti_nps::kill_sarek for client 4469, its parameters: timeout=5000 ipaddr=192.168.1.205 te-target-rc=7 lrm-is-probe=true password=XXXXXXX crm_feature_set=1.0.3 interval=10000 .
> >
> > Oct 30 19:01:51 spock crmd: [4469]: ERROR: mask(lrm.c:do_lrm_event): LRM operation (15) monitor_10000 on kill_sarek Timed Out
> >
> > The it said that sontihd was killed by signal 11 and respawned it.
> > Oct 30 19:01:55 spock heartbeat: [4447]: ERROR: Exiting /usr/lib/heartbeat/stonithd process 4467 killed by signal 11.
> > Oct 30 19:01:55 spock heartbeat: [4447]: ERROR: Exiting /usr/lib/heartbeat/stonithd process 4467 dumped core
>
> WE NEED THE STACK TRACE FROM THIS CORE DUMP.
>
Im sorry, I forgot. Attached some gdb backtraces (hope that is what you want, since pstack on linux seems not to support core files).

To avoid misunderstandings, do you aggree, that solving the stonithd coredump cause solves not the whole problem. I mean, stonithd recovers through the respawning mechanism but what makes the situation worse is that the stonith resources fail to restart and therefore remain not active.

Many thanks in advance.
Stefan Peinkofer
> --
> Alan Robertson <alanr [at] unix>
>
> "Openness is the foundation and preservative of friendship... Let me
> claim from you at all times your undisguised opinions." - William
> Wilberforce
> _______________________________________________
> Linux-HA mailing list
> Linux-HA [at] lists
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
Attachments: core_backtraces.txt (4.71 KB)


alanr at unix

Oct 31, 2005, 8:35 AM

Post #21 of 55 (2491 views)
Permalink
Re: New problem(s) with heartbeat 2.0.3 and STONITH [In reply to]

peinkofe [at] fhm wrote:
> Hello Alan, On Mon, Oct 31, 2005 at 08:18:10AM -0700, Alan Robertson
> wrote:
>> peinkofe [at] fhm wrote:
>>> Hello everybody, On Sun, Oct 30, 2005 at 07:29:31PM +0100,
>>> peinkofe [at] fhm wrote:
>>>> Yes, I just tried the current cvs version and it works.
>>>> (Problem 2 (the "cannot add field to ha_msg" Error) is gone and
>>>> Problem 1 seems to be solved either)
>>>>
>>> Seems that I was a little bit too optimistic. Problem 1 isn't
>>> solved yet. In fact it worked once and failed many times. In the
>>> case which worked, a timeout of the monitor op was discovered:
>>> Oct 30 19:01:46 spock lrmd: [4468]: WARN: on_op_timeout_expired:
>>> TIMEOUT: operation monitor[15] on stonith::wti_nps::kill_sarek
>>> for client 4469, its parameters: timeout=5000
>>> ipaddr=192.168.1.205 te-target-rc=7 lrm-is-probe=true
>>> password=XXXXXXX crm_feature_set=1.0.3 interval=10000 .
>>>
>>> Oct 30 19:01:51 spock crmd: [4469]: ERROR:
>>> mask(lrm.c:do_lrm_event): LRM operation (15) monitor_10000 on
>>> kill_sarek Timed Out
>>>
>>> The it said that sontihd was killed by signal 11 and respawned
>>> it. Oct 30 19:01:55 spock heartbeat: [4447]: ERROR: Exiting
>>> /usr/lib/heartbeat/stonithd process 4467 killed by signal 11. Oct
>>> 30 19:01:55 spock heartbeat: [4447]: ERROR: Exiting
>>> /usr/lib/heartbeat/stonithd process 4467 dumped core
>> WE NEED THE STACK TRACE FROM THIS CORE DUMP.
>>
> Im sorry, I forgot. Attached some gdb backtraces (hope that is what
> you want, since pstack on linux seems not to support core files).
>
> To avoid misunderstandings, do you aggree, that solving the stonithd
> coredump cause solves not the whole problem. I mean, stonithd
> recovers through the respawning mechanism but what makes the
> situation worse is that the stonith resources fail to restart and
> therefore remain not active.

I agree that there are two problems.

IMHO, the more serious of the two is the core dump. The other wouldn't
be a problem if the stonithd hadn't needed to restart.

I don't know why the CRM didn't restart the resources when the monitor
operation failed. (At least, I think it failed)


--
Alan Robertson <alanr [at] unix>

"Openness is the foundation and preservative of friendship... Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


beekhof at gmail

Oct 31, 2005, 9:18 AM

Post #22 of 55 (2506 views)
Permalink
Re: New problem(s) with heartbeat 2.0.3 and STONITH [In reply to]

On 10/28/05, Alan Robertson <alanr [at] unix> wrote:
> Andrew Beekhof wrote:
> > On 10/28/05, Alan Robertson <alanr [at] unix> wrote:
> >> Sun Jiang Dong wrote:
> >>>
> >>> Alan Robertson wrote:
> >>>> Stefan Peinkofer wrote:
> >>>>
> >>>>> Hello everybody,
> >>>>>
> >>>>> unforunately I have new prolbems with the heartbeat 2.0.3 cvs version
> >>>>> and stonith.
> >>>>>
> >>>>> I ran a cvs heartbeat which was checked out on 2005-10-18 and
> >>>>> encountered a problem with stonithd which was killed by signal 11.
> >>>>> The effects were that the stonith resources were NOT_ACTIVE and when I
> >>>>> initiated a split brain no node could fence the other off.
> >>>>>
> >>>>> I thought maybe it's already fixed in cvs and checkout a version today
> >>>>> (2005-10-26). But unfortunately this version seems to contain a even
> >>>>> worse problem with stonith.
> >>>>>
> >>>>> After I startup heartbeat on the two nodes, and wait until it's started
> >>>>> up completely I initiated the split brain situation. I had expected that
> >>>>> this works as expected because both stonith resources were active.
> >>>>>
> >>>>> In the logs I saw:
> >>>>> Oct 26 17:30:53 spock pengine: [20031]: WARN: mask(stages.c:stage6):
> >>>>> Scheduling Node sarek for STONITH
> >>>>> Thats what I want :)
> >>>>> But then the following message appeared:
> >>>>> Oct 26 17:31:03 spock tengine: [20030]: ERROR: stonithd_node_fence:
> >>>>> cannot add field to ha_msg.
> >>>>
> >>>> This is some kind of an issue in the lib/fencing/stonithd_lib.c file
> >>>>
> >>>> if ( (ha_msg_add_int(request, F_STONITHD_OPTYPE, op->optype)
> >>>> != HA_OK )
> >>>> ||(ha_msg_add(request, F_STONITHD_NODE, op->node_name ) !=
> >>>> HA_OK)
> >>>> ||(op->node_uuid == NULL
> >>>> || ha_msg_add(request, F_STONITHD_NODE_UUID,
> >>>> op->node_uuid) != HA_OK)
> >>>> ||(op->private_data == NULL
> >>>> || ha_msg_add(request, F_STONITHD_PDATA,
> >>>> op->private_data) != HA_OK)
> >>>> ||(ha_msg_add_int(request, F_STONITHD_TIMEOUT, op->timeout)
> >>>> != HA_OK) ) {
> >>>> stdlib_log(LOG_ERR, "stonithd_node_fence: "
> >>>> "cannot add field to ha_msg.");
> >>>> ZAPMSG(request);
> >>>> return ST_FAIL;
> >>>> }
> >>>>
> >>>> My guess is that op->node_name or op->optype is NULL. The code should
> >>>> have validated those. Since they're critical, and they come from
> >>>> who-knows-where (meaning some doofus user process), they should
> >>>> definitely have been error checked, and there should be a clear
> >>>> message about their errors.
> >>>>
> >>> Should be op->private_data == NULL. This condition is not reasonable.
> >>> I'll fix it.
> >>>
> >>>> Things I don't quite understand...
> >>>> UUIDs are normally special portable binary values with their own type
> >>>> in the structure world... Having this be a string violates the law of
> >>>> least surprise. If they're not really uuids, then they shouldn't be
> >>>> CALLED uuids.
> >>> There is a long story regarding this, it's required by Andrew.
> >>
> >> If Andrew requires you to call something which isn't a UUID as a uuid,
> >> then he screwed up and he should fix it.
> >
> > delightfully tactful as ever.
>
> Untactful, yes. Delightful, no. I screwed up. Again.
>
> > from reading this one would think that its the first time time we've
> > had this discussion.
>
> I wasn't sure it was this same issue, and I had (foolishly) hoped that
> it wasn't really still broken.
>
> The project really does use the concept of a UUID. It is (and has been
> and will continue to be) inappropriate to misuse terminology and/or use
> it in inconsistent ways. It creates confusion - because that word
> already means something else. Confusion violates the principle of least
> surprise.
>
> How would you suggest we go about fixing this?

My basic feeling about it is that requiring a uuid_t (rather than a
char*) doesnt help anyone - so there's nothing to fix :-)

Sure we could use a uuid_t instead, its just a call to cl_uuid_parse().

But the first thing that the function is going to (or at least should)
do is unparse it into a char* again so they can log what they're about
to do.

So I just dont see the added value of keeping it in one form vs. another.
But on the otherhand, I dont actually care so much... if you're that
keen on a uuid_t then we can use that.

Btw. the stonithd doesn't actually use it for anything internally.

> Would it be of value to have a bugzilla for this?

Its about a 2 line change in the TE where it calls stonith.

On the otherhand, if you want me using uuid_t EVERYWHERE... thats a
different story.
_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


peinkofe at fhm

Oct 31, 2005, 9:21 AM

Post #23 of 55 (2505 views)
Permalink
Re: New problem(s) with heartbeat 2.0.3 and STONITH [In reply to]

Hello Alan,
On Mon, Oct 31, 2005 at 09:35:58AM -0700, Alan Robertson wrote:
> peinkofe [at] fhm wrote:
> > Hello Alan, On Mon, Oct 31, 2005 at 08:18:10AM -0700, Alan Robertson
> > wrote:
> >> peinkofe [at] fhm wrote:
> >>> Hello everybody, On Sun, Oct 30, 2005 at 07:29:31PM +0100,
> >>> peinkofe [at] fhm wrote:
> >>>> Yes, I just tried the current cvs version and it works.
> >>>> (Problem 2 (the "cannot add field to ha_msg" Error) is gone and
> >>>> Problem 1 seems to be solved either)
> >>>>
> >>> Seems that I was a little bit too optimistic. Problem 1 isn't
> >>> solved yet. In fact it worked once and failed many times. In the
> >>> case which worked, a timeout of the monitor op was discovered:
> >>> Oct 30 19:01:46 spock lrmd: [4468]: WARN: on_op_timeout_expired:
> >>> TIMEOUT: operation monitor[15] on stonith::wti_nps::kill_sarek
> >>> for client 4469, its parameters: timeout=5000
> >>> ipaddr=192.168.1.205 te-target-rc=7 lrm-is-probe=true
> >>> password=XXXXXXX crm_feature_set=1.0.3 interval=10000 .
> >>>
> >>> Oct 30 19:01:51 spock crmd: [4469]: ERROR:
> >>> mask(lrm.c:do_lrm_event): LRM operation (15) monitor_10000 on
> >>> kill_sarek Timed Out
> >>>
> >>> The it said that sontihd was killed by signal 11 and respawned
> >>> it. Oct 30 19:01:55 spock heartbeat: [4447]: ERROR: Exiting
> >>> /usr/lib/heartbeat/stonithd process 4467 killed by signal 11. Oct
> >>> 30 19:01:55 spock heartbeat: [4447]: ERROR: Exiting
> >>> /usr/lib/heartbeat/stonithd process 4467 dumped core
> >> WE NEED THE STACK TRACE FROM THIS CORE DUMP.
> >>
> > Im sorry, I forgot. Attached some gdb backtraces (hope that is what
> > you want, since pstack on linux seems not to support core files).
> >
> > To avoid misunderstandings, do you aggree, that solving the stonithd
> > coredump cause solves not the whole problem. I mean, stonithd
> > recovers through the respawning mechanism but what makes the
> > situation worse is that the stonith resources fail to restart and
> > therefore remain not active.
>
> I agree that there are two problems.
>
> IMHO, the more serious of the two is the core dump. The other wouldn't
> be a problem if the stonithd hadn't needed to restart.
>
Form my humble users point of view it's the other way round, because overstated a user doesn't care that stonithd segfaults as long as the cluster does what it's supposed to do. By the way I personally like the approach to accept that failures occour and to add "self healing" capabilities to recover, if possible.
> I don't know why the CRM didn't restart the resources when the monitor
> operation failed. (At least, I think it failed)
>
I think CRM at least tried to restart the stonith resources and one time (see the first set of the logfiles for this) it even succeeded in doing so. Maybe there is a timing "problem" since the in the case it succeeded, the announcement of the resource restart was after the stointhd respawn announcment. In the other cases where restart didn't succed, it was exactly the other way round.
>
Many thanks in advance.
Stefan Peinkofer
> --
> Alan Robertson <alanr [at] unix>
>
> "Openness is the foundation and preservative of friendship... Let me
> claim from you at all times your undisguised opinions." - William
> Wilberforce
> _______________________________________________
> Linux-HA mailing list
> Linux-HA [at] lists
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


alanr at unix

Oct 31, 2005, 9:41 AM

Post #24 of 55 (2491 views)
Permalink
Re: New problem(s) with heartbeat 2.0.3 and STONITH [In reply to]

Andrew Beekhof wrote:
> On 10/28/05, Alan Robertson <alanr [at] unix> wrote:
>> Andrew Beekhof wrote:
>>> On 10/28/05, Alan Robertson <alanr [at] unix> wrote:
>>>> Sun Jiang Dong wrote:
>>>>> Alan Robertson wrote:
>>>>>> Stefan Peinkofer wrote:
>>>>>>
>>>>>>> Hello everybody,
>>>>>>>
>>>>>>> unforunately I have new prolbems with the heartbeat 2.0.3 cvs version
>>>>>>> and stonith.
>>>>>>>
>>>>>>> I ran a cvs heartbeat which was checked out on 2005-10-18 and
>>>>>>> encountered a problem with stonithd which was killed by signal 11.
>>>>>>> The effects were that the stonith resources were NOT_ACTIVE and when I
>>>>>>> initiated a split brain no node could fence the other off.
>>>>>>>
>>>>>>> I thought maybe it's already fixed in cvs and checkout a version today
>>>>>>> (2005-10-26). But unfortunately this version seems to contain a even
>>>>>>> worse problem with stonith.
>>>>>>>
>>>>>>> After I startup heartbeat on the two nodes, and wait until it's started
>>>>>>> up completely I initiated the split brain situation. I had expected that
>>>>>>> this works as expected because both stonith resources were active.
>>>>>>>
>>>>>>> In the logs I saw:
>>>>>>> Oct 26 17:30:53 spock pengine: [20031]: WARN: mask(stages.c:stage6):
>>>>>>> Scheduling Node sarek for STONITH
>>>>>>> Thats what I want :)
>>>>>>> But then the following message appeared:
>>>>>>> Oct 26 17:31:03 spock tengine: [20030]: ERROR: stonithd_node_fence:
>>>>>>> cannot add field to ha_msg.
>>>>>> This is some kind of an issue in the lib/fencing/stonithd_lib.c file
>>>>>>
>>>>>> if ( (ha_msg_add_int(request, F_STONITHD_OPTYPE, op->optype)
>>>>>> != HA_OK )
>>>>>> ||(ha_msg_add(request, F_STONITHD_NODE, op->node_name ) !=
>>>>>> HA_OK)
>>>>>> ||(op->node_uuid == NULL
>>>>>> || ha_msg_add(request, F_STONITHD_NODE_UUID,
>>>>>> op->node_uuid) != HA_OK)
>>>>>> ||(op->private_data == NULL
>>>>>> || ha_msg_add(request, F_STONITHD_PDATA,
>>>>>> op->private_data) != HA_OK)
>>>>>> ||(ha_msg_add_int(request, F_STONITHD_TIMEOUT, op->timeout)
>>>>>> != HA_OK) ) {
>>>>>> stdlib_log(LOG_ERR, "stonithd_node_fence: "
>>>>>> "cannot add field to ha_msg.");
>>>>>> ZAPMSG(request);
>>>>>> return ST_FAIL;
>>>>>> }
>>>>>>
>>>>>> My guess is that op->node_name or op->optype is NULL. The code should
>>>>>> have validated those. Since they're critical, and they come from
>>>>>> who-knows-where (meaning some doofus user process), they should
>>>>>> definitely have been error checked, and there should be a clear
>>>>>> message about their errors.
>>>>>>
>>>>> Should be op->private_data == NULL. This condition is not reasonable.
>>>>> I'll fix it.
>>>>>
>>>>>> Things I don't quite understand...
>>>>>> UUIDs are normally special portable binary values with their own type
>>>>>> in the structure world... Having this be a string violates the law of
>>>>>> least surprise. If they're not really uuids, then they shouldn't be
>>>>>> CALLED uuids.
>>>>> There is a long story regarding this, it's required by Andrew.
>>>> If Andrew requires you to call something which isn't a UUID as a uuid,
>>>> then he screwed up and he should fix it.
>>> delightfully tactful as ever.
>> Untactful, yes. Delightful, no. I screwed up. Again.
>>
>>> from reading this one would think that its the first time time we've
>>> had this discussion.
>> I wasn't sure it was this same issue, and I had (foolishly) hoped that
>> it wasn't really still broken.
>>
>> The project really does use the concept of a UUID. It is (and has been
>> and will continue to be) inappropriate to misuse terminology and/or use
>> it in inconsistent ways. It creates confusion - because that word
>> already means something else. Confusion violates the principle of least
>> surprise.
>>
>> How would you suggest we go about fixing this?
>
> My basic feeling about it is that requiring a uuid_t (rather than a
> char*) doesnt help anyone - so there's nothing to fix :-)
>
> Sure we could use a uuid_t instead, its just a call to cl_uuid_parse().
>
> But the first thing that the function is going to (or at least should)
> do is unparse it into a char* again so they can log what they're about
> to do.
>
> So I just dont see the added value of keeping it in one form vs. another.
> But on the otherhand, I dont actually care so much... if you're that
> keen on a uuid_t then we can use that.
>
> Btw. the stonithd doesn't actually use it for anything internally.
>
>> Would it be of value to have a bugzilla for this?
>
> Its about a 2 line change in the TE where it calls stonith.
>
> On the otherhand, if you want me using uuid_t EVERYWHERE... thats a
> different story.


No, no no.

I just meant - let's not call it a uuid. Call it a charhandle or
something. uniquestring or something.

It's simply a nomenclature issue.


--
Alan Robertson <alanr [at] unix>

"Openness is the foundation and preservative of friendship... Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


alanr at unix

Oct 31, 2005, 9:49 AM

Post #25 of 55 (2491 views)
Permalink
Re: New problem(s) with heartbeat 2.0.3 and STONITH [In reply to]

peinkofe [at] fhm wrote:
> Hello Alan,
> On Mon, Oct 31, 2005 at 09:35:58AM -0700, Alan Robertson wrote:
>> peinkofe [at] fhm wrote:
>>> Hello Alan, On Mon, Oct 31, 2005 at 08:18:10AM -0700, Alan Robertson
>>> wrote:
>>>> peinkofe [at] fhm wrote:
>>>>> Hello everybody, On Sun, Oct 30, 2005 at 07:29:31PM +0100,
>>>>> peinkofe [at] fhm wrote:
>>>>>> Yes, I just tried the current cvs version and it works.
>>>>>> (Problem 2 (the "cannot add field to ha_msg" Error) is gone and
>>>>>> Problem 1 seems to be solved either)
>>>>>>
>>>>> Seems that I was a little bit too optimistic. Problem 1 isn't
>>>>> solved yet. In fact it worked once and failed many times. In the
>>>>> case which worked, a timeout of the monitor op was discovered:
>>>>> Oct 30 19:01:46 spock lrmd: [4468]: WARN: on_op_timeout_expired:
>>>>> TIMEOUT: operation monitor[15] on stonith::wti_nps::kill_sarek
>>>>> for client 4469, its parameters: timeout=5000
>>>>> ipaddr=192.168.1.205 te-target-rc=7 lrm-is-probe=true
>>>>> password=XXXXXXX crm_feature_set=1.0.3 interval=10000 .
>>>>>
>>>>> Oct 30 19:01:51 spock crmd: [4469]: ERROR:
>>>>> mask(lrm.c:do_lrm_event): LRM operation (15) monitor_10000 on
>>>>> kill_sarek Timed Out
>>>>>
>>>>> The it said that sontihd was killed by signal 11 and respawned
>>>>> it. Oct 30 19:01:55 spock heartbeat: [4447]: ERROR: Exiting
>>>>> /usr/lib/heartbeat/stonithd process 4467 killed by signal 11. Oct
>>>>> 30 19:01:55 spock heartbeat: [4447]: ERROR: Exiting
>>>>> /usr/lib/heartbeat/stonithd process 4467 dumped core
>>>> WE NEED THE STACK TRACE FROM THIS CORE DUMP.
>>>>
>>> Im sorry, I forgot. Attached some gdb backtraces (hope that is what
>>> you want, since pstack on linux seems not to support core files).
>>>
>>> To avoid misunderstandings, do you aggree, that solving the stonithd
>>> coredump cause solves not the whole problem. I mean, stonithd
>>> recovers through the respawning mechanism but what makes the
>>> situation worse is that the stonith resources fail to restart and
>>> therefore remain not active.
>> I agree that there are two problems.
>>
>> IMHO, the more serious of the two is the core dump. The other wouldn't
>> be a problem if the stonithd hadn't needed to restart.

> Form my humble users point of view it's the other way round, because
> overstated a user doesn't care that stonithd segfaults as long as the
> cluster does what it's supposed to do

I understand. Obviously, I have a different perspective.

> By the way I personally like
> the approach to accept that failures occour and to add "self healing"
> capabilities to recover, if possible.

We obviously agree on that. Stuff happens.

>> I don't know why the CRM didn't restart the resources when the
>> monitor operation failed. (At least, I think it failed)

The respawn should more often happen before the monitor failed - unless
things were unlucky.

> I think CRM at least tried to restart the stonith resources and one
> time (see the first set of the logfiles for this) it even succeeded
> in doing so. Maybe there is a timing "problem" since the in the case
> it succeeded, the announcement of the resource restart was after the
> stointhd respawn announcment. In the other cases where restart didn't
> succed, it was exactly the other way round. Many thanks in advance.

OK

So it did succeed some times.

--
Alan Robertson <alanr [at] unix>

"Openness is the foundation and preservative of friendship... Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

First page Previous page 1 2 3 Next page Last page  View All Linux-HA users RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.