Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Linux-HA: Users

LRM_RSC_IDLE/LRM_RSC_BUSY

 

 

Linux-HA users RSS feed   Index | Next | Previous | View Threaded


ikedaj at intellilink

Oct 25, 2007, 9:32 PM

Post #1 of 11 (323 views)
Permalink
LRM_RSC_IDLE/LRM_RSC_BUSY

Hi,

I tried things out about a split brain as ever,
and found the following log messages after the both nodes recovered from a
split brain.

There are two cases;

debug: do_fsa_action: actions:trace: // A_CL_JOIN_RESULT
info: update_dc: Set DC to prec370e (2.0)
debug: do_cl_join_finalize_respond: Confirming join join-2: join_ack_nack
debug: on_msg_get_state:state of rsc prmDummy is LRM_RSC_IDLE

--- or ---

debug: do_fsa_action: actions:trace: // A_CL_JOIN_RESULT
info: update_dc: Set DC to prec370e (2.0)
debug: do_cl_join_finalize_respond: Confirming join join-2: join_ack_nack
debug: on_msg_get_state:state of rsc prmDummy is LRM_RSC_BUSY

if the state of a resouce was LRM_RSC_IDLE,
that resource would stop nomally.

if it's LRM_RSC_BUSY,
a fail count would be increased,
and a return code was set as 14 (EXECRA_STATUS_UNKNOWN ?).

What kind of the status of one resource in LRM_RSC_BUSY?
When would lrm regard it as LRM_RSC_BUSY?

Best Regards,
Junko Ikeda

NTT DATA INTELLILINK CORPORATION


_______________________________________________
Linux-HA mailing list
Linux-HA[at]lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


dejanmm at fastmail

Oct 28, 2007, 4:41 PM

Post #2 of 11 (308 views)
Permalink
Re: LRM_RSC_IDLE/LRM_RSC_BUSY [In reply to]

Hi,

On Fri, Oct 26, 2007 at 01:32:06PM +0900, Junko IKEDA wrote:
> Hi,
>
> I tried things out about a split brain as ever,
> and found the following log messages after the both nodes recovered from a
> split brain.
>
> There are two cases;
>
> debug: do_fsa_action: actions:trace: // A_CL_JOIN_RESULT
> info: update_dc: Set DC to prec370e (2.0)
> debug: do_cl_join_finalize_respond: Confirming join join-2: join_ack_nack
> debug: on_msg_get_state:state of rsc prmDummy is LRM_RSC_IDLE
>
> --- or ---
>
> debug: do_fsa_action: actions:trace: // A_CL_JOIN_RESULT
> info: update_dc: Set DC to prec370e (2.0)
> debug: do_cl_join_finalize_respond: Confirming join join-2: join_ack_nack
> debug: on_msg_get_state:state of rsc prmDummy is LRM_RSC_BUSY
>
> if the state of a resouce was LRM_RSC_IDLE,
> that resource would stop nomally.
>
> if it's LRM_RSC_BUSY,
> a fail count would be increased,
> and a return code was set as 14 (EXECRA_STATUS_UNKNOWN ?).

That should not have anything to do with it. If the resource is
busy, the requested operation will be postponed until it becomes
idle. The CRM handles such a situation.

> What kind of the status of one resource in LRM_RSC_BUSY?
> When would lrm regard it as LRM_RSC_BUSY?

A resource is busy whenever there's an operation running, i.e.
such as monitor. Idle is the opposite.

Thanks,

Dejan

> Best Regards,
> Junko Ikeda
>
> NTT DATA INTELLILINK CORPORATION
>
>
> _______________________________________________
> Linux-HA mailing list
> Linux-HA[at]lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
Linux-HA[at]lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


ikedaj at intellilink

Oct 28, 2007, 9:13 PM

Post #3 of 11 (308 views)
Permalink
RE: LRM_RSC_IDLE/LRM_RSC_BUSY [In reply to]

> > if it's LRM_RSC_BUSY,
> > a fail count would be increased,
> > and a return code was set as 14 (EXECRA_STATUS_UNKNOWN ?).
>
> That should not have anything to do with it. If the resource is
> busy, the requested operation will be postponed until it becomes
> idle. The CRM handles such a situation.

Do you mean that if RA is busy, CRM will wait until it becomes idle?
It seems that CRM doesn't wait.

lrmd[9049]: 2007/10/29_12:47:35 debug: on_msg_get_state:state of rsc
prmDummy is LRM_RSC_BUSY
crmd[9136]: 2007/10/29_12:47:35 WARN: msg_to_op(1173): failed to get the
value of field lrm_opstatus from a ha_msg
...
crmd[9136]: 2007/10/29_12:47:35 WARN: msg_to_op(1173): failed to get the
value of field lfailcount: Updating failcount for prmDummy on
9d9ca527-cea9-470c-9e03-e49fe5630bba after failed monitor: rc=14


> A resource is busy whenever there's an operation running, i.e.
> such as monitor. Idle is the opposite.

I used a modified Dummy resource to emulate a delay of monitoring operation.
This RA is calling "sleep 50" immediately after monitoring.
(see attached Dummy RA)
I wonder it might cause RA's busy status.

Thanks,
Junko
Attachments: Dummy (4.80 KB)
  hb_report.tar.gz (68.5 KB)


beekhof at gmail

Oct 29, 2007, 2:56 AM

Post #4 of 11 (308 views)
Permalink
Re: LRM_RSC_IDLE/LRM_RSC_BUSY [In reply to]

On 10/29/07, Junko IKEDA <ikedaj[at]intellilink.co.jp> wrote:
> > > if it's LRM_RSC_BUSY,
> > > a fail count would be increased,
> > > and a return code was set as 14 (EXECRA_STATUS_UNKNOWN ?).
> >
> > That should not have anything to do with it. If the resource is
> > busy, the requested operation will be postponed until it becomes
> > idle. The CRM handles such a situation.
>
> Do you mean that if RA is busy, CRM will wait until it becomes idle?

No, we just fire off operations and let the lrmd tell us when they're done.

> It seems that CRM doesn't wait.
>
> lrmd[9049]: 2007/10/29_12:47:35 debug: on_msg_get_state:state of rsc
> prmDummy is LRM_RSC_BUSY
> crmd[9136]: 2007/10/29_12:47:35 WARN: msg_to_op(1173): failed to get the
> value of field lrm_opstatus from a ha_msg
> ...
> crmd[9136]: 2007/10/29_12:47:35 WARN: msg_to_op(1173): failed to get the
> value of field lfailcount: Updating failcount for prmDummy on
> 9d9ca527-cea9-470c-9e03-e49fe5630bba after failed monitor: rc=14

Thats the lrm client library by the looks of it (I don't see that
function anywhere in the crm), I wonder what failcount its modifying.
Possibly a spelling mistake too "lfailcount"

> > A resource is busy whenever there's an operation running, i.e.
> > such as monitor. Idle is the opposite.
>
> I used a modified Dummy resource to emulate a delay of monitoring operation.
> This RA is calling "sleep 50" immediately after monitoring.
> (see attached Dummy RA)
> I wonder it might cause RA's busy status.
>
> Thanks,
> Junko
>
>
> _______________________________________________
> Linux-HA mailing list
> Linux-HA[at]lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>
>
_______________________________________________
Linux-HA mailing list
Linux-HA[at]lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


dejanmm at fastmail

Oct 29, 2007, 5:47 AM

Post #5 of 11 (306 views)
Permalink
Re: LRM_RSC_IDLE/LRM_RSC_BUSY [In reply to]

Hi,

On Mon, Oct 29, 2007 at 01:13:44PM +0900, Junko IKEDA wrote:
> > > if it's LRM_RSC_BUSY,
> > > a fail count would be increased,
> > > and a return code was set as 14 (EXECRA_STATUS_UNKNOWN ?).
> >
> > That should not have anything to do with it. If the resource is
> > busy, the requested operation will be postponed until it becomes
> > idle. The CRM handles such a situation.
>
> Do you mean that if RA is busy, CRM will wait until it becomes idle?
> It seems that CRM doesn't wait.
>
> lrmd[9049]: 2007/10/29_12:47:35 debug: on_msg_get_state:state of rsc
> prmDummy is LRM_RSC_BUSY
> crmd[9136]: 2007/10/29_12:47:35 WARN: msg_to_op(1173): failed to get the
> value of field lrm_opstatus from a ha_msg

I'd presume because the operation never ran.

> ...
> crmd[9136]: 2007/10/29_12:47:35 WARN: msg_to_op(1173): failed to get the
> value of field lfailcount: Updating failcount for prmDummy on
> 9d9ca527-cea9-470c-9e03-e49fe5630bba after failed monitor: rc=14

That should've read:

tengine[9138]: 2007/10/29_12:47:35 WARN: update_failcount:
Updating failcount for prmDummy on
9d9ca527-cea9-470c-9e03-e49fe5630bba after failed monitor: rc=14

This looks wrong. The CRM shouldn't consider an operation failed
if the operation status is pending (that's what is replaced when
there's no op status) and the rc set to 14
(EXECRA_STATUS_UNKNOWN).

> > A resource is busy whenever there's an operation running, i.e.
> > such as monitor. Idle is the opposite.
>
> I used a modified Dummy resource to emulate a delay of monitoring operation.
> This RA is calling "sleep 50" immediately after monitoring.
> (see attached Dummy RA)
> I wonder it might cause RA's busy status.

That sleep is part of the monitor operation. While it's running
the resource is in the busy state.

Thanks,

Dejan

> Thanks,
> Junko
>



> _______________________________________________
> Linux-HA mailing list
> Linux-HA[at]lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
Linux-HA[at]lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


beekhof at gmail

Oct 29, 2007, 10:38 AM

Post #6 of 11 (304 views)
Permalink
Re: LRM_RSC_IDLE/LRM_RSC_BUSY [In reply to]

On 10/29/07, Dejan Muhamedagic <dejanmm[at]fastmail.fm> wrote:
> Hi,
>
> On Mon, Oct 29, 2007 at 01:13:44PM +0900, Junko IKEDA wrote:
> > > > if it's LRM_RSC_BUSY,
> > > > a fail count would be increased,
> > > > and a return code was set as 14 (EXECRA_STATUS_UNKNOWN ?).
> > >
> > > That should not have anything to do with it. If the resource is
> > > busy, the requested operation will be postponed until it becomes
> > > idle. The CRM handles such a situation.
> >
> > Do you mean that if RA is busy, CRM will wait until it becomes idle?
> > It seems that CRM doesn't wait.
> >
> > lrmd[9049]: 2007/10/29_12:47:35 debug: on_msg_get_state:state of rsc
> > prmDummy is LRM_RSC_BUSY
> > crmd[9136]: 2007/10/29_12:47:35 WARN: msg_to_op(1173): failed to get the
> > value of field lrm_opstatus from a ha_msg
>
> I'd presume because the operation never ran.
>
> > ...
> > crmd[9136]: 2007/10/29_12:47:35 WARN: msg_to_op(1173): failed to get the
> > value of field lfailcount: Updating failcount for prmDummy on
> > 9d9ca527-cea9-470c-9e03-e49fe5630bba after failed monitor: rc=14
>
> That should've read:
>
> tengine[9138]: 2007/10/29_12:47:35 WARN: update_failcount:
> Updating failcount for prmDummy on
> 9d9ca527-cea9-470c-9e03-e49fe5630bba after failed monitor: rc=14
>
> This looks wrong. The CRM shouldn't consider an operation failed
> if the operation status is pending (that's what is replaced when
> there's no op status) and the rc set to 14
> (EXECRA_STATUS_UNKNOWN).

I think this is the right patch...

We can't filter it when the crmd is querying the lrmd because the PE
needs to know that the op has been scheduled. This will stop the TE
from incrementing the failcount though (and pretty much doing anything
else for a pending operations).

diff -r 09fb789b3e82 crm/tengine/events.c
--- a/crm/tengine/events.c Mon Oct 29 13:35:03 2007 +0100
+++ b/crm/tengine/events.c Mon Oct 29 14:42:45 2007 +0100
@@ -501,6 +501,10 @@ process_graph_event(crm_data_t *event, c
abort_transition(INFINITY, tg_restart,"Bad event", event);
);

+ if(status == LRM_OP_PENDING) {
+ goto bail;
+ }
+
if(transition_num == -1) {
crm_err("Action %s initiated outside of a transition", id);
abort_transition(INFINITY, tg_restart,"Unexpected event",event);
@@ -532,6 +536,7 @@ process_graph_event(crm_data_t *event, c
update_failcount(event, event_node, rc);
}

+ bail:
crm_free(update_te_uuid);
return;
}
_______________________________________________
Linux-HA mailing list
Linux-HA[at]lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


ikedaj at intellilink

Oct 30, 2007, 12:11 AM

Post #7 of 11 (303 views)
Permalink
RE: LRM_RSC_IDLE/LRM_RSC_BUSY [In reply to]

> > This looks wrong. The CRM shouldn't consider an operation failed
> > if the operation status is pending (that's what is replaced when
> > there's no op status) and the rc set to 14
> > (EXECRA_STATUS_UNKNOWN).
>
> I think this is the right patch...
>
> We can't filter it when the crmd is querying the lrmd because the PE
> needs to know that the op has been scheduled. This will stop the TE
> from incrementing the failcount though (and pretty much doing anything
> else for a pending operations).
>

It seems that "status" is not LRM_OP_PENDING but LRM_OP_ERROR here...
I found the following log message.
debug: build_operation_update: Mapping pending operation to ERROR

I do some hard thinking about test case again.
as Dejan said, a sleep operation adding in monitor causes this strange
condition,
our test might not make sense.
Thanks anyway.

> diff -r 09fb789b3e82 crm/tengine/events.c
> --- a/crm/tengine/events.c Mon Oct 29 13:35:03 2007 +0100
> +++ b/crm/tengine/events.c Mon Oct 29 14:42:45 2007 +0100
> @@ -501,6 +501,10 @@ process_graph_event(crm_data_t *event, c
> abort_transition(INFINITY, tg_restart,"Bad event",
> event);
> );
>
> + if(status == LRM_OP_PENDING) {
> + goto bail;
> + }
> +
> if(transition_num == -1) {
> crm_err("Action %s initiated outside of a transition",
id);
> abort_transition(INFINITY, tg_restart,"Unexpected
> event",event);
> @@ -532,6 +536,7 @@ process_graph_event(crm_data_t *event, c
> update_failcount(event, event_node, rc);
> }
>
> + bail:
> crm_free(update_te_uuid);
> return;
> }

_______________________________________________
Linux-HA mailing list
Linux-HA[at]lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


ikedaj at intellilink

Oct 30, 2007, 3:05 AM

Post #8 of 11 (304 views)
Permalink
RE: LRM_RSC_IDLE/LRM_RSC_BUSY [In reply to]

> We can't filter it when the crmd is querying the lrmd because the PE
> needs to know that the op has been scheduled. This will stop the TE
> from incrementing the failcount though (and pretty much doing anything
> else for a pending operations).
>
> diff -r 09fb789b3e82 crm/tengine/events.c
> --- a/crm/tengine/events.c Mon Oct 29 13:35:03 2007 +0100
> +++ b/crm/tengine/events.c Mon Oct 29 14:42:45 2007 +0100
> @@ -501,6 +501,10 @@ process_graph_event(crm_data_t *event, c
> abort_transition(INFINITY, tg_restart,"Bad event",
> event);
> );
>
> + if(status == LRM_OP_PENDING) {
> + goto bail;
> + }
> +
> if(transition_num == -1) {
> crm_err("Action %s initiated outside of a transition",
id);
> abort_transition(INFINITY, tg_restart,"Unexpected
> event",event);
> @@ -532,6 +536,7 @@ process_graph_event(crm_data_t *event, c
> update_failcount(event, event_node, rc);
> }
>
> + bail:
> crm_free(update_te_uuid);
> return;
> }

if this patch is not so inconvenience for other people,
is it possible to attach this to the above as a giveaway?
I want to handle the "busy" monitor operation like any other operation which
doesn't have an interval value.
but it's just a little strange feeling...

--- a/crm/crmd/lrm.c 2007-10-30 17:54:43.000000000 +0900
+++ b/crm/crmd/lrm.c 2007-10-30 17:57:28.000000000 +0900
@@ -524,10 +524,7 @@
if(op->rc == 0) {
crm_debug("Mapping pending operation to DONE");
op->op_status = LRM_OP_DONE;
- } else {
- crm_debug("Mapping pending operation to ERROR");
- op->op_status = LRM_OP_ERROR;
- }
+ }
}

xml_op = find_entity(xml_rsc, XML_LRM_TAG_RSC_OP, op_id);


Thanks,
Junko


_______________________________________________
Linux-HA mailing list
Linux-HA[at]lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


dejanmm at fastmail

Oct 30, 2007, 4:26 AM

Post #9 of 11 (302 views)
Permalink
Re: LRM_RSC_IDLE/LRM_RSC_BUSY [In reply to]

Hi,

On Tue, Oct 30, 2007 at 04:11:11PM +0900, Junko IKEDA wrote:
> > > This looks wrong. The CRM shouldn't consider an operation failed
> > > if the operation status is pending (that's what is replaced when
> > > there's no op status) and the rc set to 14
> > > (EXECRA_STATUS_UNKNOWN).
> >
> > I think this is the right patch...
> >
> > We can't filter it when the crmd is querying the lrmd because the PE
> > needs to know that the op has been scheduled. This will stop the TE
> > from incrementing the failcount though (and pretty much doing anything
> > else for a pending operations).
> >
>
> It seems that "status" is not LRM_OP_PENDING but LRM_OP_ERROR here...
> I found the following log message.
> debug: build_operation_update: Mapping pending operation to ERROR
>
> I do some hard thinking about test case again.
> as Dejan said, a sleep operation adding in monitor causes this strange
> condition,
> our test might not make sense.

This test does make sense. We have to be able to deal with all
timing issues of resources.

Thanks,

Dejan

> Thanks anyway.
>
> > diff -r 09fb789b3e82 crm/tengine/events.c
> > --- a/crm/tengine/events.c Mon Oct 29 13:35:03 2007 +0100
> > +++ b/crm/tengine/events.c Mon Oct 29 14:42:45 2007 +0100
> > @@ -501,6 +501,10 @@ process_graph_event(crm_data_t *event, c
> > abort_transition(INFINITY, tg_restart,"Bad event",
> > event);
> > );
> >
> > + if(status == LRM_OP_PENDING) {
> > + goto bail;
> > + }
> > +
> > if(transition_num == -1) {
> > crm_err("Action %s initiated outside of a transition",
> id);
> > abort_transition(INFINITY, tg_restart,"Unexpected
> > event",event);
> > @@ -532,6 +536,7 @@ process_graph_event(crm_data_t *event, c
> > update_failcount(event, event_node, rc);
> > }
> >
> > + bail:
> > crm_free(update_te_uuid);
> > return;
> > }
>
> _______________________________________________
> Linux-HA mailing list
> Linux-HA[at]lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
Linux-HA[at]lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


beekhof at gmail

Oct 31, 2007, 3:39 AM

Post #10 of 11 (300 views)
Permalink
Re: LRM_RSC_IDLE/LRM_RSC_BUSY [In reply to]

On 10/30/07, Junko IKEDA <ikedaj[at]intellilink.co.jp> wrote:
> > We can't filter it when the crmd is querying the lrmd because the PE
> > needs to know that the op has been scheduled. This will stop the TE
> > from incrementing the failcount though (and pretty much doing anything
> > else for a pending operations).
> >
> > diff -r 09fb789b3e82 crm/tengine/events.c
> > --- a/crm/tengine/events.c Mon Oct 29 13:35:03 2007 +0100
> > +++ b/crm/tengine/events.c Mon Oct 29 14:42:45 2007 +0100
> > @@ -501,6 +501,10 @@ process_graph_event(crm_data_t *event, c
> > abort_transition(INFINITY, tg_restart,"Bad event",
> > event);
> > );
> >
> > + if(status == LRM_OP_PENDING) {
> > + goto bail;
> > + }
> > +
> > if(transition_num == -1) {
> > crm_err("Action %s initiated outside of a transition",
> id);
> > abort_transition(INFINITY, tg_restart,"Unexpected
> > event",event);
> > @@ -532,6 +536,7 @@ process_graph_event(crm_data_t *event, c
> > update_failcount(event, event_node, rc);
> > }
> >
> > + bail:
> > crm_free(update_te_uuid);
> > return;
> > }
>
> if this patch is not so inconvenience for other people,
> is it possible to attach this to the above as a giveaway?

you're right - the previous patch on its own is not enough, yours is needed also

> I want to handle the "busy" monitor operation like any other operation which
> doesn't have an interval value.
> but it's just a little strange feeling...
>
> --- a/crm/crmd/lrm.c 2007-10-30 17:54:43.000000000 +0900
> +++ b/crm/crmd/lrm.c 2007-10-30 17:57:28.000000000 +0900
> @@ -524,10 +524,7 @@
> if(op->rc == 0) {
> crm_debug("Mapping pending operation to DONE");
> op->op_status = LRM_OP_DONE;
> - } else {
> - crm_debug("Mapping pending operation to ERROR");
> - op->op_status = LRM_OP_ERROR;
> - }
> + }
> }
>
> xml_op = find_entity(xml_rsc, XML_LRM_TAG_RSC_OP, op_id);
>
>
> Thanks,
> Junko
>
>
> _______________________________________________
> Linux-HA mailing list
> Linux-HA[at]lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>
_______________________________________________
Linux-HA mailing list
Linux-HA[at]lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


ikedaj at intellilink

Nov 1, 2007, 1:54 AM

Post #11 of 11 (287 views)
Permalink
RE: LRM_RSC_IDLE/LRM_RSC_BUSY [In reply to]

I made a lot of noise about this issue,
but it seems that this problem wouldn't happen if we use Guochun's patch.
http://hg.linux-ha.org/dev/rev/ee8dea66ae1b
CCM behavior had an impact on various things...

by the way, I think Matsuda-san's patch is also needed to recover from a
split brain.
http://www.gossamer-threads.com/lists/linuxha/dev/43271

Please reconsider it.

Thanks,
Junko
Attachments: hb_report.tar.gz (61.7 KB)

Linux-HA users RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact lists@gossamer-threads.com
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.