Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Linux-HA: Dev

Difference between OCF_ERR_CONFIGURED and OCF_ERR_INSTALLED ?

 

 

Linux-HA dev RSS feed   Index | Next | Previous | View Threaded


pica1dilly at yahoo

Jul 4, 2008, 2:55 AM

Post #1 of 12 (705 views)
Permalink
Difference between OCF_ERR_CONFIGURED and OCF_ERR_INSTALLED ?

Exatcly how does heartbeat handle OCF_ERR_CONFIGURED and OCF_ERR_INSTALLED differently ?


beekhof at gmail

Jul 4, 2008, 3:44 AM

Post #2 of 12 (687 views)
Permalink
Re: Difference between OCF_ERR_CONFIGURED and OCF_ERR_INSTALLED ? [In reply to]

2008/7/4 Joe Bill <pica1dilly[at]yahoo.com>:
> Exatcly how does heartbeat handle OCF_ERR_CONFIGURED and OCF_ERR_INSTALLED
> differently ?
>

>From some badly formatted and not-quite finished documentation:

soft = stop and retry
hard = stop and retry - current node is excluded
fatal = stop - all nodes are excluded

OCF Return Code Description Recovery Type
0 Success. The command complete successfully. This is the expected
result for all start, stop, promote and demote commands. N/A
1 Generic "there was a problem" error code. soft
2 Incorrect arguments were passed to the resource agent. fatal
3 The requested action is not implemented. hard
4 The resource agent does not have sufficient privileges to complete
the task. hard
5 The requested agent or tool required by the agent is not installed. hard
6 The resource's configuration is invalid. fatal
7 The resource is safely stopped. The cluster will not attempt to
stop a resource that returns this for any action. N/A
8 The resource is running in Master mode. N/A
9 The resource is in Master mode but has failed. The resource will be
demoted, stopped and then started (and possibly promoted) again. soft
other Custom error code. soft
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev[at]lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


sergeyfd at gmail

Jul 4, 2008, 4:21 AM

Post #3 of 12 (683 views)
Permalink
Re: Difference between OCF_ERR_CONFIGURED and OCF_ERR_INSTALLED ? [In reply to]

On Fri, Jul 4, 2008 at 4:44 AM, Andrew Beekhof <beekhof[at]gmail.com> wrote:
> 2008/7/4 Joe Bill <pica1dilly[at]yahoo.com>:
>> Exatcly how does heartbeat handle OCF_ERR_CONFIGURED and OCF_ERR_INSTALLED
>> differently ?
>>
>
> >From some badly formatted and not-quite finished documentation:
>
> soft = stop and retry
> hard = stop and retry - current node is excluded
> fatal = stop - all nodes are excluded
>
> OCF Return Code Description Recovery Type
> 0 Success. The command complete successfully. This is the expected
> result for all start, stop, promote and demote commands. N/A
> 1 Generic "there was a problem" error code. soft
> 2 Incorrect arguments were passed to the resource agent. fatal
> 3 The requested action is not implemented. hard

Doesn't look like it works like that. If you remember that discussion
about crm_resource -F, it restarts resource but doesn't exclude the
node,

> 4 The resource agent does not have sufficient privileges to complete
> the task. hard
> 5 The requested agent or tool required by the agent is not installed. hard
> 6 The resource's configuration is invalid. fatal
> 7 The resource is safely stopped. The cluster will not attempt to
> stop a resource that returns this for any action. N/A
> 8 The resource is running in Master mode. N/A
> 9 The resource is in Master mode but has failed. The resource will be
> demoted, stopped and then started (and possibly promoted) again. soft
> other Custom error code. soft
> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev[at]lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/
>



--
Serge Dubrouski.
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev[at]lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


beekhof at gmail

Jul 4, 2008, 6:13 AM

Post #4 of 12 (687 views)
Permalink
Re: Difference between OCF_ERR_CONFIGURED and OCF_ERR_INSTALLED ? [In reply to]

On Fri, Jul 4, 2008 at 13:21, Serge Dubrouski <sergeyfd[at]gmail.com> wrote:
> On Fri, Jul 4, 2008 at 4:44 AM, Andrew Beekhof <beekhof[at]gmail.com> wrote:
>> 2008/7/4 Joe Bill <pica1dilly[at]yahoo.com>:
>>> Exatcly how does heartbeat handle OCF_ERR_CONFIGURED and OCF_ERR_INSTALLED
>>> differently ?
>>>
>>
>> >From some badly formatted and not-quite finished documentation:
>>
>> soft = stop and retry
>> hard = stop and retry - current node is excluded
>> fatal = stop - all nodes are excluded
>>
>> OCF Return Code Description Recovery Type
>> 0 Success. The command complete successfully. This is the expected
>> result for all start, stop, promote and demote commands. N/A
>> 1 Generic "there was a problem" error code. soft
>> 2 Incorrect arguments were passed to the resource agent. fatal
>> 3 The requested action is not implemented. hard
>
> Doesn't look like it works like that. If you remember that discussion
> about crm_resource -F, it restarts resource but doesn't exclude the
> node,

this only applies to non-recurring actions, crm_resource -F was
pretending to start a recurring operation

>> 4 The resource agent does not have sufficient privileges to complete
>> the task. hard
>> 5 The requested agent or tool required by the agent is not installed. hard
>> 6 The resource's configuration is invalid. fatal
>> 7 The resource is safely stopped. The cluster will not attempt to
>> stop a resource that returns this for any action. N/A
>> 8 The resource is running in Master mode. N/A
>> 9 The resource is in Master mode but has failed. The resource will be
>> demoted, stopped and then started (and possibly promoted) again. soft
>> other Custom error code. soft
>> _______________________________________________________
>> Linux-HA-Dev: Linux-HA-Dev[at]lists.linux-ha.org
>> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
>> Home Page: http://linux-ha.org/
>>
>
>
>
> --
> Serge Dubrouski.
> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev[at]lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/
>
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev[at]lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


pica1dilly at yahoo

Jul 4, 2008, 7:52 AM

Post #5 of 12 (676 views)
Permalink
Re: Difference between OCF_ERR_CONFIGURED and OCF_ERR_INSTALLED ? [In reply to]

>--- On Fri, 7/4/08, Andrew Beekhof <beekhof[at]gmail.com> wrote:
>> Exatcly how does heartbeat handle OCF_ERR_CONFIGURED and
>> OCF_ERR_INSTALLED differently ?
>
>From some badly formatted and not-quite finished documentation:
>
>soft = stop and retry
>hard = stop and retry - current node is excluded
>fatal = stop - all nodes are excluded

Taking the opportunity then that the documentation is not yet finished, I would like to make the following suggestions:

- "soft" be changed to "error, unexpected"

- "hard" be changed to "fatal, local" or "critical, local", or "fatal, node" or "critical, node" because we have diagnosed that the resource at fault is local to the node where it has been detected on

- "fatal" be changed to "fatal, common" or "critical, common" or "fatal, cluster" or "critical, cluster" because we have diagnosed that the resource at fault is common to all nodes in the cluster.

>5 The requested agent or tool required by the agent is
> not installed. hard

I believe "resource configuration" to be more appropriate here. HA shouldn't care at this point if it's a piece of software or local configuration file that is missing or screwed.

add:

- or the resource's local configuration,
- or the node's specific configuration ... are invalid.

>6 The resource's configuration is invalid. fatal

I believe "instance configuration" to be more appropriate here,

replace with:

- the instance's configuration (common, shared, clusterwide resource configuration) is invalid,
- or the resource agent has detected a severe internal (programming,code) error.


Regarding the mnemonics of the return codes...

>From your notes above, it seems the status definitions appear to be more related to the restart and blocking effect the HA supervisor has on resources, than what the current mnemonics attempt to describe as situation.

I am not sure it is such a good idea to attempt to combine a condition with the condition's handling action in the process of defining states that are to be reported to the supervisor.

>From what you provided as description, is it i.e. the supervisor's concern, and will the supervisor attempt anything to address the cause, or for that matter do anything different if it receives any of the following status: OCF_ERR_UNIMPLEMENTED, OCF_ERR_PERM, OCF_ERR_INSTALLED ?

Same question for OCF_ERR_ARGS and OCF_ERR_CONFIGURED ?

Now the problem starts when I want to describe a condition where a resource needs an internal ( fixed name, not specified as resource parameter) file but file is missing on one host and not on others. Which condition would you choose ?

Then the situation where a filename is specified as resource parameter but that file does not exist on one host. Is it an OCF_ERR_INSTALLED error, or a OCF_ERR_CONFIGURED error, why not an OCF_ERR_ARGS ? Can I even diagnose a OCF_ERR_ARGS when running the resource agent on only one node if that file DOES exist on other nodes ? How is that resource agent going to check on the another nodes and see that the file does exist there ?





_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev[at]lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


beekhof at gmail

Jul 8, 2008, 6:03 AM

Post #6 of 12 (633 views)
Permalink
Re: Re: Difference between OCF_ERR_CONFIGURED and OCF_ERR_INSTALLED ? [In reply to]

On Fri, Jul 4, 2008 at 16:52, Joe Bill <pica1dilly[at]yahoo.com> wrote:
>
>>--- On Fri, 7/4/08, Andrew Beekhof <beekhof[at]gmail.com> wrote:
>>> Exatcly how does heartbeat handle OCF_ERR_CONFIGURED and
>>> OCF_ERR_INSTALLED differently ?
>>
> >From some badly formatted and not-quite finished documentation:
>>
>>soft = stop and retry
>>hard = stop and retry - current node is excluded
>>fatal = stop - all nodes are excluded
>
> Taking the opportunity then that the documentation is not yet finished, I would like to make the following suggestions:
>
> - "soft" be changed to "error, unexpected"
>
> - "hard" be changed to "fatal, local" or "critical, local", or "fatal, node" or "critical, node" because we have diagnosed that the resource at fault is local to the node where it has been detected on
>
> - "fatal" be changed to "fatal, common" or "critical, common" or "fatal, cluster" or "critical, cluster" because we have diagnosed that the resource at fault is common to all nodes in the cluster.
>
>>5 The requested agent or tool required by the agent is
>> not installed. hard
>
> I believe "resource configuration" to be more appropriate here. HA shouldn't care at this point if it's a piece of software or local configuration file that is missing or screwed.
>
> add:
>
> - or the resource's local configuration,
> - or the node's specific configuration ... are invalid.
>
>>6 The resource's configuration is invalid. fatal
>
> I believe "instance configuration" to be more appropriate here,
>
> replace with:
>
> - the instance's configuration (common, shared, clusterwide resource configuration) is invalid,
> - or the resource agent has detected a severe internal (programming,code) error.

makes sense

>
>
> Regarding the mnemonics of the return codes...
>
> >From your notes above, it seems the status definitions appear to be more related to the restart and blocking effect the HA supervisor has on resources, than what the current mnemonics attempt to describe as situation.
>
> I am not sure it is such a good idea to attempt to combine a condition with the condition's handling action in the process of defining states that are to be reported to the supervisor.

Not sure I follow this...

>
> >From what you provided as description, is it i.e. the supervisor's concern, and will the supervisor attempt anything to address the cause, or for that matter do anything different if it receives any of the following status: OCF_ERR_UNIMPLEMENTED, OCF_ERR_PERM, OCF_ERR_INSTALLED ?
>
> Same question for OCF_ERR_ARGS and OCF_ERR_CONFIGURED ?
>
> Now the problem starts when I want to describe a condition where a resource needs an internal ( fixed name, not specified as resource parameter) file but file is missing on one host and not on others. Which condition would you choose ?

OCF_ERR_ARGS i guess - since that would exclude the failed node but
not the others.
if the file isn't available anywhere, then the resource will be tried
once on each node and give up.

> Then the situation where a filename is specified as resource parameter but that file does not exist on one host. Is it an OCF_ERR_INSTALLED error, or a OCF_ERR_CONFIGURED error, why not an OCF_ERR_ARGS ? Can I even diagnose a OCF_ERR_ARGS when running the resource agent on only one node if that file DOES exist on other nodes ? How is that resource agent going to check on the another nodes and see that the file does exist there ?

why would you try and do this? just let it fail once on each node.
OCF_ERR_CONFIGURED should only be used when the inputs are so bad that
the resource wont be able to run anywhere (ie. "file" is mandatory but
no value was specified)
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev[at]lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


beekhof at gmail

Jul 8, 2008, 6:09 AM

Post #7 of 12 (632 views)
Permalink
Re: Re: Difference between OCF_ERR_CONFIGURED and OCF_ERR_INSTALLED ? [In reply to]

On Tue, Jul 8, 2008 at 15:03, Andrew Beekhof <beekhof[at]gmail.com> wrote:
> On Fri, Jul 4, 2008 at 16:52, Joe Bill <pica1dilly[at]yahoo.com> wrote:
>>
>>>--- On Fri, 7/4/08, Andrew Beekhof <beekhof[at]gmail.com> wrote:
>>>> Exatcly how does heartbeat handle OCF_ERR_CONFIGURED and
>>>> OCF_ERR_INSTALLED differently ?
>>>
>> >From some badly formatted and not-quite finished documentation:
>>>
>>>soft = stop and retry
>>>hard = stop and retry - current node is excluded
>>>fatal = stop - all nodes are excluded
>>
>> Taking the opportunity then that the documentation is not yet finished, I would like to make the following suggestions:
>>
>> - "soft" be changed to "error, unexpected"
>>
>> - "hard" be changed to "fatal, local" or "critical, local", or "fatal, node" or "critical, node" because we have diagnosed that the resource at fault is local to the node where it has been detected on
>>
>> - "fatal" be changed to "fatal, common" or "critical, common" or "fatal, cluster" or "critical, cluster" because we have diagnosed that the resource at fault is common to all nodes in the cluster.
>>
>>>5 The requested agent or tool required by the agent is
>>> not installed. hard
>>
>> I believe "resource configuration" to be more appropriate here. HA shouldn't care at this point if it's a piece of software or local configuration file that is missing or screwed.
>>
>> add:
>>
>> - or the resource's local configuration,
>> - or the node's specific configuration ... are invalid.
>>
>>>6 The resource's configuration is invalid. fatal
>>
>> I believe "instance configuration" to be more appropriate here,
>>
>> replace with:
>>
>> - the instance's configuration (common, shared, clusterwide resource configuration) is invalid,
>> - or the resource agent has detected a severe internal (programming,code) error.
>
> makes sense
>
>>
>>
>> Regarding the mnemonics of the return codes...
>>
>> >From your notes above, it seems the status definitions appear to be more related to the restart and blocking effect the HA supervisor has on resources, than what the current mnemonics attempt to describe as situation.
>>
>> I am not sure it is such a good idea to attempt to combine a condition with the condition's handling action in the process of defining states that are to be reported to the supervisor.
>
> Not sure I follow this...
>
>>
>> >From what you provided as description, is it i.e. the supervisor's concern, and will the supervisor attempt anything to address the cause, or for that matter do anything different if it receives any of the following status: OCF_ERR_UNIMPLEMENTED, OCF_ERR_PERM, OCF_ERR_INSTALLED ?
>>
>> Same question for OCF_ERR_ARGS and OCF_ERR_CONFIGURED ?
>>
>> Now the problem starts when I want to describe a condition where a resource needs an internal ( fixed name, not specified as resource parameter) file but file is missing on one host and not on others. Which condition would you choose ?
>
> OCF_ERR_ARGS i guess - since that would exclude the failed node but not the others.

oops, args doesn't do this.
probably OCF_ERR_INSTALLED then. or maybe one of OCF_ERR_ARGS and
OCF_ERR_CONFIGURED needs to be made fatal.

> if the file isn't available anywhere, then the resource will be tried
> once on each node and give up.
>
>> Then the situation where a filename is specified as resource parameter but that file does not exist on one host. Is it an OCF_ERR_INSTALLED error, or a OCF_ERR_CONFIGURED error, why not an OCF_ERR_ARGS ? Can I even diagnose a OCF_ERR_ARGS when running the resource agent on only one node if that file DOES exist on other nodes ? How is that resource agent going to check on the another nodes and see that the file does exist there ?
>
> why would you try and do this? just let it fail once on each node.
> OCF_ERR_CONFIGURED should only be used when the inputs are so bad that
> the resource wont be able to run anywhere (ie. "file" is mandatory but
> no value was specified)
>
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev[at]lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


beekhof at gmail

Jul 8, 2008, 6:19 AM

Post #8 of 12 (632 views)
Permalink
Re: Re: Difference between OCF_ERR_CONFIGURED and OCF_ERR_INSTALLED ? [In reply to]

On Tue, Jul 8, 2008 at 15:09, Andrew Beekhof <beekhof[at]gmail.com> wrote:
> On Tue, Jul 8, 2008 at 15:03, Andrew Beekhof <beekhof[at]gmail.com> wrote:
>> On Fri, Jul 4, 2008 at 16:52, Joe Bill <pica1dilly[at]yahoo.com> wrote:
>>>
>>>>--- On Fri, 7/4/08, Andrew Beekhof <beekhof[at]gmail.com> wrote:
>>>>> Exatcly how does heartbeat handle OCF_ERR_CONFIGURED and
>>>>> OCF_ERR_INSTALLED differently ?
>>>>
>>> >From some badly formatted and not-quite finished documentation:
>>>>
>>>>soft = stop and retry
>>>>hard = stop and retry - current node is excluded
>>>>fatal = stop - all nodes are excluded
>>>
>>> Taking the opportunity then that the documentation is not yet finished, I would like to make the following suggestions:
>>>
>>> - "soft" be changed to "error, unexpected"
>>>
>>> - "hard" be changed to "fatal, local" or "critical, local", or "fatal, node" or "critical, node" because we have diagnosed that the resource at fault is local to the node where it has been detected on
>>>
>>> - "fatal" be changed to "fatal, common" or "critical, common" or "fatal, cluster" or "critical, cluster" because we have diagnosed that the resource at fault is common to all nodes in the cluster.
>>>
>>>>5 The requested agent or tool required by the agent is
>>>> not installed. hard
>>>
>>> I believe "resource configuration" to be more appropriate here. HA shouldn't care at this point if it's a piece of software or local configuration file that is missing or screwed.
>>>
>>> add:
>>>
>>> - or the resource's local configuration,
>>> - or the node's specific configuration ... are invalid.
>>>
>>>>6 The resource's configuration is invalid. fatal
>>>
>>> I believe "instance configuration" to be more appropriate here,
>>>
>>> replace with:
>>>
>>> - the instance's configuration (common, shared, clusterwide resource configuration) is invalid,
>>> - or the resource agent has detected a severe internal (programming,code) error.
>>
>> makes sense
>>
>>>
>>>
>>> Regarding the mnemonics of the return codes...
>>>
>>> >From your notes above, it seems the status definitions appear to be more related to the restart and blocking effect the HA supervisor has on resources, than what the current mnemonics attempt to describe as situation.
>>>
>>> I am not sure it is such a good idea to attempt to combine a condition with the condition's handling action in the process of defining states that are to be reported to the supervisor.
>>
>> Not sure I follow this...
>>
>>>
>>> >From what you provided as description, is it i.e. the supervisor's concern, and will the supervisor attempt anything to address the cause, or for that matter do anything different if it receives any of the following status: OCF_ERR_UNIMPLEMENTED, OCF_ERR_PERM, OCF_ERR_INSTALLED ?
>>>
>>> Same question for OCF_ERR_ARGS and OCF_ERR_CONFIGURED ?
>>>
>>> Now the problem starts when I want to describe a condition where a resource needs an internal ( fixed name, not specified as resource parameter) file but file is missing on one host and not on others. Which condition would you choose ?
>>
>> OCF_ERR_ARGS i guess - since that would exclude the failed node but not the others.
>
> oops, args doesn't do this.
> probably OCF_ERR_INSTALLED then. or maybe one of OCF_ERR_ARGS and
> OCF_ERR_CONFIGURED needs to be made fatal.

brain not working today... of course I meant "hard". and having
looked at everything again, i think this is the right approach.
So from now on OCF_ERR_ARGS will be a "hard" error instead of a "fatal" one.
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev[at]lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


pica1dilly at yahoo

Jul 8, 2008, 9:06 AM

Post #9 of 12 (630 views)
Permalink
Re: Re: Difference between OCF_ERR_CONFIGURED and OCF_ERR_INSTALLED ? [In reply to]

> So from now on OCF_ERR_ARGS will be a "hard" error
> instead of a "fatal" one.

I copy that.

>> Regarding the mnemonics of the return codes...
>>
>> From your notes above, it seems the status
>> definitions appear to be more related to the
>> restart and blocking effect the HA supervisor
>> has on resources, than what the current mnemonics
>> attempt to describe as situation.
>>
>> I am not sure it is such a good idea to attempt to
>> combine a condition with the condition's handling
>> action in the process of defining states that are
>> to be reported to the supervisor.

> Not sure I follow this

I know it's a bit obscur so that's why I continued ...

>> From what you provided as description, is it i.e.
>> the supervisor's concern, and will the supervisor
>> attempt anything to address the cause, or for that
>> matter do anything different if it receives any of
>> the following status: OCF_ERR_UNIMPLEMENTED,
>> OCF_ERR_PERM, OCF_ERR_INSTALLED ?

What I meant is: does heartbeat do anything different,
whether it receives either 3 return codes directly above,
(or 4, if you now include OCF_ERR_ARGS), considering
that all of them cause, as you call it, a "hard" restart
of the resource ?

Or, in other words, are all 4 return codes necessary,
if all we want in all 4 cases is to trigger a hard reset ?

In which case, this suggests that whatever the cause for
such a return code, like "permissions failure", or an
"installation failure", is superfluous to specify in the
mnemonic. So superfluous that it becomes misleading when
it comes to explaining the effect such a return code is
supposed to cause the supervisor.

Unless the supervisor does anything special in any 4,
cases above, the condition returned is understood better
if only one return code describes it, and the mnemonic is
better chosen, i.e. OCF_CRIT_LOCAL or OCF_CRIT_NODE or
OCF_FATAL_LOCAL or OCF_FATAL_NODE.

Eventually, OCF_ERR_CONFIGURED is understood better if
it is renamed to OCF_FATAL_COMMON or OCF_FATAL_CLUSTER.

And OCF_ERR_GENERIC, to plain old OCF_ERROR ...

So, to summarize, the original mnemonics attempted to
describe situations combined with the severity
of a condition to handle, ERR for error, while ignoring the
different effect such conditions have on the supervisor,
whereas, in the proposed scheme, we drop the situational
part in the mnemonic to focus on the severity and the
scope of the effect, bringing along a better understanding
of what the supervisor does and needs.




_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev[at]lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


beekhof at gmail

Jul 9, 2008, 3:37 AM

Post #10 of 12 (617 views)
Permalink
Re: Re: Re: Difference between OCF_ERR_CONFIGURED and OCF_ERR_INSTALLED ? [In reply to]

On Tue, Jul 8, 2008 at 18:06, Joe Bill <pica1dilly[at]yahoo.com> wrote:
>> So from now on OCF_ERR_ARGS will be a "hard" error
>> instead of a "fatal" one.
>
> I copy that.
>
>>> Regarding the mnemonics of the return codes...
>>>
>>> From your notes above, it seems the status
>>> definitions appear to be more related to the
>>> restart and blocking effect the HA supervisor
>>> has on resources, than what the current mnemonics
>>> attempt to describe as situation.
>>>
>>> I am not sure it is such a good idea to attempt to
>>> combine a condition with the condition's handling
>>> action in the process of defining states that are
>>> to be reported to the supervisor.
>
>> Not sure I follow this
>
> I know it's a bit obscur so that's why I continued ...
>
>>> From what you provided as description, is it i.e.
>>> the supervisor's concern, and will the supervisor
>>> attempt anything to address the cause, or for that
>>> matter do anything different if it receives any of
>>> the following status: OCF_ERR_UNIMPLEMENTED,
>>> OCF_ERR_PERM, OCF_ERR_INSTALLED ?
>
> What I meant is: does heartbeat do anything different,
> whether it receives either 3 return codes directly above,
> (or 4, if you now include OCF_ERR_ARGS), considering
> that all of them cause, as you call it, a "hard" restart
> of the resource ?

no. or at least, not yet.
(until semi-recently, there was only "soft" recovery. so maybe in the
future we'll do more.)

>
> Or, in other words, are all 4 return codes necessary,
> if all we want in all 4 cases is to trigger a hard reset ?

programatically, not really.
but if i'm an admin trying to figure out why the resource wont run on
a given node anymore, i'm sure i'd appreciate them not being merged.

at any rate, these return codes are part of the OCF spec.
we're just following it and indicating what type of recovery we do for each.

>
> In which case, this suggests that whatever the cause for
> such a return code, like "permissions failure", or an
> "installation failure", is superfluous to specify in the
> mnemonic. So superfluous that it becomes misleading when
> it comes to explaining the effect such a return code is
> supposed to cause the supervisor.
>
> Unless the supervisor does anything special in any 4,
> cases above, the condition returned is understood better
> if only one return code describes it, and the mnemonic is
> better chosen, i.e. OCF_CRIT_LOCAL or OCF_CRIT_NODE or
> OCF_FATAL_LOCAL or OCF_FATAL_NODE.
>
> Eventually, OCF_ERR_CONFIGURED is understood better if
> it is renamed to OCF_FATAL_COMMON or OCF_FATAL_CLUSTER.
>
> And OCF_ERR_GENERIC, to plain old OCF_ERROR ...
>
> So, to summarize, the original mnemonics attempted to
> describe situations combined with the severity
> of a condition to handle, ERR for error, while ignoring the
> different effect such conditions have on the supervisor,
> whereas, in the proposed scheme, we drop the situational
> part in the mnemonic to focus on the severity and the
> scope of the effect, bringing along a better understanding
> of what the supervisor does and needs.
>
>
>
>
> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev[at]lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/
>
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev[at]lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


pica1dilly at yahoo

Jul 9, 2008, 11:05 PM

Post #11 of 12 (602 views)
Permalink
Re: Re: Re: Difference between OCF_ERR_CONFIGURED and OCF_ERR_INSTALLED ? [In reply to]

--- On Wed, 7/9/08, Andrew Beekhof <beekhof[at]gmail.com> wrote:

>> Or, in other words, are all 4 return codes necessary,
>> if all we want in all 4 cases is to trigger a hard reset ?
>
> programatically, not really.
> but if i'm an admin trying to figure out why the
> resource wont run on a given node anymore, i'm sure
> i'd appreciate them not being merged.

That I fully understand. But shouldn't these conditions be described independently from the condition status, through, i.e. a condition code ?

> at any rate, these return codes are part of the OCF spec.
> we're just following it and indicating what type of
> recovery we do for each.

I also understand that and that's why I was precisely saying mixing a condition with it's severity or it's handling was not a good idea for the reasons I already gave.

Right now, the OCF return status code only uses 9 values out of 256 the return status allows, or in other words 4 bits out of 8.

For an improved scheme, why not use the lower bits to describe the condition as they are described today, with the exception of values 0 and 1 which describe generic success and error codes, and the higher 4 bits to describe the severity and handling ?

One can also imagine, if the high bits aren't used, use the current scheme, and if the high bits are used using the improved scheme.




_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev[at]lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


beekhof at gmail

Jul 9, 2008, 11:19 PM

Post #12 of 12 (603 views)
Permalink
Re: Re: Re: Difference between OCF_ERR_CONFIGURED and OCF_ERR_INSTALLED ? [In reply to]

On Thu, Jul 10, 2008 at 08:05, Joe Bill <pica1dilly[at]yahoo.com> wrote:
>
> --- On Wed, 7/9/08, Andrew Beekhof <beekhof[at]gmail.com> wrote:
>
>>> Or, in other words, are all 4 return codes necessary,
>>> if all we want in all 4 cases is to trigger a hard reset ?
>>
>> programatically, not really.
>> but if i'm an admin trying to figure out why the
>> resource wont run on a given node anymore, i'm sure
>> i'd appreciate them not being merged.
>
> That I fully understand. But shouldn't these conditions be described independently from the condition status, through, i.e. a condition code ?

I don't understand.

The RA should tell us what happened. End of story.
What this documents is what the cluster will do based on what the RA told us.

Telling the cluster what you think it wants to hear always leads to
tragedy. Just tell the truth.

>
>> at any rate, these return codes are part of the OCF spec.
>> we're just following it and indicating what type of
>> recovery we do for each.
>
> I also understand that and that's why I was precisely saying mixing a condition with it's severity or it's handling was not a good idea for the reasons I already gave.
>
> Right now, the OCF return status code only uses 9 values out of 256 the return status allows, or in other words 4 bits out of 8.
>
> For an improved scheme, why not use the lower bits to describe the condition as they are described today, with the exception of values 0 and 1 which describe generic success and error codes, and the higher 4 bits to describe the severity and handling ?

Maybe, but I don't write the spec and personally I don't think the RA
should be telling the cluster what type of recovery to perform.
That's a policy decision that should be made by the Policy Engine.

As above, just tell us what happened and let the cluster decide what to do.

>
> One can also imagine, if the high bits aren't used, use the current scheme, and if the high bits are used using the improved scheme.
>
>
>
>
>
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev[at]lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Linux-HA dev RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact lists@gossamer-threads.com
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.