Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Linux-HA: Dev

Patch: VirtualDomain - fix probe if config is not on shared storage

 

 

Linux-HA dev RSS feed   Index | Next | Previous | View Threaded


dominik.klein at gmail

Jun 24, 2011, 3:33 AM

Post #1 of 13 (605 views)
Permalink
Patch: VirtualDomain - fix probe if config is not on shared storage

This fixes the issue described yesterday.

Comments?

Regards
Dominik
Attachments: VirtualDomain.patch (1.59 KB)
  VirtualDomain.patch2 (0.97 KB)


dejan at suse

Jun 24, 2011, 5:14 AM

Post #2 of 13 (580 views)
Permalink
Re: Patch: VirtualDomain - fix probe if config is not on shared storage [In reply to]

Hi Dominik,

On Fri, Jun 24, 2011 at 12:33:42PM +0200, Dominik Klein wrote:
> This fixes the issue described yesterday.
>
> Comments?

It's not really necessary to introduce a new parameter. See for
instance oracle or apache RA how they manage that. A quote from
oracle:

681 LSB_STATUS_STOPPED=3
682 testoraenv
683 rc=$?
684 if [ $rc -ne 0 ]; then
685 ocf_log info "Oracle environment for SID $ORACLE_SID does not exist"
686 case "$1" in
687 stop) exit $OCF_SUCCESS;;
688 monitor) exit $OCF_NOT_RUNNING;;
689 status) exit $LSB_STATUS_STOPPED;;
690 *)
691 ocf_log err "Oracle environment for SID $ORACLE_SID broken"
692 exit $rc
693 ;;
694 esac
695 fi

It should probably be changed a bit for the monitor action, sth
like:

monitor) is_ocf_probe && exit $OCF_NOT_RUNNING || exit $OCF_ERR_GENERIC;;

Though as it is, it'll work correctly.

Cheers,

Dejan

> Regards
> Dominik

> exporting patch:
> # HG changeset patch
> # User Dominik Klein <dominik.klein [at] gmail>
> # Date 1308909599 -7200
> # Node ID 2b1615aaca2c90f2f4ab93eb443e5902906fb28a
> # Parent 7a11934b142d1daf42a04fbaa0391a3ac47cee4c
> RA VirtualDomain: Fix probe if config is not on shared storage
>
> diff -r 7a11934b142d -r 2b1615aaca2c heartbeat/VirtualDomain
> --- a/heartbeat/VirtualDomain Fri Feb 25 12:23:17 2011 +0100
> +++ b/heartbeat/VirtualDomain Fri Jun 24 11:59:59 2011 +0200
> @@ -19,9 +19,11 @@
> # Defaults
> OCF_RESKEY_force_stop_default=0
> OCF_RESKEY_hypervisor_default="$(virsh --quiet uri)"
> +OCF_RESKEY_config_on_shared_storage_default=1
>
> : ${OCF_RESKEY_force_stop=${OCF_RESKEY_force_stop_default}}
> : ${OCF_RESKEY_hypervisor=${OCF_RESKEY_hypervisor_default}}
> +: ${OCF_RESKEY_config_on_shared_storage=${OCF_RESKEY_config_on_shared_storage_default}}
> #######################################################################
>
> ## I'd very much suggest to make this RA use bash,
> @@ -421,8 +423,8 @@
> # check if we can read the config file (otherwise we're unable to
> # deduce $DOMAIN_NAME from it, see below)
> if [ ! -r $OCF_RESKEY_config ]; then
> - if ocf_is_probe; then
> - ocf_log info "Configuration file $OCF_RESKEY_config not readable during probe."
> + if ocf_is_probe && ocf_is_true $OCF_RESKEY_config_on_shared_storage; then
> + ocf_log info "Configuration file $OCF_RESKEY_config not readable during probe. Assuming it is on shared storage and therefore reporting VM is not running."
> else
> ocf_log error "Configuration file $OCF_RESKEY_config does not exist or is not readable."
> return $OCF_ERR_INSTALLED

> exporting patch:
> # HG changeset patch
> # User Dominik Klein <dominik.klein [at] gmail>
> # Date 1308911272 -7200
> # Node ID 312adf2449eb59dcc41686626b1726428d13227b
> # Parent 2b1615aaca2c90f2f4ab93eb443e5902906fb28a
> RA VirtualDomain: Add metadata for the new parameter
>
> diff -r 2b1615aaca2c -r 312adf2449eb heartbeat/VirtualDomain
> --- a/heartbeat/VirtualDomain Fri Jun 24 11:59:59 2011 +0200
> +++ b/heartbeat/VirtualDomain Fri Jun 24 12:27:52 2011 +0200
> @@ -119,6 +119,16 @@
> <content type="string" default="" />
> </parameter>
>
> +<parameter name="config_on_shared_storage" unique="0" required="0">
> +<longdesc lang="en">
> +If your VMs configuration file is _not_ on shared storage, so that the config
> +file not being in place during a probe means that the VM is not installed/runnable
> +on that node, set this to 0.
> +</longdesc>
> +<shortdesc lang="en">Set to 0 if your VMs config file is not on shared storage</shortdesc>
> +<content type="boolean" default="1" />
> +</parameter>
> +
> </parameters>
>
> <actions>

> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev [at] lists
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/

_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


dejan at suse

Jun 24, 2011, 6:34 AM

Post #3 of 13 (578 views)
Permalink
Re: Patch: VirtualDomain - fix probe if config is not on shared storage [In reply to]

Hi again,

Can you please try this (very minimal) patch.
Florian, please very once you find time.

Cheers,

Dejan

On Fri, Jun 24, 2011 at 02:14:22PM +0200, Dejan Muhamedagic wrote:
> Hi Dominik,
>
> On Fri, Jun 24, 2011 at 12:33:42PM +0200, Dominik Klein wrote:
> > This fixes the issue described yesterday.
> >
> > Comments?
>
> It's not really necessary to introduce a new parameter. See for
> instance oracle or apache RA how they manage that. A quote from
> oracle:
>
> 681 LSB_STATUS_STOPPED=3
> 682 testoraenv
> 683 rc=$?
> 684 if [ $rc -ne 0 ]; then
> 685 ocf_log info "Oracle environment for SID $ORACLE_SID does not exist"
> 686 case "$1" in
> 687 stop) exit $OCF_SUCCESS;;
> 688 monitor) exit $OCF_NOT_RUNNING;;
> 689 status) exit $LSB_STATUS_STOPPED;;
> 690 *)
> 691 ocf_log err "Oracle environment for SID $ORACLE_SID broken"
> 692 exit $rc
> 693 ;;
> 694 esac
> 695 fi
>
> It should probably be changed a bit for the monitor action, sth
> like:
>
> monitor) is_ocf_probe && exit $OCF_NOT_RUNNING || exit $OCF_ERR_GENERIC;;
>
> Though as it is, it'll work correctly.
>
> Cheers,
>
> Dejan
>
> > Regards
> > Dominik
>
> > exporting patch:
> > # HG changeset patch
> > # User Dominik Klein <dominik.klein [at] gmail>
> > # Date 1308909599 -7200
> > # Node ID 2b1615aaca2c90f2f4ab93eb443e5902906fb28a
> > # Parent 7a11934b142d1daf42a04fbaa0391a3ac47cee4c
> > RA VirtualDomain: Fix probe if config is not on shared storage
> >
> > diff -r 7a11934b142d -r 2b1615aaca2c heartbeat/VirtualDomain
> > --- a/heartbeat/VirtualDomain Fri Feb 25 12:23:17 2011 +0100
> > +++ b/heartbeat/VirtualDomain Fri Jun 24 11:59:59 2011 +0200
> > @@ -19,9 +19,11 @@
> > # Defaults
> > OCF_RESKEY_force_stop_default=0
> > OCF_RESKEY_hypervisor_default="$(virsh --quiet uri)"
> > +OCF_RESKEY_config_on_shared_storage_default=1
> >
> > : ${OCF_RESKEY_force_stop=${OCF_RESKEY_force_stop_default}}
> > : ${OCF_RESKEY_hypervisor=${OCF_RESKEY_hypervisor_default}}
> > +: ${OCF_RESKEY_config_on_shared_storage=${OCF_RESKEY_config_on_shared_storage_default}}
> > #######################################################################
> >
> > ## I'd very much suggest to make this RA use bash,
> > @@ -421,8 +423,8 @@
> > # check if we can read the config file (otherwise we're unable to
> > # deduce $DOMAIN_NAME from it, see below)
> > if [ ! -r $OCF_RESKEY_config ]; then
> > - if ocf_is_probe; then
> > - ocf_log info "Configuration file $OCF_RESKEY_config not readable during probe."
> > + if ocf_is_probe && ocf_is_true $OCF_RESKEY_config_on_shared_storage; then
> > + ocf_log info "Configuration file $OCF_RESKEY_config not readable during probe. Assuming it is on shared storage and therefore reporting VM is not running."
> > else
> > ocf_log error "Configuration file $OCF_RESKEY_config does not exist or is not readable."
> > return $OCF_ERR_INSTALLED
>
> > exporting patch:
> > # HG changeset patch
> > # User Dominik Klein <dominik.klein [at] gmail>
> > # Date 1308911272 -7200
> > # Node ID 312adf2449eb59dcc41686626b1726428d13227b
> > # Parent 2b1615aaca2c90f2f4ab93eb443e5902906fb28a
> > RA VirtualDomain: Add metadata for the new parameter
> >
> > diff -r 2b1615aaca2c -r 312adf2449eb heartbeat/VirtualDomain
> > --- a/heartbeat/VirtualDomain Fri Jun 24 11:59:59 2011 +0200
> > +++ b/heartbeat/VirtualDomain Fri Jun 24 12:27:52 2011 +0200
> > @@ -119,6 +119,16 @@
> > <content type="string" default="" />
> > </parameter>
> >
> > +<parameter name="config_on_shared_storage" unique="0" required="0">
> > +<longdesc lang="en">
> > +If your VMs configuration file is _not_ on shared storage, so that the config
> > +file not being in place during a probe means that the VM is not installed/runnable
> > +on that node, set this to 0.
> > +</longdesc>
> > +<shortdesc lang="en">Set to 0 if your VMs config file is not on shared storage</shortdesc>
> > +<content type="boolean" default="1" />
> > +</parameter>
> > +
> > </parameters>
> >
> > <actions>
>
> > _______________________________________________________
> > Linux-HA-Dev: Linux-HA-Dev [at] lists
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> > Home Page: http://linux-ha.org/
>
> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev [at] lists
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/
Attachments: VirtualDomain.patch (0.65 KB)


dominik.klein at googlemail

Jun 24, 2011, 6:50 AM

Post #4 of 13 (578 views)
Permalink
Re: Patch: VirtualDomain - fix probe if config is not on shared storage [In reply to]

Hi Dejan,

this way, the cluster never learns that it can't start a resource on
that node.

I don't consider this a solution.

Regards
Dominik
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


dominik.klein at googlemail

Jun 26, 2011, 10:55 PM

Post #5 of 13 (568 views)
Permalink
Re: Patch: VirtualDomain - fix probe if config is not on shared storage [In reply to]

I'm not sure my fix is correct.

According to

https://github.com/ClusterLabs/resource-agents/commit/96ff8e9ad3d4beca7e063beef156f3b838a798e1#heartbeat/VirtualDomain

this is a regression which was introduced in April '11.

So the fix should be the other way around: Introduce a parameter that
let's the user configure the config file _is_ on shared storage and if
this is false or unset, return to the old behaviour of returning
ERR_INSTALLED.

Regards
Dominik
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


dejan at suse

Jun 27, 2011, 2:09 AM

Post #6 of 13 (569 views)
Permalink
Re: Patch: VirtualDomain - fix probe if config is not on shared storage [In reply to]

Hi Dominik,

On Fri, Jun 24, 2011 at 03:50:40PM +0200, Dominik Klein wrote:
> Hi Dejan,
>
> this way, the cluster never learns that it can't start a resource on
> that node.

This resource depends on shared storage. So, the cluster won't
try to start it unless the shared storage resource is already
running. This is something that needs to be specified using
either a negative preference location constraint or asymmetrical
cluster. There's no need for yet another mechanism (the extra
parameter) built into the resource agent. It's really an
overkill.

Makes sense?

Cheers,

Dejan

> I don't consider this a solution.
>
> Regards
> Dominik
> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev [at] lists
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


dominik.klein at googlemail

Jun 27, 2011, 3:00 AM

Post #7 of 13 (566 views)
Permalink
Re: Patch: VirtualDomain - fix probe if config is not on shared storage [In reply to]

On 06/27/2011 11:09 AM, Dejan Muhamedagic wrote:
> Hi Dominik,
>
> On Fri, Jun 24, 2011 at 03:50:40PM +0200, Dominik Klein wrote:
>> Hi Dejan,
>>
>> this way, the cluster never learns that it can't start a resource on
>> that node.
>
> This resource depends on shared storage. So, the cluster won't
> try to start it unless the shared storage resource is already
> running. This is something that needs to be specified using
> either a negative preference location constraint or asymmetrical
> cluster. There's no need for yet another mechanism (the extra
> parameter) built into the resource agent. It's really an
> overkill.

As requested on IRC, I describe my setup and explain why I think this is
a regression.

2 node cluster with a bunch of drbd devices.

Each /dev/drbdXX is used as a block device of a VM. The VMs
configuration files are not on shared storage but have to be copied
manually.

So it happened that during configuration of a VM, the admin forgot to
copy the configuration file to node2. The machine's DRBD was configured
though. So the cluster decided to promote the VMs DRBD on node2 and then
start the master-colocated and ordered VM.

With the agent before the mentioned patch, during probe of a newly
configured resource, the cluster would have learned that the VM is not
available on one of the nodes (ERR_INSTALLED), so it would never start
the resource there.

Now it sees NOT_RUNNING on all nodes during probe and may decide to
start the VM on a node where it cannot run. That, with the current
version of the agent, leads to a failed start, a failed stop during
recovery and therefore: an unnecessary stonith operation.

With Dejan's patch, it would still see NOT_RUNNING during probe, but at
least the stop would succeed. So the difference to the old version would
be that we had an unnecessary failed start on the node that does not
have the VM but it would not harm the node and I'd be fine with applying
that patch.

There's a case though that might stop the vm from running (for an amount
of time). And that is if start-failure-is-fatal is false. Then we would
have $migration-threshold "failed start/succeeded stop" iterations while
the VMs service would not be running.

Of course I do realize that the initial fault is a human one. but the
cluster used to protect from this, does not any more and that's why I
think this is a regression.

I think the correct way to fix this is to still return ERR_INSTALLED
during probe unless the cluster admin configures that the VMs config is
on shared storage. Finding out about resource states on different nodes
is what the probe was designed to do, was it not? And we work around
that in this resource agent just to support certain setups.

Regards
Dominik
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


dejan at suse

Jun 27, 2011, 3:26 AM

Post #8 of 13 (567 views)
Permalink
Re: Patch: VirtualDomain - fix probe if config is not on shared storage [In reply to]

On Mon, Jun 27, 2011 at 12:00:28PM +0200, Dominik Klein wrote:
> On 06/27/2011 11:09 AM, Dejan Muhamedagic wrote:
> > Hi Dominik,
> >
> > On Fri, Jun 24, 2011 at 03:50:40PM +0200, Dominik Klein wrote:
> >> Hi Dejan,
> >>
> >> this way, the cluster never learns that it can't start a resource on
> >> that node.
> >
> > This resource depends on shared storage. So, the cluster won't
> > try to start it unless the shared storage resource is already
> > running. This is something that needs to be specified using
> > either a negative preference location constraint or asymmetrical
> > cluster. There's no need for yet another mechanism (the extra
> > parameter) built into the resource agent. It's really an
> > overkill.
>
> As requested on IRC, I describe my setup and explain why I think this is
> a regression.
>
> 2 node cluster with a bunch of drbd devices.
>
> Each /dev/drbdXX is used as a block device of a VM. The VMs
> configuration files are not on shared storage but have to be copied
> manually.
>
> So it happened that during configuration of a VM, the admin forgot to
> copy the configuration file to node2. The machine's DRBD was configured
> though. So the cluster decided to promote the VMs DRBD on node2 and then
> start the master-colocated and ordered VM.
>
> With the agent before the mentioned patch, during probe of a newly
> configured resource, the cluster would have learned that the VM is not
> available on one of the nodes (ERR_INSTALLED), so it would never start
> the resource there.

This is exactly the problem with shared storage setups, where
such an exit code can prevent resource from ever being started on
a node which is otherwise perfectly capable of running that
resource.

> Now it sees NOT_RUNNING on all nodes during probe and may decide to
> start the VM on a node where it cannot run.

But really, if a resource can _never_ run on a node, then there
should be a negative location constraint or the cluster should be
setup as asymmetrical. Now, I understand that in your case, it is
actually due to the administrator's fault.

> That, with the current
> version of the agent, leads to a failed start, a failed stop during
> recovery and therefore: an unnecessary stonith operation.
>
> With Dejan's patch, it would still see NOT_RUNNING during probe, but at
> least the stop would succeed. So the difference to the old version would
> be that we had an unnecessary failed start on the node that does not
> have the VM but it would not harm the node and I'd be fine with applying
> that patch.
>
> There's a case though that might stop the vm from running (for an amount
> of time). And that is if start-failure-is-fatal is false. Then we would
> have $migration-threshold "failed start/succeeded stop" iterations while
> the VMs service would not be running.
>
> Of course I do realize that the initial fault is a human one. but the
> cluster used to protect from this, does not any more and that's why I
> think this is a regression.
>
> I think the correct way to fix this is to still return ERR_INSTALLED
> during probe unless the cluster admin configures that the VMs config is
> on shared storage. Finding out about resource states on different nodes
> is what the probe was designed to do, was it not? And we work around
> that in this resource agent just to support certain setups.

This particular setup is a special case of shared storage. The
images are on shared storage, but the configurations are local. I
think that you really need to make sure that the configurations
are present where they need to be. Best would be that the
configuration is kept on the storage along with the corresponding
VM image. Since you're using a raw device as image, that's
obviously not possible. Otherwise, use csync2 or similar to keep
files in sync.

Cheers,

Dejan

> Regards
> Dominik
> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev [at] lists
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


dominik.klein at googlemail

Jun 27, 2011, 3:40 AM

Post #9 of 13 (565 views)
Permalink
Re: Patch: VirtualDomain - fix probe if config is not on shared storage [In reply to]

>> With the agent before the mentioned patch, during probe of a newly
>> configured resource, the cluster would have learned that the VM is not
>> available on one of the nodes (ERR_INSTALLED), so it would never start
>> the resource there.
>
> This is exactly the problem with shared storage setups, where
> such an exit code can prevent resource from ever being started on
> a node which is otherwise perfectly capable of running that
> resource.

I see and understand that that, too, is a valid setup and concern.

> But really, if a resource can _never_ run on a node, then there
> should be a negative location constraint or the cluster should be
> setup as asymmetrical.

There did not have to be a negative location constraint up to now,
because the cluster took care of that.

> Now, I understand that in your case, it is
> actually due to the administrator's fault.

Yes, that's how I noticed the problem with the agent.

> This particular setup is a special case of shared storage. The
> images are on shared storage, but the configurations are local. I
> think that you really need to make sure that the configurations
> are present where they need to be. Best would be that the
> configuration is kept on the storage along with the corresponding
> VM image. Since you're using a raw device as image, that's
> obviously not possible. Otherwise, use csync2 or similar to keep
> files in sync.

Actually, this is a wanted setup. It happened that VMs configs were
changed in ways that lead to a VM not being startable any more. For that
case, they wanted to be able to start the old config on the other node.

I agree that the cases that lead me to finding this change in the agent
are cases that could have been solved with better configuration and that
your suggestions make sense. Still, I feel that the change introduces a
new way of doing things that might affect running and working setups in
unintended ways. I refuse to believe that I am the only one doing HA VMs
like this (although of course I might be wrong on that, too ...).

Regards
Dominik
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


lmb at suse

Jun 27, 2011, 2:01 PM

Post #10 of 13 (562 views)
Permalink
Re: Patch: VirtualDomain - fix probe if config is not on shared storage [In reply to]

On 2011-06-27T12:00:28, Dominik Klein <dominik.klein [at] googlemail> wrote:

> Now it sees NOT_RUNNING on all nodes during probe and may decide to
> start the VM on a node where it cannot run. That, with the current
> version of the agent, leads to a failed start, a failed stop during
> recovery and therefore: an unnecessary stonith operation.

Yes, the 'stop' in the old agent is/was broken.

The probe, alas, can't explicitly check all pre-requisites, since they
may not be online yet. It, perhaps, was a mistake to use a "monitor" as
the "probe", with 20:20 hindsight. It seemed an improvement at the time,
but nowadays I'm no longer so sure; it requires the "ocf_is_probe"
special case that I'm not so fond of and leads to discussions like this.
;-)

Dejan is correct: unless the "monitor" op during probe has more evidence
than a missing file, it probably shouldn't return "ERR_INSTALLED" (nor
_CONFIGURED); that'll block the resource from the node completely. It
_is_ a valid return code of course, but inappropriate for bits that
could be on shared storage and simply missing.

Actually, all we _must_ know for "monitor_0" is if the resource is
active in any capacity. Any further requirements probably are best
checked at "start" time.


> I think the correct way to fix this is to still return ERR_INSTALLED
> during probe unless the cluster admin configures that the VMs config is
> on shared storage. Finding out about resource states on different nodes
> is what the probe was designed to do, was it not? And we work around
> that in this resource agent just to support certain setups.

Yeah, and that is a pretty depressing result. But I definitely dislike
the special switch for telling the cluster that the config is on shared
storage like that. That would be a scenario that no admin would test.

So it seems defining a specific "probe" operation would appear to be a
good idea going forward; it can, in fact, do exactly the same thing as a
"monitor" (if it has enough definite evidence), but it would be more
obvious that the emphasis is different. And hopefully be less
confusing.


Regards,
Lars

--
Architect Storage/HA, OPS Engineering, Novell, Inc.
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


dejan at suse

Jun 28, 2011, 2:03 AM

Post #11 of 13 (562 views)
Permalink
Re: Patch: VirtualDomain - fix probe if config is not on shared storage [In reply to]

On Mon, Jun 27, 2011 at 12:40:19PM +0200, Dominik Klein wrote:
> >> With the agent before the mentioned patch, during probe of a newly
> >> configured resource, the cluster would have learned that the VM is not
> >> available on one of the nodes (ERR_INSTALLED), so it would never start
> >> the resource there.
> >
> > This is exactly the problem with shared storage setups, where
> > such an exit code can prevent resource from ever being started on
> > a node which is otherwise perfectly capable of running that
> > resource.
>
> I see and understand that that, too, is a valid setup and concern.

In this case this resource wouldn't function at all. The worst
would be that the config is available on one node and the
resource would be started there, but there'll be no failover,
because all other nodes would report ERR_INSTALLED.

> > But really, if a resource can _never_ run on a node, then there
> > should be a negative location constraint or the cluster should be
> > setup as asymmetrical.
>
> There did not have to be a negative location constraint up to now,
> because the cluster took care of that.

Only because it didn't work correctly.

> > Now, I understand that in your case, it is
> > actually due to the administrator's fault.
>
> Yes, that's how I noticed the problem with the agent.
>
> > This particular setup is a special case of shared storage. The
> > images are on shared storage, but the configurations are local. I
> > think that you really need to make sure that the configurations
> > are present where they need to be. Best would be that the
> > configuration is kept on the storage along with the corresponding
> > VM image. Since you're using a raw device as image, that's
> > obviously not possible. Otherwise, use csync2 or similar to keep
> > files in sync.
>
> Actually, this is a wanted setup. It happened that VMs configs were
> changed in ways that lead to a VM not being startable any more. For that
> case, they wanted to be able to start the old config on the other node.

Wow! So, they can have different configurations at different
nodes.

> I agree that the cases that lead me to finding this change in the agent
> are cases that could have been solved with better configuration and that
> your suggestions make sense. Still, I feel that the change introduces a
> new way of doing things that might affect running and working setups in
> unintended ways. I refuse to believe that I am the only one doing HA VMs
> like this (although of course I might be wrong on that, too ...).

The only issue you may have with this cluster is if the
administrator erronously removes a config on some node, right?
And that then some time afterwards the cluster does a probe on
that node. And then again the cluster wants to fail over this VM
to that node. And that at this point in time no other node can
run this VM and that it is going to repeatedly try to start and
fail. And that "failed start is fatal" isn't configured. No doubt
that this could happen, but what's the probability? And, finally,
that doesn't look like a well maintained cluster.

Thanks,

Dejan

> Regards
> Dominik
> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev [at] lists
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


dominik.klein at googlemail

Jun 28, 2011, 10:46 PM

Post #12 of 13 (556 views)
Permalink
Re: Patch: VirtualDomain - fix probe if config is not on shared storage [In reply to]

>> There did not have to be a negative location constraint up to now,
>> because the cluster took care of that.
>
> Only because it didn't work correctly.

Okay.

>> Actually, this is a wanted setup. It happened that VMs configs were
>> changed in ways that lead to a VM not being startable any more. For that
>> case, they wanted to be able to start the old config on the other node.

Please, notice _they_ vs. _me_ here :)

> Wow! So, they can have different configurations at different
> nodes.

Agreed, wow!

> The only issue you may have with this cluster is if the
> administrator erronously removes a config on some node, right?
> And that then some time afterwards the cluster does a probe on
> that node. And then again the cluster wants to fail over this VM
> to that node. And that at this point in time no other node can
> run this VM and that it is going to repeatedly try to start and
> fail. And that "failed start is fatal" isn't configured. No doubt
> that this could happen, but what's the probability? And, finally,
> that doesn't look like a well maintained cluster.

I guess this is something _they_ have to live with then.

At first glance, I honestly thought this was a change in the agent that
introduced a regression that not only this configuration would hit, but
you made me realize that it does not, but that it does improve the agent
for sane setups.

My vote goes for your patch, ie "stop && no config = return SUCCESS"

Thanks
Dominik
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


dejan at suse

Jun 29, 2011, 5:41 AM

Post #13 of 13 (553 views)
Permalink
Re: Patch: VirtualDomain - fix probe if config is not on shared storage [In reply to]

Hi Dominik,

On Wed, Jun 29, 2011 at 07:46:57AM +0200, Dominik Klein wrote:
> >> There did not have to be a negative location constraint up to now,
> >> because the cluster took care of that.
> >
> > Only because it didn't work correctly.
>
> Okay.
>
> >> Actually, this is a wanted setup. It happened that VMs configs were
> >> changed in ways that lead to a VM not being startable any more. For that
> >> case, they wanted to be able to start the old config on the other node.
>
> Please, notice _they_ vs. _me_ here :)

Of course, I had no doubt about that.

> > Wow! So, they can have different configurations at different
> > nodes.
>
> Agreed, wow!

Yes, not a very good practice in clusters. Can lead to somewhat
surprising effects. Depending on the perspective, naturally.

> > The only issue you may have with this cluster is if the
> > administrator erronously removes a config on some node, right?
> > And that then some time afterwards the cluster does a probe on
> > that node. And then again the cluster wants to fail over this VM
> > to that node. And that at this point in time no other node can
> > run this VM and that it is going to repeatedly try to start and
> > fail. And that "failed start is fatal" isn't configured. No doubt
> > that this could happen, but what's the probability? And, finally,
> > that doesn't look like a well maintained cluster.
>
> I guess this is something _they_ have to live with then.
>
> At first glance, I honestly thought this was a change in the agent that
> introduced a regression that not only this configuration would hit, but
> you made me realize that it does not, but that it does improve the agent
> for sane setups.
>
> My vote goes for your patch, ie "stop && no config = return SUCCESS"

Great! Looks like we can ship 3.9.2 now.

Cheers,

Dejan

> Thanks
> Dominik
> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev [at] lists
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Linux-HA dev RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.