Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Linux-HA: Dev

"monitor_scripts" parameter for the VirtualDomain RA (was Re: [Linux-HA] ocf resource agent for KVM virtual machines)

 

 

Linux-HA dev RSS feed   Index | Next | Previous | View Threaded


florian.haas at linbit

Dec 2, 2008, 8:06 AM

Post #1 of 7 (1669 views)
Permalink
"monitor_scripts" parameter for the VirtualDomain RA (was Re: [Linux-HA] ocf resource agent for KVM virtual machines)

[moving this discussion over to -dev]

> Hi,
>
> thanks. That is what I was looking for. Much better that my script.
>
> Is there a chance to add an option to that RA to link to a separate script
> additionally checking the state of the resource. Something like
> monitor_script in the xen RA? Should be quite easy with copy and paste
> from
> that RA.
>
> Thanks.

Having external monitor scripts available sounds like a good idea, however
I'm not quite happy with the current implementation in the Xen RA. I'd
propose the following logic:

- Any script listed in OCF_RESKEY_monitor_scripts must provide
OCF-compliant exit codes by itself (unlike the present implementation in
the Xen RA which just maps any nonzero exit code to $OCF_ERR_GENERIC).
- If OCF_RESKEY_monitor_scripts contains multiple entries, they are
iterated over (just like in the Xen RA).
- The first nonzero exit code encountered from a monitor script stops the
iteration (just like in the Xen RA), and its exit code propagates as the
return value, and hence exit code, of the monitor operation (unlike the
Xen RA).
- The external monitor operation must never time out by itself, it must
keep trying indefinitely until killed by the LRM.

The last one is due to an additional pitfall with respect to implementing
migrate_from/migrate_to (which eventually should work, of course). We can
set start_delay on the monitor op so we make sure we start monitoring only
after the domain has booted completely. So that is fine. But suppose we
have a brief interruption in machine availability during migration. We
can't temporarily disable the external monitor operation then, so we at
least need to make sure that it doesn't time out before the LRM says it
does. I realize no such interruption is supposed to happen during Xen live
migration, but I don't know about KVM, OpenVZ, lxc etc.

WDOT?

Cheers,
Florian

_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


florian.haas at linbit

Dec 2, 2008, 8:09 AM

Post #2 of 7 (1602 views)
Permalink
Re: "monitor_scripts" parameter for the VirtualDomain RA (was Re: [Linux-HA] ocf resource agent for KVM virtual machines) [In reply to]

> - The external monitor operation must never time out by itself, it must
> keep trying indefinitely until killed by the LRM.

Sorry, that one looks misleading on second read. Should of course be
"...it must keep trying indefinitely until either succeeding, or being
killed by the LRM."

Cheers,
Florian

_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


lmb at suse

Dec 4, 2008, 7:06 AM

Post #3 of 7 (1572 views)
Permalink
Re: "monitor_scripts" parameter for the VirtualDomain RA (was Re: [Linux-HA] ocf resource agent for KVM virtual machines) [In reply to]

On 2008-12-02T17:06:46, Florian Haas <florian.haas [at] linbit> wrote:

> - Any script listed in OCF_RESKEY_monitor_scripts must provide
> OCF-compliant exit codes by itself (unlike the present implementation in
> the Xen RA which just maps any nonzero exit code to $OCF_ERR_GENERIC).

Nope.

The "master" RA is responsible for determining whether or not the
instance is active. The external monitor scripts just get a chance to
fail it, but they can't suddenly claim its not running or anything.

So the external scripts only need "true" or "false".

> - If OCF_RESKEY_monitor_scripts contains multiple entries, they are
> iterated over (just like in the Xen RA).
> - The first nonzero exit code encountered from a monitor script stops the
> iteration (just like in the Xen RA), and its exit code propagates as the
> return value, and hence exit code, of the monitor operation (unlike the
> Xen RA).
> - The external monitor operation must never time out by itself, it must
> keep trying indefinitely until killed by the LRM.

The last one is an implementation detail which is left for the external
script to handle.

> The last one is due to an additional pitfall with respect to implementing
> migrate_from/migrate_to (which eventually should work, of course). We can
> set start_delay on the monitor op so we make sure we start monitoring only
> after the domain has booted completely. So that is fine.

start_delay should never be needed. It was one of the biggest mistakes
to add it. I keep thinking about just making it a no-op; anything which
requires it points to a broken RA.

The resource must be fully operational after start (or migrate_from)
have completed. Monitor must immediately be OK.

> But suppose we have a brief interruption in machine availability
> during migration. We can't temporarily disable the external monitor
> operation then, so we at least need to make sure that it doesn't time
> out before the LRM says it does. I realize no such interruption is
> supposed to happen during Xen live migration, but I don't know about
> KVM, OpenVZ, lxc etc.

If that's what you think you need, have the migrate_from/start op loop
until "monitor" succeeds.

while ! monitor_function ; do sleep 1 ; done


Regards,
Lars

--
Teamlead Kernel, SuSE Labs, Research and Development
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


florian.haas at linbit

Dec 4, 2008, 8:20 AM

Post #4 of 7 (1574 views)
Permalink
Re: "monitor_scripts" parameter for the VirtualDomain RA (was Re: [Linux-HA] ocf resource agent for KVM virtual machines) [In reply to]

On 2008-12-04 16:06, Lars Marowsky-Bree wrote:
> On 2008-12-02T17:06:46, Florian Haas <florian.haas [at] linbit> wrote:
>
>> - Any script listed in OCF_RESKEY_monitor_scripts must provide
>> OCF-compliant exit codes by itself (unlike the present implementation in
>> the Xen RA which just maps any nonzero exit code to $OCF_ERR_GENERIC).
>
> Nope.
>
> The "master" RA is responsible for determining whether or not the
> instance is active. The external monitor scripts just get a chance to
> fail it, but they can't suddenly claim its not running or anything.
>
> So the external scripts only need "true" or "false".

Fair enough.

>> - If OCF_RESKEY_monitor_scripts contains multiple entries, they are
>> iterated over (just like in the Xen RA).
>> - The first nonzero exit code encountered from a monitor script stops the
>> iteration (just like in the Xen RA), and its exit code propagates as the
>> return value, and hence exit code, of the monitor operation (unlike the
>> Xen RA).
>> - The external monitor operation must never time out by itself, it must
>> keep trying indefinitely until killed by the LRM.
>
> The last one is an implementation detail which is left for the external
> script to handle.

As is the one with the exit codes which you already rejected.

>> The last one is due to an additional pitfall with respect to implementing
>> migrate_from/migrate_to (which eventually should work, of course). We can
>> set start_delay on the monitor op so we make sure we start monitoring only
>> after the domain has booted completely. So that is fine.
>
> start_delay should never be needed. It was one of the biggest mistakes
> to add it. I keep thinking about just making it a no-op; anything which
> requires it points to a broken RA.
>
> The resource must be fully operational after start (or migrate_from)
> have completed. Monitor must immediately be OK.

What?

If I'm not mistaken, the purpose of an external monitor_script in
conjunction with a virtual domain would be to do something like ping it,
try to connect to its TCP port 22, connect to its TCP port 445 (for a
virtual Windows box), etc. Any such monitor script only has a chance to
succeed when the virtual domain is fully booted. The start operation
from the VirtualDomain RA (just like that from the Xen RA) returns
immediately after the virtualization management API has determined that
the virtual domain has successfully _started_ its boot process, not
completed it.

What would be your suggestion to determine, from Pacemaker's
perspective, that a virtual domain is fully booted?

>> But suppose we have a brief interruption in machine availability
>> during migration. We can't temporarily disable the external monitor
>> operation then, so we at least need to make sure that it doesn't time
>> out before the LRM says it does. I realize no such interruption is
>> supposed to happen during Xen live migration, but I don't know about
>> KVM, OpenVZ, lxc etc.
>
> If that's what you think you need, have the migrate_from/start op loop
> until "monitor" succeeds.
>
> while ! monitor_function ; do sleep 1 ; done

So what is your suggestion?

1. Augment the monitor operation with any external monitor_script and
block any start or migrate_from until monitor succeeds? In that case,
please educate me as to the purpose of monitor timeouts. Or are you
saying one would have to adjust start and migrate_from timeouts accordingly?

2. Ditch any external monitor_script functionality in the VirtualDomain
RA, as it's useless anyway? In that case, please let me know what it's
for in the Xen RA.

Cheers,
Florian
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


dejanmm at fastmail

Dec 5, 2008, 3:18 AM

Post #5 of 7 (1562 views)
Permalink
Re: "monitor_scripts" parameter for the VirtualDomain RA (was Re: [Linux-HA] ocf resource agent for KVM virtual machines) [In reply to]

Hi,

On Thu, Dec 04, 2008 at 05:20:43PM +0100, Florian Haas wrote:
> On 2008-12-04 16:06, Lars Marowsky-Bree wrote:
> > On 2008-12-02T17:06:46, Florian Haas <florian.haas [at] linbit> wrote:
> >
> >> - Any script listed in OCF_RESKEY_monitor_scripts must provide
> >> OCF-compliant exit codes by itself (unlike the present implementation in
> >> the Xen RA which just maps any nonzero exit code to $OCF_ERR_GENERIC).
> >
> > Nope.
> >
> > The "master" RA is responsible for determining whether or not the
> > instance is active. The external monitor scripts just get a chance to
> > fail it, but they can't suddenly claim its not running or anything.
> >
> > So the external scripts only need "true" or "false".
>
> Fair enough.
>
> >> - If OCF_RESKEY_monitor_scripts contains multiple entries, they are
> >> iterated over (just like in the Xen RA).
> >> - The first nonzero exit code encountered from a monitor script stops the
> >> iteration (just like in the Xen RA), and its exit code propagates as the
> >> return value, and hence exit code, of the monitor operation (unlike the
> >> Xen RA).
> >> - The external monitor operation must never time out by itself, it must
> >> keep trying indefinitely until killed by the LRM.
> >
> > The last one is an implementation detail which is left for the external
> > script to handle.
>
> As is the one with the exit codes which you already rejected.

It is up to the RA, of course, but the best practice is to let
the upper layers (i.e. lrmd) deal with timeouts. Simply because
that's the place where the user can control the timeouts. In the
most cases, it is very hard for an RA to take into account all
possible configurations and, in particular, all possible loads.

> >> The last one is due to an additional pitfall with respect to implementing
> >> migrate_from/migrate_to (which eventually should work, of course). We can
> >> set start_delay on the monitor op so we make sure we start monitoring only
> >> after the domain has booted completely. So that is fine.
> >
> > start_delay should never be needed. It was one of the biggest mistakes
> > to add it. I keep thinking about just making it a no-op; anything which
> > requires it points to a broken RA.
> >
> > The resource must be fully operational after start (or migrate_from)
> > have completed. Monitor must immediately be OK.
>
> What?
>
> If I'm not mistaken, the purpose of an external monitor_script in
> conjunction with a virtual domain would be to do something like ping it,
> try to connect to its TCP port 22, connect to its TCP port 445 (for a
> virtual Windows box), etc. Any such monitor script only has a chance to
> succeed when the virtual domain is fully booted. The start operation
> from the VirtualDomain RA (just like that from the Xen RA) returns
> immediately after the virtualization management API has determined that
> the virtual domain has successfully _started_ its boot process, not
> completed it.
>
> What would be your suggestion to determine, from Pacemaker's
> perspective, that a virtual domain is fully booted?

This is not easy to answer in general. It depends on what the
VM should do, i.e. what kind of service it has to provide. I
agree with Lars that the start action should, once it has
finished, really mean that the resource is fully operational.
After all, there could be another resource waiting to start (i.e.
the order dependency) and this other resource may fail if the
previous one hasn't started. The simplest way for the start
operation to ensure this is to invoke monitor itself.

> >> But suppose we have a brief interruption in machine availability
> >> during migration. We can't temporarily disable the external monitor
> >> operation then, so we at least need to make sure that it doesn't time
> >> out before the LRM says it does. I realize no such interruption is
> >> supposed to happen during Xen live migration, but I don't know about
> >> KVM, OpenVZ, lxc etc.
> >
> > If that's what you think you need, have the migrate_from/start op loop
> > until "monitor" succeeds.
> >
> > while ! monitor_function ; do sleep 1 ; done
>
> So what is your suggestion?
>
> 1. Augment the monitor operation with any external monitor_script and
> block any start or migrate_from until monitor succeeds? In that case,
> please educate me as to the purpose of monitor timeouts. Or are you
> saying one would have to adjust start and migrate_from timeouts accordingly?

The start/migrate_from timeout should be generous for Xen and
such. The monitor timeout may/should be shorter, depending on
your service quality policy. If in doubt, use longer timeouts.

> 2. Ditch any external monitor_script functionality in the VirtualDomain
> RA, as it's useless anyway? In that case, please let me know what it's
> for in the Xen RA.

I guess that this has been implicitly answered above :)

Cheers,

Dejan

> Cheers,
> Florian
> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev [at] lists
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


lmb at suse

Dec 5, 2008, 3:52 AM

Post #6 of 7 (1556 views)
Permalink
Re: "monitor_scripts" parameter for the VirtualDomain RA (was Re: [Linux-HA] ocf resource agent for KVM virtual machines) [In reply to]

On 2008-12-04T17:20:43, Florian Haas <florian.haas [at] linbit> wrote:

> > The resource must be fully operational after start (or migrate_from)
> > have completed. Monitor must immediately be OK.
>
> What?
>
> If I'm not mistaken, the purpose of an external monitor_script in
> conjunction with a virtual domain would be to do something like ping it,
> try to connect to its TCP port 22, connect to its TCP port 445 (for a
> virtual Windows box), etc. Any such monitor script only has a chance to
> succeed when the virtual domain is fully booted. The start operation
> from the VirtualDomain RA (just like that from the Xen RA) returns
> immediately after the virtualization management API has determined that
> the virtual domain has successfully _started_ its boot process, not
> completed it.
>
> What would be your suggestion to determine, from Pacemaker's
> perspective, that a virtual domain is fully booted?

It's not pacemaker's job to determine that. The RA must wait and not
return until this state has been reached.

The ordering dependencies are one extremly good reason, but monitor ops
could also occur "out of the blue" if the user invokes a reprobe
manually or something.

The easiest way is to loop in start until monitor succeeded, if there is
any doubt that the start action has been achieved.


> > while ! monitor_function ; do sleep 1 ; done
> So what is your suggestion?

Uhm, I think the above line is actually valid shell code ;-)

> 1. Augment the monitor operation with any external monitor_script and
> block any start or migrate_from until monitor succeeds? In that case,
> please educate me as to the purpose of monitor timeouts. Or are you
> saying one would have to adjust start and migrate_from timeouts accordingly?

Well, sure. start/migrate_from must cover the full time until the
resource has reached the requested state. Returning earlier is not
allowed, or rather, possibly will cause subtle errors somewhere.

Also consider the UI impact. The GUI would show the resource as "green"
and no longer in transition; still the admin would get a connection
refused; not good.

start means "start the resource and return when it is started, or some
error has occured." It does not mean "trigger the start and return".

Sorry for being pedantic, but its quite important to get the semantics
right - they are not very complicated, but observing them really makes
the cluster more dependable.


Regards,
Lars

--
Teamlead Kernel, SuSE Labs, Research and Development
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


florian.haas at linbit

Dec 5, 2008, 4:58 AM

Post #7 of 7 (1561 views)
Permalink
Re: "monitor_scripts" parameter for the VirtualDomain RA (was Re: [Linux-HA] ocf resource agent for KVM virtual machines) [In reply to]

Lars, Dejan,

On 12/05/2008 12:52 PM, Lars Marowsky-Bree wrote:
> On 2008-12-04T17:20:43, Florian Haas <florian.haas [at] linbit> wrote:
>
>>> The resource must be fully operational after start (or migrate_from)
>>> have completed. Monitor must immediately be OK.
>> What?
>>
>> If I'm not mistaken, the purpose of an external monitor_script in
>> conjunction with a virtual domain would be to do something like ping it,
>> try to connect to its TCP port 22, connect to its TCP port 445 (for a
>> virtual Windows box), etc. Any such monitor script only has a chance to
>> succeed when the virtual domain is fully booted. The start operation
>> from the VirtualDomain RA (just like that from the Xen RA) returns
>> immediately after the virtualization management API has determined that
>> the virtual domain has successfully _started_ its boot process, not
>> completed it.
>>
>> What would be your suggestion to determine, from Pacemaker's
>> perspective, that a virtual domain is fully booted?
>
> It's not pacemaker's job to determine that. The RA must wait and not
> return until this state has been reached.

>> 1. Augment the monitor operation with any external monitor_script and
>> block any start or migrate_from until monitor succeeds? In that case,
>> please educate me as to the purpose of monitor timeouts. Or are you
>> saying one would have to adjust start and migrate_from timeouts accordingly?
>
> Well, sure. start/migrate_from must cover the full time until the
> resource has reached the requested state. Returning earlier is not
> allowed, or rather, possibly will cause subtle errors somewhere.

> Also consider the UI impact. The GUI would show the resource as "green"
> and no longer in transition; still the admin would get a connection
> refused; not good.
>
> start means "start the resource and return when it is started, or some
> error has occured." It does not mean "trigger the start and return".

See I figured that the RA had done its job when the VMM/hypervisor/etc.
reported that the virtual domain successfully initiated its boot sequence.

My bad; I think I catch the drift now. Updated patch to go into Bugzilla
momentarily.

Cheers,
Florian

--
: Florian G. Haas
: LINBIT Information Technologies GmbH
: Vivenotgasse 48, A-1120 Vienna, Austria

When replying, there is no need to CC my personal address.
I monitor the list on a daily basis. Thank you.

LINBIT® and DRBD® are registered trademarks of LINBIT.
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Linux-HA dev RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.