
dejanmm at fastmail
Dec 5, 2008, 3:18 AM
Post #5 of 7
(1562 views)
Permalink
|
|
Re: "monitor_scripts" parameter for the VirtualDomain RA (was Re: [Linux-HA] ocf resource agent for KVM virtual machines)
[In reply to]
|
|
Hi, On Thu, Dec 04, 2008 at 05:20:43PM +0100, Florian Haas wrote: > On 2008-12-04 16:06, Lars Marowsky-Bree wrote: > > On 2008-12-02T17:06:46, Florian Haas <florian.haas [at] linbit> wrote: > > > >> - Any script listed in OCF_RESKEY_monitor_scripts must provide > >> OCF-compliant exit codes by itself (unlike the present implementation in > >> the Xen RA which just maps any nonzero exit code to $OCF_ERR_GENERIC). > > > > Nope. > > > > The "master" RA is responsible for determining whether or not the > > instance is active. The external monitor scripts just get a chance to > > fail it, but they can't suddenly claim its not running or anything. > > > > So the external scripts only need "true" or "false". > > Fair enough. > > >> - If OCF_RESKEY_monitor_scripts contains multiple entries, they are > >> iterated over (just like in the Xen RA). > >> - The first nonzero exit code encountered from a monitor script stops the > >> iteration (just like in the Xen RA), and its exit code propagates as the > >> return value, and hence exit code, of the monitor operation (unlike the > >> Xen RA). > >> - The external monitor operation must never time out by itself, it must > >> keep trying indefinitely until killed by the LRM. > > > > The last one is an implementation detail which is left for the external > > script to handle. > > As is the one with the exit codes which you already rejected. It is up to the RA, of course, but the best practice is to let the upper layers (i.e. lrmd) deal with timeouts. Simply because that's the place where the user can control the timeouts. In the most cases, it is very hard for an RA to take into account all possible configurations and, in particular, all possible loads. > >> The last one is due to an additional pitfall with respect to implementing > >> migrate_from/migrate_to (which eventually should work, of course). We can > >> set start_delay on the monitor op so we make sure we start monitoring only > >> after the domain has booted completely. So that is fine. > > > > start_delay should never be needed. It was one of the biggest mistakes > > to add it. I keep thinking about just making it a no-op; anything which > > requires it points to a broken RA. > > > > The resource must be fully operational after start (or migrate_from) > > have completed. Monitor must immediately be OK. > > What? > > If I'm not mistaken, the purpose of an external monitor_script in > conjunction with a virtual domain would be to do something like ping it, > try to connect to its TCP port 22, connect to its TCP port 445 (for a > virtual Windows box), etc. Any such monitor script only has a chance to > succeed when the virtual domain is fully booted. The start operation > from the VirtualDomain RA (just like that from the Xen RA) returns > immediately after the virtualization management API has determined that > the virtual domain has successfully _started_ its boot process, not > completed it. > > What would be your suggestion to determine, from Pacemaker's > perspective, that a virtual domain is fully booted? This is not easy to answer in general. It depends on what the VM should do, i.e. what kind of service it has to provide. I agree with Lars that the start action should, once it has finished, really mean that the resource is fully operational. After all, there could be another resource waiting to start (i.e. the order dependency) and this other resource may fail if the previous one hasn't started. The simplest way for the start operation to ensure this is to invoke monitor itself. > >> But suppose we have a brief interruption in machine availability > >> during migration. We can't temporarily disable the external monitor > >> operation then, so we at least need to make sure that it doesn't time > >> out before the LRM says it does. I realize no such interruption is > >> supposed to happen during Xen live migration, but I don't know about > >> KVM, OpenVZ, lxc etc. > > > > If that's what you think you need, have the migrate_from/start op loop > > until "monitor" succeeds. > > > > while ! monitor_function ; do sleep 1 ; done > > So what is your suggestion? > > 1. Augment the monitor operation with any external monitor_script and > block any start or migrate_from until monitor succeeds? In that case, > please educate me as to the purpose of monitor timeouts. Or are you > saying one would have to adjust start and migrate_from timeouts accordingly? The start/migrate_from timeout should be generous for Xen and such. The monitor timeout may/should be shorter, depending on your service quality policy. If in doubt, use longer timeouts. > 2. Ditch any external monitor_script functionality in the VirtualDomain > RA, as it's useless anyway? In that case, please let me know what it's > for in the Xen RA. I guess that this has been implicitly answered above :) Cheers, Dejan > Cheers, > Florian > _______________________________________________________ > Linux-HA-Dev: Linux-HA-Dev [at] lists > http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev > Home Page: http://linux-ha.org/ _______________________________________________________ Linux-HA-Dev: Linux-HA-Dev [at] lists http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
|