phil at macprofessionals
Jul 13, 2012, 8:41 AM
Post #2 of 2
On 07/03/2012 02:38 PM, Phil Frost wrote:
Re: LIO + Pacemaker kernel oops on failover
[In reply to]
> It seems there's something about the iSCSI RAs that hit a bug in LIO:
> I seem to be hitting the same problem quite reliably whenever I
> migrate the iSCSI targets in my cluster. Sounds like the OP was able
> to reach a suitable workaround, but I'm not very experienced with LIO
> or iSCSI so the discussion is a bit over my head. Anyone have some
> idea how to implement the changes described there?
I wasn't able to find a way to modify the existing
iSCSI(Target|LogicalUnit) RAs to stop the target in a way that avoided
this bug in LIO. The problem was largely that with targets and logical
units as separate resources, it was difficult to start the target before
the LUs, and also stop the target before the LUs. I tried using
asymmetric order constraints, but it didn't work so well in testing. I
don't know if it's because the shutdown wasn't working cleanly, or if
the iSCSILogicalUnit resources were upset that the LUs were stopped when
Pacemaker wasn't expecting it.
Anyhow, my solution was to write a new RA (attached) which managed the
target and the LUs together, and thus could control the ordering of
starting and stopping them in detail. It's not as featureful or general
as the existing RAs, but in my testing so far it is stable.
This is the first RA I have written, so I would appreciate any comments.
One problem in particular relates to the monitor action -- you can see
it only checks that the target is running. I could add monitoring for
the LUs easily enough, but I'm not clear on what should happen if the
target is up, but the LUs are not. In this state the service is neither
"up" nor "down", it's broken, and the right thing to do is probably
attempt to restart it. I'm not sure how I communicate that to Pacemaker
from my RA, though. Should I return OCF_ERR_GENERIC? What will pacemaker
do is this case?