
phil at macprofessionals
Jul 13, 2012, 8:41 AM
Post #2 of 2
(234 views)
Permalink
|
|
Re: LIO + Pacemaker kernel oops on failover
[In reply to]
|
|
On 07/03/2012 02:38 PM, Phil Frost wrote: > It seems there's something about the iSCSI RAs that hit a bug in LIO: > > http://comments.gmane.org/gmane.linux.scsi.target.devel/1568?set_cite=hide > > > I seem to be hitting the same problem quite reliably whenever I > migrate the iSCSI targets in my cluster. Sounds like the OP was able > to reach a suitable workaround, but I'm not very experienced with LIO > or iSCSI so the discussion is a bit over my head. Anyone have some > idea how to implement the changes described there? I wasn't able to find a way to modify the existing iSCSI(Target|LogicalUnit) RAs to stop the target in a way that avoided this bug in LIO. The problem was largely that with targets and logical units as separate resources, it was difficult to start the target before the LUs, and also stop the target before the LUs. I tried using asymmetric order constraints, but it didn't work so well in testing. I don't know if it's because the shutdown wasn't working cleanly, or if the iSCSILogicalUnit resources were upset that the LUs were stopped when Pacemaker wasn't expecting it. Anyhow, my solution was to write a new RA (attached) which managed the target and the LUs together, and thus could control the ordering of starting and stopping them in detail. It's not as featureful or general as the existing RAs, but in my testing so far it is stable. This is the first RA I have written, so I would appreciate any comments. One problem in particular relates to the monitor action -- you can see it only checks that the target is running. I could add monitoring for the LUs easily enough, but I'm not clear on what should happen if the target is up, but the LUs are not. In this state the service is neither "up" nor "down", it's broken, and the right thing to do is probably attempt to restart it. I'm not sure how I communicate that to Pacemaker from my RA, though. Should I return OCF_ERR_GENERIC? What will pacemaker do is this case?
|