
seligman at nevis
Feb 28, 2012, 12:51 PM
Post #6 of 18
(1351 views)
Permalink
|
On 2/28/12 2:09 PM, Lars Ellenberg wrote: > On Tue, Feb 28, 2012 at 01:21:51PM -0500, William Seligman wrote: >> On 2/27/12 8:40 PM, Andrew Beekhof wrote: >> >>> Oh, what does the fence_pcmk file look like? >> >> This is a standard part of the pacemaker-1.1.6 package. According to >> >> <http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Clusters_from_Scratch/_configuring_cman_fencing.html> >> >> it causes any fencing requests from cman to be redirected to pacemaker. >> >> Since you asked, I've attached a copy of the file. I note that if this script is >> used to fence a system it writes to /var/log/messages using logger, and there is >> no such log message in my logs. So I guess cman is off the hook. > > You say "fencing resource-only" in drbd.conf. > But you did not show the fencing handler used? > Did you specify one at all? It looks like I "over-edited" when I got rid of the comments before I posted my configuration. The relevant sections are: disk { fencing resource-only; } handlers { pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboo\ t -f"; pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboo\ t -f"; local-io-error "/usr/lib/drbd/notify-io-error.sh; /usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; halt -f"; split-brain "/usr/lib/drbd/notify-split-brain.sh sysadmin [at] nevis"; fence-peer "/usr/lib/drbd/crm-fence-peer.sh"; after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh"; } > Besides, for a dual-primary DRBD setup, you must have "fencing > resource-and-stonith;", and you should use a DRBD fencing handler > that really fences off the peer. It may additionally set constraints. Do crm-fence-peer.sh or Lon Hohberger's obliterate-peer.sh "really" fence off a peer? I suspect your answer will be no, since from what I can tell in a cman+pacemaker configuration they both wind up calling stonith_admin. > Also, maybe that post helps to realize some of the problems involved: > http://www.gossamer-threads.com/lists/linuxha/pacemaker/62927#62927 > > Especially the part about > But just because you can shoot someone > does not mean you have the bi^Wbetter data. > > Because of the increased complexity, I strongly recommend against dual > primary DRBD, unless you have a very good reason to want it. > > "Because it can be done" does not count as good reason in that context <off-topic> Sigh. I wish that were the reason. The reason why I'm doing dual-primary is that I've a got a single-primary two-node cluster in production that simply doesn't work. One node runs resources; the other sits and twiddles its fingers; fine. But when primary goes down, secondary has trouble starting up all the resources; when we've actually had primary failures (UPS goes haywire, hard drive failure) the secondary often winds up in a state in which it runs none of the significant resources. With the dual-primary setup I have now, both machines are running the resources that typically cause problems in my single-primary configuration. If one box goes down, the other doesn't have to failover anything; it's already running them. (I needed IPaddr2 cloning to work properly for this to work, which is why I started that thread... and all the stupider of me for missing that crucial page in Clusters From Scratch.) My only remaining problem with the configuration is restoring a fenced node to the cluster. Hence my tests, and the reason why I started this thread. </off-topic> > More comments below. > >>> On Tue, Feb 28, 2012 at 11:49 AM, William Seligman >>> <seligman [at] nevis> wrote: >>>> I'm trying to set up an active/active HA cluster as explained in Clusters From >>>> Scratch (which I just re-read after my last problem). >>>> >>>> I'll give versions and config files below, but I'll start with what happens. I >>>> start with an active/active cman+pacemaker+drbd+gfs2 cluster, with fencing >>>> enabled. My fencing mechanism cuts power to a node by turning the load off in >>>> its UPS. The two nodes are hypatia-tb and orestes-tb. >>>> >>>> I want to test fencing and recovery. I start with both nodes running, and >>>> resources properly running on both nodes. Then I simulate failure on one node, >>>> e.g., orestes-tb. I've done this with "crm node standby", "service pacemaker >>>> off", or by pulling the plug. As expected, all the resources move to hypatia-tb, >>>> with the drbd resource as Primary. >>>> >>>> When I try to bring orestes-tb back into the cluster with "crm node online" or >>>> "service pacemaker on" (the inverse of how I removed it), orestes-tb is fenced. >>>> OK, that makes sense, I guess; there's a potential split-brain situation. >>> >>> Not really, that should only happen if the two nodes can't see each >>> other. Which should not be the case. >>> Only when you pull the plug should orestes-tb be fenced. >>> >>> Or if you're using a fencing device that requires the node to have >>> power, then I can imagine that turning it on again might result in >>> fencing. >>> But not for the other cases. >> >> I ran a test: I turned off pacemaker (and so DRBD) on orestes-tb. I "touch"ed a >> file on the hypatia-tb DRBD partition, to make it the "newer" one. >> Then I turned >> off pacemaker on hypatia-tb. Finally I turned on just drbd on hypatia-tb, then >> on orestes-tb. >> >> From /var/log/messages on hypatia-tb: >> >> Feb 28 11:39:19 hypatia-tb kernel: d-con admin: Starting worker thread (from >> drbdsetup [21822]) >> Feb 28 11:39:19 hypatia-tb kernel: block drbd0: disk( Diskless -> Attaching ) >> Feb 28 11:39:19 hypatia-tb kernel: d-con admin: Method to ensure write ordering: >> barrier >> Feb 28 11:39:19 hypatia-tb kernel: block drbd0: max BIO size = 130560 >> Feb 28 11:39:19 hypatia-tb kernel: block drbd0: Adjusting my ra_pages to backing >> device's (32 -> 768) >> Feb 28 11:39:19 hypatia-tb kernel: block drbd0: drbd_bm_resize called with >> capacity == 5611549368 >> Feb 28 11:39:19 hypatia-tb kernel: block drbd0: resync bitmap: bits=701443671 >> words=10960058 pages=21407 >> Feb 28 11:39:19 hypatia-tb kernel: block drbd0: size = 2676 GB (2805774684 KB) >> Feb 28 11:39:19 hypatia-tb kernel: block drbd0: bitmap READ of 21407 pages took >> 576 jiffies >> Feb 28 11:39:20 hypatia-tb kernel: block drbd0: recounting of set bits took >> additional 87 jiffies >> Feb 28 11:39:20 hypatia-tb kernel: block drbd0: 55 MB (14114 bits) marked >> out-of-sync by on disk bit-map. >> Feb 28 11:39:20 hypatia-tb kernel: block drbd0: disk( Attaching -> UpToDate ) >> pdsk( DUnknown -> Outdated ) >> Feb 28 11:39:20 hypatia-tb kernel: block drbd0: attached to UUIDs >> 862A336609FD27CD:BFFB722D5E3E15D7:6E63EC4258C86AF2:6E62EC4258C86AF2 >> Feb 28 11:39:20 hypatia-tb kernel: d-con admin: conn( StandAlone -> Unconnected ) >> Feb 28 11:39:20 hypatia-tb kernel: d-con admin: Starting receiver thread (from >> drbd_w_admin [21824]) >> Feb 28 11:39:20 hypatia-tb kernel: d-con admin: receiver (re)started >> Feb 28 11:39:20 hypatia-tb kernel: d-con admin: conn( Unconnected -> WFConnection ) >> >> >> From /var/log/messages on orestes-tb: >> >> Feb 28 11:39:51 orestes-tb kernel: d-con admin: Starting worker thread (from >> drbdsetup [17827]) >> Feb 28 11:39:51 orestes-tb kernel: block drbd0: disk( Diskless -> Attaching ) >> Feb 28 11:39:51 orestes-tb kernel: d-con admin: Method to ensure write ordering: >> barrier >> Feb 28 11:39:51 orestes-tb kernel: block drbd0: max BIO size = 130560 >> Feb 28 11:39:51 orestes-tb kernel: block drbd0: Adjusting my ra_pages to backing >> device's (32 -> 768) >> Feb 28 11:39:51 orestes-tb kernel: block drbd0: drbd_bm_resize called with >> capacity == 5611549368 >> Feb 28 11:39:51 orestes-tb kernel: block drbd0: resync bitmap: bits=701443671 >> words=10960058 pages=21407 >> Feb 28 11:39:51 orestes-tb kernel: block drbd0: size = 2676 GB (2805774684 KB) >> Feb 28 11:39:52 orestes-tb kernel: block drbd0: bitmap READ of 21407 pages took >> 735 jiffies >> Feb 28 11:39:52 orestes-tb kernel: block drbd0: recounting of set bits took >> additional 93 jiffies >> Feb 28 11:39:52 orestes-tb kernel: block drbd0: 0 KB (0 bits) marked out-of-sync >> by on disk bit-map. >> Feb 28 11:39:52 orestes-tb kernel: block drbd0: disk( Attaching -> Outdated ) >> Feb 28 11:39:52 orestes-tb kernel: block drbd0: attached to UUIDs >> BFFB722D5E3E15D6:0000000000000000:6E63EC4258C86AF2:6E62EC4258C86AF2 >> Feb 28 11:39:52 orestes-tb kernel: d-con admin: conn( StandAlone -> Unconnected ) >> Feb 28 11:39:52 orestes-tb kernel: d-con admin: Starting receiver thread (from >> drbd_w_admin [17829]) >> Feb 28 11:39:52 orestes-tb kernel: d-con admin: receiver (re)started >> Feb 28 11:39:52 orestes-tb kernel: d-con admin: conn( Unconnected -> WFConnection ) >> Feb 28 11:39:52 orestes-tb kernel: d-con admin: Handshake successful: Agreed >> network protocol version 100 >> Feb 28 11:39:52 orestes-tb kernel: d-con admin: conn( WFConnection -> >> WFReportParams ) >> Feb 28 11:39:52 orestes-tb kernel: d-con admin: Starting asender thread (from >> drbd_r_admin [17835]) >> Feb 28 11:39:52 orestes-tb kernel: block drbd0: drbd_sync_handshake: >> Feb 28 11:39:52 orestes-tb kernel: block drbd0: self >> BFFB722D5E3E15D6:0000000000000000:6E63EC4258C86AF2:6E62EC4258C86AF2 bits:0 flags:0 >> Feb 28 11:39:52 orestes-tb kernel: block drbd0: peer >> 862A336609FD27CC:BFFB722D5E3E15D7:6E63EC4258C86AF2:6E62EC4258C86AF2 bits:14114 >> flags:0 >> Feb 28 11:39:52 orestes-tb kernel: block drbd0: uuid_compare()=-1 by rule 50 >> Feb 28 11:39:52 orestes-tb kernel: block drbd0: peer( Unknown -> Secondary ) >> conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate ) >> Feb 28 11:39:52 orestes-tb kernel: block drbd0: receive bitmap stats >> [Bytes(packets)]: plain 0(0), RLE 176(1), total 176; compression: 100.0% >> Feb 28 11:39:52 orestes-tb kernel: block drbd0: send bitmap stats >> [Bytes(packets)]: plain 0(0), RLE 176(1), total 176; compression: 100.0% >> Feb 28 11:39:52 orestes-tb kernel: block drbd0: conn( WFBitMapT -> WFSyncUUID ) >> Feb 28 11:40:01 orestes-tb corosync[2193]: [TOTEM ] A processor failed, >> forming new configuration. >> Feb 28 11:40:03 orestes-tb corosync[2193]: [QUORUM] Members[1]: 2 >> Feb 28 11:40:03 orestes-tb corosync[2193]: [TOTEM ] A processor joined or left >> the membership and a new membership was formed. >> Feb 28 11:40:03 orestes-tb kernel: dlm: closing connection to node 1 >> Feb 28 11:40:03 orestes-tb corosync[2193]: [CPG ] chosen downlist: sender >> r(0) ip(129.236.252.14) r(1) ip(192.168.100.6) ; members(old:2 left:1) >> Feb 28 11:40:03 orestes-tb corosync[2193]: [MAIN ] Completed service >> synchronization, ready to provide service. >> Feb 28 11:40:03 orestes-tb fenced[2247]: fencing node hypatia-tb.nevis.columbia.edu >> >> >> As far as I can tell, hypatia-tb's drbd comes up, says "I'm UpToDate" and waits >> for a connection from orestes-tb. orestes-tb's drbd comes up, says "I'm >> UpToDate" > > No, it clearly says "I'm Outdated" from the logs above: > | Feb 28 11:39:52 orestes-tb kernel: block drbd0: disk( Attaching -> Outdated ) > > It outdated itself voluntarily when it was told to disconnect from a > still running primary, because of your "fencing resource-only" configuration. > > Don't rely on that: in a real incident, the replication link will just > fail, in which case you really need the "fencing resource-and-stonith", > and a suitable fence-peer handler. > >> and starts the sync process with hypatia-tb. Then cman+corosync steps >> in on orestes-tb and fences hypatia-tb, before the sync can proceed. >> >> I ran another test. I did the same thing as the previous paragraph, except that >> I made sure both cman and pacemaker were off (I had to reboot to make sure) and >> just started drbd on both nodes. Sure enough, drbd was able to sync without >> split-brain or fencing. So this is a cman/corosync issue, not a drbd issue. > > You still may retry the whole thing with drbd 8.3.12, > just to make sure there is no hidden DRBD 8.4.1 instability. OK, that will be my next step, if resource-and-stonith doesn't solve the problem. >> While I was setting up the test for the previous paragraph, there was a problem >> with another resource (ocf:heartbeat:exportfs) that couldn't be properly >> monitored on either node. This led to a cycle of fencing where each node would >> successively fence the other because the exportfs resource couldn't run on >> either node. I had to quickly change my configuration to turn off monitoring on >> the resource. >> >> So it seems like cman+corosync is the issue. It's as if I"m "over-fencing." >> >> Any ideas? -- Bill Seligman | Phone: (914) 591-2823 Nevis Labs, Columbia Univ | mailto://seligman [at] nevis PO Box 137 | Irvington NY 10533 USA | http://www.nevis.columbia.edu/~seligman/
|