
seligman at nevis
Mar 1, 2012, 3:28 AM
Views: 1338
Permalink
|
On 3/1/12 4:15 AM, emmanuel segura wrote: > can you show me your /etc/cluster/cluster.conf? > > because i think your problem it's a fencing-loop Here it is: /etc/cluster/cluster.conf: <?xml version="1.0"?> <cluster config_version="17" name="Nevis_HA"> <logging debug="off"/> <cman expected_votes="1" two_node="1" /> <clusternodes> <clusternode name="hypatia-tb.nevis.columbia.edu" nodeid="1"> <altname name="hypatia-private.nevis.columbia.edu" port="5405" mcast="226.94.1.1"/> <fence> <method name="pcmk-redirect"> <device name="pcmk" port="hypatia-tb.nevis.columbia.edu"/> </method> </fence> </clusternode> <clusternode name="orestes-tb.nevis.columbia.edu" nodeid="2"> <altname name="orestes-private.nevis.columbia.edu" port="5405" mcast="226.94.1.1"/> <fence> <method name="pcmk-redirect"> <device name="pcmk" port="orestes-tb.nevis.columbia.edu"/> </method> </fence> </clusternode> </clusternodes> <fencedevices> <fencedevice name="pcmk" agent="fence_pcmk"/> </fencedevices> <fence_daemon post_join_delay="30" /> <rm disabled="1" /> </cluster> > Il giorno 01 marzo 2012 01:03, William Seligman<seligman [at] nevis >> ha scritto: > >> On 2/28/12 7:26 PM, Lars Ellenberg wrote: >>> On Tue, Feb 28, 2012 at 03:51:29PM -0500, William Seligman wrote: >>>> <off-topic> >>>> Sigh. I wish that were the reason. >>>> >>>> The reason why I'm doing dual-primary is that I've a got a >> single-primary >>>> two-node cluster in production that simply doesn't work. One node runs >>>> resources; the other sits and twiddles its fingers; fine. But when >> primary goes >>>> down, secondary has trouble starting up all the resources; when we've >> actually >>>> had primary failures (UPS goes haywire, hard drive failure) the >> secondary often >>>> winds up in a state in which it runs none of the significant resources. >>>> >>>> With the dual-primary setup I have now, both machines are running the >> resources >>>> that typically cause problems in my single-primary configuration. If >> one box >>>> goes down, the other doesn't have to failover anything; it's already >> running >>>> them. (I needed IPaddr2 cloning to work properly for this to work, >> which is why >>>> I started that thread... and all the stupider of me for missing that >> crucial >>>> page in Clusters From Scratch.) >>>> >>>> My only remaining problem with the configuration is restoring a fenced >> node to >>>> the cluster. Hence my tests, and the reason why I started this thread. >>>> </off-topic> >>> >>> Uhm, I do think that is exactly on topic. >>> >>> Rather fix your resources to be able to successfully take over, >>> than add even more complexity. >>> >>> What resources would that be, >>> and why are they not taking over? >> >> I can't tell you in detail, because the major snafu happened on a >> production >> system after a power outage a few months ago. My goal was to get the thing >> stable as quickly as possible. In the end, that turned out to be a non-HA >> configuration: One runs corosync+pacemaker+drbd, while the other just runs >> drbd. >> It works, in the sense that the users get their e-mail. If there's a power >> outage, I have to bring things up manually. >> >> So my only reference is the test-bench dual-primary setup I've got now, >> which is >> exhibiting the same kinds of problems even though the OS versions, software >> versions, and layout are different. This suggests that the problem lies in >> the >> way I'm setting up the configuration. >> >> The problems I have seem to be in the general category of "the 'good guy' >> gets >> fenced when the 'bad guy' gets into trouble." Examples: >> >> - Assuming I start out with two crashed nodes. If I just start up DRBD and >> nothing else, the partitions sync quickly with no problems. >> >> - If the system starts with cman running, and I start drbd, it's likely >> that >> system who is _not_ Outdated will be fenced (rebooted). Same thing if >> cman+pacemaker is running. >> >> - Cloned ocf:heartbeat:exportfs resources are giving me problems as well >> (which >> is why I tried making changes to that resource script). Assume I start >> with one >> node running cman+pacemaker, and the other stopped. I turned on the stopped >> node. This will typically result in the running node being fenced, because >> it >> has it times out when stopping the exportfs resource. >> >> Falling back to DRBD 8.3.12 didn't change this behavior. >> >> My pacemaker configuration is long, so I'll excerpt what I think are the >> relevant pieces in the hope that it will be enough for someone to say "You >> fool! >> This is covered in Pacemaker Explained page 56!" When bringing up a stopped >> node, in order to restart AdminClone pacemaker wants to stop ExportsClone, >> then >> Gfs2Clone, then ClvmdClone. As I said, it's the failure to stop ExportMail >> on >> the running node that causes it to be fenced. >> >> primitive AdminDrbd ocf:linbit:drbd \ >> params drbd_resource="admin" \ >> op monitor interval="60s" role="Master" \ >> op monitor interval="59s" role="Slave" \ >> op stop interval="0" timeout="320" \ >> op start interval="0" timeout="240" >> ms AdminClone AdminDrbd \ >> meta master-max="2" master-node-max="1" \ >> clone-max="2" clone-node-max="1" notify="true" >> >> primitive Clvmd lsb:clvmd op monitor interval="30s" >> clone ClvmdClone Clvmd >> colocation Clvmd_With_Admin inf: ClvmdClone AdminClone:Master >> order Admin_Before_Clvmd inf: AdminClone:promote ClvmdClone:start >> >> primitive Gfs2 lsb:gfs2 op monitor interval="30s" >> clone Gfs2Clone Gfs2 >> colocation Gfs2_With_Clvmd inf: Gfs2Clone ClvmdClone >> order Clvmd_Before_Gfs2 inf: ClvmdClone Gfs2Clone >> >> primitive ExportMail ocf:heartbeat:exportfs \ >> op start interval="0" timeout="40" \ >> op stop interval="0" timeout="45" \ >> params clientspec="mail" directory="/mail" fsid="30" >> clone ExportsClone ExportMail >> colocation Exports_With_Gfs2 inf: ExportsClone Gfs2Clone >> order Gfs2_Before_Exports inf: Gfs2Clone ExportsClone -- Bill Seligman | mailto://seligman [at] nevis Nevis Labs, Columbia Univ | http://www.nevis.columbia.edu/~seligman/ PO Box 137 | Irvington NY 10533 USA | Phone: (914) 591-2823
|