
dhoskinson at eng
Jun 30, 2009, 6:08 AM
Post #1 of 1
(167 views)
Permalink
|
|
Re: {*** SPAM 6.2 } Re: Failover problems
|
|
I am starting to believe I am cursed is all :) And that it might be one of these other files. Here your go. I think the disk one said something else other than detach and I tried changing it with no results. global { usage-count no; } resource r0 { protocol C; startup { degr-wfc-timeout 120; } disk { on-io-error detach; } net { } syncer { rate 100M; al-extents 257; } on mail1.eng.uiowa.edu { device /dev/drbd0; disk /dev/md3; address 192.168.3.1:7788; meta-disk internal; } on mail2.eng.uiowa.edu { device /dev/drbd0; disk /dev/md3; address 192.168.3.2:7788; meta-disk internal; } } On 6/30/09 2:28 AM, "Darren.Mansell[at]opengi.co.uk" <Darren.Mansell[at]opengi.co.uk> wrote: > All looks fine to me. Can you post your drbd.conf ? > >> -----Original Message----- >> From: linux-ha-bounces[at]lists.linux-ha.org [mailto:linux-ha- >> bounces[at]lists.linux-ha.org] On Behalf Of David Hoskinson >> Sent: 29 June 2009 17:24 >> To: General Linux-HA mailing list >> Subject: Re: [Linux-HA] Failover problems >> >> I have made a very simple drbd and filesystem startup and it still is >> resulting in a split brain. I have to be missing something here is my >> test >> config. >> >> <cib validate-with="pacemaker-1.0" crm_feature_set="3.0.1" > have-quorum="1" >> admin_epoch="0" epoch="189" num_updates="0" cib-last-written="Mon Jun > 29 >> 11:04:50 2009" dc-uuid="mail1"> >> <configuration> >> <crm_config> >> <cluster_property_set id="cib-bootstrap-options"> >> <nvpair id="cib-bootstrap-options-dc-version" > name="dc-version" >> value="1.0.4-6dede86d6105786af3a5321ccf66b44b6914f0aa"/> >> <nvpair id="cib-bootstrap-options-cluster-infrastructure" >> name="cluster-infrastructure" value="openais"/> >> <nvpair id="cib-bootstrap-options-expected-quorum-votes" >> name="expected-quorum-votes" value="2"/> >> <nvpair id="cib-bootstrap-options-last-lrm-refresh" >> name="last-lrm-refresh" value="1245863799"/> >> <nvpair id="cib-bootstrap-options-no-quorum-policy" >> name="no-quorum-policy" value="ignore"/> >> <nvpair id="cib-bootstrap-options-stonith-enabled" >> name="stonith-enabled" value="false"/> >> <nvpair id="cib-bootstrap-options-default-resource-stickiness" >> name="default-resource-stickiness" value="200"/> >> </cluster_property_set> >> </crm_config> >> <nodes> >> <node id="mail1" uname="mail1" type="normal"/> >> <node id="mail2" uname="mail2" type="normal"/> >> </nodes> >> <resources> >> <master id="ms-drbd0"> >> <meta_attributes id="ms-drbd0-meta_attributes"> >> <nvpair id="ms-drbd0-meta_attributes-clone-max" > name="clone-max" >> value="2"/> >> <nvpair id="ms-drbd0-meta_attributes-notify" name="notify" >> value="true"/> >> <nvpair id="ms-drbd0-meta_attributes-globally-unique" >> name="globally-unique" value="false"/> >> <nvpair id="ms-drbd0-meta_attributes-target-role" >> name="target-role" value="Started"/> >> </meta_attributes> >> </meta_attributes> >> <primitive class="ocf" id="drbd0" provider="heartbeat" >> type="drbd"> >> <instance_attributes id="drbd0-instance_attributes"> >> <nvpair id="drbd0-instance_attributes-drbd_resource" >> name="drbd_resource" value="r0"/> >> </instance_attributes> >> <operations> >> <op id="drbd0-monitor-59s" interval="59s" name="monitor" >> role="Master" timeout="30s"/> >> <op id="drbd0-monitor-60s" interval="60s" name="monitor" >> role="Slave" timeout="30s"/> >> </operations> >> </primitive> >> </master> >> <primitive class="ocf" id="fs0" provider="heartbeat" >> type="Filesystem"> >> <instance_attributes id="fs0-instance_attributes"> >> <nvpair id="fs0-instance_attributes-fstype" name="fstype" >> value="ext3"/> >> <nvpair id="fs0-instance_attributes-directory" > name="directory" >> value="/shared"/> >> <nvpair id="fs0-instance_attributes-device" name="device" >> value="/dev/drbd0"/> >> </instance_attributes> >> <meta_attributes id="fs0-meta_attributes"> >> <nvpair id="fs0-meta_attributes-target-role-stopped" >> name="target-role-stopped"/> >> <nvpair id="fs0-meta_attributes-target-role" > name="target-role" >> value="Started"/> >> </meta_attributes> >> </primitive> >> </resources> >> <constraints> >> <rsc_order first="ms-drbd0" first-action="promote" >> id="ms-drbd-before-fs0" score="INFINITY" then="fs0" > then-action="start"/> >> <rsc_colocation id="fs0-on-ms-drbd0" rsc="fs0" score="INFINITY" >> with-rsc="ms-drbd0" with-rsc-role="Master"/> >> </constraints> >> <rsc_defaults/> >> <op_defaults/> >> </configuration> >> </cib> >> >> No preferred master. >> >> In my test... >> >> Crm_mon >> >> ============ >> Last updated: Mon Jun 29 11:14:27 2009 >> Stack: openais >> Current DC: mail2 - partition with quorum >> Version: 1.0.4-6dede86d6105786af3a5321ccf66b44b6914f0aa >> 2 Nodes configured, 2 expected votes >> 2 Resources configured. >> ============ >> >> Online: [ mail1 mail2 ] >> >> Master/Slave Set: ms-drbd0 >> Masters: [ mail1 ] >> Slaves: [ mail2 ] >> fs0 (ocf::heartbeat:Filesystem): Started mail1 >> >> Mail1 was picked as master, it recognizes mail2 and has loaded the >> filesystem on mail1. >> >> >> [root[at]mail1 crm]# cat /proc/drbd >> version: 8.2.6 (api:88/proto:86-88) >> GIT-hash: 3e69822d3bb4920a8c1bfdf7d647169eba7d2eb4 build by >> buildsvn[at]c5-i386-build, 2008-10-03 11:42:32 >> 0: cs:Connected st:Primary/Secondary ds:UpToDate/UpToDate C r--- >> ns:8 nr:0 dw:4 dr:201 al:1 bm:1 lo:0 pe:0 ua:0 ap:0 oos:0 >> >> >> Drbd recognizes mail1 as the primary, sees the secondary and sync is >> uptodate. >> >> When I shut down primary (mail1) in this case I see this as I should > using >> crm_mon on mail2: >> >> ============ >> Last updated: Mon Jun 29 11:18:28 2009 >> Stack: openais >> Current DC: mail2.eng.uiowa.edu - partition WITHOUT quorum >> Version: 1.0.4-6dede86d6105786af3a5321ccf66b44b6914f0aa >> 2 Nodes configured, 2 expected votes >> 2 Resources configured. >> ============ >> >> Online: [ mail2 ] >> OFFLINE: [ mail1 ] >> >> Master/Slave Set: ms-drbd0 >> Masters: [ mail2 ] >> Stopped: [ drbd0:0 ] >> fs0 (ocf::heartbeat:Filesystem): Started mail2 >> >> And then as mail1 becomes available..... >> >> ============ >> Last updated: Mon Jun 29 11:19:44 2009 >> Stack: openais >> Current DC: mail2 - partition with quorum >> Version: 1.0.4-6dede86d6105786af3a5321ccf66b44b6914f0aa >> 2 Nodes configured, 2 expected votes >> 2 Resources configured. >> ============ >> >> Online: [ mail1 mail2 ] >> >> Master/Slave Set: ms-drbd0 >> Masters: [ mail2 ] >> Slaves: [ mail1 ] >> fs0 (ocf::heartbeat:Filesystem): Started mail2 >> >> So far so good, this is what I would expect it to say. However if I > look >> at >> drbd again: >> >> [root[at]mail2 ~]# cat /proc/drbd >> version: 8.2.6 (api:88/proto:86-88) >> GIT-hash: 3e69822d3bb4920a8c1bfdf7d647169eba7d2eb4 build by >> buildsvn[at]c5-i386-build, 2008-10-03 11:42:32 >> 0: cs:WFConnection st:Primary/Unknown ds:UpToDate/DUnknown C r--- >> ns:0 nr:8 dw:12 dr:197 al:1 bm:1 lo:0 pe:0 ua:0 ap:0 oos:4 >> >> [root[at]mail1 ~]# cat /proc/drbd >> version: 8.2.6 (api:88/proto:86-88) >> GIT-hash: 3e69822d3bb4920a8c1bfdf7d647169eba7d2eb4 build by >> buildsvn[at]c5-i386-build, 2008-10-03 11:42:32 >> 0: cs:StandAlone st:Secondary/Unknown ds:UpToDate/DUnknown r--- >> ns:0 nr:0 dw:0 dr:0 al:0 bm:2 lo:0 pe:0 ua:0 ap:0 oos:8192 >> >> >> Its split again. >> >> It has to be something simple that I am missing... >> >> >> On 6/29/09 10:37 AM, "Darren.Mansell[at]opengi.co.uk" >> <Darren.Mansell[at]opengi.co.uk> wrote: >> >>> I may have missed this but are you using the old style drbddisk RA > or >>> the new drbd RA? >>> >>> If it's the new have you ensured the init script for DRBD is turned > off? >>> >>> Also do you have an ordering constraint so you aren't trying to > mount >>> the device before it is brought online? >>> >>> Some info below I've put together from the clusterlabs web site for > my >>> own config. >>> >>> >>> >>> Open the crm and start configuring it >>> >>> crm >>> configure >>> >>> primitive drbd0 ocf:heartbeat:drbd \ >>> params drbd_resource=hub_disk \ >>> op monitor role=Master interval=59s timeout=30s \ >>> op monitor role=Slave interval=60s timeout=30s >>> >>> This means: >>> >>> * primitive - It's a primitive resource. >>> * drbd0 - This is the name we are giving it. It's always the > second >>> parameter. We could call this anything (within reason) >>> * ocf:heartbeat:drbd - ocf means the resource agent is an OCF > type, >>> (Open Cluster Framework), provided by heartbeat and it's the drbd > RA. >>> * params - Give each parameter you require here. Press tab for a >>> list. drbd_resource is the name you have in the DRBD config. >>> * op - Put an operation on the resource... >>> * monitor - Which is a monitor. You are saying monitor with this >>> interval and this timeout when the resource instance is a master, > then >>> you have another monitor with different values for if it's a slave. >>> >>> ms ms-drbd0 drbd0 \ >>> meta clone-max=2 notify=true globally-unique=false >>> >>> This means: >>> >>> * ms - It's a multi-state constraint >>> * ms-drbd0 - We call it this as it's a master-slave of the drbd0 >>> resource we configured above >>> * drbd0 - The resource this constraint refers to >>> * meta - Specific meta information goes after this. Maximum > number >>> of clones is 2, notify the RA on a change of role, it's not globally >>> unique as it's on 2 servers. >>> >>> primitive fs0 ocf:heartbeat:Filesystem \ >>> params fstype=ext3 directory=/www device=/dev/drbd0 \ >>> meta migration-threshold="50" >>> >>> This means: >>> >>> * primitive fs0 - It's another primitive resource, we're calling >>> this fs0 for filesystem0. >>> * ocf:heartbeat:Filesystem - The resource agent is type OCF, >>> provided by heartbeat and is the Filesystem RA. It takes care of >>> mounting and unmounting a filesystem on a device. >>> * params - These are the parameters we pass to the RA. In this > case >>> it's just the 3 things that mount needs to know, the FS type, where > to >>> mount it and the device name. As we're using drbd it's /dev/drbd0 >>> >>> primitive proftpd lsb:proftpd \ >>> op monitor interval="20s" timeout="10s" \ >>> meta migration-threshold="50" >>> >>> This means: >>> >>> * It's another primitive resource called proftpd. >>> * lsb:proftpd - This is an LSB resource agent (/etc/init.d > script) >>> * There are no parameters to pass to this init script. You can > build >>> them in but don't have to. >>> * You are putting a monitor operation on it that checks it every > 20s >>> and times out after 10s. The monitor operation just runs >>> /etc/init.d/proftpd status. If it gets a return code of 0 it's > working. >>> A return code of 3 means it's not. The init scripts have to be LSB >>> compliant (give the correct return codes) to work. >>> * Finally the migration threshold is how many failures it can > have >>> before it will failover to the other node. >>> >>> primitive tomcat lsb:tomcat \ >>> op monitor interval="30s" timeout="20s" \ >>> meta migration-threshold="50" >>> >>> Should be self-explanatory by now. It's a primitive resource called >>> tomcat using an LSB init script called tomcat. Pacemaker will call > the >>> init scripts status function every 30s and wait 20s for a response. > If >>> it fails 50 times it will be migrated over to the other node. >>> >>> primitive virtual-ip ocf:heartbeat:IPaddr2 \ >>> params ip="2.21.4.45" broadcast="2.255.255.255" nic="eth0" >>> cidr_netmask="8" \ >>> op monitor interval=21s timeout=5s >>> >>> And again, an IPaddr2 OCF RA called virtual-ip. Give it the > parameters >>> it needs and monitor it every 21s, timeout 5s. >>> >>> group resource-group fs0 proftpd tomcat vip >>> >>> Now we group all our primitive resources together into resource > group >>> called.... resource-group (imaginative eh?) >>> >>> order ms-drbd0-before-fs0 inf: ms-drbd0:promote fs0:start >>> >>> This sets an order constraint called ms-drbd0-before-fs0. The inf: > means >>> INFINITY scoring (mandatory). The ms-drbd0:promote says to first > promote >>> that resource then the fs0:start means to then start that resource. > For >>> info the XML of that command comes out as: >>> >>> <rsc_order first="ms-drbd0" first-action="promote" >>> id="ms-drbd0-before-fs0" score="INFINITY" then="fs0" >>> then-action="start"/> >>> >>> colocation res-group-on-ms-drbd0 inf: resource-group ms-drbd0:Master >>> >>> This is a colocation constraint. It's to ensure certain resources > have >>> to run together on the same node. This one is called >>> res-group-on-ms-drbd0 score INFINITY and resource-group has to be >>> colocated with ms-drbd0 as the Master. >>> >>> location ms-drbd0-master-on-hub1 ms-drbd0 \ >>> rule id="ms-drbd0-master-on-hub1-rule" role="master" 100: #uname eq > hub1 >>> >>> Finally this is to make the migration-threshold work. The location > is >>> called ms-drbd0-master-on-hub1 using ms-drbd0 resource as something > for >>> the rule to stick to. The role is master for ms-drbd0 score 100 and > the >>> uname of the node has to be hub1. >>> >>> commit >>> end >>> quit >>> >>> So working backwards: >>> >>> 1. With a score of 100, the DRBD resource has to be on hub1 >>> 2. The resource group resource-group has to be on the same node > as >>> the DRBD resource. This score is INFINITY which makes it mandatory. >>> 3. The resource fs0 has to start after the DRBD resource has been >>> promoted, as we can't mount any dirs using the Filesystem resource > until >>> it's a primary. >>> 4. The fs0, tomcat and proftpd resources all have a migration >>> threshold of 50. If any one of them goes over this it will cause > some >>> scores to be evaluated and then action will be decided by the crm. > If >>> the 2nd node has no issues barring the failover of resources onto it >>> then that resource will be failed over. As we have colocation >>> constraints then those will be taken into account with the > evaluation. >>> >>> Finally chkconfig off drbd, tomcat and proftpd to be sure they won't >>> start at boot time (pacemaker will start them). >>> >>>> -----Original Message----- >>>> From: linux-ha-bounces[at]lists.linux-ha.org [mailto:linux-ha- >>>> bounces[at]lists.linux-ha.org] On Behalf Of David Hoskinson >>>> Sent: 29 June 2009 16:12 >>>> To: General Linux-HA mailing list >>>> Subject: [Linux-HA] Failover problems >>>> >>>> I must be missing something here I hope someone can help. I have a >>>> master/slave setup using latest openais/pacemaker/drbd. System > starts >>> up >>>> perfectly and if I shutdown slave, primary notices status change > and >>> also >>>> notices when slave reconnects. If I shutdown master, drbd and >>> services >>>> transfer to slave and all works well. >>>> >>>> The problem as I see it, is that when the master comes back on line > it >>>> reassumes the drbd and services however I am left with a split > brain >>> for >>>> the >>>> drbd. I get split brain messages in logs, and primary machine > shows >>>> primary/unknown in the cat/proc/drbd. And Slave shows > slave/unknown. >>> I >>>> am >>>> able to manually reconnect the drives as been suggested earlier but >>> this >>>> doesn't seem to be the "normal" way in my way of thinking or am I >>> wrong >>>> with >>>> this. Should it be split brain when master takes back over? I > want >>> to >>>> know >>>> if I am struggling over something I shouldn't be. It just seems to > me >>>> that >>>> it should seamlessly reconnect without enabling the "automatic" > split >>>> brain >>>> function in drbd. >>>> >>>> Hope this makes sense to someone... >>>> >>>> >>>> _______________________________________________ >>>> Linux-HA mailing list >>>> Linux-HA[at]lists.linux-ha.org >>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha >>>> See also: http://linux-ha.org/ReportingProblems >>> _______________________________________________ >>> Linux-HA mailing list >>> Linux-HA[at]lists.linux-ha.org >>> http://lists.linux-ha.org/mailman/listinfo/linux-ha >>> See also: http://linux-ha.org/ReportingProblems >> >> >> _______________________________________________ >> Linux-HA mailing list >> Linux-HA[at]lists.linux-ha.org >> http://lists.linux-ha.org/mailman/listinfo/linux-ha >> See also: http://linux-ha.org/ReportingProblems > _________________ _______________________________________________ Linux-HA mailing list Linux-HA[at]lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
|