
dejanmm at fastmail
Nov 4, 2009, 4:33 AM
Post #2 of 6
(285 views)
Permalink
|
Hi, On Wed, Nov 04, 2009 at 10:40:15AM +0000, S. A. Woltering wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Hello, > I hope I'm not posting anything new to the list, but I probably am. > > I'm in the process of building up a two-node cluster based on DRBD, > Pacemaker and OpenAIS. > > I've 800GB+200GB RAID partitions of each of two HP DL360s (the 800GB > allocated to resource "mail", the 200GB to resource "rsync"). Running > CentOS 5.4, with Pacemaker set up as detailed in Andrew Beekhof's > "Cluster from scratch - Apache" document. x86_64 architecture. > > My DRBD config is as follows: > # /etc/drbd.conf > common { > protocol C; > net { > allow-two-primaries; > cram-hmac-alg sha1; > shared-secret XXXXXX; > after-sb-0pri discard-zero-changes; > after-sb-1pri discard-secondary; > after-sb-2pri disconnect; > } > disk { > fencing resource-only; > } > syncer { > rate 100M; > verify-alg sha1; > } > startup { > wfc-timeout 20; > degr-wfc-timeout 10; > } > handlers { > fence-peer /usr/lib/drbd/crm-fence-peer.sh; > after-resync-target /usr/lib/drbd/crm-unfence-peer.sh; > } > } > resource mail { > on gemini { > device /dev/drbd0 minor 0; > disk /dev/cciss/c0d0p3; > address ipv4 XX.hb1.addy.xx:7789; > meta-disk internal; > } > on soyuz { > device /dev/drbd0 minor 0; > disk /dev/cciss/c0d0p3; > address ipv4 XX.hb1.addy.xx:7789; > meta-disk internal; > } > } > resource rsync { > on gemini { > device /dev/drbd1 minor 1; > disk /dev/cciss/c0d0p4; > address ipv4 XX.hb2.addy.xx:7789; > meta-disk internal; > } > on soyuz { > device /dev/drbd1 minor 1; > disk /dev/cciss/c0d0p4; > address ipv4 XX.hb2.addy.xx:7789; > meta-disk internal; > } > } > > Output from "crm configure show" is: > > node gemini \ > attributes standby="off" > node soyuz \ > attributes standby="off" > primitive drbd-mail ocf:linbit:drbd \ > params drbd_resource="mail" \ > op monitor interval="15s" > primitive fs-mail ocf:heartbeat:Filesystem \ > params device="/dev/drbd0" directory="/data/mail" fstype="ext3" > primitive ip-mail ocf:heartbeat:IPaddr2 \ > params ip="xxx.xxx.xxx.xxx" nic="bond0" > primitive st-gemini stonith:external/riloe \ > params hostlist="gemini" ilo_hostname="xxx.xxx.xxx.xxx" > ilo_user="root" ilo_password="XXXXXX" ilo_can_reset="0" > ilo_protocol="2.0" ilo_powerdown_method="button -S" \ "button -S" is not supported. Only "button" and "power". The latter is, I think, more reliable, the former easier on the hardware. And riloe should issue a warning if it doesn't recognize the method. > op monitor interval="60s" > primitive st-soyuz stonith:external/riloe \ > params hostlist="soyuz" ilo_hostname="xxx.xxx.xxx.xxx" > ilo_user="root" ilo_password="XXXXXX" ilo_can_reset="0" > ilo_protocol="2.0" ilo_powerdown_method="button -S" \ > op monitor interval="60s" > group mailservice fs-mail ip-mail \ > meta target-role="Started" > ms ms-drbd-mail drbd-mail \ > meta master-max="1" master-node-max="1" clone-max="2" > clone-node-max="1" notify="true" target-role="Started" > ##########I didn't set this... should I delete it? > location drbd-fence-by-handler-ms-drbd-mail ms-drbd-mail \ > rule $id="drbd-fence-by-handler-rule-ms-drbd-mail" > $role="Master" -inf: #uname ne soyuz > ########## > location l-st-gemini st-gemini -inf: gemini > location l-st-soyuz st-soyuz -inf: soyuz > colocation mail-on-drbd inf: mailservice ms-drbd-mail:Master > order mail-after-drbd inf: ms-drbd-mail:promote mailservice:start > property $id="cib-bootstrap-options" \ > dc-version="1.0.5-462f1569a43740667daf7b0f6b521742e9eb8fa7" \ > cluster-infrastructure="openais" \ > expected-quorum-votes="2" \ > stonith-enabled="true" \ > no-quorum-policy="ignore" \ > last-lrm-refresh="1257329728" > rsc_defaults $id="rsc-options" \ > resource-stickiness="200" > > DRBD, on it's own works fine. However, when using the OCF agent as shown > in the output above, I see some strange effects. > > If I perform a "destructive" test on one node (ie: yank the power out), > everything failed over smoothly, but when I brought the downed node back > online, it refused to reconnect to the DRBD "mail" resource. > > I get the following from "crm_mon -1": > ============ > Last updated: Wed Nov 4 10:05:48 2009 > Stack: openais > Current DC: soyuz - partition with quorum > Version: 1.0.5-462f1569a43740667daf7b0f6b521742e9eb8fa7 > 2 Nodes configured, 2 expected votes > 4 Resources configured. > ============ > > Online: [ soyuz gemini ] > > st-gemini (stonith:external/riloe): Started soyuz > st-soyuz (stonith:external/riloe): Started gemini > > Failed actions: > drbd-mail:1_start_0 (node=gemini, call=8, rc=-2, status=Timed Out): > unknown > > So, I manually, re-attach and re-connect to the resource on the "quiet" > node and I see this: > > # crm resource show > st-gemini (stonith:external/riloe) Started > st-soyuz (stonith:external/riloe) Stopped > Master/Slave Set: ms-drbd-mail > Masters: [ soyuz ] > Stopped: [ drbd-mail:1 ] > Resource Group: mailservice > fs-mail (ocf::heartbeat:Filesystem) Stopped > ip-mail (ocf::heartbeat:IPaddr2) Stopped > > After about half an hour (and clearing up failure messages) it sorts > itself out and starts working again correctly. You have to cleanup resources after manual intervention. > Can anyone offer some advice as to why it might be doing this, please? This is two issues: a) Why the drbd resource can't start after split-brain and b) Why does it take such a long time for the cluster to recover after resource recovery and cleanup. The a) is, I guess, a configuration issue, i.e. it depends on did you choose automatic or manual split-brain recovery. Just guessing, not an expert on drbd, but there should be very good documentation at linbit's site. For b) there is not enough information, i.e. a hb_report would be required. Or was it that you ran resource cleanup only later? Thanks, Dejan > Thanks, > Ashley > - -- > Ashley Woltering, Systems Analyst, > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v2.0.4-svn0 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ > > iD8DBQFK8VoPh854NVK99FMRAlHBAJ9vum7mZYteZeXjai6fIt4JhHvrOACdHJ2i > ReEW/RmM9YOnV9y2UN1ncAA= > =z1nn > -----END PGP SIGNATURE----- > _______________________________________________ > Linux-HA mailing list > Linux-HA[at]lists.linux-ha.org > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems _______________________________________________ Linux-HA mailing list Linux-HA[at]lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
|