Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Linux-HA: Users

cman+pacemaker+drbd fencing problem

 

 

Linux-HA users RSS feed   Index | Next | Previous | View Threaded


seligman at nevis

Feb 27, 2012, 4:49 PM

Post #1 of 18 (2413 views)
Permalink
cman+pacemaker+drbd fencing problem

I'm trying to set up an active/active HA cluster as explained in Clusters From
Scratch (which I just re-read after my last problem).

I'll give versions and config files below, but I'll start with what happens. I
start with an active/active cman+pacemaker+drbd+gfs2 cluster, with fencing
enabled. My fencing mechanism cuts power to a node by turning the load off in
its UPS. The two nodes are hypatia-tb and orestes-tb.

I want to test fencing and recovery. I start with both nodes running, and
resources properly running on both nodes. Then I simulate failure on one node,
e.g., orestes-tb. I've done this with "crm node standby", "service pacemaker
off", or by pulling the plug. As expected, all the resources move to hypatia-tb,
with the drbd resource as Primary.

When I try to bring orestes-tb back into the cluster with "crm node online" or
"service pacemaker on" (the inverse of how I removed it), orestes-tb is fenced.
OK, that makes sense, I guess; there's a potential split-brain situation.

I bring orestes-tb back up, with the intent of adding it back into the cluster.
I make sure cman, pacemaker, and drbd services were off at system start. On
orestes-tb, I type "service drbd start".

What I expect to happen is that the drbd resource on orestes-tb is marked
"Outdated" or something like that. Then I'd fix it with "drbdadm
--discard-my-data connect admin" or whatever is appropriate.

What actually happens is that hypatia-tb is fenced. Since this is the node
running all the resources, this is bad behavior. It's even more puzzling when I
consider that at, the time, there isn't any fencing resource actually running on
orestes-tb; my guess is that DRBD on hypatia-tb is fencing itself.

Eventually hypatia-tb reboots, and the cluster goes back to normal. But as a
fencing/stability/HA test, this is a failure.

I've repeated this with a number of variations. In the end, both systems have to
be fenced/rebooted before the cluster is working again.

Any ideas?

Versions:

Scientific Linux 6.2
kernel 2.6.32
cman-3.0.12
corosync-1.4.1
pacemaker-1.1.6
drbd-8.4.1

/etc/drbd.d/global-common.conf:

global {
usage-count yes;
}

common {
startup {
wfc-timeout 60;
degr-wfc-timeout 60;
outdated-wfc-timeout 60;
}
}

/etc/drbd.d/admin.res:

resource admin {

protocol C;

on hypatia-tb.nevis.columbia.edu {
volume 0 {
device /dev/drbd0;
disk /dev/md2;
flexible-meta-disk internal;
}
address 192.168.100.7:7788;
}
on orestes-tb.nevis.columbia.edu {
volume 0 {
device /dev/drbd0;
disk /dev/md2;
flexible-meta-disk internal;
}
address 192.168.100.6:7788;
}

startup {
}

net {
allow-two-primaries yes;
after-sb-0pri discard-zero-changes;
after-sb-1pri discard-secondary;
after-sb-2pri disconnect;
sndbuf-size 0;
}

disk {
resync-rate 100M;
c-max-rate 100M;
al-extents 3389;
fencing resource-only;
}

An edited output of "crm configure show":

node hypatia-tb.nevis.columbia.edu
node orestes-tb.nevis.columbia.edu
primitive StonithHypatia stonith:fence_nut \
params pcmk_host_check="static-list" \
pcmk_host_list="hypatia-tb.nevis.columbia.edu" \
ups="sofia-ups" username="admin" password="XXX"
primitive StonithOrestes stonith:fence_nut \
params pcmk_host_check="static-list" \
pcmk_host_list="orestes-tb.nevis.columbia.edu"
ups="dc-test-stand-ups" username="admin" password="XXX"
location StonithHypatiaLocation StonithHypatia \
-inf: hypatia-tb.nevis.columbia.edu
location StonithOrestesLocation StonithOrestes \
-inf: orestes-tb.nevis.columbia.edu

/etc/cluster/cluster.conf:

<?xml version="1.0"?>
<cluster config_version="17" name="Nevis_HA">
<logging debug="off"/>
<cman expected_votes="1" two_node="1" />
<clusternodes>
<clusternode name="hypatia-tb.nevis.columbia.edu" nodeid="1">
<altname name="hypatia-private.nevis.columbia.edu" port="5405"
mcast="226.94.1.1"/>
<fence>
<method name="pcmk-redirect">
<device name="pcmk" port="hypatia-tb.nevis.columbia.edu"/>
</method>
</fence>
</clusternode>
<clusternode name="orestes-tb.nevis.columbia.edu" nodeid="2">
<altname name="orestes-private.nevis.columbia.edu" port="5405"
mcast="226.94.1.1"/>
<fence>
<method name="pcmk-redirect">
<device name="pcmk" port="orestes-tb.nevis.columbia.edu"/>
</method>
</fence>
</clusternode>
</clusternodes>
<fencedevices>
<fencedevice name="pcmk" agent="fence_pcmk"/>
</fencedevices>
<fence_daemon post_join_delay="30" />
<rm disabled="1" />
</cluster>


The log messages on orestes-tb, just before hypatia-tb is fenced (there are no
messages in the hypatia-tb log for this time):

Feb 15 16:52:27 orestes-tb kernel: drbd: initialized. Version: 8.4.1
(api:1/proto:86-100)
Feb 15 16:52:27 orestes-tb kernel: drbd: GIT-hash:
91b4c048c1a0e06777b5f65d312b38d47abaea80 build by
root [at] orestes-tb, 2012-02-14 17:05:32
Feb 15 16:52:27 orestes-tb kernel: drbd: registered as block device major 147
Feb 15 16:52:27 orestes-tb kernel: d-con admin: Starting worker thread (from
drbdsetup [2570])
Feb 15 16:52:27 orestes-tb kernel: block drbd0: disk( Diskless -> Attaching )
Feb 15 16:52:27 orestes-tb kernel: d-con admin: Method to ensure write ordering:
barrier
Feb 15 16:52:27 orestes-tb kernel: block drbd0: max BIO size = 130560
Feb 15 16:52:27 orestes-tb kernel: block drbd0: Adjusting my ra_pages to backing
device's (32 -> 768)
Feb 15 16:52:27 orestes-tb kernel: block drbd0: drbd_bm_resize called with
capacity == 5611549368
Feb 15 16:52:27 orestes-tb kernel: block drbd0: resync bitmap: bits=701443671
words=10960058 pages=21407
Feb 15 16:52:27 orestes-tb kernel: block drbd0: size = 2676 GB (2805774684 KB)
Feb 15 16:52:28 orestes-tb kernel: block drbd0: bitmap READ of 21407 pages took
634 jiffies
Feb 15 16:52:28 orestes-tb kernel: block drbd0: recounting of set bits took
additional 92 jiffies
Feb 15 16:52:28 orestes-tb kernel: block drbd0: 0 KB (0 bits) marked out-of-sync
by on disk bit-map.
Feb 15 16:52:28 orestes-tb kernel: block drbd0: disk( Attaching -> Outdated )
Feb 15 16:52:28 orestes-tb kernel: block drbd0: attached to UUIDs
F5355FCF6114F218:0000000000000000:8A5519C7090D6BD6:8A5419C7090D6BD6
Feb 15 16:52:28 orestes-tb kernel: d-con admin: conn( StandAlone -> Unconnected )
Feb 15 16:52:28 orestes-tb kernel: d-con admin: Starting receiver thread (from
drbd_w_admin [2572])
Feb 15 16:52:28 orestes-tb kernel: d-con admin: receiver (re)started
Feb 15 16:52:28 orestes-tb kernel: d-con admin: conn( Unconnected -> WFConnection )
Feb 15 16:52:29 orestes-tb kernel: d-con admin: Handshake successful: Agreed
network protocol version 100
Feb 15 16:52:29 orestes-tb kernel: d-con admin: conn( WFConnection ->
WFReportParams )
Feb 15 16:52:29 orestes-tb kernel: d-con admin: Starting asender thread (from
drbd_r_admin [2579])
Feb 15 16:52:29 orestes-tb kernel: block drbd0: drbd_sync_handshake:
Feb 15 16:52:29 orestes-tb kernel: block drbd0: self
F5355FCF6114F218:0000000000000000:8A5519C7090D6BD6:8A5419C7090D6BD6 bits:0 flags:0
Feb 15 16:52:29 orestes-tb kernel: block drbd0: peer
06B93A6C54D6D631:F5355FCF6114F219:8A5519C7090D6BD6:8A5419C7090D6BD6 bits:615 flags:0
Feb 15 16:52:29 orestes-tb kernel: block drbd0: uuid_compare()=-1 by rule 50
Feb 15 16:52:29 orestes-tb kernel: block drbd0: peer( Unknown -> Primary ) conn(
WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
Feb 15 16:52:29 orestes-tb kernel: block drbd0: receive bitmap stats
[Bytes(packets)]: plain 0(0), RLE 39(1), total 39; compression: 100.0%
Feb 15 16:52:29 orestes-tb kernel: block drbd0: send bitmap stats
[Bytes(packets)]: plain 0(0), RLE 39(1), total 39; compression: 100.0%
Feb 15 16:52:29 orestes-tb kernel: block drbd0: conn( WFBitMapT -> WFSyncUUID )
Feb 15 16:52:50 orestes-tb kernel: d-con admin: PingAck did not arrive in time.
Feb 15 16:52:50 orestes-tb kernel: d-con admin: peer( Primary -> Unknown ) conn(
WFSyncUUID -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
Feb 15 16:52:50 orestes-tb kernel: d-con admin: asender terminated
Feb 15 16:52:50 orestes-tb kernel: d-con admin: Terminating asender thread
Feb 15 16:52:51 orestes-tb kernel: block drbd0: bitmap WRITE of 3 pages took 247
jiffies
Feb 15 16:52:51 orestes-tb kernel: block drbd0: 2460 KB (615 bits) marked
out-of-sync by on disk bit-map.
Feb 15 16:52:51 orestes-tb kernel: d-con admin: Connection closed
Feb 15 16:52:51 orestes-tb kernel: d-con admin: conn( NetworkFailure ->
Unconnected )
Feb 15 16:52:51 orestes-tb kernel: d-con admin: receiver terminated
Feb 15 16:52:51 orestes-tb kernel: d-con admin: Restarting receiver thread
Feb 15 16:52:51 orestes-tb kernel: d-con admin: receiver (re)started
Feb 15 16:52:51 orestes-tb kernel: d-con admin: conn( Unconnected -> WFConnection )

--
Bill Seligman | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://seligman [at] nevis
PO Box 137 |
Irvington NY 10533 USA | http://www.nevis.columbia.edu/~seligman/
Attachments: smime.p7s (4.39 KB)


andrew at beekhof

Feb 27, 2012, 5:40 PM

Post #2 of 18 (2381 views)
Permalink
Re: cman+pacemaker+drbd fencing problem [In reply to]

On Tue, Feb 28, 2012 at 11:49 AM, William Seligman
<seligman [at] nevis> wrote:
> I'm trying to set up an active/active HA cluster as explained in Clusters From
> Scratch (which I just re-read after my last problem).
>
> I'll give versions and config files below, but I'll start with what happens. I
> start with an active/active cman+pacemaker+drbd+gfs2 cluster, with fencing
> enabled. My fencing mechanism cuts power to a node by turning the load off in
> its UPS. The two nodes are hypatia-tb and orestes-tb.
>
> I want to test fencing and recovery. I start with both nodes running, and
> resources properly running on both nodes. Then I simulate failure on one node,
> e.g., orestes-tb. I've done this with "crm node standby", "service pacemaker
> off", or by pulling the plug. As expected, all the resources move to hypatia-tb,
> with the drbd resource as Primary.
>
> When I try to bring orestes-tb back into the cluster with "crm node online" or
> "service pacemaker on" (the inverse of how I removed it), orestes-tb is fenced.
> OK, that makes sense, I guess; there's a potential split-brain situation.

Not really, that should only happen if the two nodes can't see each
other. Which should not be the case.
Only when you pull the plug should orestes-tb be fenced.

Or if you're using a fencing device that requires the node to have
power, then I can imagine that turning it on again might result in
fencing.
But not for the other cases.


>
> I bring orestes-tb back up, with the intent of adding it back into the cluster.
> I make sure cman, pacemaker, and drbd services were off at system start. On
> orestes-tb, I type "service drbd start".
>
> What I expect to happen is that the drbd resource on orestes-tb is marked
> "Outdated" or something like that. Then I'd fix it with "drbdadm
> --discard-my-data connect admin" or whatever is appropriate.
>
> What actually happens is that hypatia-tb is fenced. Since this is the node
> running all the resources, this is bad behavior. It's even more puzzling when I
> consider that at, the time, there isn't any fencing resource actually running on
> orestes-tb; my guess is that DRBD on hypatia-tb is fencing itself.
>
> Eventually hypatia-tb reboots, and the cluster goes back to normal. But as a
> fencing/stability/HA test, this is a failure.
>
> I've repeated this with a number of variations. In the end, both systems have to
> be fenced/rebooted before the cluster is working again.
>
> Any ideas?
>
> Versions:
>
> Scientific Linux 6.2
> kernel 2.6.32
> cman-3.0.12
> corosync-1.4.1
> pacemaker-1.1.6
> drbd-8.4.1
>
> /etc/drbd.d/global-common.conf:
>
> global {
> usage-count yes;
> }
>
> common {
> startup {
> wfc-timeout 60;
> degr-wfc-timeout 60;
> outdated-wfc-timeout 60;
> }
> }
>
> /etc/drbd.d/admin.res:
>
> resource admin {
>
> protocol C;
>
> on hypatia-tb.nevis.columbia.edu {
> volume 0 {
> device /dev/drbd0;
> disk /dev/md2;
> flexible-meta-disk internal;
> }
> address 192.168.100.7:7788;
> }
> on orestes-tb.nevis.columbia.edu {
> volume 0 {
> device /dev/drbd0;
> disk /dev/md2;
> flexible-meta-disk internal;
> }
> address 192.168.100.6:7788;
> }
>
> startup {
> }
>
> net {
> allow-two-primaries yes;
> after-sb-0pri discard-zero-changes;
> after-sb-1pri discard-secondary;
> after-sb-2pri disconnect;
> sndbuf-size 0;
> }
>
> disk {
> resync-rate 100M;
> c-max-rate 100M;
> al-extents 3389;
> fencing resource-only;
> }
>
> An edited output of "crm configure show":
>
> node hypatia-tb.nevis.columbia.edu
> node orestes-tb.nevis.columbia.edu
> primitive StonithHypatia stonith:fence_nut \
> params pcmk_host_check="static-list" \
> pcmk_host_list="hypatia-tb.nevis.columbia.edu" \
> ups="sofia-ups" username="admin" password="XXX"
> primitive StonithOrestes stonith:fence_nut \
> params pcmk_host_check="static-list" \
> pcmk_host_list="orestes-tb.nevis.columbia.edu"
> ups="dc-test-stand-ups" username="admin" password="XXX"
> location StonithHypatiaLocation StonithHypatia \
> -inf: hypatia-tb.nevis.columbia.edu
> location StonithOrestesLocation StonithOrestes \
> -inf: orestes-tb.nevis.columbia.edu
>
> /etc/cluster/cluster.conf:
>
> <?xml version="1.0"?>
> <cluster config_version="17" name="Nevis_HA">
> <logging debug="off"/>
> <cman expected_votes="1" two_node="1" />
> <clusternodes>
> <clusternode name="hypatia-tb.nevis.columbia.edu" nodeid="1">
> <altname name="hypatia-private.nevis.columbia.edu" port="5405"
> mcast="226.94.1.1"/>
> <fence>
> <method name="pcmk-redirect">
> <device name="pcmk" port="hypatia-tb.nevis.columbia.edu"/>
> </method>
> </fence>
> </clusternode>
> <clusternode name="orestes-tb.nevis.columbia.edu" nodeid="2">
> <altname name="orestes-private.nevis.columbia.edu" port="5405"
> mcast="226.94.1.1"/>
> <fence>
> <method name="pcmk-redirect">
> <device name="pcmk" port="orestes-tb.nevis.columbia.edu"/>
> </method>
> </fence>
> </clusternode>
> </clusternodes>
> <fencedevices>
> <fencedevice name="pcmk" agent="fence_pcmk"/>
> </fencedevices>
> <fence_daemon post_join_delay="30" />
> <rm disabled="1" />
> </cluster>
>
>
> The log messages on orestes-tb, just before hypatia-tb is fenced (there are no
> messages in the hypatia-tb log for this time):
>
> Feb 15 16:52:27 orestes-tb kernel: drbd: initialized. Version: 8.4.1
> (api:1/proto:86-100)
> Feb 15 16:52:27 orestes-tb kernel: drbd: GIT-hash:
> 91b4c048c1a0e06777b5f65d312b38d47abaea80 build by
> root [at] orestes-tb, 2012-02-14 17:05:32
> Feb 15 16:52:27 orestes-tb kernel: drbd: registered as block device major 147
> Feb 15 16:52:27 orestes-tb kernel: d-con admin: Starting worker thread (from
> drbdsetup [2570])
> Feb 15 16:52:27 orestes-tb kernel: block drbd0: disk( Diskless -> Attaching )
> Feb 15 16:52:27 orestes-tb kernel: d-con admin: Method to ensure write ordering:
> barrier
> Feb 15 16:52:27 orestes-tb kernel: block drbd0: max BIO size = 130560
> Feb 15 16:52:27 orestes-tb kernel: block drbd0: Adjusting my ra_pages to backing
> device's (32 -> 768)
> Feb 15 16:52:27 orestes-tb kernel: block drbd0: drbd_bm_resize called with
> capacity == 5611549368
> Feb 15 16:52:27 orestes-tb kernel: block drbd0: resync bitmap: bits=701443671
> words=10960058 pages=21407
> Feb 15 16:52:27 orestes-tb kernel: block drbd0: size = 2676 GB (2805774684 KB)
> Feb 15 16:52:28 orestes-tb kernel: block drbd0: bitmap READ of 21407 pages took
> 634 jiffies
> Feb 15 16:52:28 orestes-tb kernel: block drbd0: recounting of set bits took
> additional 92 jiffies
> Feb 15 16:52:28 orestes-tb kernel: block drbd0: 0 KB (0 bits) marked out-of-sync
> by on disk bit-map.
> Feb 15 16:52:28 orestes-tb kernel: block drbd0: disk( Attaching -> Outdated )
> Feb 15 16:52:28 orestes-tb kernel: block drbd0: attached to UUIDs
> F5355FCF6114F218:0000000000000000:8A5519C7090D6BD6:8A5419C7090D6BD6
> Feb 15 16:52:28 orestes-tb kernel: d-con admin: conn( StandAlone -> Unconnected )
> Feb 15 16:52:28 orestes-tb kernel: d-con admin: Starting receiver thread (from
> drbd_w_admin [2572])
> Feb 15 16:52:28 orestes-tb kernel: d-con admin: receiver (re)started
> Feb 15 16:52:28 orestes-tb kernel: d-con admin: conn( Unconnected -> WFConnection )
> Feb 15 16:52:29 orestes-tb kernel: d-con admin: Handshake successful: Agreed
> network protocol version 100
> Feb 15 16:52:29 orestes-tb kernel: d-con admin: conn( WFConnection ->
> WFReportParams )
> Feb 15 16:52:29 orestes-tb kernel: d-con admin: Starting asender thread (from
> drbd_r_admin [2579])
> Feb 15 16:52:29 orestes-tb kernel: block drbd0: drbd_sync_handshake:
> Feb 15 16:52:29 orestes-tb kernel: block drbd0: self
> F5355FCF6114F218:0000000000000000:8A5519C7090D6BD6:8A5419C7090D6BD6 bits:0 flags:0
> Feb 15 16:52:29 orestes-tb kernel: block drbd0: peer
> 06B93A6C54D6D631:F5355FCF6114F219:8A5519C7090D6BD6:8A5419C7090D6BD6 bits:615 flags:0
> Feb 15 16:52:29 orestes-tb kernel: block drbd0: uuid_compare()=-1 by rule 50
> Feb 15 16:52:29 orestes-tb kernel: block drbd0: peer( Unknown -> Primary ) conn(
> WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
> Feb 15 16:52:29 orestes-tb kernel: block drbd0: receive bitmap stats
> [Bytes(packets)]: plain 0(0), RLE 39(1), total 39; compression: 100.0%
> Feb 15 16:52:29 orestes-tb kernel: block drbd0: send bitmap stats
> [Bytes(packets)]: plain 0(0), RLE 39(1), total 39; compression: 100.0%
> Feb 15 16:52:29 orestes-tb kernel: block drbd0: conn( WFBitMapT -> WFSyncUUID )
> Feb 15 16:52:50 orestes-tb kernel: d-con admin: PingAck did not arrive in time.
> Feb 15 16:52:50 orestes-tb kernel: d-con admin: peer( Primary -> Unknown ) conn(
> WFSyncUUID -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
> Feb 15 16:52:50 orestes-tb kernel: d-con admin: asender terminated
> Feb 15 16:52:50 orestes-tb kernel: d-con admin: Terminating asender thread
> Feb 15 16:52:51 orestes-tb kernel: block drbd0: bitmap WRITE of 3 pages took 247
> jiffies
> Feb 15 16:52:51 orestes-tb kernel: block drbd0: 2460 KB (615 bits) marked
> out-of-sync by on disk bit-map.
> Feb 15 16:52:51 orestes-tb kernel: d-con admin: Connection closed
> Feb 15 16:52:51 orestes-tb kernel: d-con admin: conn( NetworkFailure ->
> Unconnected )
> Feb 15 16:52:51 orestes-tb kernel: d-con admin: receiver terminated
> Feb 15 16:52:51 orestes-tb kernel: d-con admin: Restarting receiver thread
> Feb 15 16:52:51 orestes-tb kernel: d-con admin: receiver (re)started
> Feb 15 16:52:51 orestes-tb kernel: d-con admin: conn( Unconnected -> WFConnection )
>
> --
> Bill Seligman | Phone: (914) 591-2823
> Nevis Labs, Columbia Univ | mailto://seligman [at] nevis
> PO Box 137 |
> Irvington NY 10533 USA | http://www.nevis.columbia.edu/~seligman/
>
>
> _______________________________________________
> Linux-HA mailing list
> Linux-HA [at] lists
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


andrew at beekhof

Feb 27, 2012, 5:41 PM

Post #3 of 18 (2361 views)
Permalink
Re: cman+pacemaker+drbd fencing problem [In reply to]

Oh, what does the fence_pcmk file look like?

On Tue, Feb 28, 2012 at 12:40 PM, Andrew Beekhof <andrew [at] beekhof> wrote:
> On Tue, Feb 28, 2012 at 11:49 AM, William Seligman
> <seligman [at] nevis> wrote:
>> I'm trying to set up an active/active HA cluster as explained in Clusters From
>> Scratch (which I just re-read after my last problem).
>>
>> I'll give versions and config files below, but I'll start with what happens. I
>> start with an active/active cman+pacemaker+drbd+gfs2 cluster, with fencing
>> enabled. My fencing mechanism cuts power to a node by turning the load off in
>> its UPS. The two nodes are hypatia-tb and orestes-tb.
>>
>> I want to test fencing and recovery. I start with both nodes running, and
>> resources properly running on both nodes. Then I simulate failure on one node,
>> e.g., orestes-tb. I've done this with "crm node standby", "service pacemaker
>> off", or by pulling the plug. As expected, all the resources move to hypatia-tb,
>> with the drbd resource as Primary.
>>
>> When I try to bring orestes-tb back into the cluster with "crm node online" or
>> "service pacemaker on" (the inverse of how I removed it), orestes-tb is fenced.
>> OK, that makes sense, I guess; there's a potential split-brain situation.
>
> Not really, that should only happen if the two nodes can't see each
> other. Which should not be the case.
> Only when you pull the plug should orestes-tb be fenced.
>
> Or if you're using a fencing device that requires the node to have
> power, then I can imagine that turning it on again might result in
> fencing.
> But not for the other cases.
>
>
>>
>> I bring orestes-tb back up, with the intent of adding it back into the cluster.
>> I make sure cman, pacemaker, and drbd services were off at system start. On
>> orestes-tb, I type "service drbd start".
>>
>> What I expect to happen is that the drbd resource on orestes-tb is marked
>> "Outdated" or something like that. Then I'd fix it with "drbdadm
>> --discard-my-data connect admin" or whatever is appropriate.
>>
>> What actually happens is that hypatia-tb is fenced. Since this is the node
>> running all the resources, this is bad behavior. It's even more puzzling when I
>> consider that at, the time, there isn't any fencing resource actually running on
>> orestes-tb; my guess is that DRBD on hypatia-tb is fencing itself.
>>
>> Eventually hypatia-tb reboots, and the cluster goes back to normal. But as a
>> fencing/stability/HA test, this is a failure.
>>
>> I've repeated this with a number of variations. In the end, both systems have to
>> be fenced/rebooted before the cluster is working again.
>>
>> Any ideas?
>>
>> Versions:
>>
>> Scientific Linux 6.2
>> kernel 2.6.32
>> cman-3.0.12
>> corosync-1.4.1
>> pacemaker-1.1.6
>> drbd-8.4.1
>>
>> /etc/drbd.d/global-common.conf:
>>
>> global {
>> usage-count yes;
>> }
>>
>> common {
>> startup {
>> wfc-timeout 60;
>> degr-wfc-timeout 60;
>> outdated-wfc-timeout 60;
>> }
>> }
>>
>> /etc/drbd.d/admin.res:
>>
>> resource admin {
>>
>> protocol C;
>>
>> on hypatia-tb.nevis.columbia.edu {
>> volume 0 {
>> device /dev/drbd0;
>> disk /dev/md2;
>> flexible-meta-disk internal;
>> }
>> address 192.168.100.7:7788;
>> }
>> on orestes-tb.nevis.columbia.edu {
>> volume 0 {
>> device /dev/drbd0;
>> disk /dev/md2;
>> flexible-meta-disk internal;
>> }
>> address 192.168.100.6:7788;
>> }
>>
>> startup {
>> }
>>
>> net {
>> allow-two-primaries yes;
>> after-sb-0pri discard-zero-changes;
>> after-sb-1pri discard-secondary;
>> after-sb-2pri disconnect;
>> sndbuf-size 0;
>> }
>>
>> disk {
>> resync-rate 100M;
>> c-max-rate 100M;
>> al-extents 3389;
>> fencing resource-only;
>> }
>>
>> An edited output of "crm configure show":
>>
>> node hypatia-tb.nevis.columbia.edu
>> node orestes-tb.nevis.columbia.edu
>> primitive StonithHypatia stonith:fence_nut \
>> params pcmk_host_check="static-list" \
>> pcmk_host_list="hypatia-tb.nevis.columbia.edu" \
>> ups="sofia-ups" username="admin" password="XXX"
>> primitive StonithOrestes stonith:fence_nut \
>> params pcmk_host_check="static-list" \
>> pcmk_host_list="orestes-tb.nevis.columbia.edu"
>> ups="dc-test-stand-ups" username="admin" password="XXX"
>> location StonithHypatiaLocation StonithHypatia \
>> -inf: hypatia-tb.nevis.columbia.edu
>> location StonithOrestesLocation StonithOrestes \
>> -inf: orestes-tb.nevis.columbia.edu
>>
>> /etc/cluster/cluster.conf:
>>
>> <?xml version="1.0"?>
>> <cluster config_version="17" name="Nevis_HA">
>> <logging debug="off"/>
>> <cman expected_votes="1" two_node="1" />
>> <clusternodes>
>> <clusternode name="hypatia-tb.nevis.columbia.edu" nodeid="1">
>> <altname name="hypatia-private.nevis.columbia.edu" port="5405"
>> mcast="226.94.1.1"/>
>> <fence>
>> <method name="pcmk-redirect">
>> <device name="pcmk" port="hypatia-tb.nevis.columbia.edu"/>
>> </method>
>> </fence>
>> </clusternode>
>> <clusternode name="orestes-tb.nevis.columbia.edu" nodeid="2">
>> <altname name="orestes-private.nevis.columbia.edu" port="5405"
>> mcast="226.94.1.1"/>
>> <fence>
>> <method name="pcmk-redirect">
>> <device name="pcmk" port="orestes-tb.nevis.columbia.edu"/>
>> </method>
>> </fence>
>> </clusternode>
>> </clusternodes>
>> <fencedevices>
>> <fencedevice name="pcmk" agent="fence_pcmk"/>
>> </fencedevices>
>> <fence_daemon post_join_delay="30" />
>> <rm disabled="1" />
>> </cluster>
>>
>>
>> The log messages on orestes-tb, just before hypatia-tb is fenced (there are no
>> messages in the hypatia-tb log for this time):
>>
>> Feb 15 16:52:27 orestes-tb kernel: drbd: initialized. Version: 8.4.1
>> (api:1/proto:86-100)
>> Feb 15 16:52:27 orestes-tb kernel: drbd: GIT-hash:
>> 91b4c048c1a0e06777b5f65d312b38d47abaea80 build by
>> root [at] orestes-tb, 2012-02-14 17:05:32
>> Feb 15 16:52:27 orestes-tb kernel: drbd: registered as block device major 147
>> Feb 15 16:52:27 orestes-tb kernel: d-con admin: Starting worker thread (from
>> drbdsetup [2570])
>> Feb 15 16:52:27 orestes-tb kernel: block drbd0: disk( Diskless -> Attaching )
>> Feb 15 16:52:27 orestes-tb kernel: d-con admin: Method to ensure write ordering:
>> barrier
>> Feb 15 16:52:27 orestes-tb kernel: block drbd0: max BIO size = 130560
>> Feb 15 16:52:27 orestes-tb kernel: block drbd0: Adjusting my ra_pages to backing
>> device's (32 -> 768)
>> Feb 15 16:52:27 orestes-tb kernel: block drbd0: drbd_bm_resize called with
>> capacity == 5611549368
>> Feb 15 16:52:27 orestes-tb kernel: block drbd0: resync bitmap: bits=701443671
>> words=10960058 pages=21407
>> Feb 15 16:52:27 orestes-tb kernel: block drbd0: size = 2676 GB (2805774684 KB)
>> Feb 15 16:52:28 orestes-tb kernel: block drbd0: bitmap READ of 21407 pages took
>> 634 jiffies
>> Feb 15 16:52:28 orestes-tb kernel: block drbd0: recounting of set bits took
>> additional 92 jiffies
>> Feb 15 16:52:28 orestes-tb kernel: block drbd0: 0 KB (0 bits) marked out-of-sync
>> by on disk bit-map.
>> Feb 15 16:52:28 orestes-tb kernel: block drbd0: disk( Attaching -> Outdated )
>> Feb 15 16:52:28 orestes-tb kernel: block drbd0: attached to UUIDs
>> F5355FCF6114F218:0000000000000000:8A5519C7090D6BD6:8A5419C7090D6BD6
>> Feb 15 16:52:28 orestes-tb kernel: d-con admin: conn( StandAlone -> Unconnected )
>> Feb 15 16:52:28 orestes-tb kernel: d-con admin: Starting receiver thread (from
>> drbd_w_admin [2572])
>> Feb 15 16:52:28 orestes-tb kernel: d-con admin: receiver (re)started
>> Feb 15 16:52:28 orestes-tb kernel: d-con admin: conn( Unconnected -> WFConnection )
>> Feb 15 16:52:29 orestes-tb kernel: d-con admin: Handshake successful: Agreed
>> network protocol version 100
>> Feb 15 16:52:29 orestes-tb kernel: d-con admin: conn( WFConnection ->
>> WFReportParams )
>> Feb 15 16:52:29 orestes-tb kernel: d-con admin: Starting asender thread (from
>> drbd_r_admin [2579])
>> Feb 15 16:52:29 orestes-tb kernel: block drbd0: drbd_sync_handshake:
>> Feb 15 16:52:29 orestes-tb kernel: block drbd0: self
>> F5355FCF6114F218:0000000000000000:8A5519C7090D6BD6:8A5419C7090D6BD6 bits:0 flags:0
>> Feb 15 16:52:29 orestes-tb kernel: block drbd0: peer
>> 06B93A6C54D6D631:F5355FCF6114F219:8A5519C7090D6BD6:8A5419C7090D6BD6 bits:615 flags:0
>> Feb 15 16:52:29 orestes-tb kernel: block drbd0: uuid_compare()=-1 by rule 50
>> Feb 15 16:52:29 orestes-tb kernel: block drbd0: peer( Unknown -> Primary ) conn(
>> WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
>> Feb 15 16:52:29 orestes-tb kernel: block drbd0: receive bitmap stats
>> [Bytes(packets)]: plain 0(0), RLE 39(1), total 39; compression: 100.0%
>> Feb 15 16:52:29 orestes-tb kernel: block drbd0: send bitmap stats
>> [Bytes(packets)]: plain 0(0), RLE 39(1), total 39; compression: 100.0%
>> Feb 15 16:52:29 orestes-tb kernel: block drbd0: conn( WFBitMapT -> WFSyncUUID )
>> Feb 15 16:52:50 orestes-tb kernel: d-con admin: PingAck did not arrive in time.
>> Feb 15 16:52:50 orestes-tb kernel: d-con admin: peer( Primary -> Unknown ) conn(
>> WFSyncUUID -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
>> Feb 15 16:52:50 orestes-tb kernel: d-con admin: asender terminated
>> Feb 15 16:52:50 orestes-tb kernel: d-con admin: Terminating asender thread
>> Feb 15 16:52:51 orestes-tb kernel: block drbd0: bitmap WRITE of 3 pages took 247
>> jiffies
>> Feb 15 16:52:51 orestes-tb kernel: block drbd0: 2460 KB (615 bits) marked
>> out-of-sync by on disk bit-map.
>> Feb 15 16:52:51 orestes-tb kernel: d-con admin: Connection closed
>> Feb 15 16:52:51 orestes-tb kernel: d-con admin: conn( NetworkFailure ->
>> Unconnected )
>> Feb 15 16:52:51 orestes-tb kernel: d-con admin: receiver terminated
>> Feb 15 16:52:51 orestes-tb kernel: d-con admin: Restarting receiver thread
>> Feb 15 16:52:51 orestes-tb kernel: d-con admin: receiver (re)started
>> Feb 15 16:52:51 orestes-tb kernel: d-con admin: conn( Unconnected -> WFConnection )
>>
>> --
>> Bill Seligman | Phone: (914) 591-2823
>> Nevis Labs, Columbia Univ | mailto://seligman [at] nevis
>> PO Box 137 |
>> Irvington NY 10533 USA | http://www.nevis.columbia.edu/~seligman/
>>
>>
>> _______________________________________________
>> Linux-HA mailing list
>> Linux-HA [at] lists
>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


seligman at nevis

Feb 28, 2012, 10:21 AM

Post #4 of 18 (2361 views)
Permalink
Re: cman+pacemaker+drbd fencing problem [In reply to]

On 2/27/12 8:40 PM, Andrew Beekhof wrote:

> Oh, what does the fence_pcmk file look like?

This is a standard part of the pacemaker-1.1.6 package. According to

<http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Clusters_from_Scratch/_configuring_cman_fencing.html>

it causes any fencing requests from cman to be redirected to pacemaker.

Since you asked, I've attached a copy of the file. I note that if this script is
used to fence a system it writes to /var/log/messages using logger, and there is
no such log message in my logs. So I guess cman is off the hook.

> On Tue, Feb 28, 2012 at 11:49 AM, William Seligman
> <seligman [at] nevis> wrote:
>> I'm trying to set up an active/active HA cluster as explained in Clusters From
>> Scratch (which I just re-read after my last problem).
>>
>> I'll give versions and config files below, but I'll start with what happens. I
>> start with an active/active cman+pacemaker+drbd+gfs2 cluster, with fencing
>> enabled. My fencing mechanism cuts power to a node by turning the load off in
>> its UPS. The two nodes are hypatia-tb and orestes-tb.
>>
>> I want to test fencing and recovery. I start with both nodes running, and
>> resources properly running on both nodes. Then I simulate failure on one node,
>> e.g., orestes-tb. I've done this with "crm node standby", "service pacemaker
>> off", or by pulling the plug. As expected, all the resources move to hypatia-tb,
>> with the drbd resource as Primary.
>>
>> When I try to bring orestes-tb back into the cluster with "crm node online" or
>> "service pacemaker on" (the inverse of how I removed it), orestes-tb is fenced.
>> OK, that makes sense, I guess; there's a potential split-brain situation.
>
> Not really, that should only happen if the two nodes can't see each
> other. Which should not be the case.
> Only when you pull the plug should orestes-tb be fenced.
>
> Or if you're using a fencing device that requires the node to have
> power, then I can imagine that turning it on again might result in
> fencing.
> But not for the other cases.

I ran a test: I turned off pacemaker (and so DRBD) on orestes-tb. I "touch"ed a
file on the hypatia-tb DRBD partition, to make it the "newer" one. Then I turned
off pacemaker on hypatia-tb. Finally I turned on just drbd on hypatia-tb, then
on orestes-tb.

From /var/log/messages on hypatia-tb:

Feb 28 11:39:19 hypatia-tb kernel: d-con admin: Starting worker thread (from
drbdsetup [21822])
Feb 28 11:39:19 hypatia-tb kernel: block drbd0: disk( Diskless -> Attaching )
Feb 28 11:39:19 hypatia-tb kernel: d-con admin: Method to ensure write ordering:
barrier
Feb 28 11:39:19 hypatia-tb kernel: block drbd0: max BIO size = 130560
Feb 28 11:39:19 hypatia-tb kernel: block drbd0: Adjusting my ra_pages to backing
device's (32 -> 768)
Feb 28 11:39:19 hypatia-tb kernel: block drbd0: drbd_bm_resize called with
capacity == 5611549368
Feb 28 11:39:19 hypatia-tb kernel: block drbd0: resync bitmap: bits=701443671
words=10960058 pages=21407
Feb 28 11:39:19 hypatia-tb kernel: block drbd0: size = 2676 GB (2805774684 KB)
Feb 28 11:39:19 hypatia-tb kernel: block drbd0: bitmap READ of 21407 pages took
576 jiffies
Feb 28 11:39:20 hypatia-tb kernel: block drbd0: recounting of set bits took
additional 87 jiffies
Feb 28 11:39:20 hypatia-tb kernel: block drbd0: 55 MB (14114 bits) marked
out-of-sync by on disk bit-map.
Feb 28 11:39:20 hypatia-tb kernel: block drbd0: disk( Attaching -> UpToDate )
pdsk( DUnknown -> Outdated )
Feb 28 11:39:20 hypatia-tb kernel: block drbd0: attached to UUIDs
862A336609FD27CD:BFFB722D5E3E15D7:6E63EC4258C86AF2:6E62EC4258C86AF2
Feb 28 11:39:20 hypatia-tb kernel: d-con admin: conn( StandAlone -> Unconnected )
Feb 28 11:39:20 hypatia-tb kernel: d-con admin: Starting receiver thread (from
drbd_w_admin [21824])
Feb 28 11:39:20 hypatia-tb kernel: d-con admin: receiver (re)started
Feb 28 11:39:20 hypatia-tb kernel: d-con admin: conn( Unconnected -> WFConnection )


From /var/log/messages on orestes-tb:

Feb 28 11:39:51 orestes-tb kernel: d-con admin: Starting worker thread (from
drbdsetup [17827])
Feb 28 11:39:51 orestes-tb kernel: block drbd0: disk( Diskless -> Attaching )
Feb 28 11:39:51 orestes-tb kernel: d-con admin: Method to ensure write ordering:
barrier
Feb 28 11:39:51 orestes-tb kernel: block drbd0: max BIO size = 130560
Feb 28 11:39:51 orestes-tb kernel: block drbd0: Adjusting my ra_pages to backing
device's (32 -> 768)
Feb 28 11:39:51 orestes-tb kernel: block drbd0: drbd_bm_resize called with
capacity == 5611549368
Feb 28 11:39:51 orestes-tb kernel: block drbd0: resync bitmap: bits=701443671
words=10960058 pages=21407
Feb 28 11:39:51 orestes-tb kernel: block drbd0: size = 2676 GB (2805774684 KB)
Feb 28 11:39:52 orestes-tb kernel: block drbd0: bitmap READ of 21407 pages took
735 jiffies
Feb 28 11:39:52 orestes-tb kernel: block drbd0: recounting of set bits took
additional 93 jiffies
Feb 28 11:39:52 orestes-tb kernel: block drbd0: 0 KB (0 bits) marked out-of-sync
by on disk bit-map.
Feb 28 11:39:52 orestes-tb kernel: block drbd0: disk( Attaching -> Outdated )
Feb 28 11:39:52 orestes-tb kernel: block drbd0: attached to UUIDs
BFFB722D5E3E15D6:0000000000000000:6E63EC4258C86AF2:6E62EC4258C86AF2
Feb 28 11:39:52 orestes-tb kernel: d-con admin: conn( StandAlone -> Unconnected )
Feb 28 11:39:52 orestes-tb kernel: d-con admin: Starting receiver thread (from
drbd_w_admin [17829])
Feb 28 11:39:52 orestes-tb kernel: d-con admin: receiver (re)started
Feb 28 11:39:52 orestes-tb kernel: d-con admin: conn( Unconnected -> WFConnection )
Feb 28 11:39:52 orestes-tb kernel: d-con admin: Handshake successful: Agreed
network protocol version 100
Feb 28 11:39:52 orestes-tb kernel: d-con admin: conn( WFConnection ->
WFReportParams )
Feb 28 11:39:52 orestes-tb kernel: d-con admin: Starting asender thread (from
drbd_r_admin [17835])
Feb 28 11:39:52 orestes-tb kernel: block drbd0: drbd_sync_handshake:
Feb 28 11:39:52 orestes-tb kernel: block drbd0: self
BFFB722D5E3E15D6:0000000000000000:6E63EC4258C86AF2:6E62EC4258C86AF2 bits:0 flags:0
Feb 28 11:39:52 orestes-tb kernel: block drbd0: peer
862A336609FD27CC:BFFB722D5E3E15D7:6E63EC4258C86AF2:6E62EC4258C86AF2 bits:14114
flags:0
Feb 28 11:39:52 orestes-tb kernel: block drbd0: uuid_compare()=-1 by rule 50
Feb 28 11:39:52 orestes-tb kernel: block drbd0: peer( Unknown -> Secondary )
conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
Feb 28 11:39:52 orestes-tb kernel: block drbd0: receive bitmap stats
[Bytes(packets)]: plain 0(0), RLE 176(1), total 176; compression: 100.0%
Feb 28 11:39:52 orestes-tb kernel: block drbd0: send bitmap stats
[Bytes(packets)]: plain 0(0), RLE 176(1), total 176; compression: 100.0%
Feb 28 11:39:52 orestes-tb kernel: block drbd0: conn( WFBitMapT -> WFSyncUUID )
Feb 28 11:40:01 orestes-tb corosync[2193]: [TOTEM ] A processor failed,
forming new configuration.
Feb 28 11:40:03 orestes-tb corosync[2193]: [QUORUM] Members[1]: 2
Feb 28 11:40:03 orestes-tb corosync[2193]: [TOTEM ] A processor joined or left
the membership and a new membership was formed.
Feb 28 11:40:03 orestes-tb kernel: dlm: closing connection to node 1
Feb 28 11:40:03 orestes-tb corosync[2193]: [CPG ] chosen downlist: sender
r(0) ip(129.236.252.14) r(1) ip(192.168.100.6) ; members(old:2 left:1)
Feb 28 11:40:03 orestes-tb corosync[2193]: [MAIN ] Completed service
synchronization, ready to provide service.
Feb 28 11:40:03 orestes-tb fenced[2247]: fencing node hypatia-tb.nevis.columbia.edu


As far as I can tell, hypatia-tb's drbd comes up, says "I'm UpToDate" and waits
for a connection from orestes-tb. orestes-tb's drbd comes up, says "I'm
UpToDate" and starts the sync process with hypatia-tb. Then cman+corosync steps
in on orestes-tb and fences hypatia-tb, before the sync can proceed.

I ran another test. I did the same thing as the previous paragraph, except that
I made sure both cman and pacemaker were off (I had to reboot to make sure) and
just started drbd on both nodes. Sure enough, drbd was able to sync without
split-brain or fencing. So this is a cman/corosync issue, not a drbd issue.

While I was setting up the test for the previous paragraph, there was a problem
with another resource (ocf:heartbeat:exportfs) that couldn't be properly
monitored on either node. This led to a cycle of fencing where each node would
successively fence the other because the exportfs resource couldn't run on
either node. I had to quickly change my configuration to turn off monitoring on
the resource.

So it seems like cman+corosync is the issue. It's as if I"m "over-fencing."

Any ideas?

>> I bring orestes-tb back up, with the intent of adding it back into the cluster.
>> I make sure cman, pacemaker, and drbd services were off at system start. On
>> orestes-tb, I type "service drbd start".
>>
>> What I expect to happen is that the drbd resource on orestes-tb is marked
>> "Outdated" or something like that. Then I'd fix it with "drbdadm
>> --discard-my-data connect admin" or whatever is appropriate.
>>
>> What actually happens is that hypatia-tb is fenced. Since this is the node
>> running all the resources, this is bad behavior. It's even more puzzling when I
>> consider that at, the time, there isn't any fencing resource actually running on
>> orestes-tb; my guess is that DRBD on hypatia-tb is fencing itself.
>>
>> Eventually hypatia-tb reboots, and the cluster goes back to normal. But as a
>> fencing/stability/HA test, this is a failure.
>>
>> I've repeated this with a number of variations. In the end, both systems have to
>> be fenced/rebooted before the cluster is working again.
>>
>> Any ideas?
>>
>> Versions:
>>
>> Scientific Linux 6.2
>> kernel 2.6.32
>> cman-3.0.12
>> corosync-1.4.1
>> pacemaker-1.1.6
>> drbd-8.4.1
>>
>> /etc/drbd.d/global-common.conf:
>>
>> global {
>> � � � �usage-count yes;
>> }
>>
>> common {
>> � � � �startup {
>> � � � � � � � �wfc-timeout � � � � � � 60;
>> � � � � � � � �degr-wfc-timeout � � � �60;
>> � � � � � � � �outdated-wfc-timeout � �60;
>> � � � �}
>> }
>>
>> /etc/drbd.d/admin.res:
>>
>> resource admin {
>>
>> � � � �protocol C;
>>
>> � � � �on hypatia-tb.nevis.columbia.edu {
>> � � � � � � � �volume 0 {
>> � � � � � � � � � � � �device � � � � �/dev/drbd0;
>> � � � � � � � � � � � �disk � � � � � �/dev/md2;
>> � � � � � � � � � � � �flexible-meta-disk � � �internal;
>> � � � � � � � �}
>> � � � � � � � �address � � � � 192.168.100.7:7788;
>> � � � �}
>> � � � �on orestes-tb.nevis.columbia.edu {
>> � � � � � � � �volume 0 {
>> � � � � � � � � � � � �device � � � � �/dev/drbd0;
>> � � � � � � � � � � � �disk � � � � � �/dev/md2;
>> � � � � � � � � � � � �flexible-meta-disk � � �internal;
>> � � � � � � � �}
>> � � � � � � � �address � � � � 192.168.100.6:7788;
>> � � � �}
>>
>> � � � �startup {
>> � � � �}
>>
>> � � � �net {
>> � � � � � � � �allow-two-primaries � � yes;
>> � � � � � � � �after-sb-0pri � � �discard-zero-changes;
>> � � � � � � � �after-sb-1pri � � �discard-secondary;
>> � � � � � � � �after-sb-2pri � � �disconnect;
>> � � � � � � � �sndbuf-size 0;
>> � � � �}
>>
>> � � � �disk {
>> � � � � � � � �resync-rate � � 100M;
>> � � � � � � � �c-max-rate � � �100M;
>> � � � � � � � �al-extents � � �3389;
>> � � � � � � � �fencing resource-only;
>> � � � �}
>>
>> An edited output of "crm configure show":
>>
>> node hypatia-tb.nevis.columbia.edu
>> node orestes-tb.nevis.columbia.edu
>> primitive StonithHypatia stonith:fence_nut \
>> � params pcmk_host_check="static-list" \
>> � pcmk_host_list="hypatia-tb.nevis.columbia.edu" \
>> � ups="sofia-ups" username="admin" password="XXX"
>> primitive StonithOrestes stonith:fence_nut \
>> � params pcmk_host_check="static-list" \
>> � pcmk_host_list="orestes-tb.nevis.columbia.edu"
>> � ups="dc-test-stand-ups" username="admin" password="XXX"
>> location StonithHypatiaLocation StonithHypatia \
>> � -inf: hypatia-tb.nevis.columbia.edu
>> location StonithOrestesLocation StonithOrestes \
>> � -inf: orestes-tb.nevis.columbia.edu
>>
>> /etc/cluster/cluster.conf:
>>
>> <?xml version="1.0"?>
>> <cluster config_version="17" name="Nevis_HA">
>> �<logging debug="off"/>
>> �<cman expected_votes="1" two_node="1" />
>> �<clusternodes>
>> � �<clusternode name="hypatia-tb.nevis.columbia.edu" nodeid="1">
>> � � �<altname name="hypatia-private.nevis.columbia.edu" port="5405"
>> mcast="226.94.1.1"/>
>> � � �<fence>
>> � � � �<method name="pcmk-redirect">
>> � � � � �<device name="pcmk" port="hypatia-tb.nevis.columbia.edu"/>
>> � � � �</method>
>> � � �</fence>
>> � �</clusternode>
>> � �<clusternode name="orestes-tb.nevis.columbia.edu" nodeid="2">
>> � � �<altname name="orestes-private.nevis.columbia.edu" port="5405"
>> mcast="226.94.1.1"/>
>> � � �<fence>
>> � � � �<method name="pcmk-redirect">
>> � � � � �<device name="pcmk" port="orestes-tb.nevis.columbia.edu"/>
>> � � � �</method>
>> � � �</fence>
>> � �</clusternode>
>> �</clusternodes>
>> �<fencedevices>
>> � �<fencedevice name="pcmk" agent="fence_pcmk"/>
>> �</fencedevices>
>> �<fence_daemon post_join_delay="30" />
>> �<rm disabled="1" />
>> </cluster>
>>
>>
>> The log messages on orestes-tb, just before hypatia-tb is fenced (there are no
>> messages in the hypatia-tb log for this time):
>>
>> Feb 15 16:52:27 orestes-tb kernel: drbd: initialized. Version: 8.4.1
>> (api:1/proto:86-100)
>> Feb 15 16:52:27 orestes-tb kernel: drbd: GIT-hash:
>> 91b4c048c1a0e06777b5f65d312b38d47abaea80 build by
>> root [at] orestes-tb, 2012-02-14 17:05:32
>> Feb 15 16:52:27 orestes-tb kernel: drbd: registered as block device major 147
>> Feb 15 16:52:27 orestes-tb kernel: d-con admin: Starting worker thread (from
>> drbdsetup [2570])
>> Feb 15 16:52:27 orestes-tb kernel: block drbd0: disk( Diskless -> Attaching )
>> Feb 15 16:52:27 orestes-tb kernel: d-con admin: Method to ensure write ordering:
>> barrier
>> Feb 15 16:52:27 orestes-tb kernel: block drbd0: max BIO size = 130560
>> Feb 15 16:52:27 orestes-tb kernel: block drbd0: Adjusting my ra_pages to backing
>> device's (32 -> 768)
>> Feb 15 16:52:27 orestes-tb kernel: block drbd0: drbd_bm_resize called with
>> capacity == 5611549368
>> Feb 15 16:52:27 orestes-tb kernel: block drbd0: resync bitmap: bits=701443671
>> words=10960058 pages=21407
>> Feb 15 16:52:27 orestes-tb kernel: block drbd0: size = 2676 GB (2805774684 KB)
>> Feb 15 16:52:28 orestes-tb kernel: block drbd0: bitmap READ of 21407 pages took
>> 634 jiffies
>> Feb 15 16:52:28 orestes-tb kernel: block drbd0: recounting of set bits took
>> additional 92 jiffies
>> Feb 15 16:52:28 orestes-tb kernel: block drbd0: 0 KB (0 bits) marked out-of-sync
>> by on disk bit-map.
>> Feb 15 16:52:28 orestes-tb kernel: block drbd0: disk( Attaching -> Outdated )
>> Feb 15 16:52:28 orestes-tb kernel: block drbd0: attached to UUIDs
>> F5355FCF6114F218:0000000000000000:8A5519C7090D6BD6:8A5419C7090D6BD6
>> Feb 15 16:52:28 orestes-tb kernel: d-con admin: conn( StandAlone -> Unconnected )
>> Feb 15 16:52:28 orestes-tb kernel: d-con admin: Starting receiver thread (from
>> drbd_w_admin [2572])
>> Feb 15 16:52:28 orestes-tb kernel: d-con admin: receiver (re)started
>> Feb 15 16:52:28 orestes-tb kernel: d-con admin: conn( Unconnected -> WFConnection )
>> Feb 15 16:52:29 orestes-tb kernel: d-con admin: Handshake successful: Agreed
>> network protocol version 100
>> Feb 15 16:52:29 orestes-tb kernel: d-con admin: conn( WFConnection ->
>> WFReportParams )
>> Feb 15 16:52:29 orestes-tb kernel: d-con admin: Starting asender thread (from
>> drbd_r_admin [2579])
>> Feb 15 16:52:29 orestes-tb kernel: block drbd0: drbd_sync_handshake:
>> Feb 15 16:52:29 orestes-tb kernel: block drbd0: self
>> F5355FCF6114F218:0000000000000000:8A5519C7090D6BD6:8A5419C7090D6BD6 bits:0 flags:0
>> Feb 15 16:52:29 orestes-tb kernel: block drbd0: peer
>> 06B93A6C54D6D631:F5355FCF6114F219:8A5519C7090D6BD6:8A5419C7090D6BD6 bits:615 flags:0
>> Feb 15 16:52:29 orestes-tb kernel: block drbd0: uuid_compare()=-1 by rule 50
>> Feb 15 16:52:29 orestes-tb kernel: block drbd0: peer( Unknown -> Primary ) conn(
>> WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
>> Feb 15 16:52:29 orestes-tb kernel: block drbd0: receive bitmap stats
>> [Bytes(packets)]: plain 0(0), RLE 39(1), total 39; compression: 100.0%
>> Feb 15 16:52:29 orestes-tb kernel: block drbd0: send bitmap stats
>> [Bytes(packets)]: plain 0(0), RLE 39(1), total 39; compression: 100.0%
>> Feb 15 16:52:29 orestes-tb kernel: block drbd0: conn( WFBitMapT -> WFSyncUUID )
>> Feb 15 16:52:50 orestes-tb kernel: d-con admin: PingAck did not arrive in time.
>> Feb 15 16:52:50 orestes-tb kernel: d-con admin: peer( Primary -> Unknown ) conn(
>> WFSyncUUID -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
>> Feb 15 16:52:50 orestes-tb kernel: d-con admin: asender terminated
>> Feb 15 16:52:50 orestes-tb kernel: d-con admin: Terminating asender thread
>> Feb 15 16:52:51 orestes-tb kernel: block drbd0: bitmap WRITE of 3 pages took 247
>> jiffies
>> Feb 15 16:52:51 orestes-tb kernel: block drbd0: 2460 KB (615 bits) marked
>> out-of-sync by on disk bit-map.
>> Feb 15 16:52:51 orestes-tb kernel: d-con admin: Connection closed
>> Feb 15 16:52:51 orestes-tb kernel: d-con admin: conn( NetworkFailure ->
>> Unconnected )
>> Feb 15 16:52:51 orestes-tb kernel: d-con admin: receiver terminated
>> Feb 15 16:52:51 orestes-tb kernel: d-con admin: Restarting receiver thread
>> Feb 15 16:52:51 orestes-tb kernel: d-con admin: receiver (re)started
>> Feb 15 16:52:51 orestes-tb kernel: d-con admin: conn( Unconnected -> WFConnection )


--
Bill Seligman | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://seligman [at] nevis
PO Box 137 |
Irvington NY 10533 USA | http://www.nevis.columbia.edu/~seligman/
Attachments: fence_pcmk (3.71 KB)
  smime.p7s (4.39 KB)


lars.ellenberg at linbit

Feb 28, 2012, 11:09 AM

Post #5 of 18 (2388 views)
Permalink
Re: cman+pacemaker+drbd fencing problem [In reply to]

On Tue, Feb 28, 2012 at 01:21:51PM -0500, William Seligman wrote:
> On 2/27/12 8:40 PM, Andrew Beekhof wrote:
>
> > Oh, what does the fence_pcmk file look like?
>
> This is a standard part of the pacemaker-1.1.6 package. According to
>
> <http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Clusters_from_Scratch/_configuring_cman_fencing.html>
>
> it causes any fencing requests from cman to be redirected to pacemaker.
>
> Since you asked, I've attached a copy of the file. I note that if this script is
> used to fence a system it writes to /var/log/messages using logger, and there is
> no such log message in my logs. So I guess cman is off the hook.

You say "fencing resource-only" in drbd.conf.
But you did not show the fencing handler used?
Did you specify one at all?

Besides, for a dual-primary DRBD setup, you must have "fencing
resource-and-stonith;", and you should use a DRBD fencing handler
that really fences off the peer. It may additionally set constraints.

Also, maybe that post helps to realize some of the problems involved:
http://www.gossamer-threads.com/lists/linuxha/pacemaker/62927#62927

Especially the part about
But just because you can shoot someone
does not mean you have the bi^Wbetter data.

Because of the increased complexity, I strongly recommend against dual
primary DRBD, unless you have a very good reason to want it.

"Because it can be done" does not count as good reason in that context

;-)

More comments below.

> > On Tue, Feb 28, 2012 at 11:49 AM, William Seligman
> > <seligman [at] nevis> wrote:
> >> I'm trying to set up an active/active HA cluster as explained in Clusters From
> >> Scratch (which I just re-read after my last problem).
> >>
> >> I'll give versions and config files below, but I'll start with what happens. I
> >> start with an active/active cman+pacemaker+drbd+gfs2 cluster, with fencing
> >> enabled. My fencing mechanism cuts power to a node by turning the load off in
> >> its UPS. The two nodes are hypatia-tb and orestes-tb.
> >>
> >> I want to test fencing and recovery. I start with both nodes running, and
> >> resources properly running on both nodes. Then I simulate failure on one node,
> >> e.g., orestes-tb. I've done this with "crm node standby", "service pacemaker
> >> off", or by pulling the plug. As expected, all the resources move to hypatia-tb,
> >> with the drbd resource as Primary.
> >>
> >> When I try to bring orestes-tb back into the cluster with "crm node online" or
> >> "service pacemaker on" (the inverse of how I removed it), orestes-tb is fenced.
> >> OK, that makes sense, I guess; there's a potential split-brain situation.
> >
> > Not really, that should only happen if the two nodes can't see each
> > other. Which should not be the case.
> > Only when you pull the plug should orestes-tb be fenced.
> >
> > Or if you're using a fencing device that requires the node to have
> > power, then I can imagine that turning it on again might result in
> > fencing.
> > But not for the other cases.
>
> I ran a test: I turned off pacemaker (and so DRBD) on orestes-tb. I "touch"ed a
> file on the hypatia-tb DRBD partition, to make it the "newer" one.
> Then I turned
> off pacemaker on hypatia-tb. Finally I turned on just drbd on hypatia-tb, then
> on orestes-tb.
>
> From /var/log/messages on hypatia-tb:
>
> Feb 28 11:39:19 hypatia-tb kernel: d-con admin: Starting worker thread (from
> drbdsetup [21822])
> Feb 28 11:39:19 hypatia-tb kernel: block drbd0: disk( Diskless -> Attaching )
> Feb 28 11:39:19 hypatia-tb kernel: d-con admin: Method to ensure write ordering:
> barrier
> Feb 28 11:39:19 hypatia-tb kernel: block drbd0: max BIO size = 130560
> Feb 28 11:39:19 hypatia-tb kernel: block drbd0: Adjusting my ra_pages to backing
> device's (32 -> 768)
> Feb 28 11:39:19 hypatia-tb kernel: block drbd0: drbd_bm_resize called with
> capacity == 5611549368
> Feb 28 11:39:19 hypatia-tb kernel: block drbd0: resync bitmap: bits=701443671
> words=10960058 pages=21407
> Feb 28 11:39:19 hypatia-tb kernel: block drbd0: size = 2676 GB (2805774684 KB)
> Feb 28 11:39:19 hypatia-tb kernel: block drbd0: bitmap READ of 21407 pages took
> 576 jiffies
> Feb 28 11:39:20 hypatia-tb kernel: block drbd0: recounting of set bits took
> additional 87 jiffies
> Feb 28 11:39:20 hypatia-tb kernel: block drbd0: 55 MB (14114 bits) marked
> out-of-sync by on disk bit-map.
> Feb 28 11:39:20 hypatia-tb kernel: block drbd0: disk( Attaching -> UpToDate )
> pdsk( DUnknown -> Outdated )
> Feb 28 11:39:20 hypatia-tb kernel: block drbd0: attached to UUIDs
> 862A336609FD27CD:BFFB722D5E3E15D7:6E63EC4258C86AF2:6E62EC4258C86AF2
> Feb 28 11:39:20 hypatia-tb kernel: d-con admin: conn( StandAlone -> Unconnected )
> Feb 28 11:39:20 hypatia-tb kernel: d-con admin: Starting receiver thread (from
> drbd_w_admin [21824])
> Feb 28 11:39:20 hypatia-tb kernel: d-con admin: receiver (re)started
> Feb 28 11:39:20 hypatia-tb kernel: d-con admin: conn( Unconnected -> WFConnection )
>
>
> From /var/log/messages on orestes-tb:
>
> Feb 28 11:39:51 orestes-tb kernel: d-con admin: Starting worker thread (from
> drbdsetup [17827])
> Feb 28 11:39:51 orestes-tb kernel: block drbd0: disk( Diskless -> Attaching )
> Feb 28 11:39:51 orestes-tb kernel: d-con admin: Method to ensure write ordering:
> barrier
> Feb 28 11:39:51 orestes-tb kernel: block drbd0: max BIO size = 130560
> Feb 28 11:39:51 orestes-tb kernel: block drbd0: Adjusting my ra_pages to backing
> device's (32 -> 768)
> Feb 28 11:39:51 orestes-tb kernel: block drbd0: drbd_bm_resize called with
> capacity == 5611549368
> Feb 28 11:39:51 orestes-tb kernel: block drbd0: resync bitmap: bits=701443671
> words=10960058 pages=21407
> Feb 28 11:39:51 orestes-tb kernel: block drbd0: size = 2676 GB (2805774684 KB)
> Feb 28 11:39:52 orestes-tb kernel: block drbd0: bitmap READ of 21407 pages took
> 735 jiffies
> Feb 28 11:39:52 orestes-tb kernel: block drbd0: recounting of set bits took
> additional 93 jiffies
> Feb 28 11:39:52 orestes-tb kernel: block drbd0: 0 KB (0 bits) marked out-of-sync
> by on disk bit-map.
> Feb 28 11:39:52 orestes-tb kernel: block drbd0: disk( Attaching -> Outdated )
> Feb 28 11:39:52 orestes-tb kernel: block drbd0: attached to UUIDs
> BFFB722D5E3E15D6:0000000000000000:6E63EC4258C86AF2:6E62EC4258C86AF2
> Feb 28 11:39:52 orestes-tb kernel: d-con admin: conn( StandAlone -> Unconnected )
> Feb 28 11:39:52 orestes-tb kernel: d-con admin: Starting receiver thread (from
> drbd_w_admin [17829])
> Feb 28 11:39:52 orestes-tb kernel: d-con admin: receiver (re)started
> Feb 28 11:39:52 orestes-tb kernel: d-con admin: conn( Unconnected -> WFConnection )
> Feb 28 11:39:52 orestes-tb kernel: d-con admin: Handshake successful: Agreed
> network protocol version 100
> Feb 28 11:39:52 orestes-tb kernel: d-con admin: conn( WFConnection ->
> WFReportParams )
> Feb 28 11:39:52 orestes-tb kernel: d-con admin: Starting asender thread (from
> drbd_r_admin [17835])
> Feb 28 11:39:52 orestes-tb kernel: block drbd0: drbd_sync_handshake:
> Feb 28 11:39:52 orestes-tb kernel: block drbd0: self
> BFFB722D5E3E15D6:0000000000000000:6E63EC4258C86AF2:6E62EC4258C86AF2 bits:0 flags:0
> Feb 28 11:39:52 orestes-tb kernel: block drbd0: peer
> 862A336609FD27CC:BFFB722D5E3E15D7:6E63EC4258C86AF2:6E62EC4258C86AF2 bits:14114
> flags:0
> Feb 28 11:39:52 orestes-tb kernel: block drbd0: uuid_compare()=-1 by rule 50
> Feb 28 11:39:52 orestes-tb kernel: block drbd0: peer( Unknown -> Secondary )
> conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
> Feb 28 11:39:52 orestes-tb kernel: block drbd0: receive bitmap stats
> [Bytes(packets)]: plain 0(0), RLE 176(1), total 176; compression: 100.0%
> Feb 28 11:39:52 orestes-tb kernel: block drbd0: send bitmap stats
> [Bytes(packets)]: plain 0(0), RLE 176(1), total 176; compression: 100.0%
> Feb 28 11:39:52 orestes-tb kernel: block drbd0: conn( WFBitMapT -> WFSyncUUID )
> Feb 28 11:40:01 orestes-tb corosync[2193]: [TOTEM ] A processor failed,
> forming new configuration.
> Feb 28 11:40:03 orestes-tb corosync[2193]: [QUORUM] Members[1]: 2
> Feb 28 11:40:03 orestes-tb corosync[2193]: [TOTEM ] A processor joined or left
> the membership and a new membership was formed.
> Feb 28 11:40:03 orestes-tb kernel: dlm: closing connection to node 1
> Feb 28 11:40:03 orestes-tb corosync[2193]: [CPG ] chosen downlist: sender
> r(0) ip(129.236.252.14) r(1) ip(192.168.100.6) ; members(old:2 left:1)
> Feb 28 11:40:03 orestes-tb corosync[2193]: [MAIN ] Completed service
> synchronization, ready to provide service.
> Feb 28 11:40:03 orestes-tb fenced[2247]: fencing node hypatia-tb.nevis.columbia.edu
>
>
> As far as I can tell, hypatia-tb's drbd comes up, says "I'm UpToDate" and waits
> for a connection from orestes-tb. orestes-tb's drbd comes up, says "I'm
> UpToDate"

No, it clearly says "I'm Outdated" from the logs above:
| Feb 28 11:39:52 orestes-tb kernel: block drbd0: disk( Attaching -> Outdated )

It outdated itself voluntarily when it was told to disconnect from a
still running primary, because of your "fencing resource-only" configuration.

Don't rely on that: in a real incident, the replication link will just
fail, in which case you really need the "fencing resource-and-stonith",
and a suitable fence-peer handler.

> and starts the sync process with hypatia-tb. Then cman+corosync steps
> in on orestes-tb and fences hypatia-tb, before the sync can proceed.
>
> I ran another test. I did the same thing as the previous paragraph, except that
> I made sure both cman and pacemaker were off (I had to reboot to make sure) and
> just started drbd on both nodes. Sure enough, drbd was able to sync without
> split-brain or fencing. So this is a cman/corosync issue, not a drbd issue.

You still may retry the whole thing with drbd 8.3.12,
just to make sure there is no hidden DRBD 8.4.1 instability.

> While I was setting up the test for the previous paragraph, there was a problem
> with another resource (ocf:heartbeat:exportfs) that couldn't be properly
> monitored on either node. This led to a cycle of fencing where each node would
> successively fence the other because the exportfs resource couldn't run on
> either node. I had to quickly change my configuration to turn off monitoring on
> the resource.
>
> So it seems like cman+corosync is the issue. It's as if I"m "over-fencing."
>
> Any ideas?


--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com
_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


seligman at nevis

Feb 28, 2012, 12:51 PM

Post #6 of 18 (2365 views)
Permalink
Re: cman+pacemaker+drbd fencing problem [In reply to]

On 2/28/12 2:09 PM, Lars Ellenberg wrote:
> On Tue, Feb 28, 2012 at 01:21:51PM -0500, William Seligman wrote:
>> On 2/27/12 8:40 PM, Andrew Beekhof wrote:
>>
>>> Oh, what does the fence_pcmk file look like?
>>
>> This is a standard part of the pacemaker-1.1.6 package. According to
>>
>> <http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Clusters_from_Scratch/_configuring_cman_fencing.html>
>>
>> it causes any fencing requests from cman to be redirected to pacemaker.
>>
>> Since you asked, I've attached a copy of the file. I note that if this script is
>> used to fence a system it writes to /var/log/messages using logger, and there is
>> no such log message in my logs. So I guess cman is off the hook.
>
> You say "fencing resource-only" in drbd.conf.
> But you did not show the fencing handler used?
> Did you specify one at all?

It looks like I "over-edited" when I got rid of the comments before I posted my
configuration. The relevant sections are:

disk {
fencing resource-only;
}
handlers {
pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh;
/usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboo\
t -f";
pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh;
/usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboo\
t -f";
local-io-error "/usr/lib/drbd/notify-io-error.sh;
/usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; halt -f";
split-brain "/usr/lib/drbd/notify-split-brain.sh
sysadmin [at] nevis";
fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";
}


> Besides, for a dual-primary DRBD setup, you must have "fencing
> resource-and-stonith;", and you should use a DRBD fencing handler
> that really fences off the peer. It may additionally set constraints.

Do crm-fence-peer.sh or Lon Hohberger's obliterate-peer.sh "really" fence off a
peer? I suspect your answer will be no, since from what I can tell in a
cman+pacemaker configuration they both wind up calling stonith_admin.

> Also, maybe that post helps to realize some of the problems involved:
> http://www.gossamer-threads.com/lists/linuxha/pacemaker/62927#62927
>
> Especially the part about
> But just because you can shoot someone
> does not mean you have the bi^Wbetter data.
>
> Because of the increased complexity, I strongly recommend against dual
> primary DRBD, unless you have a very good reason to want it.
>
> "Because it can be done" does not count as good reason in that context

<off-topic>
Sigh. I wish that were the reason.

The reason why I'm doing dual-primary is that I've a got a single-primary
two-node cluster in production that simply doesn't work. One node runs
resources; the other sits and twiddles its fingers; fine. But when primary goes
down, secondary has trouble starting up all the resources; when we've actually
had primary failures (UPS goes haywire, hard drive failure) the secondary often
winds up in a state in which it runs none of the significant resources.

With the dual-primary setup I have now, both machines are running the resources
that typically cause problems in my single-primary configuration. If one box
goes down, the other doesn't have to failover anything; it's already running
them. (I needed IPaddr2 cloning to work properly for this to work, which is why
I started that thread... and all the stupider of me for missing that crucial
page in Clusters From Scratch.)

My only remaining problem with the configuration is restoring a fenced node to
the cluster. Hence my tests, and the reason why I started this thread.
</off-topic>

> More comments below.
>
>>> On Tue, Feb 28, 2012 at 11:49 AM, William Seligman
>>> <seligman [at] nevis> wrote:
>>>> I'm trying to set up an active/active HA cluster as explained in Clusters From
>>>> Scratch (which I just re-read after my last problem).
>>>>
>>>> I'll give versions and config files below, but I'll start with what happens. I
>>>> start with an active/active cman+pacemaker+drbd+gfs2 cluster, with fencing
>>>> enabled. My fencing mechanism cuts power to a node by turning the load off in
>>>> its UPS. The two nodes are hypatia-tb and orestes-tb.
>>>>
>>>> I want to test fencing and recovery. I start with both nodes running, and
>>>> resources properly running on both nodes. Then I simulate failure on one node,
>>>> e.g., orestes-tb. I've done this with "crm node standby", "service pacemaker
>>>> off", or by pulling the plug. As expected, all the resources move to hypatia-tb,
>>>> with the drbd resource as Primary.
>>>>
>>>> When I try to bring orestes-tb back into the cluster with "crm node online" or
>>>> "service pacemaker on" (the inverse of how I removed it), orestes-tb is fenced.
>>>> OK, that makes sense, I guess; there's a potential split-brain situation.
>>>
>>> Not really, that should only happen if the two nodes can't see each
>>> other. Which should not be the case.
>>> Only when you pull the plug should orestes-tb be fenced.
>>>
>>> Or if you're using a fencing device that requires the node to have
>>> power, then I can imagine that turning it on again might result in
>>> fencing.
>>> But not for the other cases.
>>
>> I ran a test: I turned off pacemaker (and so DRBD) on orestes-tb. I "touch"ed a
>> file on the hypatia-tb DRBD partition, to make it the "newer" one.
>> Then I turned
>> off pacemaker on hypatia-tb. Finally I turned on just drbd on hypatia-tb, then
>> on orestes-tb.
>>
>> From /var/log/messages on hypatia-tb:
>>
>> Feb 28 11:39:19 hypatia-tb kernel: d-con admin: Starting worker thread (from
>> drbdsetup [21822])
>> Feb 28 11:39:19 hypatia-tb kernel: block drbd0: disk( Diskless -> Attaching )
>> Feb 28 11:39:19 hypatia-tb kernel: d-con admin: Method to ensure write ordering:
>> barrier
>> Feb 28 11:39:19 hypatia-tb kernel: block drbd0: max BIO size = 130560
>> Feb 28 11:39:19 hypatia-tb kernel: block drbd0: Adjusting my ra_pages to backing
>> device's (32 -> 768)
>> Feb 28 11:39:19 hypatia-tb kernel: block drbd0: drbd_bm_resize called with
>> capacity == 5611549368
>> Feb 28 11:39:19 hypatia-tb kernel: block drbd0: resync bitmap: bits=701443671
>> words=10960058 pages=21407
>> Feb 28 11:39:19 hypatia-tb kernel: block drbd0: size = 2676 GB (2805774684 KB)
>> Feb 28 11:39:19 hypatia-tb kernel: block drbd0: bitmap READ of 21407 pages took
>> 576 jiffies
>> Feb 28 11:39:20 hypatia-tb kernel: block drbd0: recounting of set bits took
>> additional 87 jiffies
>> Feb 28 11:39:20 hypatia-tb kernel: block drbd0: 55 MB (14114 bits) marked
>> out-of-sync by on disk bit-map.
>> Feb 28 11:39:20 hypatia-tb kernel: block drbd0: disk( Attaching -> UpToDate )
>> pdsk( DUnknown -> Outdated )
>> Feb 28 11:39:20 hypatia-tb kernel: block drbd0: attached to UUIDs
>> 862A336609FD27CD:BFFB722D5E3E15D7:6E63EC4258C86AF2:6E62EC4258C86AF2
>> Feb 28 11:39:20 hypatia-tb kernel: d-con admin: conn( StandAlone -> Unconnected )
>> Feb 28 11:39:20 hypatia-tb kernel: d-con admin: Starting receiver thread (from
>> drbd_w_admin [21824])
>> Feb 28 11:39:20 hypatia-tb kernel: d-con admin: receiver (re)started
>> Feb 28 11:39:20 hypatia-tb kernel: d-con admin: conn( Unconnected -> WFConnection )
>>
>>
>> From /var/log/messages on orestes-tb:
>>
>> Feb 28 11:39:51 orestes-tb kernel: d-con admin: Starting worker thread (from
>> drbdsetup [17827])
>> Feb 28 11:39:51 orestes-tb kernel: block drbd0: disk( Diskless -> Attaching )
>> Feb 28 11:39:51 orestes-tb kernel: d-con admin: Method to ensure write ordering:
>> barrier
>> Feb 28 11:39:51 orestes-tb kernel: block drbd0: max BIO size = 130560
>> Feb 28 11:39:51 orestes-tb kernel: block drbd0: Adjusting my ra_pages to backing
>> device's (32 -> 768)
>> Feb 28 11:39:51 orestes-tb kernel: block drbd0: drbd_bm_resize called with
>> capacity == 5611549368
>> Feb 28 11:39:51 orestes-tb kernel: block drbd0: resync bitmap: bits=701443671
>> words=10960058 pages=21407
>> Feb 28 11:39:51 orestes-tb kernel: block drbd0: size = 2676 GB (2805774684 KB)
>> Feb 28 11:39:52 orestes-tb kernel: block drbd0: bitmap READ of 21407 pages took
>> 735 jiffies
>> Feb 28 11:39:52 orestes-tb kernel: block drbd0: recounting of set bits took
>> additional 93 jiffies
>> Feb 28 11:39:52 orestes-tb kernel: block drbd0: 0 KB (0 bits) marked out-of-sync
>> by on disk bit-map.
>> Feb 28 11:39:52 orestes-tb kernel: block drbd0: disk( Attaching -> Outdated )
>> Feb 28 11:39:52 orestes-tb kernel: block drbd0: attached to UUIDs
>> BFFB722D5E3E15D6:0000000000000000:6E63EC4258C86AF2:6E62EC4258C86AF2
>> Feb 28 11:39:52 orestes-tb kernel: d-con admin: conn( StandAlone -> Unconnected )
>> Feb 28 11:39:52 orestes-tb kernel: d-con admin: Starting receiver thread (from
>> drbd_w_admin [17829])
>> Feb 28 11:39:52 orestes-tb kernel: d-con admin: receiver (re)started
>> Feb 28 11:39:52 orestes-tb kernel: d-con admin: conn( Unconnected -> WFConnection )
>> Feb 28 11:39:52 orestes-tb kernel: d-con admin: Handshake successful: Agreed
>> network protocol version 100
>> Feb 28 11:39:52 orestes-tb kernel: d-con admin: conn( WFConnection ->
>> WFReportParams )
>> Feb 28 11:39:52 orestes-tb kernel: d-con admin: Starting asender thread (from
>> drbd_r_admin [17835])
>> Feb 28 11:39:52 orestes-tb kernel: block drbd0: drbd_sync_handshake:
>> Feb 28 11:39:52 orestes-tb kernel: block drbd0: self
>> BFFB722D5E3E15D6:0000000000000000:6E63EC4258C86AF2:6E62EC4258C86AF2 bits:0 flags:0
>> Feb 28 11:39:52 orestes-tb kernel: block drbd0: peer
>> 862A336609FD27CC:BFFB722D5E3E15D7:6E63EC4258C86AF2:6E62EC4258C86AF2 bits:14114
>> flags:0
>> Feb 28 11:39:52 orestes-tb kernel: block drbd0: uuid_compare()=-1 by rule 50
>> Feb 28 11:39:52 orestes-tb kernel: block drbd0: peer( Unknown -> Secondary )
>> conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
>> Feb 28 11:39:52 orestes-tb kernel: block drbd0: receive bitmap stats
>> [Bytes(packets)]: plain 0(0), RLE 176(1), total 176; compression: 100.0%
>> Feb 28 11:39:52 orestes-tb kernel: block drbd0: send bitmap stats
>> [Bytes(packets)]: plain 0(0), RLE 176(1), total 176; compression: 100.0%
>> Feb 28 11:39:52 orestes-tb kernel: block drbd0: conn( WFBitMapT -> WFSyncUUID )
>> Feb 28 11:40:01 orestes-tb corosync[2193]: [TOTEM ] A processor failed,
>> forming new configuration.
>> Feb 28 11:40:03 orestes-tb corosync[2193]: [QUORUM] Members[1]: 2
>> Feb 28 11:40:03 orestes-tb corosync[2193]: [TOTEM ] A processor joined or left
>> the membership and a new membership was formed.
>> Feb 28 11:40:03 orestes-tb kernel: dlm: closing connection to node 1
>> Feb 28 11:40:03 orestes-tb corosync[2193]: [CPG ] chosen downlist: sender
>> r(0) ip(129.236.252.14) r(1) ip(192.168.100.6) ; members(old:2 left:1)
>> Feb 28 11:40:03 orestes-tb corosync[2193]: [MAIN ] Completed service
>> synchronization, ready to provide service.
>> Feb 28 11:40:03 orestes-tb fenced[2247]: fencing node hypatia-tb.nevis.columbia.edu
>>
>>
>> As far as I can tell, hypatia-tb's drbd comes up, says "I'm UpToDate" and waits
>> for a connection from orestes-tb. orestes-tb's drbd comes up, says "I'm
>> UpToDate"
>
> No, it clearly says "I'm Outdated" from the logs above:
> | Feb 28 11:39:52 orestes-tb kernel: block drbd0: disk( Attaching -> Outdated )
>
> It outdated itself voluntarily when it was told to disconnect from a
> still running primary, because of your "fencing resource-only" configuration.
>
> Don't rely on that: in a real incident, the replication link will just
> fail, in which case you really need the "fencing resource-and-stonith",
> and a suitable fence-peer handler.
>
>> and starts the sync process with hypatia-tb. Then cman+corosync steps
>> in on orestes-tb and fences hypatia-tb, before the sync can proceed.
>>
>> I ran another test. I did the same thing as the previous paragraph, except that
>> I made sure both cman and pacemaker were off (I had to reboot to make sure) and
>> just started drbd on both nodes. Sure enough, drbd was able to sync without
>> split-brain or fencing. So this is a cman/corosync issue, not a drbd issue.
>
> You still may retry the whole thing with drbd 8.3.12,
> just to make sure there is no hidden DRBD 8.4.1 instability.

OK, that will be my next step, if resource-and-stonith doesn't solve the problem.

>> While I was setting up the test for the previous paragraph, there was a problem
>> with another resource (ocf:heartbeat:exportfs) that couldn't be properly
>> monitored on either node. This led to a cycle of fencing where each node would
>> successively fence the other because the exportfs resource couldn't run on
>> either node. I had to quickly change my configuration to turn off monitoring on
>> the resource.
>>
>> So it seems like cman+corosync is the issue. It's as if I"m "over-fencing."
>>
>> Any ideas?

--
Bill Seligman | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://seligman [at] nevis
PO Box 137 |
Irvington NY 10533 USA | http://www.nevis.columbia.edu/~seligman/
Attachments: smime.p7s (4.39 KB)


andrew at beekhof

Feb 28, 2012, 2:27 PM

Post #7 of 18 (2343 views)
Permalink
Re: cman+pacemaker+drbd fencing problem [In reply to]

On Wed, Feb 29, 2012 at 5:21 AM, William Seligman
<seligman [at] nevis> wrote:
> On 2/27/12 8:40 PM, Andrew Beekhof wrote:
>
>> Oh, what does the fence_pcmk file look like?
>
> This is a standard part of the pacemaker-1.1.6 package.

I know, I wrote it :-)
I'm just curious exactly what it contains, there was a rather serious
bug at one point.

> According to
>
> <http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Clusters_from_Scratch/_configuring_cman_fencing.html>
>
> it causes any fencing requests from cman to be redirected to pacemaker.
>
> Since you asked, I've attached a copy of the file. I note that if this script is
> used to fence a system it writes to /var/log/messages using logger, and there is
> no such log message in my logs. So I guess cman is off the hook.
>
>> On Tue, Feb 28, 2012 at 11:49 AM, William Seligman
>> <seligman [at] nevis> wrote:
>>> I'm trying to set up an active/active HA cluster as explained in Clusters From
>>> Scratch (which I just re-read after my last problem).
>>>
>>> I'll give versions and config files below, but I'll start with what happens. I
>>> start with an active/active cman+pacemaker+drbd+gfs2 cluster, with fencing
>>> enabled. My fencing mechanism cuts power to a node by turning the load off in
>>> its UPS. The two nodes are hypatia-tb and orestes-tb.
>>>
>>> I want to test fencing and recovery. I start with both nodes running, and
>>> resources properly running on both nodes. Then I simulate failure on one node,
>>> e.g., orestes-tb. I've done this with "crm node standby", "service pacemaker
>>> off", or by pulling the plug. As expected, all the resources move to hypatia-tb,
>>> with the drbd resource as Primary.
>>>
>>> When I try to bring orestes-tb back into the cluster with "crm node online" or
>>> "service pacemaker on" (the inverse of how I removed it), orestes-tb is fenced.
>>> OK, that makes sense, I guess; there's a potential split-brain situation.
>>
>> Not really, that should only happen if the two nodes can't see each
>> other.  Which should not be the case.
>> Only when you pull the plug should orestes-tb be fenced.
>>
>> Or if you're using a fencing device that requires the node to have
>> power, then I can imagine that turning it on again might result in
>> fencing.
>> But not for the other cases.
>
> I ran a test: I turned off pacemaker (and so DRBD) on orestes-tb. I "touch"ed a
> file on the hypatia-tb DRBD partition, to make it the "newer" one. Then I turned
> off pacemaker on hypatia-tb. Finally I turned on just drbd on hypatia-tb, then
> on orestes-tb.
>
> From /var/log/messages on hypatia-tb:
>
> Feb 28 11:39:19 hypatia-tb kernel: d-con admin: Starting worker thread (from
> drbdsetup [21822])
> Feb 28 11:39:19 hypatia-tb kernel: block drbd0: disk( Diskless -> Attaching )
> Feb 28 11:39:19 hypatia-tb kernel: d-con admin: Method to ensure write ordering:
> barrier
> Feb 28 11:39:19 hypatia-tb kernel: block drbd0: max BIO size = 130560
> Feb 28 11:39:19 hypatia-tb kernel: block drbd0: Adjusting my ra_pages to backing
> device's (32 -> 768)
> Feb 28 11:39:19 hypatia-tb kernel: block drbd0: drbd_bm_resize called with
> capacity == 5611549368
> Feb 28 11:39:19 hypatia-tb kernel: block drbd0: resync bitmap: bits=701443671
> words=10960058 pages=21407
> Feb 28 11:39:19 hypatia-tb kernel: block drbd0: size = 2676 GB (2805774684 KB)
> Feb 28 11:39:19 hypatia-tb kernel: block drbd0: bitmap READ of 21407 pages took
> 576 jiffies
> Feb 28 11:39:20 hypatia-tb kernel: block drbd0: recounting of set bits took
> additional 87 jiffies
> Feb 28 11:39:20 hypatia-tb kernel: block drbd0: 55 MB (14114 bits) marked
> out-of-sync by on disk bit-map.
> Feb 28 11:39:20 hypatia-tb kernel: block drbd0: disk( Attaching -> UpToDate )
> pdsk( DUnknown -> Outdated )
> Feb 28 11:39:20 hypatia-tb kernel: block drbd0: attached to UUIDs
> 862A336609FD27CD:BFFB722D5E3E15D7:6E63EC4258C86AF2:6E62EC4258C86AF2
> Feb 28 11:39:20 hypatia-tb kernel: d-con admin: conn( StandAlone -> Unconnected )
> Feb 28 11:39:20 hypatia-tb kernel: d-con admin: Starting receiver thread (from
> drbd_w_admin [21824])
> Feb 28 11:39:20 hypatia-tb kernel: d-con admin: receiver (re)started
> Feb 28 11:39:20 hypatia-tb kernel: d-con admin: conn( Unconnected -> WFConnection )
>
>
> From /var/log/messages on orestes-tb:
>
> Feb 28 11:39:51 orestes-tb kernel: d-con admin: Starting worker thread (from
> drbdsetup [17827])
> Feb 28 11:39:51 orestes-tb kernel: block drbd0: disk( Diskless -> Attaching )
> Feb 28 11:39:51 orestes-tb kernel: d-con admin: Method to ensure write ordering:
> barrier
> Feb 28 11:39:51 orestes-tb kernel: block drbd0: max BIO size = 130560
> Feb 28 11:39:51 orestes-tb kernel: block drbd0: Adjusting my ra_pages to backing
> device's (32 -> 768)
> Feb 28 11:39:51 orestes-tb kernel: block drbd0: drbd_bm_resize called with
> capacity == 5611549368
> Feb 28 11:39:51 orestes-tb kernel: block drbd0: resync bitmap: bits=701443671
> words=10960058 pages=21407
> Feb 28 11:39:51 orestes-tb kernel: block drbd0: size = 2676 GB (2805774684 KB)
> Feb 28 11:39:52 orestes-tb kernel: block drbd0: bitmap READ of 21407 pages took
> 735 jiffies
> Feb 28 11:39:52 orestes-tb kernel: block drbd0: recounting of set bits took
> additional 93 jiffies
> Feb 28 11:39:52 orestes-tb kernel: block drbd0: 0 KB (0 bits) marked out-of-sync
> by on disk bit-map.
> Feb 28 11:39:52 orestes-tb kernel: block drbd0: disk( Attaching -> Outdated )
> Feb 28 11:39:52 orestes-tb kernel: block drbd0: attached to UUIDs
> BFFB722D5E3E15D6:0000000000000000:6E63EC4258C86AF2:6E62EC4258C86AF2
> Feb 28 11:39:52 orestes-tb kernel: d-con admin: conn( StandAlone -> Unconnected )
> Feb 28 11:39:52 orestes-tb kernel: d-con admin: Starting receiver thread (from
> drbd_w_admin [17829])
> Feb 28 11:39:52 orestes-tb kernel: d-con admin: receiver (re)started
> Feb 28 11:39:52 orestes-tb kernel: d-con admin: conn( Unconnected -> WFConnection )
> Feb 28 11:39:52 orestes-tb kernel: d-con admin: Handshake successful: Agreed
> network protocol version 100
> Feb 28 11:39:52 orestes-tb kernel: d-con admin: conn( WFConnection ->
> WFReportParams )
> Feb 28 11:39:52 orestes-tb kernel: d-con admin: Starting asender thread (from
> drbd_r_admin [17835])
> Feb 28 11:39:52 orestes-tb kernel: block drbd0: drbd_sync_handshake:
> Feb 28 11:39:52 orestes-tb kernel: block drbd0: self
> BFFB722D5E3E15D6:0000000000000000:6E63EC4258C86AF2:6E62EC4258C86AF2 bits:0 flags:0
> Feb 28 11:39:52 orestes-tb kernel: block drbd0: peer
> 862A336609FD27CC:BFFB722D5E3E15D7:6E63EC4258C86AF2:6E62EC4258C86AF2 bits:14114
> flags:0
> Feb 28 11:39:52 orestes-tb kernel: block drbd0: uuid_compare()=-1 by rule 50
> Feb 28 11:39:52 orestes-tb kernel: block drbd0: peer( Unknown -> Secondary )
> conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
> Feb 28 11:39:52 orestes-tb kernel: block drbd0: receive bitmap stats
> [Bytes(packets)]: plain 0(0), RLE 176(1), total 176; compression: 100.0%
> Feb 28 11:39:52 orestes-tb kernel: block drbd0: send bitmap stats
> [Bytes(packets)]: plain 0(0), RLE 176(1), total 176; compression: 100.0%
> Feb 28 11:39:52 orestes-tb kernel: block drbd0: conn( WFBitMapT -> WFSyncUUID )
> Feb 28 11:40:01 orestes-tb corosync[2193]:   [TOTEM ] A processor failed,
> forming new configuration.
> Feb 28 11:40:03 orestes-tb corosync[2193]:   [QUORUM] Members[1]: 2
> Feb 28 11:40:03 orestes-tb corosync[2193]:   [TOTEM ] A processor joined or left
> the membership and a new membership was formed.
> Feb 28 11:40:03 orestes-tb kernel: dlm: closing connection to node 1
> Feb 28 11:40:03 orestes-tb corosync[2193]:   [CPG   ] chosen downlist: sender
> r(0) ip(129.236.252.14) r(1) ip(192.168.100.6) ; members(old:2 left:1)
> Feb 28 11:40:03 orestes-tb corosync[2193]:   [MAIN  ] Completed service
> synchronization, ready to provide service.
> Feb 28 11:40:03 orestes-tb fenced[2247]: fencing node hypatia-tb.nevis.columbia.edu
>
>
> As far as I can tell, hypatia-tb's drbd comes up, says "I'm UpToDate" and waits
> for a connection from orestes-tb. orestes-tb's drbd comes up, says "I'm
> UpToDate" and starts the sync process with hypatia-tb. Then cman+corosync steps
> in on orestes-tb and fences hypatia-tb, before the sync can proceed.
>
> I ran another test. I did the same thing as the previous paragraph, except that
> I made sure both cman and pacemaker were off (I had to reboot to make sure) and
> just started drbd on both nodes. Sure enough, drbd was able to sync without
> split-brain or fencing. So this is a cman/corosync issue, not a drbd issue.
>
> While I was setting up the test for the previous paragraph, there was a problem
> with another resource (ocf:heartbeat:exportfs) that couldn't be properly
> monitored on either node. This led to a cycle of fencing where each node would
> successively fence the other because the exportfs resource couldn't run on
> either node. I had to quickly change my configuration to turn off monitoring on
> the resource.

Not being able to run is fine, but not being able to stop would
definitely cause fencing.
Make sure the RA can always stop ;-)

>
> So it seems like cman+corosync is the issue. It's as if I"m "over-fencing."
>
> Any ideas?
>
>>> I bring orestes-tb back up, with the intent of adding it back into the cluster.
>>> I make sure cman, pacemaker, and drbd services were off at system start. On
>>> orestes-tb, I type "service drbd start".
>>>
>>> What I expect to happen is that the drbd resource on orestes-tb is marked
>>> "Outdated" or something like that. Then I'd fix it with "drbdadm
>>> --discard-my-data connect admin" or whatever is appropriate.
>>>
>>> What actually happens is that hypatia-tb is fenced. Since this is the node
>>> running all the resources, this is bad behavior. It's even more puzzling when I
>>> consider that at, the time, there isn't any fencing resource actually running on
>>> orestes-tb; my guess is that DRBD on hypatia-tb is fencing itself.
>>>
>>> Eventually hypatia-tb reboots, and the cluster goes back to normal. But as a
>>> fencing/stability/HA test, this is a failure.
>>>
>>> I've repeated this with a number of variations. In the end, both systems have to
>>> be fenced/rebooted before the cluster is working again.
>>>
>>> Any ideas?
>>>
>>> Versions:
>>>
>>> Scientific Linux 6.2
>>> kernel 2.6.32
>>> cman-3.0.12
>>> corosync-1.4.1
>>> pacemaker-1.1.6
>>> drbd-8.4.1
>>>
>>> /etc/drbd.d/global-common.conf:
>>>
>>> global {
>>> � � � �usage-count yes;
>>> }
>>>
>>> common {
>>> � � � �startup {
>>> � � � � � � � �wfc-timeout � � � � � � 60;
>>> � � � � � � � �degr-wfc-timeout � � � �60;
>>> � � � � � � � �outdated-wfc-timeout � �60;
>>> � � � �}
>>> }
>>>
>>> /etc/drbd.d/admin.res:
>>>
>>> resource admin {
>>>
>>> � � � �protocol C;
>>>
>>> � � � �on hypatia-tb.nevis.columbia.edu {
>>> � � � � � � � �volume 0 {
>>> � � � � � � � � � � � �device � � � � �/dev/drbd0;
>>> � � � � � � � � � � � �disk � � � � � �/dev/md2;
>>> � � � � � � � � � � � �flexible-meta-disk � � �internal;
>>> � � � � � � � �}
>>> � � � � � � � �address � � � � 192.168.100.7:7788;
>>> � � � �}
>>> � � � �on orestes-tb.nevis.columbia.edu {
>>> � � � � � � � �volume 0 {
>>> � � � � � � � � � � � �device � � � � �/dev/drbd0;
>>> � � � � � � � � � � � �disk � � � � � �/dev/md2;
>>> � � � � � � � � � � � �flexible-meta-disk � � �internal;
>>> � � � � � � � �}
>>> � � � � � � � �address � � � � 192.168.100.6:7788;
>>> � � � �}
>>>
>>> � � � �startup {
>>> � � � �}
>>>
>>> � � � �net {
>>> � � � � � � � �allow-two-primaries � � yes;
>>> � � � � � � � �after-sb-0pri � � �discard-zero-changes;
>>> � � � � � � � �after-sb-1pri � � �discard-secondary;
>>> � � � � � � � �after-sb-2pri � � �disconnect;
>>> � � � � � � � �sndbuf-size 0;
>>> � � � �}
>>>
>>> � � � �disk {
>>> � � � � � � � �resync-rate � � 100M;
>>> � � � � � � � �c-max-rate � � �100M;
>>> � � � � � � � �al-extents � � �3389;
>>> � � � � � � � �fencing resource-only;
>>> � � � �}
>>>
>>> An edited output of "crm configure show":
>>>
>>> node hypatia-tb.nevis.columbia.edu
>>> node orestes-tb.nevis.columbia.edu
>>> primitive StonithHypatia stonith:fence_nut \
>>> � params pcmk_host_check="static-list" \
>>> � pcmk_host_list="hypatia-tb.nevis.columbia.edu" \
>>> � ups="sofia-ups" username="admin" password="XXX"
>>> primitive StonithOrestes stonith:fence_nut \
>>> � params pcmk_host_check="static-list" \
>>> � pcmk_host_list="orestes-tb.nevis.columbia.edu"
>>> � ups="dc-test-stand-ups" username="admin" password="XXX"
>>> location StonithHypatiaLocation StonithHypatia \
>>> � -inf: hypatia-tb.nevis.columbia.edu
>>> location StonithOrestesLocation StonithOrestes \
>>> � -inf: orestes-tb.nevis.columbia.edu
>>>
>>> /etc/cluster/cluster.conf:
>>>
>>> <?xml version="1.0"?>
>>> <cluster config_version="17" name="Nevis_HA">
>>> �<logging debug="off"/>
>>> �<cman expected_votes="1" two_node="1" />
>>> �<clusternodes>
>>> � �<clusternode name="hypatia-tb.nevis.columbia.edu" nodeid="1">
>>> � � �<altname name="hypatia-private.nevis.columbia.edu" port="5405"
>>> mcast="226.94.1.1"/>
>>> � � �<fence>
>>> � � � �<method name="pcmk-redirect">
>>> � � � � �<device name="pcmk" port="hypatia-tb.nevis.columbia.edu"/>
>>> � � � �</method>
>>> � � �</fence>
>>> � �</clusternode>
>>> � �<clusternode name="orestes-tb.nevis.columbia.edu" nodeid="2">
>>> � � �<altname name="orestes-private.nevis.columbia.edu" port="5405"
>>> mcast="226.94.1.1"/>
>>> � � �<fence>
>>> � � � �<method name="pcmk-redirect">
>>> � � � � �<device name="pcmk" port="orestes-tb.nevis.columbia.edu"/>
>>> � � � �</method>
>>> � � �</fence>
>>> � �</clusternode>
>>> �</clusternodes>
>>> �<fencedevices>
>>> � �<fencedevice name="pcmk" agent="fence_pcmk"/>
>>> �</fencedevices>
>>> �<fence_daemon post_join_delay="30" />
>>> �<rm disabled="1" />
>>> </cluster>
>>>
>>>
>>> The log messages on orestes-tb, just before hypatia-tb is fenced (there are no
>>> messages in the hypatia-tb log for this time):
>>>
>>> Feb 15 16:52:27 orestes-tb kernel: drbd: initialized. Version: 8.4.1
>>> (api:1/proto:86-100)
>>> Feb 15 16:52:27 orestes-tb kernel: drbd: GIT-hash:
>>> 91b4c048c1a0e06777b5f65d312b38d47abaea80 build by
>>> root [at] orestes-tb, 2012-02-14 17:05:32
>>> Feb 15 16:52:27 orestes-tb kernel: drbd: registered as block device major 147
>>> Feb 15 16:52:27 orestes-tb kernel: d-con admin: Starting worker thread (from
>>> drbdsetup [2570])
>>> Feb 15 16:52:27 orestes-tb kernel: block drbd0: disk( Diskless -> Attaching )
>>> Feb 15 16:52:27 orestes-tb kernel: d-con admin: Method to ensure write ordering:
>>> barrier
>>> Feb 15 16:52:27 orestes-tb kernel: block drbd0: max BIO size = 130560
>>> Feb 15 16:52:27 orestes-tb kernel: block drbd0: Adjusting my ra_pages to backing
>>> device's (32 -> 768)
>>> Feb 15 16:52:27 orestes-tb kernel: block drbd0: drbd_bm_resize called with
>>> capacity == 5611549368
>>> Feb 15 16:52:27 orestes-tb kernel: block drbd0: resync bitmap: bits=701443671
>>> words=10960058 pages=21407
>>> Feb 15 16:52:27 orestes-tb kernel: block drbd0: size = 2676 GB (2805774684 KB)
>>> Feb 15 16:52:28 orestes-tb kernel: block drbd0: bitmap READ of 21407 pages took
>>> 634 jiffies
>>> Feb 15 16:52:28 orestes-tb kernel: block drbd0: recounting of set bits took
>>> additional 92 jiffies
>>> Feb 15 16:52:28 orestes-tb kernel: block drbd0: 0 KB (0 bits) marked out-of-sync
>>> by on disk bit-map.
>>> Feb 15 16:52:28 orestes-tb kernel: block drbd0: disk( Attaching -> Outdated )
>>> Feb 15 16:52:28 orestes-tb kernel: block drbd0: attached to UUIDs
>>> F5355FCF6114F218:0000000000000000:8A5519C7090D6BD6:8A5419C7090D6BD6
>>> Feb 15 16:52:28 orestes-tb kernel: d-con admin: conn( StandAlone -> Unconnected )
>>> Feb 15 16:52:28 orestes-tb kernel: d-con admin: Starting receiver thread (from
>>> drbd_w_admin [2572])
>>> Feb 15 16:52:28 orestes-tb kernel: d-con admin: receiver (re)started
>>> Feb 15 16:52:28 orestes-tb kernel: d-con admin: conn( Unconnected -> WFConnection )
>>> Feb 15 16:52:29 orestes-tb kernel: d-con admin: Handshake successful: Agreed
>>> network protocol version 100
>>> Feb 15 16:52:29 orestes-tb kernel: d-con admin: conn( WFConnection ->
>>> WFReportParams )
>>> Feb 15 16:52:29 orestes-tb kernel: d-con admin: Starting asender thread (from
>>> drbd_r_admin [2579])
>>> Feb 15 16:52:29 orestes-tb kernel: block drbd0: drbd_sync_handshake:
>>> Feb 15 16:52:29 orestes-tb kernel: block drbd0: self
>>> F5355FCF6114F218:0000000000000000:8A5519C7090D6BD6:8A5419C7090D6BD6 bits:0 flags:0
>>> Feb 15 16:52:29 orestes-tb kernel: block drbd0: peer
>>> 06B93A6C54D6D631:F5355FCF6114F219:8A5519C7090D6BD6:8A5419C7090D6BD6 bits:615 flags:0
>>> Feb 15 16:52:29 orestes-tb kernel: block drbd0: uuid_compare()=-1 by rule 50
>>> Feb 15 16:52:29 orestes-tb kernel: block drbd0: peer( Unknown -> Primary ) conn(
>>> WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
>>> Feb 15 16:52:29 orestes-tb kernel: block drbd0: receive bitmap stats
>>> [Bytes(packets)]: plain 0(0), RLE 39(1), total 39; compression: 100.0%
>>> Feb 15 16:52:29 orestes-tb kernel: block drbd0: send bitmap stats
>>> [Bytes(packets)]: plain 0(0), RLE 39(1), total 39; compression: 100.0%
>>> Feb 15 16:52:29 orestes-tb kernel: block drbd0: conn( WFBitMapT -> WFSyncUUID )
>>> Feb 15 16:52:50 orestes-tb kernel: d-con admin: PingAck did not arrive in time.
>>> Feb 15 16:52:50 orestes-tb kernel: d-con admin: peer( Primary -> Unknown ) conn(
>>> WFSyncUUID -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
>>> Feb 15 16:52:50 orestes-tb kernel: d-con admin: asender terminated
>>> Feb 15 16:52:50 orestes-tb kernel: d-con admin: Terminating asender thread
>>> Feb 15 16:52:51 orestes-tb kernel: block drbd0: bitmap WRITE of 3 pages took 247
>>> jiffies
>>> Feb 15 16:52:51 orestes-tb kernel: block drbd0: 2460 KB (615 bits) marked
>>> out-of-sync by on disk bit-map.
>>> Feb 15 16:52:51 orestes-tb kernel: d-con admin: Connection closed
>>> Feb 15 16:52:51 orestes-tb kernel: d-con admin: conn( NetworkFailure ->
>>> Unconnected )
>>> Feb 15 16:52:51 orestes-tb kernel: d-con admin: receiver terminated
>>> Feb 15 16:52:51 orestes-tb kernel: d-con admin: Restarting receiver thread
>>> Feb 15 16:52:51 orestes-tb kernel: d-con admin: receiver (re)started
>>> Feb 15 16:52:51 orestes-tb kernel: d-con admin: conn( Unconnected -> WFConnection )
>
>
> --
> Bill Seligman             | Phone: (914) 591-2823
> Nevis Labs, Columbia Univ | mailto://seligman [at] nevis
> PO Box 137                |
> Irvington NY 10533 USA    | http://www.nevis.columbia.edu/~seligman/
>
> _______________________________________________
> Linux-HA mailing list
> Linux-HA [at] lists
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


seligman at nevis

Feb 28, 2012, 3:21 PM

Post #8 of 18 (2348 views)
Permalink
Re: cman+pacemaker+drbd fencing problem [In reply to]

On 2/28/12 5:27 PM, Andrew Beekhof wrote:
> On Wed, Feb 29, 2012 at 5:21 AM, William Seligman
> <seligman [at] nevis> wrote:

>> While I was setting up the test for the previous paragraph, there was a problem
>> with another resource (ocf:heartbeat:exportfs) that couldn't be properly
>> monitored on either node. This led to a cycle of fencing where each node would
>> successively fence the other because the exportfs resource couldn't run on
>> either node. I had to quickly change my configuration to turn off monitoring on
>> the resource.
>
> Not being able to run is fine, but not being able to stop would
> definitely cause fencing.
> Make sure the RA can always stop ;-)

I'm not the one who wrote ocf:heartbeat:exportfs. I've already had my fling with
trying to revise it. I can only hope that the folks who wrote it knew what they
were doing; they certainly know more than I do!
--
Bill Seligman | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://seligman [at] nevis
PO Box 137 |
Irvington NY 10533 USA | http://www.nevis.columbia.edu/~seligman/
Attachments: smime.p7s (4.39 KB)


lars.ellenberg at linbit

Feb 28, 2012, 4:26 PM

Post #9 of 18 (2397 views)
Permalink
Re: cman+pacemaker+drbd fencing problem [In reply to]

On Tue, Feb 28, 2012 at 03:51:29PM -0500, William Seligman wrote:
> On 2/28/12 2:09 PM, Lars Ellenberg wrote:
> > You say "fencing resource-only" in drbd.conf.
> > But you did not show the fencing handler used?
> > Did you specify one at all?
>
> It looks like I "over-edited" when I got rid of the comments before I posted my
> configuration. The relevant sections are:
>
> disk {
> fencing resource-only;
> }
> handlers {
> pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh;
> /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboo\
> t -f";
> pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh;
> /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboo\
> t -f";
> local-io-error "/usr/lib/drbd/notify-io-error.sh;
> /usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; halt -f";
> split-brain "/usr/lib/drbd/notify-split-brain.sh
> sysadmin [at] nevis";
> fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
> after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";
> }
>
>
> > Besides, for a dual-primary DRBD setup, you must have "fencing
> > resource-and-stonith;", and you should use a DRBD fencing handler
> > that really fences off the peer. It may additionally set constraints.
>
> Do crm-fence-peer.sh or Lon Hohberger's obliterate-peer.sh "really" fence off a
> peer? I suspect your answer will be no, since from what I can tell in a
> cman+pacemaker configuration they both wind up calling stonith_admin.

The "obliterate peer" thing does.
So do the stonith_admin_fence_peer.sh and the rhcs_fence from
http://git.drbd.org/gitweb.cgi?p=drbd-8.3.git;a=tree;f=scripts
They do not set constraints, though, so you are relying
on some more DRBD internals here...

The "crm-fence-peer.sh" simply sets a constraint,
and is in itself not sufficient to shoot the peer.
It is suitable for single primary setups.
For dual primary setups, you may combine it
with the obliterate thingy or similar.

Though it should typically cause the losing side to be "demoted",
which will typically fail if it is primary, in use, and blocked
due to "fencing resource-and-stonith".
Causing demote failure, then stop failure, then node-fencing, so it may
still end up being good enough and cause the peer to be shot.
It just takes a few detours on the way.

> > Because of the increased complexity, I strongly recommend against dual
> > primary DRBD, unless you have a very good reason to want it.
> >
> > "Because it can be done" does not count as good reason in that context
>
> <off-topic>
> Sigh. I wish that were the reason.
>
> The reason why I'm doing dual-primary is that I've a got a single-primary
> two-node cluster in production that simply doesn't work. One node runs
> resources; the other sits and twiddles its fingers; fine. But when primary goes
> down, secondary has trouble starting up all the resources; when we've actually
> had primary failures (UPS goes haywire, hard drive failure) the secondary often
> winds up in a state in which it runs none of the significant resources.
>
> With the dual-primary setup I have now, both machines are running the resources
> that typically cause problems in my single-primary configuration. If one box
> goes down, the other doesn't have to failover anything; it's already running
> them. (I needed IPaddr2 cloning to work properly for this to work, which is why
> I started that thread... and all the stupider of me for missing that crucial
> page in Clusters From Scratch.)
>
> My only remaining problem with the configuration is restoring a fenced node to
> the cluster. Hence my tests, and the reason why I started this thread.
> </off-topic>

Uhm, I do think that is exactly on topic.

Rather fix your resources to be able to successfully take over,
than add even more complexity.

What resources would that be,
and why are they not taking over?

--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com
_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


seligman at nevis

Feb 29, 2012, 4:03 PM

Post #10 of 18 (2359 views)
Permalink
Re: cman+pacemaker+drbd fencing problem [In reply to]

On 2/28/12 7:26 PM, Lars Ellenberg wrote:
> On Tue, Feb 28, 2012 at 03:51:29PM -0500, William Seligman wrote:
>> <off-topic>
>> Sigh. I wish that were the reason.
>>
>> The reason why I'm doing dual-primary is that I've a got a single-primary
>> two-node cluster in production that simply doesn't work. One node runs
>> resources; the other sits and twiddles its fingers; fine. But when primary goes
>> down, secondary has trouble starting up all the resources; when we've actually
>> had primary failures (UPS goes haywire, hard drive failure) the secondary often
>> winds up in a state in which it runs none of the significant resources.
>>
>> With the dual-primary setup I have now, both machines are running the resources
>> that typically cause problems in my single-primary configuration. If one box
>> goes down, the other doesn't have to failover anything; it's already running
>> them. (I needed IPaddr2 cloning to work properly for this to work, which is why
>> I started that thread... and all the stupider of me for missing that crucial
>> page in Clusters From Scratch.)
>>
>> My only remaining problem with the configuration is restoring a fenced node to
>> the cluster. Hence my tests, and the reason why I started this thread.
>> </off-topic>
>
> Uhm, I do think that is exactly on topic.
>
> Rather fix your resources to be able to successfully take over,
> than add even more complexity.
>
> What resources would that be,
> and why are they not taking over?

I can't tell you in detail, because the major snafu happened on a production
system after a power outage a few months ago. My goal was to get the thing
stable as quickly as possible. In the end, that turned out to be a non-HA
configuration: One runs corosync+pacemaker+drbd, while the other just runs drbd.
It works, in the sense that the users get their e-mail. If there's a power
outage, I have to bring things up manually.

So my only reference is the test-bench dual-primary setup I've got now, which is
exhibiting the same kinds of problems even though the OS versions, software
versions, and layout are different. This suggests that the problem lies in the
way I'm setting up the configuration.

The problems I have seem to be in the general category of "the 'good guy' gets
fenced when the 'bad guy' gets into trouble." Examples:

- Assuming I start out with two crashed nodes. If I just start up DRBD and
nothing else, the partitions sync quickly with no problems.

- If the system starts with cman running, and I start drbd, it's likely that
system who is _not_ Outdated will be fenced (rebooted). Same thing if
cman+pacemaker is running.

- Cloned ocf:heartbeat:exportfs resources are giving me problems as well (which
is why I tried making changes to that resource script). Assume I start with one
node running cman+pacemaker, and the other stopped. I turned on the stopped
node. This will typically result in the running node being fenced, because it
has it times out when stopping the exportfs resource.

Falling back to DRBD 8.3.12 didn't change this behavior.

My pacemaker configuration is long, so I'll excerpt what I think are the
relevant pieces in the hope that it will be enough for someone to say "You fool!
This is covered in Pacemaker Explained page 56!" When bringing up a stopped
node, in order to restart AdminClone pacemaker wants to stop ExportsClone, then
Gfs2Clone, then ClvmdClone. As I said, it's the failure to stop ExportMail on
the running node that causes it to be fenced.

primitive AdminDrbd ocf:linbit:drbd \
params drbd_resource="admin" \
op monitor interval="60s" role="Master" \
op monitor interval="59s" role="Slave" \
op stop interval="0" timeout="320" \
op start interval="0" timeout="240"
ms AdminClone AdminDrbd \
meta master-max="2" master-node-max="1" \
clone-max="2" clone-node-max="1" notify="true"

primitive Clvmd lsb:clvmd op monitor interval="30s"
clone ClvmdClone Clvmd
colocation Clvmd_With_Admin inf: ClvmdClone AdminClone:Master
order Admin_Before_Clvmd inf: AdminClone:promote ClvmdClone:start

primitive Gfs2 lsb:gfs2 op monitor interval="30s"
clone Gfs2Clone Gfs2
colocation Gfs2_With_Clvmd inf: Gfs2Clone ClvmdClone
order Clvmd_Before_Gfs2 inf: ClvmdClone Gfs2Clone

primitive ExportMail ocf:heartbeat:exportfs \
op start interval="0" timeout="40" \
op stop interval="0" timeout="45" \
params clientspec="mail" directory="/mail" fsid="30"
clone ExportsClone ExportMail
colocation Exports_With_Gfs2 inf: ExportsClone Gfs2Clone
order Gfs2_Before_Exports inf: Gfs2Clone ExportsClone

--
Bill Seligman | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://seligman [at] nevis
PO Box 137 |
Irvington NY 10533 USA | http://www.nevis.columbia.edu/~seligman/
Attachments: smime.p7s (4.39 KB)


emi2fast at gmail

Mar 1, 2012, 1:15 AM

Post #11 of 18 (2353 views)
Permalink
Re: cman+pacemaker+drbd fencing problem [In reply to]

can you show me your /etc/cluster/cluster.conf?

because i think your problem it's a fencing-loop

Il giorno 01 marzo 2012 01:03, William Seligman <seligman [at] nevis
> ha scritto:

> On 2/28/12 7:26 PM, Lars Ellenberg wrote:
> > On Tue, Feb 28, 2012 at 03:51:29PM -0500, William Seligman wrote:
> >> <off-topic>
> >> Sigh. I wish that were the reason.
> >>
> >> The reason why I'm doing dual-primary is that I've a got a
> single-primary
> >> two-node cluster in production that simply doesn't work. One node runs
> >> resources; the other sits and twiddles its fingers; fine. But when
> primary goes
> >> down, secondary has trouble starting up all the resources; when we've
> actually
> >> had primary failures (UPS goes haywire, hard drive failure) the
> secondary often
> >> winds up in a state in which it runs none of the significant resources.
> >>
> >> With the dual-primary setup I have now, both machines are running the
> resources
> >> that typically cause problems in my single-primary configuration. If
> one box
> >> goes down, the other doesn't have to failover anything; it's already
> running
> >> them. (I needed IPaddr2 cloning to work properly for this to work,
> which is why
> >> I started that thread... and all the stupider of me for missing that
> crucial
> >> page in Clusters From Scratch.)
> >>
> >> My only remaining problem with the configuration is restoring a fenced
> node to
> >> the cluster. Hence my tests, and the reason why I started this thread.
> >> </off-topic>
> >
> > Uhm, I do think that is exactly on topic.
> >
> > Rather fix your resources to be able to successfully take over,
> > than add even more complexity.
> >
> > What resources would that be,
> > and why are they not taking over?
>
> I can't tell you in detail, because the major snafu happened on a
> production
> system after a power outage a few months ago. My goal was to get the thing
> stable as quickly as possible. In the end, that turned out to be a non-HA
> configuration: One runs corosync+pacemaker+drbd, while the other just runs
> drbd.
> It works, in the sense that the users get their e-mail. If there's a power
> outage, I have to bring things up manually.
>
> So my only reference is the test-bench dual-primary setup I've got now,
> which is
> exhibiting the same kinds of problems even though the OS versions, software
> versions, and layout are different. This suggests that the problem lies in
> the
> way I'm setting up the configuration.
>
> The problems I have seem to be in the general category of "the 'good guy'
> gets
> fenced when the 'bad guy' gets into trouble." Examples:
>
> - Assuming I start out with two crashed nodes. If I just start up DRBD and
> nothing else, the partitions sync quickly with no problems.
>
> - If the system starts with cman running, and I start drbd, it's likely
> that
> system who is _not_ Outdated will be fenced (rebooted). Same thing if
> cman+pacemaker is running.
>
> - Cloned ocf:heartbeat:exportfs resources are giving me problems as well
> (which
> is why I tried making changes to that resource script). Assume I start
> with one
> node running cman+pacemaker, and the other stopped. I turned on the stopped
> node. This will typically result in the running node being fenced, because
> it
> has it times out when stopping the exportfs resource.
>
> Falling back to DRBD 8.3.12 didn't change this behavior.
>
> My pacemaker configuration is long, so I'll excerpt what I think are the
> relevant pieces in the hope that it will be enough for someone to say "You
> fool!
> This is covered in Pacemaker Explained page 56!" When bringing up a stopped
> node, in order to restart AdminClone pacemaker wants to stop ExportsClone,
> then
> Gfs2Clone, then ClvmdClone. As I said, it's the failure to stop ExportMail
> on
> the running node that causes it to be fenced.
>
> primitive AdminDrbd ocf:linbit:drbd \
> params drbd_resource="admin" \
> op monitor interval="60s" role="Master" \
> op monitor interval="59s" role="Slave" \
> op stop interval="0" timeout="320" \
> op start interval="0" timeout="240"
> ms AdminClone AdminDrbd \
> meta master-max="2" master-node-max="1" \
> clone-max="2" clone-node-max="1" notify="true"
>
> primitive Clvmd lsb:clvmd op monitor interval="30s"
> clone ClvmdClone Clvmd
> colocation Clvmd_With_Admin inf: ClvmdClone AdminClone:Master
> order Admin_Before_Clvmd inf: AdminClone:promote ClvmdClone:start
>
> primitive Gfs2 lsb:gfs2 op monitor interval="30s"
> clone Gfs2Clone Gfs2
> colocation Gfs2_With_Clvmd inf: Gfs2Clone ClvmdClone
> order Clvmd_Before_Gfs2 inf: ClvmdClone Gfs2Clone
>
> primitive ExportMail ocf:heartbeat:exportfs \
> op start interval="0" timeout="40" \
> op stop interval="0" timeout="45" \
> params clientspec="mail" directory="/mail" fsid="30"
> clone ExportsClone ExportMail
> colocation Exports_With_Gfs2 inf: ExportsClone Gfs2Clone
> order Gfs2_Before_Exports inf: Gfs2Clone ExportsClone
>
> --
> Bill Seligman | Phone: (914) 591-2823
> Nevis Labs, Columbia Univ | mailto://seligman [at] nevis
> PO Box 137 |
> Irvington NY 10533 USA | http://www.nevis.columbia.edu/~seligman/
>
>
> _______________________________________________
> Linux-HA mailing list
> Linux-HA [at] lists
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>



--
esta es mi vida e me la vivo hasta que dios quiera
_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


seligman at nevis

Mar 1, 2012, 3:28 AM

Post #12 of 18 (2342 views)
Permalink
Re: cman+pacemaker+drbd fencing problem [In reply to]

On 3/1/12 4:15 AM, emmanuel segura wrote:
> can you show me your /etc/cluster/cluster.conf?
>
> because i think your problem it's a fencing-loop

Here it is:

/etc/cluster/cluster.conf:

<?xml version="1.0"?>
<cluster config_version="17" name="Nevis_HA">
<logging debug="off"/>
<cman expected_votes="1" two_node="1" />
<clusternodes>
<clusternode name="hypatia-tb.nevis.columbia.edu" nodeid="1">
<altname name="hypatia-private.nevis.columbia.edu" port="5405"
mcast="226.94.1.1"/>
<fence>
<method name="pcmk-redirect">
<device name="pcmk" port="hypatia-tb.nevis.columbia.edu"/>
</method>
</fence>
</clusternode>
<clusternode name="orestes-tb.nevis.columbia.edu" nodeid="2">
<altname name="orestes-private.nevis.columbia.edu" port="5405"
mcast="226.94.1.1"/>
<fence>
<method name="pcmk-redirect">
<device name="pcmk" port="orestes-tb.nevis.columbia.edu"/>
</method>
</fence>
</clusternode>
</clusternodes>
<fencedevices>
<fencedevice name="pcmk" agent="fence_pcmk"/>
</fencedevices>
<fence_daemon post_join_delay="30" />
<rm disabled="1" />
</cluster>


> Il giorno 01 marzo 2012 01:03, William Seligman<seligman [at] nevis
>> ha scritto:
>
>> On 2/28/12 7:26 PM, Lars Ellenberg wrote:
>>> On Tue, Feb 28, 2012 at 03:51:29PM -0500, William Seligman wrote:
>>>> <off-topic>
>>>> Sigh. I wish that were the reason.
>>>>
>>>> The reason why I'm doing dual-primary is that I've a got a
>> single-primary
>>>> two-node cluster in production that simply doesn't work. One node runs
>>>> resources; the other sits and twiddles its fingers; fine. But when
>> primary goes
>>>> down, secondary has trouble starting up all the resources; when we've
>> actually
>>>> had primary failures (UPS goes haywire, hard drive failure) the
>> secondary often
>>>> winds up in a state in which it runs none of the significant resources.
>>>>
>>>> With the dual-primary setup I have now, both machines are running the
>> resources
>>>> that typically cause problems in my single-primary configuration. If
>> one box
>>>> goes down, the other doesn't have to failover anything; it's already
>> running
>>>> them. (I needed IPaddr2 cloning to work properly for this to work,
>> which is why
>>>> I started that thread... and all the stupider of me for missing that
>> crucial
>>>> page in Clusters From Scratch.)
>>>>
>>>> My only remaining problem with the configuration is restoring a fenced
>> node to
>>>> the cluster. Hence my tests, and the reason why I started this thread.
>>>> </off-topic>
>>>
>>> Uhm, I do think that is exactly on topic.
>>>
>>> Rather fix your resources to be able to successfully take over,
>>> than add even more complexity.
>>>
>>> What resources would that be,
>>> and why are they not taking over?
>>
>> I can't tell you in detail, because the major snafu happened on a
>> production
>> system after a power outage a few months ago. My goal was to get the thing
>> stable as quickly as possible. In the end, that turned out to be a non-HA
>> configuration: One runs corosync+pacemaker+drbd, while the other just runs
>> drbd.
>> It works, in the sense that the users get their e-mail. If there's a power
>> outage, I have to bring things up manually.
>>
>> So my only reference is the test-bench dual-primary setup I've got now,
>> which is
>> exhibiting the same kinds of problems even though the OS versions, software
>> versions, and layout are different. This suggests that the problem lies in
>> the
>> way I'm setting up the configuration.
>>
>> The problems I have seem to be in the general category of "the 'good guy'
>> gets
>> fenced when the 'bad guy' gets into trouble." Examples:
>>
>> - Assuming I start out with two crashed nodes. If I just start up DRBD and
>> nothing else, the partitions sync quickly with no problems.
>>
>> - If the system starts with cman running, and I start drbd, it's likely
>> that
>> system who is _not_ Outdated will be fenced (rebooted). Same thing if
>> cman+pacemaker is running.
>>
>> - Cloned ocf:heartbeat:exportfs resources are giving me problems as well
>> (which
>> is why I tried making changes to that resource script). Assume I start
>> with one
>> node running cman+pacemaker, and the other stopped. I turned on the stopped
>> node. This will typically result in the running node being fenced, because
>> it
>> has it times out when stopping the exportfs resource.
>>
>> Falling back to DRBD 8.3.12 didn't change this behavior.
>>
>> My pacemaker configuration is long, so I'll excerpt what I think are the
>> relevant pieces in the hope that it will be enough for someone to say "You
>> fool!
>> This is covered in Pacemaker Explained page 56!" When bringing up a stopped
>> node, in order to restart AdminClone pacemaker wants to stop ExportsClone,
>> then
>> Gfs2Clone, then ClvmdClone. As I said, it's the failure to stop ExportMail
>> on
>> the running node that causes it to be fenced.
>>
>> primitive AdminDrbd ocf:linbit:drbd \
>> params drbd_resource="admin" \
>> op monitor interval="60s" role="Master" \
>> op monitor interval="59s" role="Slave" \
>> op stop interval="0" timeout="320" \
>> op start interval="0" timeout="240"
>> ms AdminClone AdminDrbd \
>> meta master-max="2" master-node-max="1" \
>> clone-max="2" clone-node-max="1" notify="true"
>>
>> primitive Clvmd lsb:clvmd op monitor interval="30s"
>> clone ClvmdClone Clvmd
>> colocation Clvmd_With_Admin inf: ClvmdClone AdminClone:Master
>> order Admin_Before_Clvmd inf: AdminClone:promote ClvmdClone:start
>>
>> primitive Gfs2 lsb:gfs2 op monitor interval="30s"
>> clone Gfs2Clone Gfs2
>> colocation Gfs2_With_Clvmd inf: Gfs2Clone ClvmdClone
>> order Clvmd_Before_Gfs2 inf: ClvmdClone Gfs2Clone
>>
>> primitive ExportMail ocf:heartbeat:exportfs \
>> op start interval="0" timeout="40" \
>> op stop interval="0" timeout="45" \
>> params clientspec="mail" directory="/mail" fsid="30"
>> clone ExportsClone ExportMail
>> colocation Exports_With_Gfs2 inf: ExportsClone Gfs2Clone
>> order Gfs2_Before_Exports inf: Gfs2Clone ExportsClone


--
Bill Seligman | mailto://seligman [at] nevis
Nevis Labs, Columbia Univ | http://www.nevis.columbia.edu/~seligman/
PO Box 137 |
Irvington NY 10533 USA | Phone: (914) 591-2823
Attachments: smime.p7s (4.39 KB)


emi2fast at gmail

Mar 1, 2012, 3:34 AM

Post #13 of 18 (2336 views)
Permalink
Re: cman+pacemaker+drbd fencing problem [In reply to]

try to change the fence daemon tag like this
====================================
<fence_daemon clean_start="1" post_join_delay="30" />
====================================
change your cluster config version and after reboot the cluster

Il giorno 01 marzo 2012 12:28, William Seligman <seligman [at] nevis
> ha scritto:

> On 3/1/12 4:15 AM, emmanuel segura wrote:
>
>> can you show me your /etc/cluster/cluster.conf?
>>
>> because i think your problem it's a fencing-loop
>>
>
> Here it is:
>
> /etc/cluster/cluster.conf:
>
> <?xml version="1.0"?>
> <cluster config_version="17" name="Nevis_HA">
> <logging debug="off"/>
> <cman expected_votes="1" two_node="1" />
> <clusternodes>
> <clusternode name="hypatia-tb.nevis.**columbia.edu<http://hypatia-tb.nevis.columbia.edu>"
> nodeid="1">
> <altname name="hypatia-private.nevis.**columbia.edu<http://hypatia-private.nevis.columbia.edu>"
> port="5405"
> mcast="226.94.1.1"/>
> <fence>
> <method name="pcmk-redirect">
> <device name="pcmk" port="hypatia-tb.nevis.**columbia.edu<http://hypatia-tb.nevis.columbia.edu>
> "/>
> </method>
> </fence>
> </clusternode>
> <clusternode name="orestes-tb.nevis.**columbia.edu<http://orestes-tb.nevis.columbia.edu>"
> nodeid="2">
> <altname name="orestes-private.nevis.**columbia.edu<http://orestes-private.nevis.columbia.edu>"
> port="5405"
> mcast="226.94.1.1"/>
> <fence>
> <method name="pcmk-redirect">
> <device name="pcmk" port="orestes-tb.nevis.**columbia.edu<http://orestes-tb.nevis.columbia.edu>
> "/>
> </method>
> </fence>
> </clusternode>
> </clusternodes>
> <fencedevices>
> <fencedevice name="pcmk" agent="fence_pcmk"/>
> </fencedevices>
> <fence_daemon post_join_delay="30" />
> <rm disabled="1" />
> </cluster>
>
>
>
> Il giorno 01 marzo 2012 01:03, William Seligman<seligman [at] nevis**
>> columbia.edu <seligman [at] nevis>
>>
>>> ha scritto:
>>>
>>
>> On 2/28/12 7:26 PM, Lars Ellenberg wrote:
>>>
>>>> On Tue, Feb 28, 2012 at 03:51:29PM -0500, William Seligman wrote:
>>>>
>>>>> <off-topic>
>>>>> Sigh. I wish that were the reason.
>>>>>
>>>>> The reason why I'm doing dual-primary is that I've a got a
>>>>>
>>>> single-primary
>>>
>>>> two-node cluster in production that simply doesn't work. One node runs
>>>>> resources; the other sits and twiddles its fingers; fine. But when
>>>>>
>>>> primary goes
>>>
>>>> down, secondary has trouble starting up all the resources; when we've
>>>>>
>>>> actually
>>>
>>>> had primary failures (UPS goes haywire, hard drive failure) the
>>>>>
>>>> secondary often
>>>
>>>> winds up in a state in which it runs none of the significant resources.
>>>>>
>>>>> With the dual-primary setup I have now, both machines are running the
>>>>>
>>>> resources
>>>
>>>> that typically cause problems in my single-primary configuration. If
>>>>>
>>>> one box
>>>
>>>> goes down, the other doesn't have to failover anything; it's already
>>>>>
>>>> running
>>>
>>>> them. (I needed IPaddr2 cloning to work properly for this to work,
>>>>>
>>>> which is why
>>>
>>>> I started that thread... and all the stupider of me for missing that
>>>>>
>>>> crucial
>>>
>>>> page in Clusters From Scratch.)
>>>>>
>>>>> My only remaining problem with the configuration is restoring a fenced
>>>>>
>>>> node to
>>>
>>>> the cluster. Hence my tests, and the reason why I started this thread.
>>>>> </off-topic>
>>>>>
>>>>
>>>> Uhm, I do think that is exactly on topic.
>>>>
>>>> Rather fix your resources to be able to successfully take over,
>>>> than add even more complexity.
>>>>
>>>> What resources would that be,
>>>> and why are they not taking over?
>>>>
>>>
>>> I can't tell you in detail, because the major snafu happened on a
>>> production
>>> system after a power outage a few months ago. My goal was to get the
>>> thing
>>> stable as quickly as possible. In the end, that turned out to be a non-HA
>>> configuration: One runs corosync+pacemaker+drbd, while the other just
>>> runs
>>> drbd.
>>> It works, in the sense that the users get their e-mail. If there's a
>>> power
>>> outage, I have to bring things up manually.
>>>
>>> So my only reference is the test-bench dual-primary setup I've got now,
>>> which is
>>> exhibiting the same kinds of problems even though the OS versions,
>>> software
>>> versions, and layout are different. This suggests that the problem lies
>>> in
>>> the
>>> way I'm setting up the configuration.
>>>
>>> The problems I have seem to be in the general category of "the 'good guy'
>>> gets
>>> fenced when the 'bad guy' gets into trouble." Examples:
>>>
>>> - Assuming I start out with two crashed nodes. If I just start up DRBD
>>> and
>>> nothing else, the partitions sync quickly with no problems.
>>>
>>> - If the system starts with cman running, and I start drbd, it's likely
>>> that
>>> system who is _not_ Outdated will be fenced (rebooted). Same thing if
>>> cman+pacemaker is running.
>>>
>>> - Cloned ocf:heartbeat:exportfs resources are giving me problems as well
>>> (which
>>> is why I tried making changes to that resource script). Assume I start
>>> with one
>>> node running cman+pacemaker, and the other stopped. I turned on the
>>> stopped
>>> node. This will typically result in the running node being fenced,
>>> because
>>> it
>>> has it times out when stopping the exportfs resource.
>>>
>>> Falling back to DRBD 8.3.12 didn't change this behavior.
>>>
>>> My pacemaker configuration is long, so I'll excerpt what I think are the
>>> relevant pieces in the hope that it will be enough for someone to say
>>> "You
>>> fool!
>>> This is covered in Pacemaker Explained page 56!" When bringing up a
>>> stopped
>>> node, in order to restart AdminClone pacemaker wants to stop
>>> ExportsClone,
>>> then
>>> Gfs2Clone, then ClvmdClone. As I said, it's the failure to stop
>>> ExportMail
>>> on
>>> the running node that causes it to be fenced.
>>>
>>> primitive AdminDrbd ocf:linbit:drbd \
>>> params drbd_resource="admin" \
>>> op monitor interval="60s" role="Master" \
>>> op monitor interval="59s" role="Slave" \
>>> op stop interval="0" timeout="320" \
>>> op start interval="0" timeout="240"
>>> ms AdminClone AdminDrbd \
>>> meta master-max="2" master-node-max="1" \
>>> clone-max="2" clone-node-max="1" notify="true"
>>>
>>> primitive Clvmd lsb:clvmd op monitor interval="30s"
>>> clone ClvmdClone Clvmd
>>> colocation Clvmd_With_Admin inf: ClvmdClone AdminClone:Master
>>> order Admin_Before_Clvmd inf: AdminClone:promote ClvmdClone:start
>>>
>>> primitive Gfs2 lsb:gfs2 op monitor interval="30s"
>>> clone Gfs2Clone Gfs2
>>> colocation Gfs2_With_Clvmd inf: Gfs2Clone ClvmdClone
>>> order Clvmd_Before_Gfs2 inf: ClvmdClone Gfs2Clone
>>>
>>> primitive ExportMail ocf:heartbeat:exportfs \
>>> op start interval="0" timeout="40" \
>>> op stop interval="0" timeout="45" \
>>> params clientspec="mail" directory="/mail" fsid="30"
>>> clone ExportsClone ExportMail
>>> colocation Exports_With_Gfs2 inf: ExportsClone Gfs2Clone
>>> order Gfs2_Before_Exports inf: Gfs2Clone ExportsClone
>>>
>>
>
> --
> Bill Seligman | mailto://seligman [at] nevis**columbia.edu<seligman [at] nevis>
> Nevis Labs, Columbia Univ | http://www.nevis.columbia.edu/**~seligman/<http://www.nevis.columbia.edu/%7Eseligman/>
> PO Box 137 |
> Irvington NY 10533 USA | Phone: (914) 591-2823
>
>
> _______________________________________________
> Linux-HA mailing list
> Linux-HA [at] lists
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>



--
esta es mi vida e me la vivo hasta que dios quiera
_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


seligman at nevis

Mar 1, 2012, 9:10 AM

Post #14 of 18 (2342 views)
Permalink
Re: cman+pacemaker+drbd fencing problem [In reply to]

On 3/1/12 6:34 AM, emmanuel segura wrote:
> try to change the fence daemon tag like this
> ====================================
> <fence_daemon clean_start="1" post_join_delay="30" />
> ====================================
> change your cluster config version and after reboot the cluster

This did not change the behavior of the cluster. In particular, I'm still
dealing with this:

>>>> - If the system starts with cman running, and I start drbd, it's
>>>> likely that system who is _not_ Outdated will be fenced (rebooted).

> Il giorno 01 marzo 2012 12:28, William Seligman <seligman [at] nevis
>> ha scritto:
>
>> On 3/1/12 4:15 AM, emmanuel segura wrote:
>>
>>> can you show me your /etc/cluster/cluster.conf?
>>>
>>> because i think your problem it's a fencing-loop
>>>
>>
>> Here it is:
>>
>> /etc/cluster/cluster.conf:
>>
>> <?xml version="1.0"?>
>> <cluster config_version="17" name="Nevis_HA">
>> <logging debug="off"/>
>> <cman expected_votes="1" two_node="1" />
>> <clusternodes>
>> <clusternode name="hypatia-tb.nevis.**columbia.edu<http://hypatia-tb.nevis.columbia.edu>"
>> nodeid="1">
>> <altname name="hypatia-private.nevis.**columbia.edu<http://hypatia-private.nevis.columbia.edu>"
>> port="5405"
>> mcast="226.94.1.1"/>
>> <fence>
>> <method name="pcmk-redirect">
>> <device name="pcmk" port="hypatia-tb.nevis.**columbia.edu<http://hypatia-tb.nevis.columbia.edu>
>> "/>
>> </method>
>> </fence>
>> </clusternode>
>> <clusternode name="orestes-tb.nevis.**columbia.edu<http://orestes-tb.nevis.columbia.edu>"
>> nodeid="2">
>> <altname name="orestes-private.nevis.**columbia.edu<http://orestes-private.nevis.columbia.edu>"
>> port="5405"
>> mcast="226.94.1.1"/>
>> <fence>
>> <method name="pcmk-redirect">
>> <device name="pcmk" port="orestes-tb.nevis.**columbia.edu<http://orestes-tb.nevis.columbia.edu>
>> "/>
>> </method>
>> </fence>
>> </clusternode>
>> </clusternodes>
>> <fencedevices>
>> <fencedevice name="pcmk" agent="fence_pcmk"/>
>> </fencedevices>
>> <fence_daemon post_join_delay="30" />
>> <rm disabled="1" />
>> </cluster>
>>
>>
>>
>> Il giorno 01 marzo 2012 01:03, William Seligman<seligman [at] nevis**
>>> columbia.edu <seligman [at] nevis>
>>>
>>>> ha scritto:
>>>>
>>>
>>> On 2/28/12 7:26 PM, Lars Ellenberg wrote:
>>>>
>>>>> On Tue, Feb 28, 2012 at 03:51:29PM -0500, William Seligman wrote:
>>>>>
>>>>>> <off-topic>
>>>>>> Sigh. I wish that were the reason.
>>>>>>
>>>>>> The reason why I'm doing dual-primary is that I've a got a
>>>>>> single-primary two-node cluster in production that simply doesn't
>>>>>> work. One node runs resources; the other sits and twiddles its
>>>>>> fingers; fine. But when primary goes down, secondary has trouble
>>>>>> starting up all the resources; when we've actually had primary
>>>>>> failures (UPS goes haywire, hard drive failure) the secondary often
>>>>>> winds up in a state in which it runs none of the significant
>>>>>> resources.
>>>>>>
>>>>>> With the dual-primary setup I have now, both machines are running
>>>>>> the resources that typically cause problems in my single-primary
>>>>>> configuration. If one box goes down, the other doesn't have to
>>>>>> failover anything; it's already running them. (I needed IPaddr2
>>>>>> cloning to work properly for this to work, which is why I started
>>>>>> that thread... and all the stupider of me for missing that crucial
>>>>>> page in Clusters From Scratch.)
>>>>>>
>>>>>> My only remaining problem with the configuration is restoring a
>>>>>> fenced node to the cluster. Hence my tests, and the reason why I
>>>>>> started this thread.
>>>>>> </off-topic>

>>>>>>
>>>>>
>>>>> Uhm, I do think that is exactly on topic.
>>>>>
>>>>> Rather fix your resources to be able to successfully take over,
>>>>> than add even more complexity.
>>>>>
>>>>> What resources would that be,
>>>>> and why are they not taking over?
>>>>>
>>>>
>>>> I can't tell you in detail, because the major snafu happened on a
>>>> production system after a power outage a few months ago. My goal was to
>>>> get the thing stable as quickly as possible. In the end, that turned
>>>> out to be a non-HA configuration: One runs corosync+pacemaker+drbd,
>>>> while the other just runs drbd. It works, in the sense that the users
>>>> get their e-mail. If there's a power outage, I have to bring things up
>>>> manually.
>>>>
>>>> So my only reference is the test-bench dual-primary setup I've got
>>>> now, which is exhibiting the same kinds of problems even though the OS
>>>> versions, software versions, and layout are different. This suggests
>>>> that the problem lies in the way I'm setting up the configuration.
>>>>
>>>> The problems I have seem to be in the general category of "the 'good
>>>> guy' gets fenced when the 'bad guy' gets into trouble." Examples:
>>>>
>>>> - Assuming I start out with two crashed nodes. If I just start up DRBD
>>>> and nothing else, the partitions sync quickly with no problems.
>>>>
>>>> - If the system starts with cman running, and I start drbd, it's
>>>> likely that system who is _not_ Outdated will be fenced (rebooted).
>>>> Same thing if cman+pacemaker is running.
>>>>
>>>> - Cloned ocf:heartbeat:exportfs resources are giving me problems as
>>>> well (which is why I tried making changes to that resource script).
>>>> Assume I start with one node running cman+pacemaker, and the other
>>>> stopped. I turned on the stopped node. This will typically result in
>>>> the running node being fenced, because it has it times out when
>>>> stopping the exportfs resource.
>>>>
>>>> Falling back to DRBD 8.3.12 didn't change this behavior.
>>>>
>>>> My pacemaker configuration is long, so I'll excerpt what I think are
>>>> the relevant pieces in the hope that it will be enough for someone to
>>>> say "You fool! This is covered in Pacemaker Explained page 56!" When
>>>> bringing up a stopped node, in order to restart AdminClone pacemaker
>>>> wants to stop ExportsClone, then Gfs2Clone, then ClvmdClone. As I said,
>>>> it's the failure to stop ExportMail on the running node that causes it
>>>> to be fenced.
>>>>
>>>> primitive AdminDrbd ocf:linbit:drbd \
>>>> params drbd_resource="admin" \
>>>> op monitor interval="60s" role="Master" \
>>>> op monitor interval="59s" role="Slave" \
>>>> op stop interval="0" timeout="320" \
>>>> op start interval="0" timeout="240"
>>>> ms AdminClone AdminDrbd \
>>>> meta master-max="2" master-node-max="1" \
>>>> clone-max="2" clone-node-max="1" notify="true"
>>>>
>>>> primitive Clvmd lsb:clvmd op monitor interval="30s"
>>>> clone ClvmdClone Clvmd
>>>> colocation Clvmd_With_Admin inf: ClvmdClone AdminClone:Master
>>>> order Admin_Before_Clvmd inf: AdminClone:promote ClvmdClone:start
>>>>
>>>> primitive Gfs2 lsb:gfs2 op monitor interval="30s"
>>>> clone Gfs2Clone Gfs2
>>>> colocation Gfs2_With_Clvmd inf: Gfs2Clone ClvmdClone
>>>> order Clvmd_Before_Gfs2 inf: ClvmdClone Gfs2Clone
>>>>
>>>> primitive ExportMail ocf:heartbeat:exportfs \
>>>> op start interval="0" timeout="40" \
>>>> op stop interval="0" timeout="45" \
>>>> params clientspec="mail" directory="/mail" fsid="30"
>>>> clone ExportsClone ExportMail
>>>> colocation Exports_With_Gfs2 inf: ExportsClone Gfs2Clone
>>>> order Gfs2_Before_Exports inf: Gfs2Clone ExportsClone


--
Bill Seligman | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://seligman [at] nevis
PO Box 137 |
Irvington NY 10533 USA | http://www.nevis.columbia.edu/~seligman/
Attachments: smime.p7s (4.39 KB)


seligman at nevis

Mar 1, 2012, 9:16 AM

Post #15 of 18 (2345 views)
Permalink
Re: cman+pacemaker+drbd fencing problem [In reply to]

On 3/1/12 12:10 PM, William Seligman wrote:
> On 3/1/12 6:34 AM, emmanuel segura wrote:
>> try to change the fence daemon tag like this
>> ====================================
>> <fence_daemon clean_start="1" post_join_delay="30" />
>> ====================================
>> change your cluster config version and after reboot the cluster
>
> This did not change the behavior of the cluster. In particular, I'm still
> dealing with this:
>
>>>>> - If the system starts with cman running, and I start drbd, it's
>>>>> likely that system who is _not_ Outdated will be fenced (rebooted).

This just happened again. Here's the log from the "bad" node, the one I stopped
and then restarted. cman is running (not pacemaker). I start drbd:

Mar 1 12:03:49 orestes-tb kernel: drbd: initialized. Version: 8.3.12
(api:88/proto:86-96)
Mar 1 12:03:49 orestes-tb kernel: drbd: GIT-hash:
e2a8ef4656be026bbae540305fcb998a5991090f build by
root [at] hypatia-tb, 2012-02-28 18:01:34
Mar 1 12:03:49 orestes-tb kernel: drbd: registered as block device major 147
Mar 1 12:03:49 orestes-tb kernel: drbd: minor_table @ 0xffff88041dbc4b80
Mar 1 12:03:49 orestes-tb kernel: block drbd0: Starting worker thread (from
cqueue [2942])
Mar 1 12:03:49 orestes-tb kernel: block drbd0: disk( Diskless -> Attaching )
Mar 1 12:03:50 orestes-tb kernel: block drbd0: Found 57 transactions (57 active
extents) in activity log.
Mar 1 12:03:50 orestes-tb kernel: block drbd0: Method to ensure write ordering:
barrier
Mar 1 12:03:50 orestes-tb kernel: block drbd0: max BIO size = 130560
Mar 1 12:03:50 orestes-tb kernel: block drbd0: Adjusting my ra_pages to backing
device's (32 -> 768)
Mar 1 12:03:50 orestes-tb kernel: block drbd0: drbd_bm_resize called with
capacity == 5611549368
Mar 1 12:03:50 orestes-tb kernel: block drbd0: resync bitmap: bits=701443671
words=10960058 pages=21407
Mar 1 12:03:50 orestes-tb kernel: block drbd0: size = 2676 GB (2805774684 KB)
Mar 1 12:03:50 orestes-tb kernel: block drbd0: bitmap READ of 21407 pages took
625 jiffies
Mar 1 12:03:50 orestes-tb kernel: block drbd0: recounting of set bits took
additional 86 jiffies
Mar 1 12:03:50 orestes-tb kernel: block drbd0: 0 KB (0 bits) marked out-of-sync
by on disk bit-map.
Mar 1 12:03:50 orestes-tb kernel: block drbd0: disk( Attaching -> Outdated )
Mar 1 12:03:50 orestes-tb kernel: block drbd0: attached to UUIDs
878999EFCFBE8E08:0000000000000000:494B48826E41A2C2:494A48826E41A2C3
Mar 1 12:03:50 orestes-tb kernel: block drbd0: conn( StandAlone -> Unconnected )
Mar 1 12:03:50 orestes-tb kernel: block drbd0: Starting receiver thread (from
drbd0_worker [2951])
Mar 1 12:03:50 orestes-tb kernel: block drbd0: receiver (re)started
Mar 1 12:03:50 orestes-tb kernel: block drbd0: conn( Unconnected -> WFConnection )
Mar 1 12:03:51 orestes-tb kernel: block drbd0: Handshake successful: Agreed
network protocol version 96
Mar 1 12:03:51 orestes-tb kernel: block drbd0: conn( WFConnection ->
WFReportParams )
Mar 1 12:03:51 orestes-tb kernel: block drbd0: Starting asender thread (from
drbd0_receiver [2965])
Mar 1 12:03:51 orestes-tb kernel: block drbd0: data-integrity-alg: <not-used>
Mar 1 12:03:51 orestes-tb kernel: block drbd0: drbd_sync_handshake:
Mar 1 12:03:51 orestes-tb kernel: block drbd0: self
878999EFCFBE8E08:0000000000000000:494B48826E41A2C2:494A48826E41A2C3 bits:0 flags:0
Mar 1 12:03:51 orestes-tb kernel: block drbd0: peer
D40A1613FAE8F5E9:878999EFCFBE8E09:878899EFCFBE8E09:494B48826E41A2C3 bits:0 flags:0
Mar 1 12:03:51 orestes-tb kernel: block drbd0: uuid_compare()=-1 by rule 50
Mar 1 12:03:51 orestes-tb kernel: block drbd0: peer( Unknown -> Primary ) conn(
WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
Mar 1 12:03:53 orestes-tb kernel: block drbd0: conn( WFBitMapT -> WFSyncUUID )
Mar 1 12:04:01 orestes-tb corosync[2296]: [TOTEM ] A processor failed,
forming new configuration.
Mar 1 12:04:03 orestes-tb corosync[2296]: [QUORUM] Members[1]: 2
Mar 1 12:04:03 orestes-tb corosync[2296]: [TOTEM ] A processor joined or left
the membership and a new membership was formed.
Mar 1 12:04:03 orestes-tb kernel: dlm: closing connection to node 1
Mar 1 12:04:03 orestes-tb corosync[2296]: [CPG ] chosen downlist: sender
r(0) ip(129.236.252.14) r(1) ip(192.168.100.6) ; members(old:2 left:1)
Mar 1 12:04:03 orestes-tb corosync[2296]: [MAIN ] Completed service
synchronization, ready to provide service.
Mar 1 12:04:03 orestes-tb fenced[2350]: fencing node hypatia-tb.nevis.columbia.edu


As near as I can tell, the "bad" node sees that the "good" node is Primary and
UpToDate, goes into WFSyncUUID... and then corosync/cman cheerfully fences the
"good" node.

>> Il giorno 01 marzo 2012 12:28, William Seligman <seligman [at] nevis
>>> ha scritto:
>>
>>> On 3/1/12 4:15 AM, emmanuel segura wrote:
>>>
>>>> can you show me your /etc/cluster/cluster.conf?
>>>>
>>>> because i think your problem it's a fencing-loop
>>>>
>>>
>>> Here it is:
>>>
>>> /etc/cluster/cluster.conf:
>>>
>>> <?xml version="1.0"?>
>>> <cluster config_version="17" name="Nevis_HA">
>>> <logging debug="off"/>
>>> <cman expected_votes="1" two_node="1" />
>>> <clusternodes>
>>> <clusternode name="hypatia-tb.nevis.**columbia.edu<http://hypatia-tb.nevis.columbia.edu>"
>>> nodeid="1">
>>> <altname name="hypatia-private.nevis.**columbia.edu<http://hypatia-private.nevis.columbia.edu>"
>>> port="5405"
>>> mcast="226.94.1.1"/>
>>> <fence>
>>> <method name="pcmk-redirect">
>>> <device name="pcmk" port="hypatia-tb.nevis.**columbia.edu<http://hypatia-tb.nevis.columbia.edu>
>>> "/>
>>> </method>
>>> </fence>
>>> </clusternode>
>>> <clusternode name="orestes-tb.nevis.**columbia.edu<http://orestes-tb.nevis.columbia.edu>"
>>> nodeid="2">
>>> <altname name="orestes-private.nevis.**columbia.edu<http://orestes-private.nevis.columbia.edu>"
>>> port="5405"
>>> mcast="226.94.1.1"/>
>>> <fence>
>>> <method name="pcmk-redirect">
>>> <device name="pcmk" port="orestes-tb.nevis.**columbia.edu<http://orestes-tb.nevis.columbia.edu>
>>> "/>
>>> </method>
>>> </fence>
>>> </clusternode>
>>> </clusternodes>
>>> <fencedevices>
>>> <fencedevice name="pcmk" agent="fence_pcmk"/>
>>> </fencedevices>
>>> <fence_daemon post_join_delay="30" />
>>> <rm disabled="1" />
>>> </cluster>
>>>
>>>
>>>
>>> Il giorno 01 marzo 2012 01:03, William Seligman<seligman [at] nevis**
>>>> columbia.edu <seligman [at] nevis>
>>>>
>>>>> ha scritto:
>>>>>
>>>>
>>>> On 2/28/12 7:26 PM, Lars Ellenberg wrote:
>>>>>
>>>>>> On Tue, Feb 28, 2012 at 03:51:29PM -0500, William Seligman wrote:
>>>>>>
>>>>>>> <off-topic>
>>>>>>> Sigh. I wish that were the reason.
>>>>>>>
>>>>>>> The reason why I'm doing dual-primary is that I've a got a
>>>>>>> single-primary two-node cluster in production that simply doesn't
>>>>>>> work. One node runs resources; the other sits and twiddles its
>>>>>>> fingers; fine. But when primary goes down, secondary has trouble
>>>>>>> starting up all the resources; when we've actually had primary
>>>>>>> failures (UPS goes haywire, hard drive failure) the secondary often
>>>>>>> winds up in a state in which it runs none of the significant
>>>>>>> resources.
>>>>>>>
>>>>>>> With the dual-primary setup I have now, both machines are running
>>>>>>> the resources that typically cause problems in my single-primary
>>>>>>> configuration. If one box goes down, the other doesn't have to
>>>>>>> failover anything; it's already running them. (I needed IPaddr2
>>>>>>> cloning to work properly for this to work, which is why I started
>>>>>>> that thread... and all the stupider of me for missing that crucial
>>>>>>> page in Clusters From Scratch.)
>>>>>>>
>>>>>>> My only remaining problem with the configuration is restoring a
>>>>>>> fenced node to the cluster. Hence my tests, and the reason why I
>>>>>>> started this thread.
>>>>>>> </off-topic>
>
>>>>>>>
>>>>>>
>>>>>> Uhm, I do think that is exactly on topic.
>>>>>>
>>>>>> Rather fix your resources to be able to successfully take over,
>>>>>> than add even more complexity.
>>>>>>
>>>>>> What resources would that be,
>>>>>> and why are they not taking over?
>>>>>>
>>>>>
>>>>> I can't tell you in detail, because the major snafu happened on a
>>>>> production system after a power outage a few months ago. My goal was to
>>>>> get the thing stable as quickly as possible. In the end, that turned
>>>>> out to be a non-HA configuration: One runs corosync+pacemaker+drbd,
>>>>> while the other just runs drbd. It works, in the sense that the users
>>>>> get their e-mail. If there's a power outage, I have to bring things up
>>>>> manually.
>>>>>
>>>>> So my only reference is the test-bench dual-primary setup I've got
>>>>> now, which is exhibiting the same kinds of problems even though the OS
>>>>> versions, software versions, and layout are different. This suggests
>>>>> that the problem lies in the way I'm setting up the configuration.
>>>>>
>>>>> The problems I have seem to be in the general category of "the 'good
>>>>> guy' gets fenced when the 'bad guy' gets into trouble." Examples:
>>>>>
>>>>> - Assuming I start out with two crashed nodes. If I just start up DRBD
>>>>> and nothing else, the partitions sync quickly with no problems.
>>>>>
>>>>> - If the system starts with cman running, and I start drbd, it's
>>>>> likely that system who is _not_ Outdated will be fenced (rebooted).
>>>>> Same thing if cman+pacemaker is running.
>>>>>
>>>>> - Cloned ocf:heartbeat:exportfs resources are giving me problems as
>>>>> well (which is why I tried making changes to that resource script).
>>>>> Assume I start with one node running cman+pacemaker, and the other
>>>>> stopped. I turned on the stopped node. This will typically result in
>>>>> the running node being fenced, because it has it times out when
>>>>> stopping the exportfs resource.
>>>>>
>>>>> Falling back to DRBD 8.3.12 didn't change this behavior.
>>>>>
>>>>> My pacemaker configuration is long, so I'll excerpt what I think are
>>>>> the relevant pieces in the hope that it will be enough for someone to
>>>>> say "You fool! This is covered in Pacemaker Explained page 56!" When
>>>>> bringing up a stopped node, in order to restart AdminClone pacemaker
>>>>> wants to stop ExportsClone, then Gfs2Clone, then ClvmdClone. As I said,
>>>>> it's the failure to stop ExportMail on the running node that causes it
>>>>> to be fenced.
>>>>>
>>>>> primitive AdminDrbd ocf:linbit:drbd \
>>>>> params drbd_resource="admin" \
>>>>> op monitor interval="60s" role="Master" \
>>>>> op monitor interval="59s" role="Slave" \
>>>>> op stop interval="0" timeout="320" \
>>>>> op start interval="0" timeout="240"
>>>>> ms AdminClone AdminDrbd \
>>>>> meta master-max="2" master-node-max="1" \
>>>>> clone-max="2" clone-node-max="1" notify="true"
>>>>>
>>>>> primitive Clvmd lsb:clvmd op monitor interval="30s"
>>>>> clone ClvmdClone Clvmd
>>>>> colocation Clvmd_With_Admin inf: ClvmdClone AdminClone:Master
>>>>> order Admin_Before_Clvmd inf: AdminClone:promote ClvmdClone:start
>>>>>
>>>>> primitive Gfs2 lsb:gfs2 op monitor interval="30s"
>>>>> clone Gfs2Clone Gfs2
>>>>> colocation Gfs2_With_Clvmd inf: Gfs2Clone ClvmdClone
>>>>> order Clvmd_Before_Gfs2 inf: ClvmdClone Gfs2Clone
>>>>>
>>>>> primitive ExportMail ocf:heartbeat:exportfs \
>>>>> op start interval="0" timeout="40" \
>>>>> op stop interval="0" timeout="45" \
>>>>> params clientspec="mail" directory="/mail" fsid="30"
>>>>> clone ExportsClone ExportMail
>>>>> colocation Exports_With_Gfs2 inf: ExportsClone Gfs2Clone
>>>>> order Gfs2_Before_Exports inf: Gfs2Clone ExportsClone
>
>


--
Bill Seligman | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://seligman [at] nevis
PO Box 137 |
Irvington NY 10533 USA | http://www.nevis.columbia.edu/~seligman/
Attachments: smime.p7s (4.39 KB)


emi2fast at gmail

Mar 1, 2012, 9:27 AM

Post #16 of 18 (2355 views)
Permalink
Re: cman+pacemaker+drbd fencing problem [In reply to]

Ok william

if this it'sn the problem, when you show me your pacemaker cib xml

crm configure show > OUTPUT

Il giorno 01 marzo 2012 18:10, William Seligman <seligman [at] nevis
> ha scritto:

> On 3/1/12 6:34 AM, emmanuel segura wrote:
> > try to change the fence daemon tag like this
> > ====================================
> > <fence_daemon clean_start="1" post_join_delay="30" />
> > ====================================
> > change your cluster config version and after reboot the cluster
>
> This did not change the behavior of the cluster. In particular, I'm still
> dealing with this:
>
> >>>> - If the system starts with cman running, and I start drbd, it's
> >>>> likely that system who is _not_ Outdated will be fenced (rebooted).
>
> > Il giorno 01 marzo 2012 12:28, William Seligman <
> seligman [at] nevis
> >> ha scritto:
> >
> >> On 3/1/12 4:15 AM, emmanuel segura wrote:
> >>
> >>> can you show me your /etc/cluster/cluster.conf?
> >>>
> >>> because i think your problem it's a fencing-loop
> >>>
> >>
> >> Here it is:
> >>
> >> /etc/cluster/cluster.conf:
> >>
> >> <?xml version="1.0"?>
> >> <cluster config_version="17" name="Nevis_HA">
> >> <logging debug="off"/>
> >> <cman expected_votes="1" two_node="1" />
> >> <clusternodes>
> >> <clusternode name="hypatia-tb.nevis.**columbia.edu<
> http://hypatia-tb.nevis.columbia.edu>"
> >> nodeid="1">
> >> <altname name="hypatia-private.nevis.**columbia.edu<
> http://hypatia-private.nevis.columbia.edu>"
> >> port="5405"
> >> mcast="226.94.1.1"/>
> >> <fence>
> >> <method name="pcmk-redirect">
> >> <device name="pcmk" port="hypatia-tb.nevis.**columbia.edu<
> http://hypatia-tb.nevis.columbia.edu>
> >> "/>
> >> </method>
> >> </fence>
> >> </clusternode>
> >> <clusternode name="orestes-tb.nevis.**columbia.edu<
> http://orestes-tb.nevis.columbia.edu>"
> >> nodeid="2">
> >> <altname name="orestes-private.nevis.**columbia.edu<
> http://orestes-private.nevis.columbia.edu>"
> >> port="5405"
> >> mcast="226.94.1.1"/>
> >> <fence>
> >> <method name="pcmk-redirect">
> >> <device name="pcmk" port="orestes-tb.nevis.**columbia.edu<
> http://orestes-tb.nevis.columbia.edu>
> >> "/>
> >> </method>
> >> </fence>
> >> </clusternode>
> >> </clusternodes>
> >> <fencedevices>
> >> <fencedevice name="pcmk" agent="fence_pcmk"/>
> >> </fencedevices>
> >> <fence_daemon post_join_delay="30" />
> >> <rm disabled="1" />
> >> </cluster>
> >>
> >>
> >>
> >> Il giorno 01 marzo 2012 01:03, William Seligman<seligman [at] nevis**
> >>> columbia.edu <seligman [at] nevis>
> >>>
> >>>> ha scritto:
> >>>>
> >>>
> >>> On 2/28/12 7:26 PM, Lars Ellenberg wrote:
> >>>>
> >>>>> On Tue, Feb 28, 2012 at 03:51:29PM -0500, William Seligman wrote:
> >>>>>
> >>>>>> <off-topic>
> >>>>>> Sigh. I wish that were the reason.
> >>>>>>
> >>>>>> The reason why I'm doing dual-primary is that I've a got a
> >>>>>> single-primary two-node cluster in production that simply doesn't
> >>>>>> work. One node runs resources; the other sits and twiddles its
> >>>>>> fingers; fine. But when primary goes down, secondary has trouble
> >>>>>> starting up all the resources; when we've actually had primary
> >>>>>> failures (UPS goes haywire, hard drive failure) the secondary often
> >>>>>> winds up in a state in which it runs none of the significant
> >>>>>> resources.
> >>>>>>
> >>>>>> With the dual-primary setup I have now, both machines are running
> >>>>>> the resources that typically cause problems in my single-primary
> >>>>>> configuration. If one box goes down, the other doesn't have to
> >>>>>> failover anything; it's already running them. (I needed IPaddr2
> >>>>>> cloning to work properly for this to work, which is why I started
> >>>>>> that thread... and all the stupider of me for missing that crucial
> >>>>>> page in Clusters From Scratch.)
> >>>>>>
> >>>>>> My only remaining problem with the configuration is restoring a
> >>>>>> fenced node to the cluster. Hence my tests, and the reason why I
> >>>>>> started this thread.
> >>>>>> </off-topic>
>
> >>>>>>
> >>>>>
> >>>>> Uhm, I do think that is exactly on topic.
> >>>>>
> >>>>> Rather fix your resources to be able to successfully take over,
> >>>>> than add even more complexity.
> >>>>>
> >>>>> What resources would that be,
> >>>>> and why are they not taking over?
> >>>>>
> >>>>
> >>>> I can't tell you in detail, because the major snafu happened on a
> >>>> production system after a power outage a few months ago. My goal was
> to
> >>>> get the thing stable as quickly as possible. In the end, that turned
> >>>> out to be a non-HA configuration: One runs corosync+pacemaker+drbd,
> >>>> while the other just runs drbd. It works, in the sense that the users
> >>>> get their e-mail. If there's a power outage, I have to bring things up
> >>>> manually.
> >>>>
> >>>> So my only reference is the test-bench dual-primary setup I've got
> >>>> now, which is exhibiting the same kinds of problems even though the OS
> >>>> versions, software versions, and layout are different. This suggests
> >>>> that the problem lies in the way I'm setting up the configuration.
> >>>>
> >>>> The problems I have seem to be in the general category of "the 'good
> >>>> guy' gets fenced when the 'bad guy' gets into trouble." Examples:
> >>>>
> >>>> - Assuming I start out with two crashed nodes. If I just start up DRBD
> >>>> and nothing else, the partitions sync quickly with no problems.
> >>>>
> >>>> - If the system starts with cman running, and I start drbd, it's
> >>>> likely that system who is _not_ Outdated will be fenced (rebooted).
> >>>> Same thing if cman+pacemaker is running.
> >>>>
> >>>> - Cloned ocf:heartbeat:exportfs resources are giving me problems as
> >>>> well (which is why I tried making changes to that resource script).
> >>>> Assume I start with one node running cman+pacemaker, and the other
> >>>> stopped. I turned on the stopped node. This will typically result in
> >>>> the running node being fenced, because it has it times out when
> >>>> stopping the exportfs resource.
> >>>>
> >>>> Falling back to DRBD 8.3.12 didn't change this behavior.
> >>>>
> >>>> My pacemaker configuration is long, so I'll excerpt what I think are
> >>>> the relevant pieces in the hope that it will be enough for someone to
> >>>> say "You fool! This is covered in Pacemaker Explained page 56!" When
> >>>> bringing up a stopped node, in order to restart AdminClone pacemaker
> >>>> wants to stop ExportsClone, then Gfs2Clone, then ClvmdClone. As I
> said,
> >>>> it's the failure to stop ExportMail on the running node that causes it
> >>>> to be fenced.
> >>>>
> >>>> primitive AdminDrbd ocf:linbit:drbd \
> >>>> params drbd_resource="admin" \
> >>>> op monitor interval="60s" role="Master" \
> >>>> op monitor interval="59s" role="Slave" \
> >>>> op stop interval="0" timeout="320" \
> >>>> op start interval="0" timeout="240"
> >>>> ms AdminClone AdminDrbd \
> >>>> meta master-max="2" master-node-max="1" \
> >>>> clone-max="2" clone-node-max="1" notify="true"
> >>>>
> >>>> primitive Clvmd lsb:clvmd op monitor interval="30s"
> >>>> clone ClvmdClone Clvmd
> >>>> colocation Clvmd_With_Admin inf: ClvmdClone AdminClone:Master
> >>>> order Admin_Before_Clvmd inf: AdminClone:promote ClvmdClone:start
> >>>>
> >>>> primitive Gfs2 lsb:gfs2 op monitor interval="30s"
> >>>> clone Gfs2Clone Gfs2
> >>>> colocation Gfs2_With_Clvmd inf: Gfs2Clone ClvmdClone
> >>>> order Clvmd_Before_Gfs2 inf: ClvmdClone Gfs2Clone
> >>>>
> >>>> primitive ExportMail ocf:heartbeat:exportfs \
> >>>> op start interval="0" timeout="40" \
> >>>> op stop interval="0" timeout="45" \
> >>>> params clientspec="mail" directory="/mail" fsid="30"
> >>>> clone ExportsClone ExportMail
> >>>> colocation Exports_With_Gfs2 inf: ExportsClone Gfs2Clone
> >>>> order Gfs2_Before_Exports inf: Gfs2Clone ExportsClone
>
>
> --
> Bill Seligman | Phone: (914) 591-2823
> Nevis Labs, Columbia Univ | mailto://seligman [at] nevis
> PO Box 137 |
> Irvington NY 10533 USA | http://www.nevis.columbia.edu/~seligman/
>
>
> _______________________________________________
> Linux-HA mailing list
> Linux-HA [at] lists
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>



--
esta es mi vida e me la vivo hasta que dios quiera
_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


lars.ellenberg at linbit

Mar 1, 2012, 9:56 AM

Post #17 of 18 (2332 views)
Permalink
Re: cman+pacemaker+drbd fencing problem [In reply to]

On Thu, Mar 01, 2012 at 12:16:17PM -0500, William Seligman wrote:
> On 3/1/12 12:10 PM, William Seligman wrote:
> > On 3/1/12 6:34 AM, emmanuel segura wrote:
> >> try to change the fence daemon tag like this
> >> ====================================
> >> <fence_daemon clean_start="1" post_join_delay="30" />
> >> ====================================
> >> change your cluster config version and after reboot the cluster
> >
> > This did not change the behavior of the cluster. In particular, I'm still
> > dealing with this:
> >
> >>>>> - If the system starts with cman running, and I start drbd, it's
> >>>>> likely that system who is _not_ Outdated will be fenced (rebooted).
>
> This just happened again. Here's the log from the "bad" node, the one I stopped
> and then restarted. cman is running (not pacemaker). I start drbd:
>
> Mar 1 12:03:49 orestes-tb kernel: drbd: initialized. Version: 8.3.12
> (api:88/proto:86-96)
> Mar 1 12:03:49 orestes-tb kernel: drbd: GIT-hash:
> e2a8ef4656be026bbae540305fcb998a5991090f build by
> root [at] hypatia-tb, 2012-02-28 18:01:34
> Mar 1 12:03:49 orestes-tb kernel: drbd: registered as block device major 147
> Mar 1 12:03:49 orestes-tb kernel: drbd: minor_table @ 0xffff88041dbc4b80
> Mar 1 12:03:49 orestes-tb kernel: block drbd0: Starting worker thread (from
> cqueue [2942])
> Mar 1 12:03:49 orestes-tb kernel: block drbd0: disk( Diskless -> Attaching )
> Mar 1 12:03:50 orestes-tb kernel: block drbd0: Found 57 transactions (57 active
> extents) in activity log.
> Mar 1 12:03:50 orestes-tb kernel: block drbd0: Method to ensure write ordering:
> barrier
> Mar 1 12:03:50 orestes-tb kernel: block drbd0: max BIO size = 130560
> Mar 1 12:03:50 orestes-tb kernel: block drbd0: Adjusting my ra_pages to backing
> device's (32 -> 768)
> Mar 1 12:03:50 orestes-tb kernel: block drbd0: drbd_bm_resize called with
> capacity == 5611549368
> Mar 1 12:03:50 orestes-tb kernel: block drbd0: resync bitmap: bits=701443671
> words=10960058 pages=21407
> Mar 1 12:03:50 orestes-tb kernel: block drbd0: size = 2676 GB (2805774684 KB)
> Mar 1 12:03:50 orestes-tb kernel: block drbd0: bitmap READ of 21407 pages took
> 625 jiffies
> Mar 1 12:03:50 orestes-tb kernel: block drbd0: recounting of set bits took
> additional 86 jiffies
> Mar 1 12:03:50 orestes-tb kernel: block drbd0: 0 KB (0 bits) marked out-of-sync
> by on disk bit-map.
> Mar 1 12:03:50 orestes-tb kernel: block drbd0: disk( Attaching -> Outdated )
> Mar 1 12:03:50 orestes-tb kernel: block drbd0: attached to UUIDs
> 878999EFCFBE8E08:0000000000000000:494B48826E41A2C2:494A48826E41A2C3
> Mar 1 12:03:50 orestes-tb kernel: block drbd0: conn( StandAlone -> Unconnected )
> Mar 1 12:03:50 orestes-tb kernel: block drbd0: Starting receiver thread (from
> drbd0_worker [2951])
> Mar 1 12:03:50 orestes-tb kernel: block drbd0: receiver (re)started
> Mar 1 12:03:50 orestes-tb kernel: block drbd0: conn( Unconnected -> WFConnection )
> Mar 1 12:03:51 orestes-tb kernel: block drbd0: Handshake successful: Agreed
> network protocol version 96
> Mar 1 12:03:51 orestes-tb kernel: block drbd0: conn( WFConnection ->
> WFReportParams )
> Mar 1 12:03:51 orestes-tb kernel: block drbd0: Starting asender thread (from
> drbd0_receiver [2965])
> Mar 1 12:03:51 orestes-tb kernel: block drbd0: data-integrity-alg: <not-used>
> Mar 1 12:03:51 orestes-tb kernel: block drbd0: drbd_sync_handshake:
> Mar 1 12:03:51 orestes-tb kernel: block drbd0: self
> 878999EFCFBE8E08:0000000000000000:494B48826E41A2C2:494A48826E41A2C3 bits:0 flags:0
> Mar 1 12:03:51 orestes-tb kernel: block drbd0: peer
> D40A1613FAE8F5E9:878999EFCFBE8E09:878899EFCFBE8E09:494B48826E41A2C3 bits:0 flags:0
> Mar 1 12:03:51 orestes-tb kernel: block drbd0: uuid_compare()=-1 by rule 50
> Mar 1 12:03:51 orestes-tb kernel: block drbd0: peer( Unknown -> Primary ) conn(
> WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
> Mar 1 12:03:53 orestes-tb kernel: block drbd0: conn( WFBitMapT -> WFSyncUUID )
> Mar 1 12:04:01 orestes-tb corosync[2296]: [TOTEM ] A processor failed,
> forming new configuration.

some random thoughts...

DRBD Bitmap exchange causes congestion on Network, packet storm, irq
storm, whatever, and UDP cluster comm packets "falling on the floor"?

Can you change your cluster comm to use an (additional?) dedicated link?
Or play with (increase) totem timeouts? Or play with some sysctls to
make it less likely for UDP to "fall on the floor"; if that is what is
happening.

Maybe if you tcpdump the traffic while you start things up, that could
give you some hints as to why corosync thinks that "A processor failed",
and it has to fence that failed processor...

> Mar 1 12:04:03 orestes-tb corosync[2296]: [QUORUM] Members[1]: 2
> Mar 1 12:04:03 orestes-tb corosync[2296]: [TOTEM ] A processor
> joined or left the membership and a new membership was formed.
> Mar 1 12:04:03 orestes-tb kernel: dlm: closing connection to node 1
> Mar 1 12:04:03 orestes-tb corosync[2296]: [CPG ] chosen downlist: sender
> r(0) ip(129.236.252.14) r(1) ip(192.168.100.6) ; members(old:2 left:1)
> Mar 1 12:04:03 orestes-tb corosync[2296]: [MAIN ] Completed service
> synchronization, ready to provide service.
> Mar 1 12:04:03 orestes-tb fenced[2350]: fencing node hypatia-tb.nevis.columbia.edu
>
>
> As near as I can tell, the "bad" node sees that the "good" node is Primary and
> UpToDate, goes into WFSyncUUID... and then corosync/cman cheerfully fences the
> "good" node.


--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD and LINBIT are registered trademarks of LINBIT, Austria.
_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


seligman at nevis

Mar 1, 2012, 12:54 PM

Post #18 of 18 (2362 views)
Permalink
Re: cman+pacemaker+drbd fencing problem - SOLVED [In reply to]

On 3/1/12 12:56 PM, Lars Ellenberg wrote:
> On Thu, Mar 01, 2012 at 12:16:17PM -0500, William Seligman wrote:
>> On 3/1/12 12:10 PM, William Seligman wrote:
>>> On 3/1/12 6:34 AM, emmanuel segura wrote:
>>>> try to change the fence daemon tag like this
>>>> ====================================
>>>> <fence_daemon clean_start="1" post_join_delay="30" />
>>>> ====================================
>>>> change your cluster config version and after reboot the cluster
>>>
>>> This did not change the behavior of the cluster. In particular, I'm still
>>> dealing with this:
>>>
>>>>>>> - If the system starts with cman running, and I start drbd, it's
>>>>>>> likely that system who is _not_ Outdated will be fenced (rebooted).
>>
>> This just happened again. Here's the log from the "bad" node, the one I stopped
>> and then restarted. cman is running (not pacemaker). I start drbd:
>>
>> Mar 1 12:03:49 orestes-tb kernel: drbd: initialized. Version: 8.3.12
>> (api:88/proto:86-96)
>> Mar 1 12:03:49 orestes-tb kernel: drbd: GIT-hash:
>> e2a8ef4656be026bbae540305fcb998a5991090f build by
>> root [at] hypatia-tb, 2012-02-28 18:01:34
>> Mar 1 12:03:49 orestes-tb kernel: drbd: registered as block device major 147
>> Mar 1 12:03:49 orestes-tb kernel: drbd: minor_table @ 0xffff88041dbc4b80
>> Mar 1 12:03:49 orestes-tb kernel: block drbd0: Starting worker thread (from
>> cqueue [2942])
>> Mar 1 12:03:49 orestes-tb kernel: block drbd0: disk( Diskless -> Attaching )
>> Mar 1 12:03:50 orestes-tb kernel: block drbd0: Found 57 transactions (57 active
>> extents) in activity log.
>> Mar 1 12:03:50 orestes-tb kernel: block drbd0: Method to ensure write ordering:
>> barrier
>> Mar 1 12:03:50 orestes-tb kernel: block drbd0: max BIO size = 130560
>> Mar 1 12:03:50 orestes-tb kernel: block drbd0: Adjusting my ra_pages to backing
>> device's (32 -> 768)
>> Mar 1 12:03:50 orestes-tb kernel: block drbd0: drbd_bm_resize called with
>> capacity == 5611549368
>> Mar 1 12:03:50 orestes-tb kernel: block drbd0: resync bitmap: bits=701443671
>> words=10960058 pages=21407
>> Mar 1 12:03:50 orestes-tb kernel: block drbd0: size = 2676 GB (2805774684 KB)
>> Mar 1 12:03:50 orestes-tb kernel: block drbd0: bitmap READ of 21407 pages took
>> 625 jiffies
>> Mar 1 12:03:50 orestes-tb kernel: block drbd0: recounting of set bits took
>> additional 86 jiffies
>> Mar 1 12:03:50 orestes-tb kernel: block drbd0: 0 KB (0 bits) marked out-of-sync
>> by on disk bit-map.
>> Mar 1 12:03:50 orestes-tb kernel: block drbd0: disk( Attaching -> Outdated )
>> Mar 1 12:03:50 orestes-tb kernel: block drbd0: attached to UUIDs
>> 878999EFCFBE8E08:0000000000000000:494B48826E41A2C2:494A48826E41A2C3
>> Mar 1 12:03:50 orestes-tb kernel: block drbd0: conn( StandAlone -> Unconnected )
>> Mar 1 12:03:50 orestes-tb kernel: block drbd0: Starting receiver thread (from
>> drbd0_worker [2951])
>> Mar 1 12:03:50 orestes-tb kernel: block drbd0: receiver (re)started
>> Mar 1 12:03:50 orestes-tb kernel: block drbd0: conn( Unconnected -> WFConnection )
>> Mar 1 12:03:51 orestes-tb kernel: block drbd0: Handshake successful: Agreed
>> network protocol version 96
>> Mar 1 12:03:51 orestes-tb kernel: block drbd0: conn( WFConnection ->
>> WFReportParams )
>> Mar 1 12:03:51 orestes-tb kernel: block drbd0: Starting asender thread (from
>> drbd0_receiver [2965])
>> Mar 1 12:03:51 orestes-tb kernel: block drbd0: data-integrity-alg: <not-used>
>> Mar 1 12:03:51 orestes-tb kernel: block drbd0: drbd_sync_handshake:
>> Mar 1 12:03:51 orestes-tb kernel: block drbd0: self
>> 878999EFCFBE8E08:0000000000000000:494B48826E41A2C2:494A48826E41A2C3 bits:0 flags:0
>> Mar 1 12:03:51 orestes-tb kernel: block drbd0: peer
>> D40A1613FAE8F5E9:878999EFCFBE8E09:878899EFCFBE8E09:494B48826E41A2C3 bits:0 flags:0
>> Mar 1 12:03:51 orestes-tb kernel: block drbd0: uuid_compare()=-1 by rule 50
>> Mar 1 12:03:51 orestes-tb kernel: block drbd0: peer( Unknown -> Primary ) conn(
>> WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
>> Mar 1 12:03:53 orestes-tb kernel: block drbd0: conn( WFBitMapT -> WFSyncUUID )
>> Mar 1 12:04:01 orestes-tb corosync[2296]: [TOTEM ] A processor failed,
>> forming new configuration.
>
> some random thoughts...
>
> DRBD Bitmap exchange causes congestion on Network, packet storm, irq
> storm, whatever, and UDP cluster comm packets "falling on the floor"?
>
> Can you change your cluster comm to use an (additional?) dedicated link?
> Or play with (increase) totem timeouts? Or play with some sysctls to
> make it less likely for UDP to "fall on the floor"; if that is what is
> happening.
>
> Maybe if you tcpdump the traffic while you start things up, that could
> give you some hints as to why corosync thinks that "A processor failed",
> and it has to fence that failed processor...
>
>> Mar 1 12:04:03 orestes-tb corosync[2296]: [QUORUM] Members[1]: 2
>> Mar 1 12:04:03 orestes-tb corosync[2296]: [TOTEM ] A processor
>> joined or left the membership and a new membership was formed.
>> Mar 1 12:04:03 orestes-tb kernel: dlm: closing connection to node 1
>> Mar 1 12:04:03 orestes-tb corosync[2296]: [CPG ] chosen downlist: sender
>> r(0) ip(129.236.252.14) r(1) ip(192.168.100.6) ; members(old:2 left:1)
>> Mar 1 12:04:03 orestes-tb corosync[2296]: [MAIN ] Completed service
>> synchronization, ready to provide service.
>> Mar 1 12:04:03 orestes-tb fenced[2350]: fencing node hypatia-tb.nevis.columbia.edu
>>
>>
>> As near as I can tell, the "bad" node sees that the "good" node is Primary and
>> UpToDate, goes into WFSyncUUID... and then corosync/cman cheerfully fences the
>> "good" node.

I've solved the problem. Yes, it was because I did something silly. It was Lars
Ellenberg who provided the key when he suggested I think about communication
between the nodes.


The problem was in cluster.conf. Here's the one that failed:

<?xml version="1.0"?>
<cluster config_version="17" name="Nevis_HA">
<logging debug="off"/>
<cman expected_votes="1" two_node="1" />
<clusternodes>
<clusternode name="hypatia-tb.nevis.columbia.edu" nodeid="1">
<altname name="hypatia-private.nevis.columbia.edu" port="5405"
mcast="226.94.1.1"/>
<fence>
<method name="pcmk-redirect">
<device name="pcmk" port="hypatia-tb.nevis.columbia.edu"/>
</method>
</fence>
</clusternode>
<clusternode name="orestes-tb.nevis.columbia.edu" nodeid="2">
<altname name="orestes-private.nevis.columbia.edu" port="5405"
mcast="226.94.1.1"/>
<fence>
<method name="pcmk-redirect">
<device name="pcmk" port="orestes-tb.nevis.columbia.edu"/>
</method>
</fence>
</clusternode>
</clusternodes>
<fencedevices>
<fencedevice name="pcmk" agent="fence_pcmk"/>
</fencedevices>
<fence_daemon post_join_delay="30" />
<rm disabled="1" />
</cluster>


The problem is that I was using RRP by putting in the altname tags. I thought I
was being cautious; if cman+corosync couldn't talk over our public network, at
least they could talk over the dedicated 1GB link between the two nodes. This is
the same link DRBD uses.

The cluster.conf that works is:

<?xml version="1.0"?>
<cluster config_version="23" name="Nevis_HA">
<logging debug="off"/>
<cman expected_votes="1" two_node="1" />
<clusternodes>
<clusternode name="hypatia-tb.nevis.columbia.edu" nodeid="1">
<fence>
<method name="pcmk-redirect">
<device name="pcmk" port="hypatia-tb.nevis.columbia.edu"/>
</method>
</fence>
</clusternode>
<clusternode name="orestes-tb.nevis.columbia.edu" nodeid="2">
<fence>
<method name="pcmk-redirect">
<device name="pcmk" port="orestes-tb.nevis.columbia.edu"/>
</method>
</fence>
</clusternode>
</clusternodes>
<fencedevices>
<fencedevice name="pcmk" agent="fence_pcmk"/>
</fencedevices>
<fence_daemon clean_start="1" post_join_delay="30" />
<rm disabled="1" />
</cluster>


Just take out the altname tags, which turns off RRP.

As a test, since I wasn't sure whether rrp_mode="active" or "passive" was the
default, I tried each of them. Both caused my fencing problem.

I guess the lesson (which I now recall was stated in documentation somewhere) is
not to allow corosync talk over the same link that DRBD is using.

I still wish I could use rrp, but the boxes in my cluster only have two ethernet
ports each. I'll have to see if I can talk my superiors into getting an
expansion card.
--
Bill Seligman | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://seligman [at] nevis
PO Box 137 |
Irvington NY 10533 USA | http://www.nevis.columbia.edu/~seligman
Attachments: smime.p7s (4.39 KB)

Linux-HA users RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.