Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: DRBD: Users

Promote fails in state = { cs:WFConnection ro:Secondary/Unknown ds:Consistent/DUnknown r--- }

 

 

DRBD users RSS feed   Index | Next | Previous | View Threaded


theoren28 at hotmail

Jan 19, 2012, 3:52 AM

Post #1 of 4 (906 views)
Permalink
Promote fails in state = { cs:WFConnection ro:Secondary/Unknown ds:Consistent/DUnknown r--- }

Hi everyone,
First, I would like to express my pleasure using DRBD!
Here is my situation:

Two-node setup, using cman and pacemaker, don't care about quorum, no stonithMaster-Slave DRBD resource
Fence resource only
I noticed that under certain settings (powering on/off nodes enough times) the secondary node may never becomes promoted when primary is shutdown.
Here is a sample log (attached)

Jan 18 08:34:52 NODE-1 crmd: [2054]: info: do_lrm_rsc_op: Performing key=7:89911:0:aac20e27-939f-439c-b461-e668262718b3 op=drbd_fsroot:0_promote_0 )
Jan 18 08:34:52 NODE-1 lrmd: [2051]: info: rsc:drbd_fsroot:0:299768: promote
Jan 18 08:34:52 NODE-1 kernel: block drbd0: helper command: /sbin/drbdadm fence-peer minor-0
Jan 18 08:34:52 NODE-1 corosync[1759]: [TOTEM ] Automatically recovered ring 1
Jan 18 08:34:53 NODE-1 crm-fence-peer.sh[24325]: invoked for fsroot
Jan 18 08:34:53 NODE-1 corosync[1759]: [TOTEM ] Automatically recovered ring 1
Jan 18 08:34:53 NODE-1 crm-fence-peer.sh[24325]: WARNING peer is unreachable, my disk is Consistent: did not place the constraint!
Jan 18 08:34:53 NODE-1 kernel: block drbd0: helper command: /sbin/drbdadm fence-peer minor-0 exit code 5 (0x500)
Jan 18 08:34:53 NODE-1 kernel: block drbd0: fence-peer helper returned 5 (peer unreachable, doing nothing since disk != UpToDate)
Jan 18 08:34:53 NODE-1 kernel: block drbd0: State change failed: Need access to UpToDate data
Jan 18 08:34:53 NODE-1 kernel: block drbd0: state = { cs:WFConnection ro:Secondary/Unknown ds:Consistent/DUnknown r--- }
Jan 18 08:34:53 NODE-1 kernel: block drbd0: wanted = { cs:WFConnection ro:Primary/Unknown ds:Consistent/DUnknown r--- }
Jan 18 08:34:53 NODE-1 lrmd: [2051]: info: RA output: (drbd_fsroot:0:promote:stderr) 0: State change failed: (-2) Need access to UpToDate data
Jan 18 08:34:53 NODE-1 lrmd: [2051]: info: RA output: (drbd_fsroot:0:promote:stderr) Command 'drbdsetup 0 primary' terminated with exit code 17
Jan 18 08:34:53 NODE-1 drbd[24286]: ERROR: fsroot: Called drbdadm -c /etc/drbd.conf primary fsroot
Jan 18 08:34:53 NODE-1 drbd[24286]: ERROR: fsroot: Exit code 17
Jan 18 08:34:53 NODE-1 drbd[24286]: ERROR: fsroot: Command output:
Jan 18 08:34:53 NODE-1 lrmd: [2051]: info: RA output: (drbd_fsroot:0:promote:stdout)
Jan 18 08:34:53 NODE-1 drbd[24286]: CRIT: Refusing to be promoted to Primary without UpToDate data
Jan 18 08:34:53 NODE-1 lrmd: [2051]: WARN: Managed drbd_fsroot:0:promote process 24286 exited with return code 1.
Jan 18 08:34:53 NODE-1 crmd: [2054]: info: process_lrm_event: LRM operation drbd_fsroot:0_promote_0 (call=299768, rc=1, cib-update=209843, confirmed=true) unknown error
Jan 18 08:34:53 NODE-1 crmd: [2054]: WARN: status_from_rc: Action 7 (drbd_fsroot:0_promote_0) on NODE-1 failed (target: 0 vs. rc: 1): Error
Jan 18 08:34:53 NODE-1 crmd: [2054]: WARN: update_failcount: Updating failcount for drbd_fsroot:0 on NODE-1 after failed promote: rc=1 (update=value++, time=1326893693)
Jan 18 08:34:53 NODE-1 attrd: [2052]: info: attrd_local_callback: Expanded fail-count-drbd_fsroot:0=value++ to 29977
Jan 18 08:34:53 NODE-1 attrd: [2052]: notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-drbd_fsroot:0 (29977)
Jan 18 08:34:53 NODE-1 crmd: [2054]: info: abort_transition_graph: match_graph_event:277 - Triggered transition abort (complete=0, tag=lrm_rsc_op, id=drbd_fsroot:0_last_failure_0, magic=0:1;7:89911:0:aac20e27-939f-439c-b461-e668262718b3, cib=0.6.263577) : Event failed It seems to me that not promoting/fencing is worse alternative in case the other node is really shutdown and no stonith is configured to be used.
As a workaround changeing the next line in /usr/lib/drbd/crm-fence-peer.sh solves this
...
try_place_constraint()...
- unreachable/Consistent/outdated)
+ unreachable/Consistent/outdated|\
+ unreachable/Consistent/unknown)
What say you? I use
Linux 2.6.32-220.2.1.el6.i686 #1 SMP Thu Dec 22 18:50:52 GMT 2011 i686 i686 i386 GNU/Linux kmod-drbd83-8.3.8-1.el6.i686
drbd83-8.3.8-1.el6.i686 corosync-1.4.1-4.el6.i686
corosynclib-1.4.1-4.el6.i686
pacemaker-1.1.6-3.el6.i686
pacemaker-libs-1.1.6-3.el6.i686
pacemaker-cluster-libs-1.1.6-3.el6.i686
pacemaker-cli-1.1.6-3.el6.i686
cman-3.0.12.1-23.el6.i686 Best,Oren
Attachments: messages.1.gz (75.6 KB)
  messages.2.gz (46.6 KB)
  fsroot.res (0.44 KB)
  fsglobal_common.conf (2.60 KB)


lars.ellenberg at linbit

Jan 19, 2012, 2:15 PM

Post #2 of 4 (851 views)
Permalink
Re: Promote fails in state = { cs:WFConnection ro:Secondary/Unknown ds:Consistent/DUnknown r--- } [In reply to]

On Thu, Jan 19, 2012 at 11:52:03AM +0000, Oren Nechushtan wrote:
>
>
>
>
> Hi everyone,
> First, I would like to express my pleasure using DRBD!
> Here is my situation:
>
> Two-node setup, using cman and pacemaker, don't care about quorum, no stonithMaster-Slave DRBD resource
> Fence resource only
> I noticed that under certain settings (powering on/off nodes enough times) the secondary node may never becomes promoted when primary is shutdown.

I *think* that is intentional, and preventing potential data divergence,
in the following scenario:

* all good, Primary --- connected --- Secondary
* Kill Secondary, Primary continues.
* Powerdown Primary.
* Bring up Secondary only.

What use is fencing, if a fencing loop would cause data divergence anyways.

> Here is a sample log (attached)
>
> Jan 18 08:34:52 NODE-1 crmd: [2054]: info: do_lrm_rsc_op: Performing key=7:89911:0:aac20e27-939f-439c-b461-e668262718b3 op=drbd_fsroot:0_promote_0 )
> Jan 18 08:34:52 NODE-1 lrmd: [2051]: info: rsc:drbd_fsroot:0:299768: promote
> Jan 18 08:34:52 NODE-1 kernel: block drbd0: helper command: /sbin/drbdadm fence-peer minor-0
> Jan 18 08:34:52 NODE-1 corosync[1759]: [TOTEM ] Automatically recovered ring 1
> Jan 18 08:34:53 NODE-1 crm-fence-peer.sh[24325]: invoked for fsroot
> Jan 18 08:34:53 NODE-1 corosync[1759]: [TOTEM ] Automatically recovered ring 1

> Jan 18 08:34:53 NODE-1 crm-fence-peer.sh[24325]: WARNING peer is unreachable, my disk is Consistent: did not place the constraint!

This is it.

--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com
_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


theoren28 at hotmail

Jan 19, 2012, 10:55 PM

Post #3 of 4 (840 views)
Permalink
Re: Promote fails in state = { cs:WFConnection ro:Secondary/Unknown ds:Consistent/DUnknown r--- } [In reply to]

It seems to me that adding a configuration timout indicating how long to wait before allowing promoting is required, possibly indefinitely by def.
I understand why you might want to wait for either primary up again or manual recovery.
However, in active stand two node setup with the system req. to be up ALL the time there is another approach.
Promote old secondary after a timeout.
If old primary was down for long time - we are up quickly and old primary should sync - fine.
If old primary was down shortly but beyond timeout, SB handlers should recover, possibly with manual recovery.
Acceptable since we couldnt wait forever

What say you?
Oren

> Date: Thu, 19 Jan 2012 23:15:00 +0100
> From: lars.ellenberg [at] linbit
> To: drbd-user [at] lists
> Subject: Re: [DRBD-user] Promote fails in state = { cs:WFConnection ro:Secondary/Unknown ds:Consistent/DUnknown r--- }
>
> On Thu, Jan 19, 2012 at 11:52:03AM +0000, Oren Nechushtan wrote:
> >
> >
> >
> >
> > Hi everyone,
> > First, I would like to express my pleasure using DRBD!
> > Here is my situation:
> >
> > Two-node setup, using cman and pacemaker, don't care about quorum, no stonithMaster-Slave DRBD resource
> > Fence resource only
> > I noticed that under certain settings (powering on/off nodes enough times) the secondary node may never becomes promoted when primary is shutdown.
>
> I *think* that is intentional, and preventing potential data divergence,
> in the following scenario:
>
> * all good, Primary --- connected --- Secondary
> * Kill Secondary, Primary continues.
> * Powerdown Primary.
> * Bring up Secondary only.
>
> What use is fencing, if a fencing loop would cause data divergence anyways.
>
> > Here is a sample log (attached)
> >
> > Jan 18 08:34:52 NODE-1 crmd: [2054]: info: do_lrm_rsc_op: Performing key=7:89911:0:aac20e27-939f-439c-b461-e668262718b3 op=drbd_fsroot:0_promote_0 )
> > Jan 18 08:34:52 NODE-1 lrmd: [2051]: info: rsc:drbd_fsroot:0:299768: promote
> > Jan 18 08:34:52 NODE-1 kernel: block drbd0: helper command: /sbin/drbdadm fence-peer minor-0
> > Jan 18 08:34:52 NODE-1 corosync[1759]: [TOTEM ] Automatically recovered ring 1
> > Jan 18 08:34:53 NODE-1 crm-fence-peer.sh[24325]: invoked for fsroot
> > Jan 18 08:34:53 NODE-1 corosync[1759]: [TOTEM ] Automatically recovered ring 1
>
> > Jan 18 08:34:53 NODE-1 crm-fence-peer.sh[24325]: WARNING peer is unreachable, my disk is Consistent: did not place the constraint!
>
> This is it.
>
> --
> : Lars Ellenberg
> : LINBIT | Your Way to High Availability
> : DRBD/HA support and consulting http://www.linbit.com
> _______________________________________________
> drbd-user mailing list
> drbd-user [at] lists
> http://lists.linbit.com/mailman/listinfo/drbd-user


lars.ellenberg at linbit

Jan 20, 2012, 12:57 AM

Post #4 of 4 (847 views)
Permalink
Re: Promote fails in state = { cs:WFConnection ro:Secondary/Unknown ds:Consistent/DUnknown r--- } [In reply to]

On Fri, Jan 20, 2012 at 06:55:21AM +0000, Oren Nechushtan wrote:
> It seems to me that adding a configuration timout indicating how long to wait before allowing promoting is required, possibly indefinitely by def.
> I understand why you might want to wait for either primary up again or manual recovery.
> However, in active stand two node setup with the system req. to be up ALL the time there is another approach.
> Promote old secondary after a timeout.
> If old primary was down for long time - we are up quickly and old primary should sync - fine.
> If old primary was down shortly but beyond timeout, SB handlers should recover, possibly with manual recovery.
> Acceptable since we couldnt wait forever
>
> What say you?

If you don't care for fencing, don't configure it ;-)

Problem here is, there are many failure scenarios.
We can not know if the "old primary" is "down" (he is bad),
or "unreachable" only (we are bad).
What may seem right for one scenario may be very wrong for an other.
If we can not talk to the peer, we just don't know which scenario we have.

Note that we are already talking about multiple failure scenarios here,
for single failure cases it all works out fine.

How to "best" deal with multiple failure cases can likely not be solved
generically, as "best" depends very much on the specific deployment and
use case requirements, and what multiple failure scenarios you can think of.
And because there are near infinite multiple failure scenarios ;-)

You are free to implement whatever policy you want.
I'd not implement that in the fence peer handler though,
but outside of pacemaker and drbd logic.

If you think you want that, I suggest that you add this to your monitoring
(I mean strategic monitoring, outside of pacemaker),
and trigger an automatic "--forced" promotion, if whatever policies you
may come up with decide that this was a good idea, based on whatever
conditions and parameters, current and previous, your monitoring may know about.

And there will always be an other scenario you did not anticipate.

--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com
_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user

DRBD users RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.