Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Linux-HA: Dev

Re: Re: [DRBD-user] drbd peer outdater: higher level implementation?

 

 

Linux-HA dev RSS feed   Index | Next | Previous | View Threaded


lars.ellenberg at linbit

Sep 12, 2008, 2:55 PM

Post #1 of 22 (3580 views)
Permalink
Re: Re: [DRBD-user] drbd peer outdater: higher level implementation?

was: In-Reply-To: <20080912202727.GC14037 [at] marowsky-bree>

On Fri, Sep 12, 2008 at 10:27:27PM +0200, Lars Marowsky-Bree wrote:
> > > listen to the notifications we provide, and infer the peer state by that
> > > means ... ;-)
> > yeah. I asked you before,
> > how exactly that would look like,
> > and so far I saw only handwaving.
>
> Hm, I don't think there was hand-waving. Sorry. What was unclear?
>
> You get notifications when the peer starts or goes down (or is fenced,
> which looks the same). This is not yet relayed to drbd internally (just
> the RA gets the notification so far), but we could, for example, call
> "standalone" explicity to disconnect; we can discuss this mechanism.
>
> When drbd loses the peer internally, but w/o us providing the
> notification, it's either the replication link crashed, or fencing
> failing or loss of quorum; anyway, you'd "outdate" yourself (and freeze
> io) until this notification was provided (which of course needs to be
> persistent across reboots).
>
> Wouldn't that work?

that would prevent normal failover, no?

what we need is,
* on the "Secondary", "slave",
or whatever you want to call it,
* the signal of the peer, that says:
hey, I'm still alive, I'm still Primary,
and continue to modify the data set,
so you better keep out of the way.
then we mark us as outdated.

I don't think that this can be mapped into
multiple negation plus timeout logic effectively.
do you suggest that,
* on the Secondary
* we get no signal that the peer is not dead in no time,
and therefore don't mark ourself as not uptodate?
uh?

sorry, it is late.
can you explain slowly?

situation 1:

primary crash.
secondary has to take over,
so it better not mark itself outdated.

situation 2:

replication link breaks
primary wants to continue serving data.
so secondary must mark itself outdated.
otherwise on a later primary crash heartbeat would try to make
it primary and succeed in going online with stale data.

that DID HAPPEN.
that is why dopd was invented in the first place.

variation:
as it may be a cluster partition.
with stonith, (at least) one of the nodes gets shot.
primary must freeze until peer is confirmed outdated (or shot)
and must unfreeze again as soon as peer is confirmed outdated (or shot)

where and when do what notifications come in,
and how is drbd (the RA) to react on those?


I recently discussed with our Andreas Kurz, that
what _could_ possibly work is a "monitor" action,
(and optionally some daemon)
that periodically gets the "data generation uuids" from drbd
and feed that into the cib (reuse attrd?)

then when we lose the replication link,
primary freezes, the user land callback
on the primary queries the cib.
if the other nodes is dead
(it should better be shot; we might need a pseudo resource for exactly
that purpose; or would pacemaker shoot a node that does not hold any
resources [that could be started elsewhere]?)
we'd notice, and unfreeze.

if the other node is still alive, it will propagate its uuids.
so will we.

on the secondary, the next monitor action will see the other still alive
with newer UUID, so it would outdate itself, which is just one flag in
the UUIDs anyways, so they would get propagated by the cib to the
primary, which eventually will see the secondaries UUIDs saying it is
outdated.
now we can unfreeze on the Primary.

on the secondary,
if it was a Primary crash, there will be no newer Primary UUID
propagated from the cib, so there will be no self-outdate.
when heartbeat decides to make it primary, we are online again.

but I don't see where any notification would come in.

reading that again, I was not really able to follow myself,
so I'll try again after I got some sleep.
unless, of course, it is all clear to you.
in which case, please,
would you rephrase my wording so I can understand it? ;)

cheers,


--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list -- I'm subscribed
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


lmb at suse

Sep 12, 2008, 6:01 PM

Post #2 of 22 (3456 views)
Permalink
Re: Re: [DRBD-user] drbd peer outdater: higher level implementation? [In reply to]

On 2008-09-12T23:55:53, Lars Ellenberg <lars.ellenberg [at] linbit> wrote:

> > When drbd loses the peer internally, but w/o us providing the
> > notification, it's either the replication link crashed, or fencing
> > failing or loss of quorum; anyway, you'd "outdate" yourself (and freeze
> > io) until this notification was provided (which of course needs to be
> > persistent across reboots).
> >
> > Wouldn't that work?
>
> that would prevent normal failover, no?

No. Normal fail-over will only occur after 'we' have demoted/stopped the
peer. The cluster manager is quite good at enforcing dependencies ;-)

> what we need is,
> * on the "Secondary", "slave",
> or whatever you want to call it,
> * the signal of the peer, that says:
> hey, I'm still alive, I'm still Primary,
> and continue to modify the data set,
> so you better keep out of the way.
> then we mark us as outdated.

Pacemaker/CRM doesn't send signals when nothing changed, so this would
be a weird thing for it to deliver. However, it _will_ tell you when
something changed, ie the logic simply needs to be turned around.

> I don't think that this can be mapped into
> multiple negation plus timeout logic effectively.

I don't think this needs a timeout.

> do you suggest that,
> * on the Secondary
> * we get no signal that the peer is not dead in no time,
> and therefore don't mark ourself as not uptodate?
> uh?

On the secondary, until you get a signal that the peer is dead
(stopped/demoted), consider yourself "not eligible" to be promoted (ie,
outdated).

More generally: on a primary, if the connection to the peer goes away,
set said flag & freeze IO until this signal/notification is delivered.

I believe that covers all of the cases. I may be wrong. We need a
whiteboard. I will make sure we have one in Prague! ;-)

> situation 1:
>
> primary crash.
> secondary has to take over,
> so it better not mark itself outdated.

No problem; we'll deliver a "peer is stopped" notification to the
secondary so it won't be outdated by the time we ask it to promote.

> situation 2:
>
> replication link breaks
> primary wants to continue serving data.
> so secondary must mark itself outdated.
> otherwise on a later primary crash heartbeat would try to make
> it primary and succeed in going online with stale data.

Right. The logic above would protect the data, but if just the
replication link freezes, this would freeze both nodes. Not good,
obviously. Indeed that requires some additional logic.

One possible way is to not freeze IO on the primary; the secondary would
still outdate itself implicitly, and then fail its monitor, and be
stopped (and moved elsewhere, if we could ;-). That seems correct, and
not worse than anything dopd does today; freeze-io probably is an
additional "panic guard".

BTW, when it fails the "monitor", we'll stop it. That could for example
un-freeze the primary. An alternative is to use crm_resource -F as a
call-out when drbd notices the master is gone, which would provide
Pacemaker with an async failure notification and prevent the timeouts
...

> that is why dopd was invented in the first place.

Yes, I know.

> variation:
> as it may be a cluster partition.
> with stonith, (at least) one of the nodes gets shot.
> primary must freeze until peer is confirmed outdated (or shot)
> and must unfreeze again as soon as peer is confirmed outdated (or shot)

We can't confirm it's outdated, but we can tell you when the peer is
shot/stopped.

> where and when do what notifications come in,

That's explained here:
http://wiki.linux-ha.org/v2/Concepts/Clones#head-f9fa0f9ab22e08d82c8f00e15d9724eba47f7576

> and how is drbd (the RA) to react on those?

See above. How to actually provide the signals to drbd (the module ;-)
is of course open to discussion, and I look to you as to understand what
works best.

> I recently discussed with our Andreas Kurz, that
> what _could_ possibly work is a "monitor" action,
> (and optionally some daemon)
> that periodically gets the "data generation uuids" from drbd
> and feed that into the cib (reuse attrd?)

I think that is way too complicated and not needed; I think the
notifications are sufficient, as they provide the peer up/down
promote/demote events. But I may be wrong.

> so I'll try again after I got some sleep.

Good point ;-) I will do the same. And, as I mentioned, bring a
whiteboard to Prague.

If I can explain this so that it works, can I have my floating peers
supported in exchange? ;-)


Regards & good night,
Lars

--
Teamlead Kernel, SuSE Labs, Research and Development
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


lars.ellenberg at linbit

Sep 12, 2008, 6:48 PM

Post #3 of 22 (3476 views)
Permalink
Re: Re: [DRBD-user] drbd peer outdater: higher level implementation? [In reply to]

On Sat, Sep 13, 2008 at 03:01:11AM +0200, Lars Marowsky-Bree wrote:
> On 2008-09-12T23:55:53, Lars Ellenberg <lars.ellenberg [at] linbit> wrote:
>
> > > When drbd loses the peer internally, but w/o us providing the
> > > notification, it's either the replication link crashed, or fencing
> > > failing or loss of quorum; anyway, you'd "outdate" yourself (and freeze
> > > io) until this notification was provided (which of course needs to be
> > > persistent across reboots).
> > >
> > > Wouldn't that work?
> >
> > that would prevent normal failover, no?
>
> No. Normal fail-over will only occur after 'we' have demoted/stopped the
> peer. The cluster manager is quite good at enforcing dependencies ;-)

unfortunately it does not know about the "up2date" ness dependency.
that is the point.

> > what we need is,
> > * on the "Secondary", "slave",
> > or whatever you want to call it,
> > * the signal of the peer, that says:
> > hey, I'm still alive, I'm still Primary,
> > and continue to modify the data set,
> > so you better keep out of the way.
> > then we mark us as outdated.
>
> Pacemaker/CRM doesn't send signals when nothing changed, so this would
> be a weird thing for it to deliver. However, it _will_ tell you when
> something changed, ie the logic simply needs to be turned around.

and I don't think that is possible.

> > I don't think that this can be mapped into
> > multiple negation plus timeout logic effectively.
>
> I don't think this needs a timeout.
>
> > do you suggest that,
> > * on the Secondary
> > * we get no signal that the peer is not dead in no time,
> > and therefore don't mark ourself as not uptodate?
> > uh?
>
> On the secondary, until you get a signal that the peer is dead
> (stopped/demoted), consider yourself "not eligible" to be promoted (ie,
> outdated).

once outdated, it is outdated.
there is only one way for outdated, stale, data
to become uptodate again: resync.
sorry.

that does not work.

> More generally: on a primary, if the connection to the peer goes away,
> set said flag & freeze IO until this signal/notification is delivered.
>
> I believe that covers all of the cases.

it does not.
it does cover "Secondary is dead"
it does cover "Primary is dead"

it happens to cover those,
because in both cases no outdating takes place.

so basically it does nothing,
and in situations where there is nothing to do,
that happens to work.

it does not cover "replication link is down".

> > situation 1:
> >
> > primary crash.
> > secondary has to take over,
> > so it better not mark itself outdated.
>
> No problem; we'll deliver a "peer is stopped" notification to the
> secondary so it won't be outdated by the time we ask it to promote.

as I said: I'm NOT interested in the situation where I do NOT need to outdate.

I want to know when I have to.

heartbeat aparently cannot tell me, or can it.

> > situation 2:
> >
> > replication link breaks
> > primary wants to continue serving data.
> > so secondary must mark itself outdated.
> > otherwise on a later primary crash heartbeat would try to make
> > it primary and succeed in going online with stale data.
>
> Right. The logic above would protect the data, but if just the
> replication link freezes, this would freeze both nodes. Not good,
> obviously. Indeed that requires some additional logic.
>
> One possible way is to not freeze IO on the primary; the secondary would
> still outdate itself implicitly,

_when_ does the secondary outdate itself,
based on _what_.
if you implicitly outdate on connection loss,
you prevent normal failover.

you need to outdate on connection loss,
while the primary continues to write.
that cannot happen implicitly.

> and then fail its monitor, and be
> stopped (and moved elsewhere, if we could ;-). That seems correct, and
> not worse than anything dopd does today; freeze-io probably is an
> additional "panic guard".

you can already configure freezing and non-freezing in drbd
by saying "fencing resource-only" or "fencing resource-and-stonith".

> BTW, when it fails the "monitor", we'll stop it. That could for example
> un-freeze the primary. An alternative is to use crm_resource -F as a
> call-out when drbd notices the master is gone, which would provide
> Pacemaker with an async failure notification and prevent the timeouts
> ...

you try to solve the node failure.
but that is already solved.
we don't need any outdate for a node failure.

solve the replication link failure and later primary crash.
solve the problem to not go online with stale data.

> > that is why dopd was invented in the first place.
>
> Yes, I know.
>
> > variation:
> > as it may be a cluster partition.
> > with stonith, (at least) one of the nodes gets shot.
> > primary must freeze until peer is confirmed outdated (or shot)
> > and must unfreeze again as soon as peer is confirmed outdated (or shot)
>
> We can't confirm it's outdated, but we can tell you when the peer is
> shot/stopped.
>
> > where and when do what notifications come in,
>
> That's explained here:
> http://wiki.linux-ha.org/v2/Concepts/Clones#head-f9fa0f9ab22e08d82c8f00e15d9724eba47f7576

see, that is handwaving.

I describe simple situations,
you could comment inline when and which notifications would take place.
unfortunately, aparently it is not that simple,
as there are no notifications taking place in the interessting situation.

you point to some web page (which I already know)
that outlines a neat mechanism.
which does not apply.

none of those notifications would happen for
"replication link down, but Primary still up and eager to continue to write".
or even "... and still writing along"

again:
situation 1 "normal failover":

all healthy.
primary crashes
heartbeat promotes secondary to primary
and goes online with good data.

no outdate takes place.

compare with
situation 2: "outdate needed or data jumps back in time"

replication link breaks
primary keeps writing
(which means secondary has now stale data)
primary crashes
heartbeat promotes secondary to primary
and goes online with stale data.

at which point would the secondary get a notification?
which one?
how could that trigger the outdate mechanism,
and prevent the promotion?
logic in RA script?
trigger on what arguments/parameters/environment variables?

so you think it is sufficient that a secondary without
communication link to the peer refuses to become primary
until heartbeat notifies it that the primary is down?
that is a no-op, as heartbeat will do that always.
it and cannot prevent situation 2.

> > and how is drbd (the RA) to react on those?
>
> See above.

sorry, I don't see.

> How to actually provide the signals to drbd (the module ;-) is of
> course open to discussion,

not at all, that part is solved.

> and I look to you as to understand what works best.

Iff I'd get a signal in the RA with the appropriate meaning
at the appropriate time, I'd just say "drbdadm outdate resource".
that is what dopd does now.

> > I recently discussed with our Andreas Kurz, that
> > what _could_ possibly work is a "monitor" action,
> > (and optionally some daemon)
> > that periodically gets the "data generation uuids" from drbd
> > and feed that into the cib (reuse attrd?)
>
> I think that is way too complicated and not needed; I think the
> notifications are sufficient, as they provide the peer up/down
> promote/demote events. But I may be wrong.
>
> > so I'll try again after I got some sleep.
>
> Good point ;-) I will do the same. And, as I mentioned, bring a
> whiteboard to Prague.
>
> If I can explain this so that it works, can I have my floating peers
> supported in exchange? ;-)

perhaps.
but you have much work to do ;)

--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list -- I'm subscribed
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


lmb at suse

Sep 12, 2008, 6:54 PM

Post #4 of 22 (3460 views)
Permalink
Re: Re: [DRBD-user] drbd peer outdater: higher level implementation? [In reply to]

On 2008-09-13T03:01:11, Lars Marowsky-Bree <lmb [at] suse> wrote:

Ahh, it all just clicked into place!

A slave always considers itself "outdated", until this flag is cleared.
(The flag is persistent in the meta-data.) While outdated, it refuses to
promote.

The "outdated" flag is cleared either by receiving a "stop" notification
for the peer (which pacemaker delivers), or that the peer has been
demoted (which we can either deliver using pacemaker, but of course drbd
knows via its internal protocol).

A primary, on loss of connection to the peer, freezes IO (sets the
outdated flag?). It invokes a call-out "crm_resource -F" _for the peer_,
which causes pacemaker to stop the peer. (Stopping the peer w/o
connection means that the peer will save its outdated flag to disk, and
be unable to promote.) The primary will then receive a "stop"
notification for the peer (either because it was indeed stopped, or
fenced), and then unfreezes (clears the outdated flag again?).

I think that covers everything for primary/secondary.


I'm not perfectly sure how to handle primary/primary; I believe the
second paragraph above handles it, but it would cause both sides to
stop; drbd might need some internal mechanism to pre-determine which
side will do that on primary/primary; it doesn't really matter which, I
think - maybe the lowest id?


Regards,
Lars

--
Teamlead Kernel, SuSE Labs, Research and Development
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


lmb at suse

Sep 12, 2008, 7:00 PM

Post #5 of 22 (3480 views)
Permalink
Re: Re: [DRBD-user] drbd peer outdater: higher level implementation? [In reply to]

On 2008-09-13T03:48:45, Lars Ellenberg <lars.ellenberg [at] linbit> wrote:

Insomniac too? ;-)

> > No. Normal fail-over will only occur after 'we' have demoted/stopped the
> > peer. The cluster manager is quite good at enforcing dependencies ;-)
> unfortunately it does not know about the "up2date" ness dependency.
> that is the point.

But it doesn't need to. An not-uptodate drbd will simply refuse to
promote, right?

> > On the secondary, until you get a signal that the peer is dead
> > (stopped/demoted), consider yourself "not eligible" to be promoted (ie,
> > outdated).
>
> once outdated, it is outdated.
> there is only one way for outdated, stale, data
> to become uptodate again: resync.

I think implicit in this is indeed that the meaning of "outdated"
changes. Maybe a better phrase would be "eligible to promote". And
indeed, it is only cleared by reconnecting to the other side and
resyncing. Please see my other mail.

> it does not cover "replication link is down".

Please see my other mail. I think I explained it more clearly there.

> I describe simple situations,
> you could comment inline when and which notifications would take place.

I'm sorry, I wasn't aware that that was what you were looking for, and
the web page describes all scenarios when pacemaker delivers a
notification to an RA (basically, whenever the peer changes state).

> Iff I'd get a signal in the RA with the appropriate meaning
> at the appropriate time, I'd just say "drbdadm outdate resource".
> that is what dopd does now.

I think that "outdate" mechanism as it stands today might need some
minor changes, yes. Just as the logic in the RA surely needs to, and
possibly we even need to improve m/s if we find a lack there.

(Though right now I think most is contained to drbd + the RA.)


Regards,
Lars

--
Teamlead Kernel, SuSE Labs, Research and Development
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


lmb at suse

Sep 12, 2008, 7:13 PM

Post #6 of 22 (3459 views)
Permalink
Re: Re: [DRBD-user] drbd peer outdater: higher level implementation? [In reply to]

On 2008-09-12T23:55:53, Lars Ellenberg <lars.ellenberg [at] linbit> wrote:

Trying to explain again.

> situation 1:
>
> primary crash.

-> secondary receives "peer is stopped (fenced)" notification, clears
outdated flag

> secondary has to take over,

-> secondary promotes fine

> so it better not mark itself outdated.
>
> situation 2:
>
> replication link breaks

-> Pacemaker doesn't do anything, because it doesn't know ;-)

(Actually, to drbd, it doesn't know if the link broke or the secondary
is indeed down)

-> Primary marks itself as "outdated" for now, freezes IO
(As you don't like me to say that it is outdated, because this seems
to invoke the current meaning instead of the new behaviour, maybe I
should call it "marks itself as 'in flux'"? I'm open to using
terminology which is more clear.)

> primary wants to continue serving data.

-> primary calls out to mark the peer as failed
-> peer (secondary) is stopped by pacemaker, or fenced (if the machine
hung, crashed, whatever)

> so secondary must mark itself outdated.

-> Secondary is "outdated" by virtue of not having received one of the
signals that cleared the flag

-> Primary receives "peer is stopped" notification, clears flag, and
continues saving data

> otherwise on a later primary crash heartbeat would try to make
> it primary and succeed in going online with stale data.
>
> that DID HAPPEN.
> that is why dopd was invented in the first place.

Right, and I don't think it can happen with this scheme.

>
> variation:
> as it may be a cluster partition.
> with stonith, (at least) one of the nodes gets shot.

That is actually identical to either one of the above scenarios, I
think, depending on which side wins.

Only the surviving side will receive all the right steps to continue
serving data.

> primary must freeze until peer is confirmed outdated (or shot)

It'd still call out to try and fail the peer; but as that is impossible
(peer is unreachable), it'll instead receive the fencing notification.

> and must unfreeze again as soon as peer is confirmed outdated (or shot)

Or the primary is shot; could go either way, but that would look like
scenario 1.


Regards,
Lars

--
Teamlead Kernel, SuSE Labs, Research and Development
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


lars.ellenberg at linbit

Sep 13, 2008, 5:52 AM

Post #7 of 22 (3451 views)
Permalink
Re: Re: [DRBD-user] drbd peer outdater: higher level implementation? [In reply to]

On Sat, Sep 13, 2008 at 04:00:34AM +0200, Lars Marowsky-Bree wrote:
> I'm sorry, I wasn't aware that that was what you were looking for, and
> the web page describes all scenarios when pacemaker delivers a
> notification to an RA (basically, whenever the peer changes state).

and none of them is useful for _outdating_.
some of them may be useful for "unfreezing".

> > Iff I'd get a signal in the RA with the appropriate meaning
> > at the appropriate time, I'd just say "drbdadm outdate resource".
> > that is what dopd does now.
>
> I think that "outdate" mechanism as it stands today might need some
> minor changes, yes. Just as the logic in the RA surely needs to, and
> possibly we even need to improve m/s if we find a lack there.

so.
what you are suggeting is

when drbd loses replication link
primary
freezes and calles out to userland,
telling heartbeat that the peer has "failed",
in which case heartbeat would stop drbd on the secondary.
either receives "secondary was stopped",
maybe stores to meta data "_I_ am ahead of peer",
(useful for cluster wide crash/reboot later)
and unfreezes
or is being stopped itself
(which would result in the node being self fenced, as the fs on
top of drbd cannot be unmounted as drbd is freezed,...)
or is even being shot as result of a cluster partition.

so either primary continues to write,
or it will soon look like a crashed primary.

secondary
sets a flag "primary may be ahead of me",
then waits for
either being stopped, in which case
it would save to meta data "primary _IS_ ahead of me"
or being told that the Primary was stopped
when it would clear that flag again,
maybe store to meta data "_I_ am ahead of peer"
and then most likely soon after be promoted.

while drbd has the "peer may be ahead of me" flag set, i.e. basically
while drbd is not connected and no "certain" flag is set yet, it
will refuse to be promoted.

Did I get that right?

[.note that drbd has both "certain" flags already implemented,
namely "I am outdated" = peer IS ahead of me, and
"peer is outdated" = _I_ am ahead of peer ]

some questions:
wouldn't that "peer has failed" first trigger a monitor?
wouldn't that mean that on monitor, a not connected secondary would
have to report "failed", as otherwise it would not get stopped?
wouldn't that prevent normal failover?

if not,
wouldn't heartbeat try to restart the "failed" secondary?
what would happen?
what does a secondary do when started, and it finds the
"primary IS ahead of me" flag in meta data?
refuse to start even as slave?
(would prevent it from ever being resync'ed!)
start as slave, but refuse to be promoted?

[.note that typical DRBD cluster deployment
is still 2node, in case that matters]

problem: secondary crash.
secondary reboots,
heartbeat rejoins the cluster.

replication link is still broken.

secondary does not have "primary IS ahead of me" flag in meta data
as because of the crash there was no way to store that.

would heartbeat try to start drbd (slave) here?
what would trigger the "IS ahead of me" flag get stored on disk?

if for some reason policy engine now figures the master should rather
run on the just rejoined node, how can that migration be prevented?


and so on and on.
there are many scenarios.
I'm still not convinced that this method
covers as many as dopd as good as dopd.
but, at least, it is getting closer...

--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list -- I'm subscribed
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


lmb at suse

Sep 13, 2008, 9:55 AM

Post #8 of 22 (3444 views)
Permalink
Re: Re: [DRBD-user] drbd peer outdater: higher level implementation? [In reply to]

On 2008-09-13T14:52:53, Lars Ellenberg <lars.ellenberg [at] linbit> wrote:

> so.
> what you are suggeting is
>
> when drbd loses replication link
> primary
> freezes and calles out to userland,
> telling heartbeat that the peer has "failed",
> in which case heartbeat would stop drbd on the secondary.
> either receives "secondary was stopped",
> maybe stores to meta data "_I_ am ahead of peer",
> (useful for cluster wide crash/reboot later)
> and unfreezes
> or is being stopped itself
> (which would result in the node being self fenced, as the fs on
> top of drbd cannot be unmounted as drbd is freezed,...)
> or is even being shot as result of a cluster partition.
>
> so either primary continues to write,
> or it will soon look like a crashed primary.
>
> secondary
> sets a flag "primary may be ahead of me",
> then waits for
> either being stopped, in which case
> it would save to meta data "primary _IS_ ahead of me"
> or being told that the Primary was stopped
> when it would clear that flag again,
> maybe store to meta data "_I_ am ahead of peer"
> and then most likely soon after be promoted.

Okay, I think I have found a good name for the flag which I mean, and
that should allow me to rephrase more clearly, and possibly simplify
further Retry:

Unconnected secondary starts up with "peer state dirty", which is
basically identical to "(locally) unsynced/inconsistent", but I think
it's either to explain when I call it "peer dirty".

After it connects (and possibly resynchronizes), it clears the
"peer dirty" flag locally. (Assume that it both sides are secondary at
this point; that'd be the default during a cluster start-up.)

When one side gets promoted, the other side sets the "peer dirty" flag
locally. When it demotes, both sides clear it. Basically, each side
gets to clear it when it notices that the peer is demoted. So far, so
good.

Scenario A - the replication link goes down:

- Primary:
- Freezes IO.
- Calls out to user-space to "fail" the peer.
- Gets confirmation that peer is stopped (via RA notification).
- Resumes IO.

- Secondary:
- Simply gets stopped.
- It'll assume "peer dirty" anyway, until it reconnects and
resyncs.


Scenario A - primary fails:

- Primary:
- Is dead. ;-)

Secondary:
- Gets confirmation that peer is stopped.
- Clears inconsistent flag (capable to resume IO).

Scenario C - secondary fails:

Primary:
- Same as A, actually, from the point of view of the primary.

Secondary:
- Either gets fenced, or stopped.


(Note that A/B/C could actually work for active/active too, as long as
there's a way to ensure that only one side calls out to fail its peer,
and the other one - for the sake of this scenario - behaves like a
secondary.)

> some questions:
> wouldn't that "peer has failed" first trigger a monitor?

No; it'd translate to a direct stop.

> wouldn't that mean that on monitor, a not connected secondary would
> have to report "failed", as otherwise it would not get stopped?
> wouldn't that prevent normal failover?

Monitoring definitions are a slightly different matter. The result of a
monitor is not the same as the ability/preference to become master.
Indeed a failed resource will never get promoted, but a happy resource
needn't call crm_master and thus not become promotable.

I think "monitor" would refer exclusively to local health - local
storage read/writable, drbd running, etc.

> if not,
> wouldn't heartbeat try to restart the "failed" secondary?
> what would happen?

It might try to restart. But if a secondary gets restarted, it'll know
from the environment variables that a peer exists; if it can't connect
to that, it should fail the start - alternatively, it'd be up and
running, but have "outdated/peer is dirty" set anyway, and so never
announce it's ability to "promote".

> what does a secondary do when started, and it finds the
> "primary IS ahead of me" flag in meta data?
> refuse to start even as slave?
> (would prevent it from ever being resync'ed!)
> start as slave, but refuse to be promoted?

The latter.

> problem: secondary crash.
> secondary reboots,
> heartbeat rejoins the cluster.
>
> replication link is still broken.
>
> secondary does not have "primary IS ahead of me" flag in meta data
> as because of the crash there was no way to store that.

> would heartbeat try to start drbd (slave) here?
> what would trigger the "IS ahead of me" flag get stored on disk?

See above; it would _always_ come up with the assumption that "peer is
dirty", and thus refuse to promote. No need to store anything on disk;
it is the default assumption.

> if for some reason policy engine now figures the master should rather
> run on the just rejoined node, how can that migration be prevented?

That's a different discussion, but: the ability (and preference for) to
become primary is explicitly set by the RA through the call to
"crm_master".

If it is unable to become master, it would call "crm_master -D"; it'll
then _never_ be promoted.

> I'm still not convinced that this method
> covers as many as dopd as good as dopd.

I think so. At least my proposal is becoming more concise, which is good
for review ;-)


Regards,
Lars

--
Teamlead Kernel, SuSE Labs, Research and Development
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


lars.ellenberg at linbit

Sep 13, 2008, 1:44 PM

Post #9 of 22 (3443 views)
Permalink
Re: Re: [DRBD-user] drbd peer outdater: higher level implementation? [In reply to]

On Sat, Sep 13, 2008 at 06:55:05PM +0200, Lars Marowsky-Bree wrote:
> On 2008-09-13T14:52:53, Lars Ellenberg <lars.ellenberg [at] linbit> wrote:
>
> > so.
> > what you are suggeting is
> >
> > when drbd loses replication link
> > primary
> > freezes and calles out to userland,
> > telling heartbeat that the peer has "failed",
> > in which case heartbeat would stop drbd on the secondary.
> > either receives "secondary was stopped",
> > maybe stores to meta data "_I_ am ahead of peer",
> > (useful for cluster wide crash/reboot later)
> > and unfreezes
> > or is being stopped itself
> > (which would result in the node being self fenced, as the fs on
> > top of drbd cannot be unmounted as drbd is freezed,...)
> > or is even being shot as result of a cluster partition.
> >
> > so either primary continues to write,
> > or it will soon look like a crashed primary.
> >
> > secondary
> > sets a flag "primary may be ahead of me",
> > then waits for
> > either being stopped, in which case
> > it would save to meta data "primary _IS_ ahead of me"
> > or being told that the Primary was stopped
> > when it would clear that flag again,
> > maybe store to meta data "_I_ am ahead of peer"
> > and then most likely soon after be promoted.
>
> Okay, I think I have found a good name for the flag which I mean, and
> that should allow me to rephrase more clearly, and possibly simplify
> further Retry:
>
> Unconnected secondary starts up with "peer state dirty", which is
> basically identical to "(locally) unsynced/inconsistent", but I think
> it's either to explain when I call it "peer dirty".

bad choice.
it has a meaning.
we have
uptodate (necessarily consistent),
consistent (not neccessarily uptodate)
outdated (consistent,
but we know a more recent version existed at some point)
inconsistent.

with "dirty" I think of inconsistent.
that is something different than outdated.

> After it connects (and possibly resynchronizes), it clears the
> "peer dirty" flag locally. (Assume that it both sides are secondary at
> this point; that'd be the default during a cluster start-up.)
>
> When one side gets promoted, the other side sets the "peer dirty" flag
> locally. When it demotes, both sides clear it. Basically, each side
> gets to clear it when it notices that the peer is demoted. So far, so
> good.
>
> Scenario A - the replication link goes down:
>
> - Primary:
> - Freezes IO.
> - Calls out to user-space to "fail" the peer.
> - Gets confirmation that peer is stopped (via RA notification).
> - Resumes IO.
>
> - Secondary:
> - Simply gets stopped.
> - It'll assume "peer dirty" anyway, until it reconnects and
> resyncs.

how can it possibly reconnect and resync,
if it is stopped?

> Scenario A - primary fails:
>
> - Primary:
> - Is dead. ;-)
>
> Secondary:
> - Gets confirmation that peer is stopped.
> - Clears inconsistent flag (capable to resume IO).

you still ignore _my_ scenario 2,
I fail to see why you think you cover it.

right now, without dopd, and this "all new dopd in higher levels with
notifications and stuff" does not exist yet, either
this is possible:

situation 2: "outdate needed or data jumps back in time"

replication link breaks
primary keeps writing
(which means secondary has now stale data)
primary crashes
heartbeat promotes secondary to primary
and goes online with stale data.

variation: instead of primary crash, cluster crash.
cluster reboot, replication link still broken.
how do we prevent heartbeat from chosing the "wrong" node for promotion?

dopd handles both.
how does your proposal?
by stopping the secondary when the replication link broke?
but that must not happen. how could it then possibly resync, ever?
and it won't work for the variation with the cluster crash.

> Scenario C - secondary fails:
>
> Primary:
> - Same as A, actually, from the point of view of the primary.
>
> Secondary:
> - Either gets fenced, or stopped.
>
>
> (Note that A/B/C could actually work for active/active too, as long as
> there's a way to ensure that only one side calls out to fail its peer,
> and the other one - for the sake of this scenario - behaves like a
> secondary.)
>
> > some questions:
> > wouldn't that "peer has failed" first trigger a monitor?
>
> No; it'd translate to a direct stop.
>
> > wouldn't that mean that on monitor, a not connected secondary would
> > have to report "failed", as otherwise it would not get stopped?
> > wouldn't that prevent normal failover?
>
> Monitoring definitions are a slightly different matter. The result of a
> monitor is not the same as the ability/preference to become master.
> Indeed a failed resource will never get promoted, but a happy resource
> needn't call crm_master and thus not become promotable.
>
> I think "monitor" would refer exclusively to local health - local
> storage read/writable, drbd running, etc.
>
> > if not,
> > wouldn't heartbeat try to restart the "failed" secondary?
> > what would happen?
>
> It might try to restart. But if a secondary gets restarted, it'll know
> from the environment variables that a peer exists; if it can't connect
> to that, it should fail the start - alternatively, it'd be up and
> running, but have "outdated/peer is dirty" set anyway, and so never
> announce it's ability to "promote".
>
> > what does a secondary do when started, and it finds the
> > "primary IS ahead of me" flag in meta data?
> > refuse to start even as slave?
> > (would prevent it from ever being resync'ed!)
> > start as slave, but refuse to be promoted?
>
> The latter.
>
> > problem: secondary crash.
> > secondary reboots,
> > heartbeat rejoins the cluster.
> >
> > replication link is still broken.
> >
> > secondary does not have "primary IS ahead of me" flag in meta data
> > as because of the crash there was no way to store that.
>
> > would heartbeat try to start drbd (slave) here?
> > what would trigger the "IS ahead of me" flag get stored on disk?
>
> See above; it would _always_ come up with the assumption that "peer is
> dirty",

lets call it "peer may be more recent".

> and thus refuse to promote. No need to store anything on disk;
> it is the default assumption.

then you can never go online after cluster crash,
unless all drbd nodes come up _and_ can establish connection.

no availability does match the problem description
"don't go online with stale data."
but it is not exactly what we want.

I need the ability to store on disk that "_I_ am ahead of peer"
if I know for sure, so I can be promoted after crash/reboot.

> > if for some reason policy engine now figures the master should rather
> > run on the just rejoined node, how can that migration be prevented?
>
> That's a different discussion, but: the ability (and preference for) to
> become primary is explicitly set by the RA through the call to
> "crm_master".
>
> If it is unable to become master, it would call "crm_master -D"; it'll
> then _never_ be promoted.

for that, it would need to know first that it is outdated. so it _is_
the same problem to some degree, as it depends on that solution.

> > I'm still not convinced that this method
> > covers as many as dopd as good as dopd.
>
> I think so. At least my proposal is becoming more concise, which is good
> for review ;-)

this time you made a step backwards, as you seem to think that drbd does
not need to store any information about being outdated.

to again point out what problem we are trying to solve:
whenever a secondary is about to be promoted,
it needs to be "reasonably" certain that is has the most recent data,
otherwise it would refuse.
it does not matter whether the promotion attempt happens
right after connection loss,
or three and a half days, two cluster crashes
and some node reboots later.

as it is almost impossible to be certain that you have the most recent
data, but it is very well possible to know that you are outdated (as
that does not change without a resync),
the dopd logic revolves around "outdate".

we thoroughly thought about how to solve it.
the result was dopd.
any solution to replace dopd must at least
cover as many scenarios as good as dopd.

the best way to replace dopd would be to find a more "high level"
mechanism for a surviving Primary to actively signal a surviving
Secondary to outdate itself (or get feedback why that was not possible),
and for a not-connected Secondary which is about to be promoted
to ask its peer to outdate itself, which may be refused as it may be
primary.

if you want to solve it differently,
it becomes a real mess of fragile complex hackwork and assumptions.

if it is possible to express
"hello crm, something bad has happened,
would you please notify my other
clones/masters/slaves/whatever the terminus
that I am still alive, and about to continue to change the data."
and, tell me when that is done, so I can unfreeze",
that would take it half way. to fully replace dopd,
we'd need a way to communicate back.

if it is possible for the master to tell the crm that the slave has
failed and should therefore be stopped, then this should not be that
difficult.

alternatively, if we can put some state into the cib
(in current drbd e.g. the "data generation uuids"),
that might work as well.

"failing", i.e. stopping, the slave
just because a replication link hickup
is no solution.

--
: Lars Ellenberg
: LINBIT HA-Solutions GmbH
: DRBD®/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks
of LINBIT Information Technologies GmbH
__
please don't Cc me, but send to list -- I'm subscribed
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


lmb at suse

Sep 13, 2008, 3:28 PM

Post #10 of 22 (3436 views)
Permalink
Re: Re: [DRBD-user] drbd peer outdater: higher level implementation? [In reply to]

On 2008-09-13T22:44:56, Lars Ellenberg <lars.ellenberg [at] linbit> wrote:

> bad choice.
> it has a meaning.
> we have
> uptodate (necessarily consistent),
> consistent (not neccessarily uptodate)
> outdated (consistent,
> but we know a more recent version existed at some point)
> inconsistent.

Ok, ok. I'll call it "the flag" then for the time being ;-)

> > After it connects (and possibly resynchronizes), it clears the
> > "peer dirty" flag locally. (Assume that it both sides are secondary at
> > this point; that'd be the default during a cluster start-up.)
> >
> > When one side gets promoted, the other side sets the "peer dirty" flag
> > locally. When it demotes, both sides clear it. Basically, each side
> > gets to clear it when it notices that the peer is demoted. So far, so
> > good.
> >
> > Scenario A - the replication link goes down:
> >
> > - Primary:
> > - Freezes IO.
> > - Calls out to user-space to "fail" the peer.
> > - Gets confirmation that peer is stopped (via RA notification).
> > - Resumes IO.
> >
> > - Secondary:
> > - Simply gets stopped.
> > - It'll assume "peer dirty" anyway, until it reconnects and
> > resyncs.
>
> how can it possibly reconnect and resync,
> if it is stopped?

I meant "eventually", ie sometime the admin is going to fix it and then
it'll be able to reconnect and resync, and clear the flag.

The emphasis is on the first part of the sentence - as the flag is set
by default on start-up anyway, the secondary can "simply" be stopped w/o
needing to write anything to disk.

> you still ignore _my_ scenario 2,
> I fail to see why you think you cover it.

I thought it is covered. I'll try again.

> right now, without dopd, and this "all new dopd in higher levels with
> notifications and stuff" does not exist yet, either
> this is possible:

Well, of course I'm describing the target scenario, not the current one.
I entirely agree that that is possible right now.

> situation 2: "outdate needed or data jumps back in time"
>
> replication link breaks

That's the very first scenario I described!

> primary keeps writing
> (which means secondary has now stale data)

First, it wouldn't keep writing in the described approach, but freeze,
and only resume to write after it has been notified that the peer has
been stopped.

> primary crashes
> heartbeat promotes secondary to primary
> and goes online with stale data.

Second, even if Pacemaker would restart the secondary (which was stopped
due to the failure), the secondary would be unable to promote as "the
flag" would be set by default on start-up.

I really believe that the approach I described covers this.

> variation: instead of primary crash, cluster crash.
> cluster reboot, replication link still broken.
> how do we prevent heartbeat from chosing the "wrong" node for promotion?

This scenario is indeed not perfectly handled by my approach as
described: it does handle that the "wrong" secondary doesn't get
promoted, but it would indeed prevent _both_ sides from being promoted,
which is not good.

First, the theoretical response to this is that replication link down
plus crash of two nodes actually constitutes a triple failure, and thus
not one we claim the cluster protects against. ;-) For some customers,
manual intervention here would be acceptable.


But second, a possible solution is to write a persistent "I was primary"
flag to the meta-data. On start, this would then set crm_master's
preference to non-zero value (say, 1), which would allow the node to be
promoted. This might be a tunable operation.


> dopd handles both.
> how does your proposal?
> by stopping the secondary when the replication link broke?

That's what I explained, yes.

> but that must not happen. how could it then possibly resync, ever?

Pacemaker can be configured to restart it too, which would attempt a
reconnect (or even attempt the reconnect periodically, if the RA would
fail to start if unable to connect to the peer, but that might not even
be needed - restarting it once and keeping it running is sufficient).

Further, I might wish to actually stop the secondary _to be able to move
it to another node_ (which might be able to reconnect & resync).

> > See above; it would _always_ come up with the assumption that "peer is
> > dirty",
>
> lets call it "peer may be more recent".

OK. I'm still calling it "the flag" because it's easier to type ;-)

> > and thus refuse to promote. No need to store anything on disk;
> > it is the default assumption.
>
> then you can never go online after cluster crash,
> unless all drbd nodes come up _and_ can establish connection.

See above for one possible solution.

Okay, now you're going to propose the following scenario:

- Primary N1 crashes
- Secondary N2 gets promoted
- Cluster crash
- Replication link down
- Both nodes N1+N2 up

With the extension I propose above, both sides would set the same master
preference, while we'd obviously want N2 promoted, not N1. But then,
dopd wouldn't help this. Instead of writing 1 though, they could use one
of the generation counters (primary transitions seen?), which would be
n+1 for N2 and cause N2 to be (correctly) promoted.

(Of course I can construct a sequence of failures which would break even
that, to which I'd reply that they really should simply use the same
bonded interfaces for both their cluster traffic _and_ the replication,
to completely avoid this problem ;-)

> no availability does match the problem description
> "don't go online with stale data."
> but it is not exactly what we want.

Depends on the scenario, but I think my above scenario works fine.

> I need the ability to store on disk that "_I_ am ahead of peer"
> if I know for sure, so I can be promoted after crash/reboot.

Ok, I see your point, and that is I think what I proposed above.

> > I think so. At least my proposal is becoming more concise, which is good
> > for review ;-)
> this time you made a step backwards, as you seem to think that drbd does
> not need to store any information about being outdated.

I actually still think this is so, yes.

> to again point out what problem we are trying to solve:
> whenever a secondary is about to be promoted,
> it needs to be "reasonably" certain that is has the most recent data,
> otherwise it would refuse.
> it does not matter whether the promotion attempt happens
> right after connection loss,
> or three and a half days, two cluster crashes
> and some node reboots later.

Right, and agreed.

> as it is almost impossible to be certain that you have the most recent
> data, but it is very well possible to know that you are outdated (as
> that does not change without a resync),
> the dopd logic revolves around "outdate".

No disagreement there. I'm not saying dopd doesn't solve the problem.
I'm just trying to find a solution which solves it without needing dopd,
but which can instead leverage that Pacemaker is quite a bit smarter
than heartbeat-v1; hence my proposal above.

> we thoroughly thought about how to solve it.
> the result was dopd.
> any solution to replace dopd must at least
> cover as many scenarios as good as dopd.

Of course. I'm not disagreeing.

> the best way to replace dopd would be to find a more "high level"
> mechanism for a surviving Primary to actively signal a surviving
> Secondary to outdate itself (or get feedback why that was not possible),

Restarting it does that in my proposal (it would possibly come back up
with 'the flag' set by default) - and it does get active feedback that
the peer was stopped.

Indeed, it would NOT get feedback if that was not possible - that is a
new requirement. But that's impossible (okay, okay, "unlikely"), as
failure to stop would trigger the recovery escalation and eventually
stonith the former peer. Of course, if that fails _too_, but then I
think we've arrived at so many failures that "freeze" is an acceptable
response.

> and for a not-connected Secondary which is about to be promoted
> to ask its peer to outdate itself, which may be refused as it may be
> primary.

I don't see the need for this second requirement. First, a not-connected
secondary in my example would never promote (unless it was primary
before, with the extension); second, if a primary exists, the cluster
would never promote a second one (that's protected against).

> if you want to solve it differently,
> it becomes a real mess of fragile complex hackwork and assumptions.

Please, don't call something that I thought a lot about is "a mess and
hackwork" - that is sort of too easy to take personal. It is sufficient
to point out why it doesn't work ;-) I'm not saying I got it right; I
just _think_ I got it right.

And that doesn't mean that I'm disagreeing that dopd also solves it. But
I thought the intention was to try and get rid of it. My goal is to make
the entire setup less complex, which means cutting out as much as
possible with the intent of making it less fragile and easier to setup.

(And of course, to find out if there are things which are missing in the
m/s concept which might need to be introduced to achieve the former.)

> if it is possible to express
> "hello crm, something bad has happened,
> would you please notify my other
> clones/masters/slaves/whatever the terminus
> that I am still alive, and about to continue to change the data."
> and, tell me when that is done, so I can unfreeze",
> that would take it half way.

That is quite easily doable.

> to fully replace dopd, we'd need a way to communicate back.
>
> if it is possible for the master to tell the crm that the slave has
> failed and should therefore be stopped, then this should not be that
> difficult.

Right, exactly.

> alternatively, if we can put some state into the cib
> (in current drbd e.g. the "data generation uuids"),
> that might work as well.

Yes, that's somewhat what I proposed to do to solve the N1 versu N2
scenario.

> "failing", i.e. stopping, the slave
> just because a replication link hickup
> is no solution.

Depends, as above - it could get restarted (possibly elsewhere) and then
try to reconnect, but the primary would be _sure_ that the secondary
will not be promoted.


To be frank, the _real_ problem we're solving here is that drbd and the
cluster layer are (or at least can be) disconnected, and trying to
figure out how much of that is needed and how to achieve it. If all
meta-data communication were taken out of drbd's data channel but
instead routed through user-space and the cluster layer, none of this
could happen, and the in-kernel implementation probably quite
simplified. But that strikes me as quite a stretch goal for the time
being.

The easiest way to achieve this in 99% of all real-world cases with the
current code probably is to setup a bonded interface and route both the
cluster traffic and drbd across it. The likelihood of the two diverging
then approaches epsilon. I sometimes wonder if that would not be the
ultimately smarter thing to recommend than trying to implement complex
code. ;-)


Regards,
Lars

--
Teamlead Kernel, SuSE Labs, Research and Development
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


lars.ellenberg at linbit

Sep 14, 2008, 7:31 AM

Post #11 of 22 (3430 views)
Permalink
Re: Re: [DRBD-user] drbd peer outdater: higher level implementation? [In reply to]

On Sun, Sep 14, 2008 at 12:28:51AM +0200, Lars Marowsky-Bree wrote:
> > > After it connects (and possibly resynchronizes), it clears the
> > > "peer dirty" flag locally. (Assume that it both sides are secondary at
> > > this point; that'd be the default during a cluster start-up.)
> > >
> > > When one side gets promoted, the other side sets the "peer dirty" flag
> > > locally. When it demotes, both sides clear it. Basically, each side
> > > gets to clear it when it notices that the peer is demoted. So far, so
> > > good.
> > >
> > > Scenario A - the replication link goes down:
> > >
> > > - Primary:
> > > - Freezes IO.
> > > - Calls out to user-space to "fail" the peer.
> > > - Gets confirmation that peer is stopped (via RA notification).
> > > - Resumes IO.
> > >
> > > - Secondary:
> > > - Simply gets stopped.
> > > - It'll assume "peer dirty" anyway, until it reconnects and
> > > resyncs.
> >
> > how can it possibly reconnect and resync,
> > if it is stopped?
>
> I meant "eventually", ie sometime the admin is going to fix it and then
> it'll be able to reconnect and resync, and clear the flag.

admin intervention required for a network hickup.
not an option.

> > right now, without dopd, and this "all new dopd in higher levels with
> > notifications and stuff" does not exist yet, either
> > this is possible:
>
> Well, of course I'm describing the target scenario, not the current one.
> I entirely agree that that is possible right now.

sure. but we have dopd. and it covers this.
master/slave notifications alone, as was your original proposal,
certainly cannot, as you meanwhile noticed.
your current, combined proposal involving the notifications for some
part, and calling out to "fail" a node, i.e. stop the secondary because
of a network hickup, is worse than dopd.

you try to concinve me to stay with dopd ;)

> > situation 2: "outdate needed or data jumps back in time"
> >
> > replication link breaks
>
> That's the very first scenario I described!
>
> > primary keeps writing
> > (which means secondary has now stale data)
>
> First, it wouldn't keep writing in the described approach, but freeze,
> and only resume to write after it has been notified that the peer has
> been stopped.
>
> > primary crashes
> > heartbeat promotes secondary to primary
> > and goes online with stale data.
>
> Second, even if Pacemaker would restart the secondary (which was stopped
> due to the failure), the secondary would be unable to promote as "the
> flag" would be set by default on start-up.
>
> I really believe that the approach I described covers this.

and needs admin intervention to start the secondary again,
just because some switch reset.

> > variation: instead of primary crash, cluster crash.
> > cluster reboot, replication link still broken.
> > how do we prevent heartbeat from chosing the "wrong" node for promotion?
>
> This scenario is indeed not perfectly handled by my approach as
> described: it does handle that the "wrong" secondary doesn't get
> promoted, but it would indeed prevent _both_ sides from being promoted,
> which is not good.
>
> First, the theoretical response to this is that replication link down
> plus crash of two nodes actually constitutes a triple failure, and thus
> not one we claim the cluster protects against. ;-) For some customers,
> manual intervention here would be acceptable.

dopd handles it, your proposal does not.
dopd is already there.
dopd is simpler.
dopd wins.

> But second, a possible solution is to write a persistent "I was primary"
> flag to the meta-data.

already there.

> On start, this would then set crm_master's
> preference to non-zero value (say, 1), which would allow the node to be
> promoted. This might be a tunable operation.

absolutely not. for a realy primary crash, the secondary was promoted.
cluster crash. both have their "I was primary" flag set.
its not that simple.

you must not focus on one scenario. you need to keep them all in mind
when designing a replacement for dopd.
it is not that simple, after all, right?

> > dopd handles both.
> > how does your proposal?
> > by stopping the secondary when the replication link broke?
>
> That's what I explained, yes.
>
> > but that must not happen. how could it then possibly resync, ever?
>
> Pacemaker can be configured to restart it too, which would attempt a
> reconnect (or even attempt the reconnect periodically, if the RA would
> fail to start if unable to connect to the peer, but that might not even
> be needed - restarting it once and keeping it running is sufficient).

drbd attempts to reconnect on its own.
dopd does not need to restart it.
dopd wins.

> Further, I might wish to actually stop the secondary _to be able to move
> it to another node_ (which might be able to reconnect & resync).

The typical deployment with DRBD is still
two nodes with direct attached storage each.
if you are arguing floating peers around SANs,
then that is a rare special case.
if you are arguing "cold spare" drbd nodes with DAS,
(which is an even more rare special case) you really think that getting
rid of dopd was worth a full sync every time we have a network kickup?

> > > See above; it would _always_ come up with the assumption that "peer is
> > > dirty",
> >
> > lets call it "peer may be more recent".
>
> OK. I'm still calling it "the flag" because it's easier to type ;-)

and, btw, drbd does already do so, and would call out to dopd if you try
to promote an unconnected Secondary. unless it already knows that it
itself is outdated, which makes it refuse right away.

> > > and thus refuse to promote. No need to store anything on disk;
> > > it is the default assumption.
> >
> > then you can never go online after cluster crash,
> > unless all drbd nodes come up _and_ can establish connection.
>
> See above for one possible solution.

I'm just pointing out that you started out claiming
"I actually think that dopd is the real hack, and drbd instead should
listen to the notifications we provide, and infer the peer state by that
means ..."
I accused you of handwaving, and you said no, it is all perfectly clear.

now, when it comes to fill in those dots,
you need to reconsider again and again.
while dopd is already there.
and even if is is a "hack", does a pretty good job.

> Okay, now you're going to propose the following scenario:
>
> - Primary N1 crashes
> - Secondary N2 gets promoted
> - Cluster crash
> - Replication link down
> - Both nodes N1+N2 up

good.
I see you start to look at the whole picture ;)

> With the extension I propose above, both sides would set the same master
> preference, while we'd obviously want N2 promoted, not N1. But then,
> dopd wouldn't help this. Instead of writing 1 though, they could use one
> of the generation counters (primary transitions seen?), which would be
> n+1 for N2 and cause N2 to be (correctly) promoted.

btw.
there are no generation counters anymore.
there is a reason for that: they are not reliable.
drbd8 does compare a history of generation UUIDs. while still not
perfect, it is much more reliable than generation counters.

> (Of course I can construct a sequence of failures which would break even
> that, to which I'd reply that they really should simply use the same
> bonded interfaces for both their cluster traffic _and_ the replication,
> to completely avoid this problem ;-)

it does not need to be perfect.
it just needs to be as good as "the hack" dopd.
and preferably as simple.

> > no availability does match the problem description
> > "don't go online with stale data."
> > but it is not exactly what we want.
>
> Depends on the scenario, but I think my above scenario works fine.
>
> > I need the ability to store on disk that "_I_ am ahead of peer"
> > if I know for sure, so I can be promoted after crash/reboot.
>
> Ok, I see your point, and that is I think what I proposed above.
>
> > > I think so. At least my proposal is becoming more concise, which is good
> > > for review ;-)
> > this time you made a step backwards, as you seem to think that drbd does
> > not need to store any information about being outdated.
>
> I actually still think this is so, yes.
>
> > to again point out what problem we are trying to solve:
> > whenever a secondary is about to be promoted,
> > it needs to be "reasonably" certain that is has the most recent data,
> > otherwise it would refuse.
> > it does not matter whether the promotion attempt happens
> > right after connection loss,
> > or three and a half days, two cluster crashes
> > and some node reboots later.
>
> Right, and agreed.
>
> > as it is almost impossible to be certain that you have the most recent
> > data, but it is very well possible to know that you are outdated (as
> > that does not change without a resync),
> > the dopd logic revolves around "outdate".
>
> No disagreement there. I'm not saying dopd doesn't solve the problem.
> I'm just trying to find a solution which solves it without needing dopd,
> but which can instead leverage that Pacemaker is quite a bit smarter
> than heartbeat-v1; hence my proposal above.
>
> > we thoroughly thought about how to solve it.
> > the result was dopd.
> > any solution to replace dopd must at least
> > cover as many scenarios as good as dopd.
>
> Of course. I'm not disagreeing.
>
> > the best way to replace dopd would be to find a more "high level"
> > mechanism for a surviving Primary to actively signal a surviving
> > Secondary to outdate itself (or get feedback why that was not possible),
>
> Restarting it does that in my proposal (it would possibly come back up
> with 'the flag' set by default) - and it does get active feedback that
> the peer was stopped.
>
> Indeed, it would NOT get feedback if that was not possible - that is a
> new requirement. But that's impossible (okay, okay, "unlikely"), as
> failure to stop would trigger the recovery escalation and eventually
> stonith the former peer. Of course, if that fails _too_, but then I
> think we've arrived at so many failures that "freeze" is an acceptable
> response.
>
> > and for a not-connected Secondary which is about to be promoted
> > to ask its peer to outdate itself, which may be refused as it may be
> > primary.
>
> I don't see the need for this second requirement. First, a not-connected
> secondary in my example would never promote (unless it was primary
> before, with the extension); second, if a primary exists, the cluster
> would never promote a second one (that's protected against).

we have to unconnected secondaries.
for this example, lets even assume they are equivalent,
have the very same data generation.
we promote one of them. so the other needs to be outdated.

> > if you want to solve it differently,
> > it becomes a real mess of fragile complex hackwork and assumptions.
>
> Please, don't call something that I thought a lot about is "a mess and
> hackwork" - that is sort of too easy to take personal.

I don't mean it that way, you know that.
but, while at that level,
"you said hack first" ;)

> t is sufficient to point out why it doesn't work ;-) I'm not saying I
> got it right; I just _think_ I got it right.

and you never say what you think ;^)

> And that doesn't mean that I'm disagreeing that dopd also solves it. But
> I thought the intention was to try and get rid of it.

the intention is to replace it with something as effective,
hopefully less dependend on low level infrastructure,
and possibly simpler.
so far, I think it is still simpler and more effective.

> My goal is to make the entire setup less complex, which means cutting
> out as much as possible with the intent of making it less fragile and
> easier to setup.

I fully agree to that goal.

> (And of course, to find out if there are things which are missing in the
> m/s concept which might need to be introduced to achieve the former.)
>
> > if it is possible to express
> > "hello crm, something bad has happened,
> > would you please notify my other
> > clones/masters/slaves/whatever the terminus
> > that I am still alive, and about to continue to change the data."
> > and, tell me when that is done, so I can unfreeze",
> > that would take it half way.
>
> That is quite easily doable.
>
> > to fully replace dopd, we'd need a way to communicate back.
> >
> > if it is possible for the master to tell the crm that the slave has
> > failed and should therefore be stopped, then this should not be that
> > difficult.
>
> Right, exactly.
>
> > alternatively, if we can put some state into the cib
> > (in current drbd e.g. the "data generation uuids"),
> > that might work as well.
>
> Yes, that's somewhat what I proposed to do to solve the N1 versu N2
> scenario.

that is something I mentioned as a result of a linbit internal
discussion in my mail Date: Sat, 13 Sep 2008 03:01:11 +0200

> > "failing", i.e. stopping, the slave
> > just because a replication link hickup
> > is no solution.
>
> Depends, as above - it could get restarted (possibly elsewhere) and then
> try to reconnect, but the primary would be _sure_ that the secondary
> will not be promoted.
>
> To be frank, the _real_ problem we're solving here is that drbd and the
> cluster layer are (or at least can be) disconnected, and trying to
> figure out how much of that is needed and how to achieve it.

that is good. thats why I'm here, writing back,
so you can sharpen you proposal on my rejection
until its sharp enough.

> If all meta-data communication were taken out of drbd's data channel
> but instead routed through user-space and the cluster layer, none of
> this could happen, and the in-kernel implementation probably quite
> simplified. But that strikes me as quite a stretch goal for the time
> being.

you can have that. don't use drbd then. use md.
but there was a reason that you did not.
thats not really suitable for that purpose.
right.

> The easiest way to achieve this in 99% of all real-world cases with the
> current code probably is to setup a bonded interface and route both the
> cluster traffic and drbd across it. The likelihood of the two diverging
> then approaches epsilon. I sometimes wonder if that would not be the
> ultimately smarter thing to recommend than trying to implement complex
> code. ;-)

we do recommend to use a bonded replication link.
and, dopd is simple, and it is implemented.

please, don't give up yet.
for starting out with three dots,
your proposal is amazingly good already.
it just needs to simplify some more,
and maybe get rid of required spurious restarts.
think about it some more, and I'll finally give in.

--
: Lars Ellenberg
: LINBIT HA-Solutions GmbH
: DRBD®/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks
of LINBIT Information Technologies GmbH
__
please don't Cc me, but send to list -- I'm subscribed
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


lmb at suse

Sep 14, 2008, 9:10 AM

Post #12 of 22 (3429 views)
Permalink
Re: Re: [DRBD-user] drbd peer outdater: higher level implementation? [In reply to]

On 2008-09-14T16:31:36, Lars Ellenberg <lars.ellenberg [at] linbit> wrote:

> > I meant "eventually", ie sometime the admin is going to fix it and then
> > it'll be able to reconnect and resync, and clear the flag.
> admin intervention required for a network hickup.
> not an option.

Depends on how the cluster is configured. A restart can happen
automatically, too. And, with floating peers implemented right, possibly
on another node - which some customers might like.

> > Well, of course I'm describing the target scenario, not the current one.
> > I entirely agree that that is possible right now.
> sure. but we have dopd. and it covers this.
> master/slave notifications alone, as was your original proposal,
> certainly cannot, as you meanwhile noticed.
> your current, combined proposal involving the notifications for some
> part, and calling out to "fail" a node, i.e. stop the secondary because
> of a network hickup, is worse than dopd.

I'm not sure it is. It achieves what dopd does w/o needing dopd, but by
interacting with the cluster layer only, and by, I think, simplyfing
drbd's logic - I think that's a win.

And yes, for the cluster to notice that something needs to be done about
the replication link going down _requires_ some call-out of some form.
That's internal state of drbd the cluster naturally doesn't have access
to, so it must be communicated somehow. However, dopd still remains
internal and unknown to the cluster; so the cluster's policy system
can't help with the recovery. Exposing this might have some charm,
described further below.

The notifications provide each clone (or m/s) instance with the
cluster's state about the peers and intended/completed state changes, so
I think those are useful.

> you try to concinve me to stay with dopd ;)

Well, yes; that is one possible result of exploring the other
alternatives, at least then we'll all agree and understand why that is
the case. Or even to identify that we might have scenarios where dopd is
needed, and others where a different approach is recommended ...

> > Second, even if Pacemaker would restart the secondary (which was stopped
> > due to the failure), the secondary would be unable to promote as "the
> > flag" would be set by default on start-up.
> >
> > I really believe that the approach I described covers this.
>
> and needs admin intervention to start the secondary again,
> just because some switch reset.

How so? In fact, the default response by Pacemaker to a failed resource
is to _restart_ it. No admin intervention required.

But that's tunable; it could be set to restart it only N times within M
seconds, or fail-over to a different node etc - that strikes me as quite
powerful enough.

> > First, the theoretical response to this is that replication link down
> > plus crash of two nodes actually constitutes a triple failure, and thus
> > not one we claim the cluster protects against. ;-) For some customers,
> > manual intervention here would be acceptable.
>
> dopd handles it, your proposal does not.
> dopd is already there.
> dopd is simpler.
> dopd wins.

I disagree. I was merely pointing out the fact that we can always
construct failure sequences which are not satisfactorily solved.

For example, in your scenario, if after the cluster crash only the old
secondary comes back, I am _sure_ there is someone out there who'd
rather continue serving data with possibly a few transactions missing
than not at all, which would require an admin to step in and clear the
outdated flag.

(And no, I have no answer to this case ;-)

> > But second, a possible solution is to write a persistent "I was primary"
> > flag to the meta-data.
> already there.

Perfect ;-) Then nothing further is needed.

> > On start, this would then set crm_master's
> > preference to non-zero value (say, 1), which would allow the node to be
> > promoted. This might be a tunable operation.
> absolutely not. for a realy primary crash, the secondary was promoted.
> cluster crash. both have their "I was primary" flag set.
> its not that simple.

I know that. As I've pointed out later describing _exactly this
scenario_, they could instead use the "number of primary transitions
seen" (acknowleding the short-comings of g-c's, but which would be quite
reasonable here), which would make Pacemaker prefer the more "recent"
node.

But actually, in this case, neither side would have the "outdated" flag
set either, so we're actually discussing something not quite related to
dopd anyway, aren't we?

> > > but that must not happen. how could it then possibly resync, ever?
> > Pacemaker can be configured to restart it too, which would attempt a
> > reconnect (or even attempt the reconnect periodically, if the RA would
> > fail to start if unable to connect to the peer, but that might not even
> > be needed - restarting it once and keeping it running is sufficient).
>
> drbd attempts to reconnect on its own.

Exactly. Hence why it would do that after a restart.

> dopd does not need to restart it.
> dopd wins.

"restarting it" is merely a way of achieving the dopd functionality w/o
needing dopd. _Of course_ dopd can already do that. If you're going to
critique my proposal on the basis that it doesn't do more than dopd, of
course you're going to be right. ;-)

Anyway, I can actually point out a few cases: the restart might happen
on another node, which dopd can't achieve. The (possibly forced)
restart might clear up some internal state within drbd (or the whole
system) which might allow it to reconnect.

The former is my pet project of floating peers, but the latter is not
that unlikely, either. Many errors are transient and are solved by a
restart (if not even a reboot).

> The typical deployment with DRBD is still
> two nodes with direct attached storage each.

Yes. I understand that. I think it likely accounts for 95% of all our
customer deployments, if not 98%. This is partly because drbd is hard to
extend to other scenarios right now though, not because there would not
be a need for it.

> if you are arguing floating peers around SANs,
> then that is a rare special case.

Yes, of course I know. I'm not sure it would stay as rare as that,
though. But it would enable drbd to be used in rather interesting and
more ... lucrative deployments, too.

> if you are arguing "cold spare" drbd nodes with DAS,
> (which is an even more rare special case) you really think that getting
> rid of dopd was worth a full sync every time we have a network kickup?

Well, first, this is not that rare in demand, though of course not
easily possible today. Some customers with bladecenters would quite like
this.

Second, no, of course not every time. Pacemaker could be configured to
try up to, say, 3 local restarts within 24 hours before doing a
fail-over to another node - and suddenly the deployment sounds a bit
more attractive ...

For these last two "rare" scenarios, the fail-over might not just be
caused by the replication link going down, but also by the node losing
its storage or the connection to it, in which case a fail-over is quite
desirable.

> > See above for one possible solution.
>
> I'm just pointing out that you started out claiming
> "I actually think that dopd is the real hack, and drbd instead should
> listen to the notifications we provide, and infer the peer state by that
> means ..."
> I accused you of handwaving, and you said no, it is all perfectly clear.

Well, I admit to having been wrong on the "perfectly clear". I thought
it was clear, and the elaborate discussion is quite helpful.

And calling dopd the real hack might have been offensive, for which I
apologize. But I'd still like understand if we could do without it, and
possibly even achieve more.

> now, when it comes to fill in those dots,
> you need to reconsider again and again.

Right. That tends to happen during a dialogue - it would make me look
rather silly if I ignored new insights, wouldn't it? ;-)

> while dopd is already there.

Yes, it's there for heartbeat, but it is not there at all for openAIS,
and I don't think it works well with the m/s resources (which I'm also
trying to improve here). So I'm looking how we could achieve this for
the future.

(Personally, I consider heartbeat as the cluster comm layer as dead as
you think drbd-0.7 to be; it won't be around on SLE11, for example. So
we really need to find out how we could merge the two well.)

This would, as it relies "only" on Pacemaker, continue working on top of
the heartbeat comm-layer of course too.

> and even if is is a "hack", does a pretty good job.

True.

> good.
> I see you start to look at the whole picture ;)

I'm always happy to learn ;-)

> btw.
> there are no generation counters anymore.
> there is a reason for that: they are not reliable.
> drbd8 does compare a history of generation UUIDs. while still not
> perfect, it is much more reliable than generation counters.

Good to know.

But, even if this is somewhat unrelated to the "outdated" discussion,
how would you translate this to the "N1+N2" (ie, two former primaries)
recovery scenario? Compared to heartbeat v1, at least Pacemaker would
allow you to declare your preference for becoming primary, but that
needs to be numeric (higher integer wins). Maybe worth a separate
thread, but I could see them pushing their "UUID" into the CIB on start,
and then (in post-start notification) the side which finds the other
side's UUID not in its history would declare itself unable to become
master. (Similar to what you discussed with Andreas Kurz, but just
applies to "start" and not every monitor.)

> > (Of course I can construct a sequence of failures which would break even
> > that, to which I'd reply that they really should simply use the same
> > bonded interfaces for both their cluster traffic _and_ the replication,
> > to completely avoid this problem ;-)
> it does not need to be perfect.
> it just needs to be as good as "the hack" dopd.
> and preferably as simple.

To be honest, simply using the same links would be simpler.

(Tangent: we have a similar issue for example with OCFS2 and the DLM,
and trying to tell the cluster that A can no longer talk to C is an icky
problem. There's no m/s state as with drbd, but the topology complexity
of N>2 makes up for this :-/)

On the other hand, that wouldn't clear up the cases where the
replication link is down because of some internal state hiccup, which
the approach outlined might help with.

> > I don't see the need for this second requirement. First, a not-connected
> > secondary in my example would never promote (unless it was primary
> > before, with the extension); second, if a primary exists, the cluster
> > would never promote a second one (that's protected against).
> we have to unconnected secondaries.
> for this example, lets even assume they are equivalent,
> have the very same data generation.
> we promote one of them. so the other needs to be outdated.

"No problem" ;-)

Pacemaker will deliver a "I am about to promote your peer" notification,
promote the peer, and then a "I just promoted the peer" notification.
So, it can use that notification to update its knowledge that the peer
is now ahead of it.

> "you said hack first" ;)

Whom are you calling a hack!?!?!?! ;-)

> > My goal is to make the entire setup less complex, which means cutting
> > out as much as possible with the intent of making it less fragile and
> > easier to setup.
> I fully agree to that goal.

That's good, so at least we can figure it out from here ... And yes,
dopd is simple.

> > To be frank, the _real_ problem we're solving here is that drbd and the
> > cluster layer are (or at least can be) disconnected, and trying to
> > figure out how much of that is needed and how to achieve it.
> that is good. thats why I'm here, writing back,
> so you can sharpen you proposal on my rejection
> until its sharp enough.

It provides a great excuse from more boring work too. ;-)

> > If all meta-data communication were taken out of drbd's data channel
> > but instead routed through user-space and the cluster layer, none of
> > this could happen, and the in-kernel implementation probably quite
> > simplified. But that strikes me as quite a stretch goal for the time
> > being.
> you can have that. don't use drbd then. use md.
> but there was a reason that you did not.
> thats not really suitable for that purpose.
> right.

Right. But I also know you're looking at the future with the new
replication framework etc, and maybe we might want to reconsider this.
Afterall, we now _do_ have a "standard" for reliable cluster comms,
called openAIS, works between Oracle/RHT/GFS2/OCFS2/Novell/etc, we have
more powerful network block devices (iSCSI ...), so it might make sense
to leverage it, combine it with the knowledge of drbd's state machine
for replication, and make drbd-9 ;-) But yes, that's clearly longer
reach than what we're trying to discuss here. It always helps to look
towards the future though.

> please, don't give up yet.

I've been with this project for almost 9 years. I'm too stubborn to give
up ;-)

> for starting out with three dots,
> your proposal is amazingly good already.
> it just needs to simplify some more,

More simple is always good.

> and maybe get rid of required spurious restarts.

The restart of the secondary is not just "spurious" though. It might
actually help "fix" (or at least "reset") things. Restarts are amazingly
simple and effective.

For example, if the link broke due to some OS/module issue, the stop
might fail, and the node would actually get fenced, and reconnect
"fine". Or the stop might succeed, and the reinitialization on 'start'
is sufficient to clear things up.

This might seem like "voodoo" and hand waving, but Gray&Reuter quote a
ratio between soft/transient to hard errors for software of about 100:1
- that is, restarting solves a _lot_ of problems. (Hence why STONITH
happens to be so effective in practice, while it is crude in theory.)

It also moves the policy decision to the, well, Policy Engine, where a
number of other recovery actions could be triggered - including those
"rare cases".


Regards,
Lars

--
Teamlead Kernel, SuSE Labs, Research and Development
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


lars.ellenberg at linbit

Sep 14, 2008, 12:02 PM

Post #13 of 22 (3426 views)
Permalink
Re: Re: [DRBD-user] drbd peer outdater: higher level implementation? [In reply to]

On Sun, Sep 14, 2008 at 06:10:50PM +0200, Lars Marowsky-Bree wrote:

<sniped some parts where we most likely agree,
or where it is "only opinion"/>

> Anyway, I can actually point out a few cases: the restart might happen
> on another node, which dopd can't achieve. The (possibly forced)
> restart might clear up some internal state within drbd (or the whole
> system) which might allow it to reconnect.
>
> The former is my pet project of floating peers, but the latter is not
> that unlikely, either. Many errors are transient and are solved by a
> restart (if not even a reboot).
>
> > The typical deployment with DRBD is still
> > two nodes with direct attached storage each.
>
> Yes. I understand that. I think it likely accounts for 95% of all our
> customer deployments, if not 98%. This is partly because drbd is hard to
> extend to other scenarios right now though, not because there would not
> be a need for it.
>
> > if you are arguing floating peers around SANs,
> > then that is a rare special case.
>
> Yes, of course I know. I'm not sure it would stay as rare as that,
> though. But it would enable drbd to be used in rather interesting and
> more ... lucrative deployments, too.

while drbd is able to enhance SANs,
it is actually out there to replace them ;)

> > if you are arguing "cold spare" drbd nodes with DAS,
> > (which is an even more rare special case) you really think that getting
> > rid of dopd was worth a full sync every time we have a network kickup?
>
> Well, first, this is not that rare in demand, though of course not
> easily possible today. Some customers with bladecenters would quite like
> this.
>
> Second, no, of course not every time. Pacemaker could be configured to
> try up to, say, 3 local restarts within 24 hours before doing a
> fail-over to another node - and suddenly the deployment sounds a bit
> more attractive ...
>
> For these last two "rare" scenarios, the fail-over might not just be
> caused by the replication link going down, but also by the node losing
> its storage or the connection to it, in which case a fail-over is quite
> desirable.
>
> > > See above for one possible solution.
> >
> > I'm just pointing out that you started out claiming
> > "I actually think that dopd is the real hack, and drbd instead should
> > listen to the notifications we provide, and infer the peer state by that
> > means ..."
> > I accused you of handwaving, and you said no, it is all perfectly clear.
>
> Well, I admit to having been wrong on the "perfectly clear". I thought
> it was clear, and the elaborate discussion is quite helpful.
>
> And calling dopd the real hack might have been offensive, for which I
> apologize. But I'd still like understand if we could do without it, and
> possibly even achieve more.
>
> > now, when it comes to fill in those dots,
> > you need to reconsider again and again.
>
> Right. That tends to happen during a dialogue - it would make me look
> rather silly if I ignored new insights, wouldn't it? ;-)
>
> > while dopd is already there.
>
> Yes, it's there for heartbeat, but it is not there at all for openAIS,

I have been told someone is working on this, though.
and its not too difficult.
but yes, as I mentione early in this thread,
I'm happy to replace the method of communication (dopd)
with something more "high level", like a combination of "crm fail"
commands and notificatoin events, if it does not get too,
how shal I put it, "artificial".
but we are getting closer already.

> and I don't think it works well with the m/s resources (which I'm also
> trying to improve here). So I'm looking how we could achieve this for
> the future.
>
> (Personally, I consider heartbeat as the cluster comm layer as dead as
> you think drbd-0.7 to be; it won't be around on SLE11, for example. So
> we really need to find out how we could merge the two well.)

> This would, as it relies "only" on Pacemaker, continue working on top of
> the heartbeat comm-layer of course too.
>
> > and even if is is a "hack", does a pretty good job.
>
> True.
>
> > good.
> > I see you start to look at the whole picture ;)
>
> I'm always happy to learn ;-)
>
> > btw.
> > there are no generation counters anymore.
> > there is a reason for that: they are not reliable.
> > drbd8 does compare a history of generation UUIDs. while still not
> > perfect, it is much more reliable than generation counters.
>
> Good to know.
>
> But, even if this is somewhat unrelated to the "outdated" discussion,
> how would you translate this to the "N1+N2" (ie, two former primaries)
> recovery scenario?

what we currently do?

- Primary N1 crashes
- Secondary N2 gets promoted
* at which point it knows it is ahead of N1,
and stores that even in meta data *
- Cluster crash
- Replication link down
- Both nodes N1+N2 up

- N1 does know it comes up after primary crash,
so if asked to be promoted,
it first tries (via dopd) to outdate N2
- N2 does know it comes up after primary crash,
and that it has newer data than N1

in addition, because of how wfc-timeout and degr-wfc-timeout work,
the drbd initscript would use wfc-timeout on N1 (which is "forever" by
default), but degr-wfc-timeout on N2 (which is a finite time by default)

so yes, N2 would be promoted, because
* it knows that it is ahead of N1,
* N1 would wait for the connection (much longer) before even continuing
the boot process.

in case the wfc-timeouts are both set to "very short",
to correctly deal with this situation,
a node (N2) that knows it is ahead of the other must refuse to be "outdated"
by the known outdated node (N1), even if that node (N1) does not know already,
and even if not currently primary (N2, before being promoted).
we should correctly implement that, but I need to double check.

> Compared to heartbeat v1, at least Pacemaker would
> allow you to declare your preference for becoming primary, but that
> needs to be numeric (higher integer wins). Maybe worth a separate
> thread, but I could see them pushing their "UUID" into the CIB on start,
> and then (in post-start notification) the side which finds the other
> side's UUID not in its history would declare itself unable to become
> master. (Similar to what you discussed with Andreas Kurz, but just
> applies to "start" and not every monitor.)
>
> > > (Of course I can construct a sequence of failures which would break even
> > > that, to which I'd reply that they really should simply use the same
> > > bonded interfaces for both their cluster traffic _and_ the replication,
> > > to completely avoid this problem ;-)
> > it does not need to be perfect.
> > it just needs to be as good as "the hack" dopd.
> > and preferably as simple.
>
> To be honest, simply using the same links would be simpler.

then we are back to "true" split brain scenarios.
and discussing quorum in a two-node cluster.

sure that would be simpler.
but it would cause either no-availability
or data divergence every time that link breaks.

> (Tangent: we have a similar issue for example with OCFS2 and the DLM,
> and trying to tell the cluster that A can no longer talk to C is an icky
> problem. There's no m/s state as with drbd, but the topology complexity
> of N>2 makes up for this :-/)
>
> On the other hand, that wouldn't clear up the cases where the
> replication link is down because of some internal state hiccup, which
> the approach outlined might help with.
>
> > > I don't see the need for this second requirement. First, a not-connected
> > > secondary in my example would never promote (unless it was primary
> > > before, with the extension); second, if a primary exists, the cluster
> > > would never promote a second one (that's protected against).
> > we have to unconnected secondaries.
> > for this example, lets even assume they are equivalent,
> > have the very same data generation.
> > we promote one of them. so the other needs to be outdated.
>
> "No problem" ;-)
>
> Pacemaker will deliver a "I am about to promote your peer" notification,
> promote the peer, and then a "I just promoted the peer" notification.
> So, it can use that notification to update its knowledge that the peer
> is now ahead of it.

ok.

> > "you said hack first" ;)
>
> Whom are you calling a hack!?!?!?! ;-)
>
> > > My goal is to make the entire setup less complex, which means cutting
> > > out as much as possible with the intent of making it less fragile and
> > > easier to setup.
> > I fully agree to that goal.
>
> That's good, so at least we can figure it out from here ... And yes,
> dopd is simple.
>
> > > To be frank, the _real_ problem we're solving here is that drbd and the
> > > cluster layer are (or at least can be) disconnected, and trying to
> > > figure out how much of that is needed and how to achieve it.
> > that is good. thats why I'm here, writing back,
> > so you can sharpen you proposal on my rejection
> > until its sharp enough.
>
> It provides a great excuse from more boring work too. ;-)

you know, I have a paper to write...
and keep avoiding that for weeks now.

> > > If all meta-data communication were taken out of drbd's data channel
> > > but instead routed through user-space and the cluster layer, none of
> > > this could happen, and the in-kernel implementation probably quite
> > > simplified. But that strikes me as quite a stretch goal for the time
> > > being.
> > you can have that. don't use drbd then. use md.
> > but there was a reason that you did not.
> > thats not really suitable for that purpose.
> > right.
>
> Right. But I also know you're looking at the future with the new
> replication framework etc, and maybe we might want to reconsider this.
> Afterall, we now _do_ have a "standard" for reliable cluster comms,
> called openAIS, works between Oracle/RHT/GFS2/OCFS2/Novell/etc, we have
> more powerful network block devices (iSCSI ...), so it might make sense
> to leverage it, combine it with the knowledge of drbd's state machine
> for replication, and make drbd-9 ;-) But yes, that's clearly longer
> reach than what we're trying to discuss here. It always helps to look
> towards the future though.

absolutly. I'm going to extend drbd UUIDs to something I call
"monotonic storage time" lacking a better term. as long as that can be
comunicated somehow, each node knows exactly whether it lags behind, and
how much. Its all in that unwritten paper ;)

> > please, don't give up yet.
>
> I've been with this project for almost 9 years. I'm too stubborn to give
> up ;-)

I thought so :)

> > for starting out with three dots,
> > your proposal is amazingly good already.
> > it just needs to simplify some more,
>
> More simple is always good.
>
> > and maybe get rid of required spurious restarts.
>
> The restart of the secondary is not just "spurious" though. It might
> actually help "fix" (or at least "reset") things. Restarts are amazingly
> simple and effective.

hmm.

> For example, if the link broke due to some OS/module issue, the stop
> might fail, and the node would actually get fenced, and reconnect
> "fine". Or the stop might succeed, and the reinitialization on 'start'
> is sufficient to clear things up.
>
> This might seem like "voodoo" and hand waving, but Gray&Reuter quote a
> ratio between soft/transient to hard errors for software of about 100:1
> - that is, restarting solves a _lot_ of problems. (Hence why STONITH
> happens to be so effective in practice, while it is crude in theory.)
>
> It also moves the policy decision to the, well, Policy Engine, where a
> number of other recovery actions could be triggered - including those
> "rare cases".

ok, you modify "your" ocf drbd RA as a proof of concept?

according to your proposal,
on the drbd part,
we'd only need to replace the outdate-peer-handler
from "drbd-peer-outdater" to "some other program calling crm fail
appropriately and block until confirmed".

thats just an entry in the config file
(and someone needs to write that script).

later we may make it easier for the script by
extending the logic in the drbd module,
to make it easier for asynchonous confirmation.

--
: Lars Ellenberg
: LINBIT HA-Solutions GmbH
: DRBD®/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks
of LINBIT Information Technologies GmbH
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


lmb at suse

Sep 14, 2008, 3:56 PM

Post #14 of 22 (3420 views)
Permalink
Re: Re: [DRBD-user] drbd peer outdater: higher level implementation? [In reply to]

On 2008-09-14T21:02:20, Lars Ellenberg <lars.ellenberg [at] linbit> wrote:

> <sniped some parts where we most likely agree,
> or where it is "only opinion"/>

Yes, thanks, good idea, time to shrink the discussion down some.


> > Yes, of course I know. I'm not sure it would stay as rare as that,
> > though. But it would enable drbd to be used in rather interesting and
> > more ... lucrative deployments, too.
> while drbd is able to enhance SANs,
> it is actually out there to replace them ;)

I know ;-) But using it to replicate between SANs (or having standby
systems for the replica with DAS) brings us closer to the point where we
can deliver "split-site" clusters.

> > Yes, it's there for heartbeat, but it is not there at all for openAIS,
> I have been told someone is working on this, though.

Oh?

> and its not too difficult.

Right, as dopd uses only mostly simple message exchange, it wouldn't be.
Good time though to reassess.

> what we currently do?
>
> - Primary N1 crashes
> - Secondary N2 gets promoted
> * at which point it knows it is ahead of N1,
> and stores that even in meta data *
> - Cluster crash
> - Replication link down
> - Both nodes N1+N2 up
>
> - N1 does know it comes up after primary crash,
> so if asked to be promoted,
> it first tries (via dopd) to outdate N2
> - N2 does know it comes up after primary crash,
> and that it has newer data than N1
>
> in addition, because of how wfc-timeout and degr-wfc-timeout work,
> the drbd initscript would use wfc-timeout on N1 (which is "forever" by
> default), but degr-wfc-timeout on N2 (which is a finite time by default)
>
> so yes, N2 would be promoted, because
> * it knows that it is ahead of N1,
> * N1 would wait for the connection (much longer) before even continuing
> the boot process.
>
> in case the wfc-timeouts are both set to "very short",
> to correctly deal with this situation,
> a node (N2) that knows it is ahead of the other must refuse to be "outdated"
> by the known outdated node (N1), even if that node (N1) does not know already,
> and even if not currently primary (N2, before being promoted).
> we should correctly implement that, but I need to double check.

Not exactly trivial either, but then, it is not exactly a trivial
failure sequence. Thanks for the explanation.

> > To be honest, simply using the same links would be simpler.
>
> then we are back to "true" split brain scenarios.
> and discussing quorum in a two-node cluster.
>
> sure that would be simpler.
> but it would cause either no-availability
> or data divergence every time that link breaks.

Right; note how my proposal works for "true" split-brain too, of course.

> > It provides a great excuse from more boring work too. ;-)
> you know, I have a paper to write...
> and keep avoiding that for weeks now.

I only have a few more paragraphs to write for my part-time studies
today. Anything else then suddenly becomes so much more attractive. I
wonder how many open source projects harness the power of
procrastination.

> absolutly. I'm going to extend drbd UUIDs to something I call
> "monotonic storage time" lacking a better term. as long as that can be
> comunicated somehow, each node knows exactly whether it lags behind, and
> how much. Its all in that unwritten paper ;)

Sounds interesting; is that the Linux Kongress paper?

> > The restart of the secondary is not just "spurious" though. It might
> > actually help "fix" (or at least "reset") things. Restarts are amazingly
> > simple and effective.
> hmm.

You've got to admit that it's a valid point ;-)

> ok, you modify "your" ocf drbd RA as a proof of concept?

Yes, I can do that.

> according to your proposal,
> on the drbd part,
> we'd only need to replace the outdate-peer-handler
> from "drbd-peer-outdater" to "some other program calling crm fail
> appropriately and block until confirmed".

Does drbd on the primary side indeed freeze IO until that script
returns?

And I think the need for the secondary to not allow itself to be
promoted as I described might need to be implemented in drbd. Hrm. I
think I could work-around this by setting the "outdated" flag if
stoppd while disconnected ...

> thats just an entry in the config file
> (and someone needs to write that script).

That script should be easy too; not pretty, but easy ...

> later we may make it easier for the script by
> extending the logic in the drbd module,
> to make it easier for asynchonous confirmation.

I'd probably make the script block and then have the notification signal
it to continue.

Ok. I'll try to get to this this week, but I might not make it until
Wednesday or so. (I'm doing a half-week and thus need to cram a bit.) If
someone else wants to give it a shot before that, be my guest ;-)



Regards,
Lars

--
Teamlead Kernel, SuSE Labs, Research and Development
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


lars.ellenberg at linbit

Sep 14, 2008, 11:28 PM

Post #15 of 22 (3422 views)
Permalink
Re: Re: [DRBD-user] drbd peer outdater: higher level implementation? [In reply to]

On Mon, Sep 15, 2008 at 12:56:22AM +0200, Lars Marowsky-Bree wrote:
> > > To be honest, simply using the same links would be simpler.
> >
> > then we are back to "true" split brain scenarios.
> > and discussing quorum in a two-node cluster.
> >
> > sure that would be simpler.
> > but it would cause either no-availability
> > or data divergence every time that link breaks.
>
> Right; note how my proposal works for "true" split-brain too, of course.

how so?

> > > The restart of the secondary is not just "spurious" though. It might
> > > actually help "fix" (or at least "reset") things. Restarts are amazingly
> > > simple and effective.
> > hmm.
>
> You've got to admit that it's a valid point ;-)

that was more a disagreeing grumble.
it may also break things.

> > ok, you modify "your" ocf drbd RA as a proof of concept?
>
> Yes, I can do that.

but.
before you do.


> > according to your proposal,
> > on the drbd part,
> > we'd only need to replace the outdate-peer-handler
> > from "drbd-peer-outdater" to "some other program calling crm fail
> > appropriately and block until confirmed".
>
> Does drbd on the primary side indeed freeze IO until that script
> returns?

if you set "fencing resource-and-stonith",
yes it does.

"freeze" in the sense that it does not accept new IO.

> And I think the need for the secondary to not allow itself to be
> promoted as I described might need to be implemented in drbd. Hrm. I
> think I could work-around this by setting the "outdated" flag if
> stoppd while disconnected ...
>
> > thats just an entry in the config file
> > (and someone needs to write that script).
>
> That script should be easy too; not pretty, but easy ...
>
> > later we may make it easier for the script by
> > extending the logic in the drbd module,
> > to make it easier for asynchonous confirmation.
>
> I'd probably make the script block and then have the notification signal
> it to continue.
>
> Ok. I'll try to get to this this week, but I might not make it until
> Wednesday or so. (I'm doing a half-week and thus need to cram a bit.) If
> someone else wants to give it a shot before that, be my guest ;-)

great. but wait.

if we set aside confused admins for the moment,
and assume CRM is the only entity promoting/demoting drbd.

would it not be enough for a Primary on connection loss to
set some constraint pinning the master role on that node/node group?

the DRBD after-resync handler can then remove that contraint again.

--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list -- I'm subscribed
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


lmb at suse

Sep 15, 2008, 12:59 AM

Post #16 of 22 (3405 views)
Permalink
Re: Re: [DRBD-user] drbd peer outdater: higher level implementation? [In reply to]

On 2008-09-15T08:28:18, Lars Ellenberg <lars.ellenberg [at] linbit> wrote:

> > > > The restart of the secondary is not just "spurious" though. It might
> > > > actually help "fix" (or at least "reset") things. Restarts are amazingly
> > > > simple and effective.
> > > hmm.
> > You've got to admit that it's a valid point ;-)
> that was more a disagreeing grumble.
> it may also break things.

How so?


> if we set aside confused admins for the moment,
> and assume CRM is the only entity promoting/demoting drbd.
>
> would it not be enough for a Primary on connection loss to
> set some constraint pinning the master role on that node/node group?
>
> the DRBD after-resync handler can then remove that contraint again.

The idea is interesting. A RA modifying its own constraints ...

However, it wouldn't work for a true split-brain. If the primary does
that before being fenced by the secondary (which, given awkward
circumstances for the split-brain, is possible), and the partition
heals, the master would be pinned to the "wrong" node briefly.

Also, given that it is a split-brain and the constraint is only on one
side, the secondary would allow itself to be promoted - okay, so the
cluster never would before the primary has been fenced, but neither
must the master continue before the secondary has been fenced ...

Does that make sense?


Regards,
Lars

--
Teamlead Kernel, SuSE Labs, Research and Development
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


beekhof at gmail

Sep 15, 2008, 1:26 AM

Post #17 of 22 (3407 views)
Permalink
Re: Re: [DRBD-user] drbd peer outdater: higher level implementation? [In reply to]

On Mon, Sep 15, 2008 at 08:28, Lars Ellenberg <lars.ellenberg [at] linbit> wrote:
>
> if we set aside confused admins for the moment,
> and assume CRM is the only entity promoting/demoting drbd.

A very important assumption.

Without it, it doesn't matter how much redundancy you add, you'll
always have a split brain/personality.

You can't put the cluster in charge, tell it to go in one direction,
and then start trying to go off in another. That path leads to
madness for the cluster and the admin.


If people don't want Pacemaker to manage DRBD, thats ok.
I think its great software but clearly I'm going to be biased.

In such cases, one can use is-managed=false for drbd so other
resources can still reference it in colocation and ordering
dependancies. Or avoid the issue altogether and simply don't tell
the cluster about drbd at all. Whatever works for people is fine by
me.


But if you do want the CRM _actively_ involved, then the only sane
approach is for any external entity to act indirectly via the CRM. I
think we expose everything that might be needed to achieve this, but
I'm happy to be corrected.
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


lars.ellenberg at linbit

Sep 15, 2008, 1:49 AM

Post #18 of 22 (3403 views)
Permalink
Re: Re: [DRBD-user] drbd peer outdater: higher level implementation? [In reply to]

On Mon, Sep 15, 2008 at 09:59:53AM +0200, Lars Marowsky-Bree wrote:
> On 2008-09-15T08:28:18, Lars Ellenberg <lars.ellenberg [at] linbit> wrote:

> > assume CRM is the only entity promoting/demoting drbd.
> >
> > would it not be enough for a Primary on connection loss to
> > set some constraint pinning the master role on that node/node group?
> >
> > the DRBD after-resync handler can then remove that contraint again.
>
> The idea is interesting. A RA modifying its own constraints ...

well, that is the point. that is what we are talking about the whole
time, constraining the not-up-to-date side to not go live.
why not use the constraint infrastructure for that.

> However, it wouldn't work for a true split-brain. If the primary does
> that before being fenced by the secondary (which, given awkward
> circumstances for the split-brain, is possible), and the partition
> heals, the master would be pinned to the "wrong" node briefly.
>
> Also, given that it is a split-brain and the constraint is only on one
> side, the secondary would allow itself to be promoted - okay, so the
> cluster never would before the primary has been fenced, but neither
> must the master continue before the secondary has been fenced ...

so it can cause a crm fail (restart) of the other clone,
or preferably (in my opinion) of a dummy resource colocated with the
other clone, if that is needed to cause the other node to be fenced if
unreachable.
if it is reachable, no-brainer.

if you have a true split-brain,
and it in fact happens that both partitions pin their node as only
possible location of the master, then later rejoin, we should have
complementary constraints, causing both diverged datasets to be taken
offline. that is even desirable.

--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list -- I'm subscribed
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


lars.ellenberg at linbit

Sep 15, 2008, 2:04 AM

Post #19 of 22 (3410 views)
Permalink
Re: Re: [DRBD-user] drbd peer outdater: higher level implementation? [In reply to]

On Mon, Sep 15, 2008 at 10:26:46AM +0200, Andrew Beekhof wrote:
> On Mon, Sep 15, 2008 at 08:28, Lars Ellenberg <lars.ellenberg [at] linbit> wrote:
> >
> > if we set aside confused admins for the moment,
> > and assume CRM is the only entity promoting/demoting drbd.
>
> A very important assumption.
>
> Without it, it doesn't matter how much redundancy you add, you'll
> always have a split brain/personality.
>
> You can't put the cluster in charge, tell it to go in one direction,
> and then start trying to go off in another. That path leads to
> madness for the cluster and the admin.

Absolutely. No disagreement there at all.

The "outdated" flag stored in drbd metadata does help to remind the
confused admin [1] that he meant to type that command in the other
terminal window.

That is all I wanted to say there.
I just meant to point out this "subtile" difference
in _where_ we store the information that this node is out-of-date.

[1] who tries desperately to rescue what he can after a minor
catastrophe, at 3:17 am Saturday morning

--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list -- I'm subscribed
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


lmb at suse

Sep 15, 2008, 2:09 AM

Post #20 of 22 (3402 views)
Permalink
Re: Re: [DRBD-user] drbd peer outdater: higher level implementation? [In reply to]

On 2008-09-15T10:49:15, Lars Ellenberg <lars.ellenberg [at] linbit> wrote:

> if you have a true split-brain,
> and it in fact happens that both partitions pin their node as only
> possible location of the master, then later rejoin, we should have
> complementary constraints, causing both diverged datasets to be taken
> offline. that is even desirable.

No, you're overseeing something: CIBs don't get merged; one will
overwrite the other, and while it is deterministic, I'dn't want to embed
that knowledge into the RA. "Merging" is too complicated.

And the "I set a constraint on myself" is not in any way identical to "I
did something (stop|fail|fence) to the secondary." But that's really the
event the master must wait for.


Regards,
Lars

--
Teamlead Kernel, SuSE Labs, Research and Development
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


lars.ellenberg at linbit

Sep 15, 2008, 2:10 AM

Post #21 of 22 (3397 views)
Permalink
Re: Re: [DRBD-user] drbd peer outdater: higher level implementation? [In reply to]

On Mon, Sep 15, 2008 at 11:04:14AM +0200, Stefan Seifert wrote:
> Just a side note to this (very interesting) discussion:
> Now as we finally have a working dopd in reach (bugs in drbd fixed and new
> heartbeat version available), the discussion starts about replacing it. I
> just hope, that the replacement will not take as long.

thats the good thing when replacing things that work.

you can take all the time it needs
for the replacement to work even better ;)

until then, we have dopd.

--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list -- I'm subscribed
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


beekhof at gmail

Sep 15, 2008, 3:12 AM

Post #22 of 22 (3401 views)
Permalink
Re: Re: [DRBD-user] drbd peer outdater: higher level implementation? [In reply to]

On Sep 15, 2008, at 11:04 AM, Lars Ellenberg wrote:

> On Mon, Sep 15, 2008 at 10:26:46AM +0200, Andrew Beekhof wrote:
>> On Mon, Sep 15, 2008 at 08:28, Lars Ellenberg <lars.ellenberg [at] linbit
>> > wrote:
>>>
>>> if we set aside confused admins for the moment,
>>> and assume CRM is the only entity promoting/demoting drbd.
>>
>> A very important assumption.
>>
>> Without it, it doesn't matter how much redundancy you add, you'll
>> always have a split brain/personality.
>>
>> You can't put the cluster in charge, tell it to go in one direction,
>> and then start trying to go off in another. That path leads to
>> madness for the cluster and the admin.
>
> Absolutely. No disagreement there at all.

Cool. Just wanted to make sure we're all on the same page... it
wouldn't have been the first time someone went down the "other" path :-)

>
>
> The "outdated" flag stored in drbd metadata does help to remind the
> confused admin [1] that he meant to type that command in the other
> terminal window.
>
> That is all I wanted to say there.
> I just meant to point out this "subtile" difference
> in _where_ we store the information that this node is out-of-date.

Makes sense.

>
>
> [1] who tries desperately to rescue what he can after a minor
> catastrophe, at 3:17 am Saturday morning

ouch
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Linux-HA dev RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.