Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: DRBD: Users

DOPD problem and Heartbeat

 

 

DRBD users RSS feed   Index | Next | Previous | View Threaded


aluno3 at poczta

Apr 18, 2012, 10:55 AM

Post #1 of 5 (766 views)
Permalink
DOPD problem and Heartbeat

Hello

We are testing DOPD mechanism and reviewing source of the dopd file
(http://hg.linux-ha.org/lha-2.1/file/1d5b54f0a2e0/contrib/drbd-outdate-peer/dopd.c).

Is it ok that in function check_drbd_peer, during loop, at the beginning
is checking status of the node and in case if node is dead then function
is finishing with returning FALSE even if node is ping node? Next part
of the code checks if node is 'normal' node, but it is to late.

In case when you have:
-configured ping node,
-timeouts: ping-int 10, deadping 10, deadtime 30

and link from replication, ping node down, dopd starts working. Function
check_drbd_peer checks if status of the node is dead (ping node is dead,
remote/normal node is ok) and if yes, ends with returning FALSE and does
not mark remote volumes as outdated with using other auxiliary path.
Unfortunately during test such problem occurred.

We know that DRBD timeouts have to be lower then heartbeat timeouts, but
in case when dopd has to mark a lot of remote resources, it cannot do
that in time. It is easy to race.
_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


lars.ellenberg at linbit

Apr 18, 2012, 12:00 PM

Post #2 of 5 (726 views)
Permalink
Re: DOPD problem and Heartbeat [In reply to]

On Wed, Apr 18, 2012 at 07:55:32PM +0200, aluno3 [at] poczta wrote:
> Hello
>
> We are testing DOPD mechanism and reviewing source of the dopd file
> (http://hg.linux-ha.org/lha-2.1/file/1d5b54f0a2e0/contrib/drbd-outdate-peer/dopd.c).

Would you please use heartbeat 3
(and pacemaker, unless you use the haresource mode of heartbeat)

When using pacemaker, use the drbd crm-fence-peer.sh.
It covers all the cases dopd would cover, and in fact even a couple more
corner cases in multiple failure scenarios.

>
> Is it ok that in function check_drbd_peer, during loop, at the
> beginning is checking status of the node and in case if node is dead
> then function is finishing with returning FALSE even if node is ping
> node? Next part of the code checks if node is 'normal' node, but it
> is to late.

Then I guess we have to fix that.

> In case when you have:
> -configured ping node,
> -timeouts: ping-int 10, deadping 10, deadtime 30
>
> and link from replication, ping node down, dopd starts working. Function
> check_drbd_peer checks if status of the node is dead (ping node is
> dead, remote/normal node is ok) and if yes, ends with returning
> FALSE and does not mark remote volumes as outdated with using other
> auxiliary path. Unfortunately during test such problem occurred.
>
> We know that DRBD timeouts have to be lower then heartbeat timeouts, but
> in case when dopd has to mark a lot of remote resources, it cannot do
> that in time. It is easy to race.

--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list -- I'm subscribed
_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


aluno3 at poczta

Apr 18, 2012, 12:41 PM

Post #3 of 5 (719 views)
Permalink
Re: DOPD problem and Heartbeat [In reply to]

> On Wed, Apr 18, 2012 at 07:55:32PM +0200, aluno3 [at] poczta wrote:
> > Hello
> >
> > We are testing DOPD mechanism and reviewing source of the dopd file
> > (http://hg.linux-ha.org/lha-2.1/file/1d5b54f0a2e0/contrib/drbd-outdate-peer/dopd.c).
>
> Would you please use heartbeat 3
> (and pacemaker, unless you use the haresource mode of heartbeat)
>
> When using pacemaker, use the drbd crm-fence-peer.sh.
> It covers all the cases dopd would cover, and in fact even a couple more
> corner cases in multiple failure scenarios.

We would like to use heartbeat 3 with newer crm but our front end is not adapted yet...

>
> >
> > Is it ok that in function check_drbd_peer, during loop, at the
> > beginning is checking status of the node and in case if node is dead
> > then function is finishing with returning FALSE even if node is ping
> > node? Next part of the code checks if node is 'normal' node, but it
> > is to late.
>
> Then I guess we have to fix that.

Maybe fix should look like:

--- ./heartbeat/contrib/drbd-outdate-peer/dopd.c 2008-08-18 14:32:19.000000000 +0200
+++ ./heartbeat-dopdfix/contrib/drbd-outdate-peer/dopd.c 2012-04-18 20:10:41.000000000 +0200
@@ -226,7 +226,7 @@ check_drbd_peer(const char *drbd_peer)
}
while((node = dopd_cluster_conn->llc_ops->nextnode(dopd_cluster_conn)) != NULL) {
const char *status = dopd_cluster_conn->llc_ops->node_status(dopd_cluster_conn, node);
- if (!strcmp(status, "dead")) {
+ if (!strcmp(status, "dead") && !strcmp("normal", dopd_cluster_conn->llc_ops->node_type(dopd_cluster_conn, node))) {
cl_log(LOG_WARNING, "Cluster node: %s: status: %s",
node, status);
return FALSE;

>
> > In case when you have:
> > -configured ping node,
> > -timeouts: ping-int 10, deadping 10, deadtime 30
> >
> > and link from replication, ping node down, dopd starts working. Function
> > check_drbd_peer checks if status of the node is dead (ping node is
> > dead, remote/normal node is ok) and if yes, ends with returning
> > FALSE and does not mark remote volumes as outdated with using other
> > auxiliary path. Unfortunately during test such problem occurred.
> >
> > We know that DRBD timeouts have to be lower then heartbeat timeouts, but
> > in case when dopd has to mark a lot of remote resources, it cannot do
> > that in time. It is easy to race.
>
> --
> : Lars Ellenberg
> : LINBIT | Your Way to High Availability
> : DRBD/HA support and consulting http://www.linbit.com
>
> DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
> __
> please don't Cc me, but send to list -- I'm subscribed
> _______________________________________________
> drbd-user mailing list
> drbd-user [at] lists
> http://lists.linbit.com/mailman/listinfo/drbd-user
>

_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


lars.ellenberg at linbit

Apr 18, 2012, 12:49 PM

Post #4 of 5 (915 views)
Permalink
Re: DOPD problem and Heartbeat [In reply to]

On Wed, Apr 18, 2012 at 09:41:53PM +0200, aluno3 wrote:
> > On Wed, Apr 18, 2012 at 07:55:32PM +0200, aluno3 [at] poczta wrote:
> > > Hello
> > >
> > > We are testing DOPD mechanism and reviewing source of the dopd file
> > > (http://hg.linux-ha.org/lha-2.1/file/1d5b54f0a2e0/contrib/drbd-outdate-peer/dopd.c).
> >
> > Would you please use heartbeat 3
> > (and pacemaker, unless you use the haresource mode of heartbeat)
> >
> > When using pacemaker, use the drbd crm-fence-peer.sh.
> > It covers all the cases dopd would cover, and in fact even a couple more
> > corner cases in multiple failure scenarios.
>
> We would like to use heartbeat 3 with newer crm but our front end is not adapted yet...
>
> >
> > >
> > > Is it ok that in function check_drbd_peer, during loop, at the
> > > beginning is checking status of the node and in case if node is dead
> > > then function is finishing with returning FALSE even if node is ping
> > > node? Next part of the code checks if node is 'normal' node, but it
> > > is to late.
> >
> > Then I guess we have to fix that.
>
> Maybe fix should look like:
>
> --- ./heartbeat/contrib/drbd-outdate-peer/dopd.c 2008-08-18 14:32:19.000000000 +0200
> +++ ./heartbeat-dopdfix/contrib/drbd-outdate-peer/dopd.c 2012-04-18 20:10:41.000000000 +0200
> @@ -226,7 +226,7 @@ check_drbd_peer(const char *drbd_peer)
> }
> while((node = dopd_cluster_conn->llc_ops->nextnode(dopd_cluster_conn)) != NULL) {
> const char *status = dopd_cluster_conn->llc_ops->node_status(dopd_cluster_conn, node);
> - if (!strcmp(status, "dead")) {
> + if (!strcmp(status, "dead") && !strcmp("normal", dopd_cluster_conn->llc_ops->node_type(dopd_cluster_conn, node))) {
> cl_log(LOG_WARNING, "Cluster node: %s: status: %s",
> node, status);
> return FALSE;

I'd say, it should rather look like (against heartbeat 3 source, so it
may or may not directly apply on your tree; probably best to just copy
over all of contrib/drbd-outdate-peer from 3):

diff --git a/contrib/drbd-outdate-peer/dopd.c b/contrib/drbd-outdate-peer/dopd.c
--- a/contrib/drbd-outdate-peer/dopd.c
+++ b/contrib/drbd-outdate-peer/dopd.c
@@ -226,19 +226,26 @@ check_drbd_peer(const char *drbd_peer)
}
while((node = dopd_cluster_conn->llc_ops->nextnode(dopd_cluster_conn)) != NULL) {
const char *status = dopd_cluster_conn->llc_ops->node_status(dopd_cluster_conn, node);
+
+ /* Look for the peer */
+ if (strcasecmp(node, drbd_peer))
+ continue;
+
+ if (strcmp("normal", dopd_cluster_conn->llc_ops->node_type(dopd_cluster_conn, node))) {
+ cl_log(LOG_WARNING, "Cluster node: %s: status: %s is not a normal node",
+ node, status);
+ break;
+ }
+
if (!strcmp(status, "dead")) {
cl_log(LOG_WARNING, "Cluster node: %s: status: %s",
node, status);
- return FALSE;
+ break;
}

- /* Look for the peer */
- if (!strcmp("normal", dopd_cluster_conn->llc_ops->node_type(dopd_cluster_conn, node))
- && !strcasecmp(node, drbd_peer)) {
- cl_log(LOG_DEBUG, "node %s found\n", node);
- found = TRUE;
- break;
- }
+ cl_log(LOG_DEBUG, "node %s found with status %s\n", node, status);
+ found = TRUE;
+ break;
}
if (dopd_cluster_conn->llc_ops->end_nodewalk(dopd_cluster_conn) != HA_OK) {
cl_log(LOG_INFO, "Cannot end node walk");


Not even compile tested, but I think this is what it should look like.

> > > In case when you have:
> > > -configured ping node,
> > > -timeouts: ping-int 10, deadping 10, deadtime 30
> > >
> > > and link from replication, ping node down, dopd starts working. Function
> > > check_drbd_peer checks if status of the node is dead (ping node is
> > > dead, remote/normal node is ok) and if yes, ends with returning
> > > FALSE and does not mark remote volumes as outdated with using other
> > > auxiliary path. Unfortunately during test such problem occurred.
> > >
> > > We know that DRBD timeouts have to be lower then heartbeat timeouts, but
> > > in case when dopd has to mark a lot of remote resources, it cannot do
> > > that in time. It is easy to race.
> >
> > --
> > : Lars Ellenberg
> > : LINBIT | Your Way to High Availability
> > : DRBD/HA support and consulting http://www.linbit.com
> >
> > DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
> > __
> > please don't Cc me, but send to list -- I'm subscribed
> > _______________________________________________
> > drbd-user mailing list
> > drbd-user [at] lists
> > http://lists.linbit.com/mailman/listinfo/drbd-user
> >
>
> _______________________________________________
> drbd-user mailing list
> drbd-user [at] lists
> http://lists.linbit.com/mailman/listinfo/drbd-user

--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list -- I'm subscribed
_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


aluno3 at poczta

Apr 18, 2012, 1:40 PM

Post #5 of 5 (718 views)
Permalink
Re: DOPD problem and Heartbeat [In reply to]

On 18.04.2012 21:49, Lars Ellenberg wrote:
> On Wed, Apr 18, 2012 at 09:41:53PM +0200, aluno3 wrote:
>>> On Wed, Apr 18, 2012 at 07:55:32PM +0200, aluno3 [at] poczta wrote:
>>>> Hello
>>>>
>>>> We are testing DOPD mechanism and reviewing source of the dopd file
>>>> (http://hg.linux-ha.org/lha-2.1/file/1d5b54f0a2e0/contrib/drbd-outdate-peer/dopd.c).
>>> Would you please use heartbeat 3
>>> (and pacemaker, unless you use the haresource mode of heartbeat)
>>>
>>> When using pacemaker, use the drbd crm-fence-peer.sh.
>>> It covers all the cases dopd would cover, and in fact even a couple more
>>> corner cases in multiple failure scenarios.
>> We would like to use heartbeat 3 with newer crm but our front end is not adapted yet...
>>
>>>> Is it ok that in function check_drbd_peer, during loop, at the
>>>> beginning is checking status of the node and in case if node is dead
>>>> then function is finishing with returning FALSE even if node is ping
>>>> node? Next part of the code checks if node is 'normal' node, but it
>>>> is to late.
>>> Then I guess we have to fix that.
>> Maybe fix should look like:
>>
>> --- ./heartbeat/contrib/drbd-outdate-peer/dopd.c 2008-08-18 14:32:19.000000000 +0200
>> +++ ./heartbeat-dopdfix/contrib/drbd-outdate-peer/dopd.c 2012-04-18 20:10:41.000000000 +0200
>> @@ -226,7 +226,7 @@ check_drbd_peer(const char *drbd_peer)
>> }
>> while((node = dopd_cluster_conn->llc_ops->nextnode(dopd_cluster_conn)) != NULL) {
>> const char *status = dopd_cluster_conn->llc_ops->node_status(dopd_cluster_conn, node);
>> - if (!strcmp(status, "dead")) {
>> + if (!strcmp(status, "dead")&& !strcmp("normal", dopd_cluster_conn->llc_ops->node_type(dopd_cluster_conn, node))) {
>> cl_log(LOG_WARNING, "Cluster node: %s: status: %s",
>> node, status);
>> return FALSE;
> I'd say, it should rather look like (against heartbeat 3 source, so it
> may or may not directly apply on your tree; probably best to just copy
> over all of contrib/drbd-outdate-peer from 3):
>
> diff --git a/contrib/drbd-outdate-peer/dopd.c b/contrib/drbd-outdate-peer/dopd.c
> --- a/contrib/drbd-outdate-peer/dopd.c
> +++ b/contrib/drbd-outdate-peer/dopd.c
> @@ -226,19 +226,26 @@ check_drbd_peer(const char *drbd_peer)
> }
> while((node = dopd_cluster_conn->llc_ops->nextnode(dopd_cluster_conn)) != NULL) {
> const char *status = dopd_cluster_conn->llc_ops->node_status(dopd_cluster_conn, node);
> +
> + /* Look for the peer */
> + if (strcasecmp(node, drbd_peer))
> + continue;
> +
> + if (strcmp("normal", dopd_cluster_conn->llc_ops->node_type(dopd_cluster_conn, node))) {
> + cl_log(LOG_WARNING, "Cluster node: %s: status: %s is not a normal node",
> + node, status);
> + break;
> + }
> +
> if (!strcmp(status, "dead")) {
> cl_log(LOG_WARNING, "Cluster node: %s: status: %s",
> node, status);
> - return FALSE;
> + break;
> }
>
> - /* Look for the peer */
> - if (!strcmp("normal", dopd_cluster_conn->llc_ops->node_type(dopd_cluster_conn, node))
> - && !strcasecmp(node, drbd_peer)) {
> - cl_log(LOG_DEBUG, "node %s found\n", node);
> - found = TRUE;
> - break;
> - }
> + cl_log(LOG_DEBUG, "node %s found with status %s\n", node, status);
> + found = TRUE;
> + break;
> }
> if (dopd_cluster_conn->llc_ops->end_nodewalk(dopd_cluster_conn) != HA_OK) {
> cl_log(LOG_INFO, "Cannot end node walk");
>
>
> Not even compile tested, but I think this is what it should look like.
>
After fast test, looks like fix is working. Thanks for help.

>>>> In case when you have:
>>>> -configured ping node,
>>>> -timeouts: ping-int 10, deadping 10, deadtime 30
>>>>
>>>> and link from replication, ping node down, dopd starts working. Function
>>>> check_drbd_peer checks if status of the node is dead (ping node is
>>>> dead, remote/normal node is ok) and if yes, ends with returning
>>>> FALSE and does not mark remote volumes as outdated with using other
>>>> auxiliary path. Unfortunately during test such problem occurred.
>>>>
>>>> We know that DRBD timeouts have to be lower then heartbeat timeouts, but
>>>> in case when dopd has to mark a lot of remote resources, it cannot do
>>>> that in time. It is easy to race.
>>> --
>>> : Lars Ellenberg
>>> : LINBIT | Your Way to High Availability
>>> : DRBD/HA support and consulting http://www.linbit.com
>>>
>>> DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
>>> __
>>> please don't Cc me, but send to list -- I'm subscribed
>>> _______________________________________________
>>> drbd-user mailing list
>>> drbd-user [at] lists
>>> http://lists.linbit.com/mailman/listinfo/drbd-user
>>>
>> _______________________________________________
>> drbd-user mailing list
>> drbd-user [at] lists
>> http://lists.linbit.com/mailman/listinfo/drbd-user

_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user

DRBD users RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.