Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: DRBD: Users

Pacemaker - DRBD fails on node every couple hours

 

 

DRBD users RSS feed   Index | Next | Previous | View Threaded


christoph at iway

Feb 27, 2012, 8:15 AM

Post #1 of 11 (556 views)
Permalink
Pacemaker - DRBD fails on node every couple hours

We use a simple 2node active-passive cluster with DRBD and NFS services.

Right now the cluster monitor detects a drbr failure every couple hours (~
2-40) and will fail over.
syslog shows the following lines just before pacepaker initiates the
failover:

--------------------------------------
Feb 24 20:55:54 drbdnode1 lrmd: [1659]: info: RA output:
(p_drbd_r0:0:monitor:stderr) <1>error creating netlink socket
Feb 24 20:55:54 drbdnode1 lrmd: [1659]: info: RA output:
(p_drbd_r0:0:monitor:stderr) Could not connect to 'drbd' generic netlink
family
Feb 24 20:55:54 drbdnode1 crmd: [1662]: info: process_lrm_event: LRM
operation p_drbd_r0:0_monitor_15000 (call=26, rc=7, cib-update=32,
confirmed=false) not running
Feb 24 20:55:55 drbdnode1 attrd: [1661]: notice: attrd_trigger_update:
Sending flush op to all hosts for: fail-count-p_drbd_r0:0 (1)

--------------------------------------

does anyone has a clue why this might happen?
It only seems to happen when drbd runs primary on nodeA, though this node is
to be designed to be always primary as long as it's online...

thanks
Christoph Roethlisberger




SysConfig (both nodes):
---------------------------------
DualXeon E5606, 24GB RAM, Intel 10GbE NICs

Debian Squeeze
kernel: 3.2.0-0.bpo.1-amd64
pacemaker 1.1.6-2~bpo60+1
heartbeat 1:3.0.5-2~bpo60+1
DRBD 8.4.1 (module and userland tools)
---------------------------------


DRBD Config
---------------------------------
resource r0 {
volume 0 {
device minor 0;
disk /dev/vg01/vol01;
meta-disk internal;
}
on drbdnode1 {
address 192.168.100.1:7789;
}
on drbdnode2 {
address 192.168.100.2:7789;
}
}
---------------------------------


Pacemaker Config
---------------------------------
node $id="1b6d29da-484e-4b0f-a3ab-c7de3ea8f3ee" drbdnode1 \
attributes standby="off"
node $id="8bfba0ca-81cf-4fd1-ac89-79b4b4209151" drbdnode2 \
attributes standby="off"
primitive p_drbd_r0 ocf:linbit:drbd \
params drbd_resource="r0" \
op monitor interval="15s" role="Master" \
op monitor interval="30s" role="Slave"
primitive p_exportnfs_cgpro ocf:heartbeat:exportfs \
params fsid="100" directory="/srv/nfs/cgpro"
options="rw,sync,no_wdelay,no_root_squash,no_subtree_check,mountpoint"
clientspec="172.20.0.0/24" \
op monitor interval="30s" wait_for_leasetime_on_stop="true"
unlock_on_stop="true" rmtab_backup="none" \
meta target-role="Started"
primitive p_exportnfs_root ocf:heartbeat:exportfs \
params fsid="0" directory="/srv/nfs"
options="rw,sync,no_wdelay,no_root_squash,no_subtree_check,crossmnt"
clientspec="172.20.0.0/24" \
op monitor interval="30s" wait_for_leasetime_on_stop="true"
unlock_on_stop="true" rmtab_backup="none"
primitive p_fsmount_cgpro ocf:heartbeat:Filesystem \
params device="/dev/drbd/by-res/r0/0" directory="/srv/nfs/cgpro"
fstype="xfs" options="nobarrier,inode64" \
op start interval="0s" timeout="60s" \
op stop interval="0s" timeout="120s" \
meta is-managed="true"
primitive p_ipv4 ocf:heartbeat:IPaddr2 \
params ip="172.20.0.3" nic="eth2" \
op monitor interval="5s"
primitive p_lsb_nfsserver lsb:nfs-kernel-server \
op monitor interval="30s"
group g_haservices p_ipv4 p_fsmount_cgpro p_exportnfs_cgpro \
meta target-role="Started"
ms ms_drbd_r0 p_drbd_r0 \
meta master-max="1" master-node-max="1" clone-max="2"
clone-node-max="1" notify="true" target-role="Started"
clone cl_exportnfs_root p_exportnfs_root
clone cl_lsb_nfsserver p_lsb_nfsserver
colocation c_ms-drbd-r0_with_haservices inf: g_haservices ms_drbd_r0:Master
order o_exportnfs-root_before_exportnfs_cgpro 0: cl_exportnfs_root
p_exportnfs_cgpro
order o_fsmount-cgrpro-before-exportnfs-cgpro inf: p_fsmount_cgpro
p_exportnfs_cgpro:start
order o_lsb-nfsserver-before-exportnfs-root inf: cl_lsb_nfsserver
cl_exportnfs_root
order o_ms-drbd-r0-before-fsmount-cgpro inf: ms_drbd_r0:promote
p_fsmount_cgpro:start
property $id="cib-bootstrap-options" \
dc-version="1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c" \
cluster-infrastructure="Heartbeat" \
stonith-enabled="false" \
no-quorum-policy="ignore" \
last-lrm-refresh="1330068747"
rsc_defaults $id="rsc-options" \
resource-stickiness="200"
---------------------------------

_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


lars.ellenberg at linbit

Feb 27, 2012, 8:40 AM

Post #2 of 11 (538 views)
Permalink
Re: Pacemaker - DRBD fails on node every couple hours [In reply to]

On Mon, Feb 27, 2012 at 05:15:29PM +0100, Christoph Roethlisberger wrote:
> We use a simple 2node active-passive cluster with DRBD and NFS services.
>
> Right now the cluster monitor detects a drbr failure every couple
> hours (~ 2-40) and will fail over.
> syslog shows the following lines just before pacepaker initiates the
> failover:
>
> --------------------------------------
> Feb 24 20:55:54 drbdnode1 lrmd: [1659]: info: RA output:
> (p_drbd_r0:0:monitor:stderr) <1>error creating netlink socket
> Feb 24 20:55:54 drbdnode1 lrmd: [1659]: info: RA output:
> (p_drbd_r0:0:monitor:stderr) Could not connect to 'drbd' generic
> netlink family


Check that you really have loaded the DRBD 8.4.1 kernel module.

My guess is that you have some drbd 8.3 module loaded.

find /lib/modules/`uname -r` -name "drbd.ko"

You probably have more than one.
Make sure you load the one you want.


--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com
_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


christoph at iway

Feb 28, 2012, 2:14 AM

Post #3 of 11 (532 views)
Permalink
Re: Pacemaker - DRBD fails on node every couple hours [In reply to]

A find shows two modules - the one that got installed with the kernel
(kernel/drivers/block/drbd/drbd.ko) and the one that we have compiled from
the sources (updates/drbd.ko)

---------snip------------
#find /lib/modules/`uname -r` -name "drbd.ko"
/lib/modules/3.2.0-0.bpo.1-amd64/updates/drbd.ko
/lib/modules/3.2.0-0.bpo.1-amd64/kernel/drivers/block/drbd/drbd.ko
---------snip------------

but loaded seems to be the compiled 8.4.1 (checked on both nodes)

--------------snip-------------
#modinfo drbd
filename: /lib/modules/3.2.0-0.bpo.1-amd64/updates/drbd.ko
alias: block-major-147-*
license: GPL
version: 8.4.1
description: drbd - Distributed Replicated Block Device v8.4.1
author: Philipp Reisner <phil [at] linbit>, Lars Ellenberg
<lars [at] linbit>
srcversion: 4A4FDD6F2ECF22BD2AD5970
depends: libcrc32c
vermagic: 3.2.0-0.bpo.1-amd64 SMP mod_unload modversions
parm: minor_count:Approximate number of drbd devices (1-255)
(uint)
parm: disable_sendpage:bool
parm: allow_oos:DONT USE! (bool)
parm: proc_details:int
parm: enable_faults:int
parm: fault_rate:int
parm: fault_count:int
parm: fault_devs:int
parm: usermode_helper:string
--------------snip-------------

I can of course manually delete the stock drbd module - just to make sure,
but I do not believe that this will change anything.

Christoph Roethlisberger



----- Original Message -----
From: "Lars Ellenberg" <lars.ellenberg [at] linbit>
To: <drbd-user [at] lists>
Sent: Monday, February 27, 2012 5:40 PM
Subject: Re: [DRBD-user] Pacemaker - DRBD fails on node every couple hours


>
> Check that you really have loaded the DRBD 8.4.1 kernel module.
>
> My guess is that you have some drbd 8.3 module loaded.
>
> find /lib/modules/`uname -r` -name "drbd.ko"
>
> You probably have more than one.
> Make sure you load the one you want.
>
> : Lars Ellenberg

_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


lars.ellenberg at linbit

Feb 28, 2012, 3:47 PM

Post #4 of 11 (536 views)
Permalink
Re: Pacemaker - DRBD fails on node every couple hours [In reply to]

On Tue, Feb 28, 2012 at 11:14:33AM +0100, Christoph Roethlisberger wrote:
> A find shows two modules - the one that got installed with the
> kernel (kernel/drivers/block/drbd/drbd.ko) and the one that we have
> compiled from the sources (updates/drbd.ko)
>
> ---------snip------------
> #find /lib/modules/`uname -r` -name "drbd.ko"
> /lib/modules/3.2.0-0.bpo.1-amd64/updates/drbd.ko
> /lib/modules/3.2.0-0.bpo.1-amd64/kernel/drivers/block/drbd/drbd.ko
> ---------snip------------
>
> but loaded seems to be the compiled 8.4.1 (checked on both nodes)
>
> --------------snip-------------
> #modinfo drbd
> filename: /lib/modules/3.2.0-0.bpo.1-amd64/updates/drbd.ko
> alias: block-major-147-*
> license: GPL
> version: 8.4.1

Nope.
That tells you that modprobe *would* likely load this,
if you would do modprobe drbd *now*.
It does not tell you what you currently have loaded.

#cat /proc/drbd
#grep . /sys/module/drbd/*version

#rmmod drbd; modprobe drbd; cat /proc/drbd

> I can of course manually delete the stock drbd module - just to make
> sure, but I do not believe that this will change anything.

Maybe you need to check your initramfs, too...

--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com
_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


lars.ellenberg at linbit

Mar 1, 2012, 2:27 AM

Post #5 of 11 (509 views)
Permalink
Re: Pacemaker - DRBD fails on node every couple hours [In reply to]

On Mon, Feb 27, 2012 at 05:15:29PM +0100, Christoph Roethlisberger wrote:
> We use a simple 2node active-passive cluster with DRBD and NFS services.
>
> Right now the cluster monitor detects a drbr failure every couple
> hours (~ 2-40) and will fail over.


Oh... I may have missed this context, and focused to much on the error
log below.

So you *do* have a working DRBD,
and only the monitor operation fails "occasionally" (much too often,
still), with the below error log.

Did I understand correctly this time?

> syslog shows the following lines just before pacepaker initiates the
> failover:
>
> --------------------------------------
> Feb 24 20:55:54 drbdnode1 lrmd: [1659]: info: RA output:
> (p_drbd_r0:0:monitor:stderr) <1>error creating netlink socket
> Feb 24 20:55:54 drbdnode1 lrmd: [1659]: info: RA output:
> (p_drbd_r0:0:monitor:stderr) Could not connect to 'drbd' generic
> netlink family
> Feb 24 20:55:54 drbdnode1 crmd: [1662]: info: process_lrm_event: LRM
> operation p_drbd_r0:0_monitor_15000 (call=26, rc=7, cib-update=32,
> confirmed=false) not running
> Feb 24 20:55:55 drbdnode1 attrd: [1661]: notice:
> attrd_trigger_update: Sending flush op to all hosts for:
> fail-count-p_drbd_r0:0 (1)
>
> --------------------------------------
>
> does anyone has a clue why this might happen?
> It only seems to happen when drbd runs primary on nodeA, though this
> node is to be designed to be always primary as long as it's
> online...
>
> thanks
> Christoph Roethlisberger

--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com
_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


christoph at iway

Mar 1, 2012, 2:48 AM

Post #6 of 11 (512 views)
Permalink
Re: Pacemaker - DRBD fails on node every couple hours [In reply to]

first, thank you very much for your time, Lars


>> #cat /proc/drbd
>> #grep . /sys/module/drbd/*version
-------------------------
version: 8.4.1 (api:1/proto:86-100)
GIT-hash: 91b4c048c1a0e06777b5f65d312b38d47abaea80 build by root [at] drbdnode,
2012-02-13 16:06:27
0: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate B r-----
ns:0 nr:8496 dw:8496 dr:0 al:0 bm:6 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0

/sys/module/drbd/srcversion:4A4FDD6F2ECF22BD2AD5970
/sys/module/drbd/version:8.4.1
-------------------------

the initrd file also looks good, it contains the 8.4.1 module only



>> So you *do* have a working DRBD,
>> and only the monitor operation fails "occasionally" (much too often,
>> still), with the below error log.


Yes, this *may* to be the case.
I'm not sure if the drbd module really crashes (and gets started again by
pacemaker afterwards) or if it never failed at all.
So far I only really see/know that pacemaker detects "a problem" and
initiates the failover.
After that all services continue to run on the other node and drbd switched
its primary/secondary state. (and I see all these errors/messages in the
log)


if it helps - the crm_mon output changes from:

--------------------------------------
Online: [ drbdnodeA drbdnodeB ]

Master/Slave Set: ms_drbd_r0 [p_drbd_r0]
Masters: [ drbdnodeA ]
Slaves: [ drbdnodeB ]
Resource Group: g_haservices
p_ipv4 (ocf::heartbeat:IPaddr2): Started drbdnodeA
p_fsmount_cgpro (ocf::heartbeat:Filesystem): Started drbdnodeA
p_exportnfs_cgpro (ocf::heartbeat:exportfs): Started drbdnodeA
Clone Set: cl_lsb_nfsserver [p_lsb_nfsserver]
Started: [ drbdnodeA drbdnodeB ]
Clone Set: cl_exportnfs_root [p_exportnfs_root]
Started: [ drbdnodeA drbdnodeB ]
--------------------------------------

into

--------------------------------------
Online: [ drbdnodeA drbdnodeB ]

Master/Slave Set: ms_drbd_r0 [p_drbd_r0]
Masters: [ drbdnodeB ]
Slaves: [ drbdnodeA ]
Resource Group: g_haservices
p_ipv4 (ocf::heartbeat:IPaddr2): Started drbdnodeB
p_fsmount_cgpro (ocf::heartbeat:Filesystem): Started drbdnodeB
p_exportnfs_cgpro (ocf::heartbeat:exportfs): Started drbdnodeB
Clone Set: cl_lsb_nfsserver [p_lsb_nfsserver]
Started: [ drbdnodeA drbdnodeB ]
Clone Set: cl_exportnfs_root [p_exportnfs_root]
Started: [ drbdnodeA drbdnodeB ]

Failed actions:
p_drbd_r0:0_monitor_15000 (node=drbdnodeA, call=26, rc=7,
status=complete): not running
--------------------------------------

Christoph Roethlisberger

_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


lars.ellenberg at linbit

Mar 1, 2012, 3:15 AM

Post #7 of 11 (511 views)
Permalink
Re: Pacemaker - DRBD fails on node every couple hours [In reply to]

On Thu, Mar 01, 2012 at 11:27:01AM +0100, Lars Ellenberg wrote:
> On Mon, Feb 27, 2012 at 05:15:29PM +0100, Christoph Roethlisberger wrote:
> > We use a simple 2node active-passive cluster with DRBD and NFS services.
> >
> > Right now the cluster monitor detects a drbr failure every couple
> > hours (~ 2-40) and will fail over.
>
>
> Oh... I may have missed this context, and focused to much on the error
> log below.
>
> So you *do* have a working DRBD,
> and only the monitor operation fails "occasionally" (much too often,
> still), with the below error log.
>
> Did I understand correctly this time?
>
> > syslog shows the following lines just before pacepaker initiates the
> > failover:
> >
> > --------------------------------------
> > Feb 24 20:55:54 drbdnode1 lrmd: [1659]: info: RA output:
> > (p_drbd_r0:0:monitor:stderr) <1>error creating netlink socket


That error message above is triggered only, if
- calloc(1, sizeof(struct genl_sock)) fails
very unlikely, that's a few bytes, you would be that hard out of
memory that you should know...
- s->s_fd = socket(AF_NETLINK, SOCK_DGRAM, NETLINK_GENERIC);
fails.
- any of
err = setsockopt(s->s_fd, SOL_SOCKET, SO_SNDBUF, &bsz, sizeof(bsz)) ||
setsockopt(s->s_fd, SOL_SOCKET, SO_RCVBUF, &bsz, sizeof(bsz)) ||
bind(s->s_fd, (struct sockaddr*) &s->s_local, sizeof(s->s_local));
fails.

All of which are not likely to fail "only occasionally".

You could run a tight loop:
i=0; while drbdsetup 0 dstate >/dev/null ; do let i++; done;
echo "failed after $i calls"

If that fails after some time,
you could repeat that with
i=0; while strace -x -s 1024 -o /tmp/whatever.strace.out drbdsetup 0 dstate >/dev/null ; do let i++; done;
echo "failed after $i calls"

Now you should have an strace of the failed run in that file,
which we could analyse...

> > Feb 24 20:55:54 drbdnode1 lrmd: [1659]: info: RA output:
> > (p_drbd_r0:0:monitor:stderr) Could not connect to 'drbd' generic
> > netlink family
> > Feb 24 20:55:54 drbdnode1 crmd: [1662]: info: process_lrm_event: LRM
> > operation p_drbd_r0:0_monitor_15000 (call=26, rc=7, cib-update=32,
> > confirmed=false) not running
> > Feb 24 20:55:55 drbdnode1 attrd: [1661]: notice:
> > attrd_trigger_update: Sending flush op to all hosts for:
> > fail-count-p_drbd_r0:0 (1)
> >
> > --------------------------------------
> >
> > does anyone has a clue why this might happen?
> > It only seems to happen when drbd runs primary on nodeA, though this
> > node is to be designed to be always primary as long as it's
> > online...
> >
> > thanks
> > Christoph Roethlisberger
>
> --
> : Lars Ellenberg
> : LINBIT | Your Way to High Availability
> : DRBD/HA support and consulting http://www.linbit.com
> _______________________________________________
> drbd-user mailing list
> drbd-user [at] lists
> http://lists.linbit.com/mailman/listinfo/drbd-user

--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list -- I'm subscribed
_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


christoph at iway

Mar 1, 2012, 4:22 AM

Post #8 of 11 (513 views)
Permalink
Re: Pacemaker - DRBD fails on node every couple hours [In reply to]

I did the loop a couple times and it aways "failed" rather soon:

-------------------------------------------------------------
# i=0; while drbdsetup 0 dstate >/dev/null ; do let i++; done; echo "failed
after $i calls"
<1>error creating netlink socket
Could not connect to 'drbd' generic netlink family
failed after 28492 calls

# i=0; while drbdsetup 0 dstate >/dev/null ; do let i++; done; echo "failed
after $i calls"
<1>error creating netlink socket
Could not connect to 'drbd' generic netlink family
failed after 31887 calls

# i=0; while drbdsetup 0 dstate >/dev/null ; do let i++; done; echo "failed
after $i calls"
<1>error creating netlink socket
Could not connect to 'drbd' generic netlink family
failed after 10861 calls
-------------------------------------------------------------


attached you should find the output from the run run with strace.


Christoph Röthlisberger
Attachments: drbdstate.strace.out (6.29 KB)


lars.ellenberg at linbit

Mar 1, 2012, 7:04 AM

Post #9 of 11 (540 views)
Permalink
Re: Pacemaker - DRBD fails on node every couple hours [In reply to]

On Thu, Mar 01, 2012 at 01:22:22PM +0100, Christoph Roethlisberger wrote:
> I did the loop a couple times and it aways "failed" rather soon:
>
> -------------------------------------------------------------
> # i=0; while drbdsetup 0 dstate >/dev/null ; do let i++; done; echo
> "failed after $i calls"
> <1>error creating netlink socket
> Could not connect to 'drbd' generic netlink family
> failed after 28492 calls
>
> # i=0; while drbdsetup 0 dstate >/dev/null ; do let i++; done; echo
> "failed after $i calls"
> <1>error creating netlink socket
> Could not connect to 'drbd' generic netlink family
> failed after 31887 calls
>
> # i=0; while drbdsetup 0 dstate >/dev/null ; do let i++; done; echo
> "failed after $i calls"
> <1>error creating netlink socket
> Could not connect to 'drbd' generic netlink family
> failed after 10861 calls
> -------------------------------------------------------------
>
>
> attached you should find the output from the run run with strace.

bind(3, {sa_family=AF_NETLINK, pid=1432, groups=00000000}, 12) = -1
EADDRINUSE (Address already in use)
write(2, "<1>error creating netlink socket\n", 33) = 33

That "should not happen", as the pid (port id) is unique,
because it is set to the pid (process id).
Oh well.
We can ask the kernel to assign that "port id" for us.

Please try if this preliminary patch improves the situation for you.
Let me know if you need any help with applying/rebuilding this.

diff --git a/user/libgenl.c b/user/libgenl.c
index 0a6ea2e..713d653 100644
--- a/user/libgenl.c
+++ b/user/libgenl.c
@@ -26,15 +26,17 @@ int genl_join_mc_group(struct genl_sock *s, const char *name) {
static struct genl_sock *genl_connect(__u32 nl_groups)
{
struct genl_sock *s = calloc(1, sizeof(*s));
+ int sock_len;
int err;
+ int pid = getpid();
int bsz = 2 << 10;

if (!s)
return NULL;

- /* the netlink port id - use the process id, it is unique,
- * and "everyone else does it". */
- s->s_local.nl_pid = getpid();
+ /* autobind; kernel is responsible to give us something unique
+ * in bind() below. */
+ s->s_local.nl_pid = 0;
s->s_local.nl_family = AF_NETLINK;
/*
* If we want to receive multicast traffic on this socket, kernels
@@ -50,9 +52,15 @@ static struct genl_sock *genl_connect(__u32 nl_groups)
if (s->s_fd == -1)
goto fail;

+ sock_len = sizeof(s->s_local);
err = setsockopt(s->s_fd, SOL_SOCKET, SO_SNDBUF, &bsz, sizeof(bsz)) ||
setsockopt(s->s_fd, SOL_SOCKET, SO_RCVBUF, &bsz, sizeof(bsz)) ||
- bind(s->s_fd, (struct sockaddr*) &s->s_local, sizeof(s->s_local));
+ bind(s->s_fd, (struct sockaddr*) &s->s_local, sizeof(s->s_local)) ||
+ getsockname(s->s_fd, (struct sockaddr*) &s->s_local, &sock_len);
+
+ dbg(pid != s_local.nl_pid ? 1 : 3,
+ "bound socket to nl_pid:%u, my pid:%u, len:%d, sizeof:%u\n",
+ s->s_local.nl_pid, pid, sock_len, sizeof(s->s_local));

if (err)
goto fail;



--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list -- I'm subscribed
_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


ff at mpexnet

Mar 1, 2012, 7:22 AM

Post #10 of 11 (509 views)
Permalink
Re: Pacemaker - DRBD fails on node every couple hours [In reply to]

On 03/01/2012 04:04 PM, Lars Ellenberg wrote:
> - /* the netlink port id - use the process id, it is unique,
> - * and "everyone else does it". */

Hah :-) I like this comment a lot.
_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


christoph at iway

Mar 1, 2012, 11:49 PM

Post #11 of 11 (498 views)
Permalink
Re: Pacemaker - DRBD fails on node every couple hours [In reply to]

We've managed to apply your patch and compile the sources again, after
changing one of the new code lines:

-----------------------------------------
- dbg(pid != s_local.nl_pid ? 1 : 3,
+ dbg(pid != s->s_local.nl_pid ? 1 : 3,
-----------------------------------------

As for now it seems to have fixed the problem, as

# i=0; while drbdsetup 0 dstate >/dev/null ; do let i++; done; echo
"failed after $i calls"

no longer breaks - though now it prints out this line of debug? output every
couple seconds

------------------------------
<1>bound socket to nl_pid:4294963073, my pid:1432, len:12, sizeof:12

<1>bound socket to nl_pid:4294963072, my pid:1432, len:12, sizeof:12

<1>bound socket to nl_pid:4294963071, my pid:1432, len:12, sizeof:12

<1>bound socket to nl_pid:4294963070, my pid:1432, len:12, sizeof:12

<1>bound socket to nl_pid:4294963069, my pid:1432, len:12, sizeof:12

<1>bound socket to nl_pid:4294963068, my pid:1432, len:12, sizeof:12
------------------------------

dunno if this is bad or not...

Christoph

_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user

DRBD users RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.