Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: DRBD: Users

Problem with stacked resource failing

 

 

DRBD users RSS feed   Index | Next | Previous | View Threaded


ron.wells at envision-rx

Mar 1, 2012, 12:37 PM

Post #1 of 10 (511 views)
Permalink
Problem with stacked resource failing

Hey all, I have a two node single primary with offsite disaster recovery (dr)
node configuration using stacked resources that I'm having weird issues
with. Twice in the last week the primary node stopped responding and I had
to disconnect/reconnect the dr node to get it working again. When it fails
I get the following in the primary nodes logs:

kern.err<3>: Feb 29 20:21:20 openfiler2 kernel: block drbd14:
[drbd14_worker/7472] sock_sendmsg time expired, ko = 4294966565

There are no relevant log entries on the DR node.

I see these messages in the logs from time to time, but usually they just
last for a few seconds and it's all cleared up on it's own.

can anyone give me some idea of what direction to go in to try to figure out
what the issue might be? I've included my global.conf, drbd.conf and more
logs from around the time it failed last. Please let me know if any
additional information would be helpful!

Thanks!

here is my global.conf file:
global {
usage-count yes;
# minor-count dialog-refresh disable-ip-verification
}

common {
protocol C;

handlers {
pri-on-incon-degr
"/usr/lib/drbd/notify-pri-on-incon-degr.sh;
/usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ;
reboot -f";
pri-lost-after-sb
"/usr/lib/drbd/notify-pri-lost-after-sb.sh;
/usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ;
reboot -f";
local-io-error "/usr/lib/drbd/notify-io-error.sh;
/usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ;
halt -f";
# fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
# split-brain "/usr/lib/drbd/notify-split-brain.sh root";
# out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh root";
# before-resync-target
"/usr/lib/drbd/snapshot-resync-target-lvm.sh -p 15 -- -c 16k";
# after-resync-target
/usr/lib/drbd/unsnapshot-resync-target-lvm.sh;
}

startup {
# wfc-timeout degr-wfc-timeout outdated-wfc-timeout
wait-after-sb
}

disk {
# on-io-error fencing use-bmbv no-disk-barrier
no-disk-flushes
# no-disk-drain no-md-flushes max-bio-bvecs
on-io-error detach;
# fencing resource-only;
}

net {
# sndbuf-size rcvbuf-size timeout connect-int ping-int
ping-timeout max-buffers
# max-epoch-size ko-count allow-two-primaries cram-hmac-alg
shared-secret
# after-sb-0pri after-sb-1pri after-sb-2pri
data-integrity-alg no-tcp-cork
# data-integrity-alg crc32c;
after-sb-0pri discard-zero-changes;
after-sb-1pri consensus;
after-sb-2pri disconnect;
}

syncer {
# rate after al-extents use-rle cpu-mask verify-alg
csums-alg
rate 100M;
csums-alg crc32c;
verify-alg crc32c;
use-rle;
}
}

drbd.conf file excerpt ( i have a total of 12 resources, 6 lowers and 6
uppers, meta and data1-data5, all are configured the same as the two shown
here)

include "drbd.d/global_common.conf";
include "drbd.d/*.res";
resource meta_lower {
disk /dev/backingvg/metabacking;
device /dev/drbd0;
meta-disk internal;
disk {
fencing resource-only;
}
handlers {
fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";
}
on openfiler1 {
address 10.50.153.1:7788;
}
on openfiler2 {
address 10.50.153.2:7788;
}
}
resource data1_lower {
device /dev/drbd1;
disk /dev/backingvg/256data1backing;
meta-disk internal;
disk {
fencing resource-only;
}
handlers {
fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";
}
on openfiler1 {
address 10.50.153.1:7789;
}
on openfiler2 {
address 10.50.153.2:7789;
}
}
...

resource meta {
protocol A;
device /dev/drbd10;
meta-disk internal;
syncer {
rate 1000k;
}
stacked-on-top-of meta_lower {
address 10.50.150.101:7788;
}
on openfiler3 {
disk /dev/backingvg/metabacking;
address 10.50.250.4:7788;
}
}
resource data1 {
protocol A;
device /dev/drbd11;
meta-disk internal;
handlers {
before-resync-target "/usr/lib/drbd/snapshot-resync-target-lvm.sh";
after-resync-target "/usr/lib/drbd/unsnapshot-resync-target-lvm.sh";
}
net {
sndbuf-size 512k;
on-congestion pull-ahead;
congestion-fill 500k;
}
syncer {
rate 1000k;
}
stacked-on-top-of data1_lower {
address 10.50.150.101:7789;
}
on openfiler3 {
disk /dev/backingvg/256data1backing;
address 10.50.250.4:7789;
}
}
...

Here the log right around the time it failed:


kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: conn(
WFConnection -> WFReportParams )
kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: Starting
asender thread (from drbd10_receiver [4007])
kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10:
data-integrity-alg: <not-used>
kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10:
drbd_sync_handshake:
kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: self
D64C8B1CF54765C3:6A4AC00929A719C7:BAA2C9167F6DE4B7:BAA1C9167F6DE4B7 bits:0
flags:0
kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: peer
6A4AC00929A719C6:0000000000000000:BAA2C9167F6DE4B6:BAA1C9167F6DE4B7 bits:0
flags:0
kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10:
uuid_compare()=1 by rule 70
kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: peer( Unknown
-> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( DUnknown ->
Consistent )
kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: send bitmap
stats [Bytes(packets)]: plain 0(0), RLE 13(1), total 13; compression: 100.0%
kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: receive
bitmap stats [Bytes(packets)]: plain 0(0), RLE 13(1), total 13; compression:
100.0%
kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: helper
command: /sbin/drbdadm before-resync-source minor-10
kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: helper
command: /sbin/drbdadm before-resync-source minor-10 exit code 0 (0x0)
kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: conn(
WFBitMapS -> SyncSource ) pdsk( Consistent -> Inconsistent )
kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: Began resync
as SyncSource (will sync 0 KB [0 bits set]).
kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: updated sync
UUID D64C8B1CF54765C3:6A4BC00929A719C7:6A4AC00929A719C7:BAA2C9167F6DE4B7
kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: Resync done
(total 1 sec; paused 0 sec; 0 K/sec)
kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: updated UUIDs
D64C8B1CF54765C3:0000000000000000:6A4BC00929A719C7:6A4AC00929A719C7
kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: conn(
SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate )
kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: bitmap WRITE
of 0 pages took 0 jiffies
kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: 0 KB (0 bits)
marked out-of-sync by on disk bit-map.
kern.info<6>: Feb 29 18:28:40 openfiler2 kernel: block drbd12: Resync done
(total 98 sec; paused 0 sec; 0 K/sec)
kern.info<6>: Feb 29 18:28:40 openfiler2 kernel: block drbd12: updated UUIDs
42842875FAED516F:0000000000000000:4ED75BEBA8150A1D:4ED65BEBA8150A1D
kern.info<6>: Feb 29 18:28:40 openfiler2 kernel: block drbd12: conn(
SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate )
kern.info<6>: Feb 29 18:28:40 openfiler2 kernel: block drbd12: bitmap WRITE
of 0 pages took 0 jiffies
kern.info<6>: Feb 29 18:28:40 openfiler2 kernel: block drbd12: 0 KB (0 bits)
marked out-of-sync by on disk bit-map.
kern.info<6>: Feb 29 18:28:59 openfiler2 kernel: block drbd11: Resync done
(total 117 sec; paused 0 sec; 0 K/sec)
kern.info<6>: Feb 29 18:28:59 openfiler2 kernel: block drbd11: updated UUIDs
226B7A8BEE6FD74D:0000000000000000:FB859CF1270E0AB1:FB849CF1270E0AB1
kern.info<6>: Feb 29 18:28:59 openfiler2 kernel: block drbd11: conn(
SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate )
kern.info<6>: Feb 29 18:28:59 openfiler2 kernel: block drbd11: bitmap WRITE
of 0 pages took 0 jiffies
kern.info<6>: Feb 29 18:28:59 openfiler2 kernel: block drbd11: 0 KB (0 bits)
marked out-of-sync by on disk bit-map.
kern.err<3>: Feb 29 19:08:20 openfiler2 kernel: block drbd14:
[drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967295
kern.err<3>: Feb 29 19:08:26 openfiler2 kernel: block drbd14:
[drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967294
kern.err<3>: Feb 29 19:08:32 openfiler2 kernel: block drbd14:
[drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967293
kern.err<3>: Feb 29 19:08:38 openfiler2 kernel: block drbd14:
[drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967292
kern.err<3>: Feb 29 19:08:44 openfiler2 kernel: block drbd14:
[drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967291
kern.err<3>: Feb 29 19:08:50 openfiler2 kernel: block drbd14:
[drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967290
kern.err<3>: Feb 29 19:08:56 openfiler2 kernel: block drbd14:
[drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967289
kern.err<3>: Feb 29 19:09:02 openfiler2 kernel: block drbd14:
[drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967288
kern.err<3>: Feb 29 19:09:08 openfiler2 kernel: block drbd14:
[drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967287
kern.err<3>: Feb 29 19:09:14 openfiler2 kernel: block drbd14:
[drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967286
kern.err<3>: Feb 29 19:09:20 openfiler2 kernel: block drbd14:
[drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967285
kern.err<3>: Feb 29 19:09:26 openfiler2 kernel: block drbd14:
[drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967284
kern.err<3>: Feb 29 19:09:32 openfiler2 kernel: block drbd14:
[drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967283
kern.err<3>: Feb 29 19:09:38 openfiler2 kernel: block drbd14:
[drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967282
kern.err<3>: Feb 29 19:09:44 openfiler2 kernel: block drbd14:
[drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967281
kern.err<3>: Feb 29 19:09:50 openfiler2 kernel: block drbd14:
[drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967280
kern.err<3>: Feb 29 19:09:56 openfiler2 kernel: block drbd14:
[drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967279
kern.err<3>: Feb 29 19:10:02 openfiler2 kernel: block drbd14:
[drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967278
kern.err<3>: Feb 29 19:10:08 openfiler2 kernel: block drbd14:
[drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967277
kern.err<3>: Feb 29 19:10:14 openfiler2 kernel: block drbd14:
[drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967276


these are the only messages show until i reset the link between nodes by
doing drbdadm down all on the dr node.




--
View this message in context: http://old.nabble.com/Problem-with-stacked-resource-failing-tp33424203p33424203.html
Sent from the DRBD - User mailing list archive at Nabble.com.

_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


ron.wells at envision-rx

Mar 1, 2012, 12:48 PM

Post #2 of 10 (487 views)
Permalink
Re: Problem with stacked resource failing [In reply to]

Ugh, i guess it would help if you know the version i'm using..

The servers are 16 core amd based supermicro servers with 16GB of memory and
a 7 TB raid5 array running off of an Adaptec 6805 controller.

I'm using openfiler 2.99.2 as the basis of the storage servers, although i
don't use the web interface since I have drbd and corosync configured and
the web interface is useless for my case.

drbdadm -V
DRBDADM_BUILDTAG=GIT-hash:\ 0de839cee13a4160eed6037c4bddd066645e23c5\ build\
by\ rmake-chroot [at] localhost\,\ 2011-08-12\ 18:38:56
DRBDADM_API_VERSION=88
DRBD_KERNEL_VERSION_CODE=0x08030b
DRBDADM_VERSION_CODE=0x08030b
DRBDADM_VERSION=8.3.11

uname -a
Linux openfiler2 2.6.32-131.17.1.el6-0.11.smp.gcc4.4.x86_64 #1 SMP Sat Nov
19 14:13:16 WET 2011 x86_64 x86_64 x86_64 GNU/Linux


--
View this message in context: http://old.nabble.com/Problem-with-stacked-resource-failing-tp33424203p33424258.html
Sent from the DRBD - User mailing list archive at Nabble.com.

_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


andreas at hastexo

Mar 2, 2012, 4:52 AM

Post #3 of 10 (475 views)
Permalink
Re: Problem with stacked resource failing [In reply to]

Hello,

On 03/01/2012 09:37 PM, envisionrx wrote:
>
> Hey all, I have a two node single primary with offsite disaster recovery (dr)
> node configuration using stacked resources that I'm having weird issues
> with. Twice in the last week the primary node stopped responding and I had
> to disconnect/reconnect the dr node to get it working again. When it fails
> I get the following in the primary nodes logs:
>
> kern.err<3>: Feb 29 20:21:20 openfiler2 kernel: block drbd14:
> [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294966565
>
> There are no relevant log entries on the DR node.
>
> I see these messages in the logs from time to time, but usually they just
> last for a few seconds and it's all cleared up on it's own.
>
> can anyone give me some idea of what direction to go in to try to figure out
> what the issue might be? I've included my global.conf, drbd.conf and more
> logs from around the time it failed last. Please let me know if any
> additional information would be helpful!

Any chance your DR node has significant different hardware setup,
especially regarding disk and raid controller capabilities? If your DR
node is under high (i/o load) because of e.g. a backup job it might be
unable to cope with DRBD replication at the same time because your i/o
stack is completely overloaded. Add something like "ko-count 6;" to the
net section, this will prevent your primary to block for too long time
though it will also go into Standalone mode which has to be resolved
manually.

Regards,
Andreas

--
Need help with DRBD?
http://www.hastexo.com/now

>
> Thanks!
>
> here is my global.conf file:
> global {
> usage-count yes;
> # minor-count dialog-refresh disable-ip-verification
> }
>
> common {
> protocol C;
>
> handlers {
> pri-on-incon-degr
> "/usr/lib/drbd/notify-pri-on-incon-degr.sh;
> /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ;
> reboot -f";
> pri-lost-after-sb
> "/usr/lib/drbd/notify-pri-lost-after-sb.sh;
> /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ;
> reboot -f";
> local-io-error "/usr/lib/drbd/notify-io-error.sh;
> /usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ;
> halt -f";
> # fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
> # split-brain "/usr/lib/drbd/notify-split-brain.sh root";
> # out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh root";
> # before-resync-target
> "/usr/lib/drbd/snapshot-resync-target-lvm.sh -p 15 -- -c 16k";
> # after-resync-target
> /usr/lib/drbd/unsnapshot-resync-target-lvm.sh;
> }
>
> startup {
> # wfc-timeout degr-wfc-timeout outdated-wfc-timeout
> wait-after-sb
> }
>
> disk {
> # on-io-error fencing use-bmbv no-disk-barrier
> no-disk-flushes
> # no-disk-drain no-md-flushes max-bio-bvecs
> on-io-error detach;
> # fencing resource-only;
> }
>
> net {
> # sndbuf-size rcvbuf-size timeout connect-int ping-int
> ping-timeout max-buffers
> # max-epoch-size ko-count allow-two-primaries cram-hmac-alg
> shared-secret
> # after-sb-0pri after-sb-1pri after-sb-2pri
> data-integrity-alg no-tcp-cork
> # data-integrity-alg crc32c;
> after-sb-0pri discard-zero-changes;
> after-sb-1pri consensus;
> after-sb-2pri disconnect;
> }
>
> syncer {
> # rate after al-extents use-rle cpu-mask verify-alg
> csums-alg
> rate 100M;
> csums-alg crc32c;
> verify-alg crc32c;
> use-rle;
> }
> }
>
> drbd.conf file excerpt ( i have a total of 12 resources, 6 lowers and 6
> uppers, meta and data1-data5, all are configured the same as the two shown
> here)
>
> include "drbd.d/global_common.conf";
> include "drbd.d/*.res";
> resource meta_lower {
> disk /dev/backingvg/metabacking;
> device /dev/drbd0;
> meta-disk internal;
> disk {
> fencing resource-only;
> }
> handlers {
> fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
> after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";
> }
> on openfiler1 {
> address 10.50.153.1:7788;
> }
> on openfiler2 {
> address 10.50.153.2:7788;
> }
> }
> resource data1_lower {
> device /dev/drbd1;
> disk /dev/backingvg/256data1backing;
> meta-disk internal;
> disk {
> fencing resource-only;
> }
> handlers {
> fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
> after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";
> }
> on openfiler1 {
> address 10.50.153.1:7789;
> }
> on openfiler2 {
> address 10.50.153.2:7789;
> }
> }
> ...
>
> resource meta {
> protocol A;
> device /dev/drbd10;
> meta-disk internal;
> syncer {
> rate 1000k;
> }
> stacked-on-top-of meta_lower {
> address 10.50.150.101:7788;
> }
> on openfiler3 {
> disk /dev/backingvg/metabacking;
> address 10.50.250.4:7788;
> }
> }
> resource data1 {
> protocol A;
> device /dev/drbd11;
> meta-disk internal;
> handlers {
> before-resync-target "/usr/lib/drbd/snapshot-resync-target-lvm.sh";
> after-resync-target "/usr/lib/drbd/unsnapshot-resync-target-lvm.sh";
> }
> net {
> sndbuf-size 512k;
> on-congestion pull-ahead;
> congestion-fill 500k;
> }
> syncer {
> rate 1000k;
> }
> stacked-on-top-of data1_lower {
> address 10.50.150.101:7789;
> }
> on openfiler3 {
> disk /dev/backingvg/256data1backing;
> address 10.50.250.4:7789;
> }
> }
> ...
>
> Here the log right around the time it failed:
>
>
> kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: conn(
> WFConnection -> WFReportParams )
> kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: Starting
> asender thread (from drbd10_receiver [4007])
> kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10:
> data-integrity-alg: <not-used>
> kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10:
> drbd_sync_handshake:
> kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: self
> D64C8B1CF54765C3:6A4AC00929A719C7:BAA2C9167F6DE4B7:BAA1C9167F6DE4B7 bits:0
> flags:0
> kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: peer
> 6A4AC00929A719C6:0000000000000000:BAA2C9167F6DE4B6:BAA1C9167F6DE4B7 bits:0
> flags:0
> kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10:
> uuid_compare()=1 by rule 70
> kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: peer( Unknown
> -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( DUnknown ->
> Consistent )
> kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: send bitmap
> stats [Bytes(packets)]: plain 0(0), RLE 13(1), total 13; compression: 100.0%
> kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: receive
> bitmap stats [Bytes(packets)]: plain 0(0), RLE 13(1), total 13; compression:
> 100.0%
> kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: helper
> command: /sbin/drbdadm before-resync-source minor-10
> kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: helper
> command: /sbin/drbdadm before-resync-source minor-10 exit code 0 (0x0)
> kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: conn(
> WFBitMapS -> SyncSource ) pdsk( Consistent -> Inconsistent )
> kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: Began resync
> as SyncSource (will sync 0 KB [0 bits set]).
> kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: updated sync
> UUID D64C8B1CF54765C3:6A4BC00929A719C7:6A4AC00929A719C7:BAA2C9167F6DE4B7
> kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: Resync done
> (total 1 sec; paused 0 sec; 0 K/sec)
> kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: updated UUIDs
> D64C8B1CF54765C3:0000000000000000:6A4BC00929A719C7:6A4AC00929A719C7
> kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: conn(
> SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate )
> kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: bitmap WRITE
> of 0 pages took 0 jiffies
> kern.info<6>: Feb 29 18:27:03 openfiler2 kernel: block drbd10: 0 KB (0 bits)
> marked out-of-sync by on disk bit-map.
> kern.info<6>: Feb 29 18:28:40 openfiler2 kernel: block drbd12: Resync done
> (total 98 sec; paused 0 sec; 0 K/sec)
> kern.info<6>: Feb 29 18:28:40 openfiler2 kernel: block drbd12: updated UUIDs
> 42842875FAED516F:0000000000000000:4ED75BEBA8150A1D:4ED65BEBA8150A1D
> kern.info<6>: Feb 29 18:28:40 openfiler2 kernel: block drbd12: conn(
> SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate )
> kern.info<6>: Feb 29 18:28:40 openfiler2 kernel: block drbd12: bitmap WRITE
> of 0 pages took 0 jiffies
> kern.info<6>: Feb 29 18:28:40 openfiler2 kernel: block drbd12: 0 KB (0 bits)
> marked out-of-sync by on disk bit-map.
> kern.info<6>: Feb 29 18:28:59 openfiler2 kernel: block drbd11: Resync done
> (total 117 sec; paused 0 sec; 0 K/sec)
> kern.info<6>: Feb 29 18:28:59 openfiler2 kernel: block drbd11: updated UUIDs
> 226B7A8BEE6FD74D:0000000000000000:FB859CF1270E0AB1:FB849CF1270E0AB1
> kern.info<6>: Feb 29 18:28:59 openfiler2 kernel: block drbd11: conn(
> SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate )
> kern.info<6>: Feb 29 18:28:59 openfiler2 kernel: block drbd11: bitmap WRITE
> of 0 pages took 0 jiffies
> kern.info<6>: Feb 29 18:28:59 openfiler2 kernel: block drbd11: 0 KB (0 bits)
> marked out-of-sync by on disk bit-map.
> kern.err<3>: Feb 29 19:08:20 openfiler2 kernel: block drbd14:
> [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967295
> kern.err<3>: Feb 29 19:08:26 openfiler2 kernel: block drbd14:
> [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967294
> kern.err<3>: Feb 29 19:08:32 openfiler2 kernel: block drbd14:
> [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967293
> kern.err<3>: Feb 29 19:08:38 openfiler2 kernel: block drbd14:
> [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967292
> kern.err<3>: Feb 29 19:08:44 openfiler2 kernel: block drbd14:
> [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967291
> kern.err<3>: Feb 29 19:08:50 openfiler2 kernel: block drbd14:
> [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967290
> kern.err<3>: Feb 29 19:08:56 openfiler2 kernel: block drbd14:
> [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967289
> kern.err<3>: Feb 29 19:09:02 openfiler2 kernel: block drbd14:
> [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967288
> kern.err<3>: Feb 29 19:09:08 openfiler2 kernel: block drbd14:
> [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967287
> kern.err<3>: Feb 29 19:09:14 openfiler2 kernel: block drbd14:
> [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967286
> kern.err<3>: Feb 29 19:09:20 openfiler2 kernel: block drbd14:
> [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967285
> kern.err<3>: Feb 29 19:09:26 openfiler2 kernel: block drbd14:
> [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967284
> kern.err<3>: Feb 29 19:09:32 openfiler2 kernel: block drbd14:
> [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967283
> kern.err<3>: Feb 29 19:09:38 openfiler2 kernel: block drbd14:
> [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967282
> kern.err<3>: Feb 29 19:09:44 openfiler2 kernel: block drbd14:
> [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967281
> kern.err<3>: Feb 29 19:09:50 openfiler2 kernel: block drbd14:
> [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967280
> kern.err<3>: Feb 29 19:09:56 openfiler2 kernel: block drbd14:
> [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967279
> kern.err<3>: Feb 29 19:10:02 openfiler2 kernel: block drbd14:
> [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967278
> kern.err<3>: Feb 29 19:10:08 openfiler2 kernel: block drbd14:
> [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967277
> kern.err<3>: Feb 29 19:10:14 openfiler2 kernel: block drbd14:
> [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294967276
>
>
> these are the only messages show until i reset the link between nodes by
> doing drbdadm down all on the dr node.
>
>
>
>
Attachments: signature.asc (0.22 KB)


ron.wells at envision-rx

Mar 2, 2012, 7:56 AM

Post #4 of 10 (471 views)
Permalink
Re: Problem with stacked resource failing [In reply to]

Andreas,
Thank you so much for your response. Actually the hardware in the dr
cluster is exactly the same, except that the DR raid array has more hard
disk space. There are currently no other jobs or applications running in
the dr cluster, it's only function is to be replicating from the primary
cluster.

I'm using the Ahead / Behind feature to deal with the fact that we're
connected over a WAN, I wonder if that is some how messing up the stack?

Unfortunately this happened again today and I had to do an ifdown on the DR
eth interface to break the connection in order to get the primary cluster
drbd resource to start responding again. :( :( drbdadm disconnect would
just time out.

Thanks,
Ron


Andreas Kurz-3 wrote:
>
> Hello,
>
> Any chance your DR node has significant different hardware setup,
> especially regarding disk and raid controller capabilities? If your DR
> node is under high (i/o load) because of e.g. a backup job it might be
> unable to cope with DRBD replication at the same time because your i/o
> stack is completely overloaded. Add something like "ko-count 6;" to the
> net section, this will prevent your primary to block for too long time
> though it will also go into Standalone mode which has to be resolved
> manually.
>
> Regards,
> Andreas
>
> --
> Need help with DRBD?
> http://www.hastexo.com/now
>
>

--
View this message in context: http://old.nabble.com/Problem-with-stacked-resource-failing-tp33424203p33429449.html
Sent from the DRBD - User mailing list archive at Nabble.com.

_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


ron.wells at envision-rx

Mar 2, 2012, 10:04 AM

Post #5 of 10 (471 views)
Permalink
Re: Problem with stacked resource failing [In reply to]

Andreas Kurz-3 wrote:
>
> Hello,
>
> - Add something like "ko-count 6;" to the
>
> I tried adding this, and it appears that after 6 failures the node gets
> disconnected, but then goes back into the WFConnection status on both
> nodes, and after a short pause connects again. I expected from the
> documentation and your statement that the resource should switch to stand
> alone, but it doesn't?
>
>

--
View this message in context: http://old.nabble.com/Problem-with-stacked-resource-failing-tp33424203p33430375.html
Sent from the DRBD - User mailing list archive at Nabble.com.

_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


brian at linbit

Mar 2, 2012, 10:31 AM

Post #6 of 10 (471 views)
Permalink
Re: Problem with stacked resource failing [In reply to]

On 03/01/2012 12:37 PM, envisionrx wrote:
> Hey all, I have a two node single primary with offsite disaster recovery (dr)
> node configuration using stacked resources that I'm having weird issues
> with. Twice in the last week the primary node stopped responding and I had
> to disconnect/reconnect the dr node to get it working again. When it fails
> I get the following in the primary nodes logs:
>
> kern.err<3>: Feb 29 20:21:20 openfiler2 kernel: block drbd14:
> [drbd14_worker/7472] sock_sendmsg time expired, ko = 4294966565
>
> There are no relevant log entries on the DR node.
This may be a situation where DRBD Proxy would help, however we'd need a
bit more information to determine that. Do the logs on the DR side say
anything with regards to DRBD at all? What is the latency between the
sites? Are you able to trigger this, or do you see a pattern of when it
occurs?



--

: Brian Hellman
: LINBIT | "Your Way to High Availability"
: 1-877-4-LINBIT
: Web: http://www.linbit.com
:
: Twitter: http://www.linbit.com/en/twitter
: Facebook: http://www.linbit.com/en/facebook

_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


ron.wells at envision-rx

Mar 2, 2012, 6:55 PM

Post #7 of 10 (468 views)
Permalink
Re: Problem with stacked resource failing [In reply to]

Brian R. Hellman wrote:
>
>
> This may be a situation where DRBD Proxy would help, however we'd need a
> bit more information to determine that.
>
> Do the logs on the DR side say anything with regards to DRBD at all?
>
> The logs have normal drbd info, but i don't see anything on the DR side
> indicating an error condition.
>
> What is the latency between the sites?
>
> The ping time between sites is typically around 60 ms, is that what your
> asking, or is there some other measurement you're looking for?
>
> Are you able to trigger this, or do you see a pattern of when it occurs?
>
> Well, interestingly, we decided to try to configure an ipsec tunnel
> between the sites, we've been using an openvpn tunnel. We're having
> trouble getting the ipsec tunnel to work. When we are working with the
> ipsec tunnel we have a very strange situation where the tunnel shows as
> being up, we can ping through the tunnel, ssh through the tunnel, but if
> we try to pass any significant amount of traffic through the tunnel it
> flakes out. For example if we try to scp through the tunnel it starts and
> then stalls. If we try to sync drbd through the tunnel we see this
> sock_sendmsg time expired failure situation all the time. So at this
> point if we use the ipsec tunnel we can duplicate it consistently.
>
>
>
> --
>
> : Brian Hellman
> : LINBIT | "Your Way to High Availability"
> : 1-877-4-LINBIT
> : Web: http://www.linbit.com
> :
> : Twitter: http://www.linbit.com/en/twitter
> : Facebook: http://www.linbit.com/en/facebook
>
> _______________________________________________
> drbd-user mailing list
> drbd-user [at] lists
> http://lists.linbit.com/mailman/listinfo/drbd-user
>
>

--
View this message in context: http://old.nabble.com/Problem-with-stacked-resource-failing-tp33424203p33432772.html
Sent from the DRBD - User mailing list archive at Nabble.com.

_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


kkovachev at varna

Mar 3, 2012, 3:37 AM

Post #8 of 10 (463 views)
Permalink
Re: Problem with stacked resource failing [In reply to]

On Fri, 2 Mar 2012 18:55:22 -0800 (PST), envisionrx
<ron.wells [at] envision-rx> wrote:
> Brian R. Hellman wrote:
>>
>>
>> This may be a situation where DRBD Proxy would help, however we'd need
a
>> bit more information to determine that.
>>
>> Do the logs on the DR side say anything with regards to DRBD at all?
>>
>> The logs have normal drbd info, but i don't see anything on the DR side
>> indicating an error condition.
>>
>> What is the latency between the sites?
>>
>> The ping time between sites is typically around 60 ms, is that what
your
>> asking, or is there some other measurement you're looking for?
>>
>> Are you able to trigger this, or do you see a pattern of when it
occurs?
>>
>> Well, interestingly, we decided to try to configure an ipsec tunnel
>> between the sites, we've been using an openvpn tunnel. We're having
>> trouble getting the ipsec tunnel to work. When we are working with the
>> ipsec tunnel we have a very strange situation where the tunnel shows as
>> being up, we can ping through the tunnel, ssh through the tunnel, but
if
>> we try to pass any significant amount of traffic through the tunnel it
>> flakes out. For example if we try to scp through the tunnel it starts
>> and
>> then stalls. If we try to sync drbd through the tunnel we see this
>> sock_sendmsg time expired failure situation all the time. So at this
>> point if we use the ipsec tunnel we can duplicate it consistently.
>>

This looks like MTU problem. Try to ping with different packet sizes to
determine the link MTU and adjust accordingly.

>>
>>
>> --
>>
>> : Brian Hellman
>> : LINBIT | "Your Way to High Availability"
>> : 1-877-4-LINBIT
>> : Web: http://www.linbit.com
>> :
>> : Twitter: http://www.linbit.com/en/twitter
>> : Facebook: http://www.linbit.com/en/facebook
>>
>> _______________________________________________
>> drbd-user mailing list
>> drbd-user [at] lists
>> http://lists.linbit.com/mailman/listinfo/drbd-user
>>
>>
_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


andreas at hastexo

Mar 5, 2012, 12:49 AM

Post #9 of 10 (436 views)
Permalink
Re: Problem with stacked resource failing [In reply to]

On 03/02/2012 07:04 PM, envisionrx wrote:
>
>
>
> Andreas Kurz-3 wrote:
>>
>> Hello,
>>
>> - Add something like "ko-count 6;" to the
>>
>> I tried adding this, and it appears that after 6 failures the node gets
>> disconnected, but then goes back into the WFConnection status on both
>> nodes, and after a short pause connects again. I expected from the
>> documentation and your statement that the resource should switch to stand
>> alone, but it doesn't?

Yes, I expected the resources to go into Standalone ... does not look
like the correct behavior ... feature/bug?

Regards,
Andreas

--
Need help with DRBD?
http://www.hastexo.com/now
Attachments: signature.asc (0.22 KB)


ron.wells at envision-rx

Mar 7, 2012, 10:54 AM

Post #10 of 10 (424 views)
Permalink
Re: Problem with stacked resource failing [In reply to]

So we got the Tunnel fixed and working, and using the ko-count 6; option
seems to keep the primary from becoming unresponsive. The only thing is
that if there is a case where the network is messed up and stays that way
then when the ko-count is reached the resource reconnects rather than
disconnecting as per the docs. Any idea why that is?
--
View this message in context: http://old.nabble.com/Problem-with-stacked-resource-failing-tp33424203p33460181.html
Sent from the DRBD - User mailing list archive at Nabble.com.

_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user

DRBD users RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.