Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: DRBD: Users

DRBD crash on two nodes cluster. Some help please?

 

 

DRBD users RSS feed   Index | Next | Previous | View Threaded


theophanis_kontogiannis at yahoo

Oct 20, 2009, 10:31 AM

Post #1 of 4 (856 views)
Permalink
DRBD crash on two nodes cluster. Some help please?

Hello all,

Eventually I managed to get a log during DRBD crash.

I have a two nodes RHEL5.3 cluster with 2.6.18-164.el5xen and
drbd-8.3.1-3 self compiled.

Both nodes have a dedicated 1G ethernet back to back connection over
RTL8169sb/8110sb cards.

When I run applications, that constantly read or write to the disks
(active/active config), drbd kept on crashing.

Now I have the logs for the reason of that:


______________________
ON TWEETY1

Oct 20 15:46:52 localhost kernel: drbd2: Digest integrity check FAILED.
Oct 20 15:46:52 localhost kernel: drbd2: Digest integrity check FAILED.
Oct 20 15:46:52 localhost kernel: drbd2: error receiving Data, l: 540!
Oct 20 15:46:52 localhost kernel: drbd2: error receiving Data, l: 540!
Oct 20 15:46:52 localhost kernel: drbd2: peer( Primary -> Unknown )
conn( Connected -> ProtocolError ) pdsk( UpToDate -> DUnknown ) susp( 0
-> 1 )
Oct 20 15:46:52 localhost kernel: drbd2: peer( Primary -> Unknown )
conn( Connected -> ProtocolError ) pdsk( UpToDate -> DUnknown ) susp( 0
-> 1 )
Oct 20 15:46:52 localhost kernel: drbd2: asender terminated
Oct 20 15:46:52 localhost kernel: drbd2: asender terminated
Oct 20 15:46:52 localhost kernel: drbd2: Terminating asender thread
Oct 20 15:46:52 localhost kernel: drbd2: Terminating asender thread
Oct 20 15:46:52 localhost kernel: drbd2: Creating new current UUID
Oct 20 15:46:52 localhost kernel: drbd2: Creating new current UUID
Oct 20 15:46:52 localhost clurgmgrd: [4161]: <info>
Executing /etc/init.d/drbd status
Oct 20 15:46:52 localhost clurgmgrd: [4161]: <info>
Executing /etc/init.d/drbd status
Oct 20 15:46:52 localhost kernel: drbd2: Connection closed
Oct 20 15:46:52 localhost kernel: drbd2: Connection closed

___________________________

ON TWEETY2


Oct 20 15:46:52 localhost kernel: drbd2: sock was reset by peer
Oct 20 15:46:52 localhost kernel: drbd2: peer( Primary -> Unknown )
conn( Connected -> BrokenPipe ) pdsk( UpToDate -> DUnknown ) susp( 0 ->
1 )
Oct 20 15:46:52 localhost kernel: drbd2: short read expecting header on
sock: r=-104
Oct 20 15:46:52 localhost kernel: drbd2: meta connection shut down by
peer.
Oct 20 15:46:52 localhost kernel: drbd2: asender terminated
Oct 20 15:46:52 localhost kernel: drbd2: Terminating asender thread
Oct 20 15:46:52 localhost kernel: drbd2: Creating new current UUID
Oct 20 15:46:52 localhost kernel: drbd2: Connection closed
Oct 20 15:46:52 localhost kernel: drbd2: helper command: /sbin/drbdadm
fence-peer minor-2

____________________


DRBD.CONF


#
# drbd.conf
#


global {

usage-count yes;
}


common {

protocol C;

syncer {

rate 100M;

al-extents 257;
}


handlers {

pri-on-incon-degr "echo b > /proc/sysrq-trigger ; reboot -f";

pri-lost-after-sb "echo b > /proc/sysrq-trigger ; reboot -f";

local-io-error "echo o > /proc/sysrq-trigger ; halt -f";

outdate-peer "/sbin/obliterate";


pri-lost "echo pri-lost. Have a look at the log files. | mail -s
'DRBD Alert' root; echo b > /proc/sysrq-trigger ; reboot -f";

split-brain "echo split-brain. drbdadm -- --discard-my-data connect
$DRBD_RESOURCE ? | mail -s 'DRBD Alert' root";

}

startup {

wfc-timeout 60;


degr-wfc-timeout 60; # 1 minutes.


become-primary-on both;

}

disk {

fencing resource-and-stonith;


}

net {

sndbuf-size 512k;

timeout 60; # 6 seconds (unit = 0.1 seconds)
connect-int 10; # 10 seconds (unit = 1 second)
ping-int 10; # 10 seconds (unit = 1 second)
ping-timeout 50; # 500 ms (unit = 0.1 seconds)

max-buffers 2048;

max-epoch-size 2048;

ko-count 10;


allow-two-primaries;


cram-hmac-alg "sha1";
shared-secret "*****";


after-sb-0pri discard-least-changes;

after-sb-1pri violently-as0p;


after-sb-2pri violently-as0p;


rr-conflict call-pri-lost;


data-integrity-alg "crc32c";

}


}


resource r0 {

device /dev/drbd0;
disk /dev/hda4;
meta-disk internal;

on tweety-1 { address 10.254.254.253:7788; }

on tweety-2 { address 10.254.254.254:7788; }

}

resource r1 {

device /dev/drbd1;
disk /dev/hdb4;
meta-disk internal;

on tweety-1 { address 10.254.254.253:7789; }

on tweety-2 { address 10.254.254.254:7789; }
}

resource r2 {

device /dev/drbd2;
disk /dev/sda1;
meta-disk internal;

on tweety-1 { address 10.254.254.253:7790; }

on tweety-2 { address 10.254.254.254:7790; }
}

_________

Also available in http://pastebin.ca/1633173


How can I solve this?

Thank you All for your time.


theophanis_kontogiannis at yahoo

Oct 29, 2009, 7:40 AM

Post #2 of 4 (743 views)
Permalink
Re: DRBD crash on two nodes cluster. Some help please? [In reply to]

Hello all again.

In continuation to the bellow described issue, with integrity check
enabled, I used to get a crash at least once per 24 hours.

Now I have integrity check disabled and the cluster is running without
crashes for the last 9 days.

Could someone kindly provide some hints for the possible reasons of
this observed behavior?

Off-loading is disabled on both dedicated gigabit NICs.

Also is integrity-check really needed (I have read the
documentation :) ) if it keeps on breaking the cluster?

Thank you All for your time.

Theophanis Kontogiannis


On Tue, 2009-10-20 at 20:31 +0300, Theophanis Kontogiannis wrote:

> Hello all,
>
> Eventually I managed to get a log during DRBD crash.
>
> I have a two nodes RHEL5.3 cluster with 2.6.18-164.el5xen and
> drbd-8.3.1-3 self compiled.
>
> Both nodes have a dedicated 1G ethernet back to back connection over
> RTL8169sb/8110sb cards.
>
> When I run applications, that constantly read or write to the disks
> (active/active config), drbd kept on crashing.
>
> Now I have the logs for the reason of that:
>
>
> ______________________
> ON TWEETY1
>
> Oct 20 15:46:52 localhost kernel: drbd2: Digest integrity check
> FAILED.
> Oct 20 15:46:52 localhost kernel: drbd2: Digest integrity check
> FAILED.
> Oct 20 15:46:52 localhost kernel: drbd2: error receiving Data, l: 540!
> Oct 20 15:46:52 localhost kernel: drbd2: error receiving Data, l: 540!
> Oct 20 15:46:52 localhost kernel: drbd2: peer( Primary -> Unknown )
> conn( Connected -> ProtocolError ) pdsk( UpToDate -> DUnknown )
> susp( 0 -> 1 )
> Oct 20 15:46:52 localhost kernel: drbd2: peer( Primary -> Unknown )
> conn( Connected -> ProtocolError ) pdsk( UpToDate -> DUnknown )
> susp( 0 -> 1 )
> Oct 20 15:46:52 localhost kernel: drbd2: asender terminated
> Oct 20 15:46:52 localhost kernel: drbd2: asender terminated
> Oct 20 15:46:52 localhost kernel: drbd2: Terminating asender thread
> Oct 20 15:46:52 localhost kernel: drbd2: Terminating asender thread
> Oct 20 15:46:52 localhost kernel: drbd2: Creating new current UUID
> Oct 20 15:46:52 localhost kernel: drbd2: Creating new current UUID
> Oct 20 15:46:52 localhost clurgmgrd: [4161]: <info>
> Executing /etc/init.d/drbd status
> Oct 20 15:46:52 localhost clurgmgrd: [4161]: <info>
> Executing /etc/init.d/drbd status
> Oct 20 15:46:52 localhost kernel: drbd2: Connection closed
> Oct 20 15:46:52 localhost kernel: drbd2: Connection closed
>
> ___________________________
>
> ON TWEETY2
>
>
> Oct 20 15:46:52 localhost kernel: drbd2: sock was reset by peer
> Oct 20 15:46:52 localhost kernel: drbd2: peer( Primary -> Unknown )
> conn( Connected -> BrokenPipe ) pdsk( UpToDate -> DUnknown ) susp( 0
> -> 1 )
> Oct 20 15:46:52 localhost kernel: drbd2: short read expecting header
> on sock: r=-104
> Oct 20 15:46:52 localhost kernel: drbd2: meta connection shut down by
> peer.
> Oct 20 15:46:52 localhost kernel: drbd2: asender terminated
> Oct 20 15:46:52 localhost kernel: drbd2: Terminating asender thread
> Oct 20 15:46:52 localhost kernel: drbd2: Creating new current UUID
> Oct 20 15:46:52 localhost kernel: drbd2: Connection closed
> Oct 20 15:46:52 localhost kernel: drbd2: helper command: /sbin/drbdadm
> fence-peer minor-2
>
> ____________________
>
>
> DRBD.CONF
>
>
> #
> # drbd.conf
> #
>
>
> global {
>
> usage-count yes;
> }
>
>
> common {
>
> protocol C;
>
> syncer {
>
> rate 100M;
>
> al-extents 257;
> }
>
>
> handlers {
>
> pri-on-incon-degr "echo b > /proc/sysrq-trigger ; reboot -f";
>
> pri-lost-after-sb "echo b > /proc/sysrq-trigger ; reboot -f";
>
> local-io-error "echo o > /proc/sysrq-trigger ; halt -f";
>
> outdate-peer "/sbin/obliterate";
>
>
> pri-lost "echo pri-lost. Have a look at the log files. | mail -s
> 'DRBD Alert' root; echo b > /proc/sysrq-trigger ; reboot -f";
>
> split-brain "echo split-brain. drbdadm -- --discard-my-data
> connect $DRBD_RESOURCE ? | mail -s 'DRBD Alert' root";
>
> }
>
> startup {
>
> wfc-timeout 60;
>
>
> degr-wfc-timeout 60; # 1 minutes.
>
>
> become-primary-on both;
>
> }
>
> disk {
>
> fencing resource-and-stonith;
>
>
> }
>
> net {
>
> sndbuf-size 512k;
>
> timeout 60; # 6 seconds (unit = 0.1 seconds)
> connect-int 10; # 10 seconds (unit = 1 second)
> ping-int 10; # 10 seconds (unit = 1 second)
> ping-timeout 50; # 500 ms (unit = 0.1 seconds)
>
> max-buffers 2048;
>
> max-epoch-size 2048;
>
> ko-count 10;
>
>
> allow-two-primaries;
>
>
> cram-hmac-alg "sha1";
> shared-secret "*****";
>
>
> after-sb-0pri discard-least-changes;
>
> after-sb-1pri violently-as0p;
>
>
> after-sb-2pri violently-as0p;
>
>
> rr-conflict call-pri-lost;
>
>
> data-integrity-alg "crc32c";
>
> }
>
>
> }
>
>
> resource r0 {
>
> device /dev/drbd0;
> disk /dev/hda4;
> meta-disk internal;
>
> on tweety-1 { address 10.254.254.253:7788; }
>
> on tweety-2 { address 10.254.254.254:7788; }
>
> }
>
> resource r1 {
>
> device /dev/drbd1;
> disk /dev/hdb4;
> meta-disk internal;
>
> on tweety-1 { address 10.254.254.253:7789; }
>
> on tweety-2 { address 10.254.254.254:7789; }
> }
>
> resource r2 {
>
> device /dev/drbd2;
> disk /dev/sda1;
> meta-disk internal;
>
> on tweety-1 { address 10.254.254.253:7790; }
>
> on tweety-2 { address 10.254.254.254:7790; }
> }
>
> _________
>
> Also available in http://pastebin.ca/1633173
>
>
> How can I solve this?
>
> Thank you All for your time.
>
>


lars.ellenberg at linbit

Oct 29, 2009, 8:53 AM

Post #3 of 4 (762 views)
Permalink
Re: DRBD crash on two nodes cluster. Some help please? [In reply to]

On Thu, Oct 29, 2009 at 04:40:01PM +0200, Theophanis Kontogiannis wrote:
> Hello all again.
>
> In continuation to the bellow described issue, with integrity check
> enabled, I used to get a crash at least once per 24 hours.

No.
You don't get "crashes".

You configured it to fence its peer on connection loss,
and that is what it does.

> Now I have integrity check disabled and the cluster is running without
> crashes for the last 9 days.
>
> Could someone kindly provide some hints for the possible reasons of
> this observed behavior?
>
> Off-loading is disabled on both dedicated gigabit NICs.

Either something modifies in-flight buffers,
which may or may not be intentional,
and may or may not be "safe" wrt file system data integrity.

Or you actually _do_ have data corruption.

If drbd detects checksum mismatch (== data corruption,
or more general: data received is not the same as
it was when calculating the checksum before it was
send), rather than knowingly writing diverging data,
drbd disconnects, and tries to reconnect,
hoping for the bitmap based resync to send
"better" data this time.

On disconnect, if so configured, a primary will call its
fence-peer handler.

You configured "obliterate" as fence peer handler.

So it "obliterates" its peer.

> Also is integrity-check really needed (I have read the
> documentation :) ) if it keeps on breaking the cluster?

If you rather have silent data corruption :-)

==> Find the cause of the checksum mismatch.

--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list -- I'm subscribed
_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


theophanis_kontogiannis at yahoo

Nov 1, 2009, 5:05 AM

Post #4 of 4 (720 views)
Permalink
Re: DRBD crash on two nodes cluster. Some help please? [In reply to]

Hello Lars and All,

Please look bellow

On Thu, 2009-10-29 at 16:53 +0100, Lars Ellenberg wrote:

> On Thu, Oct 29, 2009 at 04:40:01PM +0200, Theophanis Kontogiannis wrote:
> > Hello all again.
> >
> > In continuation to the bellow described issue, with integrity check
> > enabled, I used to get a crash at least once per 24 hours.
>
> No.
> You don't get "crashes".
>
> You configured it to fence its peer on connection loss,
> and that is what it does.
>

Correct in strict terminology. I just had in my mind that both nodes get
fenced so I get "crush" in the sense of having no service.
But yes, the actual thing is that it gets fenced.


> > Now I have integrity check disabled and the cluster is running without
> > crashes for the last 9 days.
> >
> > Could someone kindly provide some hints for the possible reasons of
> > this observed behavior?
> >
> > Off-loading is disabled on both dedicated gigabit NICs.
>
> Either something modifies in-flight buffers,
> which may or may not be intentional,
> and may or may not be "safe" wrt file system data integrity.
>
> Or you actually _do_ have data corruption.
>
> If drbd detects checksum mismatch (== data corruption,
> or more general: data received is not the same as
> it was when calculating the checksum before it was
> send), rather than knowingly writing diverging data,
> drbd disconnects, and tries to reconnect,
> hoping for the bitmap based resync to send
> "better" data this time.
>
> On disconnect, if so configured, a primary will call its
> fence-peer handler.
>
> You configured "obliterate" as fence peer handler.
>
> So it "obliterates" its peer.
>
> > Also is integrity-check really needed (I have read the
> > documentation :) ) if it keeps on breaking the cluster?
>
> If you rather have silent data corruption :-)
>
> ==> Find the cause of the checksum mismatch.
>

Is there any way to track to really low level the crc error? Turn on
insane debugging on drbd or something else?
I can not think of any good way to go low level for that!

Thank you All for your time.
T.K.

DRBD users RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.