Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: DRBD: Users

8.3.5 Stalling on sync

 

 

DRBD users RSS feed   Index | Next | Previous | View Threaded


jim at roadtech

Nov 24, 2009, 5:22 AM

Post #1 of 10 (1407 views)
Permalink
8.3.5 Stalling on sync

Hi List,



Please help. I have installed drbd 8.3.5 on Open Suse 11.1 (Kernel
2.6.27.29-0.1).



I have run drbdadm create-md dbms-test on one node and create-md dbms-test2
on the other node. I then ran drbdadm up all on both nodes. I then ran
drbdadm -- --overwrite-data-of-my-peer primary dbms-test on the first node
and the same with dbms-test2 on the other node. They then run for a short
while before stalling. I have tried older version without success and
turning the sync rate down does not make any difference. Downing the
resources and bringing back up starts the sync again but this then stalls
quickly.



I have attached /proc/drbd, /etc/drbd.conf and a section from
/var/log/messages. Any pointers would be greatly appreciated.



version: 8.3.5 (api:88/proto:86-91)

GIT-hash: ded8cdf09b0efa1460e8ce7a72327c60ff2210fb build by root [at] hp-tm-4,
2009-11-24 12:21:46

0: cs:SyncSource ro:Secondary/Secondary ds:UpToDate/Inconsistent C r----

ns:160896 nr:0 dw:0 dr:160896 al:0 bm:9 lo:1 pe:0 ua:0 ap:0 ep:1 wo:b
oos:926694296

[>.] sync'ed: 0.1% (905040/905132)M 4972

stalled

1: cs:SyncTarget ro:Secondary/Secondary ds:Inconsistent/UpToDate C r----

ns:0 nr:2173248 dw:2173248 dr:0 al:0 bm:132 lo:0 pe:29878 ua:0 ap:0 ep:1
wo:b oos:777971256

[>.] sync'ed: 0.3% (759736/761856)M

Stalled







Drbd.conf



global {

# minor-count 64;

# dialog-refresh 5; # 5 seconds

# disable-ip-verification;

usage-count no;

}



common {



handlers {

pri-on-incon-degr "echo o > /proc/sysrq-trigger ; halt -f";

pri-lost-after-sb "echo o > /proc/sysrq-trigger ; halt -f";

local-io-error "echo o > /proc/sysrq-trigger ; halt -f";

outdate-peer "/usr/lib/heartbeat/drbd-peer-outdater -t 5";

}





startup {

degr-wfc-timeout 120; # 2 minutes.

}



disk {

on-io-error detach;

# fencing resource-only;

}



net {



max-buffers 40000;

unplug-watermark 40000;

after-sb-0pri disconnect;

after-sb-1pri disconnect;

after-sb-2pri disconnect;



rr-conflict disconnect;



}



syncer {



rate 90M;



al-extents 257;



verify-alg crc32c;

cpu-mask 1;

}



}



resource dbms-test {



protocol C;





on hp-tm-40 {

device /dev/drbd0;

disk /dev/cciss/c0d1p4;

address 192.168.95.53:7789;

meta-disk /dev/cciss/c0d1p1[0];

}



on hp-tm-41 {

device /dev/drbd0;

disk /dev/cciss/c0d1p4;

address 192.168.95.54:7789;

meta-disk /dev/cciss/c0d1p1[0];

}

}



resource dbms-test2 {



protocol C;





on hp-tm-40 {

device /dev/drbd1;

disk /dev/cciss/c0d1p3;

address 192.168.95.53:7788;

meta-disk /dev/cciss/c0d1p2[0];

}



on hp-tm-41{

device /dev/drbd1;

disk /dev/cciss/c0d1p3;

address 192.168.95.54:7788;

meta-disk /dev/cciss/c0d1p2[0];

}

}





Section from /var/log/messages



Nov 24 13:03:43 hp-tm-41 kernel: block drbd0: peer( Secondary -> Unknown )
conn( SyncTarget -> TearDown ) pdsk( UpToDate -> DUnknown )

Nov 24 13:03:43 hp-tm-41 kernel: block drbd0: asender terminated

Nov 24 13:03:43 hp-tm-41 kernel: block drbd0: Terminating asender thread

Nov 24 13:03:43 hp-tm-41 kernel: block drbd0: Connection closed

Nov 24 13:03:43 hp-tm-41 kernel: block drbd0: conn( TearDown -> Unconnected
)

Nov 24 13:03:43 hp-tm-41 kernel: block drbd0: receiver terminated

Nov 24 13:03:43 hp-tm-41 kernel: block drbd0: Restarting receiver thread

Nov 24 13:03:43 hp-tm-41 kernel: block drbd0: receiver (re)started

Nov 24 13:03:43 hp-tm-41 kernel: block drbd0: conn( Unconnected ->
WFConnection )

Nov 24 13:03:46 hp-tm-41 kernel: block drbd0: conn( WFConnection ->
Disconnecting )

Nov 24 13:03:46 hp-tm-41 kernel: block drbd0: Discarding network
configuration.

Nov 24 13:03:46 hp-tm-41 kernel: block drbd0: Connection closed

Nov 24 13:03:46 hp-tm-41 kernel: block drbd0: conn( Disconnecting ->
StandAlone )

Nov 24 13:03:46 hp-tm-41 kernel: block drbd0: receiver terminated

Nov 24 13:03:46 hp-tm-41 kernel: block drbd0: Terminating receiver thread

Nov 24 13:03:46 hp-tm-41 kernel: block drbd0: disk( Inconsistent -> Diskless
)

Nov 24 13:03:46 hp-tm-41 kernel: block drbd0: drbd_bm_resize called with
capacity == 0

Nov 24 13:03:46 hp-tm-41 kernel: block drbd0: worker terminated

Nov 24 13:03:46 hp-tm-41 kernel: block drbd0: Terminating worker thread

Nov 24 13:03:46 hp-tm-41 kernel: block drbd1: peer( Secondary -> Unknown )
conn( SyncSource -> Disconnecting )

Nov 24 13:03:46 hp-tm-41 kernel: block drbd1: meta connection shut down by
peer.

Nov 24 13:03:46 hp-tm-41 kernel: block drbd1: asender terminated

Nov 24 13:03:46 hp-tm-41 kernel: block drbd1: Terminating asender thread

Nov 24 13:03:46 hp-tm-41 kernel: block drbd1: drbd_pp_alloc interrupted!

Nov 24 13:03:46 hp-tm-41 kernel: block drbd1: alloc_ee: Allocation of a page
failed

Nov 24 13:03:46 hp-tm-41 kernel: block drbd1: error receiving RSDataRequest,
l: 24!

Nov 24 13:03:46 hp-tm-41 kernel: block drbd1: Connection closed

Nov 24 13:03:46 hp-tm-41 kernel: block drbd1: conn( Disconnecting ->
StandAlone )

Nov 24 13:03:46 hp-tm-41 kernel: block drbd1: disk( UpToDate -> Diskless )
pdsk( Inconsistent -> DUnknown )

Nov 24 13:03:46 hp-tm-41 kernel: block drbd1: net_ee not empty, killed 5000
entries

Nov 24 13:03:46 hp-tm-41 kernel: block drbd1: receiver terminated

Nov 24 13:03:46 hp-tm-41 kernel: block drbd1: Terminating receiver thread

Nov 24 13:03:46 hp-tm-41 kernel: block drbd1: drbd_bm_resize called with
capacity == 0

Nov 24 13:03:46 hp-tm-41 kernel: block drbd1: worker terminated

Nov 24 13:03:46 hp-tm-41 kernel: block drbd1: Terminating worker thread

Nov 24 13:03:50 hp-tm-41 kernel: block drbd0: Starting worker thread (from
cqueue [86])

Nov 24 13:03:50 hp-tm-41 kernel: block drbd0: disk( Diskless -> Attaching )

Nov 24 13:03:50 hp-tm-41 kernel: block drbd0: No usable activity log found.

Nov 24 13:03:50 hp-tm-41 kernel: block drbd0: Method to ensure write
ordering: barrier

Nov 24 13:03:50 hp-tm-41 kernel: block drbd0: max_segment_size ( = BIO size
) = 32768

Nov 24 13:03:50 hp-tm-41 kernel: block drbd0: drbd_bm_resize called with
capacity == 1887428655

Nov 24 13:03:50 hp-tm-41 kernel: block drbd0: resync bitmap: bits=235928582
words=3686385

Nov 24 13:03:50 hp-tm-41 kernel: block drbd0: size = 900 GB (943714327 KB)

Nov 24 13:03:50 hp-tm-41 kernel: block drbd0: recounting of set bits took
additional 6 jiffies

Nov 24 13:03:50 hp-tm-41 kernel: block drbd0: 884 GB (231676934 bits) marked
out-of-sync by on disk bit-map.

Nov 24 13:03:50 hp-tm-41 kernel: block drbd0: disk( Attaching ->
Inconsistent )

Nov 24 13:03:50 hp-tm-41 kernel: block drbd0: Barriers not supported on meta
data device - disabling

Nov 24 13:03:50 hp-tm-41 kernel: block drbd1: Starting worker thread (from
cqueue [86])

Nov 24 13:03:50 hp-tm-41 kernel: block drbd1: disk( Diskless -> Attaching )

Nov 24 13:03:50 hp-tm-41 kernel: block drbd1: No usable activity log found.

Nov 24 13:03:50 hp-tm-41 kernel: block drbd1: Method to ensure write
ordering: barrier

Nov 24 13:03:50 hp-tm-41 kernel: block drbd1: max_segment_size ( = BIO size
) = 32768

Nov 24 13:03:50 hp-tm-41 kernel: block drbd1: drbd_bm_resize called with
capacity == 1887444720

Nov 24 13:03:50 hp-tm-41 kernel: block drbd1: resync bitmap: bits=235930590
words=3686416

Nov 24 13:03:50 hp-tm-41 kernel: block drbd1: size = 900 GB (943722360 KB)

Nov 24 13:03:50 hp-tm-41 kernel: block drbd1: recounting of set bits took
additional 6 jiffies

Nov 24 13:03:50 hp-tm-41 kernel: block drbd1: 742 GB (194495454 bits) marked
out-of-sync by on disk bit-map.

Nov 24 13:03:50 hp-tm-41 kernel: block drbd1: disk( Attaching -> UpToDate )
pdsk( DUnknown -> Outdated )

Nov 24 13:03:50 hp-tm-41 kernel: block drbd1: Barriers not supported on meta
data device - disabling

Nov 24 13:03:50 hp-tm-41 kernel: block drbd0: conn( StandAlone ->
Unconnected )

Nov 24 13:03:50 hp-tm-41 kernel: block drbd0: Starting receiver thread (from
drbd0_worker [6688])

Nov 24 13:03:50 hp-tm-41 kernel: block drbd0: receiver (re)started

Nov 24 13:03:50 hp-tm-41 kernel: block drbd0: conn( Unconnected ->
WFConnection )

Nov 24 13:03:50 hp-tm-41 kernel: block drbd1: conn( StandAlone ->
Unconnected )

Nov 24 13:03:50 hp-tm-41 kernel: block drbd1: Starting receiver thread (from
drbd1_worker [6695])

Nov 24 13:03:50 hp-tm-41 kernel: block drbd1: receiver (re)started

Nov 24 13:03:50 hp-tm-41 kernel: block drbd1: conn( Unconnected ->
WFConnection )

Nov 24 13:03:53 hp-tm-41 kernel: block drbd0: Handshake successful: Agreed
network protocol version 91

Nov 24 13:03:53 hp-tm-41 kernel: block drbd0: conn( WFConnection ->
WFReportParams )

Nov 24 13:03:53 hp-tm-41 kernel: block drbd0: Starting asender thread (from
drbd0_receiver [6717])

Nov 24 13:03:53 hp-tm-41 kernel: block drbd0: data-integrity-alg: <not-used>

Nov 24 13:03:53 hp-tm-41 kernel: block drbd0: drbd_sync_handshake:

Nov 24 13:03:53 hp-tm-41 kernel: block drbd0: self
88E0ED22FECE2B68:0000000000000000:0000000000000000:0000000000000000
bits:231676934 flags:0

Nov 24 13:03:53 hp-tm-41 kernel: block drbd0: peer
5299E3A47E1A3F30:88E0ED22FECE2B69:8810A1CE27BB9808:27DB4B359F02FE48
bits:231676934 flags:0

Nov 24 13:03:53 hp-tm-41 kernel: block drbd0: uuid_compare()=-1 by rule 50

Nov 24 13:03:53 hp-tm-41 kernel: block drbd0: Becoming sync target due to
disk states.

Nov 24 13:03:53 hp-tm-41 kernel: block drbd0: peer( Unknown -> Secondary )
conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )

Nov 24 13:03:53 hp-tm-41 kernel: block drbd1: Handshake successful: Agreed
network protocol version 91

Nov 24 13:03:53 hp-tm-41 kernel: block drbd1: conn( WFConnection ->
WFReportParams )

Nov 24 13:03:53 hp-tm-41 kernel: block drbd1: Starting asender thread (from
drbd1_receiver [6721])

Nov 24 13:03:53 hp-tm-41 kernel: block drbd1: data-integrity-alg: <not-used>

Nov 24 13:03:53 hp-tm-41 kernel: block drbd1: drbd_sync_handshake:

Nov 24 13:03:53 hp-tm-41 kernel: block drbd1: self
12DFCDD264D5E7AE:20C37C56C7437B76:441CA1FB5B900754:4A4B9D0203491EC4
bits:194495454 flags:0

Nov 24 13:03:53 hp-tm-41 kernel: block drbd1: peer
20C37C56C7437B76:0000000000000000:0000000000000000:0000000000000000
bits:194495454 flags:0

Nov 24 13:03:53 hp-tm-41 kernel: block drbd1: uuid_compare()=1 by rule 70

Nov 24 13:03:53 hp-tm-41 kernel: block drbd1: Becoming sync source due to
disk states.

Nov 24 13:03:53 hp-tm-41 kernel: block drbd1: peer( Unknown -> Secondary )
conn( WFReportParams -> WFBitMapS ) pdsk( Outdated -> Inconsistent )

Nov 24 13:03:53 hp-tm-41 kernel: block drbd0: conn( WFBitMapT -> WFSyncUUID
)

Nov 24 13:03:53 hp-tm-41 kernel: block drbd0: helper command: /sbin/drbdadm
before-resync-target minor-0

Nov 24 13:03:53 hp-tm-41 kernel: block drbd0: helper command: /sbin/drbdadm
before-resync-target minor-0 exit code 0 (0x0)

Nov 24 13:03:53 hp-tm-41 kernel: block drbd0: conn( WFSyncUUID -> SyncTarget
)

Nov 24 13:03:53 hp-tm-41 kernel: block drbd0: Began resync as SyncTarget
(will sync 926707736 KB [231676934 bits set]).

Nov 24 13:03:54 hp-tm-41 kernel: block drbd1: conn( WFBitMapS -> SyncSource
)

Nov 24 13:03:54 hp-tm-41 kernel: block drbd1: Began resync as SyncSource
(will sync 777981816 KB [194495454 bits set]).



Thanks











*************************************************************************
This e-mail is confidential and may be legally privileged. It is intended
solely for the use of the individual(s) to whom it is addressed. Any
content in this message is not necessarily a view or statement from Road
Tech Computer Systems Limited but is that of the individual sender. If
you are not the intended recipient, be advised that you have received
this e-mail in error and that any use, dissemination, forwarding,
printing, or copying of this e-mail is strictly prohibited. We use
reasonable endeavours to virus scan all e-mails leaving the company but
no warranty is given that this e-mail and any attachments are virus free.
You should undertake your own virus checking. The right to monitor e-mail
communications through our networks is reserved by us

Road Tech Computer Systems Ltd. Shenley Hall, Rectory Lane, Shenley,
Radlett, Hertfordshire, WD7 9AN. - VAT Registration No GB 449 3582 17
Registered in England No: 02017435, Registered Address: Charter Court,
Midland Road, Hemel Hempstead, Hertfordshire, HP2 5GE.
*************************************************************************


mike at dev-zero

Nov 24, 2009, 9:48 AM

Post #2 of 10 (1336 views)
Permalink
Re: 8.3.5 Stalling on sync [In reply to]

James Larcombe wrote:
>
> Hi List,
>
>
>
> Please help. I have installed drbd 8.3.5 on Open Suse 11.1 (Kernel
> 2.6.27.29-0.1).
>
>
>
> I have run drbdadm create-md dbms-test on one node and create-md
> dbms-test2 on the other node. I then ran drbdadm up all on both nodes.
> I then ran drbdadm -- --overwrite-data-of-my-peer primary dbms-test on
> the first node and the same with dbms-test2 on the other node. They
> then run for a short while before stalling. I have tried older version
> without success and turning the sync rate down does not make any
> difference. Downing the resources and bringing back up starts the sync
> again but this then stalls quickly.
>
>
>
> I have attached /proc/drbd, /etc/drbd.conf and a section from
> /var/log/messages. Any pointers would be greatly appreciated.
>
>
>
> version: 8.3.5 (api:88/proto:86-91)
>
> GIT-hash: ded8cdf09b0efa1460e8ce7a72327c60ff2210fb build by
> root [at] hp-tm-4, 2009-11-24 12:21:46
>
> 0: cs:SyncSource ro:Secondary/Secondary ds:UpToDate/Inconsistent C r----
>
> ns:160896 nr:0 dw:0 dr:160896 al:0 bm:9 lo:1 pe:0 ua:0 ap:0 ep:1
> wo:b oos:926694296
>
> [>.] sync'ed: 0.1% (905040/905132)M 4972
>
> stalled
>
> 1: cs:SyncTarget ro:Secondary/Secondary ds:Inconsistent/UpToDate C r----
>
> ns:0 nr:2173248 dw:2173248 dr:0 al:0 bm:132 lo:0 pe:29878 ua:0
> ap:0 ep:1 wo:b oos:777971256
>
> [>.] sync'ed: 0.3% (759736/761856)M
>
> Stalled
>

what kind of network are you using between the two servers? this is
almost the exact same behavior i had when i was trying to get drbd to
work over 10gig ethernet. turned out to be something in drbd didn't like
something about the 10gig cards i had. i eventually had to change my
network cards. what cards are you using? 1gig? 10gig? have you tried
other cards? that is where i would look.

mike


jim at roadtech

Nov 25, 2009, 1:57 AM

Post #3 of 10 (1346 views)
Permalink
Re: 8.3.5 Stalling on sync [In reply to]

Hi Mike,



Thanks for the quick response. Yes you are correct we are using 10gig fibre
cards. I’m not sure we could change them though as the fibre modules used in
them cost over £400 each.



Is there anything I can tweak in the drbd.conf file to get these to work.



James



From: Mike Lovell [mailto:mike [at] dev-zero]
Sent: 24 November 2009 17:49
To: James Larcombe
Cc: drbd-user [at] lists
Subject: Re: [DRBD-user] 8.3.5 Stalling on sync



James Larcombe wrote:

Hi List,



Please help. I have installed drbd 8.3.5 on Open Suse 11.1 (Kernel
2.6.27.29-0.1).



I have run drbdadm create-md dbms-test on one node and create-md dbms-test2
on the other node. I then ran drbdadm up all on both nodes. I then ran
drbdadm -- --overwrite-data-of-my-peer primary dbms-test on the first node
and the same with dbms-test2 on the other node. They then run for a short
while before stalling. I have tried older version without success and
turning the sync rate down does not make any difference. Downing the
resources and bringing back up starts the sync again but this then stalls
quickly.



I have attached /proc/drbd, /etc/drbd.conf and a section from
/var/log/messages. Any pointers would be greatly appreciated.



version: 8.3.5 (api:88/proto:86-91)

GIT-hash: ded8cdf09b0efa1460e8ce7a72327c60ff2210fb build by root [at] hp-tm-4,
2009-11-24 12:21:46

0: cs:SyncSource ro:Secondary/Secondary ds:UpToDate/Inconsistent C r----

ns:160896 nr:0 dw:0 dr:160896 al:0 bm:9 lo:1 pe:0 ua:0 ap:0 ep:1 wo:b
oos:926694296

[>.] sync'ed: 0.1% (905040/905132)M 4972

stalled

1: cs:SyncTarget ro:Secondary/Secondary ds:Inconsistent/UpToDate C r----

ns:0 nr:2173248 dw:2173248 dr:0 al:0 bm:132 lo:0 pe:29878 ua:0 ap:0 ep:1
wo:b oos:777971256

[>.] sync'ed: 0.3% (759736/761856)M

Stalled


what kind of network are you using between the two servers? this is almost
the exact same behavior i had when i was trying to get drbd to work over
10gig ethernet. turned out to be something in drbd didn't like something
about the 10gig cards i had. i eventually had to change my network cards.
what cards are you using? 1gig? 10gig? have you tried other cards? that is
where i would look.

mike


*RT IMSS Scanned*





*************************************************************************
This e-mail is confidential and may be legally privileged. It is intended
solely for the use of the individual(s) to whom it is addressed. Any
content in this message is not necessarily a view or statement from Road
Tech Computer Systems Limited but is that of the individual sender. If
you are not the intended recipient, be advised that you have received
this e-mail in error and that any use, dissemination, forwarding,
printing, or copying of this e-mail is strictly prohibited. We use
reasonable endeavours to virus scan all e-mails leaving the company but
no warranty is given that this e-mail and any attachments are virus free.
You should undertake your own virus checking. The right to monitor e-mail
communications through our networks is reserved by us

Road Tech Computer Systems Ltd. Shenley Hall, Rectory Lane, Shenley,
Radlett, Hertfordshire, WD7 9AN. - VAT Registration No GB 449 3582 17
Registered in England No: 02017435, Registered Address: Charter Court,
Midland Road, Hemel Hempstead, Hertfordshire, HP2 5GE.
*************************************************************************


igor at 3gnt

Nov 25, 2009, 3:50 AM

Post #4 of 10 (1343 views)
Permalink
Re: 8.3.5 Stalling on sync [In reply to]

Hi,

This is why, Linbit should publish in the documentation, a list of
hardware (NIC's) tested and certified as fully working.

I have seen this problem here in the list a couple of times in the last
month. It's kind of sad spend time and a lot of money in good enterprise
network hardware, and in the end things don't work as they should.

What do you guys think of this?

Cheers,

On 11/25/2009 09:57 AM, James Larcombe wrote:
>
> Hi Mike,
>
> Thanks for the quick response. Yes you are correct we are using 10gig
> fibre cards. I'm not sure we could change them though as the fibre
> modules used in them cost over £400 each.
>
> Is there anything I can tweak in the drbd.conf file to get these to work.
>
> James
>
> *From:* Mike Lovell [mailto:mike [at] dev-zero]
> *Sent:* 24 November 2009 17:49
> *To:* James Larcombe
> *Cc:* drbd-user [at] lists
> *Subject:* Re: [DRBD-user] 8.3.5 Stalling on sync
>
> James Larcombe wrote:
>
> Hi List,
>
> Please help. I have installed drbd 8.3.5 on Open Suse 11.1 (Kernel
> 2.6.27.29-0.1).
>
> I have run drbdadm create-md dbms-test on one node and create-md
> dbms-test2 on the other node. I then ran drbdadm up all on both nodes.
> I then ran drbdadm -- --overwrite-data-of-my-peer primary dbms-test on
> the first node and the same with dbms-test2 on the other node. They
> then run for a short while before stalling. I have tried older version
> without success and turning the sync rate down does not make any
> difference. Downing the resources and bringing back up starts the sync
> again but this then stalls quickly.
>
> I have attached /proc/drbd, /etc/drbd.conf and a section from
> /var/log/messages. Any pointers would be greatly appreciated.
>
> version: 8.3.5 (api:88/proto:86-91)
>
> GIT-hash: ded8cdf09b0efa1460e8ce7a72327c60ff2210fb build by
> root [at] hp-tm-4, 2009-11-24 12:21:46
>
> 0: cs:SyncSource ro:Secondary/Secondary ds:UpToDate/Inconsistent C r----
>
> ns:160896 nr:0 dw:0 dr:160896 al:0 bm:9 lo:1 pe:0 ua:0 ap:0 ep:1
> wo:b oos:926694296
>
> [>.] sync'ed: 0.1% (905040/905132)M 4972
>
> stalled
>
> 1: cs:SyncTarget ro:Secondary/Secondary ds:Inconsistent/UpToDate C r----
>
> ns:0 nr:2173248 dw:2173248 dr:0 al:0 bm:132 lo:0 pe:29878 ua:0
> ap:0 ep:1 wo:b oos:777971256
>
> [>.] sync'ed: 0.3% (759736/761856)M
>
> Stalled
>
>
> what kind of network are you using between the two servers? this is
> almost the exact same behavior i had when i was trying to get drbd to
> work over 10gig ethernet. turned out to be something in drbd didn't
> like something about the 10gig cards i had. i eventually had to change
> my network cards. what cards are you using? 1gig? 10gig? have you
> tried other cards? that is where i would look.
>
> mike
>
> *RT IMSS Scanned*
>
> *************************************************************************
> This e-mail is confidential and may be legally privileged. It is intended
> solely for the use of the individual(s) to whom it is addressed. Any
> content in this message is not necessarily a view or statement from Road
> Tech Computer Systems Limited but is that of the individual sender. If
> you are not the intended recipient, be advised that you have received
> this e-mail in error and that any use, dissemination, forwarding,
> printing, or copying of this e-mail is strictly prohibited. We use
> reasonable endeavours to virus scan all e-mails leaving the company but
> no warranty is given that this e-mail and any attachments are virus free.
> You should undertake your own virus checking. The right to monitor e-mail
> communications through our networks is reserved by us
>
> Road Tech Computer Systems Ltd. Shenley Hall, Rectory Lane, Shenley,
> Radlett, Hertfordshire, WD7 9AN. - VAT Registration No GB 449 3582 17
> Registered in England No: 02017435, Registered Address: Charter Court,
> Midland Road, Hemel Hempstead, Hertfordshire, HP2 5GE.
> *************************************************************************
>
>
> _______________________________________________
> drbd-user mailing list
> drbd-user [at] lists
> http://lists.linbit.com/mailman/listinfo/drbd-user
>

--
Igor Neves<igor.neves [at] 3gnt>
3GNTW - Tecnologias de Informação, Lda

SIP: igor [at] 3gnt JID: igor [at] 3gnt
ICQ: 249075444 MSN: igor [at] 3gnt
TLM: 00351914503611 PSTN: 00351252377120


florian.haas at linbit

Nov 25, 2009, 4:41 AM

Post #5 of 10 (1330 views)
Permalink
Re: 8.3.5 Stalling on sync [In reply to]

Hello,

On 2009-11-25 12:50, Igor Neves wrote:
> Hi,
>
> This is why, Linbit should publish in the documentation, a list of
> hardware (NIC's) tested and certified as fully working.
>
> I have seen this problem here in the list a couple of times in the
> last month. It's kind of sad spend time and a lot of money in good
> enterprise network hardware, and in the end things don't work as they
> should.

There is no way we can do that for all the distros and kernels we support.

We can also not assume responsibility for kernel network stack bugs,
driver bugs, firmware bugs, you name it. Sorry, but what you ask for is
plain impossible. And, the underlying assumption that your hardware and
its driver is always right, and it must be DRBD that's wrong, does not
hold up in our experience. Remember broken TCP checksum offloading on
some Intel Gigabit cards?

We will double and triple check customer's hardware as part of a support
contract. We do also pinpoint individual software or hardware issues as
part of support or consultancy engagements. And we try to do what we can
in terms of alerting our user community to known hardware issues. And,
we do have hardware partners who take part in our DRBD Certified
Platform program. And that encompasses just what the name implies,
complete certified hardware platforms. Which you get full cluster stack
support on.

But we won't put up a "certified" NIC list like you are asking for. It
just makes zero sense.

Cheers,
Florian
Attachments: signature.asc (0.25 KB)


igor at 3gnt

Nov 25, 2009, 7:29 AM

Post #6 of 10 (1341 views)
Permalink
Re: 8.3.5 Stalling on sync [In reply to]

On 11/25/2009 12:41 PM, Florian Haas wrote:
> Hello,
>
> On 2009-11-25 12:50, Igor Neves wrote:
>
>> Hi,
>>
>> This is why, Linbit should publish in the documentation, a list of
>> hardware (NIC's) tested and certified as fully working.
>>
>> I have seen this problem here in the list a couple of times in the
>> last month. It's kind of sad spend time and a lot of money in good
>> enterprise network hardware, and in the end things don't work as they
>> should.
>>
> There is no way we can do that for all the distros and kernels we support.
>
> We can also not assume responsibility for kernel network stack bugs,
> driver bugs, firmware bugs, you name it. Sorry, but what you ask for is
> plain impossible. And, the underlying assumption that your hardware and
> its driver is always right, and it must be DRBD that's wrong, does not
> hold up in our experience. Remember broken TCP checksum offloading on
> some Intel Gigabit cards?
>
> We will double and triple check customer's hardware as part of a support
> contract. We do also pinpoint individual software or hardware issues as
> part of support or consultancy engagements. And we try to do what we can
> in terms of alerting our user community to known hardware issues. And,
> we do have hardware partners who take part in our DRBD Certified
> Platform program. And that encompasses just what the name implies,
> complete certified hardware platforms. Which you get full cluster stack
> support on.
>
> But we won't put up a "certified" NIC list like you are asking for. It
> just makes zero sense.
>
> Cheers,
> Florian
>
>
>
>
> _______________________________________________
> drbd-user mailing list
> drbd-user [at] lists
> http://lists.linbit.com/mailman/listinfo/drbd-user
>

Hi,

Florian, of course everything linbit does for this community, it's
amazing, I don't really have nothing to complain and I don't know too
much company's like linbit when it comes to opensource projects.

You guys are just amazing, starting from the stability, backporting, bug
correction, community support to documentation. I really don't have
nothing to complain, only to thanks.

What I was trying to says its, I was not asking a list of ALL the
hardware drbd run's fine, because it should go fine everywhere. What I
was trying to says it's, imagine in the next week, I will deploy a new
cluster with drbd, I will need 10Gbit NIC's, because I have a lot of I/O
bandwidth.
If there is a list with a couple of 10Gbit NIC's that works, I will
simply forward it to my manager and ask him to buy one of that ones. If
he would like to buy a shipper one, and things go wrong, don't blame on
me, or even on DRBD or Linbit... :).

Would be nice to have that list, that's only.

--

Igor Neves<igor.neves [at] 3gnt>
3GNTW - Tecnologias de Informação, Lda

SIP: igor [at] 3gnt JID: igor [at] 3gnt
ICQ: 249075444 MSN: igor [at] 3gnt
TLM: 00351914503611 PSTN: 00351252377120


mike at dev-zero

Nov 25, 2009, 8:01 AM

Post #7 of 10 (1338 views)
Permalink
Re: 8.3.5 Stalling on sync [In reply to]

nothing i tried tweaking in drbd.conf worked. the only thing that did
was changing the 10gig interfaces. what cards are you using? i was using
ones with an intel chip. the cards that i did get it to work with were
from chelsio. in my previous thread on the list, someone mentioned that
they had neterion cards working.

mike

James Larcombe wrote:
>
> Hi Mike,
>
>
>
> Thanks for the quick response. Yes you are correct we are using 10gig
> fibre cards. I'm not sure we could change them though as the fibre
> modules used in them cost over £400 each.
>
>
>
> Is there anything I can tweak in the drbd.conf file to get these to work.
>
>
>
> James
>
>
>
> *From:* Mike Lovell [mailto:mike [at] dev-zero]
> *Sent:* 24 November 2009 17:49
> *To:* James Larcombe
> *Cc:* drbd-user [at] lists
> *Subject:* Re: [DRBD-user] 8.3.5 Stalling on sync
>
>
>
> James Larcombe wrote:
>
> Hi List,
>
>
>
> Please help. I have installed drbd 8.3.5 on Open Suse 11.1 (Kernel
> 2.6.27.29-0.1).
>
>
>
> I have run drbdadm create-md dbms-test on one node and create-md
> dbms-test2 on the other node. I then ran drbdadm up all on both nodes.
> I then ran drbdadm -- --overwrite-data-of-my-peer primary dbms-test on
> the first node and the same with dbms-test2 on the other node. They
> then run for a short while before stalling. I have tried older version
> without success and turning the sync rate down does not make any
> difference. Downing the resources and bringing back up starts the sync
> again but this then stalls quickly.
>
>
>
> I have attached /proc/drbd, /etc/drbd.conf and a section from
> /var/log/messages. Any pointers would be greatly appreciated.
>
>
>
> version: 8.3.5 (api:88/proto:86-91)
>
> GIT-hash: ded8cdf09b0efa1460e8ce7a72327c60ff2210fb build by
> root [at] hp-tm-4, 2009-11-24 12:21:46
>
> 0: cs:SyncSource ro:Secondary/Secondary ds:UpToDate/Inconsistent C r----
>
> ns:160896 nr:0 dw:0 dr:160896 al:0 bm:9 lo:1 pe:0 ua:0 ap:0 ep:1
> wo:b oos:926694296
>
> [>.] sync'ed: 0.1% (905040/905132)M 4972
>
> stalled
>
> 1: cs:SyncTarget ro:Secondary/Secondary ds:Inconsistent/UpToDate C r----
>
> ns:0 nr:2173248 dw:2173248 dr:0 al:0 bm:132 lo:0 pe:29878 ua:0
> ap:0 ep:1 wo:b oos:777971256
>
> [>.] sync'ed: 0.3% (759736/761856)M
>
> Stalled
>
>
> what kind of network are you using between the two servers? this is
> almost the exact same behavior i had when i was trying to get drbd to
> work over 10gig ethernet. turned out to be something in drbd didn't
> like something about the 10gig cards i had. i eventually had to change
> my network cards. what cards are you using? 1gig? 10gig? have you
> tried other cards? that is where i would look.
>
> mike
>


jim at roadtech

Nov 25, 2009, 9:19 AM

Post #8 of 10 (1335 views)
Permalink
Re: 8.3.5 Stalling on sync [In reply to]

Hi Mike,



The cards I’m using are HP NC522SFP Dual Port 10GbE Server Adapters with HP
BLc 10Gb SR SFP+ Fiber Transceivers. I could try running these with 1GB
Fiber cables instead of 10GB.



James



From: Mike Lovell [mailto:mike [at] dev-zero]
Sent: 25 November 2009 16:01
To: James Larcombe
Cc: drbd-user [at] lists
Subject: Re: [DRBD-user] 8.3.5 Stalling on sync



nothing i tried tweaking in drbd.conf worked. the only thing that did was
changing the 10gig interfaces. what cards are you using? i was using ones
with an intel chip. the cards that i did get it to work with were from
chelsio. in my previous thread on the list, someone mentioned that they had
neterion cards working.

mike

James Larcombe wrote:

Hi Mike,



Thanks for the quick response. Yes you are correct we are using 10gig fibre
cards. I’m not sure we could change them though as the fibre modules used in
them cost over £400 each.



Is there anything I can tweak in the drbd.conf file to get these to work.



James



From: Mike Lovell [mailto:mike [at] dev-zero]
Sent: 24 November 2009 17:49
To: James Larcombe
Cc: drbd-user [at] lists
Subject: Re: [DRBD-user] 8.3.5 Stalling on sync



James Larcombe wrote:

Hi List,



Please help. I have installed drbd 8.3.5 on Open Suse 11.1 (Kernel
2.6.27.29-0.1).



I have run drbdadm create-md dbms-test on one node and create-md dbms-test2
on the other node. I then ran drbdadm up all on both nodes. I then ran
drbdadm -- --overwrite-data-of-my-peer primary dbms-test on the first node
and the same with dbms-test2 on the other node. They then run for a short
while before stalling. I have tried older version without success and
turning the sync rate down does not make any difference. Downing the
resources and bringing back up starts the sync again but this then stalls
quickly.



I have attached /proc/drbd, /etc/drbd.conf and a section from
/var/log/messages. Any pointers would be greatly appreciated.



version: 8.3.5 (api:88/proto:86-91)

GIT-hash: ded8cdf09b0efa1460e8ce7a72327c60ff2210fb build by root [at] hp-tm-4,
2009-11-24 12:21:46

0: cs:SyncSource ro:Secondary/Secondary ds:UpToDate/Inconsistent C r----

ns:160896 nr:0 dw:0 dr:160896 al:0 bm:9 lo:1 pe:0 ua:0 ap:0 ep:1 wo:b
oos:926694296

[>.] sync'ed: 0.1% (905040/905132)M 4972

stalled

1: cs:SyncTarget ro:Secondary/Secondary ds:Inconsistent/UpToDate C r----

ns:0 nr:2173248 dw:2173248 dr:0 al:0 bm:132 lo:0 pe:29878 ua:0 ap:0 ep:1
wo:b oos:777971256

[>.] sync'ed: 0.3% (759736/761856)M

Stalled


what kind of network are you using between the two servers? this is almost
the exact same behavior i had when i was trying to get drbd to work over
10gig ethernet. turned out to be something in drbd didn't like something
about the 10gig cards i had. i eventually had to change my network cards.
what cards are you using? 1gig? 10gig? have you tried other cards? that is
where i would look.

mike




*RT IMSS Scanned*





*************************************************************************
This e-mail is confidential and may be legally privileged. It is intended
solely for the use of the individual(s) to whom it is addressed. Any
content in this message is not necessarily a view or statement from Road
Tech Computer Systems Limited but is that of the individual sender. If
you are not the intended recipient, be advised that you have received
this e-mail in error and that any use, dissemination, forwarding,
printing, or copying of this e-mail is strictly prohibited. We use
reasonable endeavours to virus scan all e-mails leaving the company but
no warranty is given that this e-mail and any attachments are virus free.
You should undertake your own virus checking. The right to monitor e-mail
communications through our networks is reserved by us

Road Tech Computer Systems Ltd. Shenley Hall, Rectory Lane, Shenley,
Radlett, Hertfordshire, WD7 9AN. - VAT Registration No GB 449 3582 17
Registered in England No: 02017435, Registered Address: Charter Court,
Midland Road, Hemel Hempstead, Hertfordshire, HP2 5GE.
*************************************************************************


mike at dev-zero

Nov 25, 2009, 10:45 AM

Post #9 of 10 (1345 views)
Permalink
Re: 8.3.5 Stalling on sync [In reply to]

hrm. i thought i had heard of someone using drbd over 10 gig with netxen
cards. i went looking for a few minutes and didn't find anything though.
my recommendation would be try newer drivers either through compiling
the drivers for you existing kernel or using a newer kernel. i don't
have details on how to do that for your cards cause i have never used
any 10 gig from hp or netxen. other than that, my only recommendation is
new nics.

good luck

mike

James Larcombe wrote:
>
> Hi Mike,
>
>
>
> The cards I'm using are HP NC522SFP Dual Port 10GbE Server Adapters
> with HP BLc 10Gb SR SFP+ Fiber Transceivers. I could try running these
> with 1GB Fiber cables instead of 10GB.
>
>
>
> James
>
>
>
> *From:* Mike Lovell [mailto:mike [at] dev-zero]
> *Sent:* 25 November 2009 16:01
> *To:* James Larcombe
> *Cc:* drbd-user [at] lists
> *Subject:* Re: [DRBD-user] 8.3.5 Stalling on sync
>
>
>
> nothing i tried tweaking in drbd.conf worked. the only thing that did
> was changing the 10gig interfaces. what cards are you using? i was
> using ones with an intel chip. the cards that i did get it to work
> with were from chelsio. in my previous thread on the list, someone
> mentioned that they had neterion cards working.
>
> mike
>
> James Larcombe wrote:
>
> Hi Mike,
>
>
>
> Thanks for the quick response. Yes you are correct we are using 10gig
> fibre cards. I'm not sure we could change them though as the fibre
> modules used in them cost over £400 each.
>
>
>
> Is there anything I can tweak in the drbd.conf file to get these to work.
>
>
>
> James
>
>
>
> *From:* Mike Lovell [mailto:mike [at] dev-zero]
> *Sent:* 24 November 2009 17:49
> *To:* James Larcombe
> *Cc:* drbd-user [at] lists <mailto:drbd-user [at] lists>
> *Subject:* Re: [DRBD-user] 8.3.5 Stalling on sync
>
>
>
> James Larcombe wrote:
>
> Hi List,
>
>
>
> Please help. I have installed drbd 8.3.5 on Open Suse 11.1 (Kernel
> 2.6.27.29-0.1).
>
>
>
> I have run drbdadm create-md dbms-test on one node and create-md
> dbms-test2 on the other node. I then ran drbdadm up all on both nodes.
> I then ran drbdadm -- --overwrite-data-of-my-peer primary dbms-test on
> the first node and the same with dbms-test2 on the other node. They
> then run for a short while before stalling. I have tried older version
> without success and turning the sync rate down does not make any
> difference. Downing the resources and bringing back up starts the sync
> again but this then stalls quickly.
>
>
>
> I have attached /proc/drbd, /etc/drbd.conf and a section from
> /var/log/messages. Any pointers would be greatly appreciated.
>
>
>
> version: 8.3.5 (api:88/proto:86-91)
>
> GIT-hash: ded8cdf09b0efa1460e8ce7a72327c60ff2210fb build by
> root [at] hp-tm-4, 2009-11-24 12:21:46
>
> 0: cs:SyncSource ro:Secondary/Secondary ds:UpToDate/Inconsistent C r----
>
> ns:160896 nr:0 dw:0 dr:160896 al:0 bm:9 lo:1 pe:0 ua:0 ap:0 ep:1
> wo:b oos:926694296
>
> [>.] sync'ed: 0.1% (905040/905132)M 4972
>
> stalled
>
> 1: cs:SyncTarget ro:Secondary/Secondary ds:Inconsistent/UpToDate C r----
>
> ns:0 nr:2173248 dw:2173248 dr:0 al:0 bm:132 lo:0 pe:29878 ua:0
> ap:0 ep:1 wo:b oos:777971256
>
> [>.] sync'ed: 0.3% (759736/761856)M
>
> Stalled
>
>
> what kind of network are you using between the two servers? this is
> almost the exact same behavior i had when i was trying to get drbd to
> work over 10gig ethernet. turned out to be something in drbd didn't
> like something about the 10gig cards i had. i eventually had to change
> my network cards. what cards are you using? 1gig? 10gig? have you
> tried other cards? that is where i would look.
>
> mike
>
>
>


MRoof at admin

Nov 25, 2009, 1:03 PM

Post #10 of 10 (1318 views)
Permalink
Re: 8.3.5 Stalling on sync [In reply to]

Actually I have those exact cards and I'm not seeing your problem but getting those cards to work was a major pain in the rear end. I much prefer the Myricom cards but for this HP server pair I got stuck using the HP cards due to a political issue.

Anyways, some of the things I found out about these cards might be of help to you. We use SuSE here but doing the same for RedHat shouldn't be much of a problem. The biggest issue is that these cards get very hot and can over heat easily if they don't have a good amount of airflow. Once they begin to overheat packets disapper and things fall apart. Since you are seeing stalls after a bit of a run I would think that you might be having an overheating issue.

Also, the driver that comes with Linux kernel doesn't work very well so you need to get the HP driver and install it. HOWEVER, you absolutely must use the driver version that match the firmware version. If they are different things don't work and you can't even run the diagnostic tool. Here I'm running firmware 4.0.516 and driver 4.0.516.

When I was trying to get these working I would setup long runs of netperf and iperf and see how hot I can get the cards and then run the diagnostic tool as it will tell you the temperature of the card. I have found they start to freak out at about 85C. After playing around with card position they run under load at 66C and seem to work fine with 27C ambient air temp.

All in all I'm not very impressed with these cards but I got stuck using them in one place.

Hope the information helps a bit,
Morey


________________________________

From: drbd-user-bounces [at] lists [mailto:drbd-user-bounces [at] lists] On Behalf Of Mike Lovell
Sent: Wednesday, November 25, 2009 11:45 AM
To: James Larcombe
Cc: drbd-user [at] lists
Subject: Re: [DRBD-user] 8.3.5 Stalling on sync


hrm. i thought i had heard of someone using drbd over 10 gig with netxen cards. i went looking for a few minutes and didn't find anything though. my recommendation would be try newer drivers either through compiling the drivers for you existing kernel or using a newer kernel. i don't have details on how to do that for your cards cause i have never used any 10 gig from hp or netxen. other than that, my only recommendation is new nics.

good luck

mike

James Larcombe wrote:

Hi Mike,



The cards I'm using are HP NC522SFP Dual Port 10GbE Server Adapters with HP BLc 10Gb SR SFP+ Fiber Transceivers. I could try running these with 1GB Fiber cables instead of 10GB.



James



From: Mike Lovell [mailto:mike [at] dev-zero]
Sent: 25 November 2009 16:01
To: James Larcombe
Cc: drbd-user [at] lists
Subject: Re: [DRBD-user] 8.3.5 Stalling on sync



nothing i tried tweaking in drbd.conf worked. the only thing that did was changing the 10gig interfaces. what cards are you using? i was using ones with an intel chip. the cards that i did get it to work with were from chelsio. in my previous thread on the list, someone mentioned that they had neterion cards working.

mike

James Larcombe wrote:

Hi Mike,



Thanks for the quick response. Yes you are correct we are using 10gig fibre cards. I'm not sure we could change them though as the fibre modules used in them cost over £400 each.



Is there anything I can tweak in the drbd.conf file to get these to work.



James



From: Mike Lovell [mailto:mike [at] dev-zero]
Sent: 24 November 2009 17:49
To: James Larcombe
Cc: drbd-user [at] lists
Subject: Re: [DRBD-user] 8.3.5 Stalling on sync



James Larcombe wrote:

Hi List,



Please help. I have installed drbd 8.3.5 on Open Suse 11.1 (Kernel 2.6.27.29-0.1).



I have run drbdadm create-md dbms-test on one node and create-md dbms-test2 on the other node. I then ran drbdadm up all on both nodes. I then ran drbdadm -- --overwrite-data-of-my-peer primary dbms-test on the first node and the same with dbms-test2 on the other node. They then run for a short while before stalling. I have tried older version without success and turning the sync rate down does not make any difference. Downing the resources and bringing back up starts the sync again but this then stalls quickly.



I have attached /proc/drbd, /etc/drbd.conf and a section from /var/log/messages. Any pointers would be greatly appreciated.



version: 8.3.5 (api:88/proto:86-91)

GIT-hash: ded8cdf09b0efa1460e8ce7a72327c60ff2210fb build by root [at] hp-tm-4, 2009-11-24 12:21:46

0: cs:SyncSource ro:Secondary/Secondary ds:UpToDate/Inconsistent C r----

ns:160896 nr:0 dw:0 dr:160896 al:0 bm:9 lo:1 pe:0 ua:0 ap:0 ep:1 wo:b oos:926694296

[>.] sync'ed: 0.1% (905040/905132)M 4972

stalled

1: cs:SyncTarget ro:Secondary/Secondary ds:Inconsistent/UpToDate C r----

ns:0 nr:2173248 dw:2173248 dr:0 al:0 bm:132 lo:0 pe:29878 ua:0 ap:0 ep:1 wo:b oos:777971256

[>.] sync'ed: 0.3% (759736/761856)M

Stalled


what kind of network are you using between the two servers? this is almost the exact same behavior i had when i was trying to get drbd to work over 10gig ethernet. turned out to be something in drbd didn't like something about the 10gig cards i had. i eventually had to change my network cards. what cards are you using? 1gig? 10gig? have you tried other cards? that is where i would look.

mike

DRBD users RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.