Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: DRBD: Users

DRBD full sync is stalled

 

 

DRBD users RSS feed   Index | Next | Previous | View Threaded


groen692 at grosc

Sep 24, 2009, 4:57 AM

Post #1 of 8 (1472 views)
Permalink
DRBD full sync is stalled

Hello,

I have a problem when full syncing with drbd the target machine freezes.
scenario is simple whenever a full sync is made manual or automaticly
the syncing is stalled after some time. after the syncing reaches the
stalled states a view moments later the target machine freeze entirely.

OpenSuse 11.1
kernel 2.6.27.21-0.1-xen #
drbd 8.3.1

NIC: NetXtreme II BCM5708 Gigabit Ethernet

On the Source Machine:
cat /proc/drbd
version: 8.3.1 (api:88/proto:86-89)
GIT-hash: fd40f4a8f9104941537d1afc8521e584a6d3003c build by
root [at] DefaultNod, 2009-04-27 11:34:17
0: cs:SyncSource ro:Primary/Secondary ds:UpToDate/Inconsistent C r----
ns:324524 nr:0 dw:110988 dr:689400 al:263 bm:242 lo:0 pe:2131 ua:978
ap:36 ep:1 wo:b oos:1635880
[==>.................] sync'ed: 16.4% (1635880/1951768)K
stalled

How to find out what is happening here? (and prevent it in the future.)


mfg,

jeroen
_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


groen692 at grosc

Sep 25, 2009, 4:10 AM

Post #2 of 8 (1411 views)
Permalink
Re: DRBD full sync is stalled [In reply to]

Anybody?

The same seems to happen with 8.3.3RC2. although the error is either to
freeze the system or the system disconnects all network interfaces. Anybody?

mfg,

jeroen

Jeroen Groenewegen van der Weyden wrote:
> Hello,
>
> I have a problem when full syncing with drbd the target machine
> freezes. scenario is simple whenever a full sync is made manual or
> automaticly the syncing is stalled after some time. after the syncing
> reaches the stalled states a view moments later the target machine
> freeze entirely.
>
> OpenSuse 11.1
> kernel 2.6.27.21-0.1-xen #
> drbd 8.3.1
>
> NIC: NetXtreme II BCM5708 Gigabit Ethernet
>
> On the Source Machine:
> cat /proc/drbd
> version: 8.3.1 (api:88/proto:86-89)
> GIT-hash: fd40f4a8f9104941537d1afc8521e584a6d3003c build by
> root [at] DefaultNod, 2009-04-27 11:34:17
> 0: cs:SyncSource ro:Primary/Secondary ds:UpToDate/Inconsistent C r----
> ns:324524 nr:0 dw:110988 dr:689400 al:263 bm:242 lo:0 pe:2131
> ua:978 ap:36 ep:1 wo:b oos:1635880
> [==>.................] sync'ed: 16.4% (1635880/1951768)K
> stalled
>
> How to find out what is happening here? (and prevent it in the future.)
>
>
> mfg,
>
> jeroen
> _______________________________________________
> drbd-user mailing list
> drbd-user [at] lists
> http://lists.linbit.com/mailman/listinfo/drbd-user
> ------------------------------------------------------------------------
>
>
> No virus found in this incoming message.
> Checked by AVG - www.avg.com
> Version: 8.5.409 / Virus Database: 270.13.112/2391 - Release Date: 09/23/09 18:00:00
>
>


lars.ellenberg at linbit

Sep 25, 2009, 4:48 AM

Post #3 of 8 (1419 views)
Permalink
Re: DRBD full sync is stalled [In reply to]

On Fri, Sep 25, 2009 at 01:10:24PM +0200, Jeroen Groenewegen van der Weyden wrote:
> Anybody?
>
> The same seems to happen with 8.3.3RC2. although the error is either to
> freeze the system or the system disconnects all network interfaces.
> Anybody?
>
> mfg,
>
> jeroen
>
> Jeroen Groenewegen van der Weyden wrote:
>> Hello,
>>
>> I have a problem when full syncing with drbd the target machine
>> freezes. scenario is simple whenever a full sync is made manual or
>> automaticly the syncing is stalled after some time. after the syncing
>> reaches the stalled states a view moments later the target machine
>> freeze entirely.
>>
>> OpenSuse 11.1
>> kernel 2.6.27.21-0.1-xen #
>> drbd 8.3.1
>>
>> NIC: NetXtreme II BCM5708 Gigabit Ethernet
>>
>> On the Source Machine:
>> cat /proc/drbd
>> version: 8.3.1 (api:88/proto:86-89)
>> GIT-hash: fd40f4a8f9104941537d1afc8521e584a6d3003c build by
>> root [at] DefaultNod, 2009-04-27 11:34:17
>> 0: cs:SyncSource ro:Primary/Secondary ds:UpToDate/Inconsistent C r----
>> ns:324524 nr:0 dw:110988 dr:689400 al:263 bm:242 lo:0 pe:2131
>> ua:978 ap:36 ep:1 wo:b oos:1635880
>> [==>.................] sync'ed: 16.4% (1635880/1951768)K
>> stalled
>>
>> How to find out what is happening here?

Serial console?
Netconsole?
Logs?

Network stress tests not using DRBD?
General stress tests?
Memtest?

>> (and prevent it in the future.)

--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list -- I'm subscribed
_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


roccas at gmail

Sep 25, 2009, 6:13 AM

Post #4 of 8 (1410 views)
Permalink
Re: DRBD full sync is stalled [In reply to]

On Fri, Sep 25, 2009 at 08:10, Jeroen Groenewegen van der Weyden
<groen692 [at] grosc> wrote:
>
> Anybody?
>
> The same seems to happen with 8.3.3RC2. although the error is either to freeze the system or the system disconnects all network interfaces. Anybody?
>
> mfg,
>
> jeroen
>
> Jeroen Groenewegen van der Weyden wrote:
>
> Hello,
>
> I have a problem when full syncing with drbd the target machine freezes. scenario is simple whenever a full sync is made manual or automaticly the syncing is stalled after some time. after the syncing reaches the stalled states a view moments later the target machine freeze entirely.

I had this problem once, and it was due a faulty network card.
Configuring the sync speed to 1/10 of the network speed was an interim
(remote) solution and changing the card solved the problem
permanently.

--
Marcelo

"¿No será acaso que ésta vida moderna está teniendo más de moderna que
de vida?" (Mafalda)
_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


groen692 at grosc

Sep 25, 2009, 7:41 AM

Post #5 of 8 (1401 views)
Permalink
Re: DRBD full sync is stalled [In reply to]

>Serial console?
>Netconsole?
>Logs?

Which logs are you interested about, it is the first time I'm seriously troubleshooting DRBD problem.
The /var/log/messages. just stops having messages on the time of the freeze (see snippet below). is there some debug level I can increase for DRBD?


>Network stress tests not using DRBD?
>General stress tests?
>Memtest?

The problem happens on the "production lan" as well on a 4 port "1Gig staging switch". iperf shows in all cases normal values.
The problems happens on Fujitsu Siemens server RX200/RX300. The total of Fujistu Siemens Servers having this problem is 6 in total. Other servers I have installed do not have this problem. The Fujistu Siemens server have onboard Broadcom interfaces "NIC: NetXtreme II BCM5708 Gigabit Ethernet".


---------- /var/log/messages on the target machine --------------
Sep 25 11:33:13 Cluster3Node1 kernel: block drbd2: PingAck did not
arrive in time.
Sep 25 11:33:13 Cluster3Node1 kernel: block drbd2: peer( Secondary ->
Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
Sep 25 11:33:13 Cluster3Node1 kernel: block drbd2: asender terminated
Sep 25 11:33:13 Cluster3Node1 kernel: block drbd2: Terminating asender
thread
Sep 25 11:33:13 Cluster3Node1 kernel: block drbd2: short read expecting
header on sock: r=-512
Sep 25 11:33:13 Cluster3Node1 kernel: block drbd2: Connection closed
Sep 25 11:33:13 Cluster3Node1 kernel: block drbd2: conn( NetworkFailure
-> Unconnected )
Sep 25 11:33:13 Cluster3Node1 kernel: block drbd2: receiver terminated
Sep 25 11:33:13 Cluster3Node1 kernel: block drbd2: Restarting receiver
thread
Sep 25 11:33:13 Cluster3Node1 kernel: block drbd2: receiver (re)started
Sep 25 11:33:13 Cluster3Node1 kernel: block drbd2: conn( Unconnected ->
WFConnection )
Sep 25 11:33:19 Cluster3Node1 kernel: block drbd0: PingAck did not
arrive in time.
Sep 25 11:33:19 Cluster3Node1 kernel: block drbd0: peer( Primary ->
Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
Sep 25 11:33:19 Cluster3Node1 kernel: block drbd0: asender terminated
Sep 25 11:33:19 Cluster3Node1 kernel: block drbd0: Terminating asender
thread
Sep 25 11:33:19 Cluster3Node1 kernel: block drbd0: short read expecting
header on sock: r=-512
Sep 25 11:33:19 Cluster3Node1 kernel: block drbd0: Connection closed
Sep 25 11:33:19 Cluster3Node1 kernel: block drbd0: conn( NetworkFailure
-> Unconnected )
Sep 25 11:33:19 Cluster3Node1 kernel: block drbd0: receiver terminated
Sep 25 11:33:19 Cluster3Node1 kernel: block drbd0: Restarting receiver
thread
Sep 25 11:33:19 Cluster3Node1 kernel: block drbd0: receiver (re)started
Sep 25 11:33:19 Cluster3Node1 kernel: block drbd0: conn( Unconnected ->
WFConnection )
---------- here it is frozen -------------------------------
---------- /var/log/messages on the target machine --------------
Here it stop until the booting messages of the reboot show up.

mfg,

jeroen.

Lars Ellenberg wrote:
> On Fri, Sep 25, 2009 at 01:10:24PM +0200, Jeroen Groenewegen van der Weyden wrote:
>
>> Anybody?
>>
>> The same seems to happen with 8.3.3RC2. although the error is either to
>> freeze the system or the system disconnects all network interfaces.
>> Anybody?
>>
>> mfg,
>>
>> jeroen
>>
>> Jeroen Groenewegen van der Weyden wrote:
>>
>>> Hello,
>>>
>>> I have a problem when full syncing with drbd the target machine
>>> freezes. scenario is simple whenever a full sync is made manual or
>>> automaticly the syncing is stalled after some time. after the syncing
>>> reaches the stalled states a view moments later the target machine
>>> freeze entirely.
>>>
>>> OpenSuse 11.1
>>> kernel 2.6.27.21-0.1-xen #
>>> drbd 8.3.1
>>>
>>> NIC: NetXtreme II BCM5708 Gigabit Ethernet
>>>
>>> On the Source Machine:
>>> cat /proc/drbd
>>> version: 8.3.1 (api:88/proto:86-89)
>>> GIT-hash: fd40f4a8f9104941537d1afc8521e584a6d3003c build by
>>> root [at] DefaultNod, 2009-04-27 11:34:17
>>> 0: cs:SyncSource ro:Primary/Secondary ds:UpToDate/Inconsistent C r----
>>> ns:324524 nr:0 dw:110988 dr:689400 al:263 bm:242 lo:0 pe:2131
>>> ua:978 ap:36 ep:1 wo:b oos:1635880
>>> [==>.................] sync'ed: 16.4% (1635880/1951768)K
>>> stalled
>>>
>>> How to find out what is happening here?
>>>
>
> Serial console?
> Netconsole?
> Logs?
>
> Network stress tests not using DRBD?
> General stress tests?
> Memtest?
>
>
>>> (and prevent it in the future.)
>>>
>
>
> ------------------------------------------------------------------------
>
>
> No virus found in this incoming message.
> Checked by AVG - www.avg.com
> Version: 8.5.409 / Virus Database: 270.13.112/2393 - Release Date: 09/24/09 18:00:00
>
>


lars.ellenberg at linbit

Sep 28, 2009, 1:51 AM

Post #6 of 8 (1355 views)
Permalink
Re: DRBD full sync is stalled [In reply to]

> The problems happens on Fujitsu Siemens server RX200/RX300. The total
> of Fujistu Siemens Servers having this problem is 6 in total.
> Other servers I have installed do not have this problem.

Then this is a strong indication it is _not_ DRBD.
What about fixing your network drivers or hardware, then?

try Firmware update, kernel upgrade, NIC driver module upgrade, etc.

> The Fujistu Siemens server have onboard Broadcom interfaces "NIC:
> NetXtreme II BCM5708 Gigabit Ethernet".

--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list -- I'm subscribed
_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


groen692 at grosc

Sep 28, 2009, 4:03 AM

Post #7 of 8 (1383 views)
Permalink
Re: DRBD full sync is stalled [In reply to]

Thank you, I hear what you are saying.

One more thing I noticed when testing with DRBD 8.3.3RC2 the behaviour
is different. The target machine does not stall but simply disconnects
all network interfaces. The var/log/messages show the included in the
snippet below.

Do you think with this information the bnx2 module is faulty?

mfg,
jeroen


------------------ snippet -------------------------
Sep 25 11:33:23 Cluster3Node1 kernel: block drbd1: Restarting receiver
thread
Sep 25 11:33:23 Cluster3Node1 kernel: block drbd1: receiver (re)started
Sep 25 11:33:23 Cluster3Node1 kernel: block drbd1: conn( Unconnected ->
WFConnection )
Sep 25 11:38:14 Cluster3Node1 kernel: ------------[ cut here ]------------
Sep 25 11:38:14 Cluster3Node1 kernel: WARNING: at
net/sched/sch_generic.c:219 dev_watchdog+0x139/0x1eb()
Sep 25 11:38:14 Cluster3Node1 kernel: NETDEV WATCHDOG: eth0 (bnx2):
transmit timed out
Sep 25 11:38:14 Cluster3Node1 kernel: Modules linked in: drbd(N) joydev
ip6t_LOG xt_tcpudp xt_pkttype ipt_LOG xt_limit xt_physdev netbk blkbk
blktap xenbus_be binfmt_misc bridge stp ip6t_REJECT nf_conntrack_ipv6
ip6table_raw xt_NOTRACK ipt_REJECT xt_state iptable_raw iptable_filter
ip6table_mangle nf_conntrack_netbios_ns nf_conntrack_ipv4 nf_conntrack
ip_tables ip6table_filter ip6_tables x_tables ipv6 microcode fuse loop
dm_mod ppdev 8250_pnp rtc_cmos rtc_core 8250 parport_pc floppy iTCO_wdt
rtc_lib parport serial_core pcspkr sr_mod iTCO_vendor_support bnx2 e1000
i2c_i801 container serio_raw i5000_edac i2c_core button edac_core shpchp
pci_hotplug sg usbhid hid ff_memless ehci_hcd uhci_hcd usbcore sd_mod
crc_t10dif xenblk cdrom xennet megaraid_sas edd ext3 mbcache jbd fan
ide_pci_generic piix ide_core ata_generic ata_piix libata scsi_mod dock
thermal processor thermal_sys hwmon [last unloaded: drbd]
Sep 25 11:38:14 Cluster3Node1 kernel: Supported: No
Sep 25 11:38:14 Cluster3Node1 kernel: Pid: 0, comm: swapper Tainted:
G 2.6.27.21-0.1-xen #1
Sep 25 11:38:14 Cluster3Node1 kernel:
Sep 25 11:38:14 Cluster3Node1 kernel: Call Trace:
Sep 25 11:38:14 Cluster3Node1 kernel: [<ffffffff8020c597>]
show_trace_log_lvl+0x41/0x58
Sep 25 11:38:14 Cluster3Node1 kernel: [<ffffffff804635e0>]
dump_stack+0x69/0x6f
Sep 25 11:38:14 Cluster3Node1 kernel: [<ffffffff80232991>]
warn_slowpath+0xa9/0xd1
Sep 25 11:38:14 Cluster3Node1 kernel: [<ffffffff80406670>]
dev_watchdog+0x139/0x1eb
Sep 25 11:38:14 Cluster3Node1 kernel: [<ffffffff8023c2f2>]
run_timer_softirq+0x1ba/0x268
Sep 25 11:38:14 Cluster3Node1 kernel: [<ffffffff80238543>]
__do_softirq+0xa1/0x148
Sep 25 11:38:14 Cluster3Node1 kernel: [<ffffffff8020c00c>]
call_softirq+0x1c/0x28
Sep 25 11:38:14 Cluster3Node1 kernel: [<ffffffff8020d2b3>]
do_softirq+0x4b/0xca
Sep 25 11:38:14 Cluster3Node1 kernel: [<ffffffff8020bace>]
do_hypervisor_callback+0x1e/0x30
Sep 25 11:38:14 Cluster3Node1 kernel: [<ffffffff802073aa>]
0xffffffff802073aa
Sep 25 11:38:14 Cluster3Node1 kernel: [<ffffffff8020da6b>]
xen_safe_halt+0x97/0xac
Sep 25 11:38:14 Cluster3Node1 kernel: [<ffffffff80210702>]
xen_idle+0x2e/0x67
Sep 25 11:38:14 Cluster3Node1 kernel: [<ffffffff8020a422>]
cpu_idle+0x57/0x93
Sep 25 11:38:14 Cluster3Node1 kernel: [<ffffffff80712a65>]
start_kernel+0x3be/0x3ca
Sep 25 11:38:14 Cluster3Node1 kernel: [<ffffffff807121e4>]
x86_64_start_kernel+0xb4/0xba
Sep 25 11:38:14 Cluster3Node1 kernel:
Sep 25 11:38:14 Cluster3Node1 kernel: ---[ end trace 4f7e0b535ea49a76 ]---
Sep 25 11:38:14 Cluster3Node1 kernel: bnx2: eth0 NIC Copper Link is Down
Sep 25 11:38:14 Cluster3Node1 kernel: br0: port 1(eth0) entering
disabled state
Sep 25 11:58:15 Cluster3Node1 -- MARK --
------------------ snippet -------------------------

mfg,

Jeroen.

Lars Ellenberg wrote:
>> The problems happens on Fujitsu Siemens server RX200/RX300. The total
>> of Fujistu Siemens Servers having this problem is 6 in total.
>> Other servers I have installed do not have this problem.
>>
>
> Then this is a strong indication it is _not_ DRBD.
> What about fixing your network drivers or hardware, then?
>
> try Firmware update, kernel upgrade, NIC driver module upgrade, etc.
>
>
>> The Fujistu Siemens server have onboard Broadcom interfaces "NIC:
>> NetXtreme II BCM5708 Gigabit Ethernet".
>>
>
>
> ------------------------------------------------------------------------
>
>
> No virus found in this incoming message.
> Checked by AVG - www.avg.com
> Version: 8.5.409 / Virus Database: 270.13.113/2399 - Release Date: 09/27/09 17:52:00
>
>


lars.ellenberg at linbit

Sep 28, 2009, 4:21 AM

Post #8 of 8 (1362 views)
Permalink
Re: DRBD full sync is stalled [In reply to]

On Mon, Sep 28, 2009 at 01:03:26PM +0200, Jeroen Groenewegen van der Weyden wrote:
> Thank you, I hear what you are saying.
>
> One more thing I noticed when testing with DRBD 8.3.3RC2 the behaviour
> is different.

Probably only by "accident" and some different timing.

> The target machine does not stall but simply disconnects
> all network interfaces. The var/log/messages show the included in the
> snippet below.
>
> Do you think with this information the bnx2 module is faulty?

Well, what do _you_ think,
after you put the

> kernel: ------------[ cut here ]------------
> kernel: WARNING: at net/sched/sch_generic.c:219 dev_watchdog+0x139/0x1eb()
> kernel: NETDEV WATCHDOG: eth0 (bnx2): transmit timed out

> kernel: bnx2: eth0 NIC Copper Link is Down
> kernel: br0: port 1(eth0) entering disabled state

into your favorite search engine or bugzilla search interface,
and browsed through the first few hits?

--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list -- I'm subscribed
_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user

DRBD users RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.