Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: DRBD: Users

Kernel Panic occuring when drbd is up & (re)syncing

 

 

DRBD users RSS feed   Index | Next | Previous | View Threaded


jfchevrette at iweb

Nov 9, 2009, 7:20 AM

Post #1 of 8 (1942 views)
Permalink
Kernel Panic occuring when drbd is up & (re)syncing

Hello,

here we have a two nodes setup that are running CentOS 5.4, Xen 3.0
(CentOS RPMs) and DRBD 8.3.2 (again CentOS RPM). Both servers are Dell
PowerEdge 1950 servers with two Quad-Core Xeon processors and 32GB of
memory. The network card used by DRBD is an Intel 82571EB Gigabit
Ethernet card (e1000 driver). Both are connected directly with a
crossover cable.

DRBD is configured so that I have one resource (drbd0) on which I have
configured a LVM VolumeGroup which is then sliced in two LVs. Both LVs
are mapped to my Xen VM (PV) as sda and sdb disks.

Recently, we've had issues where the node that is in Primary state and
hence running the VM locks up and throws a kernel panic. The situation
seems to indicate that this might be a problem related to DRBD and/or
the network stack because if we disconnect the DRBD resource, this
problem will not occur.

Even worse, the problem occur very quickly after we connect the DRBD
resource, either during resynchronization after being out-of-sync for a
while or during normal syncing operations. No errors show up on the
network interface (ifconfig, ethtool)

One thing to note is that the kernel panic seems to complain about
checksum functions so that might be related (see below)

Here are the relevant informations

# rpm -qa | grep -e xen -e drbd
drbd83-8.3.2-6.el5_3
kmod-drbd83-xen-8.3.2-6.el5_3
xen-3.0.3-94.el5
kernel-xen-2.6.18-164.el5
xen-libs-3.0.3-94.el5

# cat /etc/drbd.conf
global {
usage-count no;
}

common {
protocol C;

syncer {
rate 33M;
verify-alg crc32c;
al-extents 1801;
}
net {
cram-hmac-alg sha1;
max-epoch-size 8192;
max-buffers 8192;
}

disk {
on-io-error detach;
no-disk-flushes;
no-disk-barrier;
no-md-flushes;
}
}

resource drbd0 {
device /dev/drbd0;
disk /dev/sda6;
flexible-meta-disk internal;
on node1 {
address 10.11.1.1:7788;
}
on node2 {
address 10.11.1.2:7788;
}
}

### Kernel Panic ###
Unable to handle kernel paging request
at ffff880011e3cc64 RIP:
[<ffffffff80212bad>] csum_partial+0x56/0x4bc
PGD ed8067
PUD ed9067
PMD f69067
PTE 0

Oops: 0000 [1]
SMP

last sysfs file: /class/scsi_host/host0/proc_name
CPU 0

Modules linked in:
xt_physdev
netconsole
drbd(U)
netloop
netbk
blktap
blkbk
ipt_MASQUERADE
iptable_nat
ip_nat
bridge
ipv6
xfrm_nalgo
crypto_api
xt_tcpudp
xt_state
ip_conntrack_irc
xt_conntrack
ip_conntrack_ftp
xt_mac
xt_length
xt_limit
xt_multiport
ipt_ULOG
ipt_TCPMSS
ipt_TOS
ipt_ttl
ipt_owner
ipt_REJECT
ipt_ecn
ipt_LOG
ipt_recent
ip_conntrack
iptable_mangle
iptable_filter
ip_tables
nfnetlink
x_tables
autofs4
dm_mirror
dm_multipath
scsi_dh
video
hwmon
backlight
sbs
i2c_ec
i2c_core
button
battery
asus_acpi
ac
parport_pc
lp
parport
joydev
ide_cd
e1000e
cdrom
serial_core
i5000_edac
edac_mc
bnx2
serio_raw
pcspkr
sg
dm_raid45
dm_message
dm_region_hash
dm_log
dm_mod
dm_mem_cache
ata_piix
libata
shpchp
megaraid_sas
sd_mod
scsi_mod
ext3
jbd
uhci_hcd
ohci_hcd
ehci_hcd

Pid: 12887, comm: drbd0_receiver Tainted: G 2.6.18-128.1.16.el5xen #1
RIP: e030:[<ffffffff80212bad>]
[<ffffffff80212bad>] csum_partial+0x56/0x4bc
RSP: e02b:ffff88000c347718 EFLAGS: 00010202
RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff880010ced500
RDX: 00000000000000e7 RSI: 000000000000039c RDI: ffff880011e3cc64
RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000025b85e7c R11: 0000000000000002 R12: 0000000000000028
R13: 0000000000000028 R14: ffff88001c56f7b0 R15: 0000000025b85e7c
FS: 00002b391e123f60(0000) GS:ffffffff805ba000(0000) knlGS:0000000000000000
CS: e033 DS: 0000 ES: 0000
Process drbd0_receiver (pid: 12887, threadinfo ffff88000c346000, task
ffff88001c207820)
Stack:
000000000000039c
00000000000005b4
ffffffff8023d496
ffff88001e7e48d8

0000001400000000
ffff8800000003c4
ffff88001c56f7b0
ffff88001e7e48d8

ffff88001e7e48ec
ffff88000c3478e8

Call Trace:
[<ffffffff8023d496>] skb_checksum+0x11b/0x260
[<ffffffff80411472>] skb_checksum_help+0x71/0xd0
[<ffffffff8853f33e>] :iptable_nat:ip_nat_fn+0x56/0x1c3
[<ffffffff8853f6cf>] :iptable_nat:ip_nat_local_fn+0x32/0xb7
[<ffffffff8023550c>] nf_iterate+0x41/0x7d
[<ffffffff8042f004>] dst_output+0x0/0xe
[<ffffffff80258b28>] nf_hook_slow+0x58/0xbc
[<ffffffff8042f004>] dst_output+0x0/0xe
[<ffffffff802359ab>] ip_queue_xmit+0x41c/0x48c
[<ffffffff8022c1cb>] local_bh_enable+0x9/0xa5
[<ffffffff8020b6b7>] kmem_cache_alloc+0x62/0x6d
[<ffffffff8023668d>] alloc_skb_from_cache+0x74/0x13c
[<ffffffff80222a0b>] tcp_transmit_skb+0x62f/0x667
[<ffffffff8043903a>] tcp_retransmit_skb+0x53d/0x638
[<ffffffff80439353>] tcp_xmit_retransmit_queue+0x21e/0x2bb
[<ffffffff80225cff>] tcp_ack+0x1705/0x1879
[<ffffffff8021c6b1>] tcp_rcv_established+0x804/0x925
[<ffffffff80263710>] schedule_timeout+0x1e/0xad
[<ffffffff8023cef3>] tcp_v4_do_rcv+0x2a/0x2fa
[<ffffffff8040bbfe>] sk_wait_data+0xac/0xbf
[<ffffffff8029b018>] autoremove_wake_function+0x0/0x2e
[<ffffffff80434f71>] tcp_prequeue_process+0x65/0x78
[<ffffffff8021dd39>] tcp_recvmsg+0x492/0xb1f
[<ffffffff80233102>] sock_common_recvmsg+0x2d/0x43
[<ffffffff80233102>] sock_common_recvmsg+0x2d/0x43
[<ffffffff80231c18>] sock_recvmsg+0x101/0x120
[<ffffffff80231c18>] sock_recvmsg+0x101/0x120
[<ffffffff8029b018>] autoremove_wake_function+0x0/0x2e
[<ffffffff80343366>] swiotlb_map_sg+0xf7/0x205
[<ffffffff880b563c>] :megaraid_sas:megasas_make_sgl64+0x78/0xa9
[<ffffffff880b61bc>] :megaraid_sas:megasas_queue_command+0x343/0x3ed
[<ffffffff884e119f>] :drbd:drbd_recv+0x7b/0x109
[<ffffffff884e53b2>] :drbd:receive_DataRequest+0x3b/0x655
[<ffffffff884e1c4b>] :drbd:drbdd+0x77/0x152
[<ffffffff884e4870>] :drbd:drbdd_init+0xea/0x1dc
[<ffffffff884f432a>] :drbd:drbd_thread_setup+0xa2/0x18b
[<ffffffff80260b2c>] child_rip+0xa/0x12
[<ffffffff884f4288>] :drbd:drbd_thread_setup+0x0/0x18b
[<ffffffff80260b22>] child_rip+0x0/0x12


Code:
44
8b
0f
ff
ca
83
ee
04
48
83
c7
04
4d
01
c8
41
89
d2
41
89

RIP
[<ffffffff80212bad>] csum_partial+0x56/0x4bc
RSP <ffff88000c347718>
CR2: ffff880011e3cc64

Kernel panic - not syncing: Fatal exception
#######


Any ideas on how to diagnose this properly and eventually find the culprit?


Regards,
--
Jean-François Chevrette [iWeb]

_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


jfchevrette at iweb

Nov 12, 2009, 9:26 AM

Post #2 of 8 (1827 views)
Permalink
Re: Kernel Panic occuring when drbd is up & (re)syncing [In reply to]

It appears that there is currently a problem with the latest
CentOS/Redhat kernel. We have noticed the same problem when using LVM
snapshots and a backup technology called R1Soft CDP.

Some related info:
http://bugs.centos.org/view.php?id=3869
forum.r1soft.com/showthread.php?t=1158

No sign of a bug at bugzilla.redhat.com

For now we have reverted to kernel-2.6.18-128.7.1 on which we did not
have any issues for the past 4 hours. Previously, a few seconds after
starting a 'drbdadm verify' the kernel panic would occur.

DRBD devs might want to check it out.

Regards,
--
Jean-François Chevrette [iWeb]


On 09-11-09 10:20 AM, Jean-Francois Chevrette wrote:
> Hello,
>
> here we have a two nodes setup that are running CentOS 5.4, Xen 3.0
> (CentOS RPMs) and DRBD 8.3.2 (again CentOS RPM). Both servers are Dell
> PowerEdge 1950 servers with two Quad-Core Xeon processors and 32GB of
> memory. The network card used by DRBD is an Intel 82571EB Gigabit
> Ethernet card (e1000 driver). Both are connected directly with a
> crossover cable.
>
> DRBD is configured so that I have one resource (drbd0) on which I have
> configured a LVM VolumeGroup which is then sliced in two LVs. Both LVs
> are mapped to my Xen VM (PV) as sda and sdb disks.
>
> Recently, we've had issues where the node that is in Primary state and
> hence running the VM locks up and throws a kernel panic. The situation
> seems to indicate that this might be a problem related to DRBD and/or
> the network stack because if we disconnect the DRBD resource, this
> problem will not occur.
>
> Even worse, the problem occur very quickly after we connect the DRBD
> resource, either during resynchronization after being out-of-sync for a
> while or during normal syncing operations. No errors show up on the
> network interface (ifconfig, ethtool)
>
> One thing to note is that the kernel panic seems to complain about
> checksum functions so that might be related (see below)
>
> Here are the relevant informations
>
> # rpm -qa | grep -e xen -e drbd
> drbd83-8.3.2-6.el5_3
> kmod-drbd83-xen-8.3.2-6.el5_3
> xen-3.0.3-94.el5
> kernel-xen-2.6.18-164.el5
> xen-libs-3.0.3-94.el5
>
> # cat /etc/drbd.conf
> global {
> usage-count no;
> }
>
> common {
> protocol C;
>
> syncer {
> rate 33M;
> verify-alg crc32c;
> al-extents 1801;
> }
> net {
> cram-hmac-alg sha1;
> max-epoch-size 8192;
> max-buffers 8192;
> }
>
> disk {
> on-io-error detach;
> no-disk-flushes;
> no-disk-barrier;
> no-md-flushes;
> }
> }
>
> resource drbd0 {
> device /dev/drbd0;
> disk /dev/sda6;
> flexible-meta-disk internal;
> on node1 {
> address 10.11.1.1:7788;
> }
> on node2 {
> address 10.11.1.2:7788;
> }
> }
>
> ### Kernel Panic ###
> Unable to handle kernel paging request
> at ffff880011e3cc64 RIP:
> [<ffffffff80212bad>] csum_partial+0x56/0x4bc
> PGD ed8067
> PUD ed9067
> PMD f69067
> PTE 0
>
> Oops: 0000 [1]
> SMP
>
> last sysfs file: /class/scsi_host/host0/proc_name
> CPU 0
>
> Modules linked in:
> xt_physdev
> netconsole
> drbd(U)
> netloop
> netbk
> blktap
> blkbk
> ipt_MASQUERADE
> iptable_nat
> ip_nat
> bridge
> ipv6
> xfrm_nalgo
> crypto_api
> xt_tcpudp
> xt_state
> ip_conntrack_irc
> xt_conntrack
> ip_conntrack_ftp
> xt_mac
> xt_length
> xt_limit
> xt_multiport
> ipt_ULOG
> ipt_TCPMSS
> ipt_TOS
> ipt_ttl
> ipt_owner
> ipt_REJECT
> ipt_ecn
> ipt_LOG
> ipt_recent
> ip_conntrack
> iptable_mangle
> iptable_filter
> ip_tables
> nfnetlink
> x_tables
> autofs4
> dm_mirror
> dm_multipath
> scsi_dh
> video
> hwmon
> backlight
> sbs
> i2c_ec
> i2c_core
> button
> battery
> asus_acpi
> ac
> parport_pc
> lp
> parport
> joydev
> ide_cd
> e1000e
> cdrom
> serial_core
> i5000_edac
> edac_mc
> bnx2
> serio_raw
> pcspkr
> sg
> dm_raid45
> dm_message
> dm_region_hash
> dm_log
> dm_mod
> dm_mem_cache
> ata_piix
> libata
> shpchp
> megaraid_sas
> sd_mod
> scsi_mod
> ext3
> jbd
> uhci_hcd
> ohci_hcd
> ehci_hcd
>
> Pid: 12887, comm: drbd0_receiver Tainted: G 2.6.18-128.1.16.el5xen #1
> RIP: e030:[<ffffffff80212bad>]
> [<ffffffff80212bad>] csum_partial+0x56/0x4bc
> RSP: e02b:ffff88000c347718 EFLAGS: 00010202
> RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff880010ced500
> RDX: 00000000000000e7 RSI: 000000000000039c RDI: ffff880011e3cc64
> RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
> R10: 0000000025b85e7c R11: 0000000000000002 R12: 0000000000000028
> R13: 0000000000000028 R14: ffff88001c56f7b0 R15: 0000000025b85e7c
> FS: 00002b391e123f60(0000) GS:ffffffff805ba000(0000) knlGS:0000000000000000
> CS: e033 DS: 0000 ES: 0000
> Process drbd0_receiver (pid: 12887, threadinfo ffff88000c346000, task
> ffff88001c207820)
> Stack:
> 000000000000039c
> 00000000000005b4
> ffffffff8023d496
> ffff88001e7e48d8
>
> 0000001400000000
> ffff8800000003c4
> ffff88001c56f7b0
> ffff88001e7e48d8
>
> ffff88001e7e48ec
> ffff88000c3478e8
>
> Call Trace:
> [<ffffffff8023d496>] skb_checksum+0x11b/0x260
> [<ffffffff80411472>] skb_checksum_help+0x71/0xd0
> [<ffffffff8853f33e>] :iptable_nat:ip_nat_fn+0x56/0x1c3
> [<ffffffff8853f6cf>] :iptable_nat:ip_nat_local_fn+0x32/0xb7
> [<ffffffff8023550c>] nf_iterate+0x41/0x7d
> [<ffffffff8042f004>] dst_output+0x0/0xe
> [<ffffffff80258b28>] nf_hook_slow+0x58/0xbc
> [<ffffffff8042f004>] dst_output+0x0/0xe
> [<ffffffff802359ab>] ip_queue_xmit+0x41c/0x48c
> [<ffffffff8022c1cb>] local_bh_enable+0x9/0xa5
> [<ffffffff8020b6b7>] kmem_cache_alloc+0x62/0x6d
> [<ffffffff8023668d>] alloc_skb_from_cache+0x74/0x13c
> [<ffffffff80222a0b>] tcp_transmit_skb+0x62f/0x667
> [<ffffffff8043903a>] tcp_retransmit_skb+0x53d/0x638
> [<ffffffff80439353>] tcp_xmit_retransmit_queue+0x21e/0x2bb
> [<ffffffff80225cff>] tcp_ack+0x1705/0x1879
> [<ffffffff8021c6b1>] tcp_rcv_established+0x804/0x925
> [<ffffffff80263710>] schedule_timeout+0x1e/0xad
> [<ffffffff8023cef3>] tcp_v4_do_rcv+0x2a/0x2fa
> [<ffffffff8040bbfe>] sk_wait_data+0xac/0xbf
> [<ffffffff8029b018>] autoremove_wake_function+0x0/0x2e
> [<ffffffff80434f71>] tcp_prequeue_process+0x65/0x78
> [<ffffffff8021dd39>] tcp_recvmsg+0x492/0xb1f
> [<ffffffff80233102>] sock_common_recvmsg+0x2d/0x43
> [<ffffffff80233102>] sock_common_recvmsg+0x2d/0x43
> [<ffffffff80231c18>] sock_recvmsg+0x101/0x120
> [<ffffffff80231c18>] sock_recvmsg+0x101/0x120
> [<ffffffff8029b018>] autoremove_wake_function+0x0/0x2e
> [<ffffffff80343366>] swiotlb_map_sg+0xf7/0x205
> [<ffffffff880b563c>] :megaraid_sas:megasas_make_sgl64+0x78/0xa9
> [<ffffffff880b61bc>] :megaraid_sas:megasas_queue_command+0x343/0x3ed
> [<ffffffff884e119f>] :drbd:drbd_recv+0x7b/0x109
> [<ffffffff884e53b2>] :drbd:receive_DataRequest+0x3b/0x655
> [<ffffffff884e1c4b>] :drbd:drbdd+0x77/0x152
> [<ffffffff884e4870>] :drbd:drbdd_init+0xea/0x1dc
> [<ffffffff884f432a>] :drbd:drbd_thread_setup+0xa2/0x18b
> [<ffffffff80260b2c>] child_rip+0xa/0x12
> [<ffffffff884f4288>] :drbd:drbd_thread_setup+0x0/0x18b
> [<ffffffff80260b22>] child_rip+0x0/0x12
>
>
> Code:
> 44
> 8b
> 0f
> ff
> ca
> 83
> ee
> 04
> 48
> 83
> c7
> 04
> 4d
> 01
> c8
> 41
> 89
> d2
> 41
> 89
>
> RIP
> [<ffffffff80212bad>] csum_partial+0x56/0x4bc
> RSP <ffff88000c347718>
> CR2: ffff880011e3cc64
>
> Kernel panic - not syncing: Fatal exception
> #######
>
>
> Any ideas on how to diagnose this properly and eventually find the culprit?
>
>
> Regards,



_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


ivars.strazdins at gmail

Nov 16, 2009, 3:29 AM

Post #3 of 8 (1792 views)
Permalink
Re: Kernel Panic occuring when drbd is up & (re)syncing [In reply to]

It looks like I am getting kernel bug on 64-bit Xen Debian in similar
conditions, ie, when running drbd-verify.
I have got it happening on both cluster nodes.

Kernel 2.6.26-2-xen-amd64, DRBD 8.3.5 compiled from Debian unstable
package for 8.3.4

For anyone interested, here is the stack trace.
BR,
Ivars


Nov 16 03:00:29 ariel kernel: [31375.026193] BUG: unable to handle
kernel NULL pointer dereference at 0000000000000016
Nov 16 03:00:29 ariel kernel: [31375.026288] IP: [<ffffffffa02f9169>]
:drbd:drbd_connector_callback+0x32/0x181
Nov 16 03:00:29 ariel kernel: [31375.026359] PGD 164c4067 PUD 170d1067
PMD 0
Nov 16 03:00:29 ariel kernel: [31375.026423] Oops: 0000 [1] SMP
Nov 16 03:00:29 ariel kernel: [31375.026474] CPU 0
Nov 16 03:00:29 ariel kernel: [31375.026512] Modules linked in:
xt_physdev iptable_filter ip_tables x_tables sha1_generic dr
bd cn iscsi_trgt crc32c libcrc32c ipv6 bridge xfs w83627ehf lm85
hwmon_vid netconsole configfs xenblktap netloop softdog ipm
i_watchdog ipmi_msghandler loop psmouse serio_raw pcspkr i2c_i801
i2c_core button rng_core shpchp pci_hotplug intel_agp evde
v ext3 jbd mbcache dm_mirror dm_log dm_snapshot dm_mod ide_cd_mod cdrom
ide_disk ide_pci_generic ata_piix piix ide_core ata_
generic libata scsi_mod dock skge ehci_hcd uhci_hcd thermal processor
fan thermal_sys [last unloaded: scsi_wait_scan]
Nov 16 03:00:29 ariel kernel: [31375.027370] Pid: 3165, comm: cqueue Not
tainted 2.6.26-2-xen-amd64 #1
Nov 16 03:00:29 ariel kernel: [31375.027405] RIP:
e030:[<ffffffffa02f9169>] [<ffffffffa02f9169>] :drbd:drbd_connector_callb
ack+0x32/0x181
Nov 16 03:00:29 ariel kernel: [31375.027485] RSP: e02b:ffff8800104f3e50
EFLAGS: 00010206
Nov 16 03:00:29 ariel kernel: [31375.027519] RAX: 0000000000000000 RBX:
ffff88001648c220 RCX: 0000000000000000
Nov 16 03:00:29 ariel kernel: [31375.027555] RDX: 0000000000000000 RSI:
0000000000000000 RDI: ffff8800164c9c10
Nov 16 03:00:29 ariel kernel: [31375.027597] RBP: ffff88001648c1d8 R08:
ffff8800104f2000 R09: ffffffff80553e18
Nov 16 03:00:29 ariel kernel: [31375.027633] R10: 0000000000000000 R11:
7fffffffffffffff R12: ffff8800164c9c10
Nov 16 03:00:29 ariel kernel: [31375.027669] R13: ffffffffa02d30c3 R14:
ffffffff8057d1c0 R15: 0000000000000000
Nov 16 03:00:29 ariel kernel: [31375.027709] FS: 00007f9ee13c46e0(0000)
GS:ffffffff8053a000(0000) knlGS:0000000000000000
Nov 16 03:00:29 ariel kernel: [31375.027761] CS: e033 DS: 0000 ES: 0000
Nov 16 03:00:29 ariel kernel: [31375.027793] DR0: 0000000000000000 DR1:
0000000000000000 DR2: 0000000000000000
Nov 16 03:00:29 ariel kernel: [31375.027829] DR3: 0000000000000000 DR6:
00000000ffff0ff0 DR7: 0000000000000400
Nov 16 03:00:29 ariel kernel: [31375.027866] Process cqueue (pid: 3165,
threadinfo ffff8800104f2000, task ffff8800161e1440)
Nov 16 03:00:29 ariel kernel: [31375.027918] Stack: 0000000000000000
ffff88001648c220 ffff88001648c1d8 ffff88001648c1d0
Nov 16 03:00:29 ariel kernel: [31375.028024] ffffffffa02d30c3
ffffffff8057d1c0 0000000000000000 ffffffffa02d30d8
Nov 16 03:00:29 ariel kernel: [31375.028120] 7fffffffffffffff
ffff880016f76840 ffff88001648c1d0 ffffffff8023c34c
Nov 16 03:00:29 ariel kernel: [31375.028185] Call Trace:
Nov 16 03:00:29 ariel kernel: [31375.028250] [<ffffffffa02d30c3>] ?
:cn:cn_queue_wrapper+0x0/0x33
Nov 16 03:00:29 ariel kernel: [31375.028393] [<ffffffffa02d30d8>] ?
:cn:cn_queue_wrapper+0x15/0x33
Nov 16 03:00:29 ariel kernel: [31375.028439] [<ffffffff8023c34c>] ?
run_workqueue+0xbe/0x189
Nov 16 03:00:29 ariel kernel: [31375.028482] [<ffffffff8023cd35>] ?
worker_thread+0xd5/0xe0
Nov 16 03:00:29 ariel kernel: [31375.028522] [<ffffffff8023f6c1>] ?
autoremove_wake_function+0x0/0x2e
Nov 16 03:00:29 ariel kernel: [31375.028564] [<ffffffff8023cc60>] ?
worker_thread+0x0/0xe0
Nov 16 03:00:29 ariel kernel: [31375.028601] [<ffffffff8023f593>] ?
kthread+0x47/0x74
Nov 16 03:00:29 ariel kernel: [31375.028637] [<ffffffff802283a8>] ?
schedule_tail+0x27/0x5c
Nov 16 03:00:29 ariel kernel: [31375.028677] [<ffffffff8020be28>] ?
child_rip+0xa/0x12
Nov 16 03:00:29 ariel kernel: [31375.028722] [<ffffffff8023f54c>] ?
kthread+0x0/0x74
Nov 16 03:00:29 ariel kernel: [31375.028760] [<ffffffff8020be1e>] ?
child_rip+0x0/0x12
Nov 16 03:00:29 ariel kernel: [31375.028796]
Nov 16 03:00:29 ariel kernel: [31375.028824]
Nov 16 03:00:29 ariel kernel: [31375.028852] Code: 41 55 41 54 49 89 fc
55 53 48 83 ec 08 65 8b 04 25 24 00 00 00 83 3d a6 75 01 00 02 74 1e 89
c0 48 c1 e0 07 48 ff 80 00 09 31 a0 <f6> 42 16 20 be 98 00 00 00 0f 84
20 01 00 00 eb 1a 41 5b 5b 5d
Nov 16 03:00:29 ariel kernel: [31375.029581] RIP [<ffffffffa02f9169>]
:drbd:drbd_connector_callback+0x32/0x181
Nov 16 03:00:29 ariel kernel: [31375.029657] RSP <ffff8800104f3e50>
Nov 16 03:00:29 ariel kernel: [31375.029688] CR2: 0000000000000016
Nov 16 03:00:29 ariel kernel: [31375.030762] ---[ end trace
296f6157c8798c56 ]---


Jean-Francois Chevrette wrote:
> It appears that there is currently a problem with the latest
> CentOS/Redhat kernel. We have noticed the same problem when using LVM
> snapshots and a backup technology called R1Soft CDP.
>
> Some related info:
> http://bugs.centos.org/view.php?id=3869
> forum.r1soft.com/showthread.php?t=1158
>
> No sign of a bug at bugzilla.redhat.com
>
> For now we have reverted to kernel-2.6.18-128.7.1 on which we did not
> have any issues for the past 4 hours. Previously, a few seconds after
> starting a 'drbdadm verify' the kernel panic would occur.
>
> DRBD devs might want to check it out.
>
> Regards,
_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


biancalana at gmail

Dec 14, 2009, 2:20 PM

Post #4 of 8 (1590 views)
Permalink
Re: Kernel Panic occuring when drbd is up & (re)syncing [In reply to]

Hi list,

Any news about this bug ??

2009/11/16 Ivars Strazdiņš <ivars.strazdins [at] gmail>:
> It looks like I am getting kernel bug on 64-bit Xen Debian in similar
> conditions, ie, when running drbd-verify.
> I have got it happening on both cluster nodes.
>
> Kernel 2.6.26-2-xen-amd64, DRBD 8.3.5 compiled from Debian unstable package
> for 8.3.4
>
> For anyone interested, here is the stack trace.
> BR,
> Ivars
>
>
> Nov 16 03:00:29 ariel kernel: [31375.026193] BUG: unable to handle kernel
> NULL pointer dereference at 0000000000000016
> Nov 16 03:00:29 ariel kernel: [31375.026288] IP: [<ffffffffa02f9169>]
> :drbd:drbd_connector_callback+0x32/0x181
> Nov 16 03:00:29 ariel kernel: [31375.026359] PGD 164c4067 PUD 170d1067 PMD 0
> Nov 16 03:00:29 ariel kernel: [31375.026423] Oops: 0000 [1] SMP
> Nov 16 03:00:29 ariel kernel: [31375.026474] CPU 0
> Nov 16 03:00:29 ariel kernel: [31375.026512] Modules linked in: xt_physdev
> iptable_filter ip_tables x_tables sha1_generic dr
> bd cn iscsi_trgt crc32c libcrc32c ipv6 bridge xfs w83627ehf lm85 hwmon_vid
> netconsole configfs xenblktap netloop softdog ipm
> i_watchdog ipmi_msghandler loop psmouse serio_raw pcspkr i2c_i801 i2c_core
> button rng_core shpchp pci_hotplug intel_agp evde
> v ext3 jbd mbcache dm_mirror dm_log dm_snapshot dm_mod ide_cd_mod cdrom
> ide_disk ide_pci_generic ata_piix piix ide_core ata_
> generic libata scsi_mod dock skge ehci_hcd uhci_hcd thermal processor fan
> thermal_sys [last unloaded: scsi_wait_scan]
> Nov 16 03:00:29 ariel kernel: [31375.027370] Pid: 3165, comm: cqueue Not
> tainted 2.6.26-2-xen-amd64 #1
> Nov 16 03:00:29 ariel kernel: [31375.027405] RIP: e030:[<ffffffffa02f9169>]
>  [<ffffffffa02f9169>] :drbd:drbd_connector_callb
> ack+0x32/0x181
> Nov 16 03:00:29 ariel kernel: [31375.027485] RSP: e02b:ffff8800104f3e50
>  EFLAGS: 00010206
> Nov 16 03:00:29 ariel kernel: [31375.027519] RAX: 0000000000000000 RBX:
> ffff88001648c220 RCX: 0000000000000000
> Nov 16 03:00:29 ariel kernel: [31375.027555] RDX: 0000000000000000 RSI:
> 0000000000000000 RDI: ffff8800164c9c10
> Nov 16 03:00:29 ariel kernel: [31375.027597] RBP: ffff88001648c1d8 R08:
> ffff8800104f2000 R09: ffffffff80553e18
> Nov 16 03:00:29 ariel kernel: [31375.027633] R10: 0000000000000000 R11:
> 7fffffffffffffff R12: ffff8800164c9c10
> Nov 16 03:00:29 ariel kernel: [31375.027669] R13: ffffffffa02d30c3 R14:
> ffffffff8057d1c0 R15: 0000000000000000
> Nov 16 03:00:29 ariel kernel: [31375.027709] FS:  00007f9ee13c46e0(0000)
> GS:ffffffff8053a000(0000) knlGS:0000000000000000
> Nov 16 03:00:29 ariel kernel: [31375.027761] CS:  e033 DS: 0000 ES: 0000
> Nov 16 03:00:29 ariel kernel: [31375.027793] DR0: 0000000000000000 DR1:
> 0000000000000000 DR2: 0000000000000000
> Nov 16 03:00:29 ariel kernel: [31375.027829] DR3: 0000000000000000 DR6:
> 00000000ffff0ff0 DR7: 0000000000000400
> Nov 16 03:00:29 ariel kernel: [31375.027866] Process cqueue (pid: 3165,
> threadinfo ffff8800104f2000, task ffff8800161e1440)
> Nov 16 03:00:29 ariel kernel: [31375.027918] Stack:  0000000000000000
> ffff88001648c220 ffff88001648c1d8 ffff88001648c1d0
> Nov 16 03:00:29 ariel kernel: [31375.028024]  ffffffffa02d30c3
> ffffffff8057d1c0 0000000000000000 ffffffffa02d30d8
> Nov 16 03:00:29 ariel kernel: [31375.028120]  7fffffffffffffff
> ffff880016f76840 ffff88001648c1d0 ffffffff8023c34c
> Nov 16 03:00:29 ariel kernel: [31375.028185] Call Trace:
> Nov 16 03:00:29 ariel kernel: [31375.028250]  [<ffffffffa02d30c3>] ?
> :cn:cn_queue_wrapper+0x0/0x33
> Nov 16 03:00:29 ariel kernel: [31375.028393]  [<ffffffffa02d30d8>] ?
> :cn:cn_queue_wrapper+0x15/0x33
> Nov 16 03:00:29 ariel kernel: [31375.028439]  [<ffffffff8023c34c>] ?
> run_workqueue+0xbe/0x189
> Nov 16 03:00:29 ariel kernel: [31375.028482]  [<ffffffff8023cd35>] ?
> worker_thread+0xd5/0xe0
> Nov 16 03:00:29 ariel kernel: [31375.028522]  [<ffffffff8023f6c1>] ?
> autoremove_wake_function+0x0/0x2e
> Nov 16 03:00:29 ariel kernel: [31375.028564]  [<ffffffff8023cc60>] ?
> worker_thread+0x0/0xe0
> Nov 16 03:00:29 ariel kernel: [31375.028601]  [<ffffffff8023f593>] ?
> kthread+0x47/0x74
> Nov 16 03:00:29 ariel kernel: [31375.028637]  [<ffffffff802283a8>] ?
> schedule_tail+0x27/0x5c
> Nov 16 03:00:29 ariel kernel: [31375.028677]  [<ffffffff8020be28>] ?
> child_rip+0xa/0x12
> Nov 16 03:00:29 ariel kernel: [31375.028722]  [<ffffffff8023f54c>] ?
> kthread+0x0/0x74
> Nov 16 03:00:29 ariel kernel: [31375.028760]  [<ffffffff8020be1e>] ?
> child_rip+0x0/0x12
> Nov 16 03:00:29 ariel kernel: [31375.028796]
> Nov 16 03:00:29 ariel kernel: [31375.028824]
> Nov 16 03:00:29 ariel kernel: [31375.028852] Code: 41 55 41 54 49 89 fc 55
> 53 48 83 ec 08 65 8b 04 25 24 00 00 00 83 3d a6 75 01 00 02 74 1e 89 c0 48
> c1 e0 07 48 ff 80 00 09 31 a0 <f6> 42 16 20 be 98 00 00 00 0f 84 20 01 00 00
> eb 1a 41 5b 5b 5d
> Nov 16 03:00:29 ariel kernel: [31375.029581] RIP  [<ffffffffa02f9169>]
> :drbd:drbd_connector_callback+0x32/0x181
> Nov 16 03:00:29 ariel kernel: [31375.029657]  RSP <ffff8800104f3e50>
> Nov 16 03:00:29 ariel kernel: [31375.029688] CR2: 0000000000000016
> Nov 16 03:00:29 ariel kernel: [31375.030762] ---[ end trace 296f6157c8798c56
> ]---
>
>
> Jean-Francois Chevrette wrote:
>>
>> It appears that there is currently a problem with the latest CentOS/Redhat
>> kernel. We have noticed the same problem when using LVM snapshots and a
>> backup technology called R1Soft CDP.
>>
>> Some related info:
>> http://bugs.centos.org/view.php?id=3869
>> forum.r1soft.com/showthread.php?t=1158
>>
>> No sign of a bug at bugzilla.redhat.com
>>
>> For now we have reverted to kernel-2.6.18-128.7.1 on which we did not have
>> any issues for the past 4 hours. Previously, a few seconds after starting a
>> 'drbdadm verify' the kernel panic would occur.
>>
>> DRBD devs might want to check it out.
>>
>> Regards,
>
> _______________________________________________
> drbd-user mailing list
> drbd-user [at] lists
> http://lists.linbit.com/mailman/listinfo/drbd-user
>
_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


biancalana at gmail

Jan 20, 2010, 11:38 AM

Post #5 of 8 (1386 views)
Permalink
Re: Kernel Panic occuring when drbd is up & (re)syncing [In reply to]

Hi list,

Any news about this bug ??

Alexandre

2009/11/16 Ivars Strazdiņš <ivars.strazdins [at] gmail>:
> It looks like I am getting kernel bug on 64-bit Xen Debian in similar
> conditions, ie, when running drbd-verify.
> I have got it happening on both cluster nodes.
>
> Kernel 2.6.26-2-xen-amd64, DRBD 8.3.5 compiled from Debian unstable package
> for 8.3.4
>
> For anyone interested, here is the stack trace.
> BR,
> Ivars
>
>
> Nov 16 03:00:29 ariel kernel: [31375.026193] BUG: unable to handle kernel
> NULL pointer dereference at 0000000000000016
> Nov 16 03:00:29 ariel kernel: [31375.026288] IP: [<ffffffffa02f9169>]
> :drbd:drbd_connector_callback+0x32/0x181
> Nov 16 03:00:29 ariel kernel: [31375.026359] PGD 164c4067 PUD 170d1067 PMD 0
> Nov 16 03:00:29 ariel kernel: [31375.026423] Oops: 0000 [1] SMP
> Nov 16 03:00:29 ariel kernel: [31375.026474] CPU 0
> Nov 16 03:00:29 ariel kernel: [31375.026512] Modules linked in: xt_physdev
> iptable_filter ip_tables x_tables sha1_generic dr
> bd cn iscsi_trgt crc32c libcrc32c ipv6 bridge xfs w83627ehf lm85 hwmon_vid
> netconsole configfs xenblktap netloop softdog ipm
> i_watchdog ipmi_msghandler loop psmouse serio_raw pcspkr i2c_i801 i2c_core
> button rng_core shpchp pci_hotplug intel_agp evde
> v ext3 jbd mbcache dm_mirror dm_log dm_snapshot dm_mod ide_cd_mod cdrom
> ide_disk ide_pci_generic ata_piix piix ide_core ata_
> generic libata scsi_mod dock skge ehci_hcd uhci_hcd thermal processor fan
> thermal_sys [last unloaded: scsi_wait_scan]
> Nov 16 03:00:29 ariel kernel: [31375.027370] Pid: 3165, comm: cqueue Not
> tainted 2.6.26-2-xen-amd64 #1
> Nov 16 03:00:29 ariel kernel: [31375.027405] RIP: e030:[<ffffffffa02f9169>]
>  [<ffffffffa02f9169>] :drbd:drbd_connector_callb
> ack+0x32/0x181
> Nov 16 03:00:29 ariel kernel: [31375.027485] RSP: e02b:ffff8800104f3e50
>  EFLAGS: 00010206
> Nov 16 03:00:29 ariel kernel: [31375.027519] RAX: 0000000000000000 RBX:
> ffff88001648c220 RCX: 0000000000000000
> Nov 16 03:00:29 ariel kernel: [31375.027555] RDX: 0000000000000000 RSI:
> 0000000000000000 RDI: ffff8800164c9c10
> Nov 16 03:00:29 ariel kernel: [31375.027597] RBP: ffff88001648c1d8 R08:
> ffff8800104f2000 R09: ffffffff80553e18
> Nov 16 03:00:29 ariel kernel: [31375.027633] R10: 0000000000000000 R11:
> 7fffffffffffffff R12: ffff8800164c9c10
> Nov 16 03:00:29 ariel kernel: [31375.027669] R13: ffffffffa02d30c3 R14:
> ffffffff8057d1c0 R15: 0000000000000000
> Nov 16 03:00:29 ariel kernel: [31375.027709] FS:  00007f9ee13c46e0(0000)
> GS:ffffffff8053a000(0000) knlGS:0000000000000000
> Nov 16 03:00:29 ariel kernel: [31375.027761] CS:  e033 DS: 0000 ES: 0000
> Nov 16 03:00:29 ariel kernel: [31375.027793] DR0: 0000000000000000 DR1:
> 0000000000000000 DR2: 0000000000000000
> Nov 16 03:00:29 ariel kernel: [31375.027829] DR3: 0000000000000000 DR6:
> 00000000ffff0ff0 DR7: 0000000000000400
> Nov 16 03:00:29 ariel kernel: [31375.027866] Process cqueue (pid: 3165,
> threadinfo ffff8800104f2000, task ffff8800161e1440)
> Nov 16 03:00:29 ariel kernel: [31375.027918] Stack:  0000000000000000
> ffff88001648c220 ffff88001648c1d8 ffff88001648c1d0
> Nov 16 03:00:29 ariel kernel: [31375.028024]  ffffffffa02d30c3
> ffffffff8057d1c0 0000000000000000 ffffffffa02d30d8
> Nov 16 03:00:29 ariel kernel: [31375.028120]  7fffffffffffffff
> ffff880016f76840 ffff88001648c1d0 ffffffff8023c34c
> Nov 16 03:00:29 ariel kernel: [31375.028185] Call Trace:
> Nov 16 03:00:29 ariel kernel: [31375.028250]  [<ffffffffa02d30c3>] ?
> :cn:cn_queue_wrapper+0x0/0x33
> Nov 16 03:00:29 ariel kernel: [31375.028393]  [<ffffffffa02d30d8>] ?
> :cn:cn_queue_wrapper+0x15/0x33
> Nov 16 03:00:29 ariel kernel: [31375.028439]  [<ffffffff8023c34c>] ?
> run_workqueue+0xbe/0x189
> Nov 16 03:00:29 ariel kernel: [31375.028482]  [<ffffffff8023cd35>] ?
> worker_thread+0xd5/0xe0
> Nov 16 03:00:29 ariel kernel: [31375.028522]  [<ffffffff8023f6c1>] ?
> autoremove_wake_function+0x0/0x2e
> Nov 16 03:00:29 ariel kernel: [31375.028564]  [<ffffffff8023cc60>] ?
> worker_thread+0x0/0xe0
> Nov 16 03:00:29 ariel kernel: [31375.028601]  [<ffffffff8023f593>] ?
> kthread+0x47/0x74
> Nov 16 03:00:29 ariel kernel: [31375.028637]  [<ffffffff802283a8>] ?
> schedule_tail+0x27/0x5c
> Nov 16 03:00:29 ariel kernel: [31375.028677]  [<ffffffff8020be28>] ?
> child_rip+0xa/0x12
> Nov 16 03:00:29 ariel kernel: [31375.028722]  [<ffffffff8023f54c>] ?
> kthread+0x0/0x74
> Nov 16 03:00:29 ariel kernel: [31375.028760]  [<ffffffff8020be1e>] ?
> child_rip+0x0/0x12
> Nov 16 03:00:29 ariel kernel: [31375.028796]
> Nov 16 03:00:29 ariel kernel: [31375.028824]
> Nov 16 03:00:29 ariel kernel: [31375.028852] Code: 41 55 41 54 49 89 fc 55
> 53 48 83 ec 08 65 8b 04 25 24 00 00 00 83 3d a6 75 01 00 02 74 1e 89 c0 48
> c1 e0 07 48 ff 80 00 09 31 a0 <f6> 42 16 20 be 98 00 00 00 0f 84 20 01 00 00
> eb 1a 41 5b 5b 5d
> Nov 16 03:00:29 ariel kernel: [31375.029581] RIP  [<ffffffffa02f9169>]
> :drbd:drbd_connector_callback+0x32/0x181
> Nov 16 03:00:29 ariel kernel: [31375.029657]  RSP <ffff8800104f3e50>
> Nov 16 03:00:29 ariel kernel: [31375.029688] CR2: 0000000000000016
> Nov 16 03:00:29 ariel kernel: [31375.030762] ---[ end trace 296f6157c8798c56
> ]---
>
>
> Jean-Francois Chevrette wrote:
>>
>> It appears that there is currently a problem with the latest CentOS/Redhat
>> kernel. We have noticed the same problem when using LVM snapshots and a
>> backup technology called R1Soft CDP.
>>
>> Some related info:
>> http://bugs.centos.org/view.php?id=3869
>> forum.r1soft.com/showthread.php?t=1158
>>
>> No sign of a bug at bugzilla.redhat.com
>>
>> For now we have reverted to kernel-2.6.18-128.7.1 on which we did not have
>> any issues for the past 4 hours. Previously, a few seconds after starting a
>> 'drbdadm verify' the kernel panic would occur.
>>
>> DRBD devs might want to check it out.
>>
>> Regards,
>
> _______________________________________________
> drbd-user mailing list
> drbd-user [at] lists
> http://lists.linbit.com/mailman/listinfo/drbd-user
>
_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


lars.ellenberg at linbit

Jan 20, 2010, 12:06 PM

Post #6 of 8 (1391 views)
Permalink
Re: Kernel Panic occuring when drbd is up & (re)syncing [In reply to]

On Wed, Jan 20, 2010 at 05:38:59PM -0200, Alexandre Biancalana wrote:
> Hi list,
>
> Any news about this bug ??

There are two "bugs" in this thread.
(do not hijack threads!)

the first poster had some oops in csum_partial,
and replied to himself providing a redhat bugzilla.
this was not DRBD related.

the other is an oops in :drbd:drbd_connector_callback,
and was caused by building from debian deb-src, which did not run our
"adjust_drbd_config_h.sh" before compiling, resulting in broken builds.

this has been fixed by the debian maintainer recently.

if it still does not work for you,
just build from tar.gz (which is http://oss.linbit.com/drbd/)
or check out from git.linbit.com

--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list -- I'm subscribed
_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


bart.coninckx at telenet

Oct 4, 2010, 9:10 AM

Post #7 of 8 (666 views)
Permalink
Re: Kernel Panic occuring when drbd is up & (re)syncing [In reply to]

On Thursday 12 November 2009 18:26:14 Jean-Francois Chevrette wrote:
> It appears that there is currently a problem with the latest
> CentOS/Redhat kernel. We have noticed the same problem when using LVM
> snapshots and a backup technology called R1Soft CDP.
>
> Some related info:
> http://bugs.centos.org/view.php?id=3869
> forum.r1soft.com/showthread.php?t=1158
>
> No sign of a bug at bugzilla.redhat.com
>
> For now we have reverted to kernel-2.6.18-128.7.1 on which we did not
> have any issues for the past 4 hours. Previously, a few seconds after
> starting a 'drbdadm verify' the kernel panic would occur.
>
> DRBD devs might want to check it out.
>
> Regards,
>
> > Hello,
> >
> > here we have a two nodes setup that are running CentOS 5.4, Xen 3.0
> > (CentOS RPMs) and DRBD 8.3.2 (again CentOS RPM). Both servers are Dell
> > PowerEdge 1950 servers with two Quad-Core Xeon processors and 32GB of
> > memory. The network card used by DRBD is an Intel 82571EB Gigabit
> > Ethernet card (e1000 driver). Both are connected directly with a
> > crossover cable.
> >
> > DRBD is configured so that I have one resource (drbd0) on which I have
> > configured a LVM VolumeGroup which is then sliced in two LVs. Both LVs
> > are mapped to my Xen VM (PV) as sda and sdb disks.
> >
> > Recently, we've had issues where the node that is in Primary state and
> > hence running the VM locks up and throws a kernel panic. The situation
> > seems to indicate that this might be a problem related to DRBD and/or
> > the network stack because if we disconnect the DRBD resource, this
> > problem will not occur.
> >
> > Even worse, the problem occur very quickly after we connect the DRBD
> > resource, either during resynchronization after being out-of-sync for a
> > while or during normal syncing operations. No errors show up on the
> > network interface (ifconfig, ethtool)
> >
> > One thing to note is that the kernel panic seems to complain about
> > checksum functions so that might be related (see below)
> >
> > Here are the relevant informations
> >
> > # rpm -qa | grep -e xen -e drbd
> > drbd83-8.3.2-6.el5_3
> > kmod-drbd83-xen-8.3.2-6.el5_3
> > xen-3.0.3-94.el5
> > kernel-xen-2.6.18-164.el5
> > xen-libs-3.0.3-94.el5
> >
> > # cat /etc/drbd.conf
> > global {
> > usage-count no;
> > }
> >
> > common {
> > protocol C;
> >
> > syncer {
> > rate 33M;
> > verify-alg crc32c;
> > al-extents 1801;
> > }
> > net {
> > cram-hmac-alg sha1;
> > max-epoch-size 8192;
> > max-buffers 8192;
> > }
> >
> > disk {
> > on-io-error detach;
> > no-disk-flushes;
> > no-disk-barrier;
> > no-md-flushes;
> > }
> > }
> >
> > resource drbd0 {
> > device /dev/drbd0;
> > disk /dev/sda6;
> > flexible-meta-disk internal;
> > on node1 {
> > address 10.11.1.1:7788;
> > }
> > on node2 {
> > address 10.11.1.2:7788;
> > }
> > }
> >
> > ### Kernel Panic ###
> > Unable to handle kernel paging request
> > at ffff880011e3cc64 RIP:
> > [<ffffffff80212bad>] csum_partial+0x56/0x4bc
> > PGD ed8067
> > PUD ed9067
> > PMD f69067
> > PTE 0
> >
> > Oops: 0000 [1]
> > SMP
> >
> > last sysfs file: /class/scsi_host/host0/proc_name
> > CPU 0
> >
> > Modules linked in:
> > xt_physdev
> > netconsole
> > drbd(U)
> > netloop
> > netbk
> > blktap
> > blkbk
> > ipt_MASQUERADE
> > iptable_nat
> > ip_nat
> > bridge
> > ipv6
> > xfrm_nalgo
> > crypto_api
> > xt_tcpudp
> > xt_state
> > ip_conntrack_irc
> > xt_conntrack
> > ip_conntrack_ftp
> > xt_mac
> > xt_length
> > xt_limit
> > xt_multiport
> > ipt_ULOG
> > ipt_TCPMSS
> > ipt_TOS
> > ipt_ttl
> > ipt_owner
> > ipt_REJECT
> > ipt_ecn
> > ipt_LOG
> > ipt_recent
> > ip_conntrack
> > iptable_mangle
> > iptable_filter
> > ip_tables
> > nfnetlink
> > x_tables
> > autofs4
> > dm_mirror
> > dm_multipath
> > scsi_dh
> > video
> > hwmon
> > backlight
> > sbs
> > i2c_ec
> > i2c_core
> > button
> > battery
> > asus_acpi
> > ac
> > parport_pc
> > lp
> > parport
> > joydev
> > ide_cd
> > e1000e
> > cdrom
> > serial_core
> > i5000_edac
> > edac_mc
> > bnx2
> > serio_raw
> > pcspkr
> > sg
> > dm_raid45
> > dm_message
> > dm_region_hash
> > dm_log
> > dm_mod
> > dm_mem_cache
> > ata_piix
> > libata
> > shpchp
> > megaraid_sas
> > sd_mod
> > scsi_mod
> > ext3
> > jbd
> > uhci_hcd
> > ohci_hcd
> > ehci_hcd
> >
> > Pid: 12887, comm: drbd0_receiver Tainted: G 2.6.18-128.1.16.el5xen #1
> > RIP: e030:[<ffffffff80212bad>]
> > [<ffffffff80212bad>] csum_partial+0x56/0x4bc
> > RSP: e02b:ffff88000c347718 EFLAGS: 00010202
> > RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff880010ced500
> > RDX: 00000000000000e7 RSI: 000000000000039c RDI: ffff880011e3cc64
> > RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
> > R10: 0000000025b85e7c R11: 0000000000000002 R12: 0000000000000028
> > R13: 0000000000000028 R14: ffff88001c56f7b0 R15: 0000000025b85e7c
> > FS: 00002b391e123f60(0000) GS:ffffffff805ba000(0000)
> > knlGS:0000000000000000 CS: e033 DS: 0000 ES: 0000
> > Process drbd0_receiver (pid: 12887, threadinfo ffff88000c346000, task
> > ffff88001c207820)
> > Stack:
> > 000000000000039c
> > 00000000000005b4
> > ffffffff8023d496
> > ffff88001e7e48d8
> >
> > 0000001400000000
> > ffff8800000003c4
> > ffff88001c56f7b0
> > ffff88001e7e48d8
> >
> > ffff88001e7e48ec
> > ffff88000c3478e8
> >
> > Call Trace:
> > [<ffffffff8023d496>] skb_checksum+0x11b/0x260
> > [<ffffffff80411472>] skb_checksum_help+0x71/0xd0
> > [<ffffffff8853f33e>] :iptable_nat:ip_nat_fn+0x56/0x1c3
> > [<ffffffff8853f6cf>] :iptable_nat:ip_nat_local_fn+0x32/0xb7
> > [<ffffffff8023550c>] nf_iterate+0x41/0x7d
> > [<ffffffff8042f004>] dst_output+0x0/0xe
> > [<ffffffff80258b28>] nf_hook_slow+0x58/0xbc
> > [<ffffffff8042f004>] dst_output+0x0/0xe
> > [<ffffffff802359ab>] ip_queue_xmit+0x41c/0x48c
> > [<ffffffff8022c1cb>] local_bh_enable+0x9/0xa5
> > [<ffffffff8020b6b7>] kmem_cache_alloc+0x62/0x6d
> > [<ffffffff8023668d>] alloc_skb_from_cache+0x74/0x13c
> > [<ffffffff80222a0b>] tcp_transmit_skb+0x62f/0x667
> > [<ffffffff8043903a>] tcp_retransmit_skb+0x53d/0x638
> > [<ffffffff80439353>] tcp_xmit_retransmit_queue+0x21e/0x2bb
> > [<ffffffff80225cff>] tcp_ack+0x1705/0x1879
> > [<ffffffff8021c6b1>] tcp_rcv_established+0x804/0x925
> > [<ffffffff80263710>] schedule_timeout+0x1e/0xad
> > [<ffffffff8023cef3>] tcp_v4_do_rcv+0x2a/0x2fa
> > [<ffffffff8040bbfe>] sk_wait_data+0xac/0xbf
> > [<ffffffff8029b018>] autoremove_wake_function+0x0/0x2e
> > [<ffffffff80434f71>] tcp_prequeue_process+0x65/0x78
> > [<ffffffff8021dd39>] tcp_recvmsg+0x492/0xb1f
> > [<ffffffff80233102>] sock_common_recvmsg+0x2d/0x43
> > [<ffffffff80233102>] sock_common_recvmsg+0x2d/0x43
> > [<ffffffff80231c18>] sock_recvmsg+0x101/0x120
> > [<ffffffff80231c18>] sock_recvmsg+0x101/0x120
> > [<ffffffff8029b018>] autoremove_wake_function+0x0/0x2e
> > [<ffffffff80343366>] swiotlb_map_sg+0xf7/0x205
> > [<ffffffff880b563c>] :megaraid_sas:megasas_make_sgl64+0x78/0xa9
> > [<ffffffff880b61bc>] :megaraid_sas:megasas_queue_command+0x343/0x3ed
> > [<ffffffff884e119f>] :drbd:drbd_recv+0x7b/0x109
> > [<ffffffff884e53b2>] :drbd:receive_DataRequest+0x3b/0x655
> > [<ffffffff884e1c4b>] :drbd:drbdd+0x77/0x152
> > [<ffffffff884e4870>] :drbd:drbdd_init+0xea/0x1dc
> > [<ffffffff884f432a>] :drbd:drbd_thread_setup+0xa2/0x18b
> > [<ffffffff80260b2c>] child_rip+0xa/0x12
> > [<ffffffff884f4288>] :drbd:drbd_thread_setup+0x0/0x18b
> > [<ffffffff80260b22>] child_rip+0x0/0x12
> >
> >
> > Code:
> > 44
> > 8b
> > 0f
> > ff
> > ca
> > 83
> > ee
> > 04
> > 48
> > 83
> > c7
> > 04
> > 4d
> > 01
> > c8
> > 41
> > 89
> > d2
> > 41
> > 89
> >
> > RIP
> > [<ffffffff80212bad>] csum_partial+0x56/0x4bc
> > RSP <ffff88000c347718>
> > CR2: ffff880011e3cc64
> >
> > Kernel panic - not syncing: Fatal exception
> > #######
> >
> >
> > Any ideas on how to diagnose this properly and eventually find the
> > culprit?
> >
> >
> > Regards,
>
> _______________________________________________
> drbd-user mailing list
> drbd-user [at] lists
> http://lists.linbit.com/mailman/listinfo/drbd-user

Jean-Francois,

thank you for this very elaborate and technically rich reply. I will certainly
look into your suggestions about using Broadcom cards. I have one dual port
Broadcom card in this server, but I was using one port combined with one port
on an Intel e1000 dual port NIC in balanced-rr to provide for backup in the
event a NIC goes down. Two port NICs usually share one chip for two ports, so
in case of a problem with the chip, the complete DRBD would be out. Reality
shows this might be a bad idea though: doing a bonnie++ test to the backend
storage (RAID5 on 15K rpm disks) gives me a 255 MB/sec write performance,
doing the same test on the DRBD device drops this to 77 MB/sec, even with the
MTU set to 9000. It would be nice to get as close as possible to the
theoretical maximum, so a lot needs to be done to get there.
Step 1 would be changing everything to the broadcom NIC. Any other
suggestions?

Thanks a lot,

Bart



_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


bart.coninckx at telenet

Oct 4, 2010, 9:12 AM

Post #8 of 8 (660 views)
Permalink
Re: Kernel Panic occuring when drbd is up & (re)syncing [In reply to]

On Monday 04 October 2010 18:10:13 Bart Coninckx wrote:
> On Thursday 12 November 2009 18:26:14 Jean-Francois Chevrette wrote:
> > It appears that there is currently a problem with the latest
> > CentOS/Redhat kernel. We have noticed the same problem when using LVM
> > snapshots and a backup technology called R1Soft CDP.
> >
> > Some related info:
> > http://bugs.centos.org/view.php?id=3869
> > forum.r1soft.com/showthread.php?t=1158
> >
> > No sign of a bug at bugzilla.redhat.com
> >
> > For now we have reverted to kernel-2.6.18-128.7.1 on which we did not
> > have any issues for the past 4 hours. Previously, a few seconds after
> > starting a 'drbdadm verify' the kernel panic would occur.
> >
> > DRBD devs might want to check it out.
> >
> > Regards,
> >
> > > Hello,
> > >
> > > here we have a two nodes setup that are running CentOS 5.4, Xen 3.0
> > > (CentOS RPMs) and DRBD 8.3.2 (again CentOS RPM). Both servers are Dell
> > > PowerEdge 1950 servers with two Quad-Core Xeon processors and 32GB of
> > > memory. The network card used by DRBD is an Intel 82571EB Gigabit
> > > Ethernet card (e1000 driver). Both are connected directly with a
> > > crossover cable.
> > >
> > > DRBD is configured so that I have one resource (drbd0) on which I have
> > > configured a LVM VolumeGroup which is then sliced in two LVs. Both LVs
> > > are mapped to my Xen VM (PV) as sda and sdb disks.
> > >
> > > Recently, we've had issues where the node that is in Primary state and
> > > hence running the VM locks up and throws a kernel panic. The situation
> > > seems to indicate that this might be a problem related to DRBD and/or
> > > the network stack because if we disconnect the DRBD resource, this
> > > problem will not occur.
> > >
> > > Even worse, the problem occur very quickly after we connect the DRBD
> > > resource, either during resynchronization after being out-of-sync for a
> > > while or during normal syncing operations. No errors show up on the
> > > network interface (ifconfig, ethtool)
> > >
> > > One thing to note is that the kernel panic seems to complain about
> > > checksum functions so that might be related (see below)
> > >
> > > Here are the relevant informations
> > >
> > > # rpm -qa | grep -e xen -e drbd
> > > drbd83-8.3.2-6.el5_3
> > > kmod-drbd83-xen-8.3.2-6.el5_3
> > > xen-3.0.3-94.el5
> > > kernel-xen-2.6.18-164.el5
> > > xen-libs-3.0.3-94.el5
> > >
> > > # cat /etc/drbd.conf
> > > global {
> > > usage-count no;
> > > }
> > >
> > > common {
> > > protocol C;
> > >
> > > syncer {
> > > rate 33M;
> > > verify-alg crc32c;
> > > al-extents 1801;
> > > }
> > > net {
> > > cram-hmac-alg sha1;
> > > max-epoch-size 8192;
> > > max-buffers 8192;
> > > }
> > >
> > > disk {
> > > on-io-error detach;
> > > no-disk-flushes;
> > > no-disk-barrier;
> > > no-md-flushes;
> > > }
> > > }
> > >
> > > resource drbd0 {
> > > device /dev/drbd0;
> > > disk /dev/sda6;
> > > flexible-meta-disk internal;
> > > on node1 {
> > > address 10.11.1.1:7788;
> > > }
> > > on node2 {
> > > address 10.11.1.2:7788;
> > > }
> > > }
> > >
> > > ### Kernel Panic ###
> > > Unable to handle kernel paging request
> > > at ffff880011e3cc64 RIP:
> > > [<ffffffff80212bad>] csum_partial+0x56/0x4bc
> > > PGD ed8067
> > > PUD ed9067
> > > PMD f69067
> > > PTE 0
> > >
> > > Oops: 0000 [1]
> > > SMP
> > >
> > > last sysfs file: /class/scsi_host/host0/proc_name
> > > CPU 0
> > >
> > > Modules linked in:
> > > xt_physdev
> > > netconsole
> > > drbd(U)
> > > netloop
> > > netbk
> > > blktap
> > > blkbk
> > > ipt_MASQUERADE
> > > iptable_nat
> > > ip_nat
> > > bridge
> > > ipv6
> > > xfrm_nalgo
> > > crypto_api
> > > xt_tcpudp
> > > xt_state
> > > ip_conntrack_irc
> > > xt_conntrack
> > > ip_conntrack_ftp
> > > xt_mac
> > > xt_length
> > > xt_limit
> > > xt_multiport
> > > ipt_ULOG
> > > ipt_TCPMSS
> > > ipt_TOS
> > > ipt_ttl
> > > ipt_owner
> > > ipt_REJECT
> > > ipt_ecn
> > > ipt_LOG
> > > ipt_recent
> > > ip_conntrack
> > > iptable_mangle
> > > iptable_filter
> > > ip_tables
> > > nfnetlink
> > > x_tables
> > > autofs4
> > > dm_mirror
> > > dm_multipath
> > > scsi_dh
> > > video
> > > hwmon
> > > backlight
> > > sbs
> > > i2c_ec
> > > i2c_core
> > > button
> > > battery
> > > asus_acpi
> > > ac
> > > parport_pc
> > > lp
> > > parport
> > > joydev
> > > ide_cd
> > > e1000e
> > > cdrom
> > > serial_core
> > > i5000_edac
> > > edac_mc
> > > bnx2
> > > serio_raw
> > > pcspkr
> > > sg
> > > dm_raid45
> > > dm_message
> > > dm_region_hash
> > > dm_log
> > > dm_mod
> > > dm_mem_cache
> > > ata_piix
> > > libata
> > > shpchp
> > > megaraid_sas
> > > sd_mod
> > > scsi_mod
> > > ext3
> > > jbd
> > > uhci_hcd
> > > ohci_hcd
> > > ehci_hcd
> > >
> > > Pid: 12887, comm: drbd0_receiver Tainted: G 2.6.18-128.1.16.el5xen #1
> > > RIP: e030:[<ffffffff80212bad>]
> > > [<ffffffff80212bad>] csum_partial+0x56/0x4bc
> > > RSP: e02b:ffff88000c347718 EFLAGS: 00010202
> > > RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff880010ced500
> > > RDX: 00000000000000e7 RSI: 000000000000039c RDI: ffff880011e3cc64
> > > RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
> > > R10: 0000000025b85e7c R11: 0000000000000002 R12: 0000000000000028
> > > R13: 0000000000000028 R14: ffff88001c56f7b0 R15: 0000000025b85e7c
> > > FS: 00002b391e123f60(0000) GS:ffffffff805ba000(0000)
> > > knlGS:0000000000000000 CS: e033 DS: 0000 ES: 0000
> > > Process drbd0_receiver (pid: 12887, threadinfo ffff88000c346000, task
> > > ffff88001c207820)
> > > Stack:
> > > 000000000000039c
> > > 00000000000005b4
> > > ffffffff8023d496
> > > ffff88001e7e48d8
> > >
> > > 0000001400000000
> > > ffff8800000003c4
> > > ffff88001c56f7b0
> > > ffff88001e7e48d8
> > >
> > > ffff88001e7e48ec
> > > ffff88000c3478e8
> > >
> > > Call Trace:
> > > [<ffffffff8023d496>] skb_checksum+0x11b/0x260
> > > [<ffffffff80411472>] skb_checksum_help+0x71/0xd0
> > > [<ffffffff8853f33e>] :iptable_nat:ip_nat_fn+0x56/0x1c3
> > > [<ffffffff8853f6cf>] :iptable_nat:ip_nat_local_fn+0x32/0xb7
> > > [<ffffffff8023550c>] nf_iterate+0x41/0x7d
> > > [<ffffffff8042f004>] dst_output+0x0/0xe
> > > [<ffffffff80258b28>] nf_hook_slow+0x58/0xbc
> > > [<ffffffff8042f004>] dst_output+0x0/0xe
> > > [<ffffffff802359ab>] ip_queue_xmit+0x41c/0x48c
> > > [<ffffffff8022c1cb>] local_bh_enable+0x9/0xa5
> > > [<ffffffff8020b6b7>] kmem_cache_alloc+0x62/0x6d
> > > [<ffffffff8023668d>] alloc_skb_from_cache+0x74/0x13c
> > > [<ffffffff80222a0b>] tcp_transmit_skb+0x62f/0x667
> > > [<ffffffff8043903a>] tcp_retransmit_skb+0x53d/0x638
> > > [<ffffffff80439353>] tcp_xmit_retransmit_queue+0x21e/0x2bb
> > > [<ffffffff80225cff>] tcp_ack+0x1705/0x1879
> > > [<ffffffff8021c6b1>] tcp_rcv_established+0x804/0x925
> > > [<ffffffff80263710>] schedule_timeout+0x1e/0xad
> > > [<ffffffff8023cef3>] tcp_v4_do_rcv+0x2a/0x2fa
> > > [<ffffffff8040bbfe>] sk_wait_data+0xac/0xbf
> > > [<ffffffff8029b018>] autoremove_wake_function+0x0/0x2e
> > > [<ffffffff80434f71>] tcp_prequeue_process+0x65/0x78
> > > [<ffffffff8021dd39>] tcp_recvmsg+0x492/0xb1f
> > > [<ffffffff80233102>] sock_common_recvmsg+0x2d/0x43
> > > [<ffffffff80233102>] sock_common_recvmsg+0x2d/0x43
> > > [<ffffffff80231c18>] sock_recvmsg+0x101/0x120
> > > [<ffffffff80231c18>] sock_recvmsg+0x101/0x120
> > > [<ffffffff8029b018>] autoremove_wake_function+0x0/0x2e
> > > [<ffffffff80343366>] swiotlb_map_sg+0xf7/0x205
> > > [<ffffffff880b563c>] :megaraid_sas:megasas_make_sgl64+0x78/0xa9
> > > [<ffffffff880b61bc>] :megaraid_sas:megasas_queue_command+0x343/0x3ed
> > > [<ffffffff884e119f>] :drbd:drbd_recv+0x7b/0x109
> > > [<ffffffff884e53b2>] :drbd:receive_DataRequest+0x3b/0x655
> > > [<ffffffff884e1c4b>] :drbd:drbdd+0x77/0x152
> > > [<ffffffff884e4870>] :drbd:drbdd_init+0xea/0x1dc
> > > [<ffffffff884f432a>] :drbd:drbd_thread_setup+0xa2/0x18b
> > > [<ffffffff80260b2c>] child_rip+0xa/0x12
> > > [<ffffffff884f4288>] :drbd:drbd_thread_setup+0x0/0x18b
> > > [<ffffffff80260b22>] child_rip+0x0/0x12
> > >
> > >
> > > Code:
> > > 44
> > > 8b
> > > 0f
> > > ff
> > > ca
> > > 83
> > > ee
> > > 04
> > > 48
> > > 83
> > > c7
> > > 04
> > > 4d
> > > 01
> > > c8
> > > 41
> > > 89
> > > d2
> > > 41
> > > 89
> > >
> > > RIP
> > > [<ffffffff80212bad>] csum_partial+0x56/0x4bc
> > > RSP <ffff88000c347718>
> > > CR2: ffff880011e3cc64
> > >
> > > Kernel panic - not syncing: Fatal exception
> > > #######
> > >
> > >
> > > Any ideas on how to diagnose this properly and eventually find the
> > > culprit?
> > >
> > >
> > > Regards,
> >
> > _______________________________________________
> > drbd-user mailing list
> > drbd-user [at] lists
> > http://lists.linbit.com/mailman/listinfo/drbd-user
>
> Jean-Francois,
>
> thank you for this very elaborate and technically rich reply. I will
> certainly look into your suggestions about using Broadcom cards. I have
> one dual port Broadcom card in this server, but I was using one port
> combined with one port on an Intel e1000 dual port NIC in balanced-rr to
> provide for backup in the event a NIC goes down. Two port NICs usually
> share one chip for two ports, so in case of a problem with the chip, the
> complete DRBD would be out. Reality shows this might be a bad idea though:
> doing a bonnie++ test to the backend storage (RAID5 on 15K rpm disks)
> gives me a 255 MB/sec write performance, doing the same test on the DRBD
> device drops this to 77 MB/sec, even with the MTU set to 9000. It would be
> nice to get as close as possible to the theoretical maximum, so a lot
> needs to be done to get there.
> Step 1 would be changing everything to the broadcom NIC. Any other
> suggestions?
>
> Thanks a lot,
>
> Bart
>
>
>
> _______________________________________________
> drbd-user mailing list
> drbd-user [at] lists
> http://lists.linbit.com/mailman/listinfo/drbd-user

Oops, complety wrong thread. Please disregard ...


B.
_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user

DRBD users RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.