Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: DRBD: Users

Re: Page allocation errors and kernel panics with drbd 8.3.3rc1 and infiniband

 

 

DRBD users RSS feed   Index | Next | Previous | View Threaded


lars.ellenberg at linbit

Sep 22, 2009, 1:34 PM

Post #1 of 7 (1126 views)
Permalink
Re: Page allocation errors and kernel panics with drbd 8.3.3rc1 and infiniband

On Tue, Sep 22, 2009 at 01:31:26PM -0400, Jason McKay wrote:
> Hello all,
>
> We're experiencing page allocation errors when writing to a connected
> drbd device (connected via infiniband using IPoIB) that resulted in a
> kernel panic last night.
>
> When rsyncing data to the primary node, we get a slew of these errors in /var/log/messages:
>
> Sep 21 22:11:21 client-nfs5 kernel: drbd0_worker: page allocation failure. order:5, mode:0x10
> Sep 21 22:11:21 client-nfs5 kernel: Pid: 6114, comm: drbd0_worker Tainted: P 2.6.27.25 #2
> Sep 21 22:11:21 client-nfs5 kernel:
> Sep 21 22:11:21 client-nfs5 kernel: Call Trace:
> Sep 21 22:11:21 client-nfs5 kernel: [<ffffffff8026eec4>] __alloc_pages_internal+0x3a4/0x3c0
> Sep 21 22:11:21 client-nfs5 kernel: [<ffffffff8028c61a>] kmem_getpages+0x6b/0x12b
> Sep 21 22:11:21 client-nfs5 kernel: [<ffffffff8028cb93>] fallback_alloc+0x11d/0x1b1
> Sep 21 22:11:21 client-nfs5 kernel: [<ffffffff8028c816>] kmem_cache_alloc_node+0xa3/0xcf
> Sep 21 22:11:21 client-nfs5 kernel: [<ffffffff803ecf65>] __alloc_skb+0x64/0x12e
> Sep 21 22:11:21 client-nfs5 kernel: [<ffffffff8041af6f>] sk_stream_alloc_skb+0x2f/0xd5
> Sep 21 22:11:21 client-nfs5 kernel: [<ffffffff8041beb0>] tcp_sendmsg+0x180/0x9d1
> Sep 21 22:11:21 client-nfs5 kernel: [<ffffffff803e6d85>] sock_sendmsg+0xe2/0xff
> Sep 21 22:11:22 client-nfs5 kernel: [<ffffffff80246d06>] autoremove_wake_function+0x0/0x2e
> Sep 21 22:11:22 client-nfs5 kernel: [<ffffffff80246d06>] autoremove_wake_function+0x0/0x2e
> Sep 21 22:11:22 client-nfs5 kernel: [<ffffffff8027464b>] zone_statistics+0x3a/0x5d
> Sep 21 22:11:22 client-nfs5 kernel: [<ffffffff803e82f6>] kernel_sendmsg+0x2c/0x3e
> Sep 21 22:11:22 client-nfs5 kernel: [<ffffffffa0568d83>] drbd_send+0xb2/0x194 [drbd]
> Sep 21 22:11:22 client-nfs5 kernel: [<ffffffffa0569134>] _drbd_send_cmd+0x9c/0x116 [drbd]
> Sep 21 22:11:22 client-nfs5 kernel: [<ffffffffa0569285>] send_bitmap_rle_or_plain+0xd7/0x13a [drbd]
> Sep 21 22:11:22 client-nfs5 kernel: [<ffffffffa0569475>] _drbd_send_bitmap+0x18d/0x1ae [drbd]
> Sep 21 22:11:22 client-nfs5 kernel: [<ffffffffa056c242>] drbd_send_bitmap+0x39/0x4c [drbd]

I certainly would not expect to _cause_
memory pressure from this call path.
But someone is causing it, and we are affected.


> Sep 21 22:11:22 client-nfs5 kernel: [<ffffffffa056b9a1>] w_bitmap_io+0x45/0x95 [drbd]
> Sep 21 22:11:22 client-nfs5 kernel: [<ffffffffa05560c4>] drbd_worker+0x230/0x3eb [drbd]
> Sep 21 22:11:22 client-nfs5 kernel: [<ffffffffa056b0d9>] drbd_thread_setup+0x124/0x1ba [drbd]
> Sep 21 22:11:22 client-nfs5 kernel: [<ffffffff8020cd59>] child_rip+0xa/0x11
> Sep 21 22:11:22 client-nfs5 kernel: [<ffffffffa056afb5>] drbd_thread_setup+0x0/0x1ba [drbd]
> Sep 21 22:11:22 client-nfs5 kernel: [<ffffffff8020cd4f>] child_rip+0x0/0x11

why is the tcp stack trying to allocate order:5 pages?

thats 32 continguous pages, 128 KiB continguous memory. apparently your
memory is so much fragmented that this is no longer available.
but why is the tcp stack ok with order:0 or order:1 pages?
why does it have to be order:5 ?

> These were occurring for hours until a kernel panic:
>
> [Mon Sep 21 22:11:33 2009]INFO: task kjournald:7025 blocked for more than 120 seconds.
> [Mon Sep 21 22:11:33 2009]"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [Mon Sep 21 22:11:33 2009] ffff88062e801d30 0000000000000046 ffff88062e801cf8 ffffffffa056b821
> [Mon Sep 21 22:11:33 2009] ffff880637d617c0 ffff88063dd05120 ffff88063e668750 ffff88063dd05478
> [Mon Sep 21 22:11:33 2009] 0000000900000002 000000011a942a2d ffffffffffffffff ffffffffffffffff
> [Mon Sep 21 22:11:33 2009]Call Trace:
> [Mon Sep 21 22:11:33 2009] [<ffffffffa056b821>] drbd_unplug_fn+0x14a/0x1aa [drbd]
> [Mon Sep 21 22:11:33 2009] [<ffffffff802b1fed>] sync_buffer+0x0/0x3f
> [Mon Sep 21 22:11:33 2009] [<ffffffff804903df>] io_schedule+0x5d/0x9f
> [Mon Sep 21 22:11:33 2009] [<ffffffff802b2028>] sync_buffer+0x3b/0x3f
> [Mon Sep 21 22:11:33 2009] [<ffffffff80490658>] __wait_on_bit+0x40/0x6f
> [Mon Sep 21 22:11:33 2009] [<ffffffff802b1fed>] sync_buffer+0x0/0x3f
> [Mon Sep 21 22:11:34 2009] [<ffffffff804906f3>] out_of_line_wait_on_bit+0x6c/0x78
> [Mon Sep 21 22:11:34 2009] [<ffffffff80246d34>] wake_bit_function+0x0/0x23
> [Mon Sep 21 22:11:34 2009] [<ffffffffa0031250>] journal_commit_transaction+0x7cd/0xc4d [jbd]
> [Mon Sep 21 22:11:34 2009] [<ffffffff8023dc9b>] lock_timer_base+0x26/0x4b
> [Mon Sep 21 22:11:34 2009] [<ffffffffa0033d32>] kjournald+0xc1/0x1fb [jbd]
> [Mon Sep 21 22:11:34 2009] [<ffffffff80246d06>] autoremove_wake_function+0x0/0x2e
> [Mon Sep 21 22:11:34 2009] [<ffffffffa0033c71>] kjournald+0x0/0x1fb [jbd]
> [Mon Sep 21 22:11:34 2009] [<ffffffff80246bd8>] kthread+0x47/0x73
> [Mon Sep 21 22:11:34 2009] [<ffffffff8020cd59>] child_rip+0xa/0x11
> [Mon Sep 21 22:11:34 2009] [<ffffffff80246b91>] kthread+0x0/0x73
> [Mon Sep 21 22:11:34 2009] [<ffffffff8020cd4f>] child_rip+0x0/0x11
> [Mon Sep 21 22:11:34 2009]
> [Mon Sep 21 22:11:34 2009]Kernel panic - not syncing: softlockup: blocked tasks
> [-- root [at] localhost attached -- Mon Sep 21 22:17:24 2009]
> [-- root [at] localhost detached -- Mon Sep 21 23:16:22 2009]
> [-- Console down -- Tue Sep 22 04:02:15 2009]

That is unfortunate.

> The two systems are hardware and OS identical:
>
> [root [at] client-nfs log]# uname -a
> Linux client-nfs5 2.6.27.25 #2 SMP Fri Jun 26 00:07:23 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux
>
> [root [at] client-nfs log]# grep model\ name /proc/cpuinfo
> model name : Intel(R) Xeon(R) CPU E5520 @ 2.27GHz
> model name : Intel(R) Xeon(R) CPU E5520 @ 2.27GHz
> model name : Intel(R) Xeon(R) CPU E5520 @ 2.27GHz
> model name : Intel(R) Xeon(R) CPU E5520 @ 2.27GHz
> model name : Intel(R) Xeon(R) CPU E5520 @ 2.27GHz
> model name : Intel(R) Xeon(R) CPU E5520 @ 2.27GHz
> model name : Intel(R) Xeon(R) CPU E5520 @ 2.27GHz
> model name : Intel(R) Xeon(R) CPU E5520 @ 2.27GHz
> model name : Intel(R) Xeon(R) CPU E5520 @ 2.27GHz
> model name : Intel(R) Xeon(R) CPU E5520 @ 2.27GHz
> model name : Intel(R) Xeon(R) CPU E5520 @ 2.27GHz
> model name : Intel(R) Xeon(R) CPU E5520 @ 2.27GHz
> model name : Intel(R) Xeon(R) CPU E5520 @ 2.27GHz
> model name : Intel(R) Xeon(R) CPU E5520 @ 2.27GHz
> model name : Intel(R) Xeon(R) CPU E5520 @ 2.27GHz
> model name : Intel(R) Xeon(R) CPU E5520 @ 2.27GHz
>
> [root [at] client-nfs log]# free -m
> total used free shared buffers cached
> Mem: 24110 2050 22060 0 157 1270
> -/+ buffers/cache: 622 23488
> Swap: 4094 0 4094
>
> [root [at] client-nfs log]# cat /proc/partitions
> major minor #blocks name
>
> 8 0 1755310080 sda
> 8 1 1755302031 sda1
> 8 16 4881116160 sdb
> 8 17 4881116126 sdb1
> 8 32 62522712 sdc
> 8 33 58147236 sdc1
> 8 34 4192965 sdc2
> 147 0 1755248424 drbd0
> 147 1 4880967128 drbd1
>
> [root [at] client-nfs log]# modinfo drbd
> filename: /lib/modules/2.6.27.25/kernel/drivers/block/drbd.ko
> alias: block-major-147-*
> license: GPL
> version: 8.3.3rc1

Just in case, please retry with current git.
But it is more likely an issue with badly tuned tcp, or tcp via IPoIB.

> description: drbd - Distributed Replicated Block Device v8.3.3rc1
> author: Philipp Reisner <phil [at] linbit>, Lars Ellenberg <lars [at] linbit>
> srcversion: A06488A2009E5EB94AF0825
> depends:
> vermagic: 2.6.27.25 SMP mod_unload modversions
> parm: minor_count:Maximum number of drbd devices (1-255) (uint)
> parm: disable_sendpage:bool
> parm: allow_oos:DONT USE! (bool)
> parm: cn_idx:uint
> parm: proc_details:int
> parm: enable_faults:int
> parm: fault_rate:int
> parm: fault_count:int
> parm: fault_devs:int
> parm: usermode_helper:string
>
> [root [at] client-nfs log]# cat /etc/drbd.conf
> global {
> usage-count no;
> }
>
> common {
> syncer { rate 800M; } #This is temporary

please reduce, and only slowly increase again, until the cause of the
high memory pressure is identified.
maybe someone is leaking full pages?
with infiniband SDP, that in fact did happen.

though, if there was no current resync going on,
this cannot be the cause.

> protocol C;
>
> handlers {
> }
>
> startup { degr-wfc-timeout 60;
> wfc-timeout 60;
> }
>
> disk {
> on-io-error detach;
> fencing dont-care;
> no-disk-flushes;
> no-md-flushes;
> no-disk-barrier;
> }
>
> net {
> ko-count 2;
> after-sb-2pri disconnect;
> max-buffers 16000;
> max-epoch-size 16000;
> unplug-watermark 16001;
> sndbuf-size 8m;
> rcvbuf-size 8m;
>
> }
> }
>
> resource drbd0 {
> on client-nfs5 {
> device /dev/drbd0;
> disk /dev/sda1;
> address 192.168.16.48:7789;
> flexible-meta-disk internal;
> }
>
> on client-nfs6 {
> device /dev/drbd0;
> disk /dev/sda1;
> address 192.168.16.49:7789;
> flexible-meta-disk internal;
> }
> }
>
> resource drbd1 {
> on client-nfs5 {
> device /dev/drbd1;
> disk /dev/sdb1;
> address 192.168.16.48:7790;
> flexible-meta-disk internal;
> }
>
> on client-nfs6 {
> device /dev/drbd1;
> disk /dev/sdb1;
> address 192.168.16.49:7790;
> flexible-meta-disk internal;
> }
> }
>
> [root [at] client-nfs log]# ibstat
> CA 'mlx4_0'
> CA type: MT26428
> Number of ports: 2
> Firmware version: 2.6.100
> Hardware version: a0
> Node GUID: 0x0002c9030004a79a
> System image GUID: 0x0002c9030004a79d
> Port 1:
> State: Active
> Physical state: LinkUp
> Rate: 40
> Base lid: 1
> LMC: 0
> SM lid: 1
> Capability mask: 0x0251086a
> Port GUID: 0x0002c9030004a79b
> Port 2:
> State: Active
> Physical state: LinkUp
> Rate: 40
> Base lid: 1
> LMC: 0
> SM lid: 1
> Capability mask: 0x0251086a
> Port GUID: 0x0002c9030004a79c
>
> [root [at] client-nfs log]# cat /etc/sysctl.conf
> # Kernel sysctl configuration file for Red Hat Linux
> #
> # For binary values, 0 is disabled, 1 is enabled. See sysctl(8) and
> # sysctl.conf(5) for more details.
>
> # Controls IP packet forwarding
> net.ipv4.ip_forward = 0
>
> # Controls source route verification
> net.ipv4.conf.default.rp_filter = 1
>
> # Do not accept source routing
> net.ipv4.conf.default.accept_source_route = 0
>
> # Controls the System Request debugging functionality of the kernel
> kernel.sysrq = 0
>
> # Controls whether core dumps will append the PID to the core filename
> # Useful for debugging multi-threaded applications
> kernel.core_uses_pid = 1
>
> # Controls the use of TCP syncookies
> net.ipv4.tcp_syncookies = 1
>
> # Controls the maximum size of a message, in bytes
> kernel.msgmnb = 65536
>
> # Controls the default maxmimum size of a mesage queue
> kernel.msgmax = 65536
>
> # Controls the maximum shared segment size, in bytes
> kernel.shmmax = 68719476736
>
> # Controls the maximum number of shared memory segments, in pages
> kernel.shmall = 4294967296
> net.ipv4.tcp_timestamps=0
> net.ipv4.tcp_sack=0
> net.core.netdev_max_backlog=250000
> net.core.rmem_max=16777216
> net.core.wmem_max=16777216
> net.core.rmem_default=16777216
> net.core.wmem_default=16777216
> net.core.optmem_max=16777216


this:

> net.ipv4.tcp_mem=16777216 16777216 16777216

^^ is nonsense.
tcp_mem is in pages, and has _nothing_ to do with your 16 MiB max buffer
size you apparently want to set.

I know the first hits on google for tcp_mem sysctl suggest otherwise.
They are misguided and wrong.

tcp_mem controlls the watermarks for _global_ tcp stack memory usage,
where no pressure, little pressure, or high pressure is used against
the tcp stack, if it tries to allocate some more.

what you say there is that you do not bother to throttle the tcp stack
memory usage at all, until its total usage would reach about
16777216 * 4k, which is 64GiB, which is more than double the amount of
total ram in your box.

DO NOT DO THAT.

Or your box will run into hard OOM conditions,
invoke the oom-killer, and eventually will panic.

Set these values to e.g. 10%, 15%, 30% of total RAM,
and remember the unit is _pages_ (4k).

> net.ipv4.tcp_rmem=4096 87380 16777216
> net.ipv4.tcp_wmem=4096 65536 16777216

here, however, I suggest to up the lower mark to about 32k or 64k,
which is the "guaranteed" amount of buffer space a tcp buffer is
promissed to be left without throttling it further, even under overall
high memory pressure.

> Since we noticed these errors, we've been searching for them on our
> logservers and found another case where this is occurirng using the
> same drbd version and infiniband. This other case, however, uses a
> stock redhat kernel.
>
> Has anyone else seen these errors? Do the tuning recommendations for
> send/receive buffers need to be revisited? Would switching to rdp vs
> IPoIB be a potential fix here? Thanks in advance for any insights.

I'm not sure about RDP in this context, I know it as
http://en.wikipedia.org/wiki/Reliable_Datagram_Protocol
which is not available for DRBD.

SDP (as in http://en.wikipedia.org/wiki/Sockets_Direct_Protocol)
is available with DRBD 8.3.3 (rc*).
But make sure you use the SDP from OFED 1.4.2, or SDP itself will leak
full pages.

As SDP support is a new feature in DRBD, I'd like to have feedback on
how well it works for you, and how performance as well as CPU usage
compares to IPoIB.

But correcting the tcp_mem setting above
is more likely to fix your symptoms.

--
: Lars Ellenberg
: LINBIT HA-Solutions GmbH
: DRBD®/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list -- I'm subscribed
_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


jmckay at logicworks

Sep 22, 2009, 2:01 PM

Post #2 of 7 (1041 views)
Permalink
Re: Page allocation errors and kernel panics with drbd 8.3.3rc1 and infiniband [In reply to]

On Sep 22, 2009, at 4:34 PM, Lars Ellenberg wrote:

> On Tue, Sep 22, 2009 at 01:31:26PM -0400, Jason McKay wrote:
>> Hello all,
>>
>> We're experiencing page allocation errors when writing to a connected
>> drbd device (connected via infiniband using IPoIB) that resulted in a
>> kernel panic last night.
>>
>> When rsyncing data to the primary node, we get a slew of these
>> errors in /var/log/messages:
>>
>> Sep 21 22:11:21 client-nfs5 kernel: drbd0_worker: page allocation
>> failure. order:5, mode:0x10
>> Sep 21 22:11:21 client-nfs5 kernel: Pid: 6114, comm: drbd0_worker
>> Tainted: P 2.6.27.25 #2
>> Sep 21 22:11:21 client-nfs5 kernel:
>> Sep 21 22:11:21 client-nfs5 kernel: Call Trace:
>> Sep 21 22:11:21 client-nfs5 kernel: [<ffffffff8026eec4>]
>> __alloc_pages_internal+0x3a4/0x3c0
>> Sep 21 22:11:21 client-nfs5 kernel: [<ffffffff8028c61a>]
>> kmem_getpages+0x6b/0x12b
>> Sep 21 22:11:21 client-nfs5 kernel: [<ffffffff8028cb93>]
>> fallback_alloc+0x11d/0x1b1
>> Sep 21 22:11:21 client-nfs5 kernel: [<ffffffff8028c816>]
>> kmem_cache_alloc_node+0xa3/0xcf
>> Sep 21 22:11:21 client-nfs5 kernel: [<ffffffff803ecf65>]
>> __alloc_skb+0x64/0x12e
>> Sep 21 22:11:21 client-nfs5 kernel: [<ffffffff8041af6f>]
>> sk_stream_alloc_skb+0x2f/0xd5
>> Sep 21 22:11:21 client-nfs5 kernel: [<ffffffff8041beb0>]
>> tcp_sendmsg+0x180/0x9d1
>> Sep 21 22:11:21 client-nfs5 kernel: [<ffffffff803e6d85>]
>> sock_sendmsg+0xe2/0xff
>> Sep 21 22:11:22 client-nfs5 kernel: [<ffffffff80246d06>]
>> autoremove_wake_function+0x0/0x2e
>> Sep 21 22:11:22 client-nfs5 kernel: [<ffffffff80246d06>]
>> autoremove_wake_function+0x0/0x2e
>> Sep 21 22:11:22 client-nfs5 kernel: [<ffffffff8027464b>]
>> zone_statistics+0x3a/0x5d
>> Sep 21 22:11:22 client-nfs5 kernel: [<ffffffff803e82f6>]
>> kernel_sendmsg+0x2c/0x3e
>> Sep 21 22:11:22 client-nfs5 kernel: [<ffffffffa0568d83>] drbd_send
>> +0xb2/0x194 [drbd]
>> Sep 21 22:11:22 client-nfs5 kernel: [<ffffffffa0569134>]
>> _drbd_send_cmd+0x9c/0x116 [drbd]
>> Sep 21 22:11:22 client-nfs5 kernel: [<ffffffffa0569285>]
>> send_bitmap_rle_or_plain+0xd7/0x13a [drbd]
>> Sep 21 22:11:22 client-nfs5 kernel: [<ffffffffa0569475>]
>> _drbd_send_bitmap+0x18d/0x1ae [drbd]
>> Sep 21 22:11:22 client-nfs5 kernel: [<ffffffffa056c242>]
>> drbd_send_bitmap+0x39/0x4c [drbd]
>
> I certainly would not expect to _cause_
> memory pressure from this call path.
> But someone is causing it, and we are affected.
>
>
>> Sep 21 22:11:22 client-nfs5 kernel: [<ffffffffa056b9a1>]
>> w_bitmap_io+0x45/0x95 [drbd]
>> Sep 21 22:11:22 client-nfs5 kernel: [<ffffffffa05560c4>]
>> drbd_worker+0x230/0x3eb [drbd]
>> Sep 21 22:11:22 client-nfs5 kernel: [<ffffffffa056b0d9>]
>> drbd_thread_setup+0x124/0x1ba [drbd]
>> Sep 21 22:11:22 client-nfs5 kernel: [<ffffffff8020cd59>] child_rip
>> +0xa/0x11
>> Sep 21 22:11:22 client-nfs5 kernel: [<ffffffffa056afb5>]
>> drbd_thread_setup+0x0/0x1ba [drbd]
>> Sep 21 22:11:22 client-nfs5 kernel: [<ffffffff8020cd4f>] child_rip
>> +0x0/0x11
>
> why is the tcp stack trying to allocate order:5 pages?
>
> thats 32 continguous pages, 128 KiB continguous memory. apparently
> your
> memory is so much fragmented that this is no longer available.
> but why is the tcp stack ok with order:0 or order:1 pages?
> why does it have to be order:5 ?
>
>> These were occurring for hours until a kernel panic:
>>
>> [Mon Sep 21 22:11:33 2009]INFO: task kjournald:7025 blocked for
>> more than 120 seconds.
>> [Mon Sep 21 22:11:33 2009]"echo 0 > /proc/sys/kernel/
>> hung_task_timeout_secs" disables this message.
>> [Mon Sep 21 22:11:33 2009] ffff88062e801d30 0000000000000046
>> ffff88062e801cf8 ffffffffa056b821
>> [Mon Sep 21 22:11:33 2009] ffff880637d617c0 ffff88063dd05120
>> ffff88063e668750 ffff88063dd05478
>> [Mon Sep 21 22:11:33 2009] 0000000900000002 000000011a942a2d
>> ffffffffffffffff ffffffffffffffff
>> [Mon Sep 21 22:11:33 2009]Call Trace:
>> [Mon Sep 21 22:11:33 2009] [<ffffffffa056b821>] drbd_unplug_fn
>> +0x14a/0x1aa [drbd]
>> [Mon Sep 21 22:11:33 2009] [<ffffffff802b1fed>] sync_buffer+0x0/0x3f
>> [Mon Sep 21 22:11:33 2009] [<ffffffff804903df>] io_schedule+0x5d/0x9f
>> [Mon Sep 21 22:11:33 2009] [<ffffffff802b2028>] sync_buffer+0x3b/0x3f
>> [Mon Sep 21 22:11:33 2009] [<ffffffff80490658>] __wait_on_bit
>> +0x40/0x6f
>> [Mon Sep 21 22:11:33 2009] [<ffffffff802b1fed>] sync_buffer+0x0/0x3f
>> [Mon Sep 21 22:11:34 2009] [<ffffffff804906f3>]
>> out_of_line_wait_on_bit+0x6c/0x78
>> [Mon Sep 21 22:11:34 2009] [<ffffffff80246d34>] wake_bit_function
>> +0x0/0x23
>> [Mon Sep 21 22:11:34 2009] [<ffffffffa0031250>]
>> journal_commit_transaction+0x7cd/0xc4d [jbd]
>> [Mon Sep 21 22:11:34 2009] [<ffffffff8023dc9b>] lock_timer_base
>> +0x26/0x4b
>> [Mon Sep 21 22:11:34 2009] [<ffffffffa0033d32>] kjournald
>> +0xc1/0x1fb [jbd]
>> [Mon Sep 21 22:11:34 2009] [<ffffffff80246d06>]
>> autoremove_wake_function+0x0/0x2e
>> [Mon Sep 21 22:11:34 2009] [<ffffffffa0033c71>] kjournald+0x0/0x1fb
>> [jbd]
>> [Mon Sep 21 22:11:34 2009] [<ffffffff80246bd8>] kthread+0x47/0x73
>> [Mon Sep 21 22:11:34 2009] [<ffffffff8020cd59>] child_rip+0xa/0x11
>> [Mon Sep 21 22:11:34 2009] [<ffffffff80246b91>] kthread+0x0/0x73
>> [Mon Sep 21 22:11:34 2009] [<ffffffff8020cd4f>] child_rip+0x0/0x11
>> [Mon Sep 21 22:11:34 2009]
>> [Mon Sep 21 22:11:34 2009]Kernel panic - not syncing: softlockup:
>> blocked tasks
>> [-- root [at] localhost attached -- Mon Sep 21 22:17:24 2009]
>> [-- root [at] localhost detached -- Mon Sep 21 23:16:22 2009]
>> [-- Console down -- Tue Sep 22 04:02:15 2009]
>
> That is unfortunate.
>
>> The two systems are hardware and OS identical:
>>
>> [root [at] client-nfs log]# uname -a
>> Linux client-nfs5 2.6.27.25 #2 SMP Fri Jun 26 00:07:23 EDT 2009
>> x86_64 x86_64 x86_64 GNU/Linux
>>
>> [root [at] client-nfs log]# grep model\ name /proc/cpuinfo
>> model name : Intel(R) Xeon(R) CPU E5520 @ 2.27GHz
>> model name : Intel(R) Xeon(R) CPU E5520 @ 2.27GHz
>> model name : Intel(R) Xeon(R) CPU E5520 @ 2.27GHz
>> model name : Intel(R) Xeon(R) CPU E5520 @ 2.27GHz
>> model name : Intel(R) Xeon(R) CPU E5520 @ 2.27GHz
>> model name : Intel(R) Xeon(R) CPU E5520 @ 2.27GHz
>> model name : Intel(R) Xeon(R) CPU E5520 @ 2.27GHz
>> model name : Intel(R) Xeon(R) CPU E5520 @ 2.27GHz
>> model name : Intel(R) Xeon(R) CPU E5520 @ 2.27GHz
>> model name : Intel(R) Xeon(R) CPU E5520 @ 2.27GHz
>> model name : Intel(R) Xeon(R) CPU E5520 @ 2.27GHz
>> model name : Intel(R) Xeon(R) CPU E5520 @ 2.27GHz
>> model name : Intel(R) Xeon(R) CPU E5520 @ 2.27GHz
>> model name : Intel(R) Xeon(R) CPU E5520 @ 2.27GHz
>> model name : Intel(R) Xeon(R) CPU E5520 @ 2.27GHz
>> model name : Intel(R) Xeon(R) CPU E5520 @ 2.27GHz
>>
>> [root [at] client-nfs log]# free -m
>> total used free shared buffers
>> cached
>> Mem: 24110 2050 22060 0
>> 157 1270
>> -/+ buffers/cache: 622 23488
>> Swap: 4094 0 4094
>>
>> [root [at] client-nfs log]# cat /proc/partitions
>> major minor #blocks name
>>
>> 8 0 1755310080 sda
>> 8 1 1755302031 sda1
>> 8 16 4881116160 sdb
>> 8 17 4881116126 sdb1
>> 8 32 62522712 sdc
>> 8 33 58147236 sdc1
>> 8 34 4192965 sdc2
>> 147 0 1755248424 drbd0
>> 147 1 4880967128 drbd1
>>
>> [root [at] client-nfs log]# modinfo drbd
>> filename: /lib/modules/2.6.27.25/kernel/drivers/block/drbd.ko
>> alias: block-major-147-*
>> license: GPL
>> version: 8.3.3rc1
>
> Just in case, please retry with current git.
> But it is more likely an issue with badly tuned tcp, or tcp via IPoIB.
>
>> description: drbd - Distributed Replicated Block Device v8.3.3rc1
>> author: Philipp Reisner <phil [at] linbit>, Lars Ellenberg <lars [at] linbit
>> >
>> srcversion: A06488A2009E5EB94AF0825
>> depends:
>> vermagic: 2.6.27.25 SMP mod_unload modversions
>> parm: minor_count:Maximum number of drbd devices (1-255)
>> (uint)
>> parm: disable_sendpage:bool
>> parm: allow_oos:DONT USE! (bool)
>> parm: cn_idx:uint
>> parm: proc_details:int
>> parm: enable_faults:int
>> parm: fault_rate:int
>> parm: fault_count:int
>> parm: fault_devs:int
>> parm: usermode_helper:string
>>
>> [root [at] client-nfs log]# cat /etc/drbd.conf
>> global {
>> usage-count no;
>> }
>>
>> common {
>> syncer { rate 800M; } #This is temporary
>
> please reduce, and only slowly increase again, until the cause of the
> high memory pressure is identified.
> maybe someone is leaking full pages?
> with infiniband SDP, that in fact did happen.
>
> though, if there was no current resync going on,
> this cannot be the cause.
>
>> protocol C;
>>
>> handlers {
>> }
>>
>> startup { degr-wfc-timeout 60;
>> wfc-timeout 60;
>> }
>>
>> disk {
>> on-io-error detach;
>> fencing dont-care;
>> no-disk-flushes;
>> no-md-flushes;
>> no-disk-barrier;
>> }
>>
>> net {
>> ko-count 2;
>> after-sb-2pri disconnect;
>> max-buffers 16000;
>> max-epoch-size 16000;
>> unplug-watermark 16001;
>> sndbuf-size 8m;
>> rcvbuf-size 8m;
>>
>> }
>> }
>>
>> resource drbd0 {
>> on client-nfs5 {
>> device /dev/drbd0;
>> disk /dev/sda1;
>> address 192.168.16.48:7789;
>> flexible-meta-disk internal;
>> }
>>
>> on client-nfs6 {
>> device /dev/drbd0;
>> disk /dev/sda1;
>> address 192.168.16.49:7789;
>> flexible-meta-disk internal;
>> }
>> }
>>
>> resource drbd1 {
>> on client-nfs5 {
>> device /dev/drbd1;
>> disk /dev/sdb1;
>> address 192.168.16.48:7790;
>> flexible-meta-disk internal;
>> }
>>
>> on client-nfs6 {
>> device /dev/drbd1;
>> disk /dev/sdb1;
>> address 192.168.16.49:7790;
>> flexible-meta-disk internal;
>> }
>> }
>>
>> [root [at] client-nfs log]# ibstat
>> CA 'mlx4_0'
>> CA type: MT26428
>> Number of ports: 2
>> Firmware version: 2.6.100
>> Hardware version: a0
>> Node GUID: 0x0002c9030004a79a
>> System image GUID: 0x0002c9030004a79d
>> Port 1:
>> State: Active
>> Physical state: LinkUp
>> Rate: 40
>> Base lid: 1
>> LMC: 0
>> SM lid: 1
>> Capability mask: 0x0251086a
>> Port GUID: 0x0002c9030004a79b
>> Port 2:
>> State: Active
>> Physical state: LinkUp
>> Rate: 40
>> Base lid: 1
>> LMC: 0
>> SM lid: 1
>> Capability mask: 0x0251086a
>> Port GUID: 0x0002c9030004a79c
>>
>> [root [at] client-nfs log]# cat /etc/sysctl.conf
>> # Kernel sysctl configuration file for Red Hat Linux
>> #
>> # For binary values, 0 is disabled, 1 is enabled. See sysctl(8) and
>> # sysctl.conf(5) for more details.
>>
>> # Controls IP packet forwarding
>> net.ipv4.ip_forward = 0
>>
>> # Controls source route verification
>> net.ipv4.conf.default.rp_filter = 1
>>
>> # Do not accept source routing
>> net.ipv4.conf.default.accept_source_route = 0
>>
>> # Controls the System Request debugging functionality of the kernel
>> kernel.sysrq = 0
>>
>> # Controls whether core dumps will append the PID to the core
>> filename
>> # Useful for debugging multi-threaded applications
>> kernel.core_uses_pid = 1
>>
>> # Controls the use of TCP syncookies
>> net.ipv4.tcp_syncookies = 1
>>
>> # Controls the maximum size of a message, in bytes
>> kernel.msgmnb = 65536
>>
>> # Controls the default maxmimum size of a mesage queue
>> kernel.msgmax = 65536
>>
>> # Controls the maximum shared segment size, in bytes
>> kernel.shmmax = 68719476736
>>
>> # Controls the maximum number of shared memory segments, in pages
>> kernel.shmall = 4294967296
>> net.ipv4.tcp_timestamps=0
>> net.ipv4.tcp_sack=0
>> net.core.netdev_max_backlog=250000
>> net.core.rmem_max=16777216
>> net.core.wmem_max=16777216
>> net.core.rmem_default=16777216
>> net.core.wmem_default=16777216
>> net.core.optmem_max=16777216
>
>
> this:
>
>> net.ipv4.tcp_mem=16777216 16777216 16777216
>
> ^^ is nonsense.
> tcp_mem is in pages, and has _nothing_ to do with your 16 MiB max
> buffer
> size you apparently want to set.
>
> I know the first hits on google for tcp_mem sysctl suggest otherwise.
> They are misguided and wrong.
>
> tcp_mem controlls the watermarks for _global_ tcp stack memory usage,
> where no pressure, little pressure, or high pressure is used against
> the tcp stack, if it tries to allocate some more.
>
> what you say there is that you do not bother to throttle the tcp stack
> memory usage at all, until its total usage would reach about
> 16777216 * 4k, which is 64GiB, which is more than double the amount of
> total ram in your box.
>
> DO NOT DO THAT.
>
> Or your box will run into hard OOM conditions,
> invoke the oom-killer, and eventually will panic.
>
> Set these values to e.g. 10%, 15%, 30% of total RAM,
> and remember the unit is _pages_ (4k).
>
>> net.ipv4.tcp_rmem=4096 87380 16777216
>> net.ipv4.tcp_wmem=4096 65536 16777216
>
> here, however, I suggest to up the lower mark to about 32k or 64k,
> which is the "guaranteed" amount of buffer space a tcp buffer is
> promissed to be left without throttling it further, even under overall
> high memory pressure.

Ok, this makes sense. These settings were taken from the ofed 1.4.2
release notes and are actually set dynamically when openibd is started
(via /sbin/ib_ipoib_sysctl)...

We'll start on these, then.

>
>> Since we noticed these errors, we've been searching for them on our
>> logservers and found another case where this is occurirng using the
>> same drbd version and infiniband. This other case, however, uses a
>> stock redhat kernel.
>>
>> Has anyone else seen these errors? Do the tuning recommendations
>> for
>> send/receive buffers need to be revisited? Would switching to rdp vs
>> IPoIB be a potential fix here? Thanks in advance for any insights.
>
> I'm not sure about RDP in this context, I know it as
> http://en.wikipedia.org/wiki/Reliable_Datagram_Protocol
> which is not available for DRBD.

The context here is my bad typing. I meant SDP.

>
> SDP (as in http://en.wikipedia.org/wiki/Sockets_Direct_Protocol)
> is available with DRBD 8.3.3 (rc*).
> But make sure you use the SDP from OFED 1.4.2, or SDP itself will leak
> full pages.

In our testing, we've been using the SDP from OFED 1.4.2.

>
> As SDP support is a new feature in DRBD, I'd like to have feedback on
> how well it works for you, and how performance as well as CPU usage
> compares to IPoIB.
>
> But correcting the tcp_mem setting above
> is more likely to fix your symptoms.

I suspect it will. We'll test and follow up.

Much thanks for the quick reply.

>
> --
> : Lars Ellenberg
> : LINBIT HA-Solutions GmbH
> : DRBD®/HA support and consulting http://www.linbit.com
>
> DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
> __
> please don't Cc me, but send to list -- I'm subscribed
> _______________________________________________
> drbd-user mailing list
> drbd-user [at] lists
> http://lists.linbit.com/mailman/listinfo/drbd-user

--
Jason McKay
Sr. Engineer
Logicworks

_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


lars.ellenberg at linbit

Sep 23, 2009, 12:20 AM

Post #3 of 7 (1045 views)
Permalink
Re: Page allocation errors and kernel panics with drbd 8.3.3rc1 and infiniband [In reply to]

On Tue, Sep 22, 2009 at 05:01:00PM -0400, Jason McKay wrote:
> > But correcting the tcp_mem setting above
> > is more likely to fix your symptoms.
>
> I suspect it will. We'll test and follow up.
>
> Much thanks for the quick reply.

We try to please.
Much more so for partners or feature-sponsors such as you ;)


--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list -- I'm subscribed
_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


lars.ellenberg at linbit

Oct 4, 2009, 1:14 PM

Post #4 of 7 (957 views)
Permalink
Re: Page allocation errors and kernel panics with drbd 8.3.3rc1 and infiniband [In reply to]

On Sun, Oct 04, 2009 at 03:55:44AM -0400, Gennadiy Nerubayev wrote:
> On Tue, Sep 22, 2009 at 5:01 PM, Jason McKay <jmckay [at] logicworks> wrote:
>
> > On Sep 22, 2009, at 4:34 PM, Lars Ellenberg wrote:
> >
> > > But correcting the tcp_mem setting above
> > > is more likely to fix your symptoms.
> >
> > I suspect it will. We'll test and follow up.
> >
>
> Hi guys,
>
> Unfortunately these are still occurring, even after we've updated to rc3,
> and used the tuning settings from rc3 notes (prior to this % of memory in
> pages were attempted with same results). They are a lot less frequent
> (intervals measured in hours), and have not yet caused a panic, but of
> course the worry is that it may happen regardless. Anything else that we
> could try here to eliminate it completely? Is there any chance that the
> ipoib stack is at fault?

Possibly.
Maybe Vlad knows more?
From http://www.openfabrics.org/txt/documentation/linux/EWG_meeting_minutes/12_01_08.txt:
1419 maj vlad [at] mellano Iperf-2.0.4 fails: page allocation failure. order:5
I guess that means https://bugs.openfabrics.org/show_bug.cgi?id=1419
Not much progress on that bug, though.

--
: Lars Ellenberg
: LINBIT HA-Solutions GmbH
: DRBD®/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


lars.ellenberg at linbit

Oct 5, 2009, 1:41 AM

Post #5 of 7 (942 views)
Permalink
Re: Page allocation errors and kernel panics with drbd 8.3.3rc1 and infiniband [In reply to]

On Sun, Oct 04, 2009 at 10:14:22PM +0200, Lars Ellenberg wrote:
> On Sun, Oct 04, 2009 at 03:55:44AM -0400, Gennadiy Nerubayev wrote:
> > On Tue, Sep 22, 2009 at 5:01 PM, Jason McKay <jmckay [at] logicworks> wrote:
> >
> > > On Sep 22, 2009, at 4:34 PM, Lars Ellenberg wrote:
> > >
> > > > But correcting the tcp_mem setting above
> > > > is more likely to fix your symptoms.
> > >
> > > I suspect it will. We'll test and follow up.
> > >
> >
> > Hi guys,
> >
> > Unfortunately these are still occurring, even after we've updated to rc3,
> > and used the tuning settings from rc3 notes (prior to this % of memory in
> > pages were attempted with same results). They are a lot less frequent
> > (intervals measured in hours), and have not yet caused a panic, but of
> > course the worry is that it may happen regardless. Anything else that we
> > could try here to eliminate it completely? Is there any chance that the
> > ipoib stack is at fault?
>
> Possibly.
> Maybe Vlad knows more?
> From http://www.openfabrics.org/txt/documentation/linux/EWG_meeting_minutes/12_01_08.txt:
> 1419 maj vlad [at] mellano Iperf-2.0.4 fails: page allocation failure. order:5
> I guess that means https://bugs.openfabrics.org/show_bug.cgi?id=1419
> Not much progress on that bug, though.

This appears related, as well:
http://bugzilla.kernel.org/show_bug.cgi?id=10890

Though there it was claimed that leaving network sysctls at the defaults
"solved" the issue.

--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list -- I'm subscribed
_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


lars.ellenberg at linbit

Oct 5, 2009, 2:49 AM

Post #6 of 7 (961 views)
Permalink
Re: Page allocation errors and kernel panics with drbd 8.3.3rc1 and infiniband [In reply to]

On Mon, Oct 05, 2009 at 10:41:23AM +0200, Lars Ellenberg wrote:
> On Sun, Oct 04, 2009 at 10:14:22PM +0200, Lars Ellenberg wrote:
> > On Sun, Oct 04, 2009 at 03:55:44AM -0400, Gennadiy Nerubayev wrote:
> > > On Tue, Sep 22, 2009 at 5:01 PM, Jason McKay <jmckay [at] logicworks> wrote:
> > >
> > > > On Sep 22, 2009, at 4:34 PM, Lars Ellenberg wrote:
> > > >
> > > > > But correcting the tcp_mem setting above
> > > > > is more likely to fix your symptoms.
> > > >
> > > > I suspect it will. We'll test and follow up.
> > > >
> > >
> > > Hi guys,
> > >
> > > Unfortunately these are still occurring, even after we've updated to rc3,
> > > and used the tuning settings from rc3 notes (prior to this % of memory in
> > > pages were attempted with same results). They are a lot less frequent
> > > (intervals measured in hours), and have not yet caused a panic, but of
> > > course the worry is that it may happen regardless. Anything else that we
> > > could try here to eliminate it completely? Is there any chance that the
> > > ipoib stack is at fault?
> >
> > Possibly.
> > Maybe Vlad knows more?
> > From http://www.openfabrics.org/txt/documentation/linux/EWG_meeting_minutes/12_01_08.txt:
> > 1419 maj vlad [at] mellano Iperf-2.0.4 fails: page allocation failure. order:5
> > I guess that means https://bugs.openfabrics.org/show_bug.cgi?id=1419
> > Not much progress on that bug, though.
>
> This appears related, as well:
> http://bugzilla.kernel.org/show_bug.cgi?id=10890
>
> Though there it was claimed that leaving network sysctls at the defaults
> "solved" the issue.

And yet one more, where sysctls helped:
http://thread.gmane.org/gmane.linux.nfs/20761/focus=695707

It has different context, but that thread may give you an idea on
how to track it down further: turn on slab debug,
then sample /proc/slabinfo, /proc/slab_allocators, /proc/net/sockstat,
and maybe similar statistics in the infiniband area.

BTW, maybe your netdev_max_backlog is a bit excessive?


--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


lars.ellenberg at linbit

Oct 5, 2009, 9:24 AM

Post #7 of 7 (956 views)
Permalink
Re: Page allocation errors and kernel panics with drbd 8.3.3rc1 and infiniband [In reply to]

On Mon, Oct 05, 2009 at 11:49:11AM +0200, Lars Ellenberg wrote:
> It has different context, but that thread may give you an idea on
> how to track it down further: turn on slab debug,
> then sample /proc/slabinfo, /proc/slab_allocators, /proc/net/sockstat,
> and maybe similar statistics in the infiniband area.

And if it turns out to be too hard to track down,
and it is actually "only" failing allocations from interrupt context
(or other "tight" code pathes), try reserve some megabytes for those
code pathes.

You seem to have pleanty of RAM araound (24 GiB iirc),
so why not simply reserve 128 MiB?
echo $[128 << 10] > /proc/sys/vm/min_free_kbytes

;)

--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user

DRBD users RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.