Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: DRBD: Users

"PingAck not received" messages

 

 

DRBD users RSS feed   Index | Next | Previous | View Threaded


matthew at bytemark

May 16, 2012, 1:11 PM

Post #1 of 19 (991 views)
Permalink
"PingAck not received" messages

I'm trying to understand a symptom for a client who uses drbd to run
sets of virtual machines between three pairs of servers (v1a/v1b,
v2a/v2b, v3a/v3b), and I wanted to understand a bit better how DRBD I/O
is buffered depending on what mode is chosen, and buffer settings.

Firstly, it surprised me that even in replication mode "A", the system
still seemed limited by by the bandwidth between nodes. I found this
out when the customer's bonded interface had flipped over to its 100Mb
backup connection, and suddenly they had I/O problems. While I was
investigating this and running tests, I noticed that switching to mode A
didn't help, even when measuring short transfers that I'd expect would
fit into reasonable-sized buffers. What kind of buffer size can I
expect from an "auto-tuned" DRBD? It seems important to be able to
cover bursts without leaning on the network, so I'd like to know whether
that's possible with some special tuning.

The other problem is the "PingAck not received" messages that have been
littering the logs of the v3a/v3b servers for the last couple of weeks,
e.g. this has been happening every few hours for one DRBD or another:

May 14 08:21:45 v3b kernel: [661127.869500] block drbd10: PingAck did
not arrive in time.
May 14 08:21:45 v3b kernel: [661127.875553] block drbd10: peer( Primary
-> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate ->
DUnknown )
May 14 08:21:45 v3b kernel: [661127.875562] block drbd10: asender terminated
May 14 08:21:45 v3b kernel: [661127.875564] block drbd10: Terminating
drbd10_asender
May 14 08:21:45 v3b kernel: [661127.875597] block drbd10: short read
expecting header on sock: r=-512
May 14 08:21:45 v3b kernel: [661127.882896] block drbd10: Connection closed
May 14 08:21:45 v3b kernel: [661127.882899] block drbd10: conn(
NetworkFailure -> Unconnected )
May 14 08:21:45 v3b kernel: [661127.882904] block drbd10: receiver
terminated
May 14 08:21:45 v3b kernel: [661127.882908] block drbd10: Restarting
drbd10_receiver
May 14 08:21:45 v3b kernel: [661127.882910] block drbd10: receiver
(re)started
May 14 08:21:45 v3b kernel: [661127.882913] block drbd10: conn(
Unconnected -> WFConnection )
May 14 08:21:46 v3b kernel: [661129.123506] block drbd10: Handshake
successful: Agreed network protocol version 91
May 14 08:21:46 v3b kernel: [661129.123511] block drbd10: conn(
WFConnection -> WFReportParams )
May 14 08:21:46 v3b kernel: [661129.123535] block drbd10: Starting
asender thread (from drbd10_receiver [31418])
May 14 08:21:46 v3b kernel: [661129.123876] block drbd10:
data-integrity-alg: <not-used>
May 14 08:21:46 v3b kernel: [661129.123898] block drbd10:
drbd_sync_handshake:
May 14 08:21:46 v3b kernel: [661129.123900] block drbd10: self
C5DC68A8AFD5BFEC:0000000000000000:7EB45F3A26B3BD72:2EC9659EFC4BC513
bits:0 flags:0
May 14 08:21:46 v3b kernel: [661129.123903] block drbd10: peer
F8BB238D22A7ACFF:C5DC68A8AFD5BFED:7EB45F3A26B3BD72:2EC9659EFC4BC513
bits:0 flags:0
May 14 08:21:46 v3b kernel: [661129.123905] block drbd10:
uuid_compare()=-1 by rule 50
May 14 08:21:46 v3b kernel: [661129.123908] block drbd10: peer( Unknown
-> Primary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown ->
UpToDate )
May 14 08:21:46 v3b kernel: [661129.138101] block drbd10: conn(
WFBitMapT -> WFSyncUUID )
May 14 08:21:46 v3b kernel: [661129.139563] block drbd10: helper
command: /sbin/drbdadm before-resync-target minor-10
May 14 08:21:46 v3b kernel: [661129.140282] block drbd10: helper
command: /sbin/drbdadm before-resync-target minor-10 exit code 0 (0x0)
May 14 08:21:46 v3b kernel: [661129.140286] block drbd10: conn(
WFSyncUUID -> SyncTarget ) disk( UpToDate -> Inconsistent )
May 14 08:21:46 v3b kernel: [661129.140292] block drbd10: Began resync
as SyncTarget (will sync 0 KB [0 bits set]).
May 14 08:21:47 v3b kernel: [661129.693954] block drbd10: Resync done
(total 1 sec; paused 0 sec; 0 K/sec)
May 14 08:21:47 v3b kernel: [661129.693961] block drbd10: conn(
SyncTarget -> Connected ) disk( Inconsistent -> UpToDate )
May 14 08:21:47 v3b kernel: [661129.693969] block drbd10: helper
command: /sbin/drbdadm after-resync-target minor-10
May 14 08:21:47 v3b kernel: [661129.694725] block drbd10: helper
command: /sbin/drbdadm after-resync-target minor-10 exit code 0 (0x0)

I've not been able to correlate these ping drops and reconnections to
any of:

1) interface capacity issues (a few times we might make a 400Mb spike,
but sometimes there's none at all);

2) loss of connectivity or ARP problems on the two servers' dedicated
DRBD interfaces (i.e. I've got an unbroken log of pings between the two
servers);

3) any kernel grumbles about the network interface, bonding, RAID or
anything remotely hardware-related. Apart from the drbd messages
there's no other chatter from the kernel.

The customer's other two pairs of servers have been running 18 months
and not exhibited this behaviour.

The customer hasn't given me the data to show these blips (which are
anything from 2s-30s) correspond to any real performance problems and I
don't have access to the inside of their VMs to check for myself. So my
questions are - would you expect these disconnections to cause
variations in I/O bandwidth or responsiveness?

And secondly, what should I be doing about it? My unsatisfactory
response to the customer's worry is to reconnect all the drbds with a
longer ping-timeout, and in 10 hours it hasn't reoccurred, which is an
unusually long record. I will be more convinced by the end of the day.

Even if that does solve these messages, I'm curious as to the cause.
We've not hit a network bandwidth ceiling, and so we've definitely not
hit an I/O ceiling (which is 4x146GB 15kRPM discs, RAID10, HP RAID). I
can accept that some VMs will use more bandwidth than others, and so it
wouldn't be surprising that one VM on the machine was the "cause".

But when the disconnections happen, they appear to be completely random.
Checking with grep/uniq -c, I see out of the 11 devices on the systems,
it happens pretty randomly (and drbd10 is just a test, getting
absolutely zero I/O).

5 drbd0:
5 drbd1:
11 drbd2:
8 drbd3:
11 drbd4:
4 drbd5:
6 drbd6:
7 drbd7:
5 drbd8:
14 drbd9:
12 drbd10:
7 drbd11:

So even if upping the ping time stops the problem, and even if the
effects of the disconnect/reconnect cycles are harmless - why might DRBD
exhibit these symptoms on one pair of servers, but not two other sets?
Is there some I/O pattern that might cause pings to get lost, even over
a lightly-loaded gigabit link?

Thanks for any insights in advance.

--
Matthew
_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


lars.ellenberg at linbit

May 18, 2012, 7:04 AM

Post #2 of 19 (966 views)
Permalink
Re: "PingAck not received" messages [In reply to]

On Wed, May 16, 2012 at 09:11:05PM +0100, Matthew Bloch wrote:
> I'm trying to understand a symptom for a client who uses drbd to run
> sets of virtual machines between three pairs of servers (v1a/v1b,
> v2a/v2b, v3a/v3b), and I wanted to understand a bit better how DRBD I/O
> is buffered depending on what mode is chosen, and buffer settings.
>
> Firstly, it surprised me that even in replication mode "A", the system
> still seemed limited by by the bandwidth between nodes. I found this
> out when the customer's bonded interface had flipped over to its 100Mb
> backup connection, and suddenly they had I/O problems. While I was
> investigating this and running tests, I noticed that switching to mode A
> didn't help, even when measuring short transfers that I'd expect would
> fit into reasonable-sized buffers. What kind of buffer size can I
> expect from an "auto-tuned" DRBD? It seems important to be able to
> cover bursts without leaning on the network, so I'd like to know whether
> that's possible with some special tuning.

Uhm, well,
we have invented the DRBD Proxy specifically for that purpose.

> The other problem is the "PingAck not received" messages that have been
> littering the logs of the v3a/v3b servers for the last couple of weeks,
> e.g. this has been happening every few hours for one DRBD or another:
>
> May 14 08:21:45 v3b kernel: [661127.869500] block drbd10: PingAck did
> not arrive in time.

Increase ping timeout?

--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com
_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


matthew at bytemark

May 18, 2012, 9:49 AM

Post #3 of 19 (956 views)
Permalink
Re: "PingAck not received" messages [In reply to]

On 18/05/12 15:04, Lars Ellenberg wrote:
> On Wed, May 16, 2012 at 09:11:05PM +0100, Matthew Bloch wrote:
>> I'm trying to understand a symptom for a client who uses drbd to run
>> sets of virtual machines between three pairs of servers (v1a/v1b,
>> v2a/v2b, v3a/v3b), and I wanted to understand a bit better how DRBD I/O
>> is buffered depending on what mode is chosen, and buffer settings.
>>
>> Firstly, it surprised me that even in replication mode "A", the system
>> still seemed limited by by the bandwidth between nodes. I found this
>> out when the customer's bonded interface had flipped over to its 100Mb
>> backup connection, and suddenly they had I/O problems. While I was
>> investigating this and running tests, I noticed that switching to mode A
>> didn't help, even when measuring short transfers that I'd expect would
>> fit into reasonable-sized buffers. What kind of buffer size can I
>> expect from an "auto-tuned" DRBD? It seems important to be able to
>> cover bursts without leaning on the network, so I'd like to know whether
>> that's possible with some special tuning.
>
> Uhm, well,
> we have invented the DRBD Proxy specifically for that purpose.

That's useful to know - so the kernel buffering, however it's
configured, isn't really set up for handling longer delays? I don't
think that's my problem, as the ICMP ping time between the servers is
<1ms, doesn't drop out even while DRBD reports it hasn't seen its own
pings. It's gigabit ethernet all the way, and on a private LAN.

>> The other problem is the "PingAck not received" messages that have been
>> littering the logs of the v3a/v3b servers for the last couple of weeks,
>> e.g. this has been happening every few hours for one DRBD or another:
>>
>> May 14 08:21:45 v3b kernel: [661127.869500] block drbd10: PingAck did
>> not arrive in time.
>
> Increase ping timeout?

I did that (now at 3s, from 0.5s) but I still get reconnections.

I set up a two pairs of VMs to write 1MB to the DRBD every second, and
time it. On the problematic machines, I saw lots of times where the
write took more than 10s, and a couple of those corresponded with DRBD
reconnections. On the normal machines, only two of the writes took more
than 0.1s!

So I'm still hunting for what might be going wrong, even though the
software versions are the same, the drbd links aren't hitting the
ceiling, they're doing no more I/O than the "good" pairs. I think next
will be to take some packet dumps to see if there is anything odd going
on at the TCP layer.

If nobody else on the list has seen this sort of behaviour, and Linbit
have a day rate :-) please get in touch privately, I'd rather get you
guys to fix this for our customer.

Best wishes,

--
Matthew Bloch Bytemark Hosting
http://www.bytemark.co.uk/
tel: +44 (0) 1904 890890
_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


florian at hastexo

May 20, 2012, 11:14 PM

Post #4 of 19 (926 views)
Permalink
Re: "PingAck not received" messages [In reply to]

Matthew,

On Wed, May 16, 2012 at 10:11 PM, Matthew Bloch <matthew [at] bytemark> wrote:
> I'm trying to understand a symptom for a client who uses drbd to run
> sets of virtual machines between three pairs of servers (v1a/v1b,
> v2a/v2b, v3a/v3b), and I wanted to understand a bit better how DRBD I/O
> is buffered depending on what mode is chosen, and buffer settings.

When you say virtual machines, how exactly are they being virtualized?
VMware? Libvirt/KVM? Xen?

> Firstly, it surprised me that even in replication mode "A", the system
> still seemed limited by by the bandwidth between nodes.

Not surprising; Bandwidth is always a limiting factor. In protocol A,
if your send buffers don't drain quickly enough, all DRBD can do is
block incoming I/O. DRBD Proxy helps you with compression and by
buffering write bursts, but if your system is getting constantly
hammered to the point where even the _average_ throughput is higher
than the bandwidth of the replication link (minus DRBD Proxy
compression, of course), you can still get it to block. Not too many
people actually run into this, but if you're trying to writing on a
DRBD device at, say 100 MB/s, and all you've got between sites is
2Mbps, then that's not going to work. :)


>  I found this
> out when the customer's bonded interface had flipped over to its 100Mb
> backup connection, and suddenly they had I/O problems.  While I was
> investigating this and running tests, I noticed that switching to mode A
> didn't help, even when measuring short transfers that I'd expect would
> fit into reasonable-sized buffers.  What kind of buffer size can I
> expect from an "auto-tuned" DRBD?  It seems important to be able to
> cover bursts without leaning on the network, so I'd like to know whether
> that's possible with some special tuning.

With sndbuf-size 0 and rcvbuf-size 0, DRBD will let the kernel do its
work. Refer to your TCP rmem/wmem sysctls to get an idea of the buffer
sizes you'll be getting.


> And secondly, what should I be doing about it?  My unsatisfactory
> response to the customer's worry is to reconnect all the drbds with a
> longer ping-timeout, and in 10 hours it hasn't reoccurred, which is an
> unusually long record.  I will be more convinced by the end of the day.

My hunch would be to start troubleshooting the virtualized network
layer. Hence my question above as to what virtualization you're using.

Cheers,
Florian

--
Need help with High Availability?
http://www.hastexo.com/now
_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


pascal.berton3 at free

May 20, 2012, 11:25 PM

Post #5 of 19 (928 views)
Permalink
Re: "PingAck not received" messages [In reply to]

Hi Matthew!

I've recently experienced the very same behavior, with two bonded 10GbE
direct links between nodes for replication. The nodes host 4 resources
under DRBD 8.3.11 using protocol B, and just like you, disconnections are
intermittent and on any resources, no logical rule for victim election
obviously. One of the resources hosts a CIFS share that I use for VMware
DataRecovery backups. Although my SMB config was looking good, I found
various errors at the datarecovery client and smb server that led me to dig
further. After spending lots of times searching for SMB tuning and such, I
finally observed that there's a link between these errors and the PingAck
errors : Since datarecovery was complaining, I did various experiments to
force it issue a complete recatalog of the restore points. This recatalog
operation issues lots of CIFS IOs. Each time, the 1st 30 minutes or so are
Ok, then the first errors occur, from times to times only at the beginning,
more often as time goes on. When it enters the "error phase", I see high io
wait activity, typically 70% or more, meaning lasting IOs... And that's
during these phases that the PingAck errors occur. I'm still unsure whether
it's network related or disk related. I feel like there's a buffer somewhere
that fills up progressively so the "correct" first 30 minutes, then this
buffer seems to get full, IO delays begin to rise, and then the CIFS errors,
the PingAck's that do not help much, and so on... Thus, it looks like these
PingAck errors occur because of the rising wait times.
Next step will be to (try to) identify whether it's a network or disk buffer
that fills up, and whether it's disk activity or network activity that is
the real problem, don't clearly know how I will do that... If that's disk
related, may be a caching solution (dm-cache or whatever) would help, don't
know... Else, I'm afraid that only DRBD proxy would effectively fix that
trick.
If you find other clues, I'm interested!

Best regards,

Pascal.

-----Message d'origine-----
De : drbd-user-bounces [at] lists
[mailto:drbd-user-bounces [at] lists] De la part de Matthew Bloch
Envoyé : vendredi 18 mai 2012 18:50
À : drbd-user [at] lists
Objet : Re: [DRBD-user] "PingAck not received" messages

On 18/05/12 15:04, Lars Ellenberg wrote:
> On Wed, May 16, 2012 at 09:11:05PM +0100, Matthew Bloch wrote:
>> I'm trying to understand a symptom for a client who uses drbd to run
>> sets of virtual machines between three pairs of servers (v1a/v1b,
>> v2a/v2b, v3a/v3b), and I wanted to understand a bit better how DRBD
>> I/O is buffered depending on what mode is chosen, and buffer settings.
>>
>> Firstly, it surprised me that even in replication mode "A", the
>> system still seemed limited by by the bandwidth between nodes. I
>> found this out when the customer's bonded interface had flipped over
>> to its 100Mb backup connection, and suddenly they had I/O problems.
>> While I was investigating this and running tests, I noticed that
>> switching to mode A didn't help, even when measuring short transfers
>> that I'd expect would fit into reasonable-sized buffers. What kind
>> of buffer size can I expect from an "auto-tuned" DRBD? It seems
>> important to be able to cover bursts without leaning on the network,
>> so I'd like to know whether that's possible with some special tuning.
>
> Uhm, well,
> we have invented the DRBD Proxy specifically for that purpose.

That's useful to know - so the kernel buffering, however it's configured,
isn't really set up for handling longer delays? I don't think that's my
problem, as the ICMP ping time between the servers is <1ms, doesn't drop out
even while DRBD reports it hasn't seen its own pings. It's gigabit ethernet
all the way, and on a private LAN.

>> The other problem is the "PingAck not received" messages that have
>> been littering the logs of the v3a/v3b servers for the last couple of
>> weeks, e.g. this has been happening every few hours for one DRBD or
another:
>>
>> May 14 08:21:45 v3b kernel: [661127.869500] block drbd10: PingAck did
>> not arrive in time.
>
> Increase ping timeout?

I did that (now at 3s, from 0.5s) but I still get reconnections.

I set up a two pairs of VMs to write 1MB to the DRBD every second, and time
it. On the problematic machines, I saw lots of times where the write took
more than 10s, and a couple of those corresponded with DRBD reconnections.
On the normal machines, only two of the writes took more than 0.1s!

So I'm still hunting for what might be going wrong, even though the software
versions are the same, the drbd links aren't hitting the ceiling, they're
doing no more I/O than the "good" pairs. I think next will be to take some
packet dumps to see if there is anything odd going on at the TCP layer.

If nobody else on the list has seen this sort of behaviour, and Linbit have
a day rate :-) please get in touch privately, I'd rather get you guys to fix
this for our customer.

Best wishes,

--
Matthew Bloch Bytemark Hosting
http://www.bytemark.co.uk/
tel: +44 (0) 1904 890890
_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user

_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


ff at mpexnet

May 21, 2012, 1:03 AM

Post #6 of 19 (926 views)
Permalink
Re: "PingAck not received" messages [In reply to]

Hi,

On 05/18/2012 06:49 PM, Matthew Bloch wrote:
> I set up a two pairs of VMs to write 1MB to the DRBD every second, and
> time it. On the problematic machines, I saw lots of times where the
> write took more than 10s, and a couple of those corresponded with DRBD
> reconnections. On the normal machines, only two of the writes took more
> than 0.1s!
>
> So I'm still hunting for what might be going wrong, even though the
> software versions are the same, the drbd links aren't hitting the
> ceiling, they're doing no more I/O than the "good" pairs. I think next
> will be to take some packet dumps to see if there is anything odd going
> on at the TCP layer.
>
> If nobody else on the list has seen this sort of behaviour, and Linbit
> have a day rate :-) please get in touch privately, I'd rather get you
> guys to fix this for our customer.

I did have a couple of VMs with severe network problems. They were based
on the 2.6.33-ish KVM with userland as found in Debian Squeeze.

I'd find more or less frequent lost pings and reconnects in the DRBD
logs. Once every couple of weeks the network stack on the Primary would
completely stop receiving packets (funnily enough, it would still send,
so HA didn't kick in, but that may be a different story).

Switching from virtio to e1000 solved this for me.

Cheers,
Felix
_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


matthew at bytemark

May 21, 2012, 9:56 AM

Post #7 of 19 (926 views)
Permalink
Re: "PingAck not received" messages [In reply to]

I will coalesce a couple of replies, thanks everyone :)

On 21/05/12 07:14, Florian Haas wrote:
> Matthew,
>
> On Wed, May 16, 2012 at 10:11 PM, Matthew Bloch<matthew [at] bytemark> wrote:
>> I'm trying to understand a symptom for a client who uses drbd to run
>> sets of virtual machines between three pairs of servers (v1a/v1b,
>> v2a/v2b, v3a/v3b), and I wanted to understand a bit better how DRBD I/O
>> is buffered depending on what mode is chosen, and buffer settings.
>
> When you say virtual machines, how exactly are they being virtualized?
> VMware? Libvirt/KVM? Xen?

These are KVM-based, but the DRBD happens outside the VMs, on the host,
and the /dev/drbdX devices presented as the VMs' /dev/vda. So I'm not
sure why the VM networking could be anything to do with it, particularly
as the DRBD goes over a separate interface.

I'm replicating the test on the host, just scribbling directly to a new
/dev/drbd to make sure I see the same performance drops while _not_
going via KVM. I can't see how something guest-side could affect it
though, when the problem manifests itself in the hosts' drbd.

Interesting about bandwidth - so DRBD doesn't have any special buffers
of its own, just sits on the usual TCP buffers. That makes sense. As I
said, the interface stats do not show that they are sending any more
DRBD traffic than the pairs of servers that are working fine though.

Thanks for the accounts Pascal and Felix, though Felix I'm pretty
certain Debian/lenny's kernel had a virtio bug that does cause its
network to break and require a "rmmod virtio_net; modprobe virtio_net"
to fix. That's nothing to do with drbd, and your problem may be
entirely separate from that as well :)

--
Matthew Bloch Bytemark Hosting
http://www.bytemark.co.uk/
tel: +44 (0) 1904 890890
_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


ff at mpexnet

May 22, 2012, 12:16 AM

Post #8 of 19 (915 views)
Permalink
Re: "PingAck not received" messages [In reply to]

On 05/21/2012 06:56 PM, Matthew Bloch wrote:
> Thanks for the accounts Pascal and Felix, though Felix I'm pretty
> certain Debian/lenny's kernel had a virtio bug that does cause its
> network to break and require a "rmmod virtio_net; modprobe virtio_net"
> to fix. That's nothing to do with drbd, and your problem may be
> entirely separate from that as well :)

Right - I agree that guest issues could not conceivably cause DRBD
issues on the host. I had wrongly inferred that you were DRBDing from
inside a guest.
_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


matthew at bytemark

May 22, 2012, 3:45 AM

Post #9 of 19 (913 views)
Permalink
Re: "PingAck not received" messages [In reply to]

On 22/05/12 08:16, Felix Frank wrote:
> On 05/21/2012 06:56 PM, Matthew Bloch wrote:
>> Thanks for the accounts Pascal and Felix, though Felix I'm pretty
>> certain Debian/lenny's kernel had a virtio bug that does cause its
>> network to break and require a "rmmod virtio_net; modprobe virtio_net"
>> to fix. That's nothing to do with drbd, and your problem may be
>> entirely separate from that as well :)
>
> Right - I agree that guest issues could not conceivably cause DRBD
> issues on the host. I had wrongly inferred that you were DRBDing from
> inside a guest.

Indeed, I started logging this command every second, and e.g. this kind
of event is typical every few hours:

dd if=/dev/zero of=/dev/drbd13 conv=fdatasync bs=1M count=1 2>&1 | \
grep copied

2012-05-22 02:17:00 W 1048576 bytes (1.0 MB) copied, 0.0115253 s, 91.0 MB/s
2012-05-22 02:17:01 W 1048576 bytes (1.0 MB) copied, 0.011519 s, 91.0 MB/s
2012-05-22 02:17:02 W 1048576 bytes (1.0 MB) copied, 0.0116563 s, 90.0 MB/s
2012-05-22 02:17:03 W 1048576 bytes (1.0 MB) copied, 1.1898 s, 881 kB/s
2012-05-22 02:17:05 W 1048576 bytes (1.0 MB) copied, 28.3202 s, 37.0 kB/s
2012-05-22 02:17:35 W 1048576 bytes (1.0 MB) copied, 0.0127468 s, 82.3 MB/s
2012-05-22 02:17:36 W 1048576 bytes (1.0 MB) copied, 0.0113499 s, 92.4 MB/s
2012-05-22 02:17:37 W 1048576 bytes (1.0 MB) copied, 0.0112707 s, 93.0 MB/s

And in the kernel log:

May 22 02:17:11 v3a kernel: [1341064.126449] block drbd13:
[drbd13_worker/797] sock_sendmsg time expired, ko = 4294967295
May 22 02:17:17 v3a kernel: [1341070.129829] block drbd13:
[drbd13_worker/797] sock_sendmsg time expired, ko = 4294967294
May 22 02:17:23 v3a kernel: [1341076.133170] block drbd13:
[drbd13_worker/797] sock_sendmsg time expired, ko = 4294967293
May 22 02:17:29 v3a kernel: [1341082.133592] block drbd13:
[drbd13_worker/797] sock_sendmsg time expired, ko = 4294967292

Curiously, the "v3a" host (on which I'm running this test) just shows
these disconnects, it's the "v3b" host that gives the "PingAck not
received" messages. But not in this instance.

--
Matthew
_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


florian at hastexo

May 22, 2012, 5:05 AM

Post #10 of 19 (921 views)
Permalink
Re: "PingAck not received" messages [In reply to]

On Tue, May 22, 2012 at 12:45 PM, Matthew Bloch <matthew [at] bytemark> wrote:
> On 22/05/12 08:16, Felix Frank wrote:
>> On 05/21/2012 06:56 PM, Matthew Bloch wrote:
>>> Thanks for the accounts Pascal and Felix, though Felix I'm pretty
>>> certain Debian/lenny's kernel had a virtio bug that does cause its
>>> network to break and require a "rmmod virtio_net; modprobe virtio_net"
>>> to fix.  That's nothing to do with drbd, and your problem may be
>>> entirely separate from that as well :)
>>
>> Right - I agree that guest issues could not conceivably cause DRBD
>> issues on the host. I had wrongly inferred that you were DRBDing from
>> inside a guest.

So was I, originally.

> Indeed, I started logging this command every second, and e.g. this kind
> of event is typical every few hours:
>
>  dd if=/dev/zero of=/dev/drbd13 conv=fdatasync bs=1M count=1 2>&1 | \
>    grep copied
>
> 2012-05-22 02:17:00 W 1048576 bytes (1.0 MB) copied, 0.0115253 s, 91.0 MB/s
> 2012-05-22 02:17:01 W 1048576 bytes (1.0 MB) copied, 0.011519 s, 91.0 MB/s
> 2012-05-22 02:17:02 W 1048576 bytes (1.0 MB) copied, 0.0116563 s, 90.0 MB/s
> 2012-05-22 02:17:03 W 1048576 bytes (1.0 MB) copied, 1.1898 s, 881 kB/s
> 2012-05-22 02:17:05 W 1048576 bytes (1.0 MB) copied, 28.3202 s, 37.0 kB/s
> 2012-05-22 02:17:35 W 1048576 bytes (1.0 MB) copied, 0.0127468 s, 82.3 MB/s
> 2012-05-22 02:17:36 W 1048576 bytes (1.0 MB) copied, 0.0113499 s, 92.4 MB/s
> 2012-05-22 02:17:37 W 1048576 bytes (1.0 MB) copied, 0.0112707 s, 93.0 MB/s
>
> And in the kernel log:
>
> May 22 02:17:11 v3a kernel: [1341064.126449] block drbd13:
> [drbd13_worker/797] sock_sendmsg time expired, ko = 4294967295
> May 22 02:17:17 v3a kernel: [1341070.129829] block drbd13:
> [drbd13_worker/797] sock_sendmsg time expired, ko = 4294967294
> May 22 02:17:23 v3a kernel: [1341076.133170] block drbd13:
> [drbd13_worker/797] sock_sendmsg time expired, ko = 4294967293
> May 22 02:17:29 v3a kernel: [1341082.133592] block drbd13:
> [drbd13_worker/797] sock_sendmsg time expired, ko = 4294967292
>
> Curiously, the "v3a" host (on which I'm running this test) just shows
> these disconnects, it's the "v3b" host that gives the "PingAck not
> received" messages.  But not in this instance.

And you're absolutely certain that "ip -s link show dev <nic>" shows
no drops or errors (for <nic> replaced with your DRBD replication
device name)? Also, if you're replicating through a switch as opposed
to via a back-to-back connection, did you check the switch port
statistics for errors as well?

I will observe that up to this point you haven't shared your DRBD
config, so if you pastebinned that and shared the URL here, people
could look into it and point to possible issues.

Cheers,
Florian

--
Need help with High Availability?
http://www.hastexo.com/now
_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


matthew at bytemark

May 22, 2012, 8:05 AM

Post #11 of 19 (915 views)
Permalink
Re: "PingAck not received" messages [In reply to]

On 22/05/12 13:05, Florian Haas wrote:
>> Indeed, I started logging this command every second, and e.g. this kind
>> of event is typical every few hours:
>>
>> dd if=/dev/zero of=/dev/drbd13 conv=fdatasync bs=1M count=1 2>&1 | \
>> grep copied
>>
>> 2012-05-22 02:17:00 W 1048576 bytes (1.0 MB) copied, 0.0115253 s, 91.0 MB/s
>> 2012-05-22 02:17:01 W 1048576 bytes (1.0 MB) copied, 0.011519 s, 91.0 MB/s
>> 2012-05-22 02:17:02 W 1048576 bytes (1.0 MB) copied, 0.0116563 s, 90.0 MB/s
>> 2012-05-22 02:17:03 W 1048576 bytes (1.0 MB) copied, 1.1898 s, 881 kB/s
>> 2012-05-22 02:17:05 W 1048576 bytes (1.0 MB) copied, 28.3202 s, 37.0 kB/s
>> 2012-05-22 02:17:35 W 1048576 bytes (1.0 MB) copied, 0.0127468 s, 82.3 MB/s
>> 2012-05-22 02:17:36 W 1048576 bytes (1.0 MB) copied, 0.0113499 s, 92.4 MB/s
>> 2012-05-22 02:17:37 W 1048576 bytes (1.0 MB) copied, 0.0112707 s, 93.0 MB/s
>>
>> And in the kernel log:
>>
>> May 22 02:17:11 v3a kernel: [1341064.126449] block drbd13:
>> [drbd13_worker/797] sock_sendmsg time expired, ko = 4294967295
>> May 22 02:17:17 v3a kernel: [1341070.129829] block drbd13:
>> [drbd13_worker/797] sock_sendmsg time expired, ko = 4294967294
>> May 22 02:17:23 v3a kernel: [1341076.133170] block drbd13:
>> [drbd13_worker/797] sock_sendmsg time expired, ko = 4294967293
>> May 22 02:17:29 v3a kernel: [1341082.133592] block drbd13:
>> [drbd13_worker/797] sock_sendmsg time expired, ko = 4294967292
>>
>> Curiously, the "v3a" host (on which I'm running this test) just shows
>> these disconnects, it's the "v3b" host that gives the "PingAck not
>> received" messages. But not in this instance.
>
> And you're absolutely certain that "ip -s link show dev <nic>" shows
> no drops or errors (for <nic> replaced with your DRBD replication
> device name)? Also, if you're replicating through a switch as opposed
> to via a back-to-back connection, did you check the switch port
> statistics for errors as well?

I hadn't spotted any but I've asked for a third opinion, because I can't
see any other explanation. I've also set up a continuous TCP connection
on the same interfaces between a pair of scripts that will report if
there is any break in TCP connectivity, as it appears drbd is seeing.

> I will observe that up to this point you haven't shared your DRBD
> config, so if you pastebinned that and shared the URL here, people
> could look into it and point to possible issues.

I'm not using drbdadm and the helper, my "pairvm" script manages DRBD
for VMs using this command to attach the disc:

drbd.setup("disk", drbd_backing_device, drbd_meta_device, 0)

and these commands to connect the network, pretty unambitious stuff I
assume:

drbd.setup("net", *(drbd_net_args + extra_options))
drbd.setup("syncer", "-r", "20M")
def drbd_net_args
[
"#{Global.ips['here_drbd']}:#{drbd_port}",
"#{Global.ips['there_drbd']}:#{drbd_port}",
"B",
"--after-sb-0pri", "discard-zero-changes",
"--after-sb-1pri", "consensus",
"--after-sb-2pri", "disconnect",
"--ping-timeout", "50"
]
end

NB these just translate to "drbdsetup /dev/drbdX net ..." with the IPs
and ports automatically assigned by the script.

--
Matthew
Attachments: signature.asc (0.26 KB)


florian at hastexo

May 23, 2012, 12:45 PM

Post #12 of 19 (879 views)
Permalink
Re: "PingAck not received" messages [In reply to]

On Tue, May 22, 2012 at 5:05 PM, Matthew Bloch <matthew [at] bytemark> wrote:
> I'm not using drbdadm and the helper, my "pairvm" script manages DRBD
> for VMs using this command to attach the disc:
>
>      drbd.setup("disk", drbd_backing_device, drbd_meta_device, 0)
>
> and these commands to connect the network, pretty unambitious stuff I
> assume:
>
>      drbd.setup("net", *(drbd_net_args + extra_options))
>      drbd.setup("syncer", "-r", "20M")
>      def drbd_net_args
>        [
>        "#{Global.ips['here_drbd']}:#{drbd_port}",
>        "#{Global.ips['there_drbd']}:#{drbd_port}",
>          "B",
>          "--after-sb-0pri", "discard-zero-changes",
>          "--after-sb-1pri", "consensus",
>          "--after-sb-2pri", "disconnect",
>          "--ping-timeout", "50"
>        ]
>      end
>
> NB these just translate to "drbdsetup /dev/drbdX net ..." with the IPs
> and ports automatically assigned by the script.

"drbdsetup /dev/drbdX show" to a pastebin please?

Cheers,
Florian


--
Need help with High Availability?
http://www.hastexo.com/now
_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


matthew at bytemark

May 23, 2012, 12:57 PM

Post #13 of 19 (885 views)
Permalink
Re: "PingAck not received" messages [In reply to]

On 23/05/12 20:45, Florian Haas wrote:
> On Tue, May 22, 2012 at 5:05 PM, Matthew Bloch <matthew [at] bytemark> wrote:
>> I'm not using drbdadm and the helper, my "pairvm" script manages DRBD
>> for VMs using this command to attach the disc:
>>
>> drbd.setup("disk", drbd_backing_device, drbd_meta_device, 0)
>>
>> and these commands to connect the network, pretty unambitious stuff I
>> assume:
>>
>> drbd.setup("net", *(drbd_net_args + extra_options))
>> drbd.setup("syncer", "-r", "20M")
>> def drbd_net_args
>> [
>> "#{Global.ips['here_drbd']}:#{drbd_port}",
>> "#{Global.ips['there_drbd']}:#{drbd_port}",
>> "B",
>> "--after-sb-0pri", "discard-zero-changes",
>> "--after-sb-1pri", "consensus",
>> "--after-sb-2pri", "disconnect",
>> "--ping-timeout", "50"
>> ]
>> end
>>
>> NB these just translate to "drbdsetup /dev/drbdX net ..." with the IPs
>> and ports automatically assigned by the script.
>
> "drbdsetup /dev/drbdX show" to a pastebin please?

It's not that long, here's one:

$ sudo drbdsetup /dev/drbd0 show
disk {
size 0s _is_default; # bytes
on-io-error pass_on _is_default;
fencing dont-care _is_default;
max-bio-bvecs 0 _is_default;
}
net {
timeout 60 _is_default; # 1/10 seconds
max-epoch-size 2048 _is_default;
max-buffers 2048 _is_default;
unplug-watermark 128 _is_default;
connect-int 10 _is_default; # seconds
ping-int 10 _is_default; # seconds
sndbuf-size 0 _is_default; # bytes
rcvbuf-size 0 _is_default; # bytes
ko-count 0 _is_default;
after-sb-0pri discard-zero-changes;
after-sb-1pri consensus;
after-sb-2pri disconnect _is_default;
rr-conflict disconnect _is_default;
ping-timeout 50; # 1/10 seconds
}
syncer {
rate 20480k; # bytes/second
after -1 _is_default;
al-extents 257;
}
protocol B;
_this_host {
device minor 0;
disk "/dev/here/customer";
meta-disk "/dev/here/customer_meta" [ 0 ];
address ipv4 v3a:45000;
}
_remote_host {
address ipv4 v3b:45000;
}

(disc names & IP addresses changed slightly)


--
Matthew

_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


matthew at bytemark

May 24, 2012, 4:53 AM

Post #14 of 19 (865 views)
Permalink
Re: "PingAck not received" messages [In reply to]

On 24/05/12 12:30, Florian Haas wrote:
> On Wed, May 23, 2012 at 9:57 PM, Matthew Bloch <matthew [at] bytemark> wrote:
>>> "drbdsetup /dev/drbdX show" to a pastebin please?
>>
>> It's not that long, here's one:
>>
>> $ sudo drbdsetup /dev/drbd0 show
>> disk {
>> size 0s _is_default; # bytes
>> on-io-error pass_on _is_default;
>
> You want to change that to "detach". Unrelated to your PingAck problem, though.
>
>> fencing dont-care _is_default;
>
> This is a bad idea too, but since you're evidently not using a cluster
> manager at all (which happens to be a bad idea as well), it probably
> doesn't make that much of a difference. Again, unrelated to PingAck
> issues.

Hmm, thanks. Unrelated to any of this, the v3a kernel (Debian 2.6.32-4)
crashed pretty badly 48hrs ago. Since it has been rebooted - there have
been no "PingAck not received" messages.

So assuming we get a week free of these messages, I'm guessing there was
a drbd bug of some kind but the reboot cleared it up.

We are preparing to jump to a 2.6.32 sourced from CentOS because this
Debian kernel seems to crash with one bug or another every few months.

The reason we're using external meta-devices is for backup: without the
metadata at the end, the underlying disk image represents exactly what
the VMs see. We can then snapshot this and take a reasonably consistent
backup without bothering DRBD. We later verify this backup by booting
it back up, disconnected, and taking a snapshot of the VNC console!

The reason I picked protocol B is because LVM snaphots kill the local
DRBD performance if we snapshot the LVM device underlying the DRBD
Primary. If we snapshot the Secondary and used protocol B where we
weren't dependent on local write speeds, my working theory was that the
performance hit wouldn't be as noticeable, and the customer seemed to
concur (previously we were using C).

--
Matthew
_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


florian at hastexo

May 24, 2012, 5:55 AM

Post #15 of 19 (865 views)
Permalink
Re: "PingAck not received" messages [In reply to]

On Thu, May 24, 2012 at 1:53 PM, Matthew Bloch <matthew [at] bytemark>
wrote:
> Hmm, thanks. Unrelated to any of this, the v3a kernel (Debian 2.6.32-4)
> crashed pretty badly 48hrs ago. Since it has been rebooted - there have
> been no "PingAck not received" messages.

Sure, if you have kernel-induced network problems on one of your
nodes, that would definitely explain the issues you're seeing. But you
insisted from the start that there were no network issues. :)

> So assuming we get a week free of these messages, I'm guessing there was a
> drbd bug of some kind but the reboot cleared it up.

Might as well not be a DRBD bug at all, just DRBD trying to do the
right thing in the face of a flaky network stack.

> We are preparing to jump to a 2.6.32 sourced from CentOS because this Debian
> kernel seems to crash with one bug or another every few months.

That would seem like an odd thing to do. FWIW, we've been running
happily on squeeze kernels for months.

> The reason we're using external meta-devices is for backup: without the
> metadata at the end, the underlying disk image represents exactly what the
> VMs see. We can then snapshot this and take a reasonably consistent backup
> without bothering DRBD. We later verify this backup by booting it back up,
> disconnected, and taking a snapshot of the VNC console!

You can always to that from a device with metadata as well. kpartx is
your friend.

> The reason I picked protocol B is because LVM snaphots kill the local DRBD
> performance if we snapshot the LVM device underlying the DRBD Primary. If
> we snapshot the Secondary and used protocol B where we weren't dependent on
> local write speeds, my working theory was that the performance hit wouldn't
> be as noticeable, and the customer seemed to concur (previously we were
> using C).

That's a fair point, but realistically, how long does it take you to
take the backup off your snapshot? And does this normally coincide
with the DRBD device getting hammered, which is pretty much the only
situation in which a downstream client would likely feel any
disruption?

Just my two cents. Or pence. :)

Cheers,
Florian

--
Need help with High Availability?
http://www.hastexo.com/now
_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


matthew at bytemark

May 24, 2012, 6:09 AM

Post #16 of 19 (867 views)
Permalink
Re: "PingAck not received" messages [In reply to]

On 24/05/12 13:54, Florian Haas wrote:
> On Thu, May 24, 2012 at 1:53 PM, Matthew Bloch <matthew [at] bytemark> wrote:
>> Hmm, thanks. Unrelated to any of this, the v3a kernel (Debian 2.6.32-4)
>> crashed pretty badly 48hrs ago. Since it has been rebooted - there have
>> been no "PingAck not received" messages.
>
> Sure, if you have kernel-induced network problems on one of your
> nodes, that would definitely explain the issues you're seeing. But you
> insisted from the start that there were no network issues. :)

No indeed, nothing external that we could detect after hours of layer 2
tracing, and no messages that would indicate a malfunction on either of
the hosts. But this network problem was only visible via DRBD's
messages, and if it's gone it's hard to reason about it any further (not
that I miss it). As I said I couldn't see any symptoms via ICMP or
TCP-based tests between the hosts.

>> We are preparing to jump to a 2.6.32 sourced from CentOS because this Debian
>> kernel seems to crash with one bug or another every few months.
>
> That would seem like an odd thing to do. FWIW, we've been running
> happily on squeeze kernels for months.

Then you've not hit the "scheduler divide by zero" bug or the "I/O
frozen for 120s for no reason" bug or the "CPU#x stuck for 9999999s"
bug? These are all things that are filed vaguely on the Redhat bug
trackers, as far as I know, and usually closed a few kernel versions
later with "well I haven't seen it for a few kernel versions so it's
probably OK"!

These are relatively rare bugs, except for some of our customers, when
they're not at all rare and we haul them up to e.g. whatever wheezy has.
Except in this case they broke the briding code in 3.2.0 which is
going to cause a virtualising customer some problems :-)

>> The reason we're using external meta-devices is for backup: without the
>> metadata at the end, the underlying disk image represents exactly what the
>> VMs see. We can then snapshot this and take a reasonably consistent backup
>> without bothering DRBD. We later verify this backup by booting it back up,
>> disconnected, and taking a snapshot of the VNC console!
>
> You can always to that from a device with metadata as well. kpartx is
> your friend.

Sure, but neither do we pay a penalty for doing it externally. It's all
on LVM and proper battery-backed RAID.

>> The reason I picked protocol B is because LVM snaphots kill the local DRBD
>> performance if we snapshot the LVM device underlying the DRBD Primary. If
>> we snapshot the Secondary and used protocol B where we weren't dependent on
>> local write speeds, my working theory was that the performance hit wouldn't
>> be as noticeable, and the customer seemed to concur (previously we were
>> using C).
>
> That's a fair point, but realistically, how long does it take you to
> take the backup off your snapshot?

10-60 minutes per system. Long enough that the I/O sensitive VMs
notice. And the customer has customers who are up 24 hours a day, so
there is no reliable "quiet time" when we can reduce their I/O bandwidth
and not have it commented on.

> And does this normally coincide
> with the DRBD device getting hammered, which is pretty much the only
> situation in which a downstream client would likely feel any
> disruption?

The DRBDs don't really get hammered at any one time - the backups happen
direct from LVs on the host, and go over the main (not replication)
interface. So the host system's I/O is stressed, sure.

Previously the disconnects happened several times a day, not just when
the backups ran - this is a separate issue from the one I asked about
while still being relevant to the list.

Arguably a customer running a heavily interactive system to very remote
destinations shouldn't be using such a complex I/O stack and should use
dedicated hardware. This is a pragmatic, expensive, unambitious
arguemnt :-) But drbd+LVM has worked very well for them for 18 months,
and the peace of mind of being able to start their customers' VMs in one
of two places makes diagnosing this properly worth the effort.

--
Matthew

_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


florian at hastexo

May 24, 2012, 6:32 AM

Post #17 of 19 (864 views)
Permalink
Re: "PingAck not received" messages [In reply to]

On Thu, May 24, 2012 at 3:09 PM, Matthew Bloch <matthew [at] bytemark> wrote:
>>> We are preparing to jump to a 2.6.32 sourced from CentOS because this
>>> Debian
>>> kernel seems to crash with one bug or another every few months.
>>
>> That would seem like an odd thing to do. FWIW, we've been running
>> happily on squeeze kernels for months.
>
> Then you've not hit the "scheduler divide by zero" bug or the "I/O frozen
> for 120s for no reason" bug or the "CPU#x stuck for 9999999s" bug?  These
> are all things that are filed vaguely on the Redhat bug trackers, as far as
> I know, and usually closed a few kernel versions later with "well I haven't
> seen it for a few kernel versions so it's probably OK"!

I've seen the "I/O frozen for no reason" problem which seems to be an
upstream XFS issue, which Debian is hardly to blame for. The others I
personally haven't encountered. Just for clarification, what seemed
odd to me was not that you would update off the Debian stock squeeze
kernel, but that you'd consider pulling a CentOS kernel, of ostensibly
thejsame kernel version, into a Debian system. I'd just go to the
current Debian backports kernel. But we're going off topic. :)

> These are relatively rare bugs, except for some of our customers, when
> they're not at all rare and we haul them up to e.g. whatever wheezy has.
>  Except in this case they broke the briding code in 3.2.0 which is going to
> cause a virtualising customer some problems :-)

Indeed.

> Previously the disconnects happened several times a day, not just when the
> backups ran - this is a separate issue from the one I asked about while
> still being relevant to the list.
>
> Arguably a customer running a heavily interactive system to very remote
> destinations shouldn't be using such a complex I/O stack and should use
> dedicated hardware.  This is a pragmatic, expensive, unambitious arguemnt
> :-)  But drbd+LVM has worked very well for them for 18 months, and the peace
> of mind of being able to start their customers' VMs in one of two places
> makes diagnosing this properly worth the effort.

It would still be interesting to find out whether you ever saw these
random disconnects with protocol C, or whether this appears to be a
B-only issue.

Cheers,
Florian

--
Need help with High Availability?
http://www.hastexo.com/now
_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


matthew at bytemark

May 24, 2012, 5:10 PM

Post #18 of 19 (859 views)
Permalink
Re: "PingAck not received" messages [In reply to]

On 24/05/12 14:32, Florian Haas wrote:
>> Arguably a customer running a heavily interactive system to very remote
>> destinations shouldn't be using such a complex I/O stack and should use
>> dedicated hardware. This is a pragmatic, expensive, unambitious arguemnt
>> :-) But drbd+LVM has worked very well for them for 18 months, and the peace
>> of mind of being able to start their customers' VMs in one of two places
>> makes diagnosing this properly worth the effort.
>
> It would still be interesting to find out whether you ever saw these
> random disconnects with protocol C, or whether this appears to be a
> B-only issue.

I might make that change if it happens again - sortof annoying to lose a
difficult bug as I'm pinning it down but I'll take the quieter life for
now :)

--
Matthew
Attachments: signature.asc (0.26 KB)


rpuglisi at regiscope

May 30, 2012, 1:08 PM

Post #19 of 19 (760 views)
Permalink
Re: "PingAck not received" messages [In reply to]

FYI - I just wanted to say that I am running DRBD version: 8.3.10 on Debian
Squeeze (Proxmox Version 2.1-8) and have been getting these messages for a
very long time now at server shutdown (and only at shutdown). It doesn't
seem to cause any problems though.
Rich Puglisi


Matthew Bloch wrote:
>
> The other problem is the "PingAck not received" messages that have been
> littering the logs of the v3a/v3b servers for the last couple of weeks,
> e.g. this has been happening every few hours for one DRBD or another:
>
> May 14 08:21:45 v3b kernel: [661127.869500] block drbd10: PingAck did
> not arrive in time.
>
> --
> Matthew
> _______________________________________________
> drbd-user mailing list
> drbd-user [at] lists
> http://lists.linbit.com/mailman/listinfo/drbd-user
>
>

--
View this message in context: http://old.nabble.com/%22PingAck-not-received%22-messages-tp33861061p33934037.html
Sent from the DRBD - User mailing list archive at Nabble.com.

_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user

DRBD users RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.