Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: DRBD: Users

strange checksum error

 

 

DRBD users RSS feed   Index | Next | Previous | View Threaded


akos.csurai at ericsson

Aug 1, 2012, 1:02 AM

Post #1 of 4 (448 views)
Permalink
strange checksum error

Hi,

We have experienced a strange replication problem since we use B protocol.
The scenario is the following:

Some binary files are saved to the replicated IO pair ( kernel:3.0.13,
drbd-8.3.12, protocol B, EXT3 )
Later they are copied to an other (but replicated) directory.
They are still consistent and there is no problem till the io1 (the
actual Primary) is rebooted.
Strange it needs a reboot. An enforced role change does not show the
symptom.
io2 takes the Primary role and when the cluster starts using the binary
files they show checksum error.

We have turned of the write cache in the sas disks ( sdparam --set WCE=0
/dev/sda )
and the symptom seemed to be disappeared, but later it surfaced again.
Those corrupted binary files has some 40 kbytes hole filled with zeros.
Yes it can be a HW issue, but we did not see it with C protocol
(which is deadly slow in our system unfortunately)

Have someone seen something similar ?

Thanks,
Akos



_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


ff at mpexnet

Aug 1, 2012, 1:22 AM

Post #2 of 4 (413 views)
Permalink
Re: strange checksum error [In reply to]

Hi,

On 08/01/2012 10:02 AM, Csurai Akos wrote:
> Those corrupted binary files has some 40 kbytes hole filled with zeros.
> Yes it can be a HW issue, but we did not see it with C protocol
> (which is deadly slow in our system unfortunately)
>
> Have someone seen something similar ?

no. Very strange, seeing as sync protocols should have nothing to do
with it.

Why is C slower than B in your setup? Does your secondary have an
inferior I/O stack? I'd advise to match the peers' hardware and switch
back to protocol C in that case. Otherwise, you should really try to
find out what's making it slow, because it shouldn't be.

Cheers,
Felix
_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


akos.csurai at ericsson

Aug 1, 2012, 1:46 AM

Post #3 of 4 (423 views)
Permalink
Re: strange checksum error [In reply to]

On 08/01/12 10:22, Felix Frank wrote:
> Hi,
>
> On 08/01/2012 10:02 AM, Csurai Akos wrote:
>> Those corrupted binary files has some 40 kbytes hole filled with zeros.
>> Yes it can be a HW issue, but we did not see it with C protocol
>> (which is deadly slow in our system unfortunately)
>>
>> Have someone seen something similar ?
> no. Very strange, seeing as sync protocols should have nothing to do
> with it.
>
> Why is C slower than B in your setup?
:-\ As far as I understand our cluster uses special version of NFSV3
client code
that works like a "latency test" from the drbd point of view:
write everything in small fractions and commit it instantly.

> Does your secondary have an
> inferior I/O stack? I'd advise to match the peers' hardware and switch
Yes, peers have the same HW.
> back to protocol C in that case. Otherwise, you should really try to
> find out what's making it slow, because it shouldn't be.
Agree, but it is to be ensured that C protocol really solve the symptom and
not just hide it or just reduce the probability of it.

> Cheers,
> Felix
>
Akos

--
This communication is confidential and intended solely for the addressee(s). Any unauthorized review, use, disclosure or distribution is prohibited. If you believe this message has been sent to you in error, please notify the sender by replying to this transmission and delete the message without disclosing it. Thank you.
E-mail including attachments is susceptible to data corruption, interception, unauthorized amendment, tampering and viruses, and we only send and receive emails on the basis that we are not liable for any such corruption, interception, amendment, tampering or viruses or any consequences thereof.

Ericsson Magyarország Kft., Székhely: 1097 Budapest, Könyves Kálmán krt. 11. B. épület. Nyilvántartó cégbíróság: Fővárosi Bíróság. Cégjegyzékszám: 01-09-070937

_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


ff at mpexnet

Aug 1, 2012, 2:00 AM

Post #4 of 4 (415 views)
Permalink
Re: strange checksum error [In reply to]

Hi,

On 08/01/2012 10:46 AM, Csurai Akos wrote:
>> Why is C slower than B in your setup?
> :-\ As far as I understand our cluster uses special version of NFSV3
> client code
> that works like a "latency test" from the drbd point of view:
> write everything in small fractions and commit it instantly.

> Yes, peers have the same HW.

Hmm, then I suppose network latency is your bottleneck. I guess my
earlier statement (C should not be slower than B with equal hardware)
was false. B protocol can hide some network latency because it allows
for parallel network communication and disk sync.

Maybe you can optimize your network performance somehow?

You're using SLES11SP2? What network hardware is this? For example,
we've found that the in-tree e1000e driver is sort of old, critically so
in 2.6.32, but even 3.2 is not really up-to-date, 3.0 presumably less so.

> Agree, but it is to be ensured that C protocol really
> solve the symptom and
> not just hide it or just reduce the probability of it.

Good point.

Best,
Felix
_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user

DRBD users RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.