Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: DRBD: Users

Drbd : PingAsk timeout, about 10 mins.

 

 

DRBD users RSS feed   Index | Next | Previous | View Threaded


litao5 at hisense

Aug 17, 2012, 6:37 PM

Post #1 of 14 (1332 views)
Permalink
Drbd : PingAsk timeout, about 10 mins.

Hi all,



I used drbd 8.3.7 on HA. When Master host is dead and HA swatches from
Master to Slave, the drbd can't switch because it spends 10 minutes to mount
its partition. But the time is timeout to HA.(in HA, default overtime is 2
miniutes).



Why does drbd spent that long time?



The log is:

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739458] block drbd1: peer(
Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate ->
DUnknown )

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739468] block drbd1: asender
terminated

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739470] block drbd1: Terminating
asender thread

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739526] block drbd1: short read
expecting header on sock: r=-512

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739666] block drbd1: Connection
closed

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739672] block drbd1: conn(
NetworkFailure -> Unconnected )

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739678] block drbd1: receiver
terminated

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739680] block drbd1: Restarting
receiver thread

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739683] block drbd1: receiver
(re)started

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739687] block drbd1: conn(
Unconnected -> WFConnection )

Jul 22 21:06:39 QD-CS-MDC-B pengine: [17776]: info: crm_log_init: Changed
active directory to /usr/var/lib/heartbeat/cores/root

Jul 22 21:06:47 QD-CS-MDC-B kernel: [325573.727331] NET: Registered protocol
family 17

Jul 22 21:06:47 QD-CS-MDC-B kernel: [325573.768912] block drbd0: role(
Secondary -> Primary )

Jul 22 21:06:47 QD-CS-MDC-B kernel: [325573.772742] block drbd1: role(
Secondary -> Primary )

Jul 22 21:06:47 QD-CS-MDC-B kernel: [325573.772997] block drbd1: Creating
new current UUID

Jul 22 21:08:47 QD-CS-MDC-B su: (to hitv) root on none

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032485] block drbd0: PingAck did
not arrive in time.

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032493] block drbd0: peer(
Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate ->
DUnknown )

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032503] block drbd0: asender
terminated

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032506] block drbd0: Terminating
asender thread

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032514] block drbd0: Creating
new current UUID

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032567] block drbd0: short read
expecting header on sock: r=-512

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032868] block drbd0: Connection
closed

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032875] block drbd0: conn(
NetworkFailure -> Unconnected )

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032879] block drbd0: receiver
terminated

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032881] block drbd0: Restarting
receiver thread

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032884] block drbd0: receiver
(re)started

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032888] block drbd0: conn(
Unconnected -> WFConnection )

Jul 22 21:16:48 QD-CS-MDC-B kernel: [326174.600888] kjournald starting.
Commit interval 15 seconds

Jul 22 21:16:48 QD-CS-MDC-B kernel: [326174.600956] EXT3-fs warning: maximal
mount count reached, running e2fsck is recommended

Jul 22 21:16:48 QD-CS-MDC-B kernel: [326174.601330] EXT3 FS on drbd0,
internal journal

Jul 22 21:16:48 QD-CS-MDC-B kernel: [326174.601334] EXT3-fs: recovery
complete.

Jul 22 21:16:48 QD-CS-MDC-B kernel: [326174.601392] EXT3-fs: mounted
filesystem with ordered data mode.



According to the log, the timeout is PingAsk operation.





Thanks your help.




simon


pascal.berton3 at free

Aug 18, 2012, 3:46 AM

Post #2 of 14 (1286 views)
Permalink
Re: Drbd : PingAsk timeout, about 10 mins. [In reply to]

Hi Simon.



AFAIK, the Ping Ack error means your replication network links are either
down or subject to sufficient errors to prevent both nodes to reach each
other in a timely manner. I had the occasion to experience such behavior
because of bad optical fibers for instance, generating huge number of
network errors. You also have network failure messages in your logs and
its Waiting for connection. In your case Id say the first thing to do is
to test this network : Can both nodes ping each other address on this
network ? Does an ifconfig of each address report errors ? Etc I bet when
your replication network is up again, your cluster will run fine.



Pascal.



De : drbd-user-bounces [at] lists
[mailto:drbd-user-bounces [at] lists] De la part de simon
Envoy : samedi 18 aot 2012 03:37
: drbd-user [at] lists
Objet : [DRBD-user] Drbd : PingAsk timeout, about 10 mins.



Hi all,



I used drbd 8.3.7 on HA. When Master host is dead and HA swatches from
Master to Slave, the drbd cant switch because it spends 10 minutes to mount
its partition. But the time is timeout to HA.(in HA, default overtime is 2
miniutes).



Why does drbd spent that long time?



The log is:

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739458] block drbd1: peer(
Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate ->
DUnknown )

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739468] block drbd1: asender
terminated

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739470] block drbd1: Terminating
asender thread

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739526] block drbd1: short read
expecting header on sock: r=-512

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739666] block drbd1: Connection
closed

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739672] block drbd1: conn(
NetworkFailure -> Unconnected )

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739678] block drbd1: receiver
terminated

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739680] block drbd1: Restarting
receiver thread

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739683] block drbd1: receiver
(re)started

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739687] block drbd1: conn(
Unconnected -> WFConnection )

Jul 22 21:06:39 QD-CS-MDC-B pengine: [17776]: info: crm_log_init: Changed
active directory to /usr/var/lib/heartbeat/cores/root

Jul 22 21:06:47 QD-CS-MDC-B kernel: [325573.727331] NET: Registered protocol
family 17

Jul 22 21:06:47 QD-CS-MDC-B kernel: [325573.768912] block drbd0: role(
Secondary -> Primary )

Jul 22 21:06:47 QD-CS-MDC-B kernel: [325573.772742] block drbd1: role(
Secondary -> Primary )

Jul 22 21:06:47 QD-CS-MDC-B kernel: [325573.772997] block drbd1: Creating
new current UUID

Jul 22 21:08:47 QD-CS-MDC-B su: (to hitv) root on none

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032485] block drbd0: PingAck did
not arrive in time.

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032493] block drbd0: peer(
Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate ->
DUnknown )

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032503] block drbd0: asender
terminated

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032506] block drbd0: Terminating
asender thread

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032514] block drbd0: Creating
new current UUID

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032567] block drbd0: short read
expecting header on sock: r=-512

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032868] block drbd0: Connection
closed

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032875] block drbd0: conn(
NetworkFailure -> Unconnected )

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032879] block drbd0: receiver
terminated

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032881] block drbd0: Restarting
receiver thread

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032884] block drbd0: receiver
(re)started

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032888] block drbd0: conn(
Unconnected -> WFConnection )

Jul 22 21:16:48 QD-CS-MDC-B kernel: [326174.600888] kjournald starting.
Commit interval 15 seconds

Jul 22 21:16:48 QD-CS-MDC-B kernel: [326174.600956] EXT3-fs warning: maximal
mount count reached, running e2fsck is recommended

Jul 22 21:16:48 QD-CS-MDC-B kernel: [326174.601330] EXT3 FS on drbd0,
internal journal

Jul 22 21:16:48 QD-CS-MDC-B kernel: [326174.601334] EXT3-fs: recovery
complete.

Jul 22 21:16:48 QD-CS-MDC-B kernel: [326174.601392] EXT3-fs: mounted
filesystem with ordered data mode.



According to the log, the timeout is PingAsk operation.





Thanks your help.




simon


litao5 at hisense

Aug 18, 2012, 7:24 AM

Post #3 of 14 (1285 views)
Permalink
Re: Drbd : PingAsk timeout, about 10 mins. [In reply to]

Hi Pasical,

Thanks your reply.

Yes, the network was bad. Master host was dead so that Slave host took over its work and mount the drbd partition on Slave host. When mounting , the timeout issued. But the default timeout of network of drdb is 6 senconds (it can be set in drbd.conf). But it failed to take effect. why?

Do you have a good idea to make it switch immediately in the condition?

Thanks.

Simon

-----原始邮件-----
发件人: "Pascal BERTON" <pascal.berton3 [at] free>
发送时间: 2012年8月18日 星期六
收件人: 'simon' <litao5 [at] hisense>, drbd-user [at] lists
抄送:
主题: RE: [DRBD-user] Drbd : PingAsk timeout, about 10 mins.



Hi Simon.



AFAIK, the Ping Ack error means your replication network links are either down or subject to sufficient errors to prevent both nodes to reach each other in a timely manner. I had the occasion to experience such behavior because of bad optical fibers for instance, generating huge number of network errors. You also have “network failure” messages in your logs and it’s “Waiting for connection”. In your case I’d say the first thing to do is to test this network : Can both nodes ping each other address on this network ? Does an ifconfig of each address report errors ? Etc… I bet when your replication network is up again, your cluster will run fine.



Pascal.



De :drbd-user-bounces [at] lists [mailto:drbd-user-bounces [at] lists] De la part de simon
Envoyé : samedi 18 août 2012 03:37
À :drbd-user [at] lists
Objet : [DRBD-user] Drbd : PingAsk timeout, about 10 mins.



Hi all,



I used drbd 8.3.7 on HA. When Master host is dead and HA swatches from Master to Slave, the drbd can’t switch because it spends 10 minutes to mount its partition. But the time is timeout to HA.(in HA, default overtime is 2 miniutes).



Why does drbd spent that long time?



The log is:

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739458] block drbd1: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739468] block drbd1: asender terminated

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739470] block drbd1: Terminating asender thread

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739526] block drbd1: short read expecting header on sock: r=-512

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739666] block drbd1: Connection closed

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739672] block drbd1: conn( NetworkFailure -> Unconnected )

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739678] block drbd1: receiver terminated

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739680] block drbd1: Restarting receiver thread

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739683] block drbd1: receiver (re)started

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739687] block drbd1: conn( Unconnected -> WFConnection )

Jul 22 21:06:39 QD-CS-MDC-B pengine: [17776]: info: crm_log_init: Changed active directory to /usr/var/lib/heartbeat/cores/root

Jul 22 21:06:47 QD-CS-MDC-B kernel: [325573.727331] NET: Registered protocol family 17

Jul 22 21:06:47 QD-CS-MDC-B kernel: [325573.768912] block drbd0: role( Secondary -> Primary )

Jul 22 21:06:47 QD-CS-MDC-B kernel: [325573.772742] block drbd1: role( Secondary -> Primary )

Jul 22 21:06:47 QD-CS-MDC-B kernel: [325573.772997] block drbd1: Creating new current UUID

Jul 22 21:08:47 QD-CS-MDC-B su: (to hitv) root on none

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032485] block drbd0: PingAck did not arrive in time.

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032493] block drbd0: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032503] block drbd0: asender terminated

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032506] block drbd0: Terminating asender thread

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032514] block drbd0: Creating new current UUID

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032567] block drbd0: short read expecting header on sock: r=-512

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032868] block drbd0: Connection closed

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032875] block drbd0: conn( NetworkFailure -> Unconnected )

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032879] block drbd0: receiver terminated

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032881] block drbd0: Restarting receiver thread

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032884] block drbd0: receiver (re)started

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032888] block drbd0: conn( Unconnected -> WFConnection )

Jul 22 21:16:48 QD-CS-MDC-B kernel: [326174.600888] kjournald starting. Commit interval 15 seconds

Jul 22 21:16:48 QD-CS-MDC-B kernel: [326174.600956] EXT3-fs warning: maximal mount count reached, running e2fsck is recommended

Jul 22 21:16:48 QD-CS-MDC-B kernel: [326174.601330] EXT3 FS on drbd0, internal journal

Jul 22 21:16:48 QD-CS-MDC-B kernel: [326174.601334] EXT3-fs: recovery complete.

Jul 22 21:16:48 QD-CS-MDC-B kernel: [326174.601392] EXT3-fs: mounted filesystem with ordered data mode.



According to the log, the timeout is PingAsk operation.





Thanks your help.



simon


roberto.fastec at gmail

Aug 19, 2012, 12:01 AM

Post #4 of 14 (1289 views)
Permalink
Re: Drbd : PingAsk timeout, about 10 mins. [In reply to]

Dear friends

I would like to suggest you to edit your messages before to issuing them to the list.

Nowadays emails are often read on mobile devices, such as smartphones and so on.

The editing phase should focus on remove (as example, in this thread), such kilometric log text:
what should be the sense of keeping multiples and multiples repetitions of in ALL the replies?

Thank you for understanding my critic that wants to be constructive as much as possible.

Kind regards and thank you really much for sharing your experiences.

Robert

Le mail ti raggiungono ovunque con BlackBerry from Vodafone!

-----Original Message-----
From: drbd-user-request [at] lists
Sender: drbd-user-bounces [at] lists
Date: Sat, 18 Aug 2012 16:24:45
To: <drbd-user [at] lists>
Reply-To: drbd-user [at] lists
Subject: drbd-user Digest, Vol 97, Issue 18

Send drbd-user mailing list submissions to
drbd-user [at] lists

To subscribe or unsubscribe via the World Wide Web, visit
http://lists.linbit.com/mailman/listinfo/drbd-user
or, via email, send a message with subject or body 'help' to
drbd-user-request [at] lists

You can reach the person managing the list at
drbd-user-owner [at] lists

When replying, please edit your Subject line so it is more specific
than "Re: Contents of drbd-user digest..."


Today's Topics:

1. Re: Drbd : PingAsk timeout, about 10 mins. (Pascal BERTON)
2. Re: Drbd : PingAsk timeout, about 10 mins. (?? (??))


----------------------------------------------------------------------

Message: 1
Date: Sat, 18 Aug 2012 12:46:01 +0200
From: "Pascal BERTON" <pascal.berton3 [at] free>
Subject: Re: [DRBD-user] Drbd : PingAsk timeout, about 10 mins.
To: "'simon'" <litao5 [at] hisense>, <drbd-user [at] lists>
Message-ID: <000f01cd7d2e$a9caa4d0$fd5fee70$@berton3 [at] free>
Content-Type: text/plain; charset="iso-8859-1"

Hi Simon.



AFAIK, the Ping Ack error means your replication network links are either
down or subject to sufficient errors to prevent both nodes to reach each
other in a timely manner. I had the occasion to experience such behavior
because of bad optical fibers for instance, generating huge number of
network errors. You also have ?network failure? messages in your logs and
it?s ?Waiting for connection?. In your case I?d say the first thing to do is
to test this network : Can both nodes ping each other address on this
network ? Does an ifconfig of each address report errors ? Etc? I bet when
your replication network is up again, your cluster will run fine.



Pascal.



De : drbd-user-bounces [at] lists
[mailto:drbd-user-bounces [at] lists] De la part de simon
Envoy? : samedi 18 ao?t 2012 03:37
? : drbd-user [at] lists
Objet : [DRBD-user] Drbd : PingAsk timeout, about 10 mins.



Hi all,



I used drbd 8.3.7 on HA. When Master host is dead and HA swatches from
Master to Slave, the drbd can?t switch because it spends 10 minutes to mount
its partition. But the time is timeout to HA.(in HA, default overtime is 2
miniutes).



Why does drbd spent that long time?



The log is:

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739458] block drbd1: peer(
Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate ->
DUnknown )

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739468] block drbd1: asender
terminated

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739470] block drbd1: Terminating
asender thread

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739526] block drbd1: short read
expecting header on sock: r=-512

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739666] block drbd1: Connection
closed

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739672] block drbd1: conn(
NetworkFailure -> Unconnected )

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739678] block drbd1: receiver
terminated

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739680] block drbd1: Restarting
receiver thread

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739683] block drbd1: receiver
(re)started

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739687] block drbd1: conn(
Unconnected -> WFConnection )

Jul 22 21:06:39 QD-CS-MDC-B pengine: [17776]: info: crm_log_init: Changed
active directory to /usr/var/lib/heartbeat/cores/root

Jul 22 21:06:47 QD-CS-MDC-B kernel: [325573.727331] NET: Registered protocol
family 17

Jul 22 21:06:47 QD-CS-MDC-B kernel: [325573.768912] block drbd0: role(
Secondary -> Primary )

Jul 22 21:06:47 QD-CS-MDC-B kernel: [325573.772742] block drbd1: role(
Secondary -> Primary )

Jul 22 21:06:47 QD-CS-MDC-B kernel: [325573.772997] block drbd1: Creating
new current UUID

Jul 22 21:08:47 QD-CS-MDC-B su: (to hitv) root on none

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032485] block drbd0: PingAck did
not arrive in time.

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032493] block drbd0: peer(
Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate ->
DUnknown )

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032503] block drbd0: asender
terminated

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032506] block drbd0: Terminating
asender thread

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032514] block drbd0: Creating
new current UUID

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032567] block drbd0: short read
expecting header on sock: r=-512

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032868] block drbd0: Connection
closed

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032875] block drbd0: conn(
NetworkFailure -> Unconnected )

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032879] block drbd0: receiver
terminated

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032881] block drbd0: Restarting
receiver thread

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032884] block drbd0: receiver
(re)started

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032888] block drbd0: conn(
Unconnected -> WFConnection )

Jul 22 21:16:48 QD-CS-MDC-B kernel: [326174.600888] kjournald starting.
Commit interval 15 seconds

Jul 22 21:16:48 QD-CS-MDC-B kernel: [326174.600956] EXT3-fs warning: maximal
mount count reached, running e2fsck is recommended

Jul 22 21:16:48 QD-CS-MDC-B kernel: [326174.601330] EXT3 FS on drbd0,
internal journal

Jul 22 21:16:48 QD-CS-MDC-B kernel: [326174.601334] EXT3-fs: recovery
complete.

Jul 22 21:16:48 QD-CS-MDC-B kernel: [326174.601392] EXT3-fs: mounted
filesystem with ordered data mode.



According to the log, the timeout is PingAsk operation.





Thanks your help.




simon







-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20120818/cb6ca975/attachment-0001.htm>

------------------------------

Message: 2
Date: Sat, 18 Aug 2012 22:24:14 +0800 (CST)
From: ??(??) <litao5 [at] hisense>
Subject: Re: [DRBD-user] Drbd : PingAsk timeout, about 10 mins.
To: "Pascal BERTON" <pascal.berton3 [at] free>
Cc: drbd-user [at] lists
Message-ID: <1408fd4.96f3e.1393a1eb50b.Coremail.litao5 [at] hisense>
Content-Type: text/plain; charset="utf-8"

Hi Pasical,

Thanks your reply.

Yes, the network was bad. Master host was dead so that Slave host took over its work and mount the drbd partition on Slave host. When mounting , the timeout issued. But the default timeout of network of drdb is 6 senconds (it can be set in drbd.conf). But it failed to take effect. why?

Do you have a good idea to make it switch immediately in the condition?

Thanks.

Simon

-----????-----
???: "Pascal BERTON" <pascal.berton3 [at] free>
????: 2012?8?18? ???
???: 'simon' <litao5 [at] hisense>, drbd-user [at] lists
??:
??: RE: [DRBD-user] Drbd : PingAsk timeout, about 10 mins.



Hi Simon.



AFAIK, the Ping Ack error means your replication network links are either down or subject to sufficient errors to prevent both nodes to reach each other in a timely manner. I had the occasion to experience such behavior because of bad optical fibers for instance, generating huge number of network errors. You also have ?network failure? messages in your logs and it?s ?Waiting for connection?. In your case I?d say the first thing to do is to test this network : Can both nodes ping each other address on this network ? Does an ifconfig of each address report errors ? Etc? I bet when your replication network is up again, your cluster will run fine.



Pascal.



De :drbd-user-bounces [at] lists [mailto:drbd-user-bounces [at] lists] De la part de simon
Envoy? : samedi 18 ao?t 2012 03:37
? :drbd-user [at] lists
Objet : [DRBD-user] Drbd : PingAsk timeout, about 10 mins.



Hi all,



I used drbd 8.3.7 on HA. When Master host is dead and HA swatches from Master to Slave, the drbd can?t switch because it spends 10 minutes to mount its partition. But the time is timeout to HA.(in HA, default overtime is 2 miniutes).



Why does drbd spent that long time?



The log is:

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739458] block drbd1: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739468] block drbd1: asender terminated

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739470] block drbd1: Terminating asender thread

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739526] block drbd1: short read expecting header on sock: r=-512

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739666] block drbd1: Connection closed

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739672] block drbd1: conn( NetworkFailure -> Unconnected )

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739678] block drbd1: receiver terminated

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739680] block drbd1: Restarting receiver thread

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739683] block drbd1: receiver (re)started

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739687] block drbd1: conn( Unconnected -> WFConnection )

Jul 22 21:06:39 QD-CS-MDC-B pengine: [17776]: info: crm_log_init: Changed active directory to /usr/var/lib/heartbeat/cores/root

Jul 22 21:06:47 QD-CS-MDC-B kernel: [325573.727331] NET: Registered protocol family 17

Jul 22 21:06:47 QD-CS-MDC-B kernel: [325573.768912] block drbd0: role( Secondary -> Primary )

Jul 22 21:06:47 QD-CS-MDC-B kernel: [325573.772742] block drbd1: role( Secondary -> Primary )

Jul 22 21:06:47 QD-CS-MDC-B kernel: [325573.772997] block drbd1: Creating new current UUID

Jul 22 21:08:47 QD-CS-MDC-B su: (to hitv) root on none

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032485] block drbd0: PingAck did not arrive in time.

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032493] block drbd0: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032503] block drbd0: asender terminated

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032506] block drbd0: Terminating asender thread

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032514] block drbd0: Creating new current UUID

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032567] block drbd0: short read expecting header on sock: r=-512

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032868] block drbd0: Connection closed

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032875] block drbd0: conn( NetworkFailure -> Unconnected )

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032879] block drbd0: receiver terminated

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032881] block drbd0: Restarting receiver thread

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032884] block drbd0: receiver (re)started

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032888] block drbd0: conn( Unconnected -> WFConnection )

Jul 22 21:16:48 QD-CS-MDC-B kernel: [326174.600888] kjournald starting. Commit interval 15 seconds

Jul 22 21:16:48 QD-CS-MDC-B kernel: [326174.600956] EXT3-fs warning: maximal mount count reached, running e2fsck is recommended

Jul 22 21:16:48 QD-CS-MDC-B kernel: [326174.601330] EXT3 FS on drbd0, internal journal

Jul 22 21:16:48 QD-CS-MDC-B kernel: [326174.601334] EXT3-fs: recovery complete.

Jul 22 21:16:48 QD-CS-MDC-B kernel: [326174.601392] EXT3-fs: mounted filesystem with ordered data mode.



According to the log, the timeout is PingAsk operation.





Thanks your help.



simon







-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20120818/c5f788f1/attachment.htm>

------------------------------

_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


End of drbd-user Digest, Vol 97, Issue 18
*****************************************
_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


litao5 at hisense

Aug 19, 2012, 6:04 PM

Post #5 of 14 (1269 views)
Permalink
Re: Drbd : PingAsk timeout, about 10 mins. [In reply to]

Hi Pasical,



Thanks your reply.



Yes, the network was bad. Master host was dead so that Slave host took over
its work and mount the drbd partition on Slave host. When mounting , the
timeout issued. But the default timeout of network of drdb is 6 senconds
(it can be set in drbd.conf). But it failed to take effect. why?



Do you have a good idea to make it switch from Master to Slave immediately
in the network anomaly?



Thanks.



Simon


pascal.berton3 at free

Aug 20, 2012, 1:58 PM

Post #6 of 14 (1269 views)
Permalink
Re: Drbd : PingAsk timeout, about 10 mins. [In reply to]

Hi Simon !



Sorry for the delay, return day to work, busy day...

Eeh yup, if the former master is down, thats effectively a fairly good
reason for having a down replication network J Just to better understand
your specific context, could you please let me know what cat /proc/drbd
does report, especially during this blackout period of 10 minutes if you may
reproduce it, and also drbdsetup 0 show and crm configure show, just to
have the big picture of your configuration.



Regards,



Pascal.



De : simon [mailto:litao5 [at] hisense]
Envoy : lundi 20 aot 2012 03:05
: 'Pascal BERTON'; drbd-user [at] lists
Objet : RE: [DRBD-user] Drbd : PingAsk timeout, about 10 mins.



Hi Pasical,



Thanks your reply.



Yes, the network was bad. Master host was dead so that Slave host took over
its work and mount the drbd partition on Slave host. When mounting , the
timeout issued. But the default timeout of network of drdb is 6 senconds
(it can be set in drbd.conf). But it failed to take effect. why?



Do you have a good idea to make it switch from Master to Slave immediately
in the network anomaly?



Thanks.



Simon


litao5 at hisense

Aug 21, 2012, 12:40 AM

Post #7 of 14 (1252 views)
Permalink
Re: Drbd : PingAsk timeout, about 10 mins. [In reply to]

Hi Pascal,



I can’t reproduce the error because the condition that it issues is very especially. The Master host is in the “not real dead” status. ( I doubt it is Linux’s panic). The TCP stack maybe is bad in Master host. Now I don’t want to avoid it because I can’t reproduce it. I only want to succeed to switch form Master to Slave so that my service can be supplied normally. But I can’t right to switch because of the 10 minutes delay of Drbd.



I run “drbdsetup 0 show” on my host, it shows as following,



disk {

size 0s _is_default; # bytes

on-io-error detach;

fencing dont-care _is_default;

max-bio-bvecs 0 _is_default;

}

net {

timeout 60 _is_default; # 1/10 seconds

max-epoch-size 2048 _is_default;

max-buffers 2048 _is_default;

unplug-watermark 128 _is_default;

connect-int 10 _is_default; # seconds

ping-int 10 _is_default; # seconds

sndbuf-size 0 _is_default; # bytes

rcvbuf-size 0 _is_default; # bytes

ko-count 0 _is_default;

allow-two-primaries;

after-sb-0pri discard-least-changes;

after-sb-1pri discard-secondary;

after-sb-2pri disconnect _is_default;

rr-conflict disconnect _is_default;

ping-timeout 5 _is_default; # 1/10 seconds

}

syncer {

rate 102400k; # bytes/second

after -1 _is_default;

al-extents 257;

}

protocol C;

_this_host {

device minor 0;

disk "/dev/cciss/c0d0p7";

meta-disk internal;

address ipv4 172.17.5.152:7900;

}

_remote_host {

address ipv4 172.17.5.151:7900;

}





In the list , there is “timeout 60 _is_default; # 1/10 seconds”.





Thanks.



Simon



发件人: Pascal BERTON [mailto:pascal.berton3 [at] free]
发送时间: 2012年8月21日 星期二 4:58
收件人: 'simon'; drbd-user [at] lists
主题: RE: [DRBD-user] Drbd : PingAsk timeout, about 10 mins.



Hi Simon !



Sorry for the delay, return day to work, busy day...

Eeh yup, if the former master is down, that’s effectively a fairly good reason for having a down replication network… J Just to better understand your specific context, could you please let me know what “cat /proc/drbd” does report, especially during this blackout period of 10 minutes if you may reproduce it, and also “drbdsetup 0 show” and “crm configure show”, just to have the big picture of your configuration.



Regards,



Pascal.



De : simon [mailto:litao5 [at] hisense]
Envoyé : lundi 20 août 2012 03:05
À : 'Pascal BERTON'; drbd-user [at] lists
Objet : RE: [DRBD-user] Drbd : PingAsk timeout, about 10 mins.



Hi Pasical,



Thanks your reply.



Yes, the network was bad. Master host was dead so that Slave host took over its work and mount the drbd partition on Slave host. When mounting , the timeout issued. But the default timeout of network of drdb is 6 senconds (it can be set in drbd.conf). But it failed to take effect. why?



Do you have a good idea to make it switch from Master to Slave immediately in the network anomaly?



Thanks.



Simon


lars.ellenberg at linbit

Aug 21, 2012, 3:50 AM

Post #8 of 14 (1252 views)
Permalink
Re: Drbd : PingAsk timeout, about 10 mins. [In reply to]

On Tue, Aug 21, 2012 at 03:40:34PM +0800, simon wrote:
> Hi Pascal,
>
>
>
> I can’t reproduce the error because the condition that it issues is
> very especially. The Master host is in the “not real dead” status.
> ( I doubt it is Linux’s panic). The TCP stack maybe is bad in Master
> host. Now I don’t want to avoid it because I can’t reproduce it. I
> only want to succeed to switch form Master to Slave so that my
> service can be supplied normally. But I can’t right to switch because
> of the 10 minutes delay of Drbd.

Well. If it was "not real dead", then I'd suspect that the DRBD
connection was still "sort of up", and thus DRBD saw the other node as
Primary still, and correctly refused to be promoted locally.


To have your cluster recover from a "almost but not quite dead node"
scenario, you need to add stonith aka node level fencing to your
cluster stack.


> I run “drbdsetup 0 show” on my host, it shows as following,
>
> disk {
> size 0s _is_default; # bytes
> on-io-error detach;
> fencing dont-care _is_default;
> max-bio-bvecs 0 _is_default;
> }
>
> net {
> timeout 60 _is_default; # 1/10 seconds
> max-epoch-size 2048 _is_default;
> max-buffers 2048 _is_default;
> unplug-watermark 128 _is_default;
> connect-int 10 _is_default; # seconds
> ping-int 10 _is_default; # seconds
> sndbuf-size 0 _is_default; # bytes
> rcvbuf-size 0 _is_default; # bytes
> ko-count 0 _is_default;
> allow-two-primaries;


Uh. You are sure about that?

Two primaries, and dont-care for fencing?

You are aware that you just subscribed to data corruption, right?

If you want two primaries, you MUST have proper fencing,
on both the cluster level (stonith) and the drbd level (fencing
resource-and-stonith; fence-peer handler: e.g. crm-fence-peer.sh).

> after-sb-0pri discard-least-changes;
> after-sb-1pri discard-secondary;

And here you configure automatic data loss.
Which is ok, as long as you are aware of that and actually mean it...


>
> after-sb-2pri disconnect _is_default;
> rr-conflict disconnect _is_default;
> ping-timeout 5 _is_default; # 1/10 seconds
> }
>
> syncer {
> rate 102400k; # bytes/second
> after -1 _is_default;
> al-extents 257;
> }
>
> protocol C;
> _this_host {
> device minor 0;
> disk "/dev/cciss/c0d0p7";
> meta-disk internal;
> address ipv4 172.17.5.152:7900;
> }
>
> _remote_host {
> address ipv4 172.17.5.151:7900;
> }
>
>
>
>
>
> In the list , there is “timeout 60 _is_default; # 1/10 seconds”.

Then guess what, maybe the timeout did not trigger,
because the peer was still "sort of" responsive?


--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list -- I'm subscribed
_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


litao5 at hisense

Aug 22, 2012, 6:45 PM

Post #9 of 14 (1235 views)
Permalink
Re: Drbd : PingAsk timeout, about 10 mins. [In reply to]

Hi Lars Ellenberg,

The Master Host has two network cards, eth0 and eth1. Drbd uses eth0. "not
real dead" means eth0 is dead. ( it can get by ha log). Eth1 can ping good
but can't login by ssh.
So I think maybe the linux is panic.

Eth0 is dead, but drbd can't detect it and return immediately. Why?

Thanks.

-----ʼԭ-----
: drbd-user-bounces [at] lists
[mailto:drbd-user-bounces [at] lists]
drbd-user-request [at] lists
ʱ: 2012822 18:00
ռ: drbd-user [at] lists
: drbd-user Digest, Vol 97, Issue 23

Send drbd-user mailing list submissions to
drbd-user [at] lists

To subscribe or unsubscribe via the World Wide Web, visit
http://lists.linbit.com/mailman/listinfo/drbd-user
or, via email, send a message with subject or body 'help' to
drbd-user-request [at] lists

You can reach the person managing the list at
drbd-user-owner [at] lists

When replying, please edit your Subject line so it is more specific
than "Re: Contents of drbd-user digest..."


Today's Topics:

1. Re: Drbd : PingAsk timeout, about 10 mins. (Lars Ellenberg)


----------------------------------------------------------------------

Message: 1
Date: Tue, 21 Aug 2012 12:50:12 +0200
From: Lars Ellenberg <lars.ellenberg [at] linbit>
Subject: Re: [DRBD-user] Drbd : PingAsk timeout, about 10 mins.
To: drbd-user [at] lists
Message-ID: <20120821105012.GG20059 [at] soda>
Content-Type: text/plain; charset=utf-8

On Tue, Aug 21, 2012 at 03:40:34PM +0800, simon wrote:
> Hi Pascal,
>
>
>
> I can?t reproduce the error because the condition that it issues is
> very especially. The Master host is in the ?not real dead? status.
> ( I doubt it is Linux?s panic). The TCP stack maybe is bad in Master
> host. Now I don?t want to avoid it because I can?t reproduce it. I
> only want to succeed to switch form Master to Slave so that my
> service can be supplied normally. But I can?t right to switch because
> of the 10 minutes delay of Drbd.

Well. If it was "not real dead", then I'd suspect that the DRBD
connection was still "sort of up", and thus DRBD saw the other node as
Primary still, and correctly refused to be promoted locally.


To have your cluster recover from a "almost but not quite dead node"
scenario, you need to add stonith aka node level fencing to your
cluster stack.


> I run ?drbdsetup 0 show? on my host, it shows as following,
>
> disk {
> size 0s _is_default; # bytes
> on-io-error detach;
> fencing dont-care _is_default;
> max-bio-bvecs 0 _is_default;
> }
>
> net {
> timeout 60 _is_default; # 1/10 seconds
> max-epoch-size 2048 _is_default;
> max-buffers 2048 _is_default;
> unplug-watermark 128 _is_default;
> connect-int 10 _is_default; # seconds
> ping-int 10 _is_default; # seconds
> sndbuf-size 0 _is_default; # bytes
> rcvbuf-size 0 _is_default; # bytes
> ko-count 0 _is_default;
> allow-two-primaries;


Uh. You are sure about that?

Two primaries, and dont-care for fencing?

You are aware that you just subscribed to data corruption, right?

If you want two primaries, you MUST have proper fencing,
on both the cluster level (stonith) and the drbd level (fencing
resource-and-stonith; fence-peer handler: e.g. crm-fence-peer.sh).

> after-sb-0pri discard-least-changes;
> after-sb-1pri discard-secondary;

And here you configure automatic data loss.
Which is ok, as long as you are aware of that and actually mean it...


>
> after-sb-2pri disconnect _is_default;
> rr-conflict disconnect _is_default;
> ping-timeout 5 _is_default; # 1/10 seconds
> }
>
> syncer {
> rate 102400k; # bytes/second
> after -1 _is_default;
> al-extents 257;
> }
>
> protocol C;
> _this_host {
> device minor 0;
> disk "/dev/cciss/c0d0p7";
> meta-disk internal;
> address ipv4 172.17.5.152:7900;
> }
>
> _remote_host {
> address ipv4 172.17.5.151:7900;
> }
>
>
>
>
>
> In the list , there is ?timeout 60 _is_default; # 1/10
seconds?.

Then guess what, maybe the timeout did not trigger,
because the peer was still "sort of" responsive?


--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD? and LINBIT? are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list -- I'm subscribed


------------------------------

_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


End of drbd-user Digest, Vol 97, Issue 23
*****************************************


_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


lars.ellenberg at linbit

Aug 23, 2012, 3:19 AM

Post #10 of 14 (1237 views)
Permalink
Re: Drbd : PingAsk timeout, about 10 mins. [In reply to]

On Thu, Aug 23, 2012 at 09:45:21AM +0800, simon wrote:
> Hi Lars Ellenberg,
>
> The Master Host has two network cards, eth0 and eth1. Drbd uses eth0. "not
> real dead" means eth0 is dead. ( it can get by ha log). Eth1 can ping good
> but can't login by ssh.
> So I think maybe the linux is panic.
>
> Eth0 is dead, but drbd can't detect it and return immediately. Why?

As I said, most likely because eth0 still was not that dead as you think it was.

And read again what I said about fencing and stonith.

> Thanks.

Cheers.


> Date: Tue, 21 Aug 2012 12:50:12 +0200
> From: Lars Ellenberg <lars.ellenberg [at] linbit>
> Subject: Re: [DRBD-user] Drbd : PingAsk timeout, about 10 mins.
> To: drbd-user [at] lists
> Message-ID: <20120821105012.GG20059 [at] soda>
> Content-Type: text/plain; charset=utf-8
>
> On Tue, Aug 21, 2012 at 03:40:34PM +0800, simon wrote:
> > Hi Pascal,
> >
> >
> >
> > I can?t reproduce the error because the condition that it issues is
> > very especially. The Master host is in the ?not real dead? status.
> > ( I doubt it is Linux?s panic). The TCP stack maybe is bad in Master
> > host. Now I don?t want to avoid it because I can?t reproduce it. I
> > only want to succeed to switch form Master to Slave so that my
> > service can be supplied normally. But I can?t right to switch because
> > of the 10 minutes delay of Drbd.
>
> Well. If it was "not real dead", then I'd suspect that the DRBD
> connection was still "sort of up", and thus DRBD saw the other node as
> Primary still, and correctly refused to be promoted locally.
>
>
> To have your cluster recover from a "almost but not quite dead node"
> scenario, you need to add stonith aka node level fencing to your
> cluster stack.
>
>
> > I run ?drbdsetup 0 show? on my host, it shows as following,
> >
> > disk {
> > size 0s _is_default; # bytes
> > on-io-error detach;
> > fencing dont-care _is_default;
> > max-bio-bvecs 0 _is_default;
> > }
> >
> > net {
> > timeout 60 _is_default; # 1/10 seconds
> > max-epoch-size 2048 _is_default;
> > max-buffers 2048 _is_default;
> > unplug-watermark 128 _is_default;
> > connect-int 10 _is_default; # seconds
> > ping-int 10 _is_default; # seconds
> > sndbuf-size 0 _is_default; # bytes
> > rcvbuf-size 0 _is_default; # bytes
> > ko-count 0 _is_default;
> > allow-two-primaries;
>
>
> Uh. You are sure about that?
>
> Two primaries, and dont-care for fencing?
>
> You are aware that you just subscribed to data corruption, right?
>
> If you want two primaries, you MUST have proper fencing,
> on both the cluster level (stonith) and the drbd level (fencing
> resource-and-stonith; fence-peer handler: e.g. crm-fence-peer.sh).
>
> > after-sb-0pri discard-least-changes;
> > after-sb-1pri discard-secondary;
>
> And here you configure automatic data loss.
> Which is ok, as long as you are aware of that and actually mean it...
>
>
> >
> > after-sb-2pri disconnect _is_default;
> > rr-conflict disconnect _is_default;
> > ping-timeout 5 _is_default; # 1/10 seconds
> > }
> >
> > syncer {
> > rate 102400k; # bytes/second
> > after -1 _is_default;
> > al-extents 257;
> > }
> >
> > protocol C;
> > _this_host {
> > device minor 0;
> > disk "/dev/cciss/c0d0p7";
> > meta-disk internal;
> > address ipv4 172.17.5.152:7900;
> > }
> >
> > _remote_host {
> > address ipv4 172.17.5.151:7900;
> > }
> >
> >
> >
> >
> >
> > In the list , there is ?timeout 60 _is_default; # 1/10
> seconds?.
>
> Then guess what, maybe the timeout did not trigger,
> because the peer was still "sort of" responsive?

--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD and LINBIT are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list -- I'm subscribed
_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


litao5 at hisense

Aug 23, 2012, 5:53 PM

Post #11 of 14 (1223 views)
Permalink
Re: Drbd : PingAsk timeout, about 10 mins. [In reply to]

On Thu, Aug 23, 2012 at 09:45:21AM +0800, simon wrote:
> Hi Lars Ellenberg,
>
> The Master Host has two network cards, eth0 and eth1. Drbd uses eth0. "not
> real dead" means eth0 is dead. ( it can get by ha log). Eth1 can ping good
> but can't login by ssh.
> So I think maybe the linux is panic.
>
> Eth0 is dead, but drbd can't detect it and return immediately. Why?

As I said, most likely because eth0 still was not that dead as you think it
was.

And read again what I said about fencing and stonith.

> Thanks.

Cheers.

Hi Cheers

I can ensure eth0 is disconnect. Because it can't ping each other.

Now I only want to know how to switch immediately , and why doesn't the
timeout option
take effect?

Can you tell me some implement detail on coding of drbd?

Thanks

simon


_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


pascal.berton3 at free

Aug 24, 2012, 1:17 AM

Post #12 of 14 (1219 views)
Permalink
Re: Drbd : PingAsk timeout, about 10 mins. [In reply to]

Simon,

You have only sent me the results of "drbdadm 0 show", you forgot to send me the result of "crm configure show" that I had also asked, could you please send it too ? I also would like to see what "ifconfig" of your 2 replication interfaces do report (The ones on subnet 172.17). Please send it along.
And finally, what type of application do you host on this cluster ? What kind of filesystem do you have on your DRBD resources ?

Apart from that, Lars has detected couple of issues in you DRBD configuration. Have you addressed them ? Namely, the dual primary configuration and the rest.

Please send the above informations to help us understand more clearly your whole setup.

Regards,

Pascal.



_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


litao5 at hisense

Aug 26, 2012, 8:11 PM

Post #13 of 14 (1207 views)
Permalink
Re: Drbd : PingAsk timeout, about 10 mins. [In reply to]

Hi Pascal,

Sorry for my late reply because there are too many work to do recently.

I send you the results of ' drbdsetup 0 show' again,

disk {
size 0s _is_default; # bytes
on-io-error detach;
fencing dont-care _is_default;
max-bio-bvecs 0 _is_default;
}
net {
timeout 60 _is_default; # 1/10 seconds
max-epoch-size 2048 _is_default;
max-buffers 2048 _is_default;
unplug-watermark 128 _is_default;
connect-int 10 _is_default; # seconds
ping-int 10 _is_default; # seconds
sndbuf-size 0 _is_default; # bytes
rcvbuf-size 0 _is_default; # bytes
ko-count 0 _is_default;
allow-two-primaries;
after-sb-0pri discard-least-changes;
after-sb-1pri discard-secondary;
after-sb-2pri disconnect _is_default;
rr-conflict disconnect _is_default;
ping-timeout 5 _is_default; # 1/10 seconds
}
syncer {
rate 102400k; # bytes/second
after -1 _is_default;
al-extents 257;
}
protocol C;
_this_host {
device minor 0;
disk "/dev/cciss/c0d0p7";
meta-disk internal;
address ipv4 192.168.1.2:7900;
}
_remote_host {
address ipv4 192.168.1.1:7900;

"crm configure show" isn't excused on my computer because I didn't install Pacemaker.

"fconfig" is :

eth0 Link encap:Ethernet HWaddr 3C:D9:2B:07:8A:42
inet addr:192.168.1.2 Bcast:192.168.1.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:329137 errors:0 dropped:0 overruns:0 frame:0
TX packets:115697 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:463432396 (441.9 Mb) TX bytes:13923644 (13.2 Mb)
Interrupt:16 Memory:f4000000-f4012800

eth1 Link encap:Ethernet HWaddr 3C:D9:2B:07:8A:44
inet addr:172.17.5.152 Bcast:172.17.5.255 Mask:255.255.255.128
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:5530 errors:0 dropped:0 overruns:0 frame:0
TX packets:3375 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:677856 (661.9 Kb) TX bytes:645750 (630.6 Kb)
Interrupt:17 Memory:f2000000-f2012800

lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
UP LOOPBACK RUNNING MTU:16436 Metric:1
RX packets:716 errors:0 dropped:0 overruns:0 frame:0
TX packets:716 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:142793 (139.4 Kb) TX bytes:142793 (139.4 Kb)


My file system on DRBD partition is EXT3.

Thanks.

simon

You have only sent me the results of "drbdadm 0 show", you forgot to send me the result of "crm configure show" that I had also asked, could you please send it too ? I also would like to see what "ifconfig" of your 2 replication interfaces do report (The ones on subnet 172.17). Please send it along.
And finally, what type of application do you host on this cluster ? What kind of filesystem do you have on your DRBD resources ?

Apart from that, Lars has detected couple of issues in you DRBD configuration. Have you addressed them ? Namely, the dual primary configuration and the rest.

Please send the above informations to help us understand more clearly your whole setup.

Regards,

Pascal.





_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


ff at mpexnet

Aug 28, 2012, 2:41 AM

Post #14 of 14 (1181 views)
Permalink
Re: Drbd : PingAsk timeout, about 10 mins. [In reply to]

Hi,

On 08/19/2012 09:01 AM, roberto.fastec [at] gmail wrote:
> The editing phase should focus on remove (as example, in this thread), such kilometric log text:
> what should be the sense of keeping multiples and multiples repetitions of in ALL the replies?

while I more or less sympathize, it makes me giggle that on my PC
(rather large screen even), your mail is 12 pages long including all
those quotes you're criticizing.
_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user

DRBD users RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.