Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: DRBD: Users

the timing of restarting thread

 

 

DRBD users RSS feed   Index | Next | Previous | View Threaded


tsukishima.ha at gmail

Jul 23, 2010, 2:15 AM

Post #1 of 8 (801 views)
Permalink
the timing of restarting thread

Hi,

I'm trying the following test.

(1) start DRBD.
node01 is "Primary" and node02 is "Secondary".
(2) block the replication port on node02.
# iptables -A INPUT -i bond0 -p tcp --dport 7790 -j DROP

the result is;

* protocol B,C
DRBD did nothing.

* protocol A
It seems that DRBD restarted its threads.

Q1, protocol A is only able to restart the threads, right?
if so, which parameter handles the timing of restaring, connect-int in
drbd.conf?

Q2, Both of receiver and asender thread will restart with new PID?
syslog said;

Terminating asender thread
Restarting receiver thread
Starting asender thread (from drbd0_receiver [27363])


--- syslog on node2 ---
Jul  9 15:36:50 dl380g5d kernel: block drbd0: PingAck did not arrive in time.
Jul  9 15:36:50 dl380g5d kernel: block drbd0: peer( Primary -> Unknown
) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
Jul  9 15:36:50 dl380g5d kernel: block drbd0: asender terminated
Jul  9 15:36:50 dl380g5d kernel: block drbd0: Terminating asender thread
Jul  9 15:36:50 dl380g5d kernel: block drbd0: sock was shut down by peer
Jul  9 15:36:50 dl380g5d kernel: block drbd0: short read expecting
header on sock: r=0
Jul  9 15:36:50 dl380g5d kernel: block drbd0: Connection closed
Jul  9 15:36:50 dl380g5d kernel: block drbd0: conn( NetworkFailure ->
Unconnected )
Jul  9 15:36:50 dl380g5d kernel: block drbd0: receiver terminated
Jul  9 15:36:50 dl380g5d kernel: block drbd0: Restarting receiver thread
Jul  9 15:36:50 dl380g5d kernel: block drbd0: receiver (re)started
Jul  9 15:36:50 dl380g5d kernel: block drbd0: conn( Unconnected ->
WFConnection )
Jul  9 15:37:03 dl380g5d kernel: block drbd0: Handshake successful:
Agreed network protocol version 94
Jul  9 15:37:03 dl380g5d kernel: block drbd0: Peer authenticated using
20 bytes of 'sha1' HMAC
Jul  9 15:37:03 dl380g5d kernel: block drbd0: conn( WFConnection ->
WFReportParams )
Jul  9 15:37:03 dl380g5d kernel: block drbd0: Starting asender thread
(from drbd0_receiver [27363])
Jul  9 15:37:03 dl380g5d kernel: block drbd0: data-integrity-alg: <not-used>
Jul  9 15:37:03 dl380g5d kernel: block drbd0: drbd_sync_handshake:
Jul  9 15:37:03 dl380g5d kernel: block drbd0: self
685D700FC6364C62:0000000000000000:F4D1EC9C726CF3F4:0E41BFAE2CA8CCD1
bits:0 flags:0
Jul  9 15:37:03 dl380g5d kernel: block drbd0: peer
0A6B6BF917641AF1:685D700FC6364C63:F4D1EC9C726CF3F4:0E41BFAE2CA8CCD1
bits:0 flags:0
Jul  9 15:37:03 dl380g5d kernel: block drbd0: uuid_compare()=-1 by rule 50
Jul  9 15:37:03 dl380g5d kernel: block drbd0: peer( Unknown -> Primary
) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
Jul  9 15:37:03 dl380g5d kernel: block drbd0: conn( WFBitMapT -> WFSyncUUID )
Jul  9 15:37:03 dl380g5d kernel: block drbd0: helper command:
/sbin/drbdadm before-resync-target minor-0
Jul  9 15:37:03 dl380g5d kernel: block drbd0: helper command:
/sbin/drbdadm before-resync-target minor-0 exit code 0 (0x0)
Jul  9 15:37:03 dl380g5d kernel: block drbd0: conn( WFSyncUUID ->
SyncTarget ) disk( UpToDate -> Inconsistent )
Jul  9 15:37:03 dl380g5d kernel: block drbd0: Began resync as
SyncTarget (will sync 0 KB [0 bits set]).
Jul  9 15:37:03 dl380g5d kernel: block drbd0: Resync done (total 1
sec; paused 0 sec; 0 K/sec)
Jul  9 15:37:03 dl380g5d kernel: block drbd0: conn( SyncTarget ->
Connected ) disk( Inconsistent -> UpToDate )
Jul  9 15:37:03 dl380g5d kernel: block drbd0: helper command:
/sbin/drbdadm after-resync-target minor-0
Jul  9 15:37:03 dl380g5d kernel: block drbd0: helper command:
/sbin/drbdadm after-resync-target minor-0 exit code 0 (0x0)
Jul  9 15:37:03 dl380g5d kernel: block drbd0: Connected in w_make_resync_request

--- netstat on node2---

# date; iptables -A INPUT -i bond1 -p tcp --dport 7790 -j DROP
15:36:48 JST

# date; netstat -an | grep 7790
15:36:48 JST
tcp        0      0 192.168.101.44:64825        192.168.101.43:7790
    ESTABLISHED
tcp        0      0 192.168.101.44:7790         192.168.101.43:41946
     ESTABLISHED

# date; netstat -an | grep 7790
15:36:50 JST
tcp        0      0 192.168.101.44:7790         0.0.0.0:*
    LISTEN
tcp        0      9 192.168.101.44:7790         192.168.101.43:41946
     FIN_WAIT1

# netstat -an | grep 7790; date
15:36:57 JST
tcp        0      0 192.168.101.44:7790         0.0.0.0:*
    LISTEN
tcp        0      9 192.168.101.44:7790         192.168.101.43:41946
     FIN_WAIT1
tcp        0      0 192.168.101.44:38648        192.168.101.43:7790
    ESTABLISHED

# netstat -an | grep 7790; date
15:37:04 JST
tcp        0      0 192.168.101.44:58916        192.168.101.43:7790
    ESTABLISHED
tcp        0      9 192.168.101.44:7790         192.168.101.43:41946
     FIN_WAIT1
tcp        0      0 192.168.101.44:38648        192.168.101.43:7790
    ESTABLISHED



Thanks,
Junko IKEDA

NTT DATA INTELLILINK CORPORATION
_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


lars.ellenberg at linbit

Jul 23, 2010, 6:17 AM

Post #2 of 8 (761 views)
Permalink
Re: the timing of restarting thread [In reply to]

On Fri, Jul 23, 2010 at 06:15:05PM +0900, Junko IKEDA wrote:
> Hi,
>
> I'm trying the following test.
>
> (1) start DRBD.
> node01 is "Primary" and node02 is "Secondary".
> (2) block the replication port on node02.
> # iptables -A INPUT -i bond0 -p tcp --dport 7790 -j DROP

insufficient.
you have to block OUTPUT as well.

DRBD has _two_ tcp sessions per device,
one end will have a "random high port",
the end the configured port.
There is nothing that guarantees which node ends up with which end,
typically both have one high port, one configured, but that is
by no means necessary, it can just as well end up with
one node having both configured ports, the other both high ports.

> the result is;
>
> * protocol B,C
> DRBD did nothing.

You _by chance_ only blocked the "data" socket.

> * protocol A
> It seems that DRBD restarted its threads.

You _by chance_ happened to block the "meta" socket.

> Q1, protocol A is only able to restart the threads, right?

wrong question, no answer.

> if so, which parameter handles the timing of restaring, connect-int in drbd.conf?

man drbdsetup.
online: http://www.drbd.org/users-guide/re-drbdsetup.html


> Q2, Both of receiver and asender thread will restart with new PID?
> syslog said;
>
> Terminating asender thread
> Restarting receiver thread
> Starting asender thread (from drbd0_receiver [27363])

irrelevant.

> --- netstat on node2---
>
> # date; iptables -A INPUT -i bond1 -p tcp --dport 7790 -j DROP
> 15:36:48 JST

I suggest to prepare it like this:
iptables -N simulbreak
for c in INPUT OUTPUT ; do
for d in sport dport ; do
iptables -I $c -p tcp --$d -j simulbreak
done
done

then break it with "iptables -I simulbreak -j DROP",
heal it with "iptables -I simulbreak -j ACCEPT".

--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list -- I'm subscribed
_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


tsukishima.ha at gmail

Jul 23, 2010, 8:13 AM

Post #3 of 8 (751 views)
Permalink
Re: the timing of restarting thread [In reply to]

Hi,

>> (1) start DRBD.
>>      node01 is "Primary" and node02 is "Secondary".
>> (2) block the replication port on node02.
>>      # iptables -A INPUT -i bond0 -p tcp --dport 7790 -j DROP
>
> insufficient.
> you have to block OUTPUT as well.

Blocking both INPUT and OUTPUT goes to split brain, doesn't it?

> DRBD has _two_ tcp sessions per device,
> one end will have a "random high port",
> the end the configured port.

Are these two sessions for "data" and "meta" socket as you mentioned below?
I think I want to simulate the blocking of "meta" socket.

DRBD can not replicate the data if "data" socket is blocked
and
DRBD reopen the new socket if "meta" socket is blocked,
Is that right?

>> if so, which parameter handles the timing of restaring, connect-int in drbd.conf?
>
> man drbdsetup.
> online: http://www.drbd.org/users-guide/re-drbdsetup.html

It seems that connect-int have some effect,
but I could not find the right parameter...

> There is nothing that guarantees which node ends up with which end,
> typically both have one high port, one configured, but that is
> by no means necessary, it can just as well end up with
> one node having both configured ports, the other both high ports.
>
>> the result is;
>>
>> * protocol B,C
>> DRBD did nothing.
>
> You _by chance_ only blocked the "data" socket.
>
>> * protocol A
>> It seems that DRBD restarted its threads.
>
> You _by chance_ happened to block the "meta" socket.
>
>> Q1, protocol A is only able to restart the threads, right?
>
> wrong question, no answer.
>
>> if so, which parameter handles the timing of restaring, connect-int in drbd.conf?
>
> man drbdsetup.
> online: http://www.drbd.org/users-guide/re-drbdsetup.html
>
>
>> Q2, Both of receiver and asender thread will restart with new PID?
>> syslog said;
>>
>> Terminating asender thread
>> Restarting receiver thread
>> Starting asender thread (from drbd0_receiver [27363])
>
> irrelevant.
>
>> --- netstat on node2---
>>
>> # date; iptables -A INPUT -i bond1 -p tcp --dport 7790 -j DROP
>> 15:36:48 JST
>
> I suggest to prepare it like this:
> iptables -N simulbreak
> for c in INPUT OUTPUT ; do
>    for d in sport dport ; do
>        iptables -I $c -p tcp --$d -j simulbreak
>    done
> done
>
> then break it with "iptables -I simulbreak -j DROP",
> heal it with "iptables -I simulbreak -j ACCEPT".
>
> --
> : Lars Ellenberg
> : LINBIT | Your Way to High Availability
> : DRBD/HA support and consulting http://www.linbit.com
>
> DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
> __
> please don't Cc me, but send to list   --   I'm subscribed
> _______________________________________________
> drbd-user mailing list
> drbd-user [at] lists
> http://lists.linbit.com/mailman/listinfo/drbd-user
>
_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


lars.ellenberg at linbit

Jul 24, 2010, 3:32 AM

Post #4 of 8 (738 views)
Permalink
Re: the timing of restarting thread [In reply to]

On Sat, Jul 24, 2010 at 12:13:01AM +0900, Junko IKEDA wrote:
> Hi,
>
> >> (1) start DRBD.
> >>      node01 is "Primary" and node02 is "Secondary".
> >> (2) block the replication port on node02.
> >>      # iptables -A INPUT -i bond0 -p tcp --dport 7790 -j DROP
> >
> > insufficient.
> > you have to block OUTPUT as well.
>
> Blocking both INPUT and OUTPUT goes to split brain, doesn't it?

What has split brain to do with it.
You seem to try to provoke replication link breakage.
Unless you also break all the comm links of your cluster management, or
you run drbd in dual-primary with unfortunately chosen settings, that
will have nothing to do with "split brain".

> > DRBD has _two_ tcp sessions per device,
> > one end will have a "random high port",
> > the end the configured port.
>
> Are these two sessions for "data" and "meta" socket as you mentioned below?
> I think I want to simulate the blocking of "meta" socket.

Ah. Why?
Please step back bit and suggest which _real world_ scenario
you have in mind. What is it that you are trying to prove or analyse?

Appart from sniffing the traffic, there is no easy way to
determine which is which just from looking at it.

> DRBD can not replicate the data if "data" socket is blocked
> and DRBD reopen the new socket if "meta" socket is blocked,
> Is that right?

No.
If one of the sockets is detected to not work,
both are dropped, and eventually reestablished.

> >> if so, which parameter handles the timing of restaring, connect-int in drbd.conf?
> >
> > man drbdsetup.
> > online: http://www.drbd.org/users-guide/re-drbdsetup.html
>
> It seems that connect-int have some effect,
> but I could not find the right parameter...

There is no right parameter.
There are quite a few parameters that all have influence on
when a connection loss may be _detected_, also depending on current
replication traffic and mode of connection failure.
timeout, ping-timeo, ping-int, ko-count,
maybe more that I forget right now.
connect-int influences how often drbd changes between listen() and
connect() when trying to establish a connection.

--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list -- I'm subscribed
_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


tsukishima.ha at gmail

Jul 24, 2010, 9:38 AM

Post #5 of 8 (743 views)
Permalink
Re: the timing of restarting thread [In reply to]

Hi,

>> > DRBD has _two_ tcp sessions per device,
>> > one end will have a "random high port",
>> > the end the configured port.
>>
>> Are these two sessions for "data" and "meta" socket as you mentioned below?
>> I think I want to simulate the blocking of "meta" socket.
>
> Ah.  Why?
> Please step back bit and suggest which _real world_ scenario
> you have in mind. What is it that you are trying to prove or analyse?
>
> Appart from sniffing the traffic, there is no easy way to
> determine which is which just from looking at it.

I want to reproduce the following situation.
Primary can send "data" to Secondary,
but only "meta data" is dropped unfortunately.
It might be a unrealistic worry...

>> DRBD can not replicate the data if "data" socket is blocked
>> and DRBD reopen the new socket if "meta" socket is blocked,
>> Is that right?
>
> No.
> If one of the sockets is detected to not work,
> both are dropped, and eventually reestablished.

ok, that means,
in my previous test that I could _by chance_ blocked the "data" socket,
the socket should be eventually reestablished.
Is there any special delay for only "data" socket?
Does "delay_prove" have some relation?
http://kerneltrap.org/mailarchive/git-commits-head/2010/5/22/38405

I have to try again anyway.

>> the result is;
>>
>> * protocol B,C
>> DRBD did nothing.
>
> You _by chance_ only blocked the "data" socket.
>
>> * protocol A
>> It seems that DRBD restarted its threads.
>
> You _by chance_ happened to block the "meta" socket.>> the result is;
>>
>> * protocol B,C
>> DRBD did nothing.
>
> You _by chance_ only blocked the "data" socket.
>
>> * protocol A
>> It seems that DRBD restarted its threads.
>
> You _by chance_ happened to block the "meta" socket.

by the way, sorry for the direct reply to your address...

Thanks,
Junko
_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


lars.ellenberg at linbit

Jul 25, 2010, 7:41 AM

Post #6 of 8 (726 views)
Permalink
Re: the timing of restarting thread [In reply to]

On Sun, Jul 25, 2010 at 01:38:56AM +0900, Junko IKEDA wrote:
> Hi,
>
> >> > DRBD has _two_ tcp sessions per device,
> >> > one end will have a "random high port",
> >> > the end the configured port.
> >>
> >> Are these two sessions for "data" and "meta" socket as you mentioned below?
> >> I think I want to simulate the blocking of "meta" socket.
> >
> > Ah.  Why?
> > Please step back bit and suggest which _real world_ scenario
> > you have in mind. What is it that you are trying to prove or analyse?
> >
> > Appart from sniffing the traffic, there is no easy way to
> > determine which is which just from looking at it.
>
> I want to reproduce the following situation.
> Primary can send "data" to Secondary,
> but only "meta data" is dropped unfortunately.

Again, please suggest a real world failure scenario.
Did you experience any strange replication problems,
or are you just "fantasizing" about esotheric failure modes.

> It might be a unrealistic worry...

If sockets fail in some detectable (by tcp) fashion,
(RST, icmp unreachable or similar),
both sockets are dropped.

Replication is only reestablished once both sockets
have successfully be reestablished.

If sockets fail in some "strange" way (no RST, no icmp,
just a black hole), periodic in-protocol DRBD Ping packets
(on the meta socket) would no longer be answered,
again both sockets are dropped.

If meta data socket is still ok (DRBD Pings are still answered in a
timely fashion), but there is no progress on the data socket,
read about ko-count.

> >>DRBD can not replicate the data if "data" socket is blocked
> >>and DRBD reopen the new socket if "meta" socket is blocked,
> >>Is that right?
> >
> >No.
> >If one of the sockets is detected to not work,
> >both are dropped, and eventually reestablished.
>
> ok, that means,
> in my previous test that I could _by chance_ blocked the "data" socket,
> the socket should be eventually reestablished.
>
> Is there any special delay for only "data" socket?

read about ko-count.

> Does "delay_prove" have some relation?

it was probe, not prove.
and it is totally unrelated.

That was an attempt to do auto-throttling of the resyncer
to have less impact on application IO
while utilizing as much as possible of "idle bandwidth".

This has been reverted since, as it did not meat expectations.
The "auto-throttling" feature is being implemented differently,
and expected to be released with 8.3.9.

It has absolutely nothing to do with connection problems.


--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list -- I'm subscribed
_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


tsukishima.ha at gmail

Jul 26, 2010, 5:55 AM

Post #7 of 8 (705 views)
Permalink
Re: the timing of restarting thread [In reply to]

Hi,

> If sockets fail in some detectable (by tcp) fashion,
> (RST, icmp unreachable or similar),
> both sockets are dropped.
>
> Replication is only reestablished once both sockets
> have successfully be reestablished.
>
> If sockets fail in some "strange" way (no RST, no icmp,
> just a black hole), periodic in-protocol DRBD Ping packets
> (on the meta socket) would no longer be answered,
> again both sockets are dropped.
>
> If meta data socket is still ok (DRBD Pings are still answered in a
> timely fashion), but there is no progress on the data socket,
> read about ko-count.

Thanks for your classification, I could make myself clear.
What I want to find out is the third case!

I set ko-count to 1, but it didn't work, it means DRBD did nothing,
there was no syslog message about ko-count...
I think my setup might be still wrong so I will try again.

Thanks,
Junko
_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


lars.ellenberg at linbit

Jul 26, 2010, 7:05 AM

Post #8 of 8 (701 views)
Permalink
Re: the timing of restarting thread [In reply to]

On Mon, Jul 26, 2010 at 09:55:53PM +0900, Junko IKEDA wrote:
> Hi,
>
> > If sockets fail in some detectable (by tcp) fashion,
> > (RST, icmp unreachable or similar),
> > both sockets are dropped.
> >
> > Replication is only reestablished once both sockets
> > have successfully be reestablished.
> >
> > If sockets fail in some "strange" way (no RST, no icmp,
> > just a black hole), periodic in-protocol DRBD Ping packets
> > (on the meta socket) would no longer be answered,
> > again both sockets are dropped.
> >
> > If meta data socket is still ok (DRBD Pings are still answered in a
> > timely fashion), but there is no progress on the data socket,
> > read about ko-count.
>
> Thanks for your classification, I could make myself clear.
> What I want to find out is the third case!
>
> I set ko-count to 1, but it didn't work, it means DRBD did nothing,
> there was no syslog message about ko-count...
> I think my setup might be still wrong so I will try again.

You need to have writes on the drbd device. If it is idle, it has no
means to detect that something strange goes on on the network.

And you need to have enough writes. As long as it fits into the socket
buffers, ko-count does not catch it.

--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list -- I'm subscribed
_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user

DRBD users RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.