Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: DRBD: Users

Recovering from erroneous sync state

 

 

DRBD users RSS feed   Index | Next | Previous | View Threaded


zweiss at scout

May 23, 2012, 1:14 PM

Post #1 of 13 (887 views)
Permalink
Recovering from erroneous sync state

Hi,

I'm running DRBD 8.3.12, and recently hit what looks to me like a bug that was listed as fixed in 8.3.13 -- getting into a state where both nodes are in SyncSource (it's just stuck like that, going nowhere). Luckily this happened on a test resource and not a live one, so it's not a big problem, but I was wondering if there were any known ways of recovering it without doing anything disruptive to the other resources (e.g. rebooting or unloading the kernel module).

I've tried 'drbdadm down', but it just hangs -- anyone have any other suggestions? It doesn't really matter to me if it wipes the resource or anything, I'd just like to have my test device back in a working state without disturbing anything else.

Thanks,
Zev Weiss

_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


florian at hastexo

May 23, 2012, 1:22 PM

Post #2 of 13 (848 views)
Permalink
Re: Recovering from erroneous sync state [In reply to]

On Wed, May 23, 2012 at 10:14 PM, Zev Weiss <zweiss [at] scout> wrote:
> Hi,
>
> I'm running DRBD 8.3.12, and recently hit what looks to me like a bug that was listed as fixed in 8.3.13 -- getting into a state where both nodes are in SyncSource (it's just stuck like that, going nowhere).  Luckily this happened on a test resource and not a live one, so it's not a big problem, but I was wondering if there were any known ways of recovering it without doing anything disruptive to the other resources (e.g. rebooting or unloading the kernel module).
>
> I've tried 'drbdadm down', but it just hangs -- anyone have any other suggestions?  It doesn't really matter to me if it wipes the resource or anything, I'd just like to have my test device back in a working state without disturbing anything else.

Can you post /proc/drbd contents from both nodes here?

Also, if it still returns a meaningful result, you could add "drbdadm
get-gi <resource>" for the affected resource.

That should enable people to come up with suggestions.

Cheers,
Florian

--
Need help with High Availability?
http://www.hastexo.com/now
_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


zweiss at scout

May 23, 2012, 1:34 PM

Post #3 of 13 (846 views)
Permalink
Re: Recovering from erroneous sync state [In reply to]

On May 23, 2012, at 3:22 PM, Florian Haas wrote:

> On Wed, May 23, 2012 at 10:14 PM, Zev Weiss <zweiss [at] scout> wrote:
>> Hi,
>>
>> I'm running DRBD 8.3.12, and recently hit what looks to me like a bug that was listed as fixed in 8.3.13 -- getting into a state where both nodes are in SyncSource (it's just stuck like that, going nowhere). Luckily this happened on a test resource and not a live one, so it's not a big problem, but I was wondering if there were any known ways of recovering it without doing anything disruptive to the other resources (e.g. rebooting or unloading the kernel module).
>>
>> I've tried 'drbdadm down', but it just hangs -- anyone have any other suggestions? It doesn't really matter to me if it wipes the resource or anything, I'd just like to have my test device back in a working state without disturbing anything else.
>
> Can you post /proc/drbd contents from both nodes here?
>

Sure -- here's one node:

version: 8.3.12 (api:88/proto:86-96)
GIT-hash: e2a8ef4656be026bbae540305fcb998a5991090f build by zweiss [at] mydomai, 2012-03-14 19:52:38

<snip other resources>
9: cs:SyncSource ro:Secondary/Primary ds:UpToDate/Inconsistent C r-----
ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:d oos:65536
[>...................] sync'ed: 5.9% (65536/65536)K
finish: 19046:04:53 speed: 0 (0 -- 0) K/sec (stalled)
0% sector pos: 0/10698352
resync: used:0/61 hits:0 misses:0 starving:0 dirty:0 changed:0
act_log: used:0/3389 hits:0 misses:0 starving:0 dirty:0 changed:0


And here's the other:

version: 8.3.12 (api:88/proto:86-96)
GIT-hash: e2a8ef4656be026bbae540305fcb998a5991090f build by zweiss [at] fromage, 2012-03-14 19:52:38

<snip other resources>
9: cs:SyncSource ro:Secondary/Secondary ds:UpToDate/Inconsistent C r-----
ns:0 nr:0 dw:0 dr:664 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:d oos:65536
[>...................] sync'ed: 5.9% (65536/65536)K
finish: 18987:55:05 speed: 0 (0 -- 0) K/sec (stalled)
0% sector pos: 0/10698352
resync: used:0/61 hits:0 misses:0 starving:0 dirty:0 changed:0
act_log: used:0/3389 hits:0 misses:0 starving:0 dirty:0 changed:0


> Also, if it still returns a meaningful result, you could add "drbdadm
> get-gi <resource>" for the affected resource.

That just hangs, unfortunately (on both nodes).


Thanks,
Zev

_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


lars.ellenberg at linbit

May 23, 2012, 1:45 PM

Post #4 of 13 (852 views)
Permalink
Re: Recovering from erroneous sync state [In reply to]

On Wed, May 23, 2012 at 03:34:27PM -0500, Zev Weiss wrote:
>
> On May 23, 2012, at 3:22 PM, Florian Haas wrote:
>
> > On Wed, May 23, 2012 at 10:14 PM, Zev Weiss <zweiss [at] scout> wrote:
> >> Hi,
> >>
> >> I'm running DRBD 8.3.12, and recently hit what looks to me like a bug that was listed as fixed in 8.3.13 -- getting into a state where both nodes are in SyncSource (it's just stuck like that, going nowhere). Luckily this happened on a test resource and not a live one, so it's not a big problem, but I was wondering if there were any known ways of recovering it without doing anything disruptive to the other resources (e.g. rebooting or unloading the kernel module).
> >>
> >> I've tried 'drbdadm down', but it just hangs -- anyone have any other suggestions? It doesn't really matter to me if it wipes the resource or anything, I'd just like to have my test device back in a working state without disturbing anything else.
> >
> > Can you post /proc/drbd contents from both nodes here?
> >
>
> Sure -- here's one node:
>
> version: 8.3.12 (api:88/proto:86-96)
> GIT-hash: e2a8ef4656be026bbae540305fcb998a5991090f build by zweiss [at] mydomai, 2012-03-14 19:52:38
>
> <snip other resources>
> 9: cs:SyncSource ro:Secondary/Primary ds:UpToDate/Inconsistent C r-----
> ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:d oos:65536
> [>...................] sync'ed: 5.9% (65536/65536)K
> finish: 19046:04:53 speed: 0 (0 -- 0) K/sec (stalled)
> 0% sector pos: 0/10698352
> resync: used:0/61 hits:0 misses:0 starving:0 dirty:0 changed:0
> act_log: used:0/3389 hits:0 misses:0 starving:0 dirty:0 changed:0

drbdsetup 9 disconnect --force
may work,
if you did not try a non-forced disconnect or similar before,
that is to say, if the drbd worker thread is not blocked yet.

You can always cut the tcp connection using iptables,
which should at least get the worker into a responsive state again.

> And here's the other:
>
> version: 8.3.12 (api:88/proto:86-96)
> GIT-hash: e2a8ef4656be026bbae540305fcb998a5991090f build by zweiss [at] fromage, 2012-03-14 19:52:38
>
> <snip other resources>
> 9: cs:SyncSource ro:Secondary/Secondary ds:UpToDate/Inconsistent C r-----
> ns:0 nr:0 dw:0 dr:664 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:d oos:65536
> [>...................] sync'ed: 5.9% (65536/65536)K
> finish: 18987:55:05 speed: 0 (0 -- 0) K/sec (stalled)
> 0% sector pos: 0/10698352
> resync: used:0/61 hits:0 misses:0 starving:0 dirty:0 changed:0
> act_log: used:0/3389 hits:0 misses:0 starving:0 dirty:0 changed:0
>
>
> > Also, if it still returns a meaningful result, you could add "drbdadm
> > get-gi <resource>" for the affected resource.
>
> That just hangs, unfortunately (on both nodes).

--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list -- I'm subscribed
_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


florian at hastexo

May 23, 2012, 1:47 PM

Post #5 of 13 (846 views)
Permalink
Re: Recovering from erroneous sync state [In reply to]

On Wed, May 23, 2012 at 10:34 PM, Zev Weiss <zweiss [at] scout> wrote:
>
> On May 23, 2012, at 3:22 PM, Florian Haas wrote:
>
>> On Wed, May 23, 2012 at 10:14 PM, Zev Weiss <zweiss [at] scout> wrote:
>>> Hi,
>>>
>>> I'm running DRBD 8.3.12, and recently hit what looks to me like a bug that was listed as fixed in 8.3.13 -- getting into a state where both nodes are in SyncSource (it's just stuck like that, going nowhere).  Luckily this happened on a test resource and not a live one, so it's not a big problem, but I was wondering if there were any known ways of recovering it without doing anything disruptive to the other resources (e.g. rebooting or unloading the kernel module).
>>>
>>> I've tried 'drbdadm down', but it just hangs -- anyone have any other suggestions?  It doesn't really matter to me if it wipes the resource or anything, I'd just like to have my test device back in a working state without disturbing anything else.
>>
>> Can you post /proc/drbd contents from both nodes here?
>>
>
> Sure -- here's one node:
>
> version: 8.3.12 (api:88/proto:86-96)
> GIT-hash: e2a8ef4656be026bbae540305fcb998a5991090f build by zweiss [at] mydomai, 2012-03-14 19:52:38
>
> <snip other resources>
>  9: cs:SyncSource ro:Secondary/Primary ds:UpToDate/Inconsistent C r-----
>    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:d oos:65536
>        [>...................] sync'ed:  5.9% (65536/65536)K
>        finish: 19046:04:53 speed: 0 (0 -- 0) K/sec (stalled)
>          0% sector pos: 0/10698352
>        resync: used:0/61 hits:0 misses:0 starving:0 dirty:0 changed:0
>        act_log: used:0/3389 hits:0 misses:0 starving:0 dirty:0 changed:0
>
>
> And here's the other:
>
> version: 8.3.12 (api:88/proto:86-96)
> GIT-hash: e2a8ef4656be026bbae540305fcb998a5991090f build by zweiss [at] fromage, 2012-03-14 19:52:38
>
> <snip other resources>
>  9: cs:SyncSource ro:Secondary/Secondary ds:UpToDate/Inconsistent C r-----
>    ns:0 nr:0 dw:0 dr:664 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:d oos:65536
>        [>...................] sync'ed:  5.9% (65536/65536)K
>        finish: 18987:55:05 speed: 0 (0 -- 0) K/sec (stalled)
>          0% sector pos: 0/10698352
>        resync: used:0/61 hits:0 misses:0 starving:0 dirty:0 changed:0
>        act_log: used:0/3389 hits:0 misses:0 starving:0 dirty:0 changed:0

Ugh. Can you force the device into the WFConnection state by injecting
a couple of iptables rules blocking the replication port, and then
"down" the resource?

Also, Lars, can you shed a little more light on the bug, and its
8.3.13 fix? I had thought the fix was in commit 305dce2c, but it
apparently fixes c19050f4 (which as per git describe was some thirty
commits after 8.3.12, so it shouldn't affect an 8.3.12 user).

Cheers,
Florian

--
Need help with High Availability?
http://www.hastexo.com/now
_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


zweiss at scout

May 23, 2012, 2:07 PM

Post #6 of 13 (866 views)
Permalink
Re: Recovering from erroneous sync state [In reply to]

On May 23, 2012, at 3:47 PM, Florian Haas wrote:

> On Wed, May 23, 2012 at 10:34 PM, Zev Weiss <zweiss [at] scout> wrote:
>>
>> On May 23, 2012, at 3:22 PM, Florian Haas wrote:
>>
>>> On Wed, May 23, 2012 at 10:14 PM, Zev Weiss <zweiss [at] scout> wrote:
>>>> Hi,
>>>>
>>>> I'm running DRBD 8.3.12, and recently hit what looks to me like a bug that was listed as fixed in 8.3.13 -- getting into a state where both nodes are in SyncSource (it's just stuck like that, going nowhere). Luckily this happened on a test resource and not a live one, so it's not a big problem, but I was wondering if there were any known ways of recovering it without doing anything disruptive to the other resources (e.g. rebooting or unloading the kernel module).
>>>>
>>>> I've tried 'drbdadm down', but it just hangs -- anyone have any other suggestions? It doesn't really matter to me if it wipes the resource or anything, I'd just like to have my test device back in a working state without disturbing anything else.
>>>
>>> Can you post /proc/drbd contents from both nodes here?
>>>
>>
>> Sure -- here's one node:
>>
>> version: 8.3.12 (api:88/proto:86-96)
>> GIT-hash: e2a8ef4656be026bbae540305fcb998a5991090f build by zweiss [at] mydomai, 2012-03-14 19:52:38
>>
>> <snip other resources>
>> 9: cs:SyncSource ro:Secondary/Primary ds:UpToDate/Inconsistent C r-----
>> ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:d oos:65536
>> [>...................] sync'ed: 5.9% (65536/65536)K
>> finish: 19046:04:53 speed: 0 (0 -- 0) K/sec (stalled)
>> 0% sector pos: 0/10698352
>> resync: used:0/61 hits:0 misses:0 starving:0 dirty:0 changed:0
>> act_log: used:0/3389 hits:0 misses:0 starving:0 dirty:0 changed:0
>>
>>
>> And here's the other:
>>
>> version: 8.3.12 (api:88/proto:86-96)
>> GIT-hash: e2a8ef4656be026bbae540305fcb998a5991090f build by zweiss [at] fromage, 2012-03-14 19:52:38
>>
>> <snip other resources>
>> 9: cs:SyncSource ro:Secondary/Secondary ds:UpToDate/Inconsistent C r-----
>> ns:0 nr:0 dw:0 dr:664 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:d oos:65536
>> [>...................] sync'ed: 5.9% (65536/65536)K
>> finish: 18987:55:05 speed: 0 (0 -- 0) K/sec (stalled)
>> 0% sector pos: 0/10698352
>> resync: used:0/61 hits:0 misses:0 starving:0 dirty:0 changed:0
>> act_log: used:0/3389 hits:0 misses:0 starving:0 dirty:0 changed:0
>
> Ugh. Can you force the device into the WFConnection state by injecting
> a couple of iptables rules blocking the replication port, and then
> "down" the resource?

I've now inserted iptables rules on both sides for the relevant replication port (reject-with icmp-port-unreachable), but no transition to WFConnection -- it's still stuck in the same state (and unsurprisingly, 'down' still just hangs).


Zev

_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


zweiss at scout

May 23, 2012, 2:12 PM

Post #7 of 13 (847 views)
Permalink
Re: Recovering from erroneous sync state [In reply to]

On May 23, 2012, at 3:45 PM, Lars Ellenberg wrote:

> On Wed, May 23, 2012 at 03:34:27PM -0500, Zev Weiss wrote:
>>
>> On May 23, 2012, at 3:22 PM, Florian Haas wrote:
>>
>>> On Wed, May 23, 2012 at 10:14 PM, Zev Weiss <zweiss [at] scout> wrote:
>>>> Hi,
>>>>
>>>> I'm running DRBD 8.3.12, and recently hit what looks to me like a bug that was listed as fixed in 8.3.13 -- getting into a state where both nodes are in SyncSource (it's just stuck like that, going nowhere). Luckily this happened on a test resource and not a live one, so it's not a big problem, but I was wondering if there were any known ways of recovering it without doing anything disruptive to the other resources (e.g. rebooting or unloading the kernel module).
>>>>
>>>> I've tried 'drbdadm down', but it just hangs -- anyone have any other suggestions? It doesn't really matter to me if it wipes the resource or anything, I'd just like to have my test device back in a working state without disturbing anything else.
>>>
>>> Can you post /proc/drbd contents from both nodes here?
>>>
>>
>> Sure -- here's one node:
>>
>> version: 8.3.12 (api:88/proto:86-96)
>> GIT-hash: e2a8ef4656be026bbae540305fcb998a5991090f build by zweiss [at] mydomai, 2012-03-14 19:52:38
>>
>> <snip other resources>
>> 9: cs:SyncSource ro:Secondary/Primary ds:UpToDate/Inconsistent C r-----
>> ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:d oos:65536
>> [>...................] sync'ed: 5.9% (65536/65536)K
>> finish: 19046:04:53 speed: 0 (0 -- 0) K/sec (stalled)
>> 0% sector pos: 0/10698352
>> resync: used:0/61 hits:0 misses:0 starving:0 dirty:0 changed:0
>> act_log: used:0/3389 hits:0 misses:0 starving:0 dirty:0 changed:0
>
> drbdsetup 9 disconnect --force
> may work,
> if you did not try a non-forced disconnect or similar before,
> that is to say, if the drbd worker thread is not blocked yet.
>

I think I had tried a non-forced disconnect previously (and perhaps also implicitly as part of a 'down' attempt, though I'm not sure whether it would have gotten to that step if the disconnect operation didn't complete), but 'drbdsetup 9 disconnect --force' also just hangs.

> You can always cut the tcp connection using iptables,
> which should at least get the worker into a responsive state again.
>

As mentioned in another message in response to Florian, blocking the replication port via iptables doesn't seem to have had any effect.


Thanks,
Zev


lars.ellenberg at linbit

May 23, 2012, 2:18 PM

Post #8 of 13 (845 views)
Permalink
Re: Recovering from erroneous sync state [In reply to]

On Wed, May 23, 2012 at 04:12:23PM -0500, Zev Weiss wrote:
>
> On May 23, 2012, at 3:45 PM, Lars Ellenberg wrote:
>
> > On Wed, May 23, 2012 at 03:34:27PM -0500, Zev Weiss wrote:
> >>
> >> On May 23, 2012, at 3:22 PM, Florian Haas wrote:
> >>
> >>> On Wed, May 23, 2012 at 10:14 PM, Zev Weiss <zweiss [at] scout> wrote:
> >>>> Hi,
> >>>>
> >>>> I'm running DRBD 8.3.12, and recently hit what looks to me like a bug that was listed as fixed in 8.3.13 -- getting into a state where both nodes are in SyncSource (it's just stuck like that, going nowhere). Luckily this happened on a test resource and not a live one, so it's not a big problem, but I was wondering if there were any known ways of recovering it without doing anything disruptive to the other resources (e.g. rebooting or unloading the kernel module).
> >>>>
> >>>> I've tried 'drbdadm down', but it just hangs -- anyone have any other suggestions? It doesn't really matter to me if it wipes the resource or anything, I'd just like to have my test device back in a working state without disturbing anything else.
> >>>
> >>> Can you post /proc/drbd contents from both nodes here?
> >>>
> >>
> >> Sure -- here's one node:
> >>
> >> version: 8.3.12 (api:88/proto:86-96)
> >> GIT-hash: e2a8ef4656be026bbae540305fcb998a5991090f build by zweiss [at] mydomai, 2012-03-14 19:52:38
> >>
> >> <snip other resources>
> >> 9: cs:SyncSource ro:Secondary/Primary ds:UpToDate/Inconsistent C r-----
> >> ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:d oos:65536
> >> [>...................] sync'ed: 5.9% (65536/65536)K
> >> finish: 19046:04:53 speed: 0 (0 -- 0) K/sec (stalled)
> >> 0% sector pos: 0/10698352
> >> resync: used:0/61 hits:0 misses:0 starving:0 dirty:0 changed:0
> >> act_log: used:0/3389 hits:0 misses:0 starving:0 dirty:0 changed:0
> >
> > drbdsetup 9 disconnect --force
> > may work,
> > if you did not try a non-forced disconnect or similar before,
> > that is to say, if the drbd worker thread is not blocked yet.
> >
>
> I think I had tried a non-forced disconnect previously (and perhaps
> also implicitly as part of a 'down' attempt, though I'm not sure
> whether it would have gotten to that step if the disconnect operation
> didn't complete), but 'drbdsetup 9 disconnect --force' also just
> hangs.
>
> > You can always cut the tcp connection using iptables,
> > which should at least get the worker into a responsive state again.
> >
>
> As mentioned in another message in response to Florian, blocking the replication port via iptables doesn't seem to have had any effect.

You would not need to "block" as in DROP, but to REJECT with tcp reset.

You could also try "ifdown" sleep a while and ifup again.
(which obviously will impact the other resources, and everything going
via that interface).



--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com
_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


smccreadie at CanyonPartners

May 23, 2012, 2:20 PM

Post #9 of 13 (845 views)
Permalink
Recovering from erroneous sync state [In reply to]

_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


zweiss at scout

May 23, 2012, 2:34 PM

Post #10 of 13 (846 views)
Permalink
Re: Recovering from erroneous sync state [In reply to]

On May 23, 2012, at 4:18 PM, Lars Ellenberg wrote:

> On Wed, May 23, 2012 at 04:12:23PM -0500, Zev Weiss wrote:
>>
>> On May 23, 2012, at 3:45 PM, Lars Ellenberg wrote:
>>
>>> On Wed, May 23, 2012 at 03:34:27PM -0500, Zev Weiss wrote:
>>>>
>>>> On May 23, 2012, at 3:22 PM, Florian Haas wrote:
>>>>
>>>>> On Wed, May 23, 2012 at 10:14 PM, Zev Weiss <zweiss [at] scout> wrote:
>>>>>> Hi,
>>>>>>
>>>>>> I'm running DRBD 8.3.12, and recently hit what looks to me like a bug that was listed as fixed in 8.3.13 -- getting into a state where both nodes are in SyncSource (it's just stuck like that, going nowhere). Luckily this happened on a test resource and not a live one, so it's not a big problem, but I was wondering if there were any known ways of recovering it without doing anything disruptive to the other resources (e.g. rebooting or unloading the kernel module).
>>>>>>
>>>>>> I've tried 'drbdadm down', but it just hangs -- anyone have any other suggestions? It doesn't really matter to me if it wipes the resource or anything, I'd just like to have my test device back in a working state without disturbing anything else.
>>>>>
>>>>> Can you post /proc/drbd contents from both nodes here?
>>>>>
>>>>
>>>> Sure -- here's one node:
>>>>
>>>> version: 8.3.12 (api:88/proto:86-96)
>>>> GIT-hash: e2a8ef4656be026bbae540305fcb998a5991090f build by zweiss [at] mydomai, 2012-03-14 19:52:38
>>>>
>>>> <snip other resources>
>>>> 9: cs:SyncSource ro:Secondary/Primary ds:UpToDate/Inconsistent C r-----
>>>> ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:d oos:65536
>>>> [>...................] sync'ed: 5.9% (65536/65536)K
>>>> finish: 19046:04:53 speed: 0 (0 -- 0) K/sec (stalled)
>>>> 0% sector pos: 0/10698352
>>>> resync: used:0/61 hits:0 misses:0 starving:0 dirty:0 changed:0
>>>> act_log: used:0/3389 hits:0 misses:0 starving:0 dirty:0 changed:0
>>>
>>> drbdsetup 9 disconnect --force
>>> may work,
>>> if you did not try a non-forced disconnect or similar before,
>>> that is to say, if the drbd worker thread is not blocked yet.
>>>
>>
>> I think I had tried a non-forced disconnect previously (and perhaps
>> also implicitly as part of a 'down' attempt, though I'm not sure
>> whether it would have gotten to that step if the disconnect operation
>> didn't complete), but 'drbdsetup 9 disconnect --force' also just
>> hangs.
>>
>>> You can always cut the tcp connection using iptables,
>>> which should at least get the worker into a responsive state again.
>>>
>>
>> As mentioned in another message in response to Florian, blocking the replication port via iptables doesn't seem to have had any effect.
>
> You would not need to "block" as in DROP, but to REJECT with tcp reset.
>

Sorry, should have worded that more carefully -- it wasn't strictly a DROP, but a REJECT, with icmp-port-unreachable. I've since tweaked it to reject with tcp-reset instead. No changes in DRBD state on either side as far as I can see, though both nodes still seem to have (or at least think they have) a tcp connection or two involving that port, despite my efforts with iptables:

[root [at] node ~]# netstat -tn | fgrep 7789
tcp 156 0 192.168.1.2:37324 192.168.1.1:7789 ESTABLISHED
tcp 0 0 192.168.1.2:7789 192.168.1.1:47548 ESTABLISHED

[root [at] node ~]# netstat -tn | fgrep 7789
tcp 0 0 192.168.1.1:47548 192.168.1.2:7789 ESTABLISHED


For what it's worth, node1 is the one that thinks the resource is Secondary/Secondary, node2 is the one that shows it as Secondary/Primary.

> You could also try "ifdown" sleep a while and ifup again.
> (which obviously will impact the other resources, and everything going
> via that interface).
>

Right, but I don't really want to disrupt replication on all the other (production) resources, so just leaving it as-is until my next scheduled maintenance reboot (at which point I plan on would be preferable.


Thanks,
Zev

_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


zweiss at scout

May 23, 2012, 2:39 PM

Post #11 of 13 (849 views)
Permalink
Re: Recovering from erroneous sync state [In reply to]

On May 23, 2012, at 4:34 PM, Zev Weiss wrote:

>
> On May 23, 2012, at 4:18 PM, Lars Ellenberg wrote:
>
>> You could also try "ifdown" sleep a while and ifup again.
>> (which obviously will impact the other resources, and everything going
>> via that interface).
>>
>
> Right, but I don't really want to disrupt replication on all the other (production) resources, so just leaving it as-is until my next scheduled maintenance reboot (at which point I plan on would be preferable.
>

"(at which point I plan on updating to 8.3.13 anyway)", was what that unfinished parenthetical was supposed to say there.

Zev


lars.ellenberg at linbit

May 23, 2012, 2:48 PM

Post #12 of 13 (844 views)
Permalink
Re: Recovering from erroneous sync state [In reply to]

On Wed, May 23, 2012 at 04:34:28PM -0500, Zev Weiss wrote:
> >>>>>> I'm running DRBD 8.3.12, and recently hit what looks to me like
> >>>>>> a bug that was listed as fixed in 8.3.13 -- getting into a
> >>>>>> state where both nodes are in SyncSource (it's just stuck like
> >>>>>> that, going nowhere). Luckily this happened on a test resource
> >>>>>> and not a live one, so it's not a big problem, but I was
> >>>>>> wondering if there were any known ways of recovering it without
> >>>>>> doing anything disruptive to the other resources (e.g.
> >>>>>> rebooting or unloading the kernel module).
> >>>>>>
> >>>>>> I've tried 'drbdadm down', but it just hangs -- anyone have any
> >>>>>> other suggestions? It doesn't really matter to me if it wipes
> >>>>>> the resource or anything, I'd just like to have my test device
> >>>>>> back in a working state without disturbing anything else.
> >>>>>
> >>>>> Can you post /proc/drbd contents from both nodes here?
> >>>>>
> >>>>
> >>>> Sure -- here's one node:
> >>>>
> >>>> version: 8.3.12 (api:88/proto:86-96)
> >>>> GIT-hash: e2a8ef4656be026bbae540305fcb998a5991090f build by zweiss [at] mydomai, 2012-03-14 19:52:38
> >>>>
> >>>> <snip other resources>
> >>>> 9: cs:SyncSource ro:Secondary/Primary ds:UpToDate/Inconsistent C r-----
> >>>> ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:d oos:65536
> >>>> [>...................] sync'ed: 5.9% (65536/65536)K
> >>>> finish: 19046:04:53 speed: 0 (0 -- 0) K/sec (stalled)
> >>>> 0% sector pos: 0/10698352
> >>>> resync: used:0/61 hits:0 misses:0 starving:0 dirty:0 changed:0
> >>>> act_log: used:0/3389 hits:0 misses:0 starving:0 dirty:0 changed:0
> >>>
> >>> drbdsetup 9 disconnect --force
> >>> may work,
> >>> if you did not try a non-forced disconnect or similar before,
> >>> that is to say, if the drbd worker thread is not blocked yet.
> >>>
> >>
> >> I think I had tried a non-forced disconnect previously (and perhaps
> >> also implicitly as part of a 'down' attempt, though I'm not sure
> >> whether it would have gotten to that step if the disconnect operation
> >> didn't complete), but 'drbdsetup 9 disconnect --force' also just
> >> hangs.
> >>
> >>> You can always cut the tcp connection using iptables,
> >>> which should at least get the worker into a responsive state again.
> >>>
> >>
> >> As mentioned in another message in response to Florian, blocking
> >> the replication port via iptables doesn't seem to have had any
> >> effect.
> >
> > You would not need to "block" as in DROP, but to REJECT with tcp reset.
> >
>
> Sorry, should have worded that more carefully -- it wasn't strictly a
> DROP, but a REJECT, with icmp-port-unreachable. I've since tweaked it
> to reject with tcp-reset instead. No changes in DRBD state on either
> side as far as I can see, though both nodes still seem to have (or at
> least think they have) a tcp connection or two involving that port,
> despite my efforts with iptables:
>
> [root [at] node ~]# netstat -tn | fgrep 7789
> tcp 156 0 192.168.1.2:37324 192.168.1.1:7789 ESTABLISHED
> tcp 0 0 192.168.1.2:7789 192.168.1.1:47548 ESTABLISHED
>
> [root [at] node ~]# netstat -tn | fgrep 7789
> tcp 0 0 192.168.1.1:47548 192.168.1.2:7789 ESTABLISHED

hu? should be two as well?

anyways:
port=7789;
for chain in INPUT OUTPUT ; do
for direction in dport sport ; do
iptables -I $chain -p tcp --$direction $port -j REJECT --reject-with tcp-reset
done
done

should do it usually.

> For what it's worth, node1 is the one that thinks the resource is
> Secondary/Secondary, node2 is the one that shows it as
> Secondary/Primary.
>
> > You could also try "ifdown" sleep a while and ifup again.
> > (which obviously will impact the other resources, and everything going
> > via that interface).
> >
>
> Right, but I don't really want to disrupt replication on all the other
> (production) resources, so just leaving it as-is until my next
> scheduled maintenance reboot (at which point I plan on would be
> preferable.



--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list -- I'm subscribed
_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user


zweiss at scout

May 24, 2012, 1:03 PM

Post #13 of 13 (822 views)
Permalink
Re: Recovering from erroneous sync state [In reply to]

On May 23, 2012, at 4:48 PM, Lars Ellenberg wrote:

> On Wed, May 23, 2012 at 04:34:28PM -0500, Zev Weiss wrote:
>>>>>>>> I'm running DRBD 8.3.12, and recently hit what looks to me like
>>>>>>>> a bug that was listed as fixed in 8.3.13 -- getting into a
>>>>>>>> state where both nodes are in SyncSource (it's just stuck like
>>>>>>>> that, going nowhere). Luckily this happened on a test resource
>>>>>>>> and not a live one, so it's not a big problem, but I was
>>>>>>>> wondering if there were any known ways of recovering it without
>>>>>>>> doing anything disruptive to the other resources (e.g.
>>>>>>>> rebooting or unloading the kernel module).
>>>>>>>>
>>>>>>>> I've tried 'drbdadm down', but it just hangs -- anyone have any
>>>>>>>> other suggestions? It doesn't really matter to me if it wipes
>>>>>>>> the resource or anything, I'd just like to have my test device
>>>>>>>> back in a working state without disturbing anything else.
>>>>>>>
>>>>>>> Can you post /proc/drbd contents from both nodes here?
>>>>>>>
>>>>>>
>>>>>> Sure -- here's one node:
>>>>>>
>>>>>> version: 8.3.12 (api:88/proto:86-96)
>>>>>> GIT-hash: e2a8ef4656be026bbae540305fcb998a5991090f build by zweiss [at] mydomai, 2012-03-14 19:52:38
>>>>>>
>>>>>> <snip other resources>
>>>>>> 9: cs:SyncSource ro:Secondary/Primary ds:UpToDate/Inconsistent C r-----
>>>>>> ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:d oos:65536
>>>>>> [>...................] sync'ed: 5.9% (65536/65536)K
>>>>>> finish: 19046:04:53 speed: 0 (0 -- 0) K/sec (stalled)
>>>>>> 0% sector pos: 0/10698352
>>>>>> resync: used:0/61 hits:0 misses:0 starving:0 dirty:0 changed:0
>>>>>> act_log: used:0/3389 hits:0 misses:0 starving:0 dirty:0 changed:0
>>>>>
>>>>> drbdsetup 9 disconnect --force
>>>>> may work,
>>>>> if you did not try a non-forced disconnect or similar before,
>>>>> that is to say, if the drbd worker thread is not blocked yet.
>>>>>
>>>>
>>>> I think I had tried a non-forced disconnect previously (and perhaps
>>>> also implicitly as part of a 'down' attempt, though I'm not sure
>>>> whether it would have gotten to that step if the disconnect operation
>>>> didn't complete), but 'drbdsetup 9 disconnect --force' also just
>>>> hangs.
>>>>
>>>>> You can always cut the tcp connection using iptables,
>>>>> which should at least get the worker into a responsive state again.
>>>>>
>>>>
>>>> As mentioned in another message in response to Florian, blocking
>>>> the replication port via iptables doesn't seem to have had any
>>>> effect.
>>>
>>> You would not need to "block" as in DROP, but to REJECT with tcp reset.
>>>
>>
>> Sorry, should have worded that more carefully -- it wasn't strictly a
>> DROP, but a REJECT, with icmp-port-unreachable. I've since tweaked it
>> to reject with tcp-reset instead. No changes in DRBD state on either
>> side as far as I can see, though both nodes still seem to have (or at
>> least think they have) a tcp connection or two involving that port,
>> despite my efforts with iptables:
>>
>> [root [at] node ~]# netstat -tn | fgrep 7789
>> tcp 156 0 192.168.1.2:37324 192.168.1.1:7789 ESTABLISHED
>> tcp 0 0 192.168.1.2:7789 192.168.1.1:47548 ESTABLISHED
>>
>> [root [at] node ~]# netstat -tn | fgrep 7789
>> tcp 0 0 192.168.1.1:47548 192.168.1.2:7789 ESTABLISHED
>
> hu? should be two as well?
>

One would think, yes...but that's all it shows. (No idea why.)

> anyways:
> port=7789;
> for chain in INPUT OUTPUT ; do
> for direction in dport sport ; do
> iptables -I $chain -p tcp --$direction $port -j REJECT --reject-with tcp-reset
> done
> done
>
> should do it usually.

Ah, hadn't thought to reject outgoing packets as well -- and doing so seems to have finally had the intended effect (got both sides into StandAlone, then able to force one back to primary and reattach). All back to normal now.

Thanks!

Zev

_______________________________________________
drbd-user mailing list
drbd-user [at] lists
http://lists.linbit.com/mailman/listinfo/drbd-user

DRBD users RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.