Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Netapp: toasters

RE: Oddball SnapMirror issue - Status: Pending with restart checkpoint

 

 

Netapp toasters RSS feed   Index | Next | Previous | View Threaded


kheal at hotmail

May 4, 2008, 11:48 AM

Post #1 of 8 (530 views)
Permalink
RE: Oddball SnapMirror issue - Status: Pending with restart checkpoint

Hi all

I don't see a bug which is a precise match to this, but I do see that both scenarios were using 7.0.x releases, and I see a fair few SnapMirror bugs have been fixed in 7.2.4; so I am wondering if in either of the scenarios it is possible to move both filers to 7.2.4 (I semi-fear it isn't especially for the source filers concerned) and/or if anyone has seen this on a 7.2.x release.

cheers
Kenneth


http://now.netapp.com/NOW/cgi-bin/relcmp.on?&rrel=7.0.6&rrel=7.2.4&what=fix
> Subject: RE: Oddball SnapMirror issue
> Date: Sun, 4 May 2008 13:24:05 -0500
> From: mpartyka[at]acmn.com
> To: tmacmd[at]gmail.com; owner-toasters[at]mathworks.com; phigmov[at]gmail.com; toasters[at]mathworks.com
>
> Is there any reason to prefer wafliron over WAFL_check? Sounds like they
> do the same thing but you have the option to only check not
> automatically fix with WAFL_check.
>
> -Mike
>
> -----Original Message-----
> From: tmacmd[at]gmail.com [mailto:tmacmd[at]gmail.com]
> Sent: Sunday, May 04, 2008 12:59 PM
> To: Mike Partyka; owner-toasters[at]mathworks.com; Raj Patel; NetApp
> Toasters List
> Subject: Re: Oddball SnapMirror issue
>
> I would try a wafl iron on the source volume/aggr
>
> Just because you do not see any filesystem problems, does not mean ther
> are not any.
>
> --tmac
>
> Sent from my Verizon Wireless BlackBerry
>
> -----Original Message-----
> From: "Mike Partyka" <mpartyka[at]acmn.com>
>
> Date: Sun, 4 May 2008 09:28:18
> To:"Raj Patel" <phigmov[at]gmail.com>, <toasters[at]mathworks.com>
> Subject: RE: Oddball SnapMirror issue
>
>
> I'm having a similar experience trying to setup a Snapmirror between a
> pair of filers in the same datacenter (Not separated by a firewall). The
> source is a 3050 running DOT 7.0.5 and the destination is a 270 running
> 7.0.6. The volume is a 420G volume serving unstructured CIFS data. When
> I start the initialize everything works fine until it gets to about 82
> or 83G, then the initialize aborts. The log contains some very
> non-specific messages, here is the current snapmirror log:
>
> sys Sat May 3 09:12:55 CDT SnapMirror_off (shutdown)
> log Sat May 3 09:15:31 CDT FILER_REBOOTED
> sys Sat May 3 09:15:34 CDT SnapMirror_on (registry)
> dst Sat May 3 10:09:36 CDT 10.0.10.238:data hci2:rcv_data Request
> (Initialize)
> dst Sat May 3 10:09:42 CDT 10.0.10.238:data hci2:rcv_data Start
> dst Sat May 3 11:51:24 CDT 10.0.10.238:data hci2:rcv_data Abort
> (snapmirror transfer failed to complete)
>
> Just as the Raj says when it fails to initialize the destination volume
> is in limbo, you can't online it due to the failed initialize. Here is
> the error:
>
> vol online: Volume 'rcv_data' was left in an inconsistent state by an
> aborted vol copy or an aborted snapmirror initial (level 0) transfer.
> In order to bring it online, you must either destroy and re-create
> the volume, or complete an initial snapmirror transfer or vol copy.
>
> I have considered running WAFL_check but WAFL isn't reporting an
> inconsistent state so i'm not sure that would be very effective.
> Yesterday I upgraded both filers to DOT 7.2.4 and updated all firmware
> then retried with the exact same results.
>
> The only thing I can think of doing now is running a packet capture on
> the filer while it runs and see what that tells me.
>
> -Mike
>
> -----Original Message-----
> From: owner-toasters[at]mathworks.com [mailto:owner-toasters[at]mathworks.com]
> On Behalf Of Raj Patel
> Sent: Sunday, May 04, 2008 1:29 AM
> To: George T Chen
> Cc: toasters[at]mathworks.com
> Subject: Re: Oddball SnapMirror issue
>
> Hi George,
>
> The working transfers do just update 10 to 20Mb - very small turnover.
>
> Unfortunately the two I need to mirror are from scratch - no baseline
> snapshot. The checkpoint restart occurring during the initialisation
> phase. Once the initialisation phase stalls further updates fail as
> the volume is not online (obviusly because the init failed).
>
> I tried setting a once-a-day schedule at a particular time so it
> wouldn't trip over itself or other snapmirror operations to no avail.
>
> As other volumes are updating with small update it made me wonder if
> it wasn't the router ipsec tunnel or firewall prematurely closing a
> connection for a large baseline transfer.
>
> I'll attach the log & config when I get back into work.
>
> Cheers,
> Raj.
>
> On Sun, May 4, 2008 at 4:36 PM, George T Chen <gtchen[at]yahoo-inc.com>
> wrote:
> > Since you have one volume already transferring, then there's no
> network
> > or firewall issue--any problem at that level would affect all
> volumes,
> > not just a few.
> >
> > A "Pending with restart checkpoint" appears you abort an ongoing
> > transfer. Checkpoint occur every ?? megabytes and gives Ontap a
> place
> > to restart instead of from scratch. It's hard to debug without more
> > info, but I would start by:
> >
> > 1) doing a snapmirror break on the volume (not just an abort)
> > 2) verify that there is a common baseline snapshot on both source and
> > destination
> > 3) restart with a snapmirror resync command
> >
> > Depending on step 2, you may be required to go to a snapmirror
> > initialize.
> >
> > What do the /etc/log/snapmirror and /etc/messages file say?
> >
> > -gtchen
> >
> >
> >
> > > -----Original Message-----
> > > From: owner-toasters[at]mathworks.com
> > [mailto:owner-toasters[at]mathworks.com]
> > > On Behalf Of Raj Patel
> > > Sent: Saturday, May 03, 2008 2:00 AM
> > > To: toasters[at]mathworks.com
> > > Subject: Oddball SnapMirror issue
> > >
> > > We've got two FAS 270's in different cities. They're connected by a
> > > 10mb pipe with routers (running ipsec) & firewalls (checkpoint
> splat)
> > > seperating each datacenter.
> > >
> > > The primary san is fine and runs all our prod volumes (7.0.5) which
> > > are mirrored to our secondary san (7.0.6).
> > >
> > > Recently I had to recreate the mirror relationship for some volumes
> as
> > > they'd fallen far out of sync due to some firewall work.
> > >
> > > What I am seeing is one volume is syncing fine, one has a small lag
> > > and two are stuck with a status of 'Pending with restart
> checkpoint'
> > > after I re-initialised the transfer.
> > >
> > > snapmirror status -l shows this for one of the two that just don't
> get
> > > properly initialised
> > >
> > > Source: 10.1.45.7:sqlprod01
> > > Destination: adcsan1:sqlprod01_mirror
> > > Status: Pending with restart checkpoint
> > > Progress: 38376 KB
> > > State: Unknown
> > > Lag: -
> > > Mirror Timestamp: -
> > > Base Snapshot: -
> > > Current Transfer Type: Retry
> > > Current Transfer Error: volume is not online; cannot execute
> operation
> > > Contents: -
> > > Last Transfer Type: -
> > > Last Transfer Size: -
> > > Last Transfer Duration: -
> > > Last Transfer From: -
> > >
> > > Our firewalls rules have been relaxed to allow free-flow between
> these
> > > devices (instead of just the SnapMirror ports) and the routers and
> > > circuit haven't changed at all between it working fine and not
> working
> > > now. The volume that is mirroring OK seems fine and still syncs
> fine -
> > > granted the updates are small whereas the three non-working volumes
> > > have to sync quite a lot of data.
> > >
> > > I've tried deleting the mirrored volumes, recreating them, setting
> up
> > > the mirror relationship again (with a variety of scheduling and
> > > bandwidth throttling options) and doing a destination SAN reboot.
> > >
> > > What are the best options to troubleshoot this or insuring a
> > > successful mirror ? Has anyone had issues with dropped or stalled
> > > SnapMirror baseline transfers via an IPSec tunnel or Firewall ?
> > >
> > > Thanks in advance,
> > > Raj.
> > >
> > > PS As an addendum it looks like it starts a transfer, stalls and
> from
> > > then on subsequent mirrors fail because its not online (ie the
> > > initialisation fails ?)
> > >
> > > What I don't understand is why it just can't carry on with the
> > > initialisation regardless of the interruption by resuming the
> mirror
> > > operation ?
> >
>
>

_________________________________________________________________
Express yourself instantly with MSN Messenger! Download today it's FREE!
http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/


mpartyka at acmn

May 4, 2008, 11:56 AM

Post #2 of 8 (514 views)
Permalink
RE: Oddball SnapMirror issue - Status: Pending with restart checkpoint [In reply to]

After failing to get the initialization going on the 270 and 3050
(running 7.0.5 and 7.0.6 respectively) yesterday morning we upgraded
both the filers (src and dst) to 7.2.4. I immediately after tried the
mirror again but no dice the error occurs around the same place/time in
the initialization.



I did miss the following error in the /etc/messages file:



Sat May 3 11:51:23 CDT [worker_thread_98:notice]: snapmirror: Message
from Read Socket : Connection

Sat May 3 11:51:23 CDT [snapmirror.dst.err:error]: SnapMirror
destination transfer from 10.0.10.238data : snapmirror transfer failed
to complete.

Sat May 3 11:51:24 CDT [snapmirror.dst.err:error]: SnapMirror
destination transfer from 10.0.10.238data : snapmirror transfer failed
to complete.



I understand this might mean the snapmirror.window_size is too large but
it's set 32768 which is pretty small already. Usually you increase this
value to increase performance but I don't think I want to go much
smaller than this.



From: Kenneth Heal [mailto:kheal[at]hotmail.com]
Sent: Sunday, May 04, 2008 1:48 PM
To: Mike Partyka; tmacmd[at]gmail.com; owner-toasters[at]mathworks.com; Raj
Patel; NetApp Toasters List
Subject: RE: Oddball SnapMirror issue - Status: Pending with restart
checkpoint



Hi all

I don't see a bug which is a precise match to this, but I do see that
both scenarios were using 7.0.x releases, and I see a fair few
SnapMirror bugs have been fixed in 7.2.4; so I am wondering if in either
of the scenarios it is possible to move both filers to 7.2.4 (I
semi-fear it isn't especially for the source filers concerned) and/or if
anyone has seen this on a 7.2.x release.

cheers
Kenneth



http://now.netapp.com/NOW/cgi-bin/relcmp.on?&rrel=7.0.6&rrel=7.2.4&what=
fix

________________________________

> Subject: RE: Oddball SnapMirror issue
> Date: Sun, 4 May 2008 13:24:05 -0500
> From: mpartyka[at]acmn.com
> To: tmacmd[at]gmail.com; owner-toasters[at]mathworks.com; phigmov[at]gmail.com;
toasters[at]mathworks.com
>
> Is there any reason to prefer wafliron over WAFL_check? Sounds like
they
> do the same thing but you have the option to only check not
> automatically fix with WAFL_check.
>
> -Mike
>
> -----Original Message-----
> From: tmacmd[at]gmail.com [mailto:tmacmd[at]gmail.com]
> Sent: Sunday, May 04, 2008 12:59 PM
> To: Mike Partyka; owner-toasters[at]mathworks.com; Raj Patel; NetApp
> Toasters List
> Subject: Re: Oddball SnapMirror issue
>
> I would try a wafl iron on the source volume/aggr
>
> Just because you do not see any filesystem problems, does not mean
ther
> are not any.
>
> --tmac
>
> Sent from my Verizon Wireless BlackBerry
>
> -----Original Message-----
> From: "Mike Partyka" <mpartyka[at]acmn.com>
>
> Date: Sun, 4 May 2008 09:28:18
> To:"Raj Patel" <phigmov[at]gmail.com>, <toasters[at]mathworks.com>
> Subject: RE: Oddball SnapMirror issue
>
>
> I'm having a similar experience trying to setup a Snapmirror between a
> pair of filers in the same datacenter (Not separated by a firewall).
The
> source is a 3050 running DOT 7.0.5 and the destination is a 270
running
> 7.0.6. The volume is a 420G volume serving unstructured CIFS data.
When
> I start the initialize everything works fine until it gets to about 82
> or 83G, then the initialize aborts. The log contains some very
> non-specific messages, here is the current snapmirror log:
>
> sys Sat May 3 09:12:55 CDT SnapMirror_off (shutdown)
> log Sat May 3 09:15:31 CDT FILER_REBOOTED
> sys Sat May 3 09:15:34 CDT SnapMirror_on (registry)
> dst Sat May 3 10:09:36 CDT 10.0.10.238:data hci2:rcv_data Request
> (Initialize)
> dst Sat May 3 10:09:42 CDT 10.0.10.238:data hci2:rcv_data Start
> dst Sat May 3 11:51:24 CDT 10.0.10.238:data hci2:rcv_data Abort
> (snapmirror transfer failed to complete)
>
> Just as the Raj says when it fails to initialize the destination
volume
> is in limbo, you can't online it due to the failed initialize. Here is
> the error:
>
> vol online: Volume 'rcv_data' was left in an inconsistent state by an
> aborted vol copy or an aborted snapmirror initial (level 0) transfer.
> In order to bring it online, you must either destroy and re-create
> the volume, or complete an initial snapmirror transfer or vol copy.
>
> I have considered running WAFL_check but WAFL isn't reporting an
> inconsistent state so i'm not sure that would be very effective.
> Yesterday I upgraded both filers to DOT 7.2.4 and updated all firmware
> then retried with the exact same results.
>
> The only thing I can think of doing now is running a packet capture on
> the filer while it runs and see what that tells me.
>
> -Mike
>
> -----Original Message-----
> From: owner-toasters[at]mathworks.com
[mailto:owner-toasters[at]mathworks.com]
> On Behalf Of Raj Patel
> Sent: Sunday, May 04, 2008 1:29 AM
> To: George T Chen
> Cc: toasters[at]mathworks.com
> Subject: Re: Oddball SnapMirror issue
>
> Hi George,
>
> The working transfers do just update 10 to 20Mb - very small turnover.
>
> Unfortunately the two I need to mirror are from scratch - no baseline
> snapshot. The checkpoint restart occurring during the initialisation
> phase. Once the initialisation phase stalls further updates fail as
> the volume is not online (obviusly because the init failed).
>
> I tried setting a once-a-day schedule at a particular time so it
> wouldn't trip over itself or other snapmirror operations to no avail.
>
> As other volumes are updating with small update it made me wonder if
> it wasn't the router ipsec tunnel or firewall prematurely closing a
> connection for a large baseline transfer.
>
> I'll attach the log & config when I get back into work.
>
> Cheers,
> Raj.
>
> On Sun, May 4, 2008 at 4:36 PM, George T Chen <gtchen[at]yahoo-inc.com>
> wrote:
> > Since you have one volume already transferring, then there's no
> network
> > or firewall issue--any problem at that level would affect all
> volumes,
> > not just a few.
> >
> > A "Pending with restart checkpoint" appears you abort an ongoing
> > transfer. Checkpoint occur every ?? megabytes and gives Ontap a
> place
> > to restart instead of from scratch. It's hard to debug without more
> > info, but I would start by:
> >
> > 1) doing a snapmirror break on the volume (not just an abort)
> > 2) verify that there is a common baseline snapshot on both source
and
> > destination
> > 3) restart with a snapmirror resync command
> >
> > Depending on step 2, you may be required to go to a snapmirror
> > initialize.
> >
> > What do the /etc/log/snapmirror and /etc/messages file say?
> >
> > -gtchen
> >
> >
> >
> > > -----Original Message-----
> > > From: owner-toasters[at]mathworks.com
> > [mailto:owner-toasters[at]mathworks.com]
> > > On Behalf Of Raj Patel
> > > Sent: Saturday, May 03, 2008 2:00 AM
> > > To: toasters[at]mathworks.com
> > > Subject: Oddball SnapMirror issue
> > >
> > > We've got two FAS 270's in different cities. They're connected by
a
> > > 10mb pipe with routers (running ipsec) & firewalls (checkpoint
> splat)
> > > seperating each datacenter.
> > >
> > > The primary san is fine and runs all our prod volumes (7.0.5)
which
> > > are mirrored to our secondary san (7.0.6).
> > >
> > > Recently I had to recreate the mirror relationship for some
volumes
> as
> > > they'd fallen far out of sync due to some firewall work.
> > >
> > > What I am seeing is one volume is syncing fine, one has a small
lag
> > > and two are stuck with a status of 'Pending with restart
> checkpoint'
> > > after I re-initialised the transfer.
> > >
> > > snapmirror status -l shows this for one of the two that just don't
> get
> > > properly initialised
> > >
> > > Source: 10.1.45.7:sqlprod01
> > > Destination: adcsan1:sqlprod01_mirror
> > > Status: Pending with restart checkpoint
> > > Progress: 38376 KB
> > > State: Unknown
> > > Lag: -
> > > Mirror Timestamp: -
> > > Base Snapshot: -
> > > Current Transfer Type: Retry
> > > Current Transfer Error: volume is not online; cannot execute
> operation
> > > Contents: -
> > > Last Transfer Type: -
> > > Last Transfer Size: -
> > > Last Transfer Duration: -
> > > Last Transfer From: -
> > >
> > > Our firewalls rules have been relaxed to allow free-flow between
> these
> > > devices (instead of just the SnapMirror ports) and the routers and
> > > circuit haven't changed at all between it working fine and not
> working
> > > now. The volume that is mirroring OK seems fine and still syncs
> fine -
> > > granted the updates are small whereas the three non-working
volumes
> > > have to sync quite a lot of data.
> > >
> > > I've tried deleting the mirrored volumes, recreating them, setting
> up
> > > the mirror relationship again (with a variety of scheduling and
> > > bandwidth throttling options) and doing a destination SAN reboot.
> > >
> > > What are the best options to troubleshoot this or insuring a
> > > successful mirror ? Has anyone had issues with dropped or stalled
> > > SnapMirror baseline transfers via an IPSec tunnel or Firewall ?
> > >
> > > Thanks in advance,
> > > Raj.
> > >
> > > PS As an addendum it looks like it starts a transfer, stalls and
> from
> > > then on subsequent mirrors fail because its not online (ie the
> > > initialisation fails ?)
> > >
> > > What I don't understand is why it just can't carry on with the
> > > initialisation regardless of the interruption by resuming the
> mirror
> > > operation ?
> >
>
>

________________________________

Express yourself instantly with MSN Messenger! MSN Messenger
<http://clk.atdmt.com/AVE/go/onm00200471ave/direct/01/>


kheal at hotmail

May 4, 2008, 12:06 PM

Post #3 of 8 (516 views)
Permalink
RE: Oddball SnapMirror issue - Status: Pending with restart checkpoint [In reply to]

Hi Mike,

Thx for the quick reply. That does indeed shoot my theory/hope out the water. And I am inclined to agree that going lower on the window size is not likely to help, especially as both your boxes are in the same datacentre without any nasty firewalls or WAN links in between them. This is also the window size recommended in the kb for such problems.

At this I would be inclined to take a packet trace, fire off ASUPs, open a support case and upload a gzipped copy of the pktt trace. Have to give myself beat on this one... though I would be keen to know what the eventual resolution is.

cheers, Kenneth

https://now.netapp.com/Knowledgebase/solutionarea.asp?id=kb17202
Subject: RE: Oddball SnapMirror issue - Status: Pending with restart checkpoint
Date: Sun, 4 May 2008 13:56:45 -0500
From: mpartyka[at]acmn.com
To: kheal[at]hotmail.com; tmacmd[at]gmail.com; owner-toasters[at]mathworks.com; phigmov[at]gmail.com; toasters[at]mathworks.com



















After failing to get the initialization going on the 270 and
3050 (running 7.0.5 and 7.0.6 respectively) yesterday morning we upgraded both
the filers (src and dst) to 7.2.4. I immediately after tried the mirror again
but no dice the error occurs around the same place/time in the initialization.



I did miss the following error in the /etc/messages file:



Sat May 3 11:51:23 CDT [worker_thread_98:notice]:
snapmirror: Message from Read Socket : Connection

Sat May 3 11:51:23 CDT [snapmirror.dst.err:error]:
SnapMirror destination transfer from 10.0.10.238data : snapmirror transfer
failed to complete.

Sat May 3 11:51:24 CDT [snapmirror.dst.err:error]:
SnapMirror destination transfer from 10.0.10.238data : snapmirror transfer
failed to complete.



I understand this might mean the snapmirror.window_size is too
large but it’s set 32768 which is pretty small already. Usually you
increase this value to increase performance but I don’t think I want to
go much smaller than this.







From: Kenneth Heal
[mailto:kheal[at]hotmail.com]

Sent: Sunday, May 04, 2008 1:48 PM

To: Mike Partyka; tmacmd[at]gmail.com; owner-toasters[at]mathworks.com; Raj
Patel; NetApp Toasters List

Subject: RE: Oddball SnapMirror issue - Status: Pending with restart
checkpoint







Hi
all



I don't see a bug which is a precise match to this, but I do see that both
scenarios were using 7.0.x releases, and I see a fair few SnapMirror bugs have
been fixed in 7.2.4; so I am wondering if in either of the scenarios it is
possible to move both filers to 7.2.4 (I semi-fear it isn't especially for the
source filers concerned) and/or if anyone has seen this on a 7.2.x release.



cheers

Kenneth





http://now.netapp.com/NOW/cgi-bin/relcmp.on?&rrel=7.0.6&rrel=7.2.4&what=fix







> Subject: RE: Oddball SnapMirror issue

> Date: Sun, 4 May 2008 13:24:05 -0500

> From: mpartyka[at]acmn.com

> To: tmacmd[at]gmail.com; owner-toasters[at]mathworks.com; phigmov[at]gmail.com;
toasters[at]mathworks.com

>

> Is there any reason to prefer wafliron over WAFL_check? Sounds like they

> do the same thing but you have the option to only check not

> automatically fix with WAFL_check.

>

> -Mike

>

> -----Original Message-----

> From: tmacmd[at]gmail.com [mailto:tmacmd[at]gmail.com]

> Sent: Sunday, May 04, 2008 12:59 PM

> To: Mike Partyka; owner-toasters[at]mathworks.com; Raj Patel; NetApp

> Toasters List

> Subject: Re: Oddball SnapMirror issue

>

> I would try a wafl iron on the source volume/aggr

>

> Just because you do not see any filesystem problems, does not mean ther

> are not any.

>

> --tmac

>

> Sent from my Verizon Wireless BlackBerry

>

> -----Original Message-----

> From: "Mike Partyka" <mpartyka[at]acmn.com>

>

> Date: Sun, 4 May 2008 09:28:18

> To:"Raj Patel" <phigmov[at]gmail.com>, <toasters[at]mathworks.com>

> Subject: RE: Oddball SnapMirror issue

>

>

> I'm having a similar experience trying to setup a Snapmirror between a

> pair of filers in the same datacenter (Not separated by a firewall). The

> source is a 3050 running DOT 7.0.5 and the destination is a 270 running

> 7.0.6. The volume is a 420G volume serving unstructured CIFS data. When

> I start the initialize everything works fine until it gets to about 82

> or 83G, then the initialize aborts. The log contains some very

> non-specific messages, here is the current snapmirror log:

>

> sys Sat May 3 09:12:55 CDT SnapMirror_off (shutdown)

> log Sat May 3 09:15:31 CDT FILER_REBOOTED

> sys Sat May 3 09:15:34 CDT SnapMirror_on (registry)

> dst Sat May 3 10:09:36 CDT 10.0.10.238:data hci2:rcv_data Request

> (Initialize)

> dst Sat May 3 10:09:42 CDT 10.0.10.238:data hci2:rcv_data Start

> dst Sat May 3 11:51:24 CDT 10.0.10.238:data hci2:rcv_data Abort

> (snapmirror transfer failed to complete)

>

> Just as the Raj says when it fails to initialize the destination volume

> is in limbo, you can't online it due to the failed initialize. Here is

> the error:

>

> vol online: Volume 'rcv_data' was left in an inconsistent state by an

> aborted vol copy or an aborted snapmirror initial (level 0) transfer.

> In order to bring it online, you must either destroy and re-create

> the volume, or complete an initial snapmirror transfer or vol copy.

>

> I have considered running WAFL_check but WAFL isn't reporting an

> inconsistent state so i'm not sure that would be very effective.

> Yesterday I upgraded both filers to DOT 7.2.4 and updated all firmware

> then retried with the exact same results.

>

> The only thing I can think of doing now is running a packet capture on

> the filer while it runs and see what that tells me.

>

> -Mike

>

> -----Original Message-----

> From: owner-toasters[at]mathworks.com [mailto:owner-toasters[at]mathworks.com]

> On Behalf Of Raj Patel

> Sent: Sunday, May 04, 2008 1:29 AM

> To: George T Chen

> Cc: toasters[at]mathworks.com

> Subject: Re: Oddball SnapMirror issue

>

> Hi George,

>

> The working transfers do just update 10 to 20Mb - very small turnover.

>

> Unfortunately the two I need to mirror are from scratch - no baseline

> snapshot. The checkpoint restart occurring during the initialisation

> phase. Once the initialisation phase stalls further updates fail as

> the volume is not online (obviusly because the init failed).

>

> I tried setting a once-a-day schedule at a particular time so it

> wouldn't trip over itself or other snapmirror operations to no avail.

>

> As other volumes are updating with small update it made me wonder if

> it wasn't the router ipsec tunnel or firewall prematurely closing a

> connection for a large baseline transfer.

>

> I'll attach the log & config when I get back into work.

>

> Cheers,

> Raj.

>

> On Sun, May 4, 2008 at 4:36 PM, George T Chen <gtchen[at]yahoo-inc.com>

> wrote:

> > Since you have one volume already transferring, then there's no

> network

> > or firewall issue--any problem at that level would affect all

> volumes,

> > not just a few.

> >

> > A "Pending with restart checkpoint" appears you abort an
ongoing

> > transfer. Checkpoint occur every ?? megabytes and gives Ontap a

> place

> > to restart instead of from scratch. It's hard to debug without more

> > info, but I would start by:

> >

> > 1) doing a snapmirror break on the volume (not just an abort)

> > 2) verify that there is a common baseline snapshot on both source and

> > destination

> > 3) restart with a snapmirror resync command

> >

> > Depending on step 2, you may be required to go to a snapmirror

> > initialize.

> >

> > What do the /etc/log/snapmirror and /etc/messages file say?

> >

> > -gtchen

> >

> >

> >

> > > -----Original Message-----

> > > From: owner-toasters[at]mathworks.com

> > [mailto:owner-toasters[at]mathworks.com]

> > > On Behalf Of Raj Patel

> > > Sent: Saturday, May 03, 2008 2:00 AM

> > > To: toasters[at]mathworks.com

> > > Subject: Oddball SnapMirror issue

> > >

> > > We've got two FAS 270's in different cities. They're connected
by a

> > > 10mb pipe with routers (running ipsec) & firewalls
(checkpoint

> splat)

> > > seperating each datacenter.

> > >

> > > The primary san is fine and runs all our prod volumes (7.0.5)
which

> > > are mirrored to our secondary san (7.0.6).

> > >

> > > Recently I had to recreate the mirror relationship for some
volumes

> as

> > > they'd fallen far out of sync due to some firewall work.

> > >

> > > What I am seeing is one volume is syncing fine, one has a small
lag

> > > and two are stuck with a status of 'Pending with restart

> checkpoint'

> > > after I re-initialised the transfer.

> > >

> > > snapmirror status -l shows this for one of the two that just
don't

> get

> > > properly initialised

> > >

> > > Source: 10.1.45.7:sqlprod01

> > > Destination: adcsan1:sqlprod01_mirror

> > > Status: Pending with restart checkpoint

> > > Progress: 38376 KB

> > > State: Unknown

> > > Lag: -

> > > Mirror Timestamp: -

> > > Base Snapshot: -

> > > Current Transfer Type: Retry

> > > Current Transfer Error: volume is not online; cannot execute

> operation

> > > Contents: -

> > > Last Transfer Type: -

> > > Last Transfer Size: -

> > > Last Transfer Duration: -

> > > Last Transfer From: -

> > >

> > > Our firewalls rules have been relaxed to allow free-flow between

> these

> > > devices (instead of just the SnapMirror ports) and the routers
and

> > > circuit haven't changed at all between it working fine and not

> working

> > > now. The volume that is mirroring OK seems fine and still syncs

> fine -

> > > granted the updates are small whereas the three non-working
volumes

> > > have to sync quite a lot of data.

> > >

> > > I've tried deleting the mirrored volumes, recreating them,
setting

> up

> > > the mirror relationship again (with a variety of scheduling and

> > > bandwidth throttling options) and doing a destination SAN
reboot.

> > >

> > > What are the best options to troubleshoot this or insuring a

> > > successful mirror ? Has anyone had issues with dropped or
stalled

> > > SnapMirror baseline transfers via an IPSec tunnel or Firewall ?

> > >

> > > Thanks in advance,

> > > Raj.

> > >

> > > PS As an addendum it looks like it starts a transfer, stalls and

> from

> > > then on subsequent mirrors fail because its not online (ie the

> > > initialisation fails ?)

> > >

> > > What I don't understand is why it just can't carry on with the

> > > initialisation regardless of the interruption by resuming the

> mirror

> > > operation ?

> >

>

>







Express
yourself instantly with MSN Messenger! MSN
Messenger







_________________________________________________________________
Express yourself instantly with MSN Messenger! Download today it's FREE!
http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/


mpartyka at acmn

May 4, 2008, 12:11 PM

Post #4 of 8 (514 views)
Permalink
RE: Oddball SnapMirror issue - Status: Pending with restart checkpoint [In reply to]

Yeah, I was thinking the same thing, a packet trace but I am waiting for
support to come to the same conclusion. After the upgrade yesterday
morning I decided I was stumped and opened a ticket this morning. They
are currently looking into the problem. Hopefully I'll hear back today
sometime and I will share what the list what the eventual resolution is.



Regards

Mike



From: Kenneth Heal [mailto:kheal[at]hotmail.com]
Sent: Sunday, May 04, 2008 2:07 PM
To: Mike Partyka; tmacmd[at]gmail.com; owner-toasters[at]mathworks.com; Raj
Patel; NetApp Toasters List
Subject: RE: Oddball SnapMirror issue - Status: Pending with restart
checkpoint



Hi Mike,

Thx for the quick reply. That does indeed shoot my theory/hope out the
water. And I am inclined to agree that going lower on the window size
is not likely to help, especially as both your boxes are in the same
datacentre without any nasty firewalls or WAN links in between them.
This is also the window size recommended in the kb for such problems.


At this I would be inclined to take a packet trace, fire off ASUPs, open
a support case and upload a gzipped copy of the pktt trace. Have to
give myself beat on this one... though I would be keen to know what the
eventual resolution is.

cheers, Kenneth

https://now.netapp.com/Knowledgebase/solutionarea.asp?id=kb17202

________________________________

Subject: RE: Oddball SnapMirror issue - Status: Pending with restart
checkpoint
Date: Sun, 4 May 2008 13:56:45 -0500
From: mpartyka[at]acmn.com
To: kheal[at]hotmail.com; tmacmd[at]gmail.com; owner-toasters[at]mathworks.com;
phigmov[at]gmail.com; toasters[at]mathworks.com

After failing to get the initialization going on the 270 and 3050
(running 7.0.5 and 7.0.6 respectively) yesterday morning we upgraded
both the filers (src and dst) to 7.2.4. I immediately after tried the
mirror again but no dice the error occurs around the same place/time in
the initialization.



I did miss the following error in the /etc/messages file:



Sat May 3 11:51:23 CDT [worker_thread_98:notice]: snapmirror: Message
from Read Socket : Connection

Sat May 3 11:51:23 CDT [snapmirror.dst.err:error]: SnapMirror
destination transfer from 10.0.10.238data : snapmirror transfer failed
to complete.

Sat May 3 11:51:24 CDT [snapmirror.dst.err:error]: SnapMirror
destination transfer from 10.0.10.238data : snapmirror transfer failed
to complete.



I understand this might mean the snapmirror.window_size is too large but
it's set 32768 which is pretty small already. Usually you increase this
value to increase performance but I don't think I want to go much
smaller than this.



From: Kenneth Heal [mailto:kheal[at]hotmail.com]
Sent: Sunday, May 04, 2008 1:48 PM
To: Mike Partyka; tmacmd[at]gmail.com; owner-toasters[at]mathworks.com; Raj
Patel; NetApp Toasters List
Subject: RE: Oddball SnapMirror issue - Status: Pending with restart
checkpoint



Hi all

I don't see a bug which is a precise match to this, but I do see that
both scenarios were using 7.0.x releases, and I see a fair few
SnapMirror bugs have been fixed in 7.2.4; so I am wondering if in either
of the scenarios it is possible to move both filers to 7.2.4 (I
semi-fear it isn't especially for the source filers concerned) and/or if
anyone has seen this on a 7.2.x release.

cheers
Kenneth



http://now.netapp.com/NOW/cgi-bin/relcmp.on?&rrel=7.0.6&rrel=7.2.4&what=
fix

________________________________

> Subject: RE: Oddball SnapMirror issue
> Date: Sun, 4 May 2008 13:24:05 -0500
> From: mpartyka[at]acmn.com
> To: tmacmd[at]gmail.com; owner-toasters[at]mathworks.com; phigmov[at]gmail.com;
toasters[at]mathworks.com
>
> Is there any reason to prefer wafliron over WAFL_check? Sounds like
they
> do the same thing but you have the option to only check not
> automatically fix with WAFL_check.
>
> -Mike
>
> -----Original Message-----
> From: tmacmd[at]gmail.com [mailto:tmacmd[at]gmail.com]
> Sent: Sunday, May 04, 2008 12:59 PM
> To: Mike Partyka; owner-toasters[at]mathworks.com; Raj Patel; NetApp
> Toasters List
> Subject: Re: Oddball SnapMirror issue
>
> I would try a wafl iron on the source volume/aggr
>
> Just because you do not see any filesystem problems, does not mean
ther
> are not any.
>
> --tmac
>
> Sent from my Verizon Wireless BlackBerry
>
> -----Original Message-----
> From: "Mike Partyka" <mpartyka[at]acmn.com>
>
> Date: Sun, 4 May 2008 09:28:18
> To:"Raj Patel" <phigmov[at]gmail.com>, <toasters[at]mathworks.com>
> Subject: RE: Oddball SnapMirror issue
>
>
> I'm having a similar experience trying to setup a Snapmirror between a
> pair of filers in the same datacenter (Not separated by a firewall).
The
> source is a 3050 running DOT 7.0.5 and the destination is a 270
running
> 7.0.6. The volume is a 420G volume serving unstructured CIFS data.
When
> I start the initialize everything works fine until it gets to about 82
> or 83G, then the initialize aborts. The log contains some very
> non-specific messages, here is the current snapmirror log:
>
> sys Sat May 3 09:12:55 CDT SnapMirror_off (shutdown)
> log Sat May 3 09:15:31 CDT FILER_REBOOTED
> sys Sat May 3 09:15:34 CDT SnapMirror_on (registry)
> dst Sat May 3 10:09:36 CDT 10.0.10.238:data hci2:rcv_data Request
> (Initialize)
> dst Sat May 3 10:09:42 CDT 10.0.10.238:data hci2:rcv_data Start
> dst Sat May 3 11:51:24 CDT 10.0.10.238:data hci2:rcv_data Abort
> (snapmirror transfer failed to complete)
>
> Just as the Raj says when it fails to initialize the destination
volume
> is in limbo, you can't online it due to the failed initialize. Here is
> the error:
>
> vol online: Volume 'rcv_data' was left in an inconsistent state by an
> aborted vol copy or an aborted snapmirror initial (level 0) transfer.
> In order to bring it online, you must either destroy and re-create
> the volume, or complete an initial snapmirror transfer or vol copy.
>
> I have considered running WAFL_check but WAFL isn't reporting an
> inconsistent state so i'm not sure that would be very effective.
> Yesterday I upgraded both filers to DOT 7.2.4 and updated all firmware
> then retried with the exact same results.
>
> The only thing I can think of doing now is running a packet capture on
> the filer while it runs and see what that tells me.
>
> -Mike
>
> -----Original Message-----
> From: owner-toasters[at]mathworks.com
[mailto:owner-toasters[at]mathworks.com]
> On Behalf Of Raj Patel
> Sent: Sunday, May 04, 2008 1:29 AM
> To: George T Chen
> Cc: toasters[at]mathworks.com
> Subject: Re: Oddball SnapMirror issue
>
> Hi George,
>
> The working transfers do just update 10 to 20Mb - very small turnover.
>
> Unfortunately the two I need to mirror are from scratch - no baseline
> snapshot. The checkpoint restart occurring during the initialisation
> phase. Once the initialisation phase stalls further updates fail as
> the volume is not online (obviusly because the init failed).
>
> I tried setting a once-a-day schedule at a particular time so it
> wouldn't trip over itself or other snapmirror operations to no avail.
>
> As other volumes are updating with small update it made me wonder if
> it wasn't the router ipsec tunnel or firewall prematurely closing a
> connection for a large baseline transfer.
>
> I'll attach the log & config when I get back into work.
>
> Cheers,
> Raj.
>
> On Sun, May 4, 2008 at 4:36 PM, George T Chen <gtchen[at]yahoo-inc.com>
> wrote:
> > Since you have one volume already transferring, then there's no
> network
> > or firewall issue--any problem at that level would affect all
> volumes,
> > not just a few.
> >
> > A "Pending with restart checkpoint" appears you abort an ongoing
> > transfer. Checkpoint occur every ?? megabytes and gives Ontap a
> place
> > to restart instead of from scratch. It's hard to debug without more
> > info, but I would start by:
> >
> > 1) doing a snapmirror break on the volume (not just an abort)
> > 2) verify that there is a common baseline snapshot on both source
and
> > destination
> > 3) restart with a snapmirror resync command
> >
> > Depending on step 2, you may be required to go to a snapmirror
> > initialize.
> >
> > What do the /etc/log/snapmirror and /etc/messages file say?
> >
> > -gtchen
> >
> >
> >
> > > -----Original Message-----
> > > From: owner-toasters[at]mathworks.com
> > [mailto:owner-toasters[at]mathworks.com]
> > > On Behalf Of Raj Patel
> > > Sent: Saturday, May 03, 2008 2:00 AM
> > > To: toasters[at]mathworks.com
> > > Subject: Oddball SnapMirror issue
> > >
> > > We've got two FAS 270's in different cities. They're connected by
a
> > > 10mb pipe with routers (running ipsec) & firewalls (checkpoint
> splat)
> > > seperating each datacenter.
> > >
> > > The primary san is fine and runs all our prod volumes (7.0.5)
which
> > > are mirrored to our secondary san (7.0.6).
> > >
> > > Recently I had to recreate the mirror relationship for some
volumes
> as
> > > they'd fallen far out of sync due to some firewall work.
> > >
> > > What I am seeing is one volume is syncing fine, one has a small
lag
> > > and two are stuck with a status of 'Pending with restart
> checkpoint'
> > > after I re-initialised the transfer.
> > >
> > > snapmirror status -l shows this for one of the two that just don't
> get
> > > properly initialised
> > >
> > > Source: 10.1.45.7:sqlprod01
> > > Destination: adcsan1:sqlprod01_mirror
> > > Status: Pending with restart checkpoint
> > > Progress: 38376 KB
> > > State: Unknown
> > > Lag: -
> > > Mirror Timestamp: -
> > > Base Snapshot: -
> > > Current Transfer Type: Retry
> > > Current Transfer Error: volume is not online; cannot execute
> operation
> > > Contents: -
> > > Last Transfer Type: -
> > > Last Transfer Size: -
> > > Last Transfer Duration: -
> > > Last Transfer From: -
> > >
> > > Our firewalls rules have been relaxed to allow free-flow between
> these
> > > devices (instead of just the SnapMirror ports) and the routers and
> > > circuit haven't changed at all between it working fine and not
> working
> > > now. The volume that is mirroring OK seems fine and still syncs
> fine -
> > > granted the updates are small whereas the three non-working
volumes
> > > have to sync quite a lot of data.
> > >
> > > I've tried deleting the mirrored volumes, recreating them, setting
> up
> > > the mirror relationship again (with a variety of scheduling and
> > > bandwidth throttling options) and doing a destination SAN reboot.
> > >
> > > What are the best options to troubleshoot this or insuring a
> > > successful mirror ? Has anyone had issues with dropped or stalled
> > > SnapMirror baseline transfers via an IPSec tunnel or Firewall ?
> > >
> > > Thanks in advance,
> > > Raj.
> > >
> > > PS As an addendum it looks like it starts a transfer, stalls and
> from
> > > then on subsequent mirrors fail because its not online (ie the
> > > initialisation fails ?)
> > >
> > > What I don't understand is why it just can't carry on with the
> > > initialisation regardless of the interruption by resuming the
> mirror
> > > operation ?
> >
>
>

________________________________

Express yourself instantly with MSN Messenger! MSN Messenger
<http://clk.atdmt.com/AVE/go/onm00200471ave/direct/01/>



________________________________

Express yourself instantly with MSN Messenger! MSN Messenger
<http://clk.atdmt.com/AVE/go/onm00200471ave/direct/01/>


phigmov at gmail

May 4, 2008, 1:40 PM

Post #5 of 8 (518 views)
Permalink
Re: Oddball SnapMirror issue - Status: Pending with restart checkpoint [In reply to]

Bill Holland pointed me to this link which might be of use to you

http://now.netapp.com/NOW/knowledge/docs/ontap/rel724/html/ontap/onlinebk/4mirror3.htm

In my case I've staggered the mirror several hours apart so they
shouldn't kick off simultaneously - I was actually reasonably suprised
(I guess I shouldn't have been) that there was a limit at all.

The other thread mentioned running a wafl_iron type command to check
the source - is there anything else on the source that could affect
establishing a new mirror ? Old snaps ? Old mirrors ? Snap schedules
etc ?

Don't suppose anyone has a definitive way of re-establishing a mirror
over a suspect connection (surely if I throttle the bandwidth it
should just take its time to establish a baseline) ?

Cheers,
Raj.

On Mon, May 5, 2008 at 7:11 AM, Mike Partyka <mpartyka[at]acmn.com> wrote:
>
>
>
>
> Yeah, I was thinking the same thing, a packet trace but I am waiting for
> support to come to the same conclusion. After the upgrade yesterday morning
> I decided I was stumped and opened a ticket this morning. They are
> currently looking into the problem. Hopefully I'll hear back today sometime
> and I will share what the list what the eventual resolution is.
>
>
>
> Regards
>
> Mike
>
>
>
>
>
> From: Kenneth Heal [mailto:kheal[at]hotmail.com]
> Sent: Sunday, May 04, 2008 2:07 PM
>
>
> To: Mike Partyka; tmacmd[at]gmail.com; owner-toasters[at]mathworks.com; Raj
> Patel; NetApp Toasters List
> Subject: RE: Oddball SnapMirror issue - Status: Pending with restart
> checkpoint
>
>
>
>
>
> Hi Mike,
>
> Thx for the quick reply. That does indeed shoot my theory/hope out the
> water. And I am inclined to agree that going lower on the window size is
> not likely to help, especially as both your boxes are in the same datacentre
> without any nasty firewalls or WAN links in between them. This is also the
> window size recommended in the kb for such problems.
>
>
> At this I would be inclined to take a packet trace, fire off ASUPs, open a
> support case and upload a gzipped copy of the pktt trace. Have to give
> myself beat on this one... though I would be keen to know what the eventual
> resolution is.
>
> cheers, Kenneth
>
> https://now.netapp.com/Knowledgebase/solutionarea.asp?id=kb17202
> ________________________________
>
>
> Subject: RE: Oddball SnapMirror issue - Status: Pending with restart
> checkpoint
> Date: Sun, 4 May 2008 13:56:45 -0500
> From: mpartyka[at]acmn.com
> To: kheal[at]hotmail.com; tmacmd[at]gmail.com; owner-toasters[at]mathworks.com;
> phigmov[at]gmail.com; toasters[at]mathworks.com
>
>
> After failing to get the initialization going on the 270 and 3050 (running
> 7.0.5 and 7.0.6 respectively) yesterday morning we upgraded both the filers
> (src and dst) to 7.2.4. I immediately after tried the mirror again but no
> dice the error occurs around the same place/time in the initialization.
>
>
>
> I did miss the following error in the /etc/messages file:
>
>
>
> Sat May 3 11:51:23 CDT [worker_thread_98:notice]: snapmirror: Message from
> Read Socket : Connection
>
> Sat May 3 11:51:23 CDT [snapmirror.dst.err:error]: SnapMirror destination
> transfer from 10.0.10.238data : snapmirror transfer failed to complete.
>
> Sat May 3 11:51:24 CDT [snapmirror.dst.err:error]: SnapMirror destination
> transfer from 10.0.10.238data : snapmirror transfer failed to complete.
>
>
>
> I understand this might mean the snapmirror.window_size is too large but
> it's set 32768 which is pretty small already. Usually you increase this
> value to increase performance but I don't think I want to go much smaller
> than this.
>
>
>
>
>
> From: Kenneth Heal [mailto:kheal[at]hotmail.com]
> Sent: Sunday, May 04, 2008 1:48 PM
> To: Mike Partyka; tmacmd[at]gmail.com; owner-toasters[at]mathworks.com; Raj
> Patel; NetApp Toasters List
> Subject: RE: Oddball SnapMirror issue - Status: Pending with restart
> checkpoint
>
>
>
> Hi all
>
> I don't see a bug which is a precise match to this, but I do see that both
> scenarios were using 7.0.x releases, and I see a fair few SnapMirror bugs
> have been fixed in 7.2.4; so I am wondering if in either of the scenarios it
> is possible to move both filers to 7.2.4 (I semi-fear it isn't especially
> for the source filers concerned) and/or if anyone has seen this on a 7.2.x
> release.
>
> cheers
> Kenneth
>
>
>
> http://now.netapp.com/NOW/cgi-bin/relcmp.on?&rrel=7.0.6&rrel=7.2.4&what=fix
> ________________________________
>
>
> > Subject: RE: Oddball SnapMirror issue
> > Date: Sun, 4 May 2008 13:24:05 -0500
> > From: mpartyka[at]acmn.com
> > To: tmacmd[at]gmail.com; owner-toasters[at]mathworks.com; phigmov[at]gmail.com;
> toasters[at]mathworks.com
> >
> > Is there any reason to prefer wafliron over WAFL_check? Sounds like they
> > do the same thing but you have the option to only check not
> > automatically fix with WAFL_check.
> >
> > -Mike
> >
> > -----Original Message-----
> > From: tmacmd[at]gmail.com [mailto:tmacmd[at]gmail.com]
> > Sent: Sunday, May 04, 2008 12:59 PM
> > To: Mike Partyka; owner-toasters[at]mathworks.com; Raj Patel; NetApp
> > Toasters List
> > Subject: Re: Oddball SnapMirror issue
> >
> > I would try a wafl iron on the source volume/aggr
> >
> > Just because you do not see any filesystem problems, does not mean ther
> > are not any.
> >
> > --tmac
> >
> > Sent from my Verizon Wireless BlackBerry
> >
> > -----Original Message-----
> > From: "Mike Partyka" <mpartyka[at]acmn.com>
> >
> > Date: Sun, 4 May 2008 09:28:18
> > To:"Raj Patel" <phigmov[at]gmail.com>, <toasters[at]mathworks.com>
> > Subject: RE: Oddball SnapMirror issue
> >
> >
> > I'm having a similar experience trying to setup a Snapmirror between a
> > pair of filers in the same datacenter (Not separated by a firewall). The
> > source is a 3050 running DOT 7.0.5 and the destination is a 270 running
> > 7.0.6. The volume is a 420G volume serving unstructured CIFS data. When
> > I start the initialize everything works fine until it gets to about 82
> > or 83G, then the initialize aborts. The log contains some very
> > non-specific messages, here is the current snapmirror log:
> >
> > sys Sat May 3 09:12:55 CDT SnapMirror_off (shutdown)
> > log Sat May 3 09:15:31 CDT FILER_REBOOTED
> > sys Sat May 3 09:15:34 CDT SnapMirror_on (registry)
> > dst Sat May 3 10:09:36 CDT 10.0.10.238:data hci2:rcv_data Request
> > (Initialize)
> > dst Sat May 3 10:09:42 CDT 10.0.10.238:data hci2:rcv_data Start
> > dst Sat May 3 11:51:24 CDT 10.0.10.238:data hci2:rcv_data Abort
> > (snapmirror transfer failed to complete)
> >
> > Just as the Raj says when it fails to initialize the destination volume
> > is in limbo, you can't online it due to the failed initialize. Here is
> > the error:
> >
> > vol online: Volume 'rcv_data' was left in an inconsistent state by an
> > aborted vol copy or an aborted snapmirror initial (level 0) transfer.
> > In order to bring it online, you must either destroy and re-create
> > the volume, or complete an initial snapmirror transfer or vol copy.
> >
> > I have considered running WAFL_check but WAFL isn't reporting an
> > inconsistent state so i'm not sure that would be very effective.
> > Yesterday I upgraded both filers to DOT 7.2.4 and updated all firmware
> > then retried with the exact same results.
> >
> > The only thing I can think of doing now is running a packet capture on
> > the filer while it runs and see what that tells me.
> >
> > -Mike
> >
> > -----Original Message-----
> > From: owner-toasters[at]mathworks.com [mailto:owner-toasters[at]mathworks.com]
> > On Behalf Of Raj Patel
> > Sent: Sunday, May 04, 2008 1:29 AM
> > To: George T Chen
> > Cc: toasters[at]mathworks.com
> > Subject: Re: Oddball SnapMirror issue
> >
> > Hi George,
> >
> > The working transfers do just update 10 to 20Mb - very small turnover.
> >
> > Unfortunately the two I need to mirror are from scratch - no baseline
> > snapshot. The checkpoint restart occurring during the initialisation
> > phase. Once the initialisation phase stalls further updates fail as
> > the volume is not online (obviusly because the init failed).
> >
> > I tried setting a once-a-day schedule at a particular time so it
> > wouldn't trip over itself or other snapmirror operations to no avail.
> >
> > As other volumes are updating with small update it made me wonder if
> > it wasn't the router ipsec tunnel or firewall prematurely closing a
> > connection for a large baseline transfer.
> >
> > I'll attach the log & config when I get back into work.
> >
> > Cheers,
> > Raj.
> >
> > On Sun, May 4, 2008 at 4:36 PM, George T Chen <gtchen[at]yahoo-inc.com>
> > wrote:
> > > Since you have one volume already transferring, then there's no
> > network
> > > or firewall issue--any problem at that level would affect all
> > volumes,
> > > not just a few.
> > >
> > > A "Pending with restart checkpoint" appears you abort an ongoing
> > > transfer. Checkpoint occur every ?? megabytes and gives Ontap a
> > place
> > > to restart instead of from scratch. It's hard to debug without more
> > > info, but I would start by:
> > >
> > > 1) doing a snapmirror break on the volume (not just an abort)
> > > 2) verify that there is a common baseline snapshot on both source and
> > > destination
> > > 3) restart with a snapmirror resync command
> > >
> > > Depending on step 2, you may be required to go to a snapmirror
> > > initialize.
> > >
> > > What do the /etc/log/snapmirror and /etc/messages file say?
> > >
> > > -gtchen
> > >
> > >
> > >
> > > > -----Original Message-----
> > > > From: owner-toasters[at]mathworks.com
> > > [mailto:owner-toasters[at]mathworks.com]
> > > > On Behalf Of Raj Patel
> > > > Sent: Saturday, May 03, 2008 2:00 AM
> > > > To: toasters[at]mathworks.com
> > > > Subject: Oddball SnapMirror issue
> > > >
> > > > We've got two FAS 270's in different cities. They're connected by a
> > > > 10mb pipe with routers (running ipsec) & firewalls (checkpoint
> > splat)
> > > > seperating each datacenter.
> > > >
> > > > The primary san is fine and runs all our prod volumes (7.0.5) which
> > > > are mirrored to our secondary san (7.0.6).
> > > >
> > > > Recently I had to recreate the mirror relationship for some volumes
> > as
> > > > they'd fallen far out of sync due to some firewall work.
> > > >
> > > > What I am seeing is one volume is syncing fine, one has a small lag
> > > > and two are stuck with a status of 'Pending with restart
> > checkpoint'
> > > > after I re-initialised the transfer.
> > > >
> > > > snapmirror status -l shows this for one of the two that just don't
> > get
> > > > properly initialised
> > > >
> > > > Source: 10.1.45.7:sqlprod01
> > > > Destination: adcsan1:sqlprod01_mirror
> > > > Status: Pending with restart checkpoint
> > > > Progress: 38376 KB
> > > > State: Unknown
> > > > Lag: -
> > > > Mirror Timestamp: -
> > > > Base Snapshot: -
> > > > Current Transfer Type: Retry
> > > > Current Transfer Error: volume is not online; cannot execute
> > operation
> > > > Contents: -
> > > > Last Transfer Type: -
> > > > Last Transfer Size: -
> > > > Last Transfer Duration: -
> > > > Last Transfer From: -
> > > >
> > > > Our firewalls rules have been relaxed to allow free-flow between
> > these
> > > > devices (instead of just the SnapMirror ports) and the routers and
> > > > circuit haven't changed at all between it working fine and not
> > working
> > > > now. The volume that is mirroring OK seems fine and still syncs
> > fine -
> > > > granted the updates are small whereas the three non-working volumes
> > > > have to sync quite a lot of data.
> > > >
> > > > I've tried deleting the mirrored volumes, recreating them, setting
> > up
> > > > the mirror relationship again (with a variety of scheduling and
> > > > bandwidth throttling options) and doing a destination SAN reboot.
> > > >
> > > > What are the best options to troubleshoot this or insuring a
> > > > successful mirror ? Has anyone had issues with dropped or stalled
> > > > SnapMirror baseline transfers via an IPSec tunnel or Firewall ?
> > > >
> > > > Thanks in advance,
> > > > Raj.
> > > >
> > > > PS As an addendum it looks like it starts a transfer, stalls and
> > from
> > > > then on subsequent mirrors fail because its not online (ie the
> > > > initialisation fails ?)
> > > >
> > > > What I don't understand is why it just can't carry on with the
> > > > initialisation regardless of the interruption by resuming the
> > mirror
> > > > operation ?
> > >
> >
> >
> ________________________________
>
>
> Express yourself instantly with MSN Messenger! MSN Messenger
>
>
> ________________________________
>
>
> Express yourself instantly with MSN Messenger! MSN Messenger


mpartyka at acmn

May 4, 2008, 1:58 PM

Post #6 of 8 (514 views)
Permalink
RE: Oddball SnapMirror issue - Status: Pending with restart checkpoint [In reply to]

Thanks for the link, I was aware of the simultaneous snapmirror limits.
Mainly because we use snapmirror to migrate data and sometimes there are
more volumes than the simultaneous snapmirror limit of the controller.
Occasionally we use the nearstore_personality license which pushes up
the maximum simultaneous data streams on a controller, which allows you
to do more simultaneous snapmirrors. That way we can just fire off all
the migration snapmirrors from a script and not worry about staggering
them.

My setup is really simplistic, even the source 3050 doesn't have more
than three flexvols on it. The 270 has nothing but a vol0. The source
volume has only about 8 scheduled snapshots on it, nothing exotic, like
ndmp backups or old snapmirror snapshots. The snap schedule is: Volume
data: 1 7 0

Good things to check certainly but I don't see anything obvious that
should be tripping me up.

It seems either a WAFL_check and/or a packet trace are the next logical
steps.

Thanks again for every ones input.

-----Original Message-----
From: Raj Patel [mailto:phigmov[at]gmail.com]
Sent: Sunday, May 04, 2008 3:40 PM
To: Mike Partyka
Cc: Kenneth Heal; tmacmd[at]gmail.com; owner-toasters[at]mathworks.com; NetApp
Toasters List
Subject: Re: Oddball SnapMirror issue - Status: Pending with restart
checkpoint

Bill Holland pointed me to this link which might be of use to you

http://now.netapp.com/NOW/knowledge/docs/ontap/rel724/html/ontap/onlineb
k/4mirror3.htm

In my case I've staggered the mirror several hours apart so they
shouldn't kick off simultaneously - I was actually reasonably suprised
(I guess I shouldn't have been) that there was a limit at all.

The other thread mentioned running a wafl_iron type command to check
the source - is there anything else on the source that could affect
establishing a new mirror ? Old snaps ? Old mirrors ? Snap schedules
etc ?

Don't suppose anyone has a definitive way of re-establishing a mirror
over a suspect connection (surely if I throttle the bandwidth it
should just take its time to establish a baseline) ?

Cheers,
Raj.

On Mon, May 5, 2008 at 7:11 AM, Mike Partyka <mpartyka[at]acmn.com> wrote:
>
>
>
>
> Yeah, I was thinking the same thing, a packet trace but I am waiting
for
> support to come to the same conclusion. After the upgrade yesterday
morning
> I decided I was stumped and opened a ticket this morning. They are
> currently looking into the problem. Hopefully I'll hear back today
sometime
> and I will share what the list what the eventual resolution is.
>
>
>
> Regards
>
> Mike
>
>
>
>
>
> From: Kenneth Heal [mailto:kheal[at]hotmail.com]
> Sent: Sunday, May 04, 2008 2:07 PM
>
>
> To: Mike Partyka; tmacmd[at]gmail.com; owner-toasters[at]mathworks.com; Raj
> Patel; NetApp Toasters List
> Subject: RE: Oddball SnapMirror issue - Status: Pending with restart
> checkpoint
>
>
>
>
>
> Hi Mike,
>
> Thx for the quick reply. That does indeed shoot my theory/hope out
the
> water. And I am inclined to agree that going lower on the window size
is
> not likely to help, especially as both your boxes are in the same
datacentre
> without any nasty firewalls or WAN links in between them. This is
also the
> window size recommended in the kb for such problems.
>
>
> At this I would be inclined to take a packet trace, fire off ASUPs,
open a
> support case and upload a gzipped copy of the pktt trace. Have to
give
> myself beat on this one... though I would be keen to know what the
eventual
> resolution is.
>
> cheers, Kenneth
>
> https://now.netapp.com/Knowledgebase/solutionarea.asp?id=kb17202
> ________________________________
>
>
> Subject: RE: Oddball SnapMirror issue - Status: Pending with restart
> checkpoint
> Date: Sun, 4 May 2008 13:56:45 -0500
> From: mpartyka[at]acmn.com
> To: kheal[at]hotmail.com; tmacmd[at]gmail.com;
owner-toasters[at]mathworks.com;
> phigmov[at]gmail.com; toasters[at]mathworks.com
>
>
> After failing to get the initialization going on the 270 and 3050
(running
> 7.0.5 and 7.0.6 respectively) yesterday morning we upgraded both the
filers
> (src and dst) to 7.2.4. I immediately after tried the mirror again but
no
> dice the error occurs around the same place/time in the
initialization.
>
>
>
> I did miss the following error in the /etc/messages file:
>
>
>
> Sat May 3 11:51:23 CDT [worker_thread_98:notice]: snapmirror: Message
from
> Read Socket : Connection
>
> Sat May 3 11:51:23 CDT [snapmirror.dst.err:error]: SnapMirror
destination
> transfer from 10.0.10.238data : snapmirror transfer failed to
complete.
>
> Sat May 3 11:51:24 CDT [snapmirror.dst.err:error]: SnapMirror
destination
> transfer from 10.0.10.238data : snapmirror transfer failed to
complete.
>
>
>
> I understand this might mean the snapmirror.window_size is too large
but
> it's set 32768 which is pretty small already. Usually you increase
this
> value to increase performance but I don't think I want to go much
smaller
> than this.
>
>
>
>
>
> From: Kenneth Heal [mailto:kheal[at]hotmail.com]
> Sent: Sunday, May 04, 2008 1:48 PM
> To: Mike Partyka; tmacmd[at]gmail.com; owner-toasters[at]mathworks.com; Raj
> Patel; NetApp Toasters List
> Subject: RE: Oddball SnapMirror issue - Status: Pending with restart
> checkpoint
>
>
>
> Hi all
>
> I don't see a bug which is a precise match to this, but I do see that
both
> scenarios were using 7.0.x releases, and I see a fair few SnapMirror
bugs
> have been fixed in 7.2.4; so I am wondering if in either of the
scenarios it
> is possible to move both filers to 7.2.4 (I semi-fear it isn't
especially
> for the source filers concerned) and/or if anyone has seen this on a
7.2.x
> release.
>
> cheers
> Kenneth
>
>
>
>
http://now.netapp.com/NOW/cgi-bin/relcmp.on?&rrel=7.0.6&rrel=7.2.4&what=
fix
> ________________________________
>
>
> > Subject: RE: Oddball SnapMirror issue
> > Date: Sun, 4 May 2008 13:24:05 -0500
> > From: mpartyka[at]acmn.com
> > To: tmacmd[at]gmail.com; owner-toasters[at]mathworks.com;
phigmov[at]gmail.com;
> toasters[at]mathworks.com
> >
> > Is there any reason to prefer wafliron over WAFL_check? Sounds like
they
> > do the same thing but you have the option to only check not
> > automatically fix with WAFL_check.
> >
> > -Mike
> >
> > -----Original Message-----
> > From: tmacmd[at]gmail.com [mailto:tmacmd[at]gmail.com]
> > Sent: Sunday, May 04, 2008 12:59 PM
> > To: Mike Partyka; owner-toasters[at]mathworks.com; Raj Patel; NetApp
> > Toasters List
> > Subject: Re: Oddball SnapMirror issue
> >
> > I would try a wafl iron on the source volume/aggr
> >
> > Just because you do not see any filesystem problems, does not mean
ther
> > are not any.
> >
> > --tmac
> >
> > Sent from my Verizon Wireless BlackBerry
> >
> > -----Original Message-----
> > From: "Mike Partyka" <mpartyka[at]acmn.com>
> >
> > Date: Sun, 4 May 2008 09:28:18
> > To:"Raj Patel" <phigmov[at]gmail.com>, <toasters[at]mathworks.com>
> > Subject: RE: Oddball SnapMirror issue
> >
> >
> > I'm having a similar experience trying to setup a Snapmirror
between a
> > pair of filers in the same datacenter (Not separated by a
firewall). The
> > source is a 3050 running DOT 7.0.5 and the destination is a 270
running
> > 7.0.6. The volume is a 420G volume serving unstructured CIFS data.
When
> > I start the initialize everything works fine until it gets to about
82
> > or 83G, then the initialize aborts. The log contains some very
> > non-specific messages, here is the current snapmirror log:
> >
> > sys Sat May 3 09:12:55 CDT SnapMirror_off (shutdown)
> > log Sat May 3 09:15:31 CDT FILER_REBOOTED
> > sys Sat May 3 09:15:34 CDT SnapMirror_on (registry)
> > dst Sat May 3 10:09:36 CDT 10.0.10.238:data hci2:rcv_data Request
> > (Initialize)
> > dst Sat May 3 10:09:42 CDT 10.0.10.238:data hci2:rcv_data Start
> > dst Sat May 3 11:51:24 CDT 10.0.10.238:data hci2:rcv_data Abort
> > (snapmirror transfer failed to complete)
> >
> > Just as the Raj says when it fails to initialize the destination
volume
> > is in limbo, you can't online it due to the failed initialize. Here
is
> > the error:
> >
> > vol online: Volume 'rcv_data' was left in an inconsistent state by
an
> > aborted vol copy or an aborted snapmirror initial (level 0)
transfer.
> > In order to bring it online, you must either destroy and re-create
> > the volume, or complete an initial snapmirror transfer or vol copy.
> >
> > I have considered running WAFL_check but WAFL isn't reporting an
> > inconsistent state so i'm not sure that would be very effective.
> > Yesterday I upgraded both filers to DOT 7.2.4 and updated all
firmware
> > then retried with the exact same results.
> >
> > The only thing I can think of doing now is running a packet capture
on
> > the filer while it runs and see what that tells me.
> >
> > -Mike
> >
> > -----Original Message-----
> > From: owner-toasters[at]mathworks.com
[mailto:owner-toasters[at]mathworks.com]
> > On Behalf Of Raj Patel
> > Sent: Sunday, May 04, 2008 1:29 AM
> > To: George T Chen
> > Cc: toasters[at]mathworks.com
> > Subject: Re: Oddball SnapMirror issue
> >
> > Hi George,
> >
> > The working transfers do just update 10 to 20Mb - very small
turnover.
> >
> > Unfo