
phigmov at gmail
May 4, 2008, 1:40 PM
Post #5 of 8
(277 views)
Permalink
|
|
Re: Oddball SnapMirror issue - Status: Pending with restart checkpoint
[In reply to]
|
|
Bill Holland pointed me to this link which might be of use to you http://now.netapp.com/NOW/knowledge/docs/ontap/rel724/html/ontap/onlinebk/4mirror3.htm In my case I've staggered the mirror several hours apart so they shouldn't kick off simultaneously - I was actually reasonably suprised (I guess I shouldn't have been) that there was a limit at all. The other thread mentioned running a wafl_iron type command to check the source - is there anything else on the source that could affect establishing a new mirror ? Old snaps ? Old mirrors ? Snap schedules etc ? Don't suppose anyone has a definitive way of re-establishing a mirror over a suspect connection (surely if I throttle the bandwidth it should just take its time to establish a baseline) ? Cheers, Raj. On Mon, May 5, 2008 at 7:11 AM, Mike Partyka <mpartyka[at]acmn.com> wrote: > > > > > Yeah, I was thinking the same thing, a packet trace but I am waiting for > support to come to the same conclusion. After the upgrade yesterday morning > I decided I was stumped and opened a ticket this morning. They are > currently looking into the problem. Hopefully I'll hear back today sometime > and I will share what the list what the eventual resolution is. > > > > Regards > > Mike > > > > > > From: Kenneth Heal [mailto:kheal[at]hotmail.com] > Sent: Sunday, May 04, 2008 2:07 PM > > > To: Mike Partyka; tmacmd[at]gmail.com; owner-toasters[at]mathworks.com; Raj > Patel; NetApp Toasters List > Subject: RE: Oddball SnapMirror issue - Status: Pending with restart > checkpoint > > > > > > Hi Mike, > > Thx for the quick reply. That does indeed shoot my theory/hope out the > water. And I am inclined to agree that going lower on the window size is > not likely to help, especially as both your boxes are in the same datacentre > without any nasty firewalls or WAN links in between them. This is also the > window size recommended in the kb for such problems. > > > At this I would be inclined to take a packet trace, fire off ASUPs, open a > support case and upload a gzipped copy of the pktt trace. Have to give > myself beat on this one... though I would be keen to know what the eventual > resolution is. > > cheers, Kenneth > > https://now.netapp.com/Knowledgebase/solutionarea.asp?id=kb17202 > ________________________________ > > > Subject: RE: Oddball SnapMirror issue - Status: Pending with restart > checkpoint > Date: Sun, 4 May 2008 13:56:45 -0500 > From: mpartyka[at]acmn.com > To: kheal[at]hotmail.com; tmacmd[at]gmail.com; owner-toasters[at]mathworks.com; > phigmov[at]gmail.com; toasters[at]mathworks.com > > > After failing to get the initialization going on the 270 and 3050 (running > 7.0.5 and 7.0.6 respectively) yesterday morning we upgraded both the filers > (src and dst) to 7.2.4. I immediately after tried the mirror again but no > dice the error occurs around the same place/time in the initialization. > > > > I did miss the following error in the /etc/messages file: > > > > Sat May 3 11:51:23 CDT [worker_thread_98:notice]: snapmirror: Message from > Read Socket : Connection > > Sat May 3 11:51:23 CDT [snapmirror.dst.err:error]: SnapMirror destination > transfer from 10.0.10.238data : snapmirror transfer failed to complete. > > Sat May 3 11:51:24 CDT [snapmirror.dst.err:error]: SnapMirror destination > transfer from 10.0.10.238data : snapmirror transfer failed to complete. > > > > I understand this might mean the snapmirror.window_size is too large but > it's set 32768 which is pretty small already. Usually you increase this > value to increase performance but I don't think I want to go much smaller > than this. > > > > > > From: Kenneth Heal [mailto:kheal[at]hotmail.com] > Sent: Sunday, May 04, 2008 1:48 PM > To: Mike Partyka; tmacmd[at]gmail.com; owner-toasters[at]mathworks.com; Raj > Patel; NetApp Toasters List > Subject: RE: Oddball SnapMirror issue - Status: Pending with restart > checkpoint > > > > Hi all > > I don't see a bug which is a precise match to this, but I do see that both > scenarios were using 7.0.x releases, and I see a fair few SnapMirror bugs > have been fixed in 7.2.4; so I am wondering if in either of the scenarios it > is possible to move both filers to 7.2.4 (I semi-fear it isn't especially > for the source filers concerned) and/or if anyone has seen this on a 7.2.x > release. > > cheers > Kenneth > > > > http://now.netapp.com/NOW/cgi-bin/relcmp.on?&rrel=7.0.6&rrel=7.2.4&what=fix > ________________________________ > > > > Subject: RE: Oddball SnapMirror issue > > Date: Sun, 4 May 2008 13:24:05 -0500 > > From: mpartyka[at]acmn.com > > To: tmacmd[at]gmail.com; owner-toasters[at]mathworks.com; phigmov[at]gmail.com; > toasters[at]mathworks.com > > > > Is there any reason to prefer wafliron over WAFL_check? Sounds like they > > do the same thing but you have the option to only check not > > automatically fix with WAFL_check. > > > > -Mike > > > > -----Original Message----- > > From: tmacmd[at]gmail.com [mailto:tmacmd[at]gmail.com] > > Sent: Sunday, May 04, 2008 12:59 PM > > To: Mike Partyka; owner-toasters[at]mathworks.com; Raj Patel; NetApp > > Toasters List > > Subject: Re: Oddball SnapMirror issue > > > > I would try a wafl iron on the source volume/aggr > > > > Just because you do not see any filesystem problems, does not mean ther > > are not any. > > > > --tmac > > > > Sent from my Verizon Wireless BlackBerry > > > > -----Original Message----- > > From: "Mike Partyka" <mpartyka[at]acmn.com> > > > > Date: Sun, 4 May 2008 09:28:18 > > To:"Raj Patel" <phigmov[at]gmail.com>, <toasters[at]mathworks.com> > > Subject: RE: Oddball SnapMirror issue > > > > > > I'm having a similar experience trying to setup a Snapmirror between a > > pair of filers in the same datacenter (Not separated by a firewall). The > > source is a 3050 running DOT 7.0.5 and the destination is a 270 running > > 7.0.6. The volume is a 420G volume serving unstructured CIFS data. When > > I start the initialize everything works fine until it gets to about 82 > > or 83G, then the initialize aborts. The log contains some very > > non-specific messages, here is the current snapmirror log: > > > > sys Sat May 3 09:12:55 CDT SnapMirror_off (shutdown) > > log Sat May 3 09:15:31 CDT FILER_REBOOTED > > sys Sat May 3 09:15:34 CDT SnapMirror_on (registry) > > dst Sat May 3 10:09:36 CDT 10.0.10.238:data hci2:rcv_data Request > > (Initialize) > > dst Sat May 3 10:09:42 CDT 10.0.10.238:data hci2:rcv_data Start > > dst Sat May 3 11:51:24 CDT 10.0.10.238:data hci2:rcv_data Abort > > (snapmirror transfer failed to complete) > > > > Just as the Raj says when it fails to initialize the destination volume > > is in limbo, you can't online it due to the failed initialize. Here is > > the error: > > > > vol online: Volume 'rcv_data' was left in an inconsistent state by an > > aborted vol copy or an aborted snapmirror initial (level 0) transfer. > > In order to bring it online, you must either destroy and re-create > > the volume, or complete an initial snapmirror transfer or vol copy. > > > > I have considered running WAFL_check but WAFL isn't reporting an > > inconsistent state so i'm not sure that would be very effective. > > Yesterday I upgraded both filers to DOT 7.2.4 and updated all firmware > > then retried with the exact same results. > > > > The only thing I can think of doing now is running a packet capture on > > the filer while it runs and see what that tells me. > > > > -Mike > > > > -----Original Message----- > > From: owner-toasters[at]mathworks.com [mailto:owner-toasters[at]mathworks.com] > > On Behalf Of Raj Patel > > Sent: Sunday, May 04, 2008 1:29 AM > > To: George T Chen > > Cc: toasters[at]mathworks.com > > Subject: Re: Oddball SnapMirror issue > > > > Hi George, > > > > The working transfers do just update 10 to 20Mb - very small turnover. > > > > Unfortunately the two I need to mirror are from scratch - no baseline > > snapshot. The checkpoint restart occurring during the initialisation > > phase. Once the initialisation phase stalls further updates fail as > > the volume is not online (obviusly because the init failed). > > > > I tried setting a once-a-day schedule at a particular time so it > > wouldn't trip over itself or other snapmirror operations to no avail. > > > > As other volumes are updating with small update it made me wonder if > > it wasn't the router ipsec tunnel or firewall prematurely closing a > > connection for a large baseline transfer. > > > > I'll attach the log & config when I get back into work. > > > > Cheers, > > Raj. > > > > On Sun, May 4, 2008 at 4:36 PM, George T Chen <gtchen[at]yahoo-inc.com> > > wrote: > > > Since you have one volume already transferring, then there's no > > network > > > or firewall issue--any problem at that level would affect all > > volumes, > > > not just a few. > > > > > > A "Pending with restart checkpoint" appears you abort an ongoing > > > transfer. Checkpoint occur every ?? megabytes and gives Ontap a > > place > > > to restart instead of from scratch. It's hard to debug without more > > > info, but I would start by: > > > > > > 1) doing a snapmirror break on the volume (not just an abort) > > > 2) verify that there is a common baseline snapshot on both source and > > > destination > > > 3) restart with a snapmirror resync command > > > > > > Depending on step 2, you may be required to go to a snapmirror > > > initialize. > > > > > > What do the /etc/log/snapmirror and /etc/messages file say? > > > > > > -gtchen > > > > > > > > > > > > > -----Original Message----- > > > > From: owner-toasters[at]mathworks.com > > > [mailto:owner-toasters[at]mathworks.com] > > > > On Behalf Of Raj Patel > > > > Sent: Saturday, May 03, 2008 2:00 AM > > > > To: toasters[at]mathworks.com > > > > Subject: Oddball SnapMirror issue > > > > > > > > We've got two FAS 270's in different cities. They're connected by a > > > > 10mb pipe with routers (running ipsec) & firewalls (checkpoint > > splat) > > > > seperating each datacenter. > > > > > > > > The primary san is fine and runs all our prod volumes (7.0.5) which > > > > are mirrored to our secondary san (7.0.6). > > > > > > > > Recently I had to recreate the mirror relationship for some volumes > > as > > > > they'd fallen far out of sync due to some firewall work. > > > > > > > > What I am seeing is one volume is syncing fine, one has a small lag > > > > and two are stuck with a status of 'Pending with restart > > checkpoint' > > > > after I re-initialised the transfer. > > > > > > > > snapmirror status -l shows this for one of the two that just don't > > get > > > > properly initialised > > > > > > > > Source: 10.1.45.7:sqlprod01 > > > > Destination: adcsan1:sqlprod01_mirror > > > > Status: Pending with restart checkpoint > > > > Progress: 38376 KB > > > > State: Unknown > > > > Lag: - > > > > Mirror Timestamp: - > > > > Base Snapshot: - > > > > Current Transfer Type: Retry > > > > Current Transfer Error: volume is not online; cannot execute > > operation > > > > Contents: - > > > > Last Transfer Type: - > > > > Last Transfer Size: - > > > > Last Transfer Duration: - > > > > Last Transfer From: - > > > > > > > > Our firewalls rules have been relaxed to allow free-flow between > > these > > > > devices (instead of just the SnapMirror ports) and the routers and > > > > circuit haven't changed at all between it working fine and not > > working > > > > now. The volume that is mirroring OK seems fine and still syncs > > fine - > > > > granted the updates are small whereas the three non-working volumes > > > > have to sync quite a lot of data. > > > > > > > > I've tried deleting the mirrored volumes, recreating them, setting > > up > > > > the mirror relationship again (with a variety of scheduling and > > > > bandwidth throttling options) and doing a destination SAN reboot. > > > > > > > > What are the best options to troubleshoot this or insuring a > > > > successful mirror ? Has anyone had issues with dropped or stalled > > > > SnapMirror baseline transfers via an IPSec tunnel or Firewall ? > > > > > > > > Thanks in advance, > > > > Raj. > > > > > > > > PS As an addendum it looks like it starts a transfer, stalls and > > from > > > > then on subsequent mirrors fail because its not online (ie the > > > > initialisation fails ?) > > > > > > > > What I don't understand is why it just can't carry on with the > > > > initialisation regardless of the interruption by resuming the > > mirror > > > > operation ? > > > > > > > > ________________________________ > > > Express yourself instantly with MSN Messenger! MSN Messenger > > > ________________________________ > > > Express yourself instantly with MSN Messenger! MSN Messenger
|