Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Netapp: toasters

Oddball SnapMirror issue

 

 

Netapp toasters RSS feed   Index | Next | Previous | View Threaded


phigmov at gmail

May 3, 2008, 1:59 AM

Post #1 of 10 (438 views)
Permalink
Oddball SnapMirror issue

We've got two FAS 270's in different cities. They're connected by a
10mb pipe with routers (running ipsec) & firewalls (checkpoint splat)
seperating each datacenter.

The primary san is fine and runs all our prod volumes (7.0.5) which
are mirrored to our secondary san (7.0.6).

Recently I had to recreate the mirror relationship for some volumes as
they'd fallen far out of sync due to some firewall work.

What I am seeing is one volume is syncing fine, one has a small lag
and two are stuck with a status of 'Pending with restart checkpoint'
after I re-initialised the transfer.

snapmirror status -l shows this for one of the two that just don't get
properly initialised

Source: 10.1.45.7:sqlprod01
Destination: adcsan1:sqlprod01_mirror
Status: Pending with restart checkpoint
Progress: 38376 KB
State: Unknown
Lag: -
Mirror Timestamp: -
Base Snapshot: -
Current Transfer Type: Retry
Current Transfer Error: volume is not online; cannot execute operation
Contents: -
Last Transfer Type: -
Last Transfer Size: -
Last Transfer Duration: -
Last Transfer From: -

Our firewalls rules have been relaxed to allow free-flow between these
devices (instead of just the SnapMirror ports) and the routers and
circuit haven't changed at all between it working fine and not working
now. The volume that is mirroring OK seems fine and still syncs fine -
granted the updates are small whereas the three non-working volumes
have to sync quite a lot of data.

I've tried deleting the mirrored volumes, recreating them, setting up
the mirror relationship again (with a variety of scheduling and
bandwidth throttling options) and doing a destination SAN reboot.

What are the best options to troubleshoot this or insuring a
successful mirror ? Has anyone had issues with dropped or stalled
SnapMirror baseline transfers via an IPSec tunnel or Firewall ?

Thanks in advance,
Raj.

PS As an addendum it looks like it starts a transfer, stalls and from
then on subsequent mirrors fail because its not online (ie the
initialisation fails ?)

What I don't understand is why it just can't carry on with the
initialisation regardless of the interruption by resuming the mirror
operation ?


gtchen at yahoo-inc

May 3, 2008, 9:36 PM

Post #2 of 10 (416 views)
Permalink
RE: Oddball SnapMirror issue [In reply to]

Since you have one volume already transferring, then there's no network
or firewall issue--any problem at that level would affect all volumes,
not just a few.

A "Pending with restart checkpoint" appears you abort an ongoing
transfer. Checkpoint occur every ?? megabytes and gives Ontap a place
to restart instead of from scratch. It's hard to debug without more
info, but I would start by:

1) doing a snapmirror break on the volume (not just an abort)
2) verify that there is a common baseline snapshot on both source and
destination
3) restart with a snapmirror resync command

Depending on step 2, you may be required to go to a snapmirror
initialize.

What do the /etc/log/snapmirror and /etc/messages file say?

-gtchen

> -----Original Message-----
> From: owner-toasters[at]mathworks.com
[mailto:owner-toasters[at]mathworks.com]
> On Behalf Of Raj Patel
> Sent: Saturday, May 03, 2008 2:00 AM
> To: toasters[at]mathworks.com
> Subject: Oddball SnapMirror issue
>
> We've got two FAS 270's in different cities. They're connected by a
> 10mb pipe with routers (running ipsec) & firewalls (checkpoint splat)
> seperating each datacenter.
>
> The primary san is fine and runs all our prod volumes (7.0.5) which
> are mirrored to our secondary san (7.0.6).
>
> Recently I had to recreate the mirror relationship for some volumes as
> they'd fallen far out of sync due to some firewall work.
>
> What I am seeing is one volume is syncing fine, one has a small lag
> and two are stuck with a status of 'Pending with restart checkpoint'
> after I re-initialised the transfer.
>
> snapmirror status -l shows this for one of the two that just don't get
> properly initialised
>
> Source: 10.1.45.7:sqlprod01
> Destination: adcsan1:sqlprod01_mirror
> Status: Pending with restart checkpoint
> Progress: 38376 KB
> State: Unknown
> Lag: -
> Mirror Timestamp: -
> Base Snapshot: -
> Current Transfer Type: Retry
> Current Transfer Error: volume is not online; cannot execute operation
> Contents: -
> Last Transfer Type: -
> Last Transfer Size: -
> Last Transfer Duration: -
> Last Transfer From: -
>
> Our firewalls rules have been relaxed to allow free-flow between these
> devices (instead of just the SnapMirror ports) and the routers and
> circuit haven't changed at all between it working fine and not working
> now. The volume that is mirroring OK seems fine and still syncs fine -
> granted the updates are small whereas the three non-working volumes
> have to sync quite a lot of data.
>
> I've tried deleting the mirrored volumes, recreating them, setting up
> the mirror relationship again (with a variety of scheduling and
> bandwidth throttling options) and doing a destination SAN reboot.
>
> What are the best options to troubleshoot this or insuring a
> successful mirror ? Has anyone had issues with dropped or stalled
> SnapMirror baseline transfers via an IPSec tunnel or Firewall ?
>
> Thanks in advance,
> Raj.
>
> PS As an addendum it looks like it starts a transfer, stalls and from
> then on subsequent mirrors fail because its not online (ie the
> initialisation fails ?)
>
> What I don't understand is why it just can't carry on with the
> initialisation regardless of the interruption by resuming the mirror
> operation ?


phigmov at gmail

May 3, 2008, 11:28 PM

Post #3 of 10 (416 views)
Permalink
Re: Oddball SnapMirror issue [In reply to]

Hi George,

The working transfers do just update 10 to 20Mb - very small turnover.

Unfortunately the two I need to mirror are from scratch - no baseline
snapshot. The checkpoint restart occurring during the initialisation
phase. Once the initialisation phase stalls further updates fail as
the volume is not online (obviusly because the init failed).

I tried setting a once-a-day schedule at a particular time so it
wouldn't trip over itself or other snapmirror operations to no avail.

As other volumes are updating with small update it made me wonder if
it wasn't the router ipsec tunnel or firewall prematurely closing a
connection for a large baseline transfer.

I'll attach the log & config when I get back into work.

Cheers,
Raj.

On Sun, May 4, 2008 at 4:36 PM, George T Chen <gtchen[at]yahoo-inc.com> wrote:
> Since you have one volume already transferring, then there's no network
> or firewall issue--any problem at that level would affect all volumes,
> not just a few.
>
> A "Pending with restart checkpoint" appears you abort an ongoing
> transfer. Checkpoint occur every ?? megabytes and gives Ontap a place
> to restart instead of from scratch. It's hard to debug without more
> info, but I would start by:
>
> 1) doing a snapmirror break on the volume (not just an abort)
> 2) verify that there is a common baseline snapshot on both source and
> destination
> 3) restart with a snapmirror resync command
>
> Depending on step 2, you may be required to go to a snapmirror
> initialize.
>
> What do the /etc/log/snapmirror and /etc/messages file say?
>
> -gtchen
>
>
>
> > -----Original Message-----
> > From: owner-toasters[at]mathworks.com
> [mailto:owner-toasters[at]mathworks.com]
> > On Behalf Of Raj Patel
> > Sent: Saturday, May 03, 2008 2:00 AM
> > To: toasters[at]mathworks.com
> > Subject: Oddball SnapMirror issue
> >
> > We've got two FAS 270's in different cities. They're connected by a
> > 10mb pipe with routers (running ipsec) & firewalls (checkpoint splat)
> > seperating each datacenter.
> >
> > The primary san is fine and runs all our prod volumes (7.0.5) which
> > are mirrored to our secondary san (7.0.6).
> >
> > Recently I had to recreate the mirror relationship for some volumes as
> > they'd fallen far out of sync due to some firewall work.
> >
> > What I am seeing is one volume is syncing fine, one has a small lag
> > and two are stuck with a status of 'Pending with restart checkpoint'
> > after I re-initialised the transfer.
> >
> > snapmirror status -l shows this for one of the two that just don't get
> > properly initialised
> >
> > Source: 10.1.45.7:sqlprod01
> > Destination: adcsan1:sqlprod01_mirror
> > Status: Pending with restart checkpoint
> > Progress: 38376 KB
> > State: Unknown
> > Lag: -
> > Mirror Timestamp: -
> > Base Snapshot: -
> > Current Transfer Type: Retry
> > Current Transfer Error: volume is not online; cannot execute operation
> > Contents: -
> > Last Transfer Type: -
> > Last Transfer Size: -
> > Last Transfer Duration: -
> > Last Transfer From: -
> >
> > Our firewalls rules have been relaxed to allow free-flow between these
> > devices (instead of just the SnapMirror ports) and the routers and
> > circuit haven't changed at all between it working fine and not working
> > now. The volume that is mirroring OK seems fine and still syncs fine -
> > granted the updates are small whereas the three non-working volumes
> > have to sync quite a lot of data.
> >
> > I've tried deleting the mirrored volumes, recreating them, setting up
> > the mirror relationship again (with a variety of scheduling and
> > bandwidth throttling options) and doing a destination SAN reboot.
> >
> > What are the best options to troubleshoot this or insuring a
> > successful mirror ? Has anyone had issues with dropped or stalled
> > SnapMirror baseline transfers via an IPSec tunnel or Firewall ?
> >
> > Thanks in advance,
> > Raj.
> >
> > PS As an addendum it looks like it starts a transfer, stalls and from
> > then on subsequent mirrors fail because its not online (ie the
> > initialisation fails ?)
> >
> > What I don't understand is why it just can't carry on with the
> > initialisation regardless of the interruption by resuming the mirror
> > operation ?
>


mpartyka at acmn

May 4, 2008, 7:28 AM

Post #4 of 10 (412 views)
Permalink
RE: Oddball SnapMirror issue [In reply to]

I'm having a similar experience trying to setup a Snapmirror between a
pair of filers in the same datacenter (Not separated by a firewall). The
source is a 3050 running DOT 7.0.5 and the destination is a 270 running
7.0.6. The volume is a 420G volume serving unstructured CIFS data. When
I start the initialize everything works fine until it gets to about 82
or 83G, then the initialize aborts. The log contains some very
non-specific messages, here is the current snapmirror log:

sys Sat May 3 09:12:55 CDT SnapMirror_off (shutdown)
log Sat May 3 09:15:31 CDT FILER_REBOOTED
sys Sat May 3 09:15:34 CDT SnapMirror_on (registry)
dst Sat May 3 10:09:36 CDT 10.0.10.238:data hci2:rcv_data Request
(Initialize)
dst Sat May 3 10:09:42 CDT 10.0.10.238:data hci2:rcv_data Start
dst Sat May 3 11:51:24 CDT 10.0.10.238:data hci2:rcv_data Abort
(snapmirror transfer failed to complete)

Just as the Raj says when it fails to initialize the destination volume
is in limbo, you can't online it due to the failed initialize. Here is
the error:

vol online: Volume 'rcv_data' was left in an inconsistent state by an
aborted vol copy or an aborted snapmirror initial (level 0) transfer.
In order to bring it online, you must either destroy and re-create
the volume, or complete an initial snapmirror transfer or vol copy.

I have considered running WAFL_check but WAFL isn't reporting an
inconsistent state so i'm not sure that would be very effective.
Yesterday I upgraded both filers to DOT 7.2.4 and updated all firmware
then retried with the exact same results.

The only thing I can think of doing now is running a packet capture on
the filer while it runs and see what that tells me.

-Mike

-----Original Message-----
From: owner-toasters[at]mathworks.com [mailto:owner-toasters[at]mathworks.com]
On Behalf Of Raj Patel
Sent: Sunday, May 04, 2008 1:29 AM
To: George T Chen
Cc: toasters[at]mathworks.com
Subject: Re: Oddball SnapMirror issue

Hi George,

The working transfers do just update 10 to 20Mb - very small turnover.

Unfortunately the two I need to mirror are from scratch - no baseline
snapshot. The checkpoint restart occurring during the initialisation
phase. Once the initialisation phase stalls further updates fail as
the volume is not online (obviusly because the init failed).

I tried setting a once-a-day schedule at a particular time so it
wouldn't trip over itself or other snapmirror operations to no avail.

As other volumes are updating with small update it made me wonder if
it wasn't the router ipsec tunnel or firewall prematurely closing a
connection for a large baseline transfer.

I'll attach the log & config when I get back into work.

Cheers,
Raj.

On Sun, May 4, 2008 at 4:36 PM, George T Chen <gtchen[at]yahoo-inc.com>
wrote:
> Since you have one volume already transferring, then there's no
network
> or firewall issue--any problem at that level would affect all
volumes,
> not just a few.
>
> A "Pending with restart checkpoint" appears you abort an ongoing
> transfer. Checkpoint occur every ?? megabytes and gives Ontap a
place
> to restart instead of from scratch. It's hard to debug without more
> info, but I would start by:
>
> 1) doing a snapmirror break on the volume (not just an abort)
> 2) verify that there is a common baseline snapshot on both source and
> destination
> 3) restart with a snapmirror resync command
>
> Depending on step 2, you may be required to go to a snapmirror
> initialize.
>
> What do the /etc/log/snapmirror and /etc/messages file say?
>
> -gtchen
>
>
>
> > -----Original Message-----
> > From: owner-toasters[at]mathworks.com
> [mailto:owner-toasters[at]mathworks.com]
> > On Behalf Of Raj Patel
> > Sent: Saturday, May 03, 2008 2:00 AM
> > To: toasters[at]mathworks.com
> > Subject: Oddball SnapMirror issue
> >
> > We've got two FAS 270's in different cities. They're connected by a
> > 10mb pipe with routers (running ipsec) & firewalls (checkpoint
splat)
> > seperating each datacenter.
> >
> > The primary san is fine and runs all our prod volumes (7.0.5) which
> > are mirrored to our secondary san (7.0.6).
> >
> > Recently I had to recreate the mirror relationship for some volumes
as
> > they'd fallen far out of sync due to some firewall work.
> >
> > What I am seeing is one volume is syncing fine, one has a small lag
> > and two are stuck with a status of 'Pending with restart
checkpoint'
> > after I re-initialised the transfer.
> >
> > snapmirror status -l shows this for one of the two that just don't
get
> > properly initialised
> >
> > Source: 10.1.45.7:sqlprod01
> > Destination: adcsan1:sqlprod01_mirror
> > Status: Pending with restart checkpoint
> > Progress: 38376 KB
> > State: Unknown
> > Lag: -
> > Mirror Timestamp: -
> > Base Snapshot: -
> > Current Transfer Type: Retry
> > Current Transfer Error: volume is not online; cannot execute
operation
> > Contents: -
> > Last Transfer Type: -
> > Last Transfer Size: -
> > Last Transfer Duration: -
> > Last Transfer From: -
> >
> > Our firewalls rules have been relaxed to allow free-flow between
these
> > devices (instead of just the SnapMirror ports) and the routers and
> > circuit haven't changed at all between it working fine and not
working
> > now. The volume that is mirroring OK seems fine and still syncs
fine -
> > granted the updates are small whereas the three non-working volumes
> > have to sync quite a lot of data.
> >
> > I've tried deleting the mirrored volumes, recreating them, setting
up
> > the mirror relationship again (with a variety of scheduling and
> > bandwidth throttling options) and doing a destination SAN reboot.
> >
> > What are the best options to troubleshoot this or insuring a
> > successful mirror ? Has anyone had issues with dropped or stalled
> > SnapMirror baseline transfers via an IPSec tunnel or Firewall ?
> >
> > Thanks in advance,
> > Raj.
> >
> > PS As an addendum it looks like it starts a transfer, stalls and
from
> > then on subsequent mirrors fail because its not online (ie the
> > initialisation fails ?)
> >
> > What I don't understand is why it just can't carry on with the
> > initialisation regardless of the interruption by resuming the
mirror
> > operation ?
>


tmacmd at gmail

May 4, 2008, 10:58 AM

Post #5 of 10 (408 views)
Permalink
Re: Oddball SnapMirror issue [In reply to]

I would try a wafl iron on the source volume/aggr

Just because you do not see any filesystem problems, does not mean ther are not any.

--tmac

Sent from my Verizon Wireless BlackBerry

-----Original Message-----
From: "Mike Partyka" <mpartyka[at]acmn.com>

Date: Sun, 4 May 2008 09:28:18
To:"Raj Patel" <phigmov[at]gmail.com>, <toasters[at]mathworks.com>
Subject: RE: Oddball SnapMirror issue


I'm having a similar experience trying to setup a Snapmirror between a
pair of filers in the same datacenter (Not separated by a firewall). The
source is a 3050 running DOT 7.0.5 and the destination is a 270 running
7.0.6. The volume is a 420G volume serving unstructured CIFS data. When
I start the initialize everything works fine until it gets to about 82
or 83G, then the initialize aborts. The log contains some very
non-specific messages, here is the current snapmirror log:

sys Sat May 3 09:12:55 CDT SnapMirror_off (shutdown)
log Sat May 3 09:15:31 CDT FILER_REBOOTED
sys Sat May 3 09:15:34 CDT SnapMirror_on (registry)
dst Sat May 3 10:09:36 CDT 10.0.10.238:data hci2:rcv_data Request
(Initialize)
dst Sat May 3 10:09:42 CDT 10.0.10.238:data hci2:rcv_data Start
dst Sat May 3 11:51:24 CDT 10.0.10.238:data hci2:rcv_data Abort
(snapmirror transfer failed to complete)

Just as the Raj says when it fails to initialize the destination volume
is in limbo, you can't online it due to the failed initialize. Here is
the error:

vol online: Volume 'rcv_data' was left in an inconsistent state by an
aborted vol copy or an aborted snapmirror initial (level 0) transfer.
In order to bring it online, you must either destroy and re-create
the volume, or complete an initial snapmirror transfer or vol copy.

I have considered running WAFL_check but WAFL isn't reporting an
inconsistent state so i'm not sure that would be very effective.
Yesterday I upgraded both filers to DOT 7.2.4 and updated all firmware
then retried with the exact same results.

The only thing I can think of doing now is running a packet capture on
the filer while it runs and see what that tells me.

-Mike

-----Original Message-----
From: owner-toasters[at]mathworks.com [mailto:owner-toasters[at]mathworks.com]
On Behalf Of Raj Patel
Sent: Sunday, May 04, 2008 1:29 AM
To: George T Chen
Cc: toasters[at]mathworks.com
Subject: Re: Oddball SnapMirror issue

Hi George,

The working transfers do just update 10 to 20Mb - very small turnover.

Unfortunately the two I need to mirror are from scratch - no baseline
snapshot. The checkpoint restart occurring during the initialisation
phase. Once the initialisation phase stalls further updates fail as
the volume is not online (obviusly because the init failed).

I tried setting a once-a-day schedule at a particular time so it
wouldn't trip over itself or other snapmirror operations to no avail.

As other volumes are updating with small update it made me wonder if
it wasn't the router ipsec tunnel or firewall prematurely closing a
connection for a large baseline transfer.

I'll attach the log & config when I get back into work.

Cheers,
Raj.

On Sun, May 4, 2008 at 4:36 PM, George T Chen <gtchen[at]yahoo-inc.com>
wrote:
> Since you have one volume already transferring, then there's no
network
> or firewall issue--any problem at that level would affect all
volumes,
> not just a few.
>
> A "Pending with restart checkpoint" appears you abort an ongoing
> transfer. Checkpoint occur every ?? megabytes and gives Ontap a
place
> to restart instead of from scratch. It's hard to debug without more
> info, but I would start by:
>
> 1) doing a snapmirror break on the volume (not just an abort)
> 2) verify that there is a common baseline snapshot on both source and
> destination
> 3) restart with a snapmirror resync command
>
> Depending on step 2, you may be required to go to a snapmirror
> initialize.
>
> What do the /etc/log/snapmirror and /etc/messages file say?
>
> -gtchen
>
>
>
> > -----Original Message-----
> > From: owner-toasters[at]mathworks.com
> [mailto:owner-toasters[at]mathworks.com]
> > On Behalf Of Raj Patel
> > Sent: Saturday, May 03, 2008 2:00 AM
> > To: toasters[at]mathworks.com
> > Subject: Oddball SnapMirror issue
> >
> > We've got two FAS 270's in different cities. They're connected by a
> > 10mb pipe with routers (running ipsec) & firewalls (checkpoint
splat)
> > seperating each datacenter.
> >
> > The primary san is fine and runs all our prod volumes (7.0.5) which
> > are mirrored to our secondary san (7.0.6).
> >
> > Recently I had to recreate the mirror relationship for some volumes
as
> > they'd fallen far out of sync due to some firewall work.
> >
> > What I am seeing is one volume is syncing fine, one has a small lag
> > and two are stuck with a status of 'Pending with restart
checkpoint'
> > after I re-initialised the transfer.
> >
> > snapmirror status -l shows this for one of the two that just don't
get
> > properly initialised
> >
> > Source: 10.1.45.7:sqlprod01
> > Destination: adcsan1:sqlprod01_mirror
> > Status: Pending with restart checkpoint
> > Progress: 38376 KB
> > State: Unknown
> > Lag: -
> > Mirror Timestamp: -
> > Base Snapshot: -
> > Current Transfer Type: Retry
> > Current Transfer Error: volume is not online; cannot execute
operation
> > Contents: -
> > Last Transfer Type: -
> > Last Transfer Size: -
> > Last Transfer Duration: -
> > Last Transfer From: -
> >
> > Our firewalls rules have been relaxed to allow free-flow between
these
> > devices (instead of just the SnapMirror ports) and the routers and
> > circuit haven't changed at all between it working fine and not
working
> > now. The volume that is mirroring OK seems fine and still syncs
fine -
> > granted the updates are small whereas the three non-working volumes
> > have to sync quite a lot of data.
> >
> > I've tried deleting the mirrored volumes, recreating them, setting
up
> > the mirror relationship again (with a variety of scheduling and
> > bandwidth throttling options) and doing a destination SAN reboot.
> >
> > What are the best options to troubleshoot this or insuring a
> > successful mirror ? Has anyone had issues with dropped or stalled
> > SnapMirror baseline transfers via an IPSec tunnel or Firewall ?
> >
> > Thanks in advance,
> > Raj.
> >
> > PS As an addendum it looks like it starts a transfer, stalls and
from
> > then on subsequent mirrors fail because its not online (ie the
> > initialisation fails ?)
> >
> > What I don't understand is why it just can't carry on with the
> > initialisation regardless of the interruption by resuming the
mirror
> > operation ?
>


mpartyka at acmn

May 4, 2008, 11:24 AM

Post #6 of 10 (407 views)
Permalink
RE: Oddball SnapMirror issue [In reply to]

Is there any reason to prefer wafliron over WAFL_check? Sounds like they
do the same thing but you have the option to only check not
automatically fix with WAFL_check.

-Mike

-----Original Message-----
From: tmacmd[at]gmail.com [mailto:tmacmd[at]gmail.com]
Sent: Sunday, May 04, 2008 12:59 PM
To: Mike Partyka; owner-toasters[at]mathworks.com; Raj Patel; NetApp
Toasters List
Subject: Re: Oddball SnapMirror issue

I would try a wafl iron on the source volume/aggr

Just because you do not see any filesystem problems, does not mean ther
are not any.

--tmac

Sent from my Verizon Wireless BlackBerry

-----Original Message-----
From: "Mike Partyka" <mpartyka[at]acmn.com>

Date: Sun, 4 May 2008 09:28:18
To:"Raj Patel" <phigmov[at]gmail.com>, <toasters[at]mathworks.com>
Subject: RE: Oddball SnapMirror issue


I'm having a similar experience trying to setup a Snapmirror between a
pair of filers in the same datacenter (Not separated by a firewall). The
source is a 3050 running DOT 7.0.5 and the destination is a 270 running
7.0.6. The volume is a 420G volume serving unstructured CIFS data. When
I start the initialize everything works fine until it gets to about 82
or 83G, then the initialize aborts. The log contains some very
non-specific messages, here is the current snapmirror log:

sys Sat May 3 09:12:55 CDT SnapMirror_off (shutdown)
log Sat May 3 09:15:31 CDT FILER_REBOOTED
sys Sat May 3 09:15:34 CDT SnapMirror_on (registry)
dst Sat May 3 10:09:36 CDT 10.0.10.238:data hci2:rcv_data Request
(Initialize)
dst Sat May 3 10:09:42 CDT 10.0.10.238:data hci2:rcv_data Start
dst Sat May 3 11:51:24 CDT 10.0.10.238:data hci2:rcv_data Abort
(snapmirror transfer failed to complete)

Just as the Raj says when it fails to initialize the destination volume
is in limbo, you can't online it due to the failed initialize. Here is
the error:

vol online: Volume 'rcv_data' was left in an inconsistent state by an
aborted vol copy or an aborted snapmirror initial (level 0) transfer.
In order to bring it online, you must either destroy and re-create
the volume, or complete an initial snapmirror transfer or vol copy.

I have considered running WAFL_check but WAFL isn't reporting an
inconsistent state so i'm not sure that would be very effective.
Yesterday I upgraded both filers to DOT 7.2.4 and updated all firmware
then retried with the exact same results.

The only thing I can think of doing now is running a packet capture on
the filer while it runs and see what that tells me.

-Mike

-----Original Message-----
From: owner-toasters[at]mathworks.com [mailto:owner-toasters[at]mathworks.com]
On Behalf Of Raj Patel
Sent: Sunday, May 04, 2008 1:29 AM
To: George T Chen
Cc: toasters[at]mathworks.com
Subject: Re: Oddball SnapMirror issue

Hi George,

The working transfers do just update 10 to 20Mb - very small turnover.

Unfortunately the two I need to mirror are from scratch - no baseline
snapshot. The checkpoint restart occurring during the initialisation
phase. Once the initialisation phase stalls further updates fail as
the volume is not online (obviusly because the init failed).

I tried setting a once-a-day schedule at a particular time so it
wouldn't trip over itself or other snapmirror operations to no avail.

As other volumes are updating with small update it made me wonder if
it wasn't the router ipsec tunnel or firewall prematurely closing a
connection for a large baseline transfer.

I'll attach the log & config when I get back into work.

Cheers,
Raj.

On Sun, May 4, 2008 at 4:36 PM, George T Chen <gtchen[at]yahoo-inc.com>
wrote:
> Since you have one volume already transferring, then there's no
network
> or firewall issue--any problem at that level would affect all
volumes,
> not just a few.
>
> A "Pending with restart checkpoint" appears you abort an ongoing
> transfer. Checkpoint occur every ?? megabytes and gives Ontap a
place
> to restart instead of from scratch. It's hard to debug without more
> info, but I would start by:
>
> 1) doing a snapmirror break on the volume (not just an abort)
> 2) verify that there is a common baseline snapshot on both source and
> destination
> 3) restart with a snapmirror resync command
>
> Depending on step 2, you may be required to go to a snapmirror
> initialize.
>
> What do the /etc/log/snapmirror and /etc/messages file say?
>
> -gtchen
>
>
>
> > -----Original Message-----
> > From: owner-toasters[at]mathworks.com
> [mailto:owner-toasters[at]mathworks.com]
> > On Behalf Of Raj Patel
> > Sent: Saturday, May 03, 2008 2:00 AM
> > To: toasters[at]mathworks.com
> > Subject: Oddball SnapMirror issue
> >
> > We've got two FAS 270's in different cities. They're connected by a
> > 10mb pipe with routers (running ipsec) & firewalls (checkpoint
splat)
> > seperating each datacenter.
> >
> > The primary san is fine and runs all our prod volumes (7.0.5) which
> > are mirrored to our secondary san (7.0.6).
> >
> > Recently I had to recreate the mirror relationship for some volumes
as
> > they'd fallen far out of sync due to some firewall work.
> >
> > What I am seeing is one volume is syncing fine, one has a small lag
> > and two are stuck with a status of 'Pending with restart
checkpoint'
> > after I re-initialised the transfer.
> >
> > snapmirror status -l shows this for one of the two that just don't
get
> > properly initialised
> >
> > Source: 10.1.45.7:sqlprod01
> > Destination: adcsan1:sqlprod01_mirror
> > Status: Pending with restart checkpoint
> > Progress: 38376 KB
> > State: Unknown
> > Lag: -
> > Mirror Timestamp: -
> > Base Snapshot: -
> > Current Transfer Type: Retry
> > Current Transfer Error: volume is not online; cannot execute
operation
> > Contents: -
> > Last Transfer Type: -
> > Last Transfer Size: -
> > Last Transfer Duration: -
> > Last Transfer From: -
> >
> > Our firewalls rules have been relaxed to allow free-flow between
these
> > devices (instead of just the SnapMirror ports) and the routers and
> > circuit haven't changed at all between it working fine and not
working
> > now. The volume that is mirroring OK seems fine and still syncs
fine -
> > granted the updates are small whereas the three non-working volumes
> > have to sync quite a lot of data.
> >
> > I've tried deleting the mirrored volumes, recreating them, setting
up
> > the mirror relationship again (with a variety of scheduling and
> > bandwidth throttling options) and doing a destination SAN reboot.
> >
> > What are the best options to troubleshoot this or insuring a
> > successful mirror ? Has anyone had issues with dropped or stalled
> > SnapMirror baseline transfers via an IPSec tunnel or Firewall ?
> >
> > Thanks in advance,
> > Raj.
> >
> > PS As an addendum it looks like it starts a transfer, stalls and
from
> > then on subsequent mirrors fail because its not online (ie the
> > initialisation fails ?)
> >
> > What I don't understand is why it just can't carry on with the
> > initialisation regardless of the interruption by resuming the
mirror
> > operation ?
>


phigmov at gmail

May 16, 2008, 4:32 PM

Post #7 of 10 (315 views)
Permalink
Re: Oddball SnapMirror issue [In reply to]

Solved (kind of) -

It looks like disabling the IPSec on the router interfaces and running
the connection unencrypted between the data-centers fixed the problem.

Further analysis revealed all the mirrors had connection issues - the
ones that carried on working just had much smaller amounts of data to
update so they eventually succeeded whereas the large baseline
transfers for the new volumes just wouldn't complete.

So my question then becomes -

Is there anything on the Cisco IPSec config or NetApp OnTap config
that can be tweaked to ensure trouble-free mirroring over a secured
link ?

Cheers,
Raj.

On Sat, May 3, 2008 at 8:59 PM, Raj Patel <phigmov[at]gmail.com> wrote:
> We've got two FAS 270's in different cities. They're connected by a
> 10mb pipe with routers (running ipsec) & firewalls (checkpoint splat)
> seperating each datacenter.
>
> The primary san is fine and runs all our prod volumes (7.0.5) which
> are mirrored to our secondary san (7.0.6).
>
> Recently I had to recreate the mirror relationship for some volumes as
> they'd fallen far out of sync due to some firewall work.
>
> What I am seeing is one volume is syncing fine, one has a small lag
> and two are stuck with a status of 'Pending with restart checkpoint'
> after I re-initialised the transfer.
>
> snapmirror status -l shows this for one of the two that just don't get
> properly initialised
>
> Source: 10.1.45.7:sqlprod01
> Destination: adcsan1:sqlprod01_mirror
> Status: Pending with restart checkpoint
> Progress: 38376 KB
> State: Unknown
> Lag: -
> Mirror Timestamp: -
> Base Snapshot: -
> Current Transfer Type: Retry
> Current Transfer Error: volume is not online; cannot execute operation
> Contents: -
> Last Transfer Type: -
> Last Transfer Size: -
> Last Transfer Duration: -
> Last Transfer From: -
>
> Our firewalls rules have been relaxed to allow free-flow between these
> devices (instead of just the SnapMirror ports) and the routers and
> circuit haven't changed at all between it working fine and not working
> now. The volume that is mirroring OK seems fine and still syncs fine -
> granted the updates are small whereas the three non-working volumes
> have to sync quite a lot of data.
>
> I've tried deleting the mirrored volumes, recreating them, setting up
> the mirror relationship again (with a variety of scheduling and
> bandwidth throttling options) and doing a destination SAN reboot.
>
> What are the best options to troubleshoot this or insuring a
> successful mirror ? Has anyone had issues with dropped or stalled
> SnapMirror baseline transfers via an IPSec tunnel or Firewall ?
>
> Thanks in advance,
> Raj.
>
> PS As an addendum it looks like it starts a transfer, stalls and from
> then on subsequent mirrors fail because its not online (ie the
> initialisation fails ?)
>
> What I don't understand is why it just can't carry on with the
> initialisation regardless of the interruption by resuming the mirror
> operation ?
>


phigmov at gmail

May 18, 2008, 12:54 PM

Post #8 of 10 (291 views)
Permalink
Re: Oddball SnapMirror issue [In reply to]

> Try backing off SnapMirror's TCP window (options snapmirror.window_size)
> to lower the effective throughput, or nail down 'kbs' options in your
> snapmirror.conf.

Yup - tried the "options snapmirror.window_size 32768" and throttling
the mirror speed down.

I still don't understand why, if the connection was ropey, it didn't
just keep trying to establish the baseline rather than just giving up
and failing to initialise.

I'll have to see what our networking guys come up with on the ipsec side.

Cheers,
Raj.


Stetson.Webster at netapp

May 18, 2008, 4:25 PM

Post #9 of 10 (291 views)
Permalink
Re: Oddball SnapMirror issue [In reply to]

Volume Snapmirror, or Qtree Smapmirror?

----- Original Message -----
From: Raj Patel <phigmov[at]gmail.com>
To: Kevin Graham <kgraham[at]industrial-marshmallow.com>
Cc: toasters[at]mathworks.com <toasters[at]mathworks.com>
Sent: Sun May 18 15:54:47 2008
Subject: Re: Oddball SnapMirror issue

> Try backing off SnapMirror's TCP window (options snapmirror.window_size)
> to lower the effective throughput, or nail down 'kbs' options in your
> snapmirror.conf.

Yup - tried the "options snapmirror.window_size 32768" and throttling
the mirror speed down.

I still don't understand why, if the connection was ropey, it didn't
just keep trying to establish the baseline rather than just giving up
and failing to initialise.

I'll have to see what our networking guys come up with on the ipsec side.

Cheers,
Raj.


phigmov at gmail

May 18, 2008, 6:33 PM

Post #10 of 10 (290 views)
Permalink
Re: Oddball SnapMirror issue [In reply to]

Volume Snapmirror.

On Mon, May 19, 2008 at 11:25 AM, Webster, Stetson
<Stetson.Webster[at]netapp.com> wrote:
> Volume Snapmirror, or Qtree Smapmirror?
>
> ----- Original Message -----
> From: Raj Patel <phigmov[at]gmail.com>
> To: Kevin Graham <kgraham[at]industrial-marshmallow.com>
> Cc: toasters[at]mathworks.com <toasters[at]mathworks.com>
> Sent: Sun May 18 15:54:47 2008
> Subject: Re: Oddball SnapMirror issue
>
>> Try backing off SnapMirror's TCP window (options snapmirror.window_size)
>> to lower the effective throughput, or nail down 'kbs' options in your
>> snapmirror.conf.
>
> Yup - tried the "options snapmirror.window_size 32768" and throttling
> the mirror speed down.
>
> I still don't understand why, if the connection was ropey, it didn't
> just keep trying to establish the baseline rather than just giving up
> and failing to initialise.
>
> I'll have to see what our networking guys come up with on the ipsec side.
>
> Cheers,
> Raj.
>

Netapp toasters RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact lists@gossamer-threads.com
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.