Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Netapp: toasters

How long can Windows survive loss of storage?

 

 

Netapp toasters RSS feed   Index | Next | Previous | View Threaded


skendric at fhcrc

Jan 24, 2012, 5:55 AM

Post #1 of 8 (1816 views)
Permalink
How long can Windows survive loss of storage?

Hi folks,

We have MS SQL Server (a handful of instances, riding on top of MS
Cluster Services) attached via Fibre Channel to a filer.

Whenever we perform a takeover, SQL Server crumps. The DBAs shut it
down, restart, repair/restore their databases (unhappy times), life
returns to normal. We've replicated this behavior a handful of times:
every single time we experience a takeover, whether administratively or
Murphy-induced.

I'd like to think that we can do better than this.

This particular filer services five clients: the two MS SQL Servers via
Fibre Channel and three Exchange servers via iSCSI. The iSCSI clients
notice the takeover, but their initiators do the multipath thing and
ride through the event fine.

In the log extract below (heavily edited), we can see the iSCSI and FCP
services going down at 21:47:25. And coming back on-line at 21:47:40
(FCP) and 21:47:49 (iSCSI). The iSCSI client 'hamlet' re-establishes
its session at 21:47:51. The rest of the log extract records the
downward spiral of the Windows server 'custard-1'.

To my way of thinking, iSCSI was unavailable for 24 seconds, FCP for 15
seconds.

(1) Are the iSCSI and FCP services really and truly available once the
'fcp.service.startup' and 'iscsi.service.startup' messages are logged,
or might there be a delay before storage is actually available to clients?
(2) How could we shrink takeover times?
(3) Why does MS Windows over FC have such trouble surviving a 15 second
disruption to storage?
(4) How could we increase MS Windows timeouts?
(5) How long can /your/ MS Windows FCP clients survive an interruption
to storage?

Toaster: FAS3020 running 7.2.4
SQL Server Clients: Windows 2008 running SnapDrive & SnapManager for
SQL Server
Exchange Clients: Windows 2003 running SnapDrive & SnapManager for Exchange

Log extract below

--sk

Stuart Kendrick
FHCRC

Feb 7 21:47:22 toaster-b-svif2 [toaster-b:
cf.misc.operatorTakeover:warning]: Cluster monitor: takeover initiated
by operator
Feb 7 21:47:22 toaster-b-svif2 [toaster-b:
cf.fsm.nfo.acceptTakeoverReq:warning]: Negotiated failover: accepting
takeover request by partner, reason: operator initiated cf takeover.
Asking partner to shutdown gracefully; will takeover in at most 180 seconds.
Feb 7 21:47:25 toaster-a-svif2 [toaster-a:
cf.fsm.nfo.startingGracefulShutdown:warning]: Negotiated failover:
starting graceful shutdown.
Feb 7 21:47:25 toaster-a-svif2 [toaster-a: kern.shutdown:notice]:
System shut down because : "reboot".
Feb 7 21:47:25 toaster-a-svif2 [toaster-a:
iscsi.service.shutdown:info]: iSCSI service shutdown
Feb 7 21:47:25 toaster-a-svif2 [toaster-a: fcp.service.shutdown:info]:
FCP service shutdown
Feb 7 21:47:32 toaster-b-svif2 [toaster-b: cf.fsm.firmwareStatus:info]:
Cluster monitor: partner rebooting
Feb 7 21:47:32 toaster-b-svif2 [toaster-b:
cf.fsm.nfo.partnerShutdown:warning]: Negotiated failover: partner has
shutdown
Feb 7 21:47:32 toaster-b-svif2 [toaster-b: cf.fsm.takeover.nfo:info]:
Cluster monitor: takeover attempted after 'cf takeover'. command
Feb 7 21:47:32 toaster-b-svif2 [toaster-b:
cf.fsm.stateTransit:warning]: Cluster monitor: UP --> TAKEOVER
Feb 7 21:47:32 toaster-b-svif2 [toaster-b:
cf.fm.takeoverStarted:warning]: Cluster monitor: takeover started
Feb 7 21:47:34 toaster-b-svif2 [toaster-b: cf_takeover:info]: NVRAM
takeover: partner nvram is disabled
Feb 7 21:47:39 toaster-b-svif2 [toaster-a/toaster-b:
wafl.takeover.nvram.missing:error]: WAFL takeover: no partner area found
during wafl replay
Feb 7 21:47:40 toaster-b-svif2 [toaster-a/toaster-b:
wafl.replay.done:info]: WAFL log replay completed, 0 seconds
Feb 7 21:47:40 toaster-b-svif2 [toaster-a/toaster-b:
fcp.service.startup:info]: FCP service startup

1 sys Error custard-1 2010-02-07 21:47:44 61203
ontapdsm DSM ID 0400010d has transitioned to the failed state.
1 sys Error custard-1 2010-02-07 21:47:44 61203
ontapdsm DSM ID 04000107 has transitioned to the failed state.
1 sys Error custard-1 2010-02-07 21:47:44 61203
ontapdsm DSM ID 0400010b has transitioned to the failed state.
1 sys Error custard-1 2010-02-07 21:47:44 61203
ontapdsm DSM ID 0300010b has transitioned to the failed state.
[...]
1 sys Error custard-1 2010-02-07 21:47:44 61142
ontapdsm Nexus ID 03000101 has failed.

Feb 7 21:47:48 toaster-b-svif2 [toaster-a/toaster-b:
cf.fm.takeoverDetectionSeconds.Default:warning]: option
cf.takeover.detection.seconds is set to 10 seconds which is below the
NetApp advised value of 15 seconds. False takeovers and/or takeovers
without diagnostic core-dumps might occur.
Feb 7 21:47:49 toaster-b-svif2 [toaster-a/toaster-b:
iscsi.service.startup:info]: iSCSI service startup
Feb 7 21:47:50 toaster-b-svif2 [toaster-b (takeover):
cf.rsrc.transitTime:notice]: Top Takeover transit times
wafl_restart=7400 {vdisk=7214, restarters=186}, wafl=5049,
registry_prerc=613, registry_postrc_phase2=555, rc=553 {ifconfig=99,
ifconfig=93, hostname=57, options=41, options=27, options=26, route=2,
vif=1, vif=1, vif=1}, raid=525, registry_postrc_phase1=403,
wafl_sync=241, raid_replay=221, cifs=216
Feb 7 21:47:50 toaster-b-svif2 [toaster-b (takeover):
cf.fm.takeoverComplete:warning]: Cluster monitor: takeover completed
Feb 7 21:47:50 toaster-b-svif2 [toaster-b (takeover):
cf.fm.takeoverDuration:warning]: Cluster monitor: takeover duration time
is 18 seconds
Feb 7 21:47:51 toaster-b-svif2 [toaster-a/toaster-b:
iscsi.notice:notice]: ISCSI: New session from initiator
iqn.1990-04.org.fhcrc:hamlet.fhcrc.org at IP addr 10.111.152.44
Feb 7 21:47:51 toaster-a-svif1 [toaster-a/toaster-b:
iscsi.notice:notice]: ISCSI: New session from initiator
iqn.1990-04.org.fhcrc:hamlet.fhcrc.org at IP addr 10.111.152.44

1 sys Warning custard-1 2010-02-07 21:47:57 61051
ontapdsm DSM ID 03000100 failed path verification.
1 sys Warning custard-1 2010-02-07 21:47:57 61051
ontapdsm DSM ID 03000101 failed path verification.
1 sys Warning custard-1 2010-02-07 21:47:57 61051
ontapdsm DSM ID 03000102 failed path verification.
1 sys Warning custard-1 2010-02-07 21:47:57 61077
ontapdsm DSM ID 03000111 has initiated a fail-over.
[...]
1 sys Error custard-1 2010-02-07 21:47:57 61124
ontapdsm The port servicing DSM ID 03000111 reported the logical
unit did not respond to selection.
[...]

Feb 7 21:48:01 toaster-b-svif2 [toaster-b (takeover):
monitor.globalStatus.critical:CRITICAL]: This node has taken over
toaster-a.
Feb 7 21:48:01 toaster-b-svif2 [toaster-a/toaster-b:
monitor.globalStatus.critical:CRITICAL]: toaster-b has taken over this
node.
Feb 7 21:48:01 toaster-a-svif1 [toaster-a/toaster-b:
monitor.globalStatus.critical:CRITICAL]: toaster-b has taken over this
node.

1 sys Warning custard-1 2010-02-07 21:48:04 61204
ontapdsm DSM ID 0400010f is in the process of being removed.
1 sys Warning custard-1 2010-02-07 21:48:04 61205
ontapdsm DSM ID 0400010f was removed.
1 sys Warning custard-1 2010-02-07 21:48:04 61205
ontapdsm DSM ID 0400010e was removed.
[...]
1 sys Error custard-1 2010-02-07 21:48:04 16 mpio
A fail-over on DeviceMPIODisk108 occurred.
[...]
1 sys Warning custard-1 2010-02-07 21:48:04 17 mpio
DeviceMPIODisk108 is currently in a degraded state. One or more
paths have failed though the process is now complete.

Feb 7 21:48:09 toaster-b-svif2 [toaster-a/toaster-b: syslogd:info]:
syslogd: restarted
Feb 7 21:48:09 toaster-a-svif1 [toaster-a/toaster-b: syslogd:info]:
syslogd: restarted
Feb 7 21:48:11 toaster-b-svif2 [toaster-a/toaster-b:
nbt.nbns.registrationComplete:info]: NBT: All CIFS name registrations
have completed for the partner server.
Feb 7 21:48:11 toaster-a-svif1 [toaster-a/toaster-b:
nbt.nbns.registrationComplete:info]: NBT: All CIFS name registrations
have completed for the partner server.

1 sys Warning custard-1 2010-02-07 21:48:24 129
ql2300 Reset to device DeviceRaidPort1 was issued.
1 sys Warning custard-1 2010-02-07 21:48:28 1123
ClusSvc The node lost communication with cluster node 'custard-2'
on network 'Public (Team)'.
1 sys Warning custard-1 2010-02-07 21:48:28 1123
ClusSvc The node lost communication with cluster node 'custard-2'
on network 'Private 172.16 (PCI)'.
1 sys Warning custard-1 2010-02-07 21:48:28 1123
ClusSvc The node lost communication with cluster node 'custard-2'
on network 'Private 192.168 (Onboard)'.
1 sys Error custard-1 2010-02-07 21:48:43 1209
ClusDisk Cluster service is requesting a bus reset for device
DeviceClusDisk0.

Feb 7 21:48:46 toaster-b-svif2 [toaster-b (takeover):
scsitarget.ispfct.targetReset:notice]: FCP Target 0b: Target was Reset
by the Initiator at Port Id: 0x10000 (WWPN 2100001b3201ba33)

1 sys Error custard-1 2010-02-07 21:48:51 1209
ClusDisk Cluster service is requesting a bus reset for device
DeviceClusDisk0.
1 sys Error custard-1 2010-02-07 21:48:53 1118
ClusNet Cluster service was terminated as requested by Node 2.
1 sys Error custard-1 2010-02-07 21:48:53 7031 Service
Control Manager The Cluster Service service terminated
unexpectedly. It has done this 1 time(s). The following corrective
action will be taken in 60000 milliseconds: Restart the service.
5 sys Error custard-1 2010-02-07 21:48:54 1036
ClusSvc Cluster disk resource '' did not respond to a SCSI
maintenance command.
1 sys Error custard-1 2010-02-07 21:48:54 1215
ClusSvc Cluster Network Name custard is no longer registered with
its hosting system. The associated resource name is ''.
1 sys Error custard-1 2010-02-07 21:48:54 1077
ClusSvc The TCP/IP interface for Cluster IP Address '' has failed.
1 sys Warning custard-1 2010-02-07 21:48:56 50 Ntfs
{Delayed Write Failed} Windows was unable to save all the data for
the file . The data has been lost. This error may be caused by a failure
of your computer hardware or network connection. Please try to save this
file elsewhere.
1 sys Warning custard-1 2010-02-07 21:48:57 57
Ftdisk The system failed to flush data to the transaction log.
Corruption may occur.

_______________________________________________
Toasters mailing list
Toasters [at] teaparty
http://www.teaparty.net/mailman/listinfo/toasters


andrey.borzenkov at ts

Jan 24, 2012, 6:11 AM

Post #2 of 8 (1719 views)
Permalink
RE: How long can Windows survive loss of storage? [In reply to]

Are you using ONTAP DSM or native Windows DSM/iSCSI MPIO? Did you use NetApp Host Utilities when setting up configuration?



---
With best regards

Andrey Borzenkov
Senior system engineer
Service operations


-----Original Message-----
From: toasters-bounces [at] teaparty [mailto:toasters-bounces [at] teaparty] On Behalf Of Stuart Kendrick
Sent: Tuesday, January 24, 2012 5:55 PM
To: toasters [at] teaparty
Subject: How long can Windows survive loss of storage?

Hi folks,

We have MS SQL Server (a handful of instances, riding on top of MS
Cluster Services) attached via Fibre Channel to a filer.

Whenever we perform a takeover, SQL Server crumps. The DBAs shut it
down, restart, repair/restore their databases (unhappy times), life
returns to normal. We've replicated this behavior a handful of times:
every single time we experience a takeover, whether administratively or
Murphy-induced.

I'd like to think that we can do better than this.

This particular filer services five clients: the two MS SQL Servers via
Fibre Channel and three Exchange servers via iSCSI. The iSCSI clients
notice the takeover, but their initiators do the multipath thing and
ride through the event fine.

In the log extract below (heavily edited), we can see the iSCSI and FCP
services going down at 21:47:25. And coming back on-line at 21:47:40
(FCP) and 21:47:49 (iSCSI). The iSCSI client 'hamlet' re-establishes
its session at 21:47:51. The rest of the log extract records the
downward spiral of the Windows server 'custard-1'.

To my way of thinking, iSCSI was unavailable for 24 seconds, FCP for 15
seconds.

(1) Are the iSCSI and FCP services really and truly available once the
'fcp.service.startup' and 'iscsi.service.startup' messages are logged,
or might there be a delay before storage is actually available to clients?
(2) How could we shrink takeover times?
(3) Why does MS Windows over FC have such trouble surviving a 15 second
disruption to storage?
(4) How could we increase MS Windows timeouts?
(5) How long can /your/ MS Windows FCP clients survive an interruption
to storage?

Toaster: FAS3020 running 7.2.4
SQL Server Clients: Windows 2008 running SnapDrive & SnapManager for
SQL Server
Exchange Clients: Windows 2003 running SnapDrive & SnapManager for Exchange

Log extract below

--sk

Stuart Kendrick
FHCRC

Feb 7 21:47:22 toaster-b-svif2 [toaster-b:
cf.misc.operatorTakeover:warning]: Cluster monitor: takeover initiated
by operator
Feb 7 21:47:22 toaster-b-svif2 [toaster-b:
cf.fsm.nfo.acceptTakeoverReq:warning]: Negotiated failover: accepting
takeover request by partner, reason: operator initiated cf takeover.
Asking partner to shutdown gracefully; will takeover in at most 180 seconds.
Feb 7 21:47:25 toaster-a-svif2 [toaster-a:
cf.fsm.nfo.startingGracefulShutdown:warning]: Negotiated failover:
starting graceful shutdown.
Feb 7 21:47:25 toaster-a-svif2 [toaster-a: kern.shutdown:notice]:
System shut down because : "reboot".
Feb 7 21:47:25 toaster-a-svif2 [toaster-a:
iscsi.service.shutdown:info]: iSCSI service shutdown
Feb 7 21:47:25 toaster-a-svif2 [toaster-a: fcp.service.shutdown:info]:
FCP service shutdown
Feb 7 21:47:32 toaster-b-svif2 [toaster-b: cf.fsm.firmwareStatus:info]:
Cluster monitor: partner rebooting
Feb 7 21:47:32 toaster-b-svif2 [toaster-b:
cf.fsm.nfo.partnerShutdown:warning]: Negotiated failover: partner has
shutdown
Feb 7 21:47:32 toaster-b-svif2 [toaster-b: cf.fsm.takeover.nfo:info]:
Cluster monitor: takeover attempted after 'cf takeover'. command
Feb 7 21:47:32 toaster-b-svif2 [toaster-b:
cf.fsm.stateTransit:warning]: Cluster monitor: UP --> TAKEOVER
Feb 7 21:47:32 toaster-b-svif2 [toaster-b:
cf.fm.takeoverStarted:warning]: Cluster monitor: takeover started
Feb 7 21:47:34 toaster-b-svif2 [toaster-b: cf_takeover:info]: NVRAM
takeover: partner nvram is disabled
Feb 7 21:47:39 toaster-b-svif2 [toaster-a/toaster-b:
wafl.takeover.nvram.missing:error]: WAFL takeover: no partner area found
during wafl replay
Feb 7 21:47:40 toaster-b-svif2 [toaster-a/toaster-b:
wafl.replay.done:info]: WAFL log replay completed, 0 seconds
Feb 7 21:47:40 toaster-b-svif2 [toaster-a/toaster-b:
fcp.service.startup:info]: FCP service startup

1 sys Error custard-1 2010-02-07 21:47:44 61203
ontapdsm DSM ID 0400010d has transitioned to the failed state.
1 sys Error custard-1 2010-02-07 21:47:44 61203
ontapdsm DSM ID 04000107 has transitioned to the failed state.
1 sys Error custard-1 2010-02-07 21:47:44 61203
ontapdsm DSM ID 0400010b has transitioned to the failed state.
1 sys Error custard-1 2010-02-07 21:47:44 61203
ontapdsm DSM ID 0300010b has transitioned to the failed state.
[...]
1 sys Error custard-1 2010-02-07 21:47:44 61142
ontapdsm Nexus ID 03000101 has failed.

Feb 7 21:47:48 toaster-b-svif2 [toaster-a/toaster-b:
cf.fm.takeoverDetectionSeconds.Default:warning]: option
cf.takeover.detection.seconds is set to 10 seconds which is below the
NetApp advised value of 15 seconds. False takeovers and/or takeovers
without diagnostic core-dumps might occur.
Feb 7 21:47:49 toaster-b-svif2 [toaster-a/toaster-b:
iscsi.service.startup:info]: iSCSI service startup
Feb 7 21:47:50 toaster-b-svif2 [toaster-b (takeover):
cf.rsrc.transitTime:notice]: Top Takeover transit times
wafl_restart=7400 {vdisk=7214, restarters=186}, wafl=5049,
registry_prerc=613, registry_postrc_phase2=555, rc=553 {ifconfig=99,
ifconfig=93, hostname=57, options=41, options=27, options=26, route=2,
vif=1, vif=1, vif=1}, raid=525, registry_postrc_phase1=403,
wafl_sync=241, raid_replay=221, cifs=216
Feb 7 21:47:50 toaster-b-svif2 [toaster-b (takeover):
cf.fm.takeoverComplete:warning]: Cluster monitor: takeover completed
Feb 7 21:47:50 toaster-b-svif2 [toaster-b (takeover):
cf.fm.takeoverDuration:warning]: Cluster monitor: takeover duration time
is 18 seconds
Feb 7 21:47:51 toaster-b-svif2 [toaster-a/toaster-b:
iscsi.notice:notice]: ISCSI: New session from initiator
iqn.1990-04.org.fhcrc:hamlet.fhcrc.org at IP addr 10.111.152.44
Feb 7 21:47:51 toaster-a-svif1 [toaster-a/toaster-b:
iscsi.notice:notice]: ISCSI: New session from initiator
iqn.1990-04.org.fhcrc:hamlet.fhcrc.org at IP addr 10.111.152.44

1 sys Warning custard-1 2010-02-07 21:47:57 61051
ontapdsm DSM ID 03000100 failed path verification.
1 sys Warning custard-1 2010-02-07 21:47:57 61051
ontapdsm DSM ID 03000101 failed path verification.
1 sys Warning custard-1 2010-02-07 21:47:57 61051
ontapdsm DSM ID 03000102 failed path verification.
1 sys Warning custard-1 2010-02-07 21:47:57 61077
ontapdsm DSM ID 03000111 has initiated a fail-over.
[...]
1 sys Error custard-1 2010-02-07 21:47:57 61124
ontapdsm The port servicing DSM ID 03000111 reported the logical
unit did not respond to selection.
[...]

Feb 7 21:48:01 toaster-b-svif2 [toaster-b (takeover):
monitor.globalStatus.critical:CRITICAL]: This node has taken over
toaster-a.
Feb 7 21:48:01 toaster-b-svif2 [toaster-a/toaster-b:
monitor.globalStatus.critical:CRITICAL]: toaster-b has taken over this
node.
Feb 7 21:48:01 toaster-a-svif1 [toaster-a/toaster-b:
monitor.globalStatus.critical:CRITICAL]: toaster-b has taken over this
node.

1 sys Warning custard-1 2010-02-07 21:48:04 61204
ontapdsm DSM ID 0400010f is in the process of being removed.
1 sys Warning custard-1 2010-02-07 21:48:04 61205
ontapdsm DSM ID 0400010f was removed.
1 sys Warning custard-1 2010-02-07 21:48:04 61205
ontapdsm DSM ID 0400010e was removed.
[...]
1 sys Error custard-1 2010-02-07 21:48:04 16 mpio
A fail-over on DeviceMPIODisk108 occurred.
[...]
1 sys Warning custard-1 2010-02-07 21:48:04 17 mpio
DeviceMPIODisk108 is currently in a degraded state. One or more
paths have failed though the process is now complete.

Feb 7 21:48:09 toaster-b-svif2 [toaster-a/toaster-b: syslogd:info]:
syslogd: restarted
Feb 7 21:48:09 toaster-a-svif1 [toaster-a/toaster-b: syslogd:info]:
syslogd: restarted
Feb 7 21:48:11 toaster-b-svif2 [toaster-a/toaster-b:
nbt.nbns.registrationComplete:info]: NBT: All CIFS name registrations
have completed for the partner server.
Feb 7 21:48:11 toaster-a-svif1 [toaster-a/toaster-b:
nbt.nbns.registrationComplete:info]: NBT: All CIFS name registrations
have completed for the partner server.

1 sys Warning custard-1 2010-02-07 21:48:24 129
ql2300 Reset to device DeviceRaidPort1 was issued.
1 sys Warning custard-1 2010-02-07 21:48:28 1123
ClusSvc The node lost communication with cluster node 'custard-2'
on network 'Public (Team)'.
1 sys Warning custard-1 2010-02-07 21:48:28 1123
ClusSvc The node lost communication with cluster node 'custard-2'
on network 'Private 172.16 (PCI)'.
1 sys Warning custard-1 2010-02-07 21:48:28 1123
ClusSvc The node lost communication with cluster node 'custard-2'
on network 'Private 192.168 (Onboard)'.
1 sys Error custard-1 2010-02-07 21:48:43 1209
ClusDisk Cluster service is requesting a bus reset for device
DeviceClusDisk0.

Feb 7 21:48:46 toaster-b-svif2 [toaster-b (takeover):
scsitarget.ispfct.targetReset:notice]: FCP Target 0b: Target was Reset
by the Initiator at Port Id: 0x10000 (WWPN 2100001b3201ba33)

1 sys Error custard-1 2010-02-07 21:48:51 1209
ClusDisk Cluster service is requesting a bus reset for device
DeviceClusDisk0.
1 sys Error custard-1 2010-02-07 21:48:53 1118
ClusNet Cluster service was terminated as requested by Node 2.
1 sys Error custard-1 2010-02-07 21:48:53 7031 Service
Control Manager The Cluster Service service terminated
unexpectedly. It has done this 1 time(s). The following corrective
action will be taken in 60000 milliseconds: Restart the service.
5 sys Error custard-1 2010-02-07 21:48:54 1036
ClusSvc Cluster disk resource '' did not respond to a SCSI
maintenance command.
1 sys Error custard-1 2010-02-07 21:48:54 1215
ClusSvc Cluster Network Name custard is no longer registered with
its hosting system. The associated resource name is ''.
1 sys Error custard-1 2010-02-07 21:48:54 1077
ClusSvc The TCP/IP interface for Cluster IP Address '' has failed.
1 sys Warning custard-1 2010-02-07 21:48:56 50 Ntfs
{Delayed Write Failed} Windows was unable to save all the data for
the file . The data has been lost. This error may be caused by a failure
of your computer hardware or network connection. Please try to save this
file elsewhere.
1 sys Warning custard-1 2010-02-07 21:48:57 57
Ftdisk The system failed to flush data to the transaction log.
Corruption may occur.

_______________________________________________
Toasters mailing list
Toasters [at] teaparty
http://www.teaparty.net/mailman/listinfo/toasters

_______________________________________________
Toasters mailing list
Toasters [at] teaparty
http://www.teaparty.net/mailman/listinfo/toasters


jeremy.page at gilbarco

Jan 24, 2012, 6:13 AM

Post #3 of 8 (1686 views)
Permalink
RE: How long can Windows survive loss of storage? [In reply to]

I can't tell you exactly but there is a registry setting that you can increase the Windows time out. This does not mean that SQL will be happy though.

We run both MSSQL (NFS VMs) and Oracle 11g (FC) on our 3070 cluster and have never had systems go down during a fail over. I think you may want to contact NetApp to see if there is a config error. I see a mention in the logs that they can't see each other's NVRAM which may be what's causing SQL to be inconsistent. FC may be dealt with differently but for IP based stuff all your writes are ack'ed to clients as soon as they hit the filer because they are written to the NVRAM. When you fail over the NVRAM for the failed head is applied. This would include outstanding writes to SQL & it's logs if it where IP. Once again I am not certain that FC writes go through the NVRAM but one of the NetApp guys on the list should be able to answer.

The short answer is that this should NOT be happening. Sorry I can't tell you what the exact problem is but even our (reasonably busy) FC attached Oracle can fail over with out any problems.

________________________________________
From: toasters-bounces [at] teaparty [toasters-bounces [at] teaparty] on behalf of Stuart Kendrick [skendric [at] fhcrc]
Sent: Tuesday, January 24, 2012 8:55 AM
To: toasters [at] teaparty
Subject: How long can Windows survive loss of storage?

Hi folks,

We have MS SQL Server (a handful of instances, riding on top of MS
Cluster Services) attached via Fibre Channel to a filer.

Whenever we perform a takeover, SQL Server crumps. The DBAs shut it
down, restart, repair/restore their databases (unhappy times), life
returns to normal. We've replicated this behavior a handful of times:
every single time we experience a takeover, whether administratively or
Murphy-induced.

I'd like to think that we can do better than this.

This particular filer services five clients: the two MS SQL Servers via
Fibre Channel and three Exchange servers via iSCSI. The iSCSI clients
notice the takeover, but their initiators do the multipath thing and
ride through the event fine.

In the log extract below (heavily edited), we can see the iSCSI and FCP
services going down at 21:47:25. And coming back on-line at 21:47:40
(FCP) and 21:47:49 (iSCSI). The iSCSI client 'hamlet' re-establishes
its session at 21:47:51. The rest of the log extract records the
downward spiral of the Windows server 'custard-1'.

To my way of thinking, iSCSI was unavailable for 24 seconds, FCP for 15
seconds.

(1) Are the iSCSI and FCP services really and truly available once the
'fcp.service.startup' and 'iscsi.service.startup' messages are logged,
or might there be a delay before storage is actually available to clients?
(2) How could we shrink takeover times?
(3) Why does MS Windows over FC have such trouble surviving a 15 second
disruption to storage?
(4) How could we increase MS Windows timeouts?
(5) How long can /your/ MS Windows FCP clients survive an interruption
to storage?

Toaster: FAS3020 running 7.2.4
SQL Server Clients: Windows 2008 running SnapDrive & SnapManager for
SQL Server
Exchange Clients: Windows 2003 running SnapDrive & SnapManager for Exchange

Log extract below

--sk

Stuart Kendrick
FHCRC

Feb 7 21:47:22 toaster-b-svif2 [toaster-b:
cf.misc.operatorTakeover:warning]: Cluster monitor: takeover initiated
by operator
Feb 7 21:47:22 toaster-b-svif2 [toaster-b:
cf.fsm.nfo.acceptTakeoverReq:warning]: Negotiated failover: accepting
takeover request by partner, reason: operator initiated cf takeover.
Asking partner to shutdown gracefully; will takeover in at most 180 seconds.
Feb 7 21:47:25 toaster-a-svif2 [toaster-a:
cf.fsm.nfo.startingGracefulShutdown:warning]: Negotiated failover:
starting graceful shutdown.
Feb 7 21:47:25 toaster-a-svif2 [toaster-a: kern.shutdown:notice]:
System shut down because : "reboot".
Feb 7 21:47:25 toaster-a-svif2 [toaster-a:
iscsi.service.shutdown:info]: iSCSI service shutdown
Feb 7 21:47:25 toaster-a-svif2 [toaster-a: fcp.service.shutdown:info]:
FCP service shutdown
Feb 7 21:47:32 toaster-b-svif2 [toaster-b: cf.fsm.firmwareStatus:info]:
Cluster monitor: partner rebooting
Feb 7 21:47:32 toaster-b-svif2 [toaster-b:
cf.fsm.nfo.partnerShutdown:warning]: Negotiated failover: partner has
shutdown
Feb 7 21:47:32 toaster-b-svif2 [toaster-b: cf.fsm.takeover.nfo:info]:
Cluster monitor: takeover attempted after 'cf takeover'. command
Feb 7 21:47:32 toaster-b-svif2 [toaster-b:
cf.fsm.stateTransit:warning]: Cluster monitor: UP --> TAKEOVER
Feb 7 21:47:32 toaster-b-svif2 [toaster-b:
cf.fm.takeoverStarted:warning]: Cluster monitor: takeover started
Feb 7 21:47:34 toaster-b-svif2 [toaster-b: cf_takeover:info]: NVRAM
takeover: partner nvram is disabled
Feb 7 21:47:39 toaster-b-svif2 [toaster-a/toaster-b:
wafl.takeover.nvram.missing:error]: WAFL takeover: no partner area found
during wafl replay
Feb 7 21:47:40 toaster-b-svif2 [toaster-a/toaster-b:
wafl.replay.done:info]: WAFL log replay completed, 0 seconds
Feb 7 21:47:40 toaster-b-svif2 [toaster-a/toaster-b:
fcp.service.startup:info]: FCP service startup

1 sys Error custard-1 2010-02-07 21:47:44 61203
ontapdsm DSM ID 0400010d has transitioned to the failed state.
1 sys Error custard-1 2010-02-07 21:47:44 61203
ontapdsm DSM ID 04000107 has transitioned to the failed state.
1 sys Error custard-1 2010-02-07 21:47:44 61203
ontapdsm DSM ID 0400010b has transitioned to the failed state.
1 sys Error custard-1 2010-02-07 21:47:44 61203
ontapdsm DSM ID 0300010b has transitioned to the failed state.
[...]
1 sys Error custard-1 2010-02-07 21:47:44 61142
ontapdsm Nexus ID 03000101 has failed.

Feb 7 21:47:48 toaster-b-svif2 [toaster-a/toaster-b:
cf.fm.takeoverDetectionSeconds.Default:warning]: option
cf.takeover.detection.seconds is set to 10 seconds which is below the
NetApp advised value of 15 seconds. False takeovers and/or takeovers
without diagnostic core-dumps might occur.
Feb 7 21:47:49 toaster-b-svif2 [toaster-a/toaster-b:
iscsi.service.startup:info]: iSCSI service startup
Feb 7 21:47:50 toaster-b-svif2 [toaster-b (takeover):
cf.rsrc.transitTime:notice]: Top Takeover transit times
wafl_restart=7400 {vdisk=7214, restarters=186}, wafl=5049,
registry_prerc=613, registry_postrc_phase2=555, rc=553 {ifconfig=99,
ifconfig=93, hostname=57, options=41, options=27, options=26, route=2,
vif=1, vif=1, vif=1}, raid=525, registry_postrc_phase1=403,
wafl_sync=241, raid_replay=221, cifs=216
Feb 7 21:47:50 toaster-b-svif2 [toaster-b (takeover):
cf.fm.takeoverComplete:warning]: Cluster monitor: takeover completed
Feb 7 21:47:50 toaster-b-svif2 [toaster-b (takeover):
cf.fm.takeoverDuration:warning]: Cluster monitor: takeover duration time
is 18 seconds
Feb 7 21:47:51 toaster-b-svif2 [toaster-a/toaster-b:
iscsi.notice:notice]: ISCSI: New session from initiator
iqn.1990-04.org.fhcrc:hamlet.fhcrc.org at IP addr 10.111.152.44
Feb 7 21:47:51 toaster-a-svif1 [toaster-a/toaster-b:
iscsi.notice:notice]: ISCSI: New session from initiator
iqn.1990-04.org.fhcrc:hamlet.fhcrc.org at IP addr 10.111.152.44

1 sys Warning custard-1 2010-02-07 21:47:57 61051
ontapdsm DSM ID 03000100 failed path verification.
1 sys Warning custard-1 2010-02-07 21:47:57 61051
ontapdsm DSM ID 03000101 failed path verification.
1 sys Warning custard-1 2010-02-07 21:47:57 61051
ontapdsm DSM ID 03000102 failed path verification.
1 sys Warning custard-1 2010-02-07 21:47:57 61077
ontapdsm DSM ID 03000111 has initiated a fail-over.
[...]
1 sys Error custard-1 2010-02-07 21:47:57 61124
ontapdsm The port servicing DSM ID 03000111 reported the logical
unit did not respond to selection.
[...]

Feb 7 21:48:01 toaster-b-svif2 [toaster-b (takeover):
monitor.globalStatus.critical:CRITICAL]: This node has taken over
toaster-a.
Feb 7 21:48:01 toaster-b-svif2 [toaster-a/toaster-b:
monitor.globalStatus.critical:CRITICAL]: toaster-b has taken over this
node.
Feb 7 21:48:01 toaster-a-svif1 [toaster-a/toaster-b:
monitor.globalStatus.critical:CRITICAL]: toaster-b has taken over this
node.

1 sys Warning custard-1 2010-02-07 21:48:04 61204
ontapdsm DSM ID 0400010f is in the process of being removed.
1 sys Warning custard-1 2010-02-07 21:48:04 61205
ontapdsm DSM ID 0400010f was removed.
1 sys Warning custard-1 2010-02-07 21:48:04 61205
ontapdsm DSM ID 0400010e was removed.
[...]
1 sys Error custard-1 2010-02-07 21:48:04 16 mpio
A fail-over on DeviceMPIODisk108 occurred.
[...]
1 sys Warning custard-1 2010-02-07 21:48:04 17 mpio
DeviceMPIODisk108 is currently in a degraded state. One or more
paths have failed though the process is now complete.

Feb 7 21:48:09 toaster-b-svif2 [toaster-a/toaster-b: syslogd:info]:
syslogd: restarted
Feb 7 21:48:09 toaster-a-svif1 [toaster-a/toaster-b: syslogd:info]:
syslogd: restarted
Feb 7 21:48:11 toaster-b-svif2 [toaster-a/toaster-b:
nbt.nbns.registrationComplete:info]: NBT: All CIFS name registrations
have completed for the partner server.
Feb 7 21:48:11 toaster-a-svif1 [toaster-a/toaster-b:
nbt.nbns.registrationComplete:info]: NBT: All CIFS name registrations
have completed for the partner server.

1 sys Warning custard-1 2010-02-07 21:48:24 129
ql2300 Reset to device DeviceRaidPort1 was issued.
1 sys Warning custard-1 2010-02-07 21:48:28 1123
ClusSvc The node lost communication with cluster node 'custard-2'
on network 'Public (Team)'.
1 sys Warning custard-1 2010-02-07 21:48:28 1123
ClusSvc The node lost communication with cluster node 'custard-2'
on network 'Private 172.16 (PCI)'.
1 sys Warning custard-1 2010-02-07 21:48:28 1123
ClusSvc The node lost communication with cluster node 'custard-2'
on network 'Private 192.168 (Onboard)'.
1 sys Error custard-1 2010-02-07 21:48:43 1209
ClusDisk Cluster service is requesting a bus reset for device
DeviceClusDisk0.

Feb 7 21:48:46 toaster-b-svif2 [toaster-b (takeover):
scsitarget.ispfct.targetReset:notice]: FCP Target 0b: Target was Reset
by the Initiator at Port Id: 0x10000 (WWPN 2100001b3201ba33)

1 sys Error custard-1 2010-02-07 21:48:51 1209
ClusDisk Cluster service is requesting a bus reset for device
DeviceClusDisk0.
1 sys Error custard-1 2010-02-07 21:48:53 1118
ClusNet Cluster service was terminated as requested by Node 2.
1 sys Error custard-1 2010-02-07 21:48:53 7031 Service
Control Manager The Cluster Service service terminated
unexpectedly. It has done this 1 time(s). The following corrective
action will be taken in 60000 milliseconds: Restart the service.
5 sys Error custard-1 2010-02-07 21:48:54 1036
ClusSvc Cluster disk resource '' did not respond to a SCSI
maintenance command.
1 sys Error custard-1 2010-02-07 21:48:54 1215
ClusSvc Cluster Network Name custard is no longer registered with
its hosting system. The associated resource name is ''.
1 sys Error custard-1 2010-02-07 21:48:54 1077
ClusSvc The TCP/IP interface for Cluster IP Address '' has failed.
1 sys Warning custard-1 2010-02-07 21:48:56 50 Ntfs
{Delayed Write Failed} Windows was unable to save all the data for
the file . The data has been lost. This error may be caused by a failure
of your computer hardware or network connection. Please try to save this
file elsewhere.
1 sys Warning custard-1 2010-02-07 21:48:57 57
Ftdisk The system failed to flush data to the transaction log.
Corruption may occur.

_______________________________________________
Toasters mailing list
Toasters [at] teaparty
http://www.teaparty.net/mailman/listinfo/toasters


Please be advised that this email may contain confidential
information. If you are not the intended recipient, please notify us
by email by replying to the sender and delete this message. The
sender disclaims that the content of this email constitutes an offer
to enter into, or the acceptance of, any agreement; provided that the
foregoing does not invalidate the binding effect of any digital or
other electronic reproduction of a manual signature that is included
in any attachment.



_______________________________________________
Toasters mailing list
Toasters [at] teaparty
http://www.teaparty.net/mailman/listinfo/toasters


fredgrieco at yahoo

Jan 24, 2012, 6:30 AM

Post #4 of 8 (1703 views)
Permalink
Re: How long can Windows survive loss of storage? [In reply to]

Stuart,

I would make sure your system is running single image mode clustering.  The command "fcp show cfmode" will show you the clustering type for fcp clients.  If it's single image mode, this will return "single_image." I'm fairly certain that single_image mode was available in 7.2.4.

FCP clustering is completely active-active in single image mode.  Each client should see two paths on one side and two from the other.  Losing one side just means losing two paths.  If you are seeing FCP as out completely for 15 seconds on all paths, something's not right.

Fred




________________________________
From: Stuart Kendrick <skendric [at] fhcrc>
To: "toasters [at] teaparty" <toasters [at] teaparty>
Sent: Tuesday, January 24, 2012 8:55 AM
Subject: How long can Windows survive loss of storage?

Hi folks,

We have MS SQL Server (a handful of instances, riding on top of MS
Cluster Services) attached via Fibre Channel to a filer.

Whenever we perform a takeover, SQL Server crumps.  The DBAs shut it
down, restart, repair/restore their databases (unhappy times), life
returns to normal.  We've replicated this behavior a handful of times:
every single time we experience a takeover, whether administratively or
Murphy-induced.

I'd like to think that we can do better than this.

This particular filer services five clients:  the two MS SQL Servers via
Fibre Channel and three Exchange servers via iSCSI.  The iSCSI clients
notice the takeover, but their initiators do the multipath thing and
ride through the event fine.

In the log extract below (heavily edited), we can see the iSCSI and FCP
services going down at 21:47:25.  And coming back on-line at 21:47:40
(FCP) and 21:47:49 (iSCSI).  The iSCSI client 'hamlet' re-establishes
its session at 21:47:51.  The rest of the log extract records the
downward spiral of the Windows server 'custard-1'.

To my way of thinking, iSCSI was unavailable for 24 seconds, FCP for 15
seconds.

(1) Are the iSCSI and FCP services really and truly available once the
'fcp.service.startup' and 'iscsi.service.startup' messages are logged,
or might there be a delay before storage is actually available to clients?
(2) How could we shrink takeover times?
(3) Why does MS Windows over FC have such trouble surviving a 15 second
disruption to storage?
(4) How could we increase MS Windows timeouts?
(5) How long can /your/ MS Windows FCP clients survive an interruption
to storage?

Toaster:  FAS3020 running 7.2.4
SQL Server Clients:  Windows 2008 running SnapDrive & SnapManager for
SQL Server
Exchange Clients:  Windows 2003 running SnapDrive & SnapManager for Exchange

Log extract below

--sk

Stuart Kendrick
FHCRC

Feb  7 21:47:22 toaster-b-svif2 [toaster-b:
cf.misc.operatorTakeover:warning]: Cluster monitor: takeover initiated
by operator
Feb  7 21:47:22 toaster-b-svif2 [toaster-b:
cf.fsm.nfo.acceptTakeoverReq:warning]: Negotiated failover: accepting
takeover request by partner, reason: operator initiated cf takeover.
Asking partner to shutdown gracefully; will takeover in at most 180 seconds.
Feb  7 21:47:25 toaster-a-svif2 [toaster-a:
cf.fsm.nfo.startingGracefulShutdown:warning]: Negotiated failover:
starting graceful shutdown.
Feb  7 21:47:25 toaster-a-svif2 [toaster-a: kern.shutdown:notice]:
System shut down because : "reboot".
Feb  7 21:47:25 toaster-a-svif2 [toaster-a:
iscsi.service.shutdown:info]: iSCSI service shutdown
Feb  7 21:47:25 toaster-a-svif2 [toaster-a: fcp.service.shutdown:info]:
FCP service shutdown
Feb  7 21:47:32 toaster-b-svif2 [toaster-b: cf.fsm.firmwareStatus:info]:
Cluster monitor: partner rebooting
Feb  7 21:47:32 toaster-b-svif2 [toaster-b:
cf.fsm.nfo.partnerShutdown:warning]: Negotiated failover: partner has
shutdown
Feb  7 21:47:32 toaster-b-svif2 [toaster-b: cf.fsm.takeover.nfo:info]:
Cluster monitor: takeover attempted after 'cf takeover'. command
Feb  7 21:47:32 toaster-b-svif2 [toaster-b:
cf.fsm.stateTransit:warning]: Cluster monitor: UP --> TAKEOVER
Feb  7 21:47:32 toaster-b-svif2 [toaster-b:
cf.fm.takeoverStarted:warning]: Cluster monitor: takeover started
Feb  7 21:47:34 toaster-b-svif2 [toaster-b: cf_takeover:info]: NVRAM
takeover: partner nvram is disabled
Feb  7 21:47:39 toaster-b-svif2 [toaster-a/toaster-b:
wafl.takeover.nvram.missing:error]: WAFL takeover: no partner area found
during wafl replay
Feb  7 21:47:40 toaster-b-svif2 [toaster-a/toaster-b:
wafl.replay.done:info]: WAFL log replay completed, 0 seconds
Feb  7 21:47:40 toaster-b-svif2 [toaster-a/toaster-b:
fcp.service.startup:info]: FCP service startup

1    sys    Error    custard-1    2010-02-07 21:47:44    61203 
ontapdsm        DSM ID 0400010d has transitioned to the failed state.
1    sys    Error    custard-1    2010-02-07 21:47:44    61203 
ontapdsm        DSM ID 04000107 has transitioned to the failed state.
1    sys    Error    custard-1    2010-02-07 21:47:44    61203 
ontapdsm        DSM ID 0400010b has transitioned to the failed state.
1    sys    Error    custard-1    2010-02-07 21:47:44    61203 
ontapdsm        DSM ID 0300010b has transitioned to the failed state.
[...]
1    sys    Error    custard-1    2010-02-07 21:47:44    61142 
ontapdsm        Nexus ID 03000101 has failed.

Feb  7 21:47:48 toaster-b-svif2 [toaster-a/toaster-b:
cf.fm.takeoverDetectionSeconds.Default:warning]: option
cf.takeover.detection.seconds is set to 10 seconds which is below the
NetApp advised value of 15 seconds. False takeovers and/or takeovers
without diagnostic core-dumps might occur.
Feb  7 21:47:49 toaster-b-svif2 [toaster-a/toaster-b:
iscsi.service.startup:info]: iSCSI service startup
Feb  7 21:47:50 toaster-b-svif2 [toaster-b (takeover):
cf.rsrc.transitTime:notice]: Top Takeover transit times
wafl_restart=7400 {vdisk=7214, restarters=186}, wafl=5049,
registry_prerc=613, registry_postrc_phase2=555, rc=553 {ifconfig=99,
ifconfig=93, hostname=57, options=41, options=27, options=26, route=2,
vif=1, vif=1, vif=1}, raid=525, registry_postrc_phase1=403,
wafl_sync=241, raid_replay=221, cifs=216
Feb  7 21:47:50 toaster-b-svif2 [toaster-b (takeover):
cf.fm.takeoverComplete:warning]: Cluster monitor: takeover completed
Feb  7 21:47:50 toaster-b-svif2 [toaster-b (takeover):
cf.fm.takeoverDuration:warning]: Cluster monitor: takeover duration time
is 18 seconds
Feb  7 21:47:51 toaster-b-svif2 [toaster-a/toaster-b:
iscsi.notice:notice]: ISCSI: New session from initiator
iqn.1990-04.org.fhcrc:hamlet.fhcrc.org at IP addr 10.111.152.44
Feb  7 21:47:51 toaster-a-svif1 [toaster-a/toaster-b:
iscsi.notice:notice]: ISCSI: New session from initiator
iqn.1990-04.org.fhcrc:hamlet.fhcrc.org at IP addr 10.111.152.44

1    sys    Warning    custard-1    2010-02-07 21:47:57    61051 
ontapdsm        DSM ID 03000100 failed path verification.
1    sys    Warning    custard-1    2010-02-07 21:47:57    61051 
ontapdsm        DSM ID 03000101 failed path verification.
1    sys    Warning    custard-1    2010-02-07 21:47:57    61051 
ontapdsm        DSM ID 03000102 failed path verification.
1    sys    Warning    custard-1    2010-02-07 21:47:57    61077 
ontapdsm        DSM ID 03000111 has initiated a fail-over. 
[...]
1    sys    Error    custard-1    2010-02-07 21:47:57    61124 
ontapdsm        The port servicing DSM ID 03000111 reported the logical
unit did not respond to selection.
[...]

Feb  7 21:48:01 toaster-b-svif2 [toaster-b (takeover):
monitor.globalStatus.critical:CRITICAL]: This node has taken over
toaster-a.
Feb  7 21:48:01 toaster-b-svif2 [toaster-a/toaster-b:
monitor.globalStatus.critical:CRITICAL]: toaster-b has taken over this
node.
Feb  7 21:48:01 toaster-a-svif1 [toaster-a/toaster-b:
monitor.globalStatus.critical:CRITICAL]: toaster-b has taken over this
node.

1    sys    Warning    custard-1    2010-02-07 21:48:04    61204 
ontapdsm        DSM ID 0400010f is in the process of being removed.
1    sys    Warning    custard-1    2010-02-07 21:48:04    61205 
ontapdsm        DSM ID 0400010f was removed.
1    sys    Warning    custard-1    2010-02-07 21:48:04    61205 
ontapdsm        DSM ID 0400010e was removed.
[...]
1    sys    Error    custard-1    2010-02-07 21:48:04    16    mpio 
    A fail-over on DeviceMPIODisk108 occurred.
[...]
1    sys    Warning    custard-1    2010-02-07 21:48:04    17    mpio 
    DeviceMPIODisk108 is currently in a degraded state. One or more
paths have failed  though the process is now complete.

Feb  7 21:48:09 toaster-b-svif2 [toaster-a/toaster-b: syslogd:info]:
syslogd: restarted
Feb  7 21:48:09 toaster-a-svif1 [toaster-a/toaster-b: syslogd:info]:
syslogd: restarted
Feb  7 21:48:11 toaster-b-svif2 [toaster-a/toaster-b:
nbt.nbns.registrationComplete:info]: NBT: All CIFS name registrations
have completed for the partner server.
Feb  7 21:48:11 toaster-a-svif1 [toaster-a/toaster-b:
nbt.nbns.registrationComplete:info]: NBT: All CIFS name registrations
have completed for the partner server.

1    sys    Warning    custard-1    2010-02-07 21:48:24    129 
ql2300        Reset to device  DeviceRaidPort1  was issued.
1    sys    Warning    custard-1    2010-02-07 21:48:28    1123 
ClusSvc        The node lost communication with cluster node 'custard-2'
on network 'Public (Team)'.
1    sys    Warning    custard-1    2010-02-07 21:48:28    1123 
ClusSvc        The node lost communication with cluster node 'custard-2'
on network 'Private 172.16 (PCI)'.
1    sys    Warning    custard-1    2010-02-07 21:48:28    1123 
ClusSvc        The node lost communication with cluster node 'custard-2'
on network 'Private 192.168 (Onboard)'.
1    sys    Error    custard-1    2010-02-07 21:48:43    1209 
ClusDisk        Cluster service is requesting a bus reset for device
DeviceClusDisk0.

Feb  7 21:48:46 toaster-b-svif2 [toaster-b (takeover):
scsitarget.ispfct.targetReset:notice]: FCP Target 0b: Target was Reset
by the Initiator at Port Id: 0x10000 (WWPN 2100001b3201ba33)

1    sys    Error    custard-1    2010-02-07 21:48:51    1209 
ClusDisk        Cluster service is requesting a bus reset for device
DeviceClusDisk0.
1    sys    Error    custard-1    2010-02-07 21:48:53    1118 
ClusNet        Cluster service was terminated as requested by Node 2.
1    sys    Error    custard-1    2010-02-07 21:48:53    7031    Service
Control Manager        The Cluster Service service terminated
unexpectedly.  It has done this 1 time(s).  The following corrective
action will be taken in 60000 milliseconds: Restart the service.
5    sys    Error    custard-1    2010-02-07 21:48:54    1036 
ClusSvc        Cluster disk resource '' did not respond to a SCSI
maintenance command.
1    sys    Error    custard-1    2010-02-07 21:48:54    1215 
ClusSvc        Cluster Network Name custard is no longer registered with
its hosting system. The associated resource name is ''.
1    sys    Error    custard-1    2010-02-07 21:48:54    1077 
ClusSvc        The TCP/IP interface for Cluster IP Address '' has failed.
1    sys    Warning    custard-1    2010-02-07 21:48:56    50    Ntfs 
    {Delayed Write Failed} Windows was unable to save all the data for
the file . The data has been lost. This error may be caused by a failure
of your computer hardware or network connection. Please try to save this
file elsewhere.
1    sys    Warning    custard-1    2010-02-07 21:48:57    57 
Ftdisk        The system failed to flush data to the transaction log.
Corruption may occur.

_______________________________________________
Toasters mailing list
Toasters [at] teaparty
http://www.teaparty.net/mailman/listinfo/toasters


jack1729 at gmail

Jan 24, 2012, 8:01 AM

Post #5 of 8 (1710 views)
Permalink
Re: How long can Windows survive loss of storage? [In reply to]

I agree, we have FC and ISCSI windows clusters (MSSQL and others) that don't have problems during failover. I know the host utilities adjust the disk timeouts when they are installed - would be interested to know if they are installed.

Jack
Sent from my Verizon Wireless BlackBerry

-----Original Message-----
From: "Page, Jeremy" <jeremy.page [at] gilbarco>
Sender: toasters-bounces [at] teaparty
Date: Tue, 24 Jan 2012 14:13:25
To: Stuart Kendrick<skendric [at] fhcrc>; toasters [at] teaparty<toasters [at] teaparty>
Subject: RE: How long can Windows survive loss of storage?

I can't tell you exactly but there is a registry setting that you can increase the Windows time out. This does not mean that SQL will be happy though.

We run both MSSQL (NFS VMs) and Oracle 11g (FC) on our 3070 cluster and have never had systems go down during a fail over. I think you may want to contact NetApp to see if there is a config error. I see a mention in the logs that they can't see each other's NVRAM which may be what's causing SQL to be inconsistent. FC may be dealt with differently but for IP based stuff all your writes are ack'ed to clients as soon as they hit the filer because they are written to the NVRAM. When you fail over the NVRAM for the failed head is applied. This would include outstanding writes to SQL & it's logs if it where IP. Once again I am not certain that FC writes go through the NVRAM but one of the NetApp guys on the list should be able to answer.

The short answer is that this should NOT be happening. Sorry I can't tell you what the exact problem is but even our (reasonably busy) FC attached Oracle can fail over with out any problems.

________________________________________
From: toasters-bounces [at] teaparty [toasters-bounces [at] teaparty] on behalf of Stuart Kendrick [skendric [at] fhcrc]
Sent: Tuesday, January 24, 2012 8:55 AM
To: toasters [at] teaparty
Subject: How long can Windows survive loss of storage?

Hi folks,

We have MS SQL Server (a handful of instances, riding on top of MS
Cluster Services) attached via Fibre Channel to a filer.

Whenever we perform a takeover, SQL Server crumps. The DBAs shut it
down, restart, repair/restore their databases (unhappy times), life
returns to normal. We've replicated this behavior a handful of times:
every single time we experience a takeover, whether administratively or
Murphy-induced.

I'd like to think that we can do better than this.

This particular filer services five clients: the two MS SQL Servers via
Fibre Channel and three Exchange servers via iSCSI. The iSCSI clients
notice the takeover, but their initiators do the multipath thing and
ride through the event fine.

In the log extract below (heavily edited), we can see the iSCSI and FCP
services going down at 21:47:25. And coming back on-line at 21:47:40
(FCP) and 21:47:49 (iSCSI). The iSCSI client 'hamlet' re-establishes
its session at 21:47:51. The rest of the log extract records the
downward spiral of the Windows server 'custard-1'.

To my way of thinking, iSCSI was unavailable for 24 seconds, FCP for 15
seconds.

(1) Are the iSCSI and FCP services really and truly available once the
'fcp.service.startup' and 'iscsi.service.startup' messages are logged,
or might there be a delay before storage is actually available to clients?
(2) How could we shrink takeover times?
(3) Why does MS Windows over FC have such trouble surviving a 15 second
disruption to storage?
(4) How could we increase MS Windows timeouts?
(5) How long can /your/ MS Windows FCP clients survive an interruption
to storage?

Toaster: FAS3020 running 7.2.4
SQL Server Clients: Windows 2008 running SnapDrive & SnapManager for
SQL Server
Exchange Clients: Windows 2003 running SnapDrive & SnapManager for Exchange

Log extract below

--sk

Stuart Kendrick
FHCRC

Feb 7 21:47:22 toaster-b-svif2 [toaster-b:
cf.misc.operatorTakeover:warning]: Cluster monitor: takeover initiated
by operator
Feb 7 21:47:22 toaster-b-svif2 [toaster-b:
cf.fsm.nfo.acceptTakeoverReq:warning]: Negotiated failover: accepting
takeover request by partner, reason: operator initiated cf takeover.
Asking partner to shutdown gracefully; will takeover in at most 180 seconds.
Feb 7 21:47:25 toaster-a-svif2 [toaster-a:
cf.fsm.nfo.startingGracefulShutdown:warning]: Negotiated failover:
starting graceful shutdown.
Feb 7 21:47:25 toaster-a-svif2 [toaster-a: kern.shutdown:notice]:
System shut down because : "reboot".
Feb 7 21:47:25 toaster-a-svif2 [toaster-a:
iscsi.service.shutdown:info]: iSCSI service shutdown
Feb 7 21:47:25 toaster-a-svif2 [toaster-a: fcp.service.shutdown:info]:
FCP service shutdown
Feb 7 21:47:32 toaster-b-svif2 [toaster-b: cf.fsm.firmwareStatus:info]:
Cluster monitor: partner rebooting
Feb 7 21:47:32 toaster-b-svif2 [toaster-b:
cf.fsm.nfo.partnerShutdown:warning]: Negotiated failover: partner has
shutdown
Feb 7 21:47:32 toaster-b-svif2 [toaster-b: cf.fsm.takeover.nfo:info]:
Cluster monitor: takeover attempted after 'cf takeover'. command
Feb 7 21:47:32 toaster-b-svif2 [toaster-b:
cf.fsm.stateTransit:warning]: Cluster monitor: UP --> TAKEOVER
Feb 7 21:47:32 toaster-b-svif2 [toaster-b:
cf.fm.takeoverStarted:warning]: Cluster monitor: takeover started
Feb 7 21:47:34 toaster-b-svif2 [toaster-b: cf_takeover:info]: NVRAM
takeover: partner nvram is disabled
Feb 7 21:47:39 toaster-b-svif2 [toaster-a/toaster-b:
wafl.takeover.nvram.missing:error]: WAFL takeover: no partner area found
during wafl replay
Feb 7 21:47:40 toaster-b-svif2 [toaster-a/toaster-b:
wafl.replay.done:info]: WAFL log replay completed, 0 seconds
Feb 7 21:47:40 toaster-b-svif2 [toaster-a/toaster-b:
fcp.service.startup:info]: FCP service startup

1 sys Error custard-1 2010-02-07 21:47:44 61203
ontapdsm DSM ID 0400010d has transitioned to the failed state.
1 sys Error custard-1 2010-02-07 21:47:44 61203
ontapdsm DSM ID 04000107 has transitioned to the failed state.
1 sys Error custard-1 2010-02-07 21:47:44 61203
ontapdsm DSM ID 0400010b has transitioned to the failed state.
1 sys Error custard-1 2010-02-07 21:47:44 61203
ontapdsm DSM ID 0300010b has transitioned to the failed state.
[...]
1 sys Error custard-1 2010-02-07 21:47:44 61142
ontapdsm Nexus ID 03000101 has failed.

Feb 7 21:47:48 toaster-b-svif2 [toaster-a/toaster-b:
cf.fm.takeoverDetectionSeconds.Default:warning]: option
cf.takeover.detection.seconds is set to 10 seconds which is below the
NetApp advised value of 15 seconds. False takeovers and/or takeovers
without diagnostic core-dumps might occur.
Feb 7 21:47:49 toaster-b-svif2 [toaster-a/toaster-b:
iscsi.service.startup:info]: iSCSI service startup
Feb 7 21:47:50 toaster-b-svif2 [toaster-b (takeover):
cf.rsrc.transitTime:notice]: Top Takeover transit times
wafl_restart=7400 {vdisk=7214, restarters=186}, wafl=5049,
registry_prerc=613, registry_postrc_phase2=555, rc=553 {ifconfig=99,
ifconfig=93, hostname=57, options=41, options=27, options=26, route=2,
vif=1, vif=1, vif=1}, raid=525, registry_postrc_phase1=403,
wafl_sync=241, raid_replay=221, cifs=216
Feb 7 21:47:50 toaster-b-svif2 [toaster-b (takeover):
cf.fm.takeoverComplete:warning]: Cluster monitor: takeover completed
Feb 7 21:47:50 toaster-b-svif2 [toaster-b (takeover):
cf.fm.takeoverDuration:warning]: Cluster monitor: takeover duration time
is 18 seconds
Feb 7 21:47:51 toaster-b-svif2 [toaster-a/toaster-b:
iscsi.notice:notice]: ISCSI: New session from initiator
iqn.1990-04.org.fhcrc:hamlet.fhcrc.org at IP addr 10.111.152.44
Feb 7 21:47:51 toaster-a-svif1 [toaster-a/toaster-b:
iscsi.notice:notice]: ISCSI: New session from initiator
iqn.1990-04.org.fhcrc:hamlet.fhcrc.org at IP addr 10.111.152.44

1 sys Warning custard-1 2010-02-07 21:47:57 61051
ontapdsm DSM ID 03000100 failed path verification.
1 sys Warning custard-1 2010-02-07 21:47:57 61051
ontapdsm DSM ID 03000101 failed path verification.
1 sys Warning custard-1 2010-02-07 21:47:57 61051
ontapdsm DSM ID 03000102 failed path verification.
1 sys Warning custard-1 2010-02-07 21:47:57 61077
ontapdsm DSM ID 03000111 has initiated a fail-over.
[...]
1 sys Error custard-1 2010-02-07 21:47:57 61124
ontapdsm The port servicing DSM ID 03000111 reported the logical
unit did not respond to selection.
[...]

Feb 7 21:48:01 toaster-b-svif2 [toaster-b (takeover):
monitor.globalStatus.critical:CRITICAL]: This node has taken over
toaster-a.
Feb 7 21:48:01 toaster-b-svif2 [toaster-a/toaster-b:
monitor.globalStatus.critical:CRITICAL]: toaster-b has taken over this
node.
Feb 7 21:48:01 toaster-a-svif1 [toaster-a/toaster-b:
monitor.globalStatus.critical:CRITICAL]: toaster-b has taken over this
node.

1 sys Warning custard-1 2010-02-07 21:48:04 61204
ontapdsm DSM ID 0400010f is in the process of being removed.
1 sys Warning custard-1 2010-02-07 21:48:04 61205
ontapdsm DSM ID 0400010f was removed.
1 sys Warning custard-1 2010-02-07 21:48:04 61205
ontapdsm DSM ID 0400010e was removed.
[...]
1 sys Error custard-1 2010-02-07 21:48:04 16 mpio
A fail-over on DeviceMPIODisk108 occurred.
[...]
1 sys Warning custard-1 2010-02-07 21:48:04 17 mpio
DeviceMPIODisk108 is currently in a degraded state. One or more
paths have failed though the process is now complete.

Feb 7 21:48:09 toaster-b-svif2 [toaster-a/toaster-b: syslogd:info]:
syslogd: restarted
Feb 7 21:48:09 toaster-a-svif1 [toaster-a/toaster-b: syslogd:info]:
syslogd: restarted
Feb 7 21:48:11 toaster-b-svif2 [toaster-a/toaster-b:
nbt.nbns.registrationComplete:info]: NBT: All CIFS name registrations
have completed for the partner server.
Feb 7 21:48:11 toaster-a-svif1 [toaster-a/toaster-b:
nbt.nbns.registrationComplete:info]: NBT: All CIFS name registrations
have completed for the partner server.

1 sys Warning custard-1 2010-02-07 21:48:24 129
ql2300 Reset to device DeviceRaidPort1 was issued.
1 sys Warning custard-1 2010-02-07 21:48:28 1123
ClusSvc The node lost communication with cluster node 'custard-2'
on network 'Public (Team)'.
1 sys Warning custard-1 2010-02-07 21:48:28 1123
ClusSvc The node lost communication with cluster node 'custard-2'
on network 'Private 172.16 (PCI)'.
1 sys Warning custard-1 2010-02-07 21:48:28 1123
ClusSvc The node lost communication with cluster node 'custard-2'
on network 'Private 192.168 (Onboard)'.
1 sys Error custard-1 2010-02-07 21:48:43 1209
ClusDisk Cluster service is requesting a bus reset for device
DeviceClusDisk0.

Feb 7 21:48:46 toaster-b-svif2 [toaster-b (takeover):
scsitarget.ispfct.targetReset:notice]: FCP Target 0b: Target was Reset
by the Initiator at Port Id: 0x10000 (WWPN 2100001b3201ba33)

1 sys Error custard-1 2010-02-07 21:48:51 1209
ClusDisk Cluster service is requesting a bus reset for device
DeviceClusDisk0.
1 sys Error custard-1 2010-02-07 21:48:53 1118
ClusNet Cluster service was terminated as requested by Node 2.
1 sys Error custard-1 2010-02-07 21:48:53 7031 Service
Control Manager The Cluster Service service terminated
unexpectedly. It has done this 1 time(s). The following corrective
action will be taken in 60000 milliseconds: Restart the service.
5 sys Error custard-1 2010-02-07 21:48:54 1036
ClusSvc Cluster disk resource '' did not respond to a SCSI
maintenance command.
1 sys Error custard-1 2010-02-07 21:48:54 1215
ClusSvc Cluster Network Name custard is no longer registered with
its hosting system. The associated resource name is ''.
1 sys Error custard-1 2010-02-07 21:48:54 1077
ClusSvc The TCP/IP interface for Cluster IP Address '' has failed.
1 sys Warning custard-1 2010-02-07 21:48:56 50 Ntfs
{Delayed Write Failed} Windows was unable to save all the data for
the file . The data has been lost. This error may be caused by a failure
of your computer hardware or network connection. Please try to save this
file elsewhere.
1 sys Warning custard-1 2010-02-07 21:48:57 57
Ftdisk The system failed to flush data to the transaction log.
Corruption may occur.

_______________________________________________
Toasters mailing list
Toasters [at] teaparty
http://www.teaparty.net/mailman/listinfo/toasters


Please be advised that this email may contain confidential
information. If you are not the intended recipient, please notify us
by email by replying to the sender and delete this message. The
sender disclaims that the content of this email constitutes an offer
to enter into, or the acceptance of, any agreement; provided that the
foregoing does not invalidate the binding effect of any digital or
other electronic reproduction of a manual signature that is included
in any attachment.



_______________________________________________
Toasters mailing list
Toasters [at] teaparty
http://www.teaparty.net/mailman/listinfo/toasters

_______________________________________________
Toasters mailing list
Toasters [at] teaparty
http://www.teaparty.net/mailman/listinfo/toasters


sklise at hotmail

Jan 24, 2012, 8:13 AM

Post #6 of 8 (1673 views)
Permalink
RE: How long can Windows survive loss of storage? [In reply to]

To add to insult to injury, my want to check hardware compatibility. I usually would downgrade my hba's drivers to a known good (and older) version. The host utility usually sets timeouts and detects if its a ms cluster or not.. Sets the disk timeouts.




Date: Tue, 24 Jan 2012 06:30:54 -0800
From: fredgrieco [at] yahoo
Subject: Re: How long can Windows survive loss of storage?
To: skendric [at] fhcrc; toasters [at] teaparty



Stuart,


I would make sure your system is running single image mode clustering. The command "fcp show cfmode" will show you the clustering type for fcp clients. If it's single image mode, this will return "single_image." I'm fairly certain that single_image mode was available in 7.2.4.


FCP clustering is completely active-active in single image mode. Each client should see two paths on one side and two from the other. Losing one side just means losing two paths. If you are seeing FCP as out completely for 15 seconds on all paths, something's not right.

Fred









From: Stuart Kendrick <skendric [at] fhcrc>
To: "toasters [at] teaparty" <toasters [at] teaparty>
Sent: Tuesday, January 24, 2012 8:55 AM
Subject: How long can Windows survive loss of storage?

Hi folks,

We have MS SQL Server (a handful of instances, riding on top of MS
Cluster Services) attached via Fibre Channel to a filer.

Whenever we perform a takeover, SQL Server crumps. The DBAs shut it
down, restart, repair/restore their databases (unhappy times), life
returns to normal. We've replicated this behavior a handful of times:
every single time we experience a takeover, whether administratively or
Murphy-induced.

I'd like to think that we can do better than this.

This particular filer services five clients: the two MS SQL Servers via
Fibre Channel and three Exchange servers via iSCSI. The iSCSI clients
notice the takeover, but their initiators do the multipath thing and
ride through the event fine.

In the log extract below (heavily edited), we can see the iSCSI and FCP
services going down at 21:47:25. And coming back on-line at 21:47:40
(FCP) and 21:47:49 (iSCSI). The iSCSI client 'hamlet' re-establishes
its session at 21:47:51. The rest of the log extract records the
downward spiral of the Windows server 'custard-1'.

To my way of thinking, iSCSI was unavailable for 24 seconds, FCP for 15
seconds.

(1) Are the iSCSI and FCP services really and truly available once the
'fcp.service.startup' and 'iscsi.service.startup' messages are logged,
or might there be a delay before storage is actually available to clients?
(2) How could we shrink takeover times?
(3) Why does MS Windows over FC have such trouble surviving a 15 second
disruption to storage?
(4) How could we increase MS Windows timeouts?
(5) How long can /your/ MS Windows FCP clients survive an interruption
to storage?

Toaster: FAS3020 running 7.2.4
SQL Server Clients: Windows 2008 running SnapDrive & SnapManager for
SQL Server
Exchange Clients: Windows 2003 running SnapDrive & SnapManager for Exchange

Log extract below

--sk

Stuart Kendrick
FHCRC

Feb 7 21:47:22 toaster-b-svif2 [toaster-b:
cf.misc.operatorTakeover:warning]: Cluster monitor: takeover initiated
by operator
Feb 7 21:47:22 toaster-b-svif2 [toaster-b:
cf.fsm.nfo.acceptTakeoverReq:warning]: Negotiated failover: accepting
takeover request by partner, reason: operator initiated cf takeover.
Asking partner to shutdown gracefully; will takeover in at most 180 seconds.
Feb 7 21:47:25 toaster-a-svif2 [toaster-a:
cf.fsm.nfo.startingGracefulShutdown:warning]: Negotiated failover:
starting graceful shutdown.
Feb 7 21:47:25 toaster-a-svif2 [toaster-a: kern.shutdown:notice]:
System shut down because : "reboot".
Feb 7 21:47:25 toaster-a-svif2 [toaster-a:
iscsi.service.shutdown:info]: iSCSI service shutdown
Feb 7 21:47:25 toaster-a-svif2 [toaster-a: fcp.service.shutdown:info]:
FCP service shutdown
Feb 7 21:47:32 toaster-b-svif2 [toaster-b: cf.fsm.firmwareStatus:info]:
Cluster monitor: partner rebooting
Feb 7 21:47:32 toaster-b-svif2 [toaster-b:
cf.fsm.nfo.partnerShutdown:warning]: Negotiated failover: partner has
shutdown
Feb 7 21:47:32 toaster-b-svif2 [toaster-b: cf.fsm.takeover.nfo:info]:
Cluster monitor: takeover attempted after 'cf takeover'. command
Feb 7 21:47:32 toaster-b-svif2 [toaster-b:
cf.fsm.stateTransit:warning]: Cluster monitor: UP --> TAKEOVER
Feb 7 21:47:32 toaster-b-svif2 [toaster-b:
cf.fm.takeoverStarted:warning]: Cluster monitor: takeover started
Feb 7 21:47:34 toaster-b-svif2 [toaster-b: cf_takeover:info]: NVRAM
takeover: partner nvram is disabled
Feb 7 21:47:39 toaster-b-svif2 [toaster-a/toaster-b:
wafl.takeover.nvram.missing:error]: WAFL takeover: no partner area found
during wafl replay
Feb 7 21:47:40 toaster-b-svif2 [toaster-a/toaster-b:
wafl.replay.done:info]: WAFL log replay completed, 0 seconds
Feb 7 21:47:40 toaster-b-svif2 [toaster-a/toaster-b:
fcp.service.startup:info]: FCP service startup

1 sys Error custard-1 2010-02-07 21:47:44 61203
ontapdsm DSM ID 0400010d has transitioned to the failed state.
1 sys Error custard-1 2010-02-07 21:47:44 61203
ontapdsm DSM ID 04000107 has transitioned to the failed state.
1 sys Error custard-1 2010-02-07 21:47:44 61203
ontapdsm DSM ID 0400010b has transitioned to the failed state.
1 sys Error custard-1 2010-02-07 21:47:44 61203
ontapdsm DSM ID 0300010b has transitioned to the failed state.
[...]
1 sys Error custard-1 2010-02-07 21:47:44 61142
ontapdsm Nexus ID 03000101 has failed.

Feb 7 21:47:48 toaster-b-svif2 [toaster-a/toaster-b:
cf.fm.takeoverDetectionSeconds.Default:warning]: option
cf.takeover.detection.seconds is set to 10 seconds which is below the
NetApp advised value of 15 seconds. False takeovers and/or takeovers
without diagnostic core-dumps might occur.
Feb 7 21:47:49 toaster-b-svif2 [toaster-a/toaster-b:
iscsi.service.startup:info]: iSCSI service startup
Feb 7 21:47:50 toaster-b-svif2 [toaster-b (takeover):
cf.rsrc.transitTime:notice]: Top Takeover transit times
wafl_restart=7400 {vdisk=7214, restarters=186}, wafl=5049,
registry_prerc=613, registry_postrc_phase2=555, rc=553 {ifconfig=99,
ifconfig=93, hostname=57, options=41, options=27, options=26, route=2,
vif=1, vif=1, vif=1}, raid=525, registry_postrc_phase1=403,
wafl_sync=241, raid_replay=221, cifs=216
Feb 7 21:47:50 toaster-b-svif2 [toaster-b (takeover):
cf.fm.takeoverComplete:warning]: Cluster monitor: takeover completed
Feb 7 21:47:50 toaster-b-svif2 [toaster-b (takeover):
cf.fm.takeoverDuration:warning]: Cluster monitor: takeover duration time
is 18 seconds
Feb 7 21:47:51 toaster-b-svif2 [toaster-a/toaster-b:
iscsi.notice:notice]: ISCSI: New session from initiator
iqn.1990-04.org.fhcrc:hamlet.fhcrc.org at IP addr 10.111.152.44
Feb 7 21:47:51 toaster-a-svif1 [toaster-a/toaster-b:
iscsi.notice:notice]: ISCSI: New session from initiator
iqn.1990-04.org.fhcrc:hamlet.fhcrc.org at IP addr 10.111.152.44

1 sys Warning custard-1 2010-02-07 21:47:57 61051
ontapdsm DSM ID 03000100 failed path verification.
1 sys Warning custard-1 2010-02-07 21:47:57 61051
ontapdsm DSM ID 03000101 failed path verification.
1 sys Warning custard-1 2010-02-07 21:47:57 61051
ontapdsm DSM ID 03000102 failed path verification.
1 sys Warning custard-1 2010-02-07 21:47:57 61077
ontapdsm DSM ID 03000111 has initiated a fail-over.
[...]
1 sys Error custard-1 2010-02-07 21:47:57 61124
ontapdsm The port servicing DSM ID 03000111 reported the logical
unit did not respond to selection.
[...]

Feb 7 21:48:01 toaster-b-svif2 [toaster-b (takeover):
monitor.globalStatus.critical:CRITICAL]: This node has taken over
toaster-a.
Feb 7 21:48:01 toaster-b-svif2 [toaster-a/toaster-b:
monitor.globalStatus.critical:CRITICAL]: toaster-b has taken over this
node.
Feb 7 21:48:01 toaster-a-svif1 [toaster-a/toaster-b:
monitor.globalStatus.critical:CRITICAL]: toaster-b has taken over this
node.

1 sys Warning custard-1 2010-02-07 21:48:04 61204
ontapdsm DSM ID 0400010f is in the process of being removed.
1 sys Warning custard-1 2010-02-07 21:48:04 61205
ontapdsm DSM ID 0400010f was removed.
1 sys Warning custard-1 2010-02-07 21:48:04 61205
ontapdsm DSM ID 0400010e was removed.
[...]
1 sys Error custard-1 2010-02-07 21:48:04 16 mpio
A fail-over on DeviceMPIODisk108 occurred.
[...]
1 sys Warning custard-1 2010-02-07 21:48:04 17 mpio
DeviceMPIODisk108 is currently in a degraded state. One or more
paths have failed though the process is now complete.

Feb 7 21:48:09 toaster-b-svif2 [toaster-a/toaster-b: syslogd:info]:
syslogd: restarted
Feb 7 21:48:09 toaster-a-svif1 [toaster-a/toaster-b: syslogd:info]:
syslogd: restarted
Feb 7 21:48:11 toaster-b-svif2 [toaster-a/toaster-b:
nbt.nbns.registrationComplete:info]: NBT: All CIFS name registrations
have completed for the partner server.
Feb 7 21:48:11 toaster-a-svif1 [toaster-a/toaster-b:
nbt.nbns.registrationComplete:info]: NBT: All CIFS name registrations
have completed for the partner server.

1 sys Warning custard-1 2010-02-07 21:48:24 129
ql2300 Reset to device DeviceRaidPort1 was issued.
1 sys Warning custard-1 2010-02-07 21:48:28 1123
ClusSvc The node lost communication with cluster node 'custard-2'
on network 'Public (Team)'.
1 sys Warning custard-1 2010-02-07 21:48:28 1123
ClusSvc The node lost communication with cluster node 'custard-2'
on network 'Private 172.16 (PCI)'.
1 sys Warning custard-1 2010-02-07 21:48:28 1123
ClusSvc The node lost communication with cluster node 'custard-2'
on network 'Private 192.168 (Onboard)'.
1 sys Error custard-1 2010-02-07 21:48:43 1209
ClusDisk Cluster service is requesting a bus reset for device
DeviceClusDisk0.

Feb 7 21:48:46 toaster-b-svif2 [toaster-b (takeover):
scsitarget.ispfct.targetReset:notice]: FCP Target 0b: Target was Reset
by the Initiator at Port Id: 0x10000 (WWPN 2100001b3201ba33)

1 sys Error custard-1 2010-02-07 21:48:51 1209
ClusDisk Cluster service is requesting a bus reset for device
DeviceClusDisk0.
1 sys Error custard-1 2010-02-07 21:48:53 1118
ClusNet Cluster service was terminated as requested by Node 2.
1 sys Error custard-1 2010-02-07 21:48:53 7031 Service
Control Manager The Cluster Service service terminated
unexpectedly. It has done this 1 time(s). The following corrective
action will be taken in 60000 milliseconds: Restart the service.
5 sys Error custard-1 2010-02-07 21:48:54 1036
ClusSvc Cluster disk resource '' did not respond to a SCSI
maintenance command.
1 sys Error custard-1 2010-02-07 21:48:54 1215
ClusSvc Cluster Network Name custard is no longer registered with
its hosting system. The associated resource name is ''.
1 sys Error custard-1 2010-02-07 21:48:54 1077
ClusSvc The TCP/IP interface for Cluster IP Address '' has failed.
1 sys Warning custard-1 2010-02-07 21:48:56 50 Ntfs
{Delayed Write Failed} Windows was unable to save all the data for
the file . The data has been lost. This error may be caused by a failure
of your computer hardware or network connection. Please try to save this
file elsewhere.
1 sys Warning custard-1 2010-02-07 21:48:57 57
Ftdisk The system failed to flush data to the transaction log.
Corruption may occur.

_______________________________________________
Toasters mailing list
Toasters [at] teaparty
http://www.teaparty.net/mailman/listinfo/toasters



_______________________________________________ Toasters mailing list Toasters [at] teaparty http://www.teaparty.net/mailman/listinfo/toasters


skendric at fhcrc

Jan 26, 2012, 5:35 AM

Post #7 of 8 (1720 views)
Permalink
Re: How long can Windows survive loss of storage? [In reply to]

Took a while for me to find the folks who had done the original install ...

ONTAP DSM and yes, we used the NetApp Host Utilities (Windows Host
Utilities) when setting them up.

BTW: turns out that all five of these Windows clients are running
Win2003, not Win2008.

I've heard the rumor that Microsoft patches can undo some of the changes
which the Host Utilities Kit makes, and that we should re-run HUK. And
perhaps upgrade the version of ONTAP DSM. I'm confident that we haven't
touched either since the original install, many years ago. What do you
think?

--sk


On 1/24/2012 6:11 AM, Borzenkov, Andrey wrote:
> Are you using ONTAP DSM or native Windows DSM/iSCSI MPIO? Did you use NetApp Host Utilities when setting up configuration?
>
>
>
> ---
> With best regards
>
> Andrey Borzenkov
> Senior system engineer
> Service operations
>
>
> -----Original Message-----
> From: toasters-bounces [at] teaparty [mailto:toasters-bounces [at] teaparty] On Behalf Of Stuart Kendrick
> Sent: Tuesday, January 24, 2012 5:55 PM
> To: toasters [at] teaparty
> Subject: How long can Windows survive loss of storage?
>
> Hi folks,
>
> We have MS SQL Server (a handful of instances, riding on top of MS
> Cluster Services) attached via Fibre Channel to a filer.
>
> Whenever we perform a takeover, SQL Server crumps. The DBAs shut it
> down, restart, repair/restore their databases (unhappy times), life
> returns to normal. We've replicated this behavior a handful of times:
> every single time we experience a takeover, whether administratively or
> Murphy-induced.
>
> I'd like to think that we can do better than this.
>
> This particular filer services five clients: the two MS SQL Servers via
> Fibre Channel and three Exchange servers via iSCSI. The iSCSI clients
> notice the takeover, but their initiators do the multipath thing and
> ride through the event fine.
>
> In the log extract below (heavily edited), we can see the iSCSI and FCP
> services going down at 21:47:25. And coming back on-line at 21:47:40
> (FCP) and 21:47:49 (iSCSI). The iSCSI client 'hamlet' re-establishes
> its session at 21:47:51. The rest of the log extract records the
> downward spiral of the Windows server 'custard-1'.
>
> To my way of thinking, iSCSI was unavailable for 24 seconds, FCP for 15
> seconds.
>
> (1) Are the iSCSI and FCP services really and truly available once the
> 'fcp.service.startup' and 'iscsi.service.startup' messages are logged,
> or might there be a delay before storage is actually available to clients?
> (2) How could we shrink takeover times?
> (3) Why does MS Windows over FC have such trouble surviving a 15 second
> disruption to storage?
> (4) How could we increase MS Windows timeouts?
> (5) How long can /your/ MS Windows FCP clients survive an interruption
> to storage?
>
> Toaster: FAS3020 running 7.2.4
> SQL Server Clients: Windows 2008 running SnapDrive & SnapManager for
> SQL Server
> Exchange Clients: Windows 2003 running SnapDrive & SnapManager for Exchange
>
> Log extract below
>
> --sk
>
> Stuart Kendrick
> FHCRC
>
> Feb 7 21:47:22 toaster-b-svif2 [toaster-b:
> cf.misc.operatorTakeover:warning]: Cluster monitor: takeover initiated
> by operator
> Feb 7 21:47:22 toaster-b-svif2 [toaster-b:
> cf.fsm.nfo.acceptTakeoverReq:warning]: Negotiated failover: accepting
> takeover request by partner, reason: operator initiated cf takeover.
> Asking partner to shutdown gracefully; will takeover in at most 180 seconds.
> Feb 7 21:47:25 toaster-a-svif2 [toaster-a:
> cf.fsm.nfo.startingGracefulShutdown:warning]: Negotiated failover:
> starting graceful shutdown.
> Feb 7 21:47:25 toaster-a-svif2 [toaster-a: kern.shutdown:notice]:
> System shut down because : "reboot".
> Feb 7 21:47:25 toaster-a-svif2 [toaster-a:
> iscsi.service.shutdown:info]: iSCSI service shutdown
> Feb 7 21:47:25 toaster-a-svif2 [toaster-a: fcp.service.shutdown:info]:
> FCP service shutdown
> Feb 7 21:47:32 toaster-b-svif2 [toaster-b: cf.fsm.firmwareStatus:info]:
> Cluster monitor: partner rebooting
> Feb 7 21:47:32 toaster-b-svif2 [toaster-b:
> cf.fsm.nfo.partnerShutdown:warning]: Negotiated failover: partner has
> shutdown
> Feb 7 21:47:32 toaster-b-svif2 [toaster-b: cf.fsm.takeover.nfo:info]:
> Cluster monitor: takeover attempted after 'cf takeover'. command
> Feb 7 21:47:32 toaster-b-svif2 [toaster-b:
> cf.fsm.stateTransit:warning]: Cluster monitor: UP --> TAKEOVER
> Feb 7 21:47:32 toaster-b-svif2 [toaster-b:
> cf.fm.takeoverStarted:warning]: Cluster monitor: takeover started
> Feb 7 21:47:34 toaster-b-svif2 [toaster-b: cf_takeover:info]: NVRAM
> takeover: partner nvram is disabled
> Feb 7 21:47:39 toaster-b-svif2 [toaster-a/toaster-b:
> wafl.takeover.nvram.missing:error]: WAFL takeover: no partner area found
> during wafl replay
> Feb 7 21:47:40 toaster-b-svif2 [toaster-a/toaster-b:
> wafl.replay.done:info]: WAFL log replay completed, 0 seconds
> Feb 7 21:47:40 toaster-b-svif2 [toaster-a/toaster-b:
> fcp.service.startup:info]: FCP service startup
>
> 1 sys Error custard-1 2010-02-07 21:47:44 61203
> ontapdsm DSM ID 0400010d has transitioned to the failed state.
> 1 sys Error custard-1 2010-02-07 21:47:44 61203
> ontapdsm DSM ID 04000107 has transitioned to the failed state.
> 1 sys Error custard-1 2010-02-07 21:47:44 61203
> ontapdsm DSM ID 0400010b has transitioned to the failed state.
> 1 sys Error custard-1 2010-02-07 21:47:44 61203
> ontapdsm DSM ID 0300010b has transitioned to the failed state.
> [...]
> 1 sys Error custard-1 2010-02-07 21:47:44 61142
> ontapdsm Nexus ID 03000101 has failed.
>
> Feb 7 21:47:48 toaster-b-svif2 [toaster-a/toaster-b:
> cf.fm.takeoverDetectionSeconds.Default:warning]: option
> cf.takeover.detection.seconds is set to 10 seconds which is below the
> NetApp advised value of 15 seconds. False takeovers and/or takeovers
> without diagnostic core-dumps might occur.
> Feb 7 21:47:49 toaster-b-svif2 [toaster-a/toaster-b:
> iscsi.service.startup:info]: iSCSI service startup
> Feb 7 21:47:50 toaster-b-svif2 [toaster-b (takeover):
> cf.rsrc.transitTime:notice]: Top Takeover transit times
> wafl_restart=7400 {vdisk=7214, restarters=186}, wafl=5049,
> registry_prerc=613, registry_postrc_phase2=555, rc=553 {ifconfig=99,
> ifconfig=93, hostname=57, options=41, options=27, options=26, route=2,
> vif=1, vif=1, vif=1}, raid=525, registry_postrc_phase1=403,
> wafl_sync=241, raid_replay=221, cifs=216
> Feb 7 21:47:50 toaster-b-svif2 [toaster-b (takeover):
> cf.fm.takeoverComplete:warning]: Cluster monitor: takeover completed
> Feb 7 21:47:50 toaster-b-svif2 [toaster-b (takeover):
> cf.fm.takeoverDuration:warning]: Cluster monitor: takeover duration time
> is 18 seconds
> Feb 7 21:47:51 toaster-b-svif2 [toaster-a/toaster-b:
> iscsi.notice:notice]: ISCSI: New session from initiator
> iqn.1990-04.org.fhcrc:hamlet.fhcrc.org at IP addr 10.111.152.44
> Feb 7 21:47:51 toaster-a-svif1 [toaster-a/toaster-b:
> iscsi.notice:notice]: ISCSI: New session from initiator
> iqn.1990-04.org.fhcrc:hamlet.fhcrc.org at IP addr 10.111.152.44
>
> 1 sys Warning custard-1 2010-02-07 21:47:57 61051
> ontapdsm DSM ID 03000100 failed path verification.
> 1 sys Warning custard-1 2010-02-07 21:47:57 61051
> ontapdsm DSM ID 03000101 failed path verification.
> 1 sys Warning custard-1 2010-02-07 21:47:57 61051
> ontapdsm DSM ID 03000102 failed path verification.
> 1 sys Warning custard-1 2010-02-07 21:47:57 61077
> ontapdsm DSM ID 03000111 has initiated a fail-over.
> [...]
> 1 sys Error custard-1 2010-02-07 21:47:57 61124
> ontapdsm The port servicing DSM ID 03000111 reported the logical
> unit did not respond to selection.
> [...]
>
> Feb 7 21:48:01 toaster-b-svif2 [toaster-b (takeover):
> monitor.globalStatus.critical:CRITICAL]: This node has taken over
> toaster-a.
> Feb 7 21:48:01 toaster-b-svif2 [toaster-a/toaster-b:
> monitor.globalStatus.critical:CRITICAL]: toaster-b has taken over this
> node.
> Feb 7 21:48:01 toaster-a-svif1 [toaster-a/toaster-b:
> monitor.globalStatus.critical:CRITICAL]: toaster-b has taken over this
> node.
>
> 1 sys Warning custard-1 2010-02-07 21:48:04 61204
> ontapdsm DSM ID 0400010f is in the process of being removed.
> 1 sys Warning custard-1 2010-02-07 21:48:04 61205
> ontapdsm DSM ID 0400010f was removed.
> 1 sys Warning custard-1 2010-02-07 21:48:04 61205
> ontapdsm DSM ID 0400010e was removed.
> [...]
> 1 sys Error custard-1 2010-02-07 21:48:04 16 mpio
> A fail-over on DeviceMPIODisk108 occurred.
> [...]
> 1 sys Warning custard-1 2010-02-07 21:48:04 17 mpio
> DeviceMPIODisk108 is currently in a degraded state. One or more
> paths have failed though the process is now complete.
>
> Feb 7 21:48:09 toaster-b-svif2 [toaster-a/toaster-b: syslogd:info]:
> syslogd: restarted
> Feb 7 21:48:09 toaster-a-svif1 [toaster-a/toaster-b: syslogd:info]:
> syslogd: restarted
> Feb 7 21:48:11 toaster-b-svif2 [toaster-a/toaster-b:
> nbt.nbns.registrationComplete:info]: NBT: All CIFS name registrations
> have completed for the partner server.
> Feb 7 21:48:11 toaster-a-svif1 [toaster-a/toaster-b:
> nbt.nbns.registrationComplete:info]: NBT: All CIFS name registrations
> have completed for the partner server.
>
> 1 sys Warning custard-1 2010-02-07 21:48:24 129
> ql2300 Reset to device DeviceRaidPort1 was issued.
> 1 sys Warning custard-1 2010-02-07 21:48:28 1123
> ClusSvc The node lost communication with cluster node 'custard-2'
> on network 'Public (Team)'.
> 1 sys Warning custard-1 2010-02-07 21:48:28 1123
> ClusSvc The node lost communication with cluster node 'custard-2'
> on network 'Private 172.16 (PCI)'.
> 1 sys Warning custard-1 2010-02-07 21:48:28 1123
> ClusSvc The node lost communication with cluster node 'custard-2'
> on network 'Private 192.168 (Onboard)'.
> 1 sys Error custard-1 2010-02-07 21:48:43 1209
> ClusDisk Cluster service is requesting a bus reset for device
> DeviceClusDisk0.
>
> Feb 7 21:48:46 toaster-b-svif2 [toaster-b (takeover):
> scsitarget.ispfct.targetReset:notice]: FCP Target 0b: Target was Reset
> by the Initiator at Port Id: 0x10000 (WWPN 2100001b3201ba33)
>
> 1 sys Error custard-1 2010-02-07 21:48:51 1209
> ClusDisk Cluster service is requesting a bus reset for device
> DeviceClusDisk0.
> 1 sys Error custard-1 2010-02-07 21:48:53 1118
> ClusNet Cluster service was terminated as requested by Node 2.
> 1 sys Error custard-1 2010-02-07 21:48:53 7031 Service
> Control Manager The Cluster Service service terminated
> unexpectedly. It has done this 1 time(s). The following corrective
> action will be taken in 60000 milliseconds: Restart the service.
> 5 sys Error custard-1 2010-02-07 21:48:54 1036
> ClusSvc Cluster disk resource '' did not respond to a SCSI
> maintenance command.
> 1 sys Error custard-1 2010-02-07 21:48:54 1215
> ClusSvc Cluster Network Name custard is no longer registered with
> its hosting system. The associated resource name is ''.
> 1 sys Error custard-1 2010-02-07 21:48:54 1077
> ClusSvc The TCP/IP interface for Cluster IP Address '' has failed.
> 1 sys Warning custard-1 2010-02-07 21:48:56 50 Ntfs
> {Delayed Write Failed} Windows was unable to save all the data for
> the file . The data has been lost. This error may be caused by a failure
> of your computer hardware or network connection. Please try to save this
> file elsewhere.
> 1 sys Warning custard-1 2010-02-07 21:48:57 57
> Ftdisk The system failed to flush data to the transaction log.
> Corruption may occur.
>
> _______________________________________________
> Toasters mailing list
> Toasters [at] teaparty
> http://www.teaparty.net/mailman/listinfo/toasters
_______________________________________________
Toasters mailing list
Toasters [at] teaparty
http://www.teaparty.net/mailman/listinfo/toasters


andrey.borzenkov at ts

Jan 26, 2012, 6:04 AM

Post #8 of 8 (1836 views)
Permalink
RE: How long can Windows survive loss of storage? [In reply to]

That's correct (about undoing changes). From Host Utilities setup guide:

Installing the cluster service on Windows 2003 changes the disk TimeOutValue.
Upgrading the Emulex or QLogic HBA driver software also changes TimeOutValue. If cluster
service is installed or the HBA driver is upgraded after you install this software, use the Repair
option of the installation program to change the disk TimeOutValue back to the supported value.

I do not exactly remember value used by MSCS, but it pretty low, 10 or 20 seconds.



---
With best regards

Andrey Borzenkov
Senior system engineer
Service operations

-----Original Message-----
From: toasters-bounces [at] teaparty [mailto:toasters-bounces [at] teaparty] On Behalf Of Stuart Kendrick
Sent: Thursday, January 26, 2012 5:36 PM
To: toasters [at] teaparty
Subject: Re: How long can Windows survive loss of storage?

Took a while for me to find the folks who had done the original install ...

ONTAP DSM and yes, we used the NetApp Host Utilities (Windows Host
Utilities) when setting them up.

BTW: turns out that all five of these Windows clients are running
Win2003, not Win2008.

I've heard the rumor that Microsoft patches can undo some of the changes
which the Host Utilities Kit makes, and that we should re-run HUK. And
perhaps upgrade the version of ONTAP DSM. I'm confident that we haven't
touched either since the original install, many years ago. What do you
think?

--sk


On 1/24/2012 6:11 AM, Borzenkov, Andrey wrote:
> Are you using ONTAP DSM or native Windows DSM/iSCSI MPIO? Did you use NetApp Host Utilities when setting up configuration?
>
>
>
> ---
> With best regards
>
> Andrey Borzenkov
> Senior system engineer
> Service operations
>
>
> -----Original Message-----
> From: toasters-bounces [at] teaparty [mailto:toasters-bounces [at] teaparty] On Behalf Of Stuart Kendrick
> Sent: Tuesday, January 24, 2012 5:55 PM
> To: toasters [at] teaparty
> Subject: How long can Windows survive loss of storage?
>
> Hi folks,
>
> We have MS SQL Server (a handful of instances, riding on top of MS
> Cluster Services) attached via Fibre Channel to a filer.
>
> Whenever we perform a takeover, SQL Server crumps. The DBAs shut it
> down, restart, repair/restore their databases (unhappy times), life
> returns to normal. We've replicated this behavior a handful of times:
> every single time we experience a takeover, whether administratively or
> Murphy-induced.
>
> I'd like to think that we can do better than this.
>
> This particular filer services five clients: the two MS SQL Servers via
> Fibre Channel and three Exchange servers via iSCSI. The iSCSI clients
> notice the takeover, but their initiators do the multipath thing and
> ride through the event fine.
>
> In the log extract below (heavily edited), we can see the iSCSI and FCP
> services going down at 21:47:25. And coming back on-line at 21:47:40
> (FCP) and 21:47:49 (iSCSI). The iSCSI client 'hamlet' re-establishes
> its session at 21:47:51. The rest of the log extract records the
> downward spiral of the Windows server 'custard-1'.
>
> To my way of thinking, iSCSI was unavailable for 24 seconds, FCP for 15
> seconds.
>
> (1) Are the iSCSI and FCP services really and truly available once the
> 'fcp.service.startup' and 'iscsi.service.startup' messages are logged,
> or might there be a delay before storage is actually available to clients?
> (2) How could we shrink takeover times?
> (3) Why does MS Windows over FC have such trouble surviving a 15 second
> disruption to storage?
> (4) How could we increase MS Windows timeouts?
> (5) How long can /your/ MS Windows FCP clients survive an interruption
> to storage?
>
> Toaster: FAS3020 running 7.2.4
> SQL Server Clients: Windows 2008 running SnapDrive & SnapManager for
> SQL Server
> Exchange Clients: Windows 2003 running SnapDrive & SnapManager for Exchange
>
> Log extract below
>
> --sk
>
> Stuart Kendrick
> FHCRC
>
> Feb 7 21:47:22 toaster-b-svif2 [toaster-b:
> cf.misc.operatorTakeover:warning]: Cluster monitor: takeover initiated
> by operator
> Feb 7 21:47:22 toaster-b-svif2 [toaster-b:
> cf.fsm.nfo.acceptTakeoverReq:warning]: Negotiated failover: accepting
> takeover request by partner, reason: operator initiated cf takeover.
> Asking partner to shutdown gracefully; will takeover in at most 180 seconds.
> Feb 7 21:47:25 toaster-a-svif2 [toaster-a:
> cf.fsm.nfo.startingGracefulShutdown:warning]: Negotiated failover:
> starting graceful shutdown.
> Feb 7 21:47:25 toaster-a-svif2 [toaster-a: kern.shutdown:notice]:
> System shut down because : "reboot".
> Feb 7 21:47:25 toaster-a-svif2 [toaster-a:
> iscsi.service.shutdown:info]: iSCSI service shutdown
> Feb 7 21:47:25 toaster-a-svif2 [toaster-a: fcp.service.shutdown:info]:
> FCP service shutdown
> Feb 7 21:47:32 toaster-b-svif2 [toaster-b: cf.fsm.firmwareStatus:info]:
> Cluster monitor: partner rebooting
> Feb 7 21:47:32 toaster-b-svif2 [toaster-b:
> cf.fsm.nfo.partnerShutdown:warning]: Negotiated failover: partner has
> shutdown
> Feb 7 21:47:32 toaster-b-svif2 [toaster-b: cf.fsm.takeover.nfo:info]:
> Cluster monitor: takeover attempted after 'cf takeover'. command
> Feb 7 21:47:32 toaster-b-svif2 [toaster-b:
> cf.fsm.stateTransit:warning]: Cluster monitor: UP --> TAKEOVER
> Feb 7 21:47:32 toaster-b-svif2 [toaster-b:
> cf.fm.takeoverStarted:warning]: Cluster monitor: takeover started
> Feb 7 21:47:34 toaster-b-svif2 [toaster-b: cf_takeover:info]: NVRAM
> takeover: partner nvram is disabled
> Feb 7 21:47:39 toaster-b-svif2 [toaster-a/toaster-b:
> wafl.takeover.nvram.missing:error]: WAFL takeover: no partner area found
> during wafl replay
> Feb 7 21:47:40 toaster-b-svif2 [toaster-a/toaster-b:
> wafl.replay.done:info]: WAFL log replay completed, 0 seconds
> Feb 7 21:47:40 toaster-b-svif2 [toaster-a/toaster-b:
> fcp.service.startup:info]: FCP service startup
>
> 1 sys Error custard-1 2010-02-07 21:47:44 61203
> ontapdsm DSM ID 0400010d has transitioned to the failed state.
> 1 sys Error custard-1 2010-02-07 21:47:44 61203
> ontapdsm DSM ID 04000107 has transitioned to the failed state.
> 1 sys Error custard-1 2010-02-07 21:47:44 61203
> ontapdsm DSM ID 0400010b has transitioned to the failed state.
> 1 sys Error custard-1 2010-02-07 21:47:44 61203
> ontapdsm DSM ID 0300010b has transitioned to the failed state.
> [...]
> 1 sys Error custard-1 2010-02-07 21:47:44 61142
> ontapdsm Nexus ID 03000101 has failed.
>
> Feb 7 21:47:48 toaster-b-svif2 [toaster-a/toaster-b:
> cf.fm.takeoverDetectionSeconds.Default:warning]: option
> cf.takeover.detection.seconds is set to 10 seconds which is below the
> NetApp advised value of 15 seconds. False takeovers and/or takeovers
> without diagnostic core-dumps might occur.
> Feb 7 21:47:49 toaster-b-svif2 [toaster-a/toaster-b:
> iscsi.service.startup:info]: iSCSI service startup
> Feb 7 21:47:50 toaster-b-svif2 [toaster-b (takeover):
> cf.rsrc.transitTime:notice]: Top Takeover transit times
> wafl_restart=7400 {vdisk=7214, restarters=186}, wafl=5049,
> registry_prerc=613, registry_postrc_phase2=555, rc=553 {ifconfig=99,
> ifconfig=93, hostname=57, options=41, options=27, options=26, route=2,
> vif=1, vif=1, vif=1}, raid=525, registry_postrc_phase1=403,
> wafl_sync=241, raid_replay=221, cifs=216
> Feb 7 21:47:50 toaster-b-svif2 [toaster-b (takeover):
> cf.fm.takeoverComplete:warning]: Cluster monitor: takeover completed
> Feb 7 21:47:50 toaster-b-svif2 [toaster-b (takeover):
> cf.fm.takeoverDuration:warning]: Cluster monitor: takeover duration time
> is 18 seconds
> Feb 7 21:47:51 toaster-b-svif2 [toaster-a/toaster-b:
> iscsi.notice:notice]: ISCSI: New session from initiator
> iqn.1990-04.org.fhcrc:hamlet.fhcrc.org at IP addr 10.111.152.44
> Feb 7 21:47:51 toaster-a-svif1 [toaster-a/toaster-b:
> iscsi.notice:notice]: ISCSI: New session from initiator
> iqn.1990-04.org.fhcrc:hamlet.fhcrc.org at IP addr 10.111.152.44
>
> 1 sys Warning custard-1 2010-02-07 21:47:57 61051
> ontapdsm DSM ID 03000100 failed path verification.
> 1 sys Warning custard-1 2010-02-07 21:47:57 61051
> ontapdsm DSM ID 03000101 failed path verification.
> 1 sys Warning custard-1 2010-02-07 21:47:57 61051
> ontapdsm DSM ID 03000102 failed path verification.
> 1 sys Warning custard-1 2010-02-07 21:47:57 61077
> ontapdsm DSM ID 03000111 has initiated a fail-over.
> [...]
> 1 sys Error custard-1 2010-02-07 21:47:57 61124
> ontapdsm The port servicing DSM ID 03000111 reported the logical
> unit did not respond to selection.
> [...]
>
> Feb 7 21:48:01 toaster-b-svif2 [toaster-b (takeover):
> monitor.globalStatus.critical:CRITICAL]: This node has taken over
> toaster-a.
> Feb 7 21:48:01 toaster-b-svif2 [toaster-a/toaster-b:
> monitor.globalStatus.critical:CRITICAL]: toaster-b has taken over this
> node.
> Feb 7 21:48:01 toaster-a-svif1 [toaster-a/toaster-b:
> monitor.globalStatus.critical:CRITICAL]: toaster-b has taken over this
> node.
>
> 1 sys Warning custard-1 2010-02-07 21:48:04 61204
> ontapdsm DSM ID 0400010f is in the process of being removed.
> 1 sys Warning custard-1 2010-02-07 21:48:04 61205
> ontapdsm DSM ID 0400010f was removed.
> 1 sys Warning custard-1 2010-02-07 21:48:04 61205
> ontapdsm DSM ID 0400010e was removed.
> [...]
> 1 sys Error custard-1 2010-02-07 21:48:04 16 mpio
> A fail-over on DeviceMPIODisk108 occurred.
> [...]
> 1 sys Warning custard-1 2010-02-07 21:48:04 17 mpio
> DeviceMPIODisk108 is currently in a degraded state. One or more
> paths have failed though the process is now complete.
>
> Feb 7 21:48:09 toaster-b-svif2 [toaster-a/toaster-b: syslogd:info]:
> syslogd: restarted
> Feb 7 21:48:09 toaster-a-svif1 [toaster-a/toaster-b: syslogd:info]:
> syslogd: restarted
> Feb 7 21:48:11 toaster-b-svif2 [toaster-a/toaster-b:
> nbt.nbns.registrationComplete:info]: NBT: All CIFS name registrations
> have completed for the partner server.
> Feb 7 21:48:11 toaster-a-svif1 [toaster-a/toaster-b:
> nbt.nbns.registrationComplete:info]: NBT: All CIFS name registrations
> have completed for the partner server.
>
> 1 sys Warning custard-1 2010-02-07 21:48:24 129
> ql2300 Reset to device DeviceRaidPort1 was issued.
> 1 sys Warning custard-1 2010-02-07 21:48:28 1123
> ClusSvc The node lost communication with cluster node 'custard-2'
> on network 'Public (Team)'.
> 1 sys Warning custard-1 2010-02-07 21:48:28 1123
> ClusSvc The node lost communication with cluster node 'custard-2'
> on network 'Private 172.16 (PCI)'.
> 1 sys Warning custard-1 2010-02-07 21:48:28 1123
> ClusSvc The node lost communication with cluster node 'custard-2'
> on network 'Private 192.168 (Onboard)'.
> 1 sys Error custard-1 2010-02-07 21:48:43 1209
> ClusDisk Cluster service is requesting a bus reset for device
> DeviceClusDisk0.
>
> Feb 7 21:48:46 toaster-b-svif2 [toaster-b (takeover):
> scsitarget.ispfct.targetReset:notice]: FCP Target 0b: Target was Reset
> by the Initiator at Port Id: 0x10000 (WWPN 2100001b3201ba33)
>
> 1 sys Error custard-1 2010-02-07 21:48:51 1209
> ClusDisk Cluster service is requesting a bus reset for device
> DeviceClusDisk0.
> 1 sys Error custard-1 2010-02-07 21:48:53 1118
> ClusNet Cluster service was terminated as requested by Node 2.
> 1 sys Error custard-1 2010-02-07 21:48:53 7031 Service
> Control Manager The Cluster Service service terminated
> unexpectedly. It has done this 1 time(s). The following corrective
> action will be taken in 60000 milliseconds: Restart the service.
> 5 sys Error custard-1 2010-02-07 21:48:54 1036
> ClusSvc Cluster disk resource '' did not respond to a SCSI
> maintenance command.
> 1 sys Error custard-1 2010-02-07 21:48:54 1215
> ClusSvc Cluster Network Name custard is no longer registered with
> its hosting system. The associated resource name is ''.
> 1 sys Error custard-1 2010-02-07 21:48:54 1077
> ClusSvc The TCP/IP interface for Cluster IP Address '' has failed.
> 1 sys Warning custard-1 2010-02-07 21:48:56 50 Ntfs
> {Delayed Write Failed} Windows was unable to save all the data for
> the file . The data has been lost. This error may be caused by a failure
> of your computer hardware or network connection. Please try to save this
> file elsewhere.
> 1 sys Warning custard-1 2010-02-07 21:48:57 57
> Ftdisk The system failed to flush data to the transaction log.
> Corruption may occur.
>
> _______________________________________________
> Toasters mailing list
> Toasters [at] teaparty
> http://www.teaparty.net/mailman/listinfo/toasters
_______________________________________________
Toasters mailing list
Toasters [at] teaparty
http://www.teaparty.net/mailman/listinfo/toasters

_______________________________________________
Toasters mailing list
Toasters [at] teaparty
http://www.teaparty.net/mailman/listinfo/toasters

Netapp toasters RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.