Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Netapp: toasters

Loop A failure not triggering failover

 

 

Netapp toasters RSS feed   Index | Next | Previous | View Threaded


SRajagopalan at williamoneil

Jan 21, 2010, 10:25 PM

Post #1 of 4 (2232 views)
Permalink
Loop A failure not triggering failover

We have a active/active setup on our filers,standard loop A/loop B
cabling (no multipath HA).



We had a recent event with our filers where intermittent failure of loop
A did not trigger a failover to the partner. I'd like to know why that
is the case. According to the Netapp failover cause and effect document
at



http://now.netapp.com/NOW/knowledge/docs/ontap/rel727/html/ontap/cluster
/failing_over/reference/r_oc_fo_failover-events.html



This event should have caused a failover.



The log message from the filer on loop A was:



Sun Jan 17 15:41:56 PST [netapp1: fci.link.break:error]: Link break
detected on Fibre Channel adapter 0e.



Is there a option or timeout setting to make the failover happen



Thanks

Suresh


lohit.b at gmail

Jan 21, 2010, 11:09 PM

Post #2 of 4 (2103 views)
Permalink
Re: Loop A failure not triggering failover [In reply to]

Hi Suresh,

I think this should have happened, when the loop failed. (Taken from ONTAP
docs)

How disk shelf comparison takeover works

Describes the way a node uses disk shelf comparison with its partner node to
determine if it is impaired.

When communication between nodes is first established through the cluster
interconnect adapters, the nodes exchange a list of disk shelves that are
visible on the A and B loops of each node. If, later, a system sees that the
B loop disk shelf count on its partner is greater than its local A loop disk
shelf count, the system concludes that it is impaired and prompts its
partner to initiate a takeover.
Note: Disk shelf comparison does not function for active/active
configurations using software-based disk ownership, or for fabric-attached
MetroClusters.


options cf.takeover.detection.seconds number_of_seconds (But, I think this
affects only cluster interconnect timeouts not the loop failure)

The valid values for number_of_seconds are 10 through 180; the default is
15.

Attention: If the specified time is less than 15 seconds, unnecessary
takeovers can occur, and a core might not be generated for some system
panics. Use caution when assigning a takeover time of less than 15 seconds.



On Fri, Jan 22, 2010 at 11:55 AM, Suresh Rajagopalan <
SRajagopalan [at] williamoneil> wrote:

> We have a active/active setup on our filers,standard loop A/loop B
> cabling (no multipath HA).
>
>
>
> We had a recent event with our filers where intermittent failure of loop A
> did not trigger a failover to the partner. I’d like to know why that is the
> case. According to the Netapp failover cause and effect document at
>
>
>
>
> http://now.netapp.com/NOW/knowledge/docs/ontap/rel727/html/ontap/cluster/failing_over/reference/r_oc_fo_failover-events.html
>
>
>
> This event should have caused a failover.
>
>
>
> The log message from the filer on loop A was:
>
> * *
>
> *Sun Jan 17 15:41:56 PST [netapp1: fci.link.break:error]: Link break
> detected on Fibre Channel adapter 0e.*
>
>
>
> Is there a option or timeout setting to make the failover happen
>
>
>
> Thanks
>
> Suresh
>
>
>



--
LOhit


Sue.Coatney at netapp

Jan 22, 2010, 2:01 PM

Post #3 of 4 (2084 views)
Permalink
RE: Loop A failure not triggering failover [In reply to]

The cf.takeover.on_disk_shelf_miscompare option needs to be turned on for takeover to happen when a disk shelf mis-compare happens.

Sue Coatney
High Availability Team
NetApp

________________________________

From: LOhit [mailto:lohit.b [at] gmail]
Sent: Thu 1/21/2010 11:09 PM
To: Suresh Rajagopalan
Cc: toasters [at] mathworks
Subject: Re: Loop A failure not triggering failover


Hi Suresh,

I think this should have happened, when the loop failed. (Taken from ONTAP docs)



How disk shelf comparison takeover works


Describes the way a node uses disk shelf comparison with its partner node to determine if it is impaired.

When communication between nodes is first established through the cluster interconnect adapters, the nodes exchange a list of disk shelves that are visible on the A and B loops of each node. If, later, a system sees that the B loop disk shelf count on its partner is greater than its local A loop disk shelf count, the system concludes that it is impaired and prompts its partner to initiate a takeover.

Note: Disk shelf comparison does not function for active/active configurations using software-based disk ownership, or for fabric-attached MetroClusters.


options cf.takeover.detection.seconds number_of_seconds (But, I think this affects only cluster interconnect timeouts not the loop failure)

The valid values for number_of_seconds are 10 through 180; the default is 15.


Attention: If the specified time is less than 15 seconds, unnecessary takeovers can occur, and a core might not be generated for some system panics. Use caution when assigning a takeover time of less than 15 seconds.



On Fri, Jan 22, 2010 at 11:55 AM, Suresh Rajagopalan <SRajagopalan [at] williamoneil> wrote:


We have a active/active setup on our filers,standard loop A/loop B cabling (no multipath HA).



We had a recent event with our filers where intermittent failure of loop A did not trigger a failover to the partner. I'd like to know why that is the case. According to the Netapp failover cause and effect document at



http://now.netapp.com/NOW/knowledge/docs/ontap/rel727/html/ontap/cluster/failing_over/reference/r_oc_fo_failover-events.html



This event should have caused a failover.



The log message from the filer on loop A was:



Sun Jan 17 15:41:56 PST [netapp1: fci.link.break:error]: Link break detected on Fibre Channel adapter 0e.



Is there a option or timeout setting to make the failover happen



Thanks

Suresh






--
LOhit


SRajagopalan at williamoneil

Jan 22, 2010, 5:34 PM

Post #4 of 4 (2086 views)
Permalink
RE: Loop A failure not triggering failover [In reply to]

We use a 6030 series with software disk ownership. According to the
documentation below this option does not apply to us. Is that right?



Please note that this was a intermittent Loop A failure and not a
complete failure. So Loop A kept going up/down and we had no failover
during this period.



Suresh





From: Coatney, Sue [mailto:Sue.Coatney [at] netapp]
Sent: Friday, January 22, 2010 2:02 PM
To: LOhit; Suresh Rajagopalan
Cc: toasters [at] mathworks
Subject: RE: Loop A failure not triggering failover



The cf.takeover.on_disk_shelf_miscompare option needs to be turned on
for takeover to happen when a disk shelf mis-compare happens.



Sue Coatney

High Availability Team

NetApp



________________________________

From: LOhit [mailto:lohit.b [at] gmail]
Sent: Thu 1/21/2010 11:09 PM
To: Suresh Rajagopalan
Cc: toasters [at] mathworks
Subject: Re: Loop A failure not triggering failover

Hi Suresh,

I think this should have happened, when the loop failed. (Taken from
ONTAP docs)


How disk shelf comparison takeover works


Describes the way a node uses disk shelf comparison with its partner
node to determine if it is impaired.

When communication between nodes is first established through the
cluster interconnect adapters, the nodes exchange a list of disk shelves
that are visible on the A and B loops of each node. If, later, a system
sees that the B loop disk shelf count on its partner is greater than its
local A loop disk shelf count, the system concludes that it is impaired
and prompts its partner to initiate a takeover.

Note: Disk shelf comparison does not function for active/active
configurations using software-based disk ownership, or for
fabric-attached MetroClusters.


options cf.takeover.detection.seconds number_of_seconds (But, I think
this affects only cluster interconnect timeouts not the loop failure)

The valid values for number_of_seconds are 10 through 180; the default
is 15.

Attention: If the specified time is less than 15 seconds, unnecessary
takeovers can occur, and a core might not be generated for some system
panics. Use caution when assigning a takeover time of less than 15
seconds.





On Fri, Jan 22, 2010 at 11:55 AM, Suresh Rajagopalan
<SRajagopalan [at] williamoneil> wrote:

We have a active/active setup on our filers,standard loop A/loop B
cabling (no multipath HA).



We had a recent event with our filers where intermittent failure of loop
A did not trigger a failover to the partner. I'd like to know why that
is the case. According to the Netapp failover cause and effect document
at



http://now.netapp.com/NOW/knowledge/docs/ontap/rel727/html/ontap/cluster
/failing_over/reference/r_oc_fo_failover-events.html



This event should have caused a failover.



The log message from the filer on loop A was:



Sun Jan 17 15:41:56 PST [netapp1: fci.link.break:error]: Link break
detected on Fibre Channel adapter 0e.



Is there a option or timeout setting to make the failover happen



Thanks

Suresh






--
LOhit

Netapp toasters RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.