Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Cisco: NSP

3550-12 interrupts out of control, possibly hardware?

 

 

Cisco nsp RSS feed   Index | Next | Previous | View Threaded


andy at xecu

Aug 16, 2012, 9:36 AM

Post #1 of 6 (320 views)
Permalink
3550-12 interrupts out of control, possibly hardware?

I've got a customer with a weird situation.

They have a pretty straightforward setup, two 7200s fronting two cisco
3550-12s, distributing to a series of 48 port 3550s. It's a bit dated, but
works very well for their needs.

They have one special network attached to (only) one of the copper gige
ports on (one of) the 3550-12s which gets a decent amount of traffic
(~100mbps or so). It's a layer 3 connection.

Well, one of their 3550-12s died, taking down that network. They moved the
IP configuration of the port and moved the cable immediately, restoring
service, and racked/configured a replacement switch, but left that network
on the second 3550-12, as it seemed fine.

However, once it began to come under load this morning, the CPU pegged
(80-99%, normally at 1-2%), causing packet drops and latency.

At that point I got involved, and for the life of me I can't figure out
why this happened. Clearly it's interrupts, as there were no processes in
the "sh proc cpu" that had more than 1% of CPU. However, cef was working
fine, everything looked normal in terms of the traditional interrupt-based
troubleshooting.

So, after scratching our heads for a bit, I had them move the connection
back to the original, newly-replaced switch. Note that these switches are
configured 100% identically with the exception of IP address and hostname.
Same IOS versions. I mean literally, if you diff the two in rancid, those
are the only config changes.

Zero problems from the point they moved the connection off of the switch
in question, both switches now have 1-2% CPU and things are humming along
fine.

So, my question is: What could be the possible causes of this? Could this
be a symptom of failing hardware, perhaps some bad memory requiring
constant CPU corrections?

Thanks,
Andy

---
Andy Dills
Xecunet, Inc.
www.xecu.net
301-682-9972
---
_______________________________________________
cisco-nsp mailing list cisco-nsp [at] puck
https://puck.nether.net/mailman/listinfo/cisco-nsp
archive at http://puck.nether.net/pipermail/cisco-nsp/


andy at xecu

Aug 16, 2012, 11:22 AM

Post #2 of 6 (292 views)
Permalink
Re: 3550-12 interrupts out of control, possibly hardware? [In reply to]

In doing further investigation, looking at traffic graphs, I see that once
they moved the network to the other switch, all of a sudden "vlan1"
started seeing all of the traffic that was being routed to that network.
Typically, the only traffic the switches see on vlan1 is traffic actually
destined for the switch (config, ICMP, etc). And the switch they are
currently on does not see any traffic on the vlan1, and once I had them
move the connection, neither switch sees the big traffic spike on vlan1
any longer.

This is quite odd, because as I mentioned, the two switches are configured
the same...can anybody suggest an explanation or potential course of
determining why the traffic that

I'm wondering if it's some odd software bug relating to them enabling "ip
routing" on that second switch last night, but not booting fresh after
doing so. I just can't puzzle out in my head why traffic destined for an
L3 port would transit the VLAN like that.

Thanks,
Andy

On Thu, 16 Aug 2012, Andy Dills wrote:

>
> I've got a customer with a weird situation.
>
> They have a pretty straightforward setup, two 7200s fronting two cisco
> 3550-12s, distributing to a series of 48 port 3550s. It's a bit dated, but
> works very well for their needs.
>
> They have one special network attached to (only) one of the copper gige
> ports on (one of) the 3550-12s which gets a decent amount of traffic
> (~100mbps or so). It's a layer 3 connection.
>
> Well, one of their 3550-12s died, taking down that network. They moved the
> IP configuration of the port and moved the cable immediately, restoring
> service, and racked/configured a replacement switch, but left that network
> on the second 3550-12, as it seemed fine.
>
> However, once it began to come under load this morning, the CPU pegged
> (80-99%, normally at 1-2%), causing packet drops and latency.
>
> At that point I got involved, and for the life of me I can't figure out
> why this happened. Clearly it's interrupts, as there were no processes in
> the "sh proc cpu" that had more than 1% of CPU. However, cef was working
> fine, everything looked normal in terms of the traditional interrupt-based
> troubleshooting.
>
> So, after scratching our heads for a bit, I had them move the connection
> back to the original, newly-replaced switch. Note that these switches are
> configured 100% identically with the exception of IP address and hostname.
> Same IOS versions. I mean literally, if you diff the two in rancid, those
> are the only config changes.
>
> Zero problems from the point they moved the connection off of the switch
> in question, both switches now have 1-2% CPU and things are humming along
> fine.
>
> So, my question is: What could be the possible causes of this? Could this
> be a symptom of failing hardware, perhaps some bad memory requiring
> constant CPU corrections?
>
> Thanks,
> Andy
>
> ---
> Andy Dills
> Xecunet, Inc.
> www.xecu.net
> 301-682-9972
> ---
> _______________________________________________
> cisco-nsp mailing list cisco-nsp [at] puck
> https://puck.nether.net/mailman/listinfo/cisco-nsp
> archive at http://puck.nether.net/pipermail/cisco-nsp/
>

---
Andy Dills
Xecunet, Inc.
www.xecu.net
301-682-9972
---
_______________________________________________
cisco-nsp mailing list cisco-nsp [at] puck
https://puck.nether.net/mailman/listinfo/cisco-nsp
archive at http://puck.nether.net/pipermail/cisco-nsp/


diosbejgli at gmail

Aug 16, 2012, 3:07 PM

Post #3 of 6 (281 views)
Permalink
Re: 3550-12 interrupts out of control, possibly hardware? [In reply to]

Hi Andy,

One idea is different SDM templates being used. The SDM template is
not showing up in running-config, and changing it requires a reload as
well. I would compare them with 'sh sdm prefer' command. You might be
running out of IPv4 routes, which causes rest of routes to be applied
in software, so packets are software switched by the CPU which can
cause high utilization.

http://www.cisco.com/en/US/products/hw/switches/ps646/products_tech_note09186a0080094bc6.shtml

http://www.cisco.com/en/US/docs/switches/lan/catalyst3550/software/release/12.2_44_se/configuration/guide/swadmin.html#wp1235565

Best regards,
Andras

On Thu, Aug 16, 2012 at 6:36 PM, Andy Dills <andy [at] xecu> wrote:
>
> I've got a customer with a weird situation.
>
> They have a pretty straightforward setup, two 7200s fronting two cisco
> 3550-12s, distributing to a series of 48 port 3550s. It's a bit dated, but
> works very well for their needs.
>
> They have one special network attached to (only) one of the copper gige
> ports on (one of) the 3550-12s which gets a decent amount of traffic
> (~100mbps or so). It's a layer 3 connection.
>
> Well, one of their 3550-12s died, taking down that network. They moved the
> IP configuration of the port and moved the cable immediately, restoring
> service, and racked/configured a replacement switch, but left that network
> on the second 3550-12, as it seemed fine.
>
> However, once it began to come under load this morning, the CPU pegged
> (80-99%, normally at 1-2%), causing packet drops and latency.
>
> At that point I got involved, and for the life of me I can't figure out
> why this happened. Clearly it's interrupts, as there were no processes in
> the "sh proc cpu" that had more than 1% of CPU. However, cef was working
> fine, everything looked normal in terms of the traditional interrupt-based
> troubleshooting.
>
> So, after scratching our heads for a bit, I had them move the connection
> back to the original, newly-replaced switch. Note that these switches are
> configured 100% identically with the exception of IP address and hostname.
> Same IOS versions. I mean literally, if you diff the two in rancid, those
> are the only config changes.
>
> Zero problems from the point they moved the connection off of the switch
> in question, both switches now have 1-2% CPU and things are humming along
> fine.
>
> So, my question is: What could be the possible causes of this? Could this
> be a symptom of failing hardware, perhaps some bad memory requiring
> constant CPU corrections?
>
> Thanks,
> Andy
>
> ---
> Andy Dills
> Xecunet, Inc.
> www.xecu.net
> 301-682-9972
> ---
> _______________________________________________
> cisco-nsp mailing list cisco-nsp [at] puck
> https://puck.nether.net/mailman/listinfo/cisco-nsp
> archive at http://puck.nether.net/pipermail/cisco-nsp/
_______________________________________________
cisco-nsp mailing list cisco-nsp [at] puck
https://puck.nether.net/mailman/listinfo/cisco-nsp
archive at http://puck.nether.net/pipermail/cisco-nsp/


andy at xecu

Aug 16, 2012, 3:23 PM

Post #4 of 6 (280 views)
Permalink
Re: 3550-12 interrupts out of control, possibly hardware? [In reply to]

Thanks, I appreciate those suggestions. I verified both the SDM and VTP
configs are identical.

Did you see my followup from earlier? I identified that for some reason
unknown to me, the traffic was hitting the vlan1 interface before exiting
via the L3 interface facing that network, which was forcing all of the
traffic to get process switched. I have no idea why, though, and would
love suggestions.

My best guess is that because they configured the port for L3 mode before
they enabled ip routing on the failover 3550-12, something didn't happen
right and perhaps a reload would have fixed it. I do know that in the past
when I have done "ip routing" on a live 3550, it goes unresponsive for
about 10-15 seconds, so I have to assume a lot goes on behind the scenes.
And I do know from the transcript of their changes that they configured
the port for L3 mode before realizing ip routing had never been enabled on
that switch. Given the "illogical" (in quotes because perhaps there
is some logic that is escaping me) nature of the behavior observed, I have
to assume it was some sort of quirk of bug like this. For what it's worth,
they're both running c3550-ipservices-mz.122-44.SE6.

Thanks,
Andy

On Fri, 17 Aug 2012, Tóth András wrote:

> Hi Andy,
>
> One idea is different SDM templates being used. The SDM template is
> not showing up in running-config, and changing it requires a reload as
> well. I would compare them with 'sh sdm prefer' command. You might be
> running out of IPv4 routes, which causes rest of routes to be applied
> in software, so packets are software switched by the CPU which can
> cause high utilization.
>
> http://www.cisco.com/en/US/products/hw/switches/ps646/products_tech_note09186a0080094bc6.shtml
>
> http://www.cisco.com/en/US/docs/switches/lan/catalyst3550/software/release/12.2_44_se/configuration/guide/swadmin.html#wp1235565
>
> Best regards,
> Andras
>
> On Thu, Aug 16, 2012 at 6:36 PM, Andy Dills <andy [at] xecu> wrote:
> >
> > I've got a customer with a weird situation.
> >
> > They have a pretty straightforward setup, two 7200s fronting two cisco
> > 3550-12s, distributing to a series of 48 port 3550s. It's a bit dated, but
> > works very well for their needs.
> >
> > They have one special network attached to (only) one of the copper gige
> > ports on (one of) the 3550-12s which gets a decent amount of traffic
> > (~100mbps or so). It's a layer 3 connection.
> >
> > Well, one of their 3550-12s died, taking down that network. They moved the
> > IP configuration of the port and moved the cable immediately, restoring
> > service, and racked/configured a replacement switch, but left that network
> > on the second 3550-12, as it seemed fine.
> >
> > However, once it began to come under load this morning, the CPU pegged
> > (80-99%, normally at 1-2%), causing packet drops and latency.
> >
> > At that point I got involved, and for the life of me I can't figure out
> > why this happened. Clearly it's interrupts, as there were no processes in
> > the "sh proc cpu" that had more than 1% of CPU. However, cef was working
> > fine, everything looked normal in terms of the traditional interrupt-based
> > troubleshooting.
> >
> > So, after scratching our heads for a bit, I had them move the connection
> > back to the original, newly-replaced switch. Note that these switches are
> > configured 100% identically with the exception of IP address and hostname.
> > Same IOS versions. I mean literally, if you diff the two in rancid, those
> > are the only config changes.
> >
> > Zero problems from the point they moved the connection off of the switch
> > in question, both switches now have 1-2% CPU and things are humming along
> > fine.
> >
> > So, my question is: What could be the possible causes of this? Could this
> > be a symptom of failing hardware, perhaps some bad memory requiring
> > constant CPU corrections?
> >
> > Thanks,
> > Andy
> >
> > ---
> > Andy Dills
> > Xecunet, Inc.
> > www.xecu.net
> > 301-682-9972
> > ---
> > _______________________________________________
> > cisco-nsp mailing list cisco-nsp [at] puck
> > https://puck.nether.net/mailman/listinfo/cisco-nsp
> > archive at http://puck.nether.net/pipermail/cisco-nsp/
>

---
Andy Dills
Xecunet, Inc.
www.xecu.net
301-682-9972
---


diosbejgli at gmail

Aug 16, 2012, 4:06 PM

Post #5 of 6 (282 views)
Permalink
Re: 3550-12 interrupts out of control, possibly hardware? [In reply to]

Hi Andy,

The only thing which comes to my mind when you mention vlan 1 is that
by default it is untagged. Therefore you might have some misconfig or
miscabling or not having routing enabled, something similar which
causes traffic from or to a routed port to arrive untagged which is
interpreted as vlan 1 traffic by the other side. Perhaps the switch is
not running IP Routing properly as you say.

Having high CPU usage is strange, that suggests either a L2 loop or
software switching most often, latter can be caused by lack of
resources with incorrect SDM template.

There would be some additional details needed to better understand,
exact topolgy diagram, config of devices, etc. 12.2(44)SE6 is the
latest for 3550 though. Have you seen anything strange in logs during
the issue?

Best regards,
Andras

On Fri, Aug 17, 2012 at 12:23 AM, Andy Dills <andy [at] xecu> wrote:
>
> Thanks, I appreciate those suggestions. I verified both the SDM and VTP
> configs are identical.
>
> Did you see my followup from earlier? I identified that for some reason
> unknown to me, the traffic was hitting the vlan1 interface before exiting
> via the L3 interface facing that network, which was forcing all of the
> traffic to get process switched. I have no idea why, though, and would
> love suggestions.
>
> My best guess is that because they configured the port for L3 mode before
> they enabled ip routing on the failover 3550-12, something didn't happen
> right and perhaps a reload would have fixed it. I do know that in the past
> when I have done "ip routing" on a live 3550, it goes unresponsive for
> about 10-15 seconds, so I have to assume a lot goes on behind the scenes.
> And I do know from the transcript of their changes that they configured
> the port for L3 mode before realizing ip routing had never been enabled on
> that switch. Given the "illogical" (in quotes because perhaps there
> is some logic that is escaping me) nature of the behavior observed, I have
> to assume it was some sort of quirk of bug like this. For what it's worth,
> they're both running c3550-ipservices-mz.122-44.SE6.
>
> Thanks,
> Andy
>
> On Fri, 17 Aug 2012, Tóth András wrote:
>
>> Hi Andy,
>>
>> One idea is different SDM templates being used. The SDM template is
>> not showing up in running-config, and changing it requires a reload as
>> well. I would compare them with 'sh sdm prefer' command. You might be
>> running out of IPv4 routes, which causes rest of routes to be applied
>> in software, so packets are software switched by the CPU which can
>> cause high utilization.
>>
>> http://www.cisco.com/en/US/products/hw/switches/ps646/products_tech_note09186a0080094bc6.shtml
>>
>> http://www.cisco.com/en/US/docs/switches/lan/catalyst3550/software/release/12.2_44_se/configuration/guide/swadmin.html#wp1235565
>>
>> Best regards,
>> Andras
>>
>> On Thu, Aug 16, 2012 at 6:36 PM, Andy Dills <andy [at] xecu> wrote:
>> >
>> > I've got a customer with a weird situation.
>> >
>> > They have a pretty straightforward setup, two 7200s fronting two cisco
>> > 3550-12s, distributing to a series of 48 port 3550s. It's a bit dated, but
>> > works very well for their needs.
>> >
>> > They have one special network attached to (only) one of the copper gige
>> > ports on (one of) the 3550-12s which gets a decent amount of traffic
>> > (~100mbps or so). It's a layer 3 connection.
>> >
>> > Well, one of their 3550-12s died, taking down that network. They moved the
>> > IP configuration of the port and moved the cable immediately, restoring
>> > service, and racked/configured a replacement switch, but left that network
>> > on the second 3550-12, as it seemed fine.
>> >
>> > However, once it began to come under load this morning, the CPU pegged
>> > (80-99%, normally at 1-2%), causing packet drops and latency.
>> >
>> > At that point I got involved, and for the life of me I can't figure out
>> > why this happened. Clearly it's interrupts, as there were no processes in
>> > the "sh proc cpu" that had more than 1% of CPU. However, cef was working
>> > fine, everything looked normal in terms of the traditional interrupt-based
>> > troubleshooting.
>> >
>> > So, after scratching our heads for a bit, I had them move the connection
>> > back to the original, newly-replaced switch. Note that these switches are
>> > configured 100% identically with the exception of IP address and hostname.
>> > Same IOS versions. I mean literally, if you diff the two in rancid, those
>> > are the only config changes.
>> >
>> > Zero problems from the point they moved the connection off of the switch
>> > in question, both switches now have 1-2% CPU and things are humming along
>> > fine.
>> >
>> > So, my question is: What could be the possible causes of this? Could this
>> > be a symptom of failing hardware, perhaps some bad memory requiring
>> > constant CPU corrections?
>> >
>> > Thanks,
>> > Andy
>> >
>> > ---
>> > Andy Dills
>> > Xecunet, Inc.
>> > www.xecu.net
>> > 301-682-9972
>> > ---
>> > _______________________________________________
>> > cisco-nsp mailing list cisco-nsp [at] puck
>> > https://puck.nether.net/mailman/listinfo/cisco-nsp
>> > archive at http://puck.nether.net/pipermail/cisco-nsp/
>>
>
> ---
> Andy Dills
> Xecunet, Inc.
> www.xecu.net
> 301-682-9972
> ---

_______________________________________________
cisco-nsp mailing list cisco-nsp [at] puck
https://puck.nether.net/mailman/listinfo/cisco-nsp
archive at http://puck.nether.net/pipermail/cisco-nsp/


andy at xecu

Aug 18, 2012, 9:31 AM

Post #6 of 6 (267 views)
Permalink
Re: 3550-12 interrupts out of control, possibly hardware? [In reply to]

Just to follow up on this on the off chance somebody runs into this in the
future...a reload of the switch fixed the issue, and the traffic for the
L3 port stopped hitting vlan 1.

Andy


On Thu, 16 Aug 2012, Andy Dills wrote:

>
> Thanks, I appreciate those suggestions. I verified both the SDM and VTP
> configs are identical.
>
> Did you see my followup from earlier? I identified that for some reason
> unknown to me, the traffic was hitting the vlan1 interface before exiting
> via the L3 interface facing that network, which was forcing all of the
> traffic to get process switched. I have no idea why, though, and would
> love suggestions.
>
> My best guess is that because they configured the port for L3 mode before
> they enabled ip routing on the failover 3550-12, something didn't happen
> right and perhaps a reload would have fixed it. I do know that in the past
> when I have done "ip routing" on a live 3550, it goes unresponsive for
> about 10-15 seconds, so I have to assume a lot goes on behind the scenes.
> And I do know from the transcript of their changes that they configured
> the port for L3 mode before realizing ip routing had never been enabled on
> that switch. Given the "illogical" (in quotes because perhaps there
> is some logic that is escaping me) nature of the behavior observed, I have
> to assume it was some sort of quirk of bug like this. For what it's worth,
> they're both running c3550-ipservices-mz.122-44.SE6.
>
> Thanks,
> Andy
>
> On Fri, 17 Aug 2012, Tóth András wrote:
>
> > Hi Andy,
> >
> > One idea is different SDM templates being used. The SDM template is
> > not showing up in running-config, and changing it requires a reload as
> > well. I would compare them with 'sh sdm prefer' command. You might be
> > running out of IPv4 routes, which causes rest of routes to be applied
> > in software, so packets are software switched by the CPU which can
> > cause high utilization.
> >
> > http://www.cisco.com/en/US/products/hw/switches/ps646/products_tech_note09186a0080094bc6.shtml
> >
> > http://www.cisco.com/en/US/docs/switches/lan/catalyst3550/software/release/12.2_44_se/configuration/guide/swadmin.html#wp1235565
> >
> > Best regards,
> > Andras
> >
> > On Thu, Aug 16, 2012 at 6:36 PM, Andy Dills <andy [at] xecu> wrote:
> > >
> > > I've got a customer with a weird situation.
> > >
> > > They have a pretty straightforward setup, two 7200s fronting two cisco
> > > 3550-12s, distributing to a series of 48 port 3550s. It's a bit dated, but
> > > works very well for their needs.
> > >
> > > They have one special network attached to (only) one of the copper gige
> > > ports on (one of) the 3550-12s which gets a decent amount of traffic
> > > (~100mbps or so). It's a layer 3 connection.
> > >
> > > Well, one of their 3550-12s died, taking down that network. They moved the
> > > IP configuration of the port and moved the cable immediately, restoring
> > > service, and racked/configured a replacement switch, but left that network
> > > on the second 3550-12, as it seemed fine.
> > >
> > > However, once it began to come under load this morning, the CPU pegged
> > > (80-99%, normally at 1-2%), causing packet drops and latency.
> > >
> > > At that point I got involved, and for the life of me I can't figure out
> > > why this happened. Clearly it's interrupts, as there were no processes in
> > > the "sh proc cpu" that had more than 1% of CPU. However, cef was working
> > > fine, everything looked normal in terms of the traditional interrupt-based
> > > troubleshooting.
> > >
> > > So, after scratching our heads for a bit, I had them move the connection
> > > back to the original, newly-replaced switch. Note that these switches are
> > > configured 100% identically with the exception of IP address and hostname.
> > > Same IOS versions. I mean literally, if you diff the two in rancid, those
> > > are the only config changes.
> > >
> > > Zero problems from the point they moved the connection off of the switch
> > > in question, both switches now have 1-2% CPU and things are humming along
> > > fine.
> > >
> > > So, my question is: What could be the possible causes of this? Could this
> > > be a symptom of failing hardware, perhaps some bad memory requiring
> > > constant CPU corrections?
> > >
> > > Thanks,
> > > Andy
> > >
> > > ---
> > > Andy Dills
> > > Xecunet, Inc.
> > > www.xecu.net
> > > 301-682-9972
> > > ---
> > > _______________________________________________
> > > cisco-nsp mailing list cisco-nsp [at] puck
> > > https://puck.nether.net/mailman/listinfo/cisco-nsp
> > > archive at http://puck.nether.net/pipermail/cisco-nsp/
> >
>
> ---
> Andy Dills
> Xecunet, Inc.
> www.xecu.net
> 301-682-9972
> ---

---
Andy Dills
Xecunet, Inc.
www.xecu.net
301-682-9972
---

Cisco nsp RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.