Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Cisco: NSP

BGP Hold time expired/ospf dropping 6500 Sup720-3BXL

 

 

Cisco nsp RSS feed   Index | Next | Previous | View Threaded


drew.weaver at thenap

Dec 11, 2009, 8:22 AM

Post #1 of 8 (2958 views)
Permalink
BGP Hold time expired/ospf dropping 6500 Sup720-3BXL

Howdy all,

Last night I had an interesting encounter on one of my 6509s /w SUP7203-BXL.

This switch has 3x iBGP sessions with full internet tables and is also running OSPF.

Two of the three iBGP sessions randomly dropped with:

%BGP-3-NOTIFICATION: sent to neighbor x.x.x.3 4/0 (hold time expired) 0 bytes, I also noticed that during this period OSPF dropped with Neighbor Down: Dead timer expired

and then re-established, and then failed again, and re-established, and failed again, and so-on, and so-on.

I checked the physical interfaces between this 6500 and the two GSR 12000s it peers with and there were no errors, there was also no obvious spike in traffic that would account for latency that might cause the hold timers to expire. I remember when this system first came online it took a really long time for it to download the full internet tables from the upstream GSRs and also during that time there was a lot of CPU time being eaten up, I am wondering if maybe the first session failing caused sort of a 'performance' domino effect which then caused everything else to fail, the issue eventually corrected itself and stabilized.

This particular box is running 12.2(18)SXF17 so I am less likely to believe it is a software bug.

Does anyone have any tips on both how I can avoid the hold timer issue altogether and also how I can make it so that if a session does go down and re-establish it doesn't totally nail the CPU while it's trying to re-establish/download the routes? A long time ago I also read that increasing the MTU on both ends of a circuit can make BGP tables download faster, I don't know if that's true or not, has anyone else found that?

thanks,
-Drew


_______________________________________________
cisco-nsp mailing list cisco-nsp [at] puck
https://puck.nether.net/mailman/listinfo/cisco-nsp
archive at http://puck.nether.net/pipermail/cisco-nsp/


dave.kruger at mtnbusiness

Dec 15, 2009, 12:27 AM

Post #2 of 8 (2843 views)
Permalink
Re: BGP Hold time expired/ospf dropping 6500 Sup720-3BXL [In reply to]

Drew Weaver wrote:
> Howdy all,
>
> Last night I had an interesting encounter on one of my 6509s /w SUP7203-BXL.
>
> This switch has 3x iBGP sessions with full internet tables and is also running OSPF.
>
> Two of the three iBGP sessions randomly dropped with:
>
> %BGP-3-NOTIFICATION: sent to neighbor x.x.x.3 4/0 (hold time expired) 0 bytes, I also noticed that during this period OSPF dropped with Neighbor Down: Dead timer expired
>
>
> and then re-established, and then failed again, and re-established, and failed again, and so-on, and so-on.
>
> I checked the physical interfaces between this 6500 and the two GSR 12000s it peers with and there were no errors, there was also no obvious spike in traffic that would account for latency that might cause the hold timers to expire. I remember when this system first came online it took a really long time for it to download the full internet tables from the upstream GSRs and also during that time there was a lot of CPU time being eaten up, I am wondering if maybe the first session failing caused sort of a 'performance' domino effect which then caused everything else to fail, the issue eventually corrected itself and stabilized.
>
> This particular box is running 12.2(18)SXF17 so I am less likely to believe it is a software bug.
>
> Does anyone have any tips on both how I can avoid the hold timer issue altogether

I dont think your issue is bgp and it's hold time - if ospf session
drops then so will BGP session. Are you sure your upstream GSR's did not
fail-over? If so NSF might help you
http://www.cisco.com/en/US/partner/docs/ios/iproute/configuration/guide/irp_bgp_adv_features_ps6350_TSD_Products_Configuration_Guide_Chapter.html#wp1056241

If you have unstable IGP, try to figure out why, if you cant, dampen. If
that doesnt help, disable next-hop address tracking
http://www.cisco.com/en/US/partner/docs/ios/iproute/configuration/guide/irp_bgp_adv_features_ps6350_TSD_Products_Configuration_Guide_Chapter.html#wp1056441

Regards
Dave

> and also how I can make it so that if a session does go down and re-establish it doesn't totally nail the CPU while it's trying to re-establish/download the routes? A long time ago I also read that increasing the MTU on both ends of a circuit can make BGP tables download faster, I don't know if that's true or not, has anyone else found that?
>
> thanks,
> -Drew
>
>
> _______________________________________________
> cisco-nsp mailing list cisco-nsp [at] puck
> https://puck.nether.net/mailman/listinfo/cisco-nsp
> archive at http://puck.nether.net/pipermail/cisco-nsp/
>

_______________________________________________
cisco-nsp mailing list cisco-nsp [at] puck
https://puck.nether.net/mailman/listinfo/cisco-nsp
archive at http://puck.nether.net/pipermail/cisco-nsp/


globichen at gmail

Jan 21, 2010, 4:28 PM

Post #3 of 8 (2719 views)
Permalink
Re: BGP Hold time expired/ospf dropping 6500 Sup720-3BXL [In reply to]

Hi,

I just fell over this thread while doing a little reseach to solve a
similar situation.

Hardware:

- 6509 with SUP720-3BXL on both ends
- SXF15a
- Uptime: 46 weeks

Problem:

- OSPF (for the loopback between cores) and BGP (mostly customers whom
we send the full table) going up and down all the time:

%OSPF-5-ADJCHG: Process 1, Nbr x.x.x.130 on TenGigabitEthernet4/1 from
FULL to DOWN, Neighbor Down: Dead timer expired
%OSPF-5-ADJCHG: Process 1, Nbr x.x.x.131 on TenGigabitEthernet9/1 from
LOADING to FULL, Loading Done
%BGP-5-ADJCHANGE: neighbor y.y.y.14 Down BGP Notification sent
%BGP-3-NOTIFICATION: sent to neighbor y.y.y.14 4/0 (hold time expired) 0 bytes
%BGP-5-ADJCHANGE: neighbor y.y.y.14 Up

This keeps going on for several hours, and suddenly it stabilizes itself.

Furthermore I use cacti to generate graphs from the core router via
SNMP. I have one VLAN that has around 15 GBPS traffic at peak times,
and as soon as I hit more than 15 GBPS, no more graphs are drawn, core
router console becomes rather unresponsive and OSPF starts to behave
strangely.

What I can rule out is the fiber capacity. I have multiple circuits
and different paths and operators. The OSPF issue happens on all
circuits, not just a specific one. No 10 GE link is used more than
60%. In fact, traffic from inside my backbone to any place outside
remains unaffected (thank God), but the core router itself is pretty
useless. Pinging the core's loopback or any ip loaded on that box
results in a 40-60% packet loss.

CPU usage is not high, it's stable. No unusual processes, just IP
Input and BGP Scanner. More than 50% memory is still free at that
time.

I've had this many times recently, but it really just happens when my
core goes beyond +- 15 GBPS of traffic (outbound). We've been below 15
GBPS for 2 years and it never happaned at that time. Now all this mess
happens almost daily, rendering important billing graphs useless and
annoying full table BGP customers.

Is this a memory issue, due to the router's long uptime? Would
reloading the router help in this case? That's the last thing I would
want to do, but if it helps...

Cheers,

Andy

On Fri, Dec 11, 2009 at 5:22 PM, Drew Weaver <drew.weaver [at] thenap> wrote:
> Howdy all,
>
> Last night I had an interesting encounter on one of my 6509s /w SUP7203-BXL.
>
> This switch has 3x iBGP sessions with full internet tables and is also running OSPF.
>
> Two of the three iBGP sessions randomly dropped with:
>
> %BGP-3-NOTIFICATION: sent to neighbor x.x.x.3 4/0 (hold time expired) 0 bytes, I also noticed that during this period OSPF dropped with Neighbor Down: Dead timer expired
>
> and then re-established, and then failed again, and re-established, and failed again, and so-on, and so-on.
>
> I checked the physical interfaces between this 6500 and the two GSR 12000s it peers with and there were no errors, there was also no obvious spike in traffic that would account for latency that might cause the hold timers to expire. I remember when this system first came online it took a really long time for it to download the full internet tables from the upstream GSRs and also during that time there was a lot of CPU time being eaten up, I am wondering if maybe the first session failing caused sort of a 'performance' domino effect which then caused everything else to fail, the issue eventually corrected itself and stabilized.
>
> This particular box is running 12.2(18)SXF17 so I am less likely to believe it is a software bug.
>
> Does anyone have any tips on both how I can avoid the hold timer issue altogether and also how I can make it so that if a session does go down and re-establish it doesn't totally nail the CPU while it's trying to re-establish/download the routes? A long time ago I also read that increasing the MTU on both ends of a circuit can make BGP tables download faster, I don't know if that's true or not, has anyone else found that?
>
> thanks,
> -Drew
>
>
> _______________________________________________
> cisco-nsp mailing list cisco-nsp [at] puck
> https://puck.nether.net/mailman/listinfo/cisco-nsp
> archive at http://puck.nether.net/pipermail/cisco-nsp/
>
_______________________________________________
cisco-nsp mailing list cisco-nsp [at] puck
https://puck.nether.net/mailman/listinfo/cisco-nsp
archive at http://puck.nether.net/pipermail/cisco-nsp/


jasonleblanc at gmail

Jan 21, 2010, 4:53 PM

Post #4 of 8 (2708 views)
Permalink
Re: BGP Hold time expired/ospf dropping 6500 Sup720-3BXL [In reply to]

Can you send your <snipped> OSPF config?

On Jan 21, 2010, at 5:28 PM, Andy B. wrote:

> Hi,
>
> I just fell over this thread while doing a little reseach to solve a
> similar situation.
>
> Hardware:
>
> - 6509 with SUP720-3BXL on both ends
> - SXF15a
> - Uptime: 46 weeks
>
> Problem:
>
> - OSPF (for the loopback between cores) and BGP (mostly customers whom
> we send the full table) going up and down all the time:
>
> %OSPF-5-ADJCHG: Process 1, Nbr x.x.x.130 on TenGigabitEthernet4/1 from
> FULL to DOWN, Neighbor Down: Dead timer expired
> %OSPF-5-ADJCHG: Process 1, Nbr x.x.x.131 on TenGigabitEthernet9/1 from
> LOADING to FULL, Loading Done
> %BGP-5-ADJCHANGE: neighbor y.y.y.14 Down BGP Notification sent
> %BGP-3-NOTIFICATION: sent to neighbor y.y.y.14 4/0 (hold time expired) 0 bytes
> %BGP-5-ADJCHANGE: neighbor y.y.y.14 Up
>
> This keeps going on for several hours, and suddenly it stabilizes itself.
>
> Furthermore I use cacti to generate graphs from the core router via
> SNMP. I have one VLAN that has around 15 GBPS traffic at peak times,
> and as soon as I hit more than 15 GBPS, no more graphs are drawn, core
> router console becomes rather unresponsive and OSPF starts to behave
> strangely.
>
> What I can rule out is the fiber capacity. I have multiple circuits
> and different paths and operators. The OSPF issue happens on all
> circuits, not just a specific one. No 10 GE link is used more than
> 60%. In fact, traffic from inside my backbone to any place outside
> remains unaffected (thank God), but the core router itself is pretty
> useless. Pinging the core's loopback or any ip loaded on that box
> results in a 40-60% packet loss.
>
> CPU usage is not high, it's stable. No unusual processes, just IP
> Input and BGP Scanner. More than 50% memory is still free at that
> time.
>
> I've had this many times recently, but it really just happens when my
> core goes beyond +- 15 GBPS of traffic (outbound). We've been below 15
> GBPS for 2 years and it never happaned at that time. Now all this mess
> happens almost daily, rendering important billing graphs useless and
> annoying full table BGP customers.
>
> Is this a memory issue, due to the router's long uptime? Would
> reloading the router help in this case? That's the last thing I would
> want to do, but if it helps...
>
> Cheers,
>
> Andy
>
> On Fri, Dec 11, 2009 at 5:22 PM, Drew Weaver <drew.weaver [at] thenap> wrote:
>> Howdy all,
>>
>> Last night I had an interesting encounter on one of my 6509s /w SUP7203-BXL.
>>
>> This switch has 3x iBGP sessions with full internet tables and is also running OSPF.
>>
>> Two of the three iBGP sessions randomly dropped with:
>>
>> %BGP-3-NOTIFICATION: sent to neighbor x.x.x.3 4/0 (hold time expired) 0 bytes, I also noticed that during this period OSPF dropped with Neighbor Down: Dead timer expired
>>
>> and then re-established, and then failed again, and re-established, and failed again, and so-on, and so-on.
>>
>> I checked the physical interfaces between this 6500 and the two GSR 12000s it peers with and there were no errors, there was also no obvious spike in traffic that would account for latency that might cause the hold timers to expire. I remember when this system first came online it took a really long time for it to download the full internet tables from the upstream GSRs and also during that time there was a lot of CPU time being eaten up, I am wondering if maybe the first session failing caused sort of a 'performance' domino effect which then caused everything else to fail, the issue eventually corrected itself and stabilized.
>>
>> This particular box is running 12.2(18)SXF17 so I am less likely to believe it is a software bug.
>>
>> Does anyone have any tips on both how I can avoid the hold timer issue altogether and also how I can make it so that if a session does go down and re-establish it doesn't totally nail the CPU while it's trying to re-establish/download the routes? A long time ago I also read that increasing the MTU on both ends of a circuit can make BGP tables download faster, I don't know if that's true or not, has anyone else found that?
>>
>> thanks,
>> -Drew
>>
>>
>> _______________________________________________
>> cisco-nsp mailing list cisco-nsp [at] puck
>> https://puck.nether.net/mailman/listinfo/cisco-nsp
>> archive at http://puck.nether.net/pipermail/cisco-nsp/
>>
> _______________________________________________
> cisco-nsp mailing list cisco-nsp [at] puck
> https://puck.nether.net/mailman/listinfo/cisco-nsp
> archive at http://puck.nether.net/pipermail/cisco-nsp/

_______________________________________________
cisco-nsp mailing list cisco-nsp [at] puck
https://puck.nether.net/mailman/listinfo/cisco-nsp
archive at http://puck.nether.net/pipermail/cisco-nsp/


globichen at gmail

Jan 21, 2010, 5:06 PM

Post #5 of 8 (2763 views)
Permalink
Re: BGP Hold time expired/ospf dropping 6500 Sup720-3BXL [In reply to]

Hi,

here we go:

Core router that is causing headaches:

interface Loopback0
ip address x.x.x.130 255.255.255.255

interface TenGigabitEthernet9/1
ip address y.y.y.1 255.255.255.252
no ip redirects
no ip proxy-arp
no cdp enable

router ospf 1
router-id x.x.x.130
log-adjacency-changes
redistribute connected subnets
redistribute static subnets
passive-interface default
no passive-interface TenGigabitEthernet8/1
no passive-interface TenGigabitEthernet9/1
no passive-interface TenGigabitEthernet9/2
network y.y.y.0 0.0.0.3 area 0
network y.y.y.4 0.0.0.3 area 0
network y.y.y.8 0.0.0.3 area 0


Adjacent router (one of them):

interface Loopback0
ip address x.x.x.131 255.255.255.255

interface TenGigabitEthernet4/1
ip address y.y.y.2 255.255.255.252
no ip redirects
no ip proxy-arp

router ospf 1
router-id x.x.x.131
log-adjacency-changes
redistribute connected subnets
redistribute static subnets
passive-interface default
no passive-interface TenGigabitEthernet4/1
network y.y.y.0 0.0.0.3 area 0


I hope this helps...

Andy


On Fri, Jan 22, 2010 at 1:53 AM, Jason LeBlanc <jasonleblanc [at] gmail> wrote:
> Can you send your <snipped> OSPF config?
>
> On Jan 21, 2010, at 5:28 PM, Andy B. wrote:
>
>> Hi,
>>
>> I just fell over this thread while doing a little reseach to solve a
>> similar situation.
>>
>> Hardware:
>>
>> - 6509 with SUP720-3BXL on both ends
>> - SXF15a
>> - Uptime: 46 weeks
>>
>> Problem:
>>
>> - OSPF (for the loopback between cores) and BGP (mostly customers whom
>> we send the full table) going up and down all the time:
>>
>> %OSPF-5-ADJCHG: Process 1, Nbr x.x.x.130 on TenGigabitEthernet4/1 from
>> FULL to DOWN, Neighbor Down: Dead timer expired
>> %OSPF-5-ADJCHG: Process 1, Nbr x.x.x.131 on TenGigabitEthernet9/1 from
>> LOADING to FULL, Loading Done
>> %BGP-5-ADJCHANGE: neighbor y.y.y.14 Down BGP Notification sent
>> %BGP-3-NOTIFICATION: sent to neighbor y.y.y.14 4/0 (hold time expired) 0 bytes
>> %BGP-5-ADJCHANGE: neighbor y.y.y.14 Up
>>
>> This keeps going on for several hours, and suddenly it stabilizes itself.
>>
>> Furthermore I use cacti to generate graphs from the core router via
>> SNMP. I have one VLAN that has around 15 GBPS traffic at peak times,
>> and as soon as I hit more than 15 GBPS, no more graphs are drawn, core
>> router console becomes rather unresponsive and OSPF starts to behave
>> strangely.
>>
>> What I can rule out is the fiber capacity. I have multiple circuits
>> and different paths and operators. The OSPF issue happens on all
>> circuits, not just a specific one. No 10 GE link is used more than
>> 60%. In fact, traffic from inside my backbone to any place outside
>> remains unaffected (thank God), but the core router itself is pretty
>> useless. Pinging the core's loopback or any ip loaded on that box
>> results in a 40-60% packet loss.
>>
>> CPU usage is not high, it's stable. No unusual processes, just IP
>> Input and BGP Scanner. More than 50% memory is still free at that
>> time.
>>
>> I've had this many times recently, but it really just happens when my
>> core goes beyond +- 15 GBPS of traffic (outbound). We've been below 15
>> GBPS for 2 years and it never happaned at that time. Now all this mess
>> happens almost daily, rendering important billing graphs useless and
>> annoying full table BGP customers.
>>
>> Is this a memory issue, due to the router's long uptime? Would
>> reloading the router help in this case? That's the last thing I would
>> want to do, but if it helps...
>>
>> Cheers,
>>
>> Andy
>>
>> On Fri, Dec 11, 2009 at 5:22 PM, Drew Weaver <drew.weaver [at] thenap> wrote:
>>> Howdy all,
>>>
>>> Last night I had an interesting encounter on one of my 6509s /w SUP7203-BXL.
>>>
>>> This switch has 3x iBGP sessions with full internet tables and is also running OSPF.
>>>
>>> Two of the three iBGP sessions randomly dropped with:
>>>
>>> %BGP-3-NOTIFICATION: sent to neighbor x.x.x.3 4/0 (hold time expired) 0 bytes, I also noticed that during this period OSPF dropped with Neighbor Down: Dead timer expired
>>>
>>> and then re-established, and then failed again, and re-established, and failed again, and so-on, and so-on.
>>>
>>> I checked the physical interfaces between this 6500 and the two GSR 12000s it peers with and there were no errors, there was also no obvious spike in traffic that would account for latency that might cause the hold timers to expire. I remember when this system first came online it took a really long time for it to download the full internet tables from the upstream GSRs and also during that time there was a lot of CPU time being eaten up, I am wondering if maybe the first session failing caused sort of a 'performance' domino effect which then caused everything else to fail, the issue eventually corrected itself and stabilized.
>>>
>>> This particular box is running 12.2(18)SXF17 so I am less likely to believe it is a software bug.
>>>
>>> Does anyone have any tips on both how I can avoid the hold timer issue altogether and also how I can make it so that if a session does go down and re-establish it doesn't totally nail the CPU while it's trying to re-establish/download the routes? A long time ago I also read that increasing the MTU on both ends of a circuit can make BGP tables download faster, I don't know if that's true or not, has anyone else found that?
>>>
>>> thanks,
>>> -Drew
>>>
>>>
>>> _______________________________________________
>>> cisco-nsp mailing list cisco-nsp [at] puck
>>> https://puck.nether.net/mailman/listinfo/cisco-nsp
>>> archive at http://puck.nether.net/pipermail/cisco-nsp/
>>>
>> _______________________________________________
>> cisco-nsp mailing list cisco-nsp [at] puck
>> https://puck.nether.net/mailman/listinfo/cisco-nsp
>> archive at http://puck.nether.net/pipermail/cisco-nsp/
>
>
_______________________________________________
cisco-nsp mailing list cisco-nsp [at] puck
https://puck.nether.net/mailman/listinfo/cisco-nsp
archive at http://puck.nether.net/pipermail/cisco-nsp/


skoal at skoal

Jan 22, 2010, 12:07 AM

Post #6 of 8 (2713 views)
Permalink
Re: BGP Hold time expired/ospf dropping 6500 Sup720-3BXL [In reply to]

just a thought :
sh ip bgp neighbors | i Datagrams

maybe one router tries to negotiate the session with low datagram size
and the update storm floods the connection.


On Fri, 22 Jan 2010 02:06:53 +0100
"Andy B." <globichen [at] gmail> wrote:

>Hi,
>
>here we go:
>
>Core router that is causing headaches:
>
>interface Loopback0
> ip address x.x.x.130 255.255.255.255
>
>interface TenGigabitEthernet9/1
> ip address y.y.y.1 255.255.255.252
> no ip redirects
> no ip proxy-arp
> no cdp enable
>
>router ospf 1
> router-id x.x.x.130
> log-adjacency-changes
> redistribute connected subnets
> redistribute static subnets
> passive-interface default
> no passive-interface TenGigabitEthernet8/1
> no passive-interface TenGigabitEthernet9/1
> no passive-interface TenGigabitEthernet9/2
> network y.y.y.0 0.0.0.3 area 0
> network y.y.y.4 0.0.0.3 area 0
> network y.y.y.8 0.0.0.3 area 0
>
>
>Adjacent router (one of them):
>
>interface Loopback0
> ip address x.x.x.131 255.255.255.255
>
>interface TenGigabitEthernet4/1
> ip address y.y.y.2 255.255.255.252
> no ip redirects
> no ip proxy-arp
>
>router ospf 1
> router-id x.x.x.131
> log-adjacency-changes
> redistribute connected subnets
> redistribute static subnets
> passive-interface default
> no passive-interface TenGigabitEthernet4/1
> network y.y.y.0 0.0.0.3 area 0
>
>
>I hope this helps...
>
>Andy
>
>
>On Fri, Jan 22, 2010 at 1:53 AM, Jason LeBlanc
><jasonleblanc [at] gmail> wrote:
>> Can you send your <snipped> OSPF config?
>>
>> On Jan 21, 2010, at 5:28 PM, Andy B. wrote:
>>
>>> Hi,
>>>
>>> I just fell over this thread while doing a little reseach to solve a
>>> similar situation.
>>>
>>> Hardware:
>>>
>>> - 6509 with SUP720-3BXL on both ends
>>> - SXF15a
>>> - Uptime: 46 weeks
>>>
>>> Problem:
>>>
>>> - OSPF (for the loopback between cores) and BGP (mostly customers
>>> whom we send the full table) going up and down all the time:
>>>
>>> %OSPF-5-ADJCHG: Process 1, Nbr x.x.x.130 on TenGigabitEthernet4/1
>>> from FULL to DOWN, Neighbor Down: Dead timer expired
>>> %OSPF-5-ADJCHG: Process 1, Nbr x.x.x.131 on TenGigabitEthernet9/1
>>> from LOADING to FULL, Loading Done
>>> %BGP-5-ADJCHANGE: neighbor y.y.y.14 Down BGP Notification sent
>>> %BGP-3-NOTIFICATION: sent to neighbor y.y.y.14 4/0 (hold time
>>> expired) 0 bytes %BGP-5-ADJCHANGE: neighbor y.y.y.14 Up
>>>
>>> This keeps going on for several hours, and suddenly it stabilizes
>>> itself.
>>>
>>> Furthermore I use cacti to generate graphs from the core router via
>>> SNMP. I have one VLAN that has around 15 GBPS traffic at peak times,
>>> and as soon as I hit more than 15 GBPS, no more graphs are drawn,
>>> core router console becomes rather unresponsive and OSPF starts to
>>> behave strangely.
>>>
>>> What I can rule out is the fiber capacity. I have multiple circuits
>>> and different paths and operators. The OSPF issue happens on all
>>> circuits, not just a specific one. No 10 GE link is used more than
>>> 60%. In fact, traffic from inside my backbone to any place outside
>>> remains unaffected (thank God), but the core router itself is pretty
>>> useless. Pinging the core's loopback or any ip loaded on that box
>>> results in a 40-60% packet loss.
>>>
>>> CPU usage is not high, it's stable. No unusual processes, just IP
>>> Input and BGP Scanner. More than 50% memory is still free at that
>>> time.
>>>
>>> I've had this many times recently, but it really just happens when
>>> my core goes beyond +- 15 GBPS of traffic (outbound). We've been
>>> below 15 GBPS for 2 years and it never happaned at that time. Now
>>> all this mess happens almost daily, rendering important billing
>>> graphs useless and annoying full table BGP customers.
>>>
>>> Is this a memory issue, due to the router's long uptime? Would
>>> reloading the router help in this case? That's the last thing I
>>> would want to do, but if it helps...
>>>
>>> Cheers,
>>>
>>> Andy
>>>
>>> On Fri, Dec 11, 2009 at 5:22 PM, Drew Weaver
>>> <drew.weaver [at] thenap> wrote:
>>>> Howdy all,
>>>>
>>>> Last night I had an interesting encounter on one of my 6509s /w
>>>> SUP7203-BXL.
>>>>
>>>> This switch has 3x iBGP sessions with full internet tables and is
>>>> also running OSPF.
>>>>
>>>> Two of the three iBGP sessions randomly dropped with:
>>>>
>>>> %BGP-3-NOTIFICATION: sent to neighbor x.x.x.3 4/0 (hold time
>>>> expired) 0 bytes, I also noticed that during this period OSPF
>>>> dropped with Neighbor Down: Dead timer expired
>>>>
>>>> and then re-established, and then failed again, and
>>>> re-established, and failed again, and so-on, and so-on.
>>>>
>>>> I checked the physical interfaces between this 6500 and the two
>>>> GSR 12000s it peers with and there were no errors, there was also
>>>> no obvious spike in traffic that would account for latency that
>>>> might cause the hold timers to expire. I remember when this system
>>>> first came online it took a really long time for it to download
>>>> the full internet tables from the upstream GSRs and also during
>>>> that time there was a lot of CPU time being eaten up, I am
>>>> wondering if maybe the first session failing caused sort of a
>>>> 'performance' domino effect which then caused everything else to
>>>> fail, the issue eventually corrected itself and stabilized.
>>>>
>>>> This particular box is running 12.2(18)SXF17 so I am less likely
>>>> to believe it is a software bug.
>>>>
>>>> Does anyone have any tips on both how I can avoid the hold timer
>>>> issue altogether and also how I can make it so that if a session
>>>> does go down and re-establish it doesn't totally nail the CPU
>>>> while it's trying to re-establish/download the routes? A long time
>>>> ago I also read that increasing the MTU on both ends of a circuit
>>>> can make BGP tables download faster, I don't know if that's true
>>>> or not, has anyone else found that?
>>>>
>>>> thanks,
>>>> -Drew
>>>>
>>>>
>>>> _______________________________________________
>>>> cisco-nsp mailing list  cisco-nsp [at] puck
>>>> https://puck.nether.net/mailman/listinfo/cisco-nsp
>>>> archive at http://puck.nether.net/pipermail/cisco-nsp/
>>>>
>>> _______________________________________________
>>> cisco-nsp mailing list  cisco-nsp [at] puck
>>> https://puck.nether.net/mailman/listinfo/cisco-nsp
>>> archive at http://puck.nether.net/pipermail/cisco-nsp/
>>
>>
>_______________________________________________
>cisco-nsp mailing list cisco-nsp [at] puck
>https://puck.nether.net/mailman/listinfo/cisco-nsp
>archive at http://puck.nether.net/pipermail/cisco-nsp/
Attachments: signature.asc (0.19 KB)


bandwidth.user at gmail

Jan 22, 2010, 2:00 AM

Post #7 of 8 (2685 views)
Permalink
Re: BGP Hold time expired/ospf dropping 6500 Sup720-3BXL [In reply to]

We had a somewhat similar problem with ospf/bgp which was eventually
resolved by making link mtu uniform across the links. Let me know if
this helps.

On Friday, 22 January, 2010 04:07 PM, Gergely Antal wrote:
>
> just a thought :
> sh ip bgp neighbors | i Datagrams
>
> maybe one router tries to negotiate the session with low datagram size
> and the update storm floods the connection.
>
>
> On Fri, 22 Jan 2010 02:06:53 +0100
> "Andy B."<globichen [at] gmail> wrote:
>
>> Hi,
>>
>> here we go:
>>
>> Core router that is causing headaches:
>>
>> interface Loopback0
>> ip address x.x.x.130 255.255.255.255
>>
>> interface TenGigabitEthernet9/1
>> ip address y.y.y.1 255.255.255.252
>> no ip redirects
>> no ip proxy-arp
>> no cdp enable
>>
>> router ospf 1
>> router-id x.x.x.130
>> log-adjacency-changes
>> redistribute connected subnets
>> redistribute static subnets
>> passive-interface default
>> no passive-interface TenGigabitEthernet8/1
>> no passive-interface TenGigabitEthernet9/1
>> no passive-interface TenGigabitEthernet9/2
>> network y.y.y.0 0.0.0.3 area 0
>> network y.y.y.4 0.0.0.3 area 0
>> network y.y.y.8 0.0.0.3 area 0
>>
>>
>> Adjacent router (one of them):
>>
>> interface Loopback0
>> ip address x.x.x.131 255.255.255.255
>>
>> interface TenGigabitEthernet4/1
>> ip address y.y.y.2 255.255.255.252
>> no ip redirects
>> no ip proxy-arp
>>
>> router ospf 1
>> router-id x.x.x.131
>> log-adjacency-changes
>> redistribute connected subnets
>> redistribute static subnets
>> passive-interface default
>> no passive-interface TenGigabitEthernet4/1
>> network y.y.y.0 0.0.0.3 area 0
>>
>>
>> I hope this helps...
>>
>> Andy
>>
>>
>> On Fri, Jan 22, 2010 at 1:53 AM, Jason LeBlanc
>> <jasonleblanc [at] gmail> wrote:
>>> Can you send your<snipped> OSPF config?
>>>
>>> On Jan 21, 2010, at 5:28 PM, Andy B. wrote:
>>>
>>>> Hi,
>>>>
>>>> I just fell over this thread while doing a little reseach to solve a
>>>> similar situation.
>>>>
>>>> Hardware:
>>>>
>>>> - 6509 with SUP720-3BXL on both ends
>>>> - SXF15a
>>>> - Uptime: 46 weeks
>>>>
>>>> Problem:
>>>>
>>>> - OSPF (for the loopback between cores) and BGP (mostly customers
>>>> whom we send the full table) going up and down all the time:
>>>>
>>>> %OSPF-5-ADJCHG: Process 1, Nbr x.x.x.130 on TenGigabitEthernet4/1
>>>> from FULL to DOWN, Neighbor Down: Dead timer expired
>>>> %OSPF-5-ADJCHG: Process 1, Nbr x.x.x.131 on TenGigabitEthernet9/1
>>>> from LOADING to FULL, Loading Done
>>>> %BGP-5-ADJCHANGE: neighbor y.y.y.14 Down BGP Notification sent
>>>> %BGP-3-NOTIFICATION: sent to neighbor y.y.y.14 4/0 (hold time
>>>> expired) 0 bytes %BGP-5-ADJCHANGE: neighbor y.y.y.14 Up
>>>>
>>>> This keeps going on for several hours, and suddenly it stabilizes
>>>> itself.
>>>>
>>>> Furthermore I use cacti to generate graphs from the core router via
>>>> SNMP. I have one VLAN that has around 15 GBPS traffic at peak times,
>>>> and as soon as I hit more than 15 GBPS, no more graphs are drawn,
>>>> core router console becomes rather unresponsive and OSPF starts to
>>>> behave strangely.
>>>>
>>>> What I can rule out is the fiber capacity. I have multiple circuits
>>>> and different paths and operators. The OSPF issue happens on all
>>>> circuits, not just a specific one. No 10 GE link is used more than
>>>> 60%. In fact, traffic from inside my backbone to any place outside
>>>> remains unaffected (thank God), but the core router itself is pretty
>>>> useless. Pinging the core's loopback or any ip loaded on that box
>>>> results in a 40-60% packet loss.
>>>>
>>>> CPU usage is not high, it's stable. No unusual processes, just IP
>>>> Input and BGP Scanner. More than 50% memory is still free at that
>>>> time.
>>>>
>>>> I've had this many times recently, but it really just happens when
>>>> my core goes beyond +- 15 GBPS of traffic (outbound). We've been
>>>> below 15 GBPS for 2 years and it never happaned at that time. Now
>>>> all this mess happens almost daily, rendering important billing
>>>> graphs useless and annoying full table BGP customers.
>>>>
>>>> Is this a memory issue, due to the router's long uptime? Would
>>>> reloading the router help in this case? That's the last thing I
>>>> would want to do, but if it helps...
>>>>
>>>> Cheers,
>>>>
>>>> Andy
>>>>
>>>> On Fri, Dec 11, 2009 at 5:22 PM, Drew Weaver
>>>> <drew.weaver [at] thenap> wrote:
>>>>> Howdy all,
>>>>>
>>>>> Last night I had an interesting encounter on one of my 6509s /w
>>>>> SUP7203-BXL.
>>>>>
>>>>> This switch has 3x iBGP sessions with full internet tables and is
>>>>> also running OSPF.
>>>>>
>>>>> Two of the three iBGP sessions randomly dropped with:
>>>>>
>>>>> %BGP-3-NOTIFICATION: sent to neighbor x.x.x.3 4/0 (hold time
>>>>> expired) 0 bytes, I also noticed that during this period OSPF
>>>>> dropped with Neighbor Down: Dead timer expired
>>>>>
>>>>> and then re-established, and then failed again, and
>>>>> re-established, and failed again, and so-on, and so-on.
>>>>>
>>>>> I checked the physical interfaces between this 6500 and the two
>>>>> GSR 12000s it peers with and there were no errors, there was also
>>>>> no obvious spike in traffic that would account for latency that
>>>>> might cause the hold timers to expire. I remember when this system
>>>>> first came online it took a really long time for it to download
>>>>> the full internet tables from the upstream GSRs and also during
>>>>> that time there was a lot of CPU time being eaten up, I am
>>>>> wondering if maybe the first session failing caused sort of a
>>>>> 'performance' domino effect which then caused everything else to
>>>>> fail, the issue eventually corrected itself and stabilized.
>>>>>
>>>>> This particular box is running 12.2(18)SXF17 so I am less likely
>>>>> to believe it is a software bug.
>>>>>
>>>>> Does anyone have any tips on both how I can avoid the hold timer
>>>>> issue altogether and also how I can make it so that if a session
>>>>> does go down and re-establish it doesn't totally nail the CPU
>>>>> while it's trying to re-establish/download the routes? A long time
>>>>> ago I also read that increasing the MTU on both ends of a circuit
>>>>> can make BGP tables download faster, I don't know if that's true
>>>>> or not, has anyone else found that?
>>>>>
>>>>> thanks,
>>>>> -Drew
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> cisco-nsp mailing list cisco-nsp [at] puck
>>>>> https://puck.nether.net/mailman/listinfo/cisco-nsp
>>>>> archive at http://puck.nether.net/pipermail/cisco-nsp/
>>>>>
>>>> _______________________________________________
>>>> cisco-nsp mailing list cisco-nsp [at] puck
>>>> https://puck.nether.net/mailman/listinfo/cisco-nsp
>>>> archive at http://puck.nether.net/pipermail/cisco-nsp/
>>>
>>>
>> _______________________________________________
>> cisco-nsp mailing list cisco-nsp [at] puck
>> https://puck.nether.net/mailman/listinfo/cisco-nsp
>> archive at http://puck.nether.net/pipermail/cisco-nsp/
>
>
>
> _______________________________________________
> cisco-nsp mailing list cisco-nsp [at] puck
> https://puck.nether.net/mailman/listinfo/cisco-nsp
> archive at http://puck.nether.net/pipermail/cisco-nsp/

_______________________________________________
cisco-nsp mailing list cisco-nsp [at] puck
https://puck.nether.net/mailman/listinfo/cisco-nsp
archive at http://puck.nether.net/pipermail/cisco-nsp/


globichen at gmail

Jan 22, 2010, 2:26 AM

Post #8 of 8 (2690 views)
Permalink
Re: BGP Hold time expired/ospf dropping 6500 Sup720-3BXL [In reply to]

MTU is 1500 on all links:

Core 1:

#sh int te9/1 | i MTU
MTU 1500 bytes, BW 10000000 Kbit, DLY 10 usec,

#sh int te9/2 | i MTU
MTU 1500 bytes, BW 10000000 Kbit, DLY 10 usec,

#sh int te8/1 | i MTU
MTU 1500 bytes, BW 10000000 Kbit, DLY 10 usec,

Core 2:

#sh int te4/1 | i MTU
MTU 1500 bytes, BW 10000000 Kbit, DLY 10 usec,

Core 3:

#sh int te4/1 | i MTU
MTU 1500 bytes, BW 10000000 Kbit, DLY 10 usec,

Core 4:

#sh int te4/1 | i MTU
MTU 1500 bytes, BW 10000000 Kbit, DLY 10 usec,

Core 1 is physically connected to 2,3 and 4 (star topology).

BGP is fully meshed - no route reflector.

Andy

On Fri, Jan 22, 2010 at 11:00 AM, roy <bandwidth.user [at] gmail> wrote:
> We had a somewhat similar problem with ospf/bgp which was eventually
> resolved by making link mtu uniform across the links. Let me know if this
> helps.
>
> On Friday, 22 January, 2010 04:07 PM, Gergely Antal wrote:
>>
>> just a thought :
>> sh ip bgp neighbors | i Datagrams
>>
>> maybe one router tries to negotiate the session with low datagram size
>> and the update storm floods the connection.
>>
>>
>> On Fri, 22 Jan 2010 02:06:53 +0100
>> "Andy B."<globichen [at] gmail> wrote:
>>
>>> Hi,
>>>
>>> here we go:
>>>
>>> Core router that is causing headaches:
>>>
>>> interface Loopback0
>>> ip address x.x.x.130 255.255.255.255
>>>
>>> interface TenGigabitEthernet9/1
>>> ip address y.y.y.1 255.255.255.252
>>> no ip redirects
>>> no ip proxy-arp
>>> no cdp enable
>>>
>>> router ospf 1
>>> router-id x.x.x.130
>>> log-adjacency-changes
>>> redistribute connected subnets
>>> redistribute static subnets
>>> passive-interface default
>>> no passive-interface TenGigabitEthernet8/1
>>> no passive-interface TenGigabitEthernet9/1
>>> no passive-interface TenGigabitEthernet9/2
>>> network y.y.y.0 0.0.0.3 area 0
>>> network y.y.y.4 0.0.0.3 area 0
>>> network y.y.y.8 0.0.0.3 area 0
>>>
>>>
>>> Adjacent router (one of them):
>>>
>>> interface Loopback0
>>> ip address x.x.x.131 255.255.255.255
>>>
>>> interface TenGigabitEthernet4/1
>>> ip address y.y.y.2 255.255.255.252
>>> no ip redirects
>>> no ip proxy-arp
>>>
>>> router ospf 1
>>> router-id x.x.x.131
>>> log-adjacency-changes
>>> redistribute connected subnets
>>> redistribute static subnets
>>> passive-interface default
>>> no passive-interface TenGigabitEthernet4/1
>>> network y.y.y.0 0.0.0.3 area 0
>>>
>>>
>>> I hope this helps...
>>>
>>> Andy
>>>
>>>
>>> On Fri, Jan 22, 2010 at 1:53 AM, Jason LeBlanc
>>> <jasonleblanc [at] gmail> wrote:
>>>>
>>>> Can you send your<snipped> OSPF config?
>>>>
>>>> On Jan 21, 2010, at 5:28 PM, Andy B. wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I just fell over this thread while doing a little reseach to solve a
>>>>> similar situation.
>>>>>
>>>>> Hardware:
>>>>>
>>>>> - 6509 with SUP720-3BXL on both ends
>>>>> - SXF15a
>>>>> - Uptime: 46 weeks
>>>>>
>>>>> Problem:
>>>>>
>>>>> - OSPF (for the loopback between cores) and BGP (mostly customers
>>>>> whom we send the full table) going up and down all the time:
>>>>>
>>>>> %OSPF-5-ADJCHG: Process 1, Nbr x.x.x.130 on TenGigabitEthernet4/1
>>>>> from FULL to DOWN, Neighbor Down: Dead timer expired
>>>>> %OSPF-5-ADJCHG: Process 1, Nbr x.x.x.131 on TenGigabitEthernet9/1
>>>>> from LOADING to FULL, Loading Done
>>>>> %BGP-5-ADJCHANGE: neighbor y.y.y.14 Down BGP Notification sent
>>>>> %BGP-3-NOTIFICATION: sent to neighbor y.y.y.14 4/0 (hold time
>>>>> expired) 0 bytes %BGP-5-ADJCHANGE: neighbor y.y.y.14 Up
>>>>>
>>>>> This keeps going on for several hours, and suddenly it stabilizes
>>>>> itself.
>>>>>
>>>>> Furthermore I use cacti to generate graphs from the core router via
>>>>> SNMP. I have one VLAN that has around 15 GBPS traffic at peak times,
>>>>> and as soon as I hit more than 15 GBPS, no more graphs are drawn,
>>>>> core router console becomes rather unresponsive and OSPF starts to
>>>>> behave strangely.
>>>>>
>>>>> What I can rule out is the fiber capacity. I have multiple circuits
>>>>> and different paths and operators. The OSPF issue happens on all
>>>>> circuits, not just a specific one. No 10 GE link is used more than
>>>>> 60%. In fact, traffic from inside my backbone to any place outside
>>>>> remains unaffected (thank God), but the core router itself is pretty
>>>>> useless. Pinging the core's loopback or any ip loaded on that box
>>>>> results in a 40-60% packet loss.
>>>>>
>>>>> CPU usage is not high, it's stable. No unusual processes, just IP
>>>>> Input and BGP Scanner. More than 50% memory is still free at that
>>>>> time.
>>>>>
>>>>> I've had this many times recently, but it really just happens when
>>>>> my core goes beyond +- 15 GBPS of traffic (outbound). We've been
>>>>> below 15 GBPS for 2 years and it never happaned at that time. Now
>>>>> all this mess happens almost daily, rendering important billing
>>>>> graphs useless and annoying full table BGP customers.
>>>>>
>>>>> Is this a memory issue, due to the router's long uptime? Would
>>>>> reloading the router help in this case? That's the last thing I
>>>>> would want to do, but if it helps...
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Andy
>>>>>
>>>>> On Fri, Dec 11, 2009 at 5:22 PM, Drew Weaver
>>>>> <drew.weaver [at] thenap> wrote:
>>>>>>
>>>>>> Howdy all,
>>>>>>
>>>>>> Last night I had an interesting encounter on one of my 6509s /w
>>>>>> SUP7203-BXL.
>>>>>>
>>>>>> This switch has 3x iBGP sessions with full internet tables and is
>>>>>> also running OSPF.
>>>>>>
>>>>>> Two of the three iBGP sessions randomly dropped with:
>>>>>>
>>>>>> %BGP-3-NOTIFICATION: sent to neighbor x.x.x.3 4/0 (hold time
>>>>>> expired) 0 bytes, I also noticed that during this period OSPF
>>>>>> dropped with Neighbor Down: Dead timer expired
>>>>>>
>>>>>> and then re-established, and then failed again, and
>>>>>> re-established, and failed again, and so-on, and so-on.
>>>>>>
>>>>>> I checked the physical interfaces between this 6500 and the two
>>>>>> GSR 12000s it peers with and there were no errors, there was also
>>>>>> no obvious spike in traffic that would account for latency that
>>>>>> might cause the hold timers to expire. I remember when this system
>>>>>> first came online it took a really long time for it to download
>>>>>> the full internet tables from the upstream GSRs and also during
>>>>>> that time there was a lot of CPU time being eaten up, I am
>>>>>> wondering if maybe the first session failing caused sort of a
>>>>>> 'performance' domino effect which then caused everything else to
>>>>>> fail, the issue eventually corrected itself and stabilized.
>>>>>>
>>>>>> This particular box is running 12.2(18)SXF17 so I am less likely
>>>>>> to believe it is a software bug.
>>>>>>
>>>>>> Does anyone have any tips on both how I can avoid the hold timer
>>>>>> issue altogether and also how I can make it so that if a session
>>>>>> does go down and re-establish it doesn't totally nail the CPU
>>>>>> while it's trying to re-establish/download the routes? A long time
>>>>>> ago I also read that increasing the MTU on both ends of a circuit
>>>>>> can make BGP tables download faster, I don't know if that's true
>>>>>> or not, has anyone else found that?
>>>>>>
>>>>>> thanks,
>>>>>> -Drew
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> cisco-nsp mailing list cisco-nsp [at] puck
>>>>>> https://puck.nether.net/mailman/listinfo/cisco-nsp
>>>>>> archive at http://puck.nether.net/pipermail/cisco-nsp/
>>>>>>
>>>>> _______________________________________________
>>>>> cisco-nsp mailing list cisco-nsp [at] puck
>>>>> https://puck.nether.net/mailman/listinfo/cisco-nsp
>>>>> archive at http://puck.nether.net/pipermail/cisco-nsp/
>>>>
>>>>
>>> _______________________________________________
>>> cisco-nsp mailing list cisco-nsp [at] puck
>>> https://puck.nether.net/mailman/listinfo/cisco-nsp
>>> archive at http://puck.nether.net/pipermail/cisco-nsp/
>>
>>
>>
>> _______________________________________________
>> cisco-nsp mailing list cisco-nsp [at] puck
>> https://puck.nether.net/mailman/listinfo/cisco-nsp
>> archive at http://puck.nether.net/pipermail/cisco-nsp/
>
> _______________________________________________
> cisco-nsp mailing list cisco-nsp [at] puck
> https://puck.nether.net/mailman/listinfo/cisco-nsp
> archive at http://puck.nether.net/pipermail/cisco-nsp/
>
_______________________________________________
cisco-nsp mailing list cisco-nsp [at] puck
https://puck.nether.net/mailman/listinfo/cisco-nsp
archive at http://puck.nether.net/pipermail/cisco-nsp/

Cisco nsp RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.