Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Wikipedia: Wikitech

potential network issue due to packet losses

 

 

Wikipedia wikitech RSS feed   Index | Next | Previous | View Threaded


ctwoo at wikimedia

Jul 2, 2012, 3:03 PM

Post #1 of 4 (552 views)
Permalink
potential network issue due to packet losses

All,

The Technical Operations team noticed abnormal network package losses
sometime after yesterday's 'leap second' switch (midnight UTC). While it
does not seem to impact the site availability at this moment, it is a
concern. We are still not sure if is even related to the 'leap second'
switch yet.

Leslie has opened a ticket with our network equipment provider and together
with Mark, they have been working with them to pinpoint the problem since
this morning. It is possible that they might induce some latency/issue
during the troubleshooting process.

If you do experience anything abnormal, please let us know (email to
ops [at] wikimedia or find us at the #wikimedia-operations IRC channel).

Thanks,
CT
_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


lcarr at wikimedia

Jul 2, 2012, 4:55 PM

Post #2 of 4 (546 views)
Permalink
Re: potential network issue due to packet losses [In reply to]

Hi Everyone -

This issue appears to be patched up. Please let me know immediately if
you see any more network issues.

Longer explanation - the root cause of issues we saw today was a
"fixed" router bug (our code version should not have been affected).
When in a firewall filter, packets are rejected (which sends an ICMP
rejected notice), the routing engine can receive too many of these
requests, causing the routing engine to "choke" on its backlog of
requests. This backup caused packets destined to the routing engine to
drop. This caused several issues as VRRP, BFD, and BGP all stopped
processing. For a currently unknown reason, OSPF was unaffected.

After correcting this, for an unknown reason, one vlan was not
processing packets destined to the routing engine, while the other
vlans were properly processing these packets. This caused both of our
main routers on that vlan to claim VRRP mastership - basically causing
two routers to claim to be the default gateway for the subnet which
contains the LVS servers. After disabling VRRP, the router still was
not passing traffic destined to this vlan. Turning down the vlan and
then turning it back up and adding and removing an arp policer (yes,
turning it off and on again) fixed this situation. This vlan issue
caused a public facing outage.

The current status is that everything is working and cr2-pmtpa is the
VRRP master for all of Tampa. We were lucky that this bug hit
cr1-sdtpa much harder than cr2-pmtpa. Eqiad was not affected, and
while we cannot yet say definitively, I believe it is due to the more
powerful routing engines and more robust network design of the eqiad
datacenter and routers. Software upgrades and configuration changes
should fix this issue in Tampa. A possible fix would be hardware
upgrades of the core routers, however it may be both prohibitively
expensive and require some downtime for important machines in pmtpa.


Leslie

On Mon, Jul 2, 2012 at 3:03 PM, Ct Woo <ctwoo [at] wikimedia> wrote:
> All,
>
> The Technical Operations team noticed abnormal network package losses
> sometime after yesterday's 'leap second' switch (midnight UTC). While it
> does not seem to impact the site availability at this moment, it is a
> concern. We are still not sure if is even related to the 'leap second'
> switch yet.
>
> Leslie has opened a ticket with our network equipment provider and together
> with Mark, they have been working with them to pinpoint the problem since
> this morning. It is possible that they might induce some latency/issue
> during the troubleshooting process.
>
> If you do experience anything abnormal, please let us know (email to
> ops [at] wikimedia or find us at the #wikimedia-operations IRC channel).
>
> Thanks,
> CT
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l [at] lists
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l



--
Leslie Carr
Wikimedia Foundation
AS 14907, 43821
http://as14907.peeringdb.com/

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


saper at saper

Jul 3, 2012, 2:05 PM

Post #3 of 4 (541 views)
Permalink
Re: potential network issue due to packet losses [In reply to]

>> Leslie Carr <lcarr [at] wikimedia> wrote:
> When in a firewall filter, packets are rejected (which sends an ICMP
> rejected notice), the routing engine can receive too many of these
> requests, causing the routing engine to "choke" on its backlog of
> requests.

Leslie, thanks for excellent update! Was is something similar to ICMP
storm caused by unreachables (similar to the problems caused by
subnet-directed packets in the old days) that even ICMP rate limiting
didn't help?

//Saper


_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


lcarr at wikimedia

Jul 3, 2012, 2:22 PM

Post #4 of 4 (537 views)
Permalink
Re: potential network issue due to packet losses [In reply to]

On Tue, Jul 3, 2012 at 2:05 PM, Marcin Cieslak <saper [at] saper> wrote:
>>> Leslie Carr <lcarr [at] wikimedia> wrote:
>> When in a firewall filter, packets are rejected (which sends an ICMP
>> rejected notice), the routing engine can receive too many of these
>> requests, causing the routing engine to "choke" on its backlog of
>> requests.
>
> Leslie, thanks for excellent update! Was is something similar to ICMP
> storm caused by unreachables (similar to the problems caused by
> subnet-directed packets in the old days) that even ICMP rate limiting
> didn't help?
>

Sadly ICMP rate limiting only counts for ICMP packets incoming to RE,
outgoing packets are processed and created before any filters kick in.

> //Saper
>
>
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l [at] lists
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l



--
Leslie Carr
Wikimedia Foundation
AS 14907, 43821
http://as14907.peeringdb.com/

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Wikipedia wikitech RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.