Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: NANOG: users

Thoughts on increasing MTUs on the internet

 

 

First page Previous page 1 2 3 Next page Last page  View All NANOG users RSS feed   Index | Next | Previous | View Threaded


adrian at creative

Apr 13, 2007, 9:24 AM

Post #51 of 64 (5057 views)
Permalink
Re: Thoughts on increasing MTUs on the internet [In reply to]

On Fri, Apr 13, 2007, Steve Meuse wrote:
> On 4/13/07, Valdis.Kletnieks [at] vt <Valdis.Kletnieks [at] vt> wrote:
> >
> >
> >For that matter, what releases of Windows support setting a 9K
> >MTU? That's
> >probably the *real* uptake limiter.
>
> Most, if not all. I have an XP box that has a GigE with 9k MTU.

Lucky you. The definition of "large frames" varies depending entirely
upon driver. I came up against this when a client nicely asked about
jumbo frames on his shiny new Cisco 3560 switch - and none of his
computers could agree on anything greater than 4k. And, to make things
worse - a few of the drivers wanted to enforce certain values rather
than any value between 1500 and an upper limit - making the whole
feat impossible.

Yay for non-clear specifications. The skeptic in me says "ain't going
to happen." The believer in me says "Ah, that'd be cool, wouldn't it?"
The realist in me says "probably best to mandate that kind of stuff
with the next revision of the ipv6-internet with the first few bits
set to 010 instead of 001. :)

The real uptake limiter is the disagreement on implementation.
Some of you have to remember how this whole internet thing started
and grew (I've only read about the collaboration in books.)



Adrian


stephen at sprunk

Apr 13, 2007, 10:31 AM

Post #52 of 64 (5055 views)
Permalink
Re: Thoughts on increasing MTUs on the internet [In reply to]

Thus spake "Mikael Abrahamsson" <swmike [at] swm>
> The internet is a very diverse and complicated beast and if end
> systems can properly detect PMTU by doing discovery of this, it
> might work. ... Make sure they can properly detect PMTU by
> use of nothing more than "is this packet size getting thru" (ie
> no ICMP-NEED-TO-FRAG) or alike, then we might see partial
> adoption of larger MTU in some parts and if this becomes a
> major customer requirement then it might spread.

PMTU Black Hole Detection works well in my experience, but unfortunately MS
doesn't turn it on by default, which is where all of the L2VPN with <1500
MTU issues come from; turn BHD on and the problems just go away... (And, as
others have noted, there's better PMTUD algorithms that are designed to work
_with_ black holes, but IME they're not really needed)

Still, we have a (mostly) working solution for wide-area use; what's missing
is the critical step in getting varying MTUs working on a single subnet.
All the solutions so far have required setting a higher, but still fixed,
MTU for every device and that isn't realistic on the edge except in tightly
controlled environments like HPC or internal datacenters.

Perry Lorier's solution is rather clever; perhaps we don't even need a
protocol sanctioned by the IEEE or IETF?

S

Stephen Sprunk "Those people who think they know everything
CCIE #3723 are a great annoyance to those of us who do."
K5SSS --Isaac Asimov


DLasher at newedgenetworks

Apr 13, 2007, 12:18 PM

Post #53 of 64 (5044 views)
Permalink
RE: Thoughts on increasing MTUs on the internet [In reply to]

-----Original Message-----
From: owner-nanog [at] merit [mailto:owner-nanog [at] merit] On Behalf Of
Stephen Sprunk
Sent: Friday, April 13, 2007 10:32 AM
To: Mikael Abrahamsson
Cc: North American Noise and Off-topic Gripes
Subject: Re: Thoughts on increasing MTUs on the internet

>PMTU Black Hole Detection works well in my experience, but unfortunately MS
doesn't turn it on by default, which is
> where all of the L2VPN with <1500 MTU issues come from; turn BHD on and
the problems just go away... (And, as others
>have noted, there's better PMTUD algorithms that are designed to work
_with_ black holes, but IME they're not really
> needed)

I wish I'd had your experience. PMTU _can_ work well, but on the internet as
a whole, far too many ignorant paranoid admins block PMTU, mostly by
accident, causing all sorts of unpleasantness. Clearing DF only takes you so
far. Unless both ends are aware, and respond apppropriately to the squeeze
in the middle, you're back to square one.

Unless there were some other method of MTU Discovery implemented, depending
on something like PMTU discovery may fail just as dramatically on larger
packets as it does on 1500byte now.
Attachments: smime.p7s (3.65 KB)


stephen at sprunk

Apr 13, 2007, 2:15 PM

Post #54 of 64 (5045 views)
Permalink
Re: Thoughts on increasing MTUs on the internet [In reply to]

Thus spake "Lasher, Donn" <DLasher [at] newedgenetworks>
>>PMTU Black Hole Detection works well in my experience, but unfortunately
>>MS doesn't turn it on by default, which is where all of the L2VPN with
>><1500
>> MTU issues come from; turn BHD on and the problems just go away... (And,
>>as others have noted, there's better PMTUD algorithms that are designed to
>>work _with_ black holes, but IME they're not really needed)
>
> I wish I'd had your experience. PMTU _can_ work well, but on the internet
> as
> a whole, far too many ignorant paranoid admins block PMTU, mostly by
> accident, causing all sorts of unpleasantness.

You can't block PMTUD per se, just the ICMP messages that dumber
implementations rely on. And, as I noted, MS's implementation is dumb by
default, which leads to the problems we're all familiar with. "PMTU Black
Hole Detection" is appropriately named; one registry change* and a reboot is
all you need to solve the problem. Of course, that's non-trivial to
implement when there's hundreds of millions of boxes with the wrong
setting...

> Clearing DF only takes you so far. Unless both ends are aware, and respond
> apppropriately to the squeeze in the middle, you're back to square one.

Smarter implementations still set DF. The difference is that when they get
neither an ACK nor an ICMP, they try progressively smaller sizes until they
do get a response of some kind. They make a note of what works and continue
on with that, with the occasional larger probe in case the problem was
transient.

In fact, one could consider Lorier's "mtud" to be roughly the same idea;
it's only needed because the stack's own PMTUD code is typically bypassed
for on-subnet destinations and/or not as smart as it should be.

S

* HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\
Parameters\EnablePMTUBHDetect=1

Stephen Sprunk "Those people who think they know everything
CCIE #3723 are a great annoyance to those of us who do."
K5SSS --Isaac Asimov


simon at limmat

Apr 13, 2007, 4:39 PM

Post #55 of 64 (5041 views)
Permalink
Re: Thoughts on increasing MTUs on the internet [In reply to]

Ah, large MTUs. Like many other "academic" backbones, we implemented
large (9192 bytes) MTUs on our backbone and 9000 bytes on some hosts.
See [1] for an illustration. Here are *my* current thoughts on
increasing the Internet MTU beyond its current value, 1500. (On the
topic, see also [2] - a wiki page which is actually served on a
9000-byte MTU server :-)

Benefits of >1500-byte MTUs:

Several benefits of moving to larger MTUs, say in the 9000-byte range,
were cited. I don't find them too convincing anymore.

1. Fewer packets reduce work for routers and hosts.

Routers:

Most backbones seem to size their routers to sustain (near-)
line-rate traffic even with small (64-byte) packets. That's a good
thing, because if networks were dimensioned to just work at average
packet sizes, they would be pretty easy to DoS by sending floods of
small packets. So I don't see how raising the MTU helps much
unless you also raise the minimum packet size - which might be
interesting, but I haven't heard anybody suggest that.

This should be true for routers and middleboxes in general,
although there are certainly many places (especially firewalls)
where pps limitations ARE an issue. But again, raising the MTU
doesn't help if you're worried about the worst case. And I would
like to see examples where it would help significantly even in the
normal case. In our network it certainly doesn't - we have Mpps to
spare.

Hosts:

For hosts, filling high-speed links at 1500-byte MTU has often been
difficult at certain times (with Fast Ethernet in the nineties,
GigE 4-5 years ago, 10GE today), due to the high rate of
interrupts/context switches and internal bus crossings.
Fortunately tricks like polling-instead-of-interrupts (Saku Ytti
mentioned this), Interrupt Coalescence and Large-Send Offload have
become commonplace these days. These give most of the end-system
performance benefits of large packets without requiring any support
from the network.

2. Fewer bytes (saved header overhead) free up bandwidth.

TCP segments over Ethernet with 1500 byte MTU is "only" 94.2%
efficient, while with 9000 byte MTU it would be 99.?% efficient.
While an improvement would certainly be nice, 94% already seems
"good enough" to me. (I'm ignoring the byte savings due to fewer
ACKs. On the other hand not all packets will be able to grow
sixfold - some transfers are small.)

3. TCP runs faster.

This boils down to two aspects (besides the effects of (1) and (2)):

a) TCP reaches its "cruising speed" faster.

Especially with LFNs (Long Fat Networks, i.e. paths with a large
bandwidth*RTT product), it can take quite a long time until TCP
slow-start has increased the window so that the maximum
achievable rate is reached. Since the window increase happens
in units of MSS (~MTU), TCPs with larger packets reach this
point proportionally faster.

This is significant, but there are alternative proposals to
solve this issue of slow ramp-up, for example HighSpeed TCP [3].

b) You get a larger share of a congested link.

I think this is true when a TCP-with-large-packets shares a
congested link with TCPs-with-small-packets, and the packet loss
probability isn't proportional to the size of the packet. In
fact the large-packet connection can get a MUCH larger share
(sixfold for 9K vs. 1500) if the loss probability is the same
for everybody (which it often will be, approximately). Some
people consider this a fairness issue, other think it's a good
incentive for people to upgrade their MTUs.

About the issues:

* Current Path MTU Discovery doesn't work reliably.

Path MTU Discovery as specified in RFC 1191/1981 relies on ICMP
messages to discover when a smaller MTU has to be used. When these
ICMP messages fail to arrive (or be sent), the sender will happily
continue to send too-large packets into the blackhole. This problem
is very real. As an experiment, try configuring an MTU < 1500 on a
backbone link which has Ethernet-connected customers behind it.
I bet that you'll receive LOUD complaints before long.

Some other people mention that Path MTU Discovery has been refined
with "blackhole detection" methods in some systems. This is widely
implemented, but not configured (although it probably could be with
a "Service Pack").

Note that a new Path MTU Discovery proposal was just published as
RFC 4821 [4]. This is also supposed to solve the problem of relying
on ICMP messages.

Please, let's wait for these more robust PMTUD mechanisms to be
universally deployed before trying to increase the Internet MTU.

* IP assumes a consistent MTU within a logical subnet.

This seems to be a pretty fundamental assumption, and Iljitsch's
original mail suggests that we "fix" this. Umm, ok, I hope we don't
miss anything important that makes use of this assumption.

Seriously, I think it's illusionary to try to change this for
general networks, in particular large LANs. It might work for
exchange points or other controlled cases where the set of protocols
is fairly well defined, but then exchange points have other options
such as separate "jumbo" VLANs.

For campus/datacenter networks, I agree that the consistent-MTU
requirement is a big problem for deploying larger MTUs. This is
true within my organization - most servers that could use larger
MTUs (NNTP servers for example) live on the same subnet with servers
that will never bother to be upgraded. The obvious solution is to
build smaller subnets - for our test servers I usually configure a
separate point-to-point subnet for each of its Ethernet interfaces
(I don't trust this bridging-magic anyway :-).

* Most edges will not upgrade anyway.

On the slow edges of the network (residual modem users, exotic
places, cellular data users etc.), people will NOT upgrade their MTU
to 9000 byte, because a single such packet would totally kill the
VoIP experience. For medium-fast networks, large MTUs don't cause
problems, but they don't help either. So only a few super-fast
edges have an incentive to do this at all.

For the core networks that support large MTUs (like we do), this is
frustrating because all our routers now probably carve their
internal buffers for 9000-byte packets that never arrive.
Maybe we're wasting lots of expensive linecard memory this way?

* Chicken/egg

As long as only a small minority of hosts supports >1500-byte MTUs,
there is no incentive for anyone important to start supporting them.
A public server supporting 9000-byte MTUs will be frustrated when it
tries to use them. The overhead (from attempted large packets that
don't make it) and potential trouble will just not be worth it.
This is a little similar to IPv6.

So I don't see large MTUs coming to the Internet at large soon. They
probably make sense in special cases, maybe for "land-speed records"
and dumb high-speed video equipment, or for server-to-server stuff
such as USENET news.

(And if anybody out there manages to access [2] or http://ndt.switch.ch/
with 9000-byte MTUs, I'd like to hear about it :-)
--
Simon.

[1] Here are a few tracepaths (more or less traceroute with integrated
PMTU discovery) from a host on our network in Switzerland.
9000-byte packets make it across our national backbone (SWITCH),
the European academic backbone (GEANT2), Abilene and CENIC in the
US, as well as through AARnet in Australia (even over IPv6). But
the link from the last wide-area backbone to the receiving site
inevitably has a 1500-byte MTU ("pmtu 1500").

: leinen [at] mamp[leinen]; tracepath www.caida.org
1: mamp1-eth2.switch.ch (130.59.35.78) 0.110ms pmtu 9000
1: swiMA1-G2-6.switch.ch (130.59.35.77) 1.029ms
2: swiMA2-G2-5.switch.ch (130.59.36.194) 1.141ms
3: swiEL2-10GE-1-4.switch.ch (130.59.37.77) 4.127ms
4: swiCE3-10GE-1-3.switch.ch (130.59.37.65) 4.726ms
5: swiCE2-10GE-1-4.switch.ch (130.59.36.209) 4.901ms
6: switch.rt1.gen.ch.geant2.net (62.40.124.21) asymm 7 4.429ms
7: so-7-2-0.rt1.fra.de.geant2.net (62.40.112.22) asymm 8 12.551ms
8: abilene-wash-gw.rt1.fra.de.geant2.net (62.40.125.18) asymm 9 105.099ms
9: 64.57.28.12 (64.57.28.12) asymm 10 121.619ms
10: kscyng-iplsng.abilene.ucaid.edu (198.32.8.81) asymm 11 153.796ms
11: dnvrng-kscyng.abilene.ucaid.edu (198.32.8.13) asymm 12 158.520ms
12: snvang-dnvrng.abilene.ucaid.edu (198.32.8.1) asymm 13 180.784ms
13: losang-snvang.abilene.ucaid.edu (198.32.8.94) asymm 14 177.487ms
14: hpr-lax-gsr1--abilene-LA-10ge.cenic.net (137.164.25.2) asymm 20 179.106ms
15: riv-hpr--lax-hpr-10ge.cenic.net (137.164.25.5) asymm 21 185.183ms
16: hpr-sdsc-sdsc2--riv-hpr-ge.cenic.net (137.164.27.54) asymm 18 186.368ms
17: hpr-sdsc-sdsc2--riv-hpr-ge.cenic.net (137.164.27.54) asymm 18 185.861ms pmtu 1500
18: cider.caida.org (192.172.226.123) asymm 19 186.264ms reached
Resume: pmtu 1500 hops 18 back 19
: leinen [at] mamp[leinen]; tracepath www.aarnet.edu.au
1: mamp1-eth2.switch.ch (130.59.35.78) 0.095ms pmtu 9000
1: swiMA1-G2-6.switch.ch (130.59.35.77) 1.024ms
2: swiMA2-G2-5.switch.ch (130.59.36.194) 1.115ms
3: swiEL2-10GE-1-4.switch.ch (130.59.37.77) 3.989ms
4: swiCE3-10GE-1-3.switch.ch (130.59.37.65) 4.731ms
5: swiCE2-10GE-1-4.switch.ch (130.59.36.209) 4.771ms
6: switch.rt1.gen.ch.geant2.net (62.40.124.21) asymm 7 4.424ms
7: so-7-2-0.rt1.fra.de.geant2.net (62.40.112.22) asymm 8 12.536ms
8: ge-3-3-0.bb1.a.fra.aarnet.net.au (202.158.204.249) asymm 9 13.207ms
9: so-0-1-0.bb1.a.sin.aarnet.net.au (202.158.194.145) asymm 10 217.846ms
10: so-3-3-0.bb1.a.per.aarnet.net.au (202.158.194.129) asymm 11 275.651ms
11: so-0-1-0.bb1.a.adl.aarnet.net.au (202.158.194.6) asymm 12 293.854ms
12: so-0-1-0.bb1.a.adl.aarnet.net.au (202.158.194.6) 297.989ms pmtu 1500
13: tiny-teddy.aarnet.edu.au (203.21.37.30) asymm 12 297.462ms reached
Resume: pmtu 1500 hops 13 back 12
: leinen [at] mamp[leinen]; tracepath6 www.aarnet.edu.au
1?: [LOCALHOST] pmtu 9000
1: swiMA1-G2-6.switch.ch 1.328ms
2: swiMA2-G2-5.switch.ch 1.703ms
3: swiEL2-10GE-1-4.switch.ch 4.529ms
4: swiCE3-10GE-1-3.switch.ch 5.278ms
5: swiCE2-10GE-1-4.switch.ch 5.493ms
6: switch.rt1.gen.ch.geant2.net asymm 7 5. 99ms
7: so-7-2-0.rt1.fra.de.geant2.net asymm 8 13.239ms
8: ge-3-3-0.bb1.a.fra.aarnet.net.au asymm 9 13.970ms
9: so-0-1-0.bb1.a.sin.aarnet.net.au asymm 10 218.718ms
10: so-3-3-0.bb1.a.per.aarnet.net.au asymm 11 267.225ms
11: so-0-1-0.bb1.a.adl.aarnet.net.au asymm 12 299. 78ms
12: so-0-1-0.bb1.a.adl.aarnet.net.au 298.473ms pmtu 1500
12: www.ipv6.aarnet.edu.au 292.893ms reached
Resume: pmtu 1500 hops 12 back 12

[2] PERT Knowledgebase article: http://kb.pert.geant2.net/PERTKB/JumboMTU

[3] RFC 3649, HighSpeed TCP for Large Congestion Windows, S. Floyd,
December 2003

[4] RFC 4821, Packetization Layer Path MTU Discovery. M. Mathis,
J. Heffner, March 2007


fred at cisco

Apr 13, 2007, 4:55 PM

Post #56 of 64 (5062 views)
Permalink
Re: Thoughts on increasing MTUs on the internet [In reply to]

I agree with many of your thoughts. This is essentially the same
discussion we had upgrading from the 576 byte common MTU of the
ARPANET to the 1500 byte MTU of Ethernet-based networks. Larger MTUs
are a good thing, but are not a panacea. The biggest value in real
practice is IMHO that the end systems deal with a lower interrupt
rate when moving the same amount of data. That said, some who are
asking about larger MTUs are asking for values so large that CRC
schemes lose their value in error detection, and they find themselves
looking at higher layer FEC technologies to make up for the issue.
Given that there is an equipment cost related to larger MTUs, I
believe that there is such a thing as an MTU that is impractical.

1500 byte MTUs in fact work. I'm all for 9K MTUs, and would recommend
them. I don't see the point of 65K MTUs.

On Apr 14, 2007, at 7:39 AM, Simon Leinen wrote:

>
> Ah, large MTUs. Like many other "academic" backbones, we implemented
> large (9192 bytes) MTUs on our backbone and 9000 bytes on some hosts.
> See [1] for an illustration. Here are *my* current thoughts on
> increasing the Internet MTU beyond its current value, 1500. (On the
> topic, see also [2] - a wiki page which is actually served on a
> 9000-byte MTU server :-)
>
> Benefits of >1500-byte MTUs:
>
> Several benefits of moving to larger MTUs, say in the 9000-byte range,
> were cited. I don't find them too convincing anymore.
>
> 1. Fewer packets reduce work for routers and hosts.
>
> Routers:
>
> Most backbones seem to size their routers to sustain (near-)
> line-rate traffic even with small (64-byte) packets. That's a good
> thing, because if networks were dimensioned to just work at average
> packet sizes, they would be pretty easy to DoS by sending floods of
> small packets. So I don't see how raising the MTU helps much
> unless you also raise the minimum packet size - which might be
> interesting, but I haven't heard anybody suggest that.
>
> This should be true for routers and middleboxes in general,
> although there are certainly many places (especially firewalls)
> where pps limitations ARE an issue. But again, raising the MTU
> doesn't help if you're worried about the worst case. And I would
> like to see examples where it would help significantly even in the
> normal case. In our network it certainly doesn't - we have Mpps to
> spare.
>
> Hosts:
>
> For hosts, filling high-speed links at 1500-byte MTU has often been
> difficult at certain times (with Fast Ethernet in the nineties,
> GigE 4-5 years ago, 10GE today), due to the high rate of
> interrupts/context switches and internal bus crossings.
> Fortunately tricks like polling-instead-of-interrupts (Saku Ytti
> mentioned this), Interrupt Coalescence and Large-Send Offload have
> become commonplace these days. These give most of the end-system
> performance benefits of large packets without requiring any support
> from the network.
>
> 2. Fewer bytes (saved header overhead) free up bandwidth.
>
> TCP segments over Ethernet with 1500 byte MTU is "only" 94.2%
> efficient, while with 9000 byte MTU it would be 99.?% efficient.
> While an improvement would certainly be nice, 94% already seems
> "good enough" to me. (I'm ignoring the byte savings due to fewer
> ACKs. On the other hand not all packets will be able to grow
> sixfold - some transfers are small.)
>
> 3. TCP runs faster.
>
> This boils down to two aspects (besides the effects of (1) and
> (2)):
>
> a) TCP reaches its "cruising speed" faster.
>
> Especially with LFNs (Long Fat Networks, i.e. paths with a large
> bandwidth*RTT product), it can take quite a long time until TCP
> slow-start has increased the window so that the maximum
> achievable rate is reached. Since the window increase happens
> in units of MSS (~MTU), TCPs with larger packets reach this
> point proportionally faster.
>
> This is significant, but there are alternative proposals to
> solve this issue of slow ramp-up, for example HighSpeed TCP [3].
>
> b) You get a larger share of a congested link.
>
> I think this is true when a TCP-with-large-packets shares a
> congested link with TCPs-with-small-packets, and the packet loss
> probability isn't proportional to the size of the packet. In
> fact the large-packet connection can get a MUCH larger share
> (sixfold for 9K vs. 1500) if the loss probability is the same
> for everybody (which it often will be, approximately). Some
> people consider this a fairness issue, other think it's a good
> incentive for people to upgrade their MTUs.
>
> About the issues:
>
> * Current Path MTU Discovery doesn't work reliably.
>
> Path MTU Discovery as specified in RFC 1191/1981 relies on ICMP
> messages to discover when a smaller MTU has to be used. When these
> ICMP messages fail to arrive (or be sent), the sender will happily
> continue to send too-large packets into the blackhole. This problem
> is very real. As an experiment, try configuring an MTU < 1500 on a
> backbone link which has Ethernet-connected customers behind it.
> I bet that you'll receive LOUD complaints before long.
>
> Some other people mention that Path MTU Discovery has been refined
> with "blackhole detection" methods in some systems. This is widely
> implemented, but not configured (although it probably could be with
> a "Service Pack").
>
> Note that a new Path MTU Discovery proposal was just published as
> RFC 4821 [4]. This is also supposed to solve the problem of relying
> on ICMP messages.
>
> Please, let's wait for these more robust PMTUD mechanisms to be
> universally deployed before trying to increase the Internet MTU.
>
> * IP assumes a consistent MTU within a logical subnet.
>
> This seems to be a pretty fundamental assumption, and Iljitsch's
> original mail suggests that we "fix" this. Umm, ok, I hope we don't
> miss anything important that makes use of this assumption.
>
> Seriously, I think it's illusionary to try to change this for
> general networks, in particular large LANs. It might work for
> exchange points or other controlled cases where the set of protocols
> is fairly well defined, but then exchange points have other options
> such as separate "jumbo" VLANs.
>
> For campus/datacenter networks, I agree that the consistent-MTU
> requirement is a big problem for deploying larger MTUs. This is
> true within my organization - most servers that could use larger
> MTUs (NNTP servers for example) live on the same subnet with servers
> that will never bother to be upgraded. The obvious solution is to
> build smaller subnets - for our test servers I usually configure a
> separate point-to-point subnet for each of its Ethernet interfaces
> (I don't trust this bridging-magic anyway :-).
>
> * Most edges will not upgrade anyway.
>
> On the slow edges of the network (residual modem users, exotic
> places, cellular data users etc.), people will NOT upgrade their MTU
> to 9000 byte, because a single such packet would totally kill the
> VoIP experience. For medium-fast networks, large MTUs don't cause
> problems, but they don't help either. So only a few super-fast
> edges have an incentive to do this at all.
>
> For the core networks that support large MTUs (like we do), this is
> frustrating because all our routers now probably carve their
> internal buffers for 9000-byte packets that never arrive.
> Maybe we're wasting lots of expensive linecard memory this way?
>
> * Chicken/egg
>
> As long as only a small minority of hosts supports >1500-byte MTUs,
> there is no incentive for anyone important to start supporting them.
> A public server supporting 9000-byte MTUs will be frustrated when it
> tries to use them. The overhead (from attempted large packets that
> don't make it) and potential trouble will just not be worth it.
> This is a little similar to IPv6.
>
> So I don't see large MTUs coming to the Internet at large soon. They
> probably make sense in special cases, maybe for "land-speed records"
> and dumb high-speed video equipment, or for server-to-server stuff
> such as USENET news.
>
> (And if anybody out there manages to access [2] or http://
> ndt.switch.ch/
> with 9000-byte MTUs, I'd like to hear about it :-)
> --
> Simon.
>
> [1] Here are a few tracepaths (more or less traceroute with integrated
> PMTU discovery) from a host on our network in Switzerland.
> 9000-byte packets make it across our national backbone (SWITCH),
> the European academic backbone (GEANT2), Abilene and CENIC in the
> US, as well as through AARnet in Australia (even over IPv6). But
> the link from the last wide-area backbone to the receiving site
> inevitably has a 1500-byte MTU ("pmtu 1500").
>
> : leinen [at] mamp[leinen]; tracepath www.caida.org
> 1: mamp1-eth2.switch.ch (130.59.35.78) 0.110ms
> pmtu 9000
> 1: swiMA1-G2-6.switch.ch (130.59.35.77) 1.029ms
> 2: swiMA2-G2-5.switch.ch (130.59.36.194) 1.141ms
> 3: swiEL2-10GE-1-4.switch.ch (130.59.37.77) 4.127ms
> 4: swiCE3-10GE-1-3.switch.ch (130.59.37.65) 4.726ms
> 5: swiCE2-10GE-1-4.switch.ch (130.59.36.209) 4.901ms
> 6: switch.rt1.gen.ch.geant2.net (62.40.124.21) asymm
> 7 4.429ms
> 7: so-7-2-0.rt1.fra.de.geant2.net (62.40.112.22) asymm 8
> 12.551ms
> 8: abilene-wash-gw.rt1.fra.de.geant2.net (62.40.125.18) asymm 9
> 105.099ms
> 9: 64.57.28.12 (64.57.28.12) asymm 10
> 121.619ms
> 10: kscyng-iplsng.abilene.ucaid.edu (198.32.8.81) asymm 11
> 153.796ms
> 11: dnvrng-kscyng.abilene.ucaid.edu (198.32.8.13) asymm 12
> 158.520ms
> 12: snvang-dnvrng.abilene.ucaid.edu (198.32.8.1) asymm 13
> 180.784ms
> 13: losang-snvang.abilene.ucaid.edu (198.32.8.94) asymm 14
> 177.487ms
> 14: hpr-lax-gsr1--abilene-LA-10ge.cenic.net (137.164.25.2) asymm
> 20 179.106ms
> 15: riv-hpr--lax-hpr-10ge.cenic.net (137.164.25.5) asymm 21
> 185.183ms
> 16: hpr-sdsc-sdsc2--riv-hpr-ge.cenic.net (137.164.27.54) asymm 18
> 186.368ms
> 17: hpr-sdsc-sdsc2--riv-hpr-ge.cenic.net (137.164.27.54) asymm 18
> 185.861ms pmtu 1500
> 18: cider.caida.org (192.172.226.123) asymm 19
> 186.264ms reached
> Resume: pmtu 1500 hops 18 back 19
> : leinen [at] mamp[leinen]; tracepath www.aarnet.edu.au
> 1: mamp1-eth2.switch.ch (130.59.35.78) 0.095ms
> pmtu 9000
> 1: swiMA1-G2-6.switch.ch (130.59.35.77) 1.024ms
> 2: swiMA2-G2-5.switch.ch (130.59.36.194) 1.115ms
> 3: swiEL2-10GE-1-4.switch.ch (130.59.37.77) 3.989ms
> 4: swiCE3-10GE-1-3.switch.ch (130.59.37.65) 4.731ms
> 5: swiCE2-10GE-1-4.switch.ch (130.59.36.209) 4.771ms
> 6: switch.rt1.gen.ch.geant2.net (62.40.124.21) asymm
> 7 4.424ms
> 7: so-7-2-0.rt1.fra.de.geant2.net (62.40.112.22) asymm 8
> 12.536ms
> 8: ge-3-3-0.bb1.a.fra.aarnet.net.au (202.158.204.249) asymm 9
> 13.207ms
> 9: so-0-1-0.bb1.a.sin.aarnet.net.au (202.158.194.145) asymm 10
> 217.846ms
> 10: so-3-3-0.bb1.a.per.aarnet.net.au (202.158.194.129) asymm 11
> 275.651ms
> 11: so-0-1-0.bb1.a.adl.aarnet.net.au (202.158.194.6) asymm 12
> 293.854ms
> 12: so-0-1-0.bb1.a.adl.aarnet.net.au (202.158.194.6) 297.989ms
> pmtu 1500
> 13: tiny-teddy.aarnet.edu.au (203.21.37.30) asymm 12
> 297.462ms reached
> Resume: pmtu 1500 hops 13 back 12
> : leinen [at] mamp[leinen]; tracepath6 www.aarnet.edu.au
> 1?: [LOCALHOST] pmtu 9000
> 1: swiMA1-G2-6.switch.ch 1.328ms
> 2: swiMA2-G2-5.switch.ch 1.703ms
> 3: swiEL2-10GE-1-4.switch.ch 4.529ms
> 4: swiCE3-10GE-1-3.switch.ch 5.278ms
> 5: swiCE2-10GE-1-4.switch.ch 5.493ms
> 6: switch.rt1.gen.ch.geant2.net asymm 7 5. 99ms
> 7: so-7-2-0.rt1.fra.de.geant2.net asymm 8 13.239ms
> 8: ge-3-3-0.bb1.a.fra.aarnet.net.au asymm 9 13.970ms
> 9: so-0-1-0.bb1.a.sin.aarnet.net.au asymm 10 218.718ms
> 10: so-3-3-0.bb1.a.per.aarnet.net.au asymm 11 267.225ms
> 11: so-0-1-0.bb1.a.adl.aarnet.net.au asymm 12 299. 78ms
> 12: so-0-1-0.bb1.a.adl.aarnet.net.au 298.473ms pmtu 1500
> 12: www.ipv6.aarnet.edu.au 292.893ms reached
> Resume: pmtu 1500 hops 12 back 12
>
> [2] PERT Knowledgebase article: http://kb.pert.geant2.net/PERTKB/
> JumboMTU
>
> [3] RFC 3649, HighSpeed TCP for Large Congestion Windows, S. Floyd,
> December 2003
>
> [4] RFC 4821, Packetization Layer Path MTU Discovery. M. Mathis,
> J. Heffner, March 2007


jgreco at ns

Apr 13, 2007, 10:18 PM

Post #57 of 64 (5052 views)
Permalink
Re: Thoughts on increasing MTUs on the internet [In reply to]

> As long as only a small minority of hosts supports >1500-byte MTUs,
> there is no incentive for anyone important to start supporting them.
> A public server supporting 9000-byte MTUs will be frustrated when it
> tries to use them. The overhead (from attempted large packets that
> don't make it) and potential trouble will just not be worth it.
> This is a little similar to IPv6.
>
> So I don't see large MTUs coming to the Internet at large soon. They
> probably make sense in special cases, maybe for "land-speed records"
> and dumb high-speed video equipment, or for server-to-server stuff
> such as USENET news.

It is *certainly* helpful for USENET news.

So perhaps it is time to chuck the whole thing out and start over. There
seem to be enough projects out there (cleanslate.stanford.edu, etc) that
are looking at just that topic... maybe it is time for a new network
design with IPv6, flexible MTU's, etc.

The existing MTU 1500 situation made sense on ten megabit ethernet, of
course, and at the time, the overall design of the Internet, and the
capabilities of the underlying network hardware were such that it
wasn't that reasonable or practical to consider trying to make it
negotiable.

There is no valid technical reason for that situation with modern
hardware. The reasons people argue against larger MTU all appear to
have to do with hysterical raisins.

1500 was okay at 10 megabits. That could imply 15000 for 100 megabits,
and 150000 for 1 gigabit. There probably isn't a huge number of
applications for such large MTU's, and certainly universal support is
not likely to happen, but we have to realize that the speeds of networks
will continue to increase, and in five years we'll probably be running
terabit networks everywhere. I could picture 150K MTU's being useful
at those speeds.

The goal shouldn't really be to simply allow for some fixed higher MTU.
If any of these "redesign the Internet" programs succeed, we should be
very certain that MTU flexibility is a core feature.

... JG
--
Joe Greco - sol.net Network Services - Milwaukee, WI - http://www.sol.net
"We call it the 'one bite at the apple' rule. Give me one chance [and] then I
won't contact you again." - Direct Marketing Ass'n position on e-mail spam(CNN)
With 24 million small businesses in the US alone, that's way too many apples.


dotis at mail-abuse

Apr 14, 2007, 10:22 AM

Post #58 of 64 (5041 views)
Permalink
Re: Thoughts on increasing MTUs on the internet [In reply to]

On Apr 13, 2007, at 4:55 PM, Fred Baker wrote:

> The biggest value in real practice is IMHO that the end systems
> deal with a lower interrupt rate when moving the same amount of
> data. That said, some who are asking about larger MTUs are asking
> for values so large that CRC schemes lose their value in error
> detection, and they find themselves looking at higher layer FEC
> technologies to make up for the issue. Given that there is an
> equipment cost related to larger MTUs, I believe that there is such
> a thing as an MTU that is impractical.
>
> 1500 byte MTUs in fact work. I'm all for 9K MTUs, and would
> recommend them. I don't see the point of 65K MTUs.

Keep in mind that a 9KB MTU still reduces the Ethernet CRC
effectiveness by a fair amount. Adoption of CRC32c by SCTP and iSCSI
has a larger Hamming distance restoring the detection rates for Jumbo
packets.

-Doug


iljitsch at muada

Apr 14, 2007, 1:10 PM

Post #59 of 64 (5059 views)
Permalink
Re: Thoughts on increasing MTUs on the internet [In reply to]

On 14-apr-2007, at 19:22, Douglas Otis wrote:

>> 1500 byte MTUs in fact work. I'm all for 9K MTUs, and would
>> recommend them. I don't see the point of 65K MTUs.

> Keep in mind that a 9KB MTU still reduces the Ethernet CRC
> effectiveness by a fair amount.

In the article "Error Characteristics of FDDI" by Raj Jain (see
http://citeseer.ist.psu.edu/341988.html ) table VII says:

Hamming Distance of FCS Polynomal

Hamming Max Frame Size
Weight Octets
3 11454
4 375
5 37

Of course a 9000 byte packets has 6 times the number of bits in it,
so the chance of having a number of bit errors in the packet that
exceeds the hamming distance is ~ 6 times greater.

I can't find bit error rate specs for various types of ethernet real
quick, but if you assume 10^-9 that means that ~ 1 in 10000 11454
byte packets has one bit error, so around 1 in 10^12 has four bit
errors and has a _chance_ to defeat the CRC32. The naieve assumption
that only 1 in 2^32 of those packets with 3 flipped bits will have a
valid CRC32 is probably incorrect, but the CRC should still catch
most of those packetss for a fairly large value of "most".

For 1500 byte packets the fraction of packets with three bits flipped
would be around 1 : 10^15, correcting for the larger number of
packets per given amount of data, that's a difference of about 1 :
100. That seems like a lot, but getting better quality fiber easily
compensates for this. Expressed differently, the average amount of
data transmitted where you see one packet with three flipped bits is
around 10 petabytes for 11454 byte packets and some 1.3 exabytes for
1500 byte packets. For the large packets that would be one packet in
three years at 1 Gbps, for the small ones one packet in 380 years.


nonobvious at gmail

Apr 14, 2007, 4:13 PM

Post #60 of 64 (5044 views)
Permalink
Re: Thoughts on increasing MTUs on the internet [In reply to]

One of my customers comments that he doesn't care about jumbograms of
9K or 4K - what he really wants is to be sure the networks support
MTUs of at least 1600-1700 bytes, so that various combinations of
IPSEC, UDP-padding, PPPoE, etc. don't break the real 1500-byte packets
underneath.


randy at psg

Apr 14, 2007, 4:35 PM

Post #61 of 64 (5035 views)
Permalink
Re: Thoughts on increasing MTUs on the internet [In reply to]

> One of my customers comments that he doesn't care about jumbograms of
> 9K or 4K - what he really wants is to be sure the networks support
> MTUs of at least 1600-1700 bytes, so that various combinations of
> IPSEC, UDP-padding, PPPoE, etc. don't break the real 1500-byte packets
> underneath.

nice to have smart customers!


stephen at sprunk

Apr 14, 2007, 6:13 PM

Post #62 of 64 (5052 views)
Permalink
Re: Thoughts on increasing MTUs on the internet [In reply to]

Thus spake "Bill Stewart" <nonobvious [at] gmail>
> One of my customers comments that he doesn't care about
> jumbograms of 9K or 4K - what he really wants is to be sure the
> networks support MTUs of at least 1600-1700 bytes, so that
> various combinations of IPSEC, UDP-padding, PPPoE, etc.
> don't break the real 1500-byte packets underneath.

This is a more realistic case, and support for "baby jumbos" of 2kB to 3kB
is almost universal even on mid-range networking gear. However, the
problems of getting it deployed are mostly the same, except one can take the
end nodes out of the picture in the simplest case.

OTOH, if we had a viable solution to the variable-MTU mess in the first
place, you could just upgrade every network to the largest MTU possible and
hosts would figure out what the PMTU was and nobody would be sending
1500-byte packets; they'd be either something like 1400 bytes or 9000 bytes,
depending on whether the path included segments that hadn't been upgraded
yet...

S

Stephen Sprunk "Those people who think they know everything
CCIE #3723 are a great annoyance to those of us who do."
K5SSS --Isaac Asimov


jmaimon at ttec

Apr 14, 2007, 7:06 PM

Post #63 of 64 (5045 views)
Permalink
Re: Thoughts on increasing MTUs on the internet [In reply to]

Simon Leinen wrote:


>
> * Current Path MTU Discovery doesn't work reliably.
>
> Please, let's wait for these more robust PMTUD mechanisms to be
> universally deployed before trying to increase the Internet MTU.

I think this is the proper summary of where we are at: Trying to restore
one of the original design goals of ipv4 -- reliable internetworking of
different MTU sized networks.

But the waiting game doesnt work, act local and think global.

>
> * IP assumes a consistent MTU within a logical subnet.
>
> This seems to be a pretty fundamental assumption, and Iljitsch's
> original mail suggests that we "fix" this.

This is an implementation detail, since local IP nodes have no
conception of remote IP nodes subnet detals.


dotis at mail-abuse

Apr 14, 2007, 8:46 PM

Post #64 of 64 (5040 views)
Permalink
Re: Thoughts on increasing MTUs on the internet [In reply to]

On Apr 14, 2007, at 1:10 PM, Iljitsch van Beijnum wrote:
> On 14-apr-2007, at 19:22, Douglas Otis wrote:
>>>
>>> 1500 byte MTUs in fact work. I'm all for 9K MTUs, and would
>>> recommend them. I don't see the point of 65K MTUs.
>>
>> Keep in mind that a 9KB MTU still reduces the Ethernet CRC
>> effectiveness by a fair amount.
>
> I can't find bit error rate specs for various types of ethernet
> real quick, but if you assume 10^-9 that means that ~ 1 in 10000
> 11454 byte packets has one bit error, so around 1 in 10^12 has four
> bit errors and has a _chance_ to defeat the CRC32. The naieve
> assumption that only 1 in 2^32 of those packets with 3 flipped bits
> will have a valid CRC32 is probably incorrect, but the CRC should
> still catch most of those packetss for a fairly large value of "most".

http://www.ietf.org/rfc/rfc3385.txt
http://citeseer.ist.psu.edu/koopman02bit.html


> For 1500 byte packets the fraction of packets with three bits
> flipped would be around 1 : 10^15, correcting for the larger number
> of packets per given amount of data, that's a difference of about
> 1 : 100.
>

Quoting from "When The CRC and TCP Checksum Disagree" by Jonathan
Stone and Craig Partridge:

http://citeseer.ist.psu.edu/cache/papers/cs/21401/
http:zSzzSzsigcomm.it.uu.sezSzconfzSzpaperzSzsigcomm2000-9-1.pdf/
stone00when.pdf

"Traces of Internet packets from the past two years show that between
1 packet in 1,100 and 1 packet in 32,000 fails the TCP checksum, even
on links where link-level CRCs should catch all but 1 in 4 billion
errors. For certain situations, the rate of checksum failures can be
even higher: in one hour-long test we observed a checksum failure of
1 packet in 400. We investigate why so many errors are observed,
when link-level CRCs should catch nearly all of them.

We have collected nearly 500,000 packets which failed the TCP or UDP
or IP checksum. This dataset shows the Internet has a wide variety of
error sources which can not be detected by link-level checks. We
describe analysis tools that have identified nearly 100 different
error patterns. Categorizing packet errors, we can infer likely
causes which explain roughly half the observed errors. The causes
span the entire spectrum of a network stack, from memory errors to
bugs in TCP.

After an analysis we conclude that the checksum will fail to detect
errors for roughly 1 in 16 million to 10 billion packets. From our
analysis of the cause of errors, we propose simple changes to several
protocols which will decrease the rate of undetected error. Even so,
the highly non-random distribution of errors strongly suggests some
applications should employ application-level checksums or equivalents."

Hardware weaknesses within DSLAMs or various memory arrays, such as a
weak driver on some internal interface, can generate high levels of
multi-bit errors not detected by TCP checksums. When affecting the
same bit within an interface, more than 1 out of 100 may go undetected.


> That seems like a lot, but getting better quality fiber easily
> compensates for this. Expressed differently, the average amount of
> data transmitted where you see one packet with three flipped bits
> is around 10 petabytes for 11454 byte packets and some 1.3 exabytes
> for 1500 byte packets. For the large packets that would be one
> packet in three years at 1 Gbps, for the small ones one packet in
> 380 years.

Consider that the CRC is not always carried with the packet between
interfaces.

-Doug

First page Previous page 1 2 3 Next page Last page  View All NANOG users RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.