Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Linux Virtual Server: Users

[lvs-users] lvs tun and ipip fragments

 

 

Linux Virtual Server users RSS feed   Index | Next | Previous | View Threaded


kgc at corp

May 9, 2012, 5:36 PM

Post #1 of 8 (735 views)
Permalink
[lvs-users] lvs tun and ipip fragments

Issues with LVS-Tun, PMTUD and MSS fixup seem to come up periodically.
We want to use LVS-Tun but do not want to end up in a situation where
we're relying on functional PMTUD or selective MSS fixup on the real
servers. The main issue being that if either of these fail to function
a client will end up in a situation where their sessions may hang in
such a way nothing short of a co-op testing with tpcdumps would reveal
the cause of the problem.

It seems that a more obvious solution to this is to allow the kernel to
frag the IPIP as needed by clearing the DF bit on the packet and
skipping the MTU exceeded check. This is a technical violation of RFC
2003 but under some circumstances it is advantageous to just let it
fragment. Any additional overhead of handling the frags is relatively
insignificant and we end up able to handle ~100mbits of traffic inbound
per real server before there is likely to be a collision in
fragmentation reassembly, and even then, only if packets arrive at the
real server out of order.

The patch to hack this into the existing code is only two lines long and
appears to work correctly in limited testing. A sysctl variable to
control the behavior would be easy enough.

Thoughts?

--- /root/rpmbuild/SOURCES/linux-2.6.32-220.13.1.el6/net/netfilter/ipvs/ip_vs_xmit.c 2009-12-02 19:51:21.000000000 -0800
+++ net/netfilter/ipvs/ip_vs_xmit.c 2012-05-09 17:24:05.180140929 -0700
@@ -559,6 +559,9 @@
if (skb_dst(skb))
skb_dst(skb)->ops->update_pmtu(skb_dst(skb), mtu);

+ //clear the DF bit so the kernel will frag the packet
+ old_iph->frag_off = 0;
+
df |= (old_iph->frag_off & htons(IP_DF));

if ((old_iph->frag_off & htons(IP_DF))
@@ -608,7 +611,7 @@
iph = ip_hdr(skb);
iph->version = 4;
iph->ihl = sizeof(struct iphdr)>>2;
- iph->frag_off = df;
+ iph->frag_off = 0;
iph->protocol = IPPROTO_IPIP;
iph->tos = tos;
iph->daddr = rt->rt_dst;

--
Kelsey Cummings - kgc [at] corp sonic.net, inc.
System Architect 2260 Apollo Way
707.522.1000 Santa Rosa, CA 95407

_______________________________________________
Please read the documentation before posting - it's available at:
http://www.linuxvirtualserver.org/

LinuxVirtualServer.org mailing list - lvs-users [at] LinuxVirtualServer
Send requests to lvs-users-request [at] LinuxVirtualServer
or go to http://lists.graemef.net/mailman/listinfo/lvs-users


kgc at corp

May 15, 2012, 11:38 AM

Post #2 of 8 (673 views)
Permalink
Re: [lvs-users] lvs tun and ipip fragments [In reply to]

On Wed, May 09, 2012 at 05:36:24PM -0700, Kelsey Cummings wrote:
> Thoughts?

Anyone? Am I worrying too much about potential problems with MSS fixup
and PMTPD not working correctly and/or being prone to failure through
operator error (which is what I'm actually the most concerned about.)
Or maybe there are issues with fragging the tunneled packets that I'm
not aware of?

Better to take this to the kernel list?

--
Kelsey Cummings - kgc [at] corp sonic.net, inc.
System Architect 2260 Apollo Way
707.522.1000 Santa Rosa, CA 95407

_______________________________________________
Please read the documentation before posting - it's available at:
http://www.linuxvirtualserver.org/

LinuxVirtualServer.org mailing list - lvs-users [at] LinuxVirtualServer
Send requests to lvs-users-request [at] LinuxVirtualServer
or go to http://lists.graemef.net/mailman/listinfo/lvs-users


ja at ssi

May 17, 2012, 3:14 PM

Post #3 of 8 (656 views)
Permalink
Re: [lvs-users] lvs tun and ipip fragments [In reply to]

Hello,

On Wed, 9 May 2012, Kelsey Cummings wrote:

> Issues with LVS-Tun, PMTUD and MSS fixup seem to come up periodically.
> We want to use LVS-Tun but do not want to end up in a situation where
> we're relying on functional PMTUD or selective MSS fixup on the real
> servers. The main issue being that if either of these fail to function
> a client will end up in a situation where their sessions may hang in
> such a way nothing short of a co-op testing with tpcdumps would reveal
> the cause of the problem.
>
> It seems that a more obvious solution to this is to allow the kernel to
> frag the IPIP as needed by clearing the DF bit on the packet and
> skipping the MTU exceeded check. This is a technical violation of RFC
> 2003 but under some circumstances it is advantageous to just let it
> fragment. Any additional overhead of handling the frags is relatively
> insignificant and we end up able to handle ~100mbits of traffic inbound
> per real server before there is likely to be a collision in
> fragmentation reassembly, and even then, only if packets arrive at the
> real server out of order.
>
> The patch to hack this into the existing code is only two lines long and
> appears to work correctly in limited testing. A sysctl variable to
> control the behavior would be easy enough.

ipip_tunnel_xmit() uses such logic to disable
the PMTUD with nopmtudisc flag in the ip tool. May be we
can do the same, the question is how to provide this flag,
may be with IPVS_DEST_ATTR_NO_PMTUD attribute in
include/linux/ip_vs.h

May be it should come as some new field in
struct ip_vs_dest, eg. unsigned int mode_flags; and
#define IP_VS_DEST_MODE_NO_PMTUD 0x0001

Then we can check dest->mode_flags & IP_VS_DEST_MODE_NO_PMTUD
if we have to clear DF.

This should be coordinated with Simon for
corresponding ipvsadm changes for the IPVS_DEST_ATTR_NO_PMTUD
attribute.

> Thoughts?
>
> --- /root/rpmbuild/SOURCES/linux-2.6.32-220.13.1.el6/net/netfilter/ipvs/ip_vs_xmit.c 2009-12-02 19:51:21.000000000 -0800
> +++ net/netfilter/ipvs/ip_vs_xmit.c 2012-05-09 17:24:05.180140929 -0700
> @@ -559,6 +559,9 @@
> if (skb_dst(skb))
> skb_dst(skb)->ops->update_pmtu(skb_dst(skb), mtu);
>
> + //clear the DF bit so the kernel will frag the packet
> + old_iph->frag_off = 0;
> +
> df |= (old_iph->frag_off & htons(IP_DF));
>
> if ((old_iph->frag_off & htons(IP_DF))
> @@ -608,7 +611,7 @@
> iph = ip_hdr(skb);
> iph->version = 4;
> iph->ihl = sizeof(struct iphdr)>>2;
> - iph->frag_off = df;
> + iph->frag_off = 0;
> iph->protocol = IPPROTO_IPIP;
> iph->tos = tos;
> iph->daddr = rt->rt_dst;
>
> --
> Kelsey Cummings - kgc [at] corp sonic.net, inc.
> System Architect 2260 Apollo Way
> 707.522.1000 Santa Rosa, CA 95407

Regards

--
Julian Anastasov <ja [at] ssi>

_______________________________________________
Please read the documentation before posting - it's available at:
http://www.linuxvirtualserver.org/

LinuxVirtualServer.org mailing list - lvs-users [at] LinuxVirtualServer
Send requests to lvs-users-request [at] LinuxVirtualServer
or go to http://lists.graemef.net/mailman/listinfo/lvs-users


kgc at corp

May 21, 2012, 5:24 PM

Post #4 of 8 (646 views)
Permalink
Re: [lvs-users] lvs tun and ipip fragments [In reply to]

On Fri, May 18, 2012 at 01:14:35AM +0300, Julian Anastasov wrote:
> ipip_tunnel_xmit() uses such logic to disable
> the PMTUD with nopmtudisc flag in the ip tool. May be we

Exactly!

> can do the same, the question is how to provide this flag,
> may be with IPVS_DEST_ATTR_NO_PMTUD attribute in
> include/linux/ip_vs.h
>
> May be it should come as some new field in
> struct ip_vs_dest, eg. unsigned int mode_flags; and
> #define IP_VS_DEST_MODE_NO_PMTUD 0x0001

The other way would be do define a new packet forwarding method like
'ipip with frags' perhaps with -I to ipvsadm. I think it really
comes down to what you guys think is most consistent with the reset
of the system and configuration options and how easy it is to add.

--
Kelsey Cummings - kgc [at] corp sonic.net, inc.
System Architect 2260 Apollo Way
707.522.1000 Santa Rosa, CA 95407

_______________________________________________
Please read the documentation before posting - it's available at:
http://www.linuxvirtualserver.org/

LinuxVirtualServer.org mailing list - lvs-users [at] LinuxVirtualServer
Send requests to lvs-users-request [at] LinuxVirtualServer
or go to http://lists.graemef.net/mailman/listinfo/lvs-users


kgc at corp

Jun 27, 2012, 1:08 PM

Post #5 of 8 (558 views)
Permalink
Re: [lvs-users] lvs tun and ipip fragments [In reply to]

On Mon, May 21, 2012 at 05:24:33PM -0700, Kelsey Cummings wrote:
> On Fri, May 18, 2012 at 01:14:35AM +0300, Julian Anastasov wrote:
> > ipip_tunnel_xmit() uses such logic to disable
> > the PMTUD with nopmtudisc flag in the ip tool. May be we
>
> Exactly!
>
> > can do the same, the question is how to provide this flag,
> > may be with IPVS_DEST_ATTR_NO_PMTUD attribute in
> > include/linux/ip_vs.h
> >
> > May be it should come as some new field in
> > struct ip_vs_dest, eg. unsigned int mode_flags; and
> > #define IP_VS_DEST_MODE_NO_PMTUD 0x0001
>
> The other way would be do define a new packet forwarding method like
> 'ipip with frags' perhaps with -I to ipvsadm. I think it really
> comes down to what you guys think is most consistent with the reset
> of the system and configuration options and how easy it is to add.

Any thought's on this Julian? I think this would be a good enhancement
for LVS.

--
Kelsey Cummings - kgc [at] corp sonic.net, inc.
System Architect 2260 Apollo Way
707.522.1000 Santa Rosa, CA 95407

_______________________________________________
Please read the documentation before posting - it's available at:
http://www.linuxvirtualserver.org/

LinuxVirtualServer.org mailing list - lvs-users [at] LinuxVirtualServer
Send requests to lvs-users-request [at] LinuxVirtualServer
or go to http://lists.graemef.net/mailman/listinfo/lvs-users


ja at ssi

Jun 27, 2012, 3:19 PM

Post #6 of 8 (556 views)
Permalink
Re: [lvs-users] lvs tun and ipip fragments [In reply to]

Hello,

On Wed, 27 Jun 2012, Kelsey Cummings wrote:

> On Mon, May 21, 2012 at 05:24:33PM -0700, Kelsey Cummings wrote:
> > On Fri, May 18, 2012 at 01:14:35AM +0300, Julian Anastasov wrote:
> > > ipip_tunnel_xmit() uses such logic to disable
> > > the PMTUD with nopmtudisc flag in the ip tool. May be we
> >
> > Exactly!
> >
> > > can do the same, the question is how to provide this flag,
> > > may be with IPVS_DEST_ATTR_NO_PMTUD attribute in
> > > include/linux/ip_vs.h
> > >
> > > May be it should come as some new field in
> > > struct ip_vs_dest, eg. unsigned int mode_flags; and
> > > #define IP_VS_DEST_MODE_NO_PMTUD 0x0001
> >
> > The other way would be do define a new packet forwarding method like
> > 'ipip with frags' perhaps with -I to ipvsadm. I think it really
> > comes down to what you guys think is most consistent with the reset
> > of the system and configuration options and how easy it is to add.
>
> Any thought's on this Julian? I think this would be a good enhancement
> for LVS.

I prefer the solution with dest mode flag. The
forwarding method is restricted to 3 bits (IP_VS_CONN_F_FWD_MASK)
and I don't want to allocate new value there. IIRC, these bits
are already used in packet capturing programs, so better
not to add many changes in the format used for sync.

May be I'll not have enough time in the following
2-3 weeks, so if someone wants to code this idea, please
let me know.

> --
> Kelsey Cummings - kgc [at] corp sonic.net, inc.
> System Architect 2260 Apollo Way
> 707.522.1000 Santa Rosa, CA 95407

Regards

--
Julian Anastasov <ja [at] ssi>

_______________________________________________
Please read the documentation before posting - it's available at:
http://www.linuxvirtualserver.org/

LinuxVirtualServer.org mailing list - lvs-users [at] LinuxVirtualServer
Send requests to lvs-users-request [at] LinuxVirtualServer
or go to http://lists.graemef.net/mailman/listinfo/lvs-users


ja at ssi

Jul 9, 2012, 12:49 AM

Post #7 of 8 (529 views)
Permalink
Re: [lvs-users] lvs tun and ipip fragments [In reply to]

Hello,

On Wed, 9 May 2012, Kelsey Cummings wrote:

> Issues with LVS-Tun, PMTUD and MSS fixup seem to come up periodically.
> We want to use LVS-Tun but do not want to end up in a situation where
> we're relying on functional PMTUD or selective MSS fixup on the real
> servers. The main issue being that if either of these fail to function
> a client will end up in a situation where their sessions may hang in
> such a way nothing short of a co-op testing with tpcdumps would reveal
> the cause of the problem.
>
> It seems that a more obvious solution to this is to allow the kernel to
> frag the IPIP as needed by clearing the DF bit on the packet and
> skipping the MTU exceeded check. This is a technical violation of RFC
> 2003 but under some circumstances it is advantageous to just let it
> fragment. Any additional overhead of handling the frags is relatively
> insignificant and we end up able to handle ~100mbits of traffic inbound
> per real server before there is likely to be a collision in
> fragmentation reassembly, and even then, only if packets arrive at the
> real server out of order.
>
> The patch to hack this into the existing code is only two lines long and
> appears to work correctly in limited testing. A sysctl variable to
> control the behavior would be easy enough.
>
> Thoughts?

I read RFC 2003 and the first result of this is the
appended patch. I will keep it for a while also here:

http://www.ssi.bg/~ja/tmp/0001-ipvs-implement-passive-PMTUD-for-IPIP-packets.txt

I'll submit it for inclusion after comments.
I think, it is a shame to add option to disable PMTUD
without first implementing it for IPIP packets.

> --- /root/rpmbuild/SOURCES/linux-2.6.32-220.13.1.el6/net/netfilter/ipvs/ip_vs_xmit.c 2009-12-02 19:51:21.000000000 -0800
> +++ net/netfilter/ipvs/ip_vs_xmit.c 2012-05-09 17:24:05.180140929 -0700
> @@ -559,6 +559,9 @@
> if (skb_dst(skb))
> skb_dst(skb)->ops->update_pmtu(skb_dst(skb), mtu);
>
> + //clear the DF bit so the kernel will frag the packet
> + old_iph->frag_off = 0;
> +

Can you identify which of your both changes helps in
your case. I guess the above change does not play at all,
it clears DF and damages MF and fragment offset causing
packets fragmented between client and IPVS box to be
damaged after the IPVS box. May be it works only
because IPVS is at LOCAL_IN hook and ip_defrag is
called in ip_local_deliver. As result, we do not see
any MF or fragment offset > 0, we are after defragmentation.

> df |= (old_iph->frag_off & htons(IP_DF));
>
> if ((old_iph->frag_off & htons(IP_DF))
> @@ -608,7 +611,7 @@
> iph = ip_hdr(skb);
> iph->version = 4;
> iph->ihl = sizeof(struct iphdr)>>2;
> - iph->frag_off = df;
> + iph->frag_off = 0;

First, I decided to implement the config option
mentioned in previous emails but later found that it is
a good idea to give chance for PMTUD to work.

Can you clarify details for your setup, do you
have lower MTU in the path to your real server?
Perhaps,

IPVS_BOX# tracepath -n RIP

can show places with MTU below 1480 (1520-20)?

I know that problem can be also in the ICMPs from
IPVS box to clients. Lets say it in this way, if leg 1 is
the path from client to IPVS box and leg 2 is the
path from IPVS box to real server, is MTU further
reduced in leg 2? Here are the variants I see:

1. there is no lower MTU in leg 2, i.e. IPVS box has
1500 and we send FRAG_NEEDED for 1480. In this case
only some ICMP filtering can cause client's PMTUD
to fail.

2. there is MTU in leg 2 that is lower, i.e. below 1500.
In this case, without the appended patch we can not
update our PMTU in routing if other traffic does not
do it. And only with the patch we are able to forward
the new PMTU to client.

I understand that for case 1 it is useful to
have option that disables PMTUD for the IPVS tunnel.
But sending 2 packets for every received packet is bad
idea if we disable PMTUD by default, without any config
option.

I'll feel more comfortable if we know that my
patch has effect for your setup. Let me know if it is
difficult/dangerous for you to do such tests. If you need
my patch for some old kernel, let me know, the patch is for
net-next from last days, so it can contain some new stuff.
May be such tests are not needed if you know that the
MTU on leg 2 is 1500 and is not reduced.

As a next step may be we can add global sysctl var
to force Disable for IPVS-TUN PMTUD (your second change),
it will take effect for all real servers. It will be
needed if problem is caused by ICMP filtering in
leg 1 (above case 1). Not sure if there is some netfilter
mangling feature that can clear the outer DF for our
IPIP packets in OUTPUT hook.

> iph->protocol = IPPROTO_IPIP;
> iph->tos = tos;
> iph->daddr = rt->rt_dst;
>
> --
> Kelsey Cummings - kgc [at] corp sonic.net, inc.
> System Architect 2260 Apollo Way
> 707.522.1000 Santa Rosa, CA 95407

>From baff2600afca847e6e3b81a033050ffdb28c3fe7 Mon Sep 17 00:00:00 2001
From: Julian Anastasov <ja [at] ssi>
Date: Mon, 9 Jul 2012 02:26:09 +0300
Subject: [PATCH] ipvs: implement passive PMTUD for IPIP packets

IPVS is missing the logic to update PMTU in routing
for its IPIP packets. We monitor the dst_mtu and can return
FRAG_NEEDED messages but if the tunneled packets get ICMP
error we can not rely on other traffic to save the lowest
MTU.

The following patch adds ICMP handling for IPIP
packets in incoming direction, from some remote host to
our local IP used as saddr in the outer header. By this
way we can forward any related ICMP traffic if it is for IPVS
TUN connection. For the special case of PMTUD we update the
routing and if client requested DF we can forward the
error.

To properly update the routing we have to bind
the cached route (dest->dst_cache) to the selected saddr
because ipv4_update_pmtu uses saddr for dst lookup.
Add IP_VS_RT_MODE_CONNECT flag to force such binding with
second route.

Update ip_vs_tunnel_xmit to provide IP_VS_RT_MODE_CONNECT
and change the code to copy DF. For now we prefer not to
force PMTU discovery (outer DF=1) because we don't have
configuration option to enable or disable PMTUD. As we
do not keep any packets to resend, we prefer not to
play games with packets without DF bit because the sender
is not informed when they are rejected.

Also, change ops->update_pmtu to be called only
for local clients because there is no point to update
MTU for input routes, in our case skb->dst->dev is lo.
It seems the code is copied from ipip.c where the skb
dst points to tunnel device.

Signed-off-by: Julian Anastasov <ja [at] ssi>
---
net/netfilter/ipvs/ip_vs_core.c | 74 +++++++++++++++++++++++++++++++++++++--
net/netfilter/ipvs/ip_vs_xmit.c | 36 +++++++++++++++++--
2 files changed, 103 insertions(+), 7 deletions(-)

diff --git a/net/netfilter/ipvs/ip_vs_core.c b/net/netfilter/ipvs/ip_vs_core.c
index b54ecce..6ee17c2 100644
--- a/net/netfilter/ipvs/ip_vs_core.c
+++ b/net/netfilter/ipvs/ip_vs_core.c
@@ -1303,7 +1303,8 @@ ip_vs_in_icmp(struct sk_buff *skb, int *related, unsigned int hooknum)
struct ip_vs_conn *cp;
struct ip_vs_protocol *pp;
struct ip_vs_proto_data *pd;
- unsigned int offset, ihl, verdict;
+ unsigned int offset, offset2, ihl, verdict;
+ bool ipip;

*related = 1;

@@ -1345,6 +1346,21 @@ ip_vs_in_icmp(struct sk_buff *skb, int *related, unsigned int hooknum)

net = skb_net(skb);

+ /* Special case for errors for IPIP packets */
+ ipip = false;
+ if (cih->protocol == IPPROTO_IPIP) {
+ if (unlikely(cih->frag_off & htons(IP_OFFSET)))
+ return NF_ACCEPT;
+ /* Error for our IPIP must arrive at LOCAL_IN */
+ if (!(skb_rtable(skb)->rt_flags & RTCF_LOCAL))
+ return NF_ACCEPT;
+ offset += cih->ihl * 4;
+ cih = skb_header_pointer(skb, offset, sizeof(_ciph), &_ciph);
+ if (cih == NULL)
+ return NF_ACCEPT; /* The packet looks wrong, ignore */
+ ipip = true;
+ }
+
pd = ip_vs_proto_data_get(net, cih->protocol);
if (!pd)
return NF_ACCEPT;
@@ -1358,11 +1374,14 @@ ip_vs_in_icmp(struct sk_buff *skb, int *related, unsigned int hooknum)
IP_VS_DBG_PKT(11, AF_INET, pp, skb, offset,
"Checking incoming ICMP for");

+ offset2 = offset;
offset += cih->ihl * 4;

ip_vs_fill_iphdr(AF_INET, cih, &ciph);
- /* The embedded headers contain source and dest in reverse order */
- cp = pp->conn_in_get(AF_INET, skb, &ciph, offset, 1);
+ /* The embedded headers contain source and dest in reverse order.
+ * For IPIP this is error for request, not for reply.
+ */
+ cp = pp->conn_in_get(AF_INET, skb, &ciph, offset, ipip ? 0 : 1);
if (!cp)
return NF_ACCEPT;

@@ -1376,6 +1395,55 @@ ip_vs_in_icmp(struct sk_buff *skb, int *related, unsigned int hooknum)
goto out;
}

+ if (ipip) {
+ __be32 info = ic->un.gateway;
+
+ /* Update the MTU */
+ if (ic->type == ICMP_DEST_UNREACH &&
+ ic->code == ICMP_FRAG_NEEDED) {
+ struct ip_vs_dest *dest = cp->dest;
+ u32 mtu = ntohs(ic->un.frag.mtu);
+
+ /* Strip outer IP and ICMP, go to IPIP header */
+ __skb_pull(skb, ihl + sizeof(_icmph));
+ offset2 -= ihl + sizeof(_icmph);
+ skb_reset_network_header(skb);
+ IP_VS_DBG(12, "ICMP for IPIP %pI4->%pI4: mtu=%u\n",
+ &ip_hdr(skb)->saddr, &ip_hdr(skb)->daddr, mtu);
+ ipv4_update_pmtu(skb, dev_net(skb->dev),
+ mtu, 0, 0, 0, 0);
+ /* Client uses PMTUD? */
+ if (!(cih->frag_off & htons(IP_DF)))
+ goto ignore_ipip;
+ /* Prefer the resulting PMTU */
+ if (dest) {
+ spin_lock(&dest->dst_lock);
+ if (dest->dst_cache)
+ mtu = dst_mtu(dest->dst_cache);
+ spin_unlock(&dest->dst_lock);
+ }
+ if (mtu > 68 + sizeof(struct iphdr))
+ mtu -= sizeof(struct iphdr);
+ info = htonl(mtu);
+ }
+ /* Strip outer IP, ICMP and IPIP, go to IP header of
+ * original request.
+ */
+ __skb_pull(skb, offset2);
+ skb_reset_network_header(skb);
+ IP_VS_DBG(12, "Sending ICMP for %pI4->%pI4: t=%u, c=%u, i=%u\n",
+ &ip_hdr(skb)->saddr, &ip_hdr(skb)->daddr,
+ ic->type, ic->code, ntohl(info));
+ icmp_send(skb, ic->type, ic->code, info);
+ /* ICMP can be shorter but anyways, account it */
+ ip_vs_out_stats(cp, skb);
+
+ignore_ipip:
+ consume_skb(skb);
+ verdict = NF_STOLEN;
+ goto out;
+ }
+
/* do the statistics and put it back */
ip_vs_in_stats(cp, skb);
if (IPPROTO_TCP == cih->protocol || IPPROTO_UDP == cih->protocol)
diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c
index 71d6ecb..6f18395 100644
--- a/net/netfilter/ipvs/ip_vs_xmit.c
+++ b/net/netfilter/ipvs/ip_vs_xmit.c
@@ -49,6 +49,7 @@ enum {
IP_VS_RT_MODE_RDR = 4, /* Allow redirect from remote daddr to
* local
*/
+ IP_VS_RT_MODE_CONNECT = 8, /* Always bind route to saddr */
};

/*
@@ -99,16 +100,40 @@ __ip_vs_get_out_rt(struct sk_buff *skb, struct ip_vs_dest *dest,
if (!(rt = (struct rtable *)
__ip_vs_dst_check(dest, rtos))) {
struct flowi4 fl4;
+ int loop = 0;

memset(&fl4, 0, sizeof(fl4));
fl4.daddr = dest->addr.ip;
+ fl4.saddr = (rt_mode & IP_VS_RT_MODE_CONNECT) ?
+ dest->dst_saddr.ip : 0;
fl4.flowi4_tos = rtos;
+
+retry:
rt = ip_route_output_key(net, &fl4);
if (IS_ERR(rt)) {
+ /* Invalid saddr ? */
+ if (PTR_ERR(rt) == -EINVAL &&
+ dest->dst_saddr.ip &&
+ rt_mode & IP_VS_RT_MODE_CONNECT &&
+ !loop) {
+ dest->dst_saddr.ip = 0;
+ flowi4_update_output(&fl4, 0, rtos,
+ dest->addr.ip, 0);
+ goto retry;
+ }
spin_unlock(&dest->dst_lock);
IP_VS_DBG_RL("ip_route_output error, dest: %pI4\n",
&dest->addr.ip);
return NULL;
+ } else if (!dest->dst_saddr.ip &&
+ rt_mode & IP_VS_RT_MODE_CONNECT &&
+ fl4.saddr) {
+ ip_rt_put(rt);
+ dest->dst_saddr.ip = fl4.saddr;
+ flowi4_update_output(&fl4, 0, rtos,
+ dest->addr.ip, fl4.saddr);
+ loop++;
+ goto retry;
}
__ip_vs_dst_set(dest, rtos, dst_clone(&rt->dst), 0);
dest->dst_saddr.ip = fl4.saddr;
@@ -331,6 +356,7 @@ ip_vs_dst_reset(struct ip_vs_dest *dest)
old_dst = dest->dst_cache;
dest->dst_cache = NULL;
dst_release(old_dst);
+ dest->dst_saddr.ip = 0;
}

#define IP_VS_XMIT_TUNNEL(skb, cp) \
@@ -771,7 +797,7 @@ ip_vs_tunnel_xmit(struct sk_buff *skb, struct ip_vs_conn *cp,
struct net_device *tdev; /* Device to other host */
struct iphdr *old_iph = ip_hdr(skb);
u8 tos = old_iph->tos;
- __be16 df = old_iph->frag_off;
+ __be16 df;
struct iphdr *iph; /* Our new IP header */
unsigned int max_headroom; /* The extra header space needed */
int mtu;
@@ -781,7 +807,8 @@ ip_vs_tunnel_xmit(struct sk_buff *skb, struct ip_vs_conn *cp,

if (!(rt = __ip_vs_get_out_rt(skb, cp->dest, cp->daddr.ip,
RT_TOS(tos), IP_VS_RT_MODE_LOCAL |
- IP_VS_RT_MODE_NON_LOCAL,
+ IP_VS_RT_MODE_NON_LOCAL |
+ IP_VS_RT_MODE_CONNECT,
&saddr)))
goto tx_error_icmp;
if (rt->rt_flags & RTCF_LOCAL) {
@@ -796,10 +823,11 @@ ip_vs_tunnel_xmit(struct sk_buff *skb, struct ip_vs_conn *cp,
IP_VS_DBG_RL("%s(): mtu less than 68\n", __func__);
goto tx_error_put;
}
- if (skb_dst(skb))
+ if (rt_is_output_route(skb_rtable(skb)))
skb_dst(skb)->ops->update_pmtu(skb_dst(skb), mtu);

- df |= (old_iph->frag_off & htons(IP_DF));
+ /* Copy DF, reset fragment offset and MF */
+ df = old_iph->frag_off & htons(IP_DF);

if ((old_iph->frag_off & htons(IP_DF) &&
mtu < ntohs(old_iph->tot_len) && !skb_is_gso(skb))) {
--
1.7.3.4


_______________________________________________
Please read the documentation before posting - it's available at:
http://www.linuxvirtualserver.org/

LinuxVirtualServer.org mailing list - lvs-users [at] LinuxVirtualServer
Send requests to lvs-users-request [at] LinuxVirtualServer
or go to http://lists.graemef.net/mailman/listinfo/lvs-users


kgc at corp

Jul 16, 2012, 12:18 PM

Post #8 of 8 (491 views)
Permalink
Re: [lvs-users] lvs tun and ipip fragments [In reply to]

Julian, I haven't had time to test the patches yet but wanted to at
answer your questions.


On Mon, Jul 09, 2012 at 10:49:24AM +0300, Julian Anastasov wrote:
> > + //clear the DF bit so the kernel will frag the packet
> > + old_iph->frag_off = 0;
>
> Can you identify which of your both changes helps in
> your case. I guess the above change does not play at all,

I'm pretty sure the only reason I set 'old_iph->frag_off = 0' was to
cause the following "if" statement to evalutate as false in order to
prevent the icmp dest unreach/need frag packet from being transmitted.

> > - iph->frag_off = df;
> > + iph->frag_off = 0;

And that's what gets the kernel to frag it once the skb is sent.

> Can you clarify details for your setup, do you
> have lower MTU in the path to your real server?

No, everything is 1500 ethernet in our current use case. This goal is
to have a lvs-tun config which allows a flexible network design without
having to rely on selective MSS fixup on the RIPs or that the ICMP frag
needed packets will actually reach the client.

Let me see what I can do about testing your patches, although I a global
sysctl variable to is probably the easiest solution.

> As a next step may be we can add global sysctl var
> to force Disable for IPVS-TUN PMTUD (your second change),
> it will take effect for all real servers. It will be
> needed if problem is caused by ICMP filtering in
> leg 1 (above case 1). Not sure if there is some netfilter
> mangling feature that can clear the outer DF for our
> IPIP packets in OUTPUT hook.

I wasn't able to find any, as this was another possible solution to the
problem and could be generally useful in other circumstances as well.
Cisco supports this in combination with ipsec tunnels to allow the
router to frag the packet regardless of the original DF bit setting.

--
Kelsey Cummings - kgc [at] corp sonic.net, inc.
System Architect 2260 Apollo Way
707.522.1000 Santa Rosa, CA 95407

_______________________________________________
Please read the documentation before posting - it's available at:
http://www.linuxvirtualserver.org/

LinuxVirtualServer.org mailing list - lvs-users [at] LinuxVirtualServer
Send requests to lvs-users-request [at] LinuxVirtualServer
or go to http://lists.graemef.net/mailman/listinfo/lvs-users

Linux Virtual Server users RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.