use iptables for proxying instead of userspace #3760

thockin · 2015-01-23T17:52:31Z

I was playing with iptables yesterday, and I protoyped (well, copied from Google hits and mutated) a set of iptables rules that essentially do all the proxying for us without help from userspace. It's not urgent, but I want to file my notes before I lose them.

This has the additional nice side-effect (as far as I can tell) of preserving the source IP and being a large net simplification. Now kube-proxy would just need to sync Services -> iptables. This has the downside of not being compatible with older iptables and kernels. We had a problem with this before - at some point we need to decide just how far back in time we care about.

This can probably be optimized further, but in basic testing, I see sticky sessions working and if I comment that part out I see ~equal probability of hitting each backend. I was not able to get deterministic round-robin working properly (with --nth instead of --probability) but we could come back to that if we want.

This sets up a service portal with the backends listed below

iptables -t nat -N TESTSVC
iptables -t nat -F TESTSVC
iptables -t nat -N TESTSVC_A
iptables -t nat -F TESTSVC_A
iptables -t nat -N TESTSVC_B
iptables -t nat -F TESTSVC_B
iptables -t nat -N TESTSVC_C
iptables -t nat -F TESTSVC_C
iptables -t nat -A TESTSVC -m recent --name hostA --rcheck --seconds 1 --reap -j TESTSVC_A
iptables -t nat -A TESTSVC -m recent --name hostB --rcheck --seconds 1 --reap -j TESTSVC_B
iptables -t nat -A TESTSVC -m recent --name hostC --rcheck --seconds 1 --reap -j TESTSVC_C
iptables -t nat -A TESTSVC -m statistic --mode random --probability 0.333 -j TESTSVC_A
iptables -t nat -A TESTSVC -m statistic --mode random --probability 0.500 -j TESTSVC_B
iptables -t nat -A TESTSVC -m statistic --mode random --probability 1.000 -j TESTSVC_C

iptables -t nat -A TESTSVC_A -m recent --name hostA --set -j DNAT -p tcp --to-destination 10.244.4.6:9376
iptables -t nat -A TESTSVC_B -m recent --name hostB --set -j DNAT -p tcp --to-destination 10.244.1.15:9376
iptables -t nat -A TESTSVC_C -m recent --name hostC --set -j DNAT -p tcp --to-destination 10.244.4.7:9376

iptables -t nat -F KUBE-PORTALS-HOST
iptables -t nat -A KUBE-PORTALS-HOST -d 10.0.0.93/32 -m state --state NEW -p tcp -m tcp --dport 80 -j TESTSVC
iptables -t nat -F KUBE-PORTALS-CONTAINER
iptables -t nat -A KUBE-PORTALS-CONTAINER -d 10.0.0.93/32 -m state --state NEW -p tcp -m tcp --dport 80 -j TESTSVC

The text was updated successfully, but these errors were encountered:

brendandburns · 2015-01-26T19:06:06Z

Cool! I think we should definitely get this merged in. On a separate note, I was seeing the proxy eat ~30% of a core under heavy load, I have to believe that iptables will give us better performance than that.

thockin · 2015-01-26T20:01:02Z

We have to prioritize this - it's almost a total rewrite of kube-proxy and
all the tests thereof. It also has back-compat problems (will not work on
older kernels or older iptables binaries).

On Mon, Jan 26, 2015 at 11:06 AM, Brendan Burns notifications@github.com
wrote:

Cool! I think we should definitely get this merged in. On a separate note,
I was seeing the proxy eat ~30% of a core under heavy load, I have to
believe that iptables will give us better performance than that.

Reply to this email directly or view it on GitHub
#3760 (comment)
.

brendandburns · 2015-01-26T21:06:34Z

Maybe implementing it as a parallel option and slowly migrating makes sense?

On Mon, Jan 26, 2015 at 12:01 PM, Tim Hockin notifications@github.com
wrote:

We have to prioritize this - it's almost a total rewrite of kube-proxy and
all the tests thereof. It also has back-compat problems (will not work on
older kernels or older iptables binaries).

On Mon, Jan 26, 2015 at 11:06 AM, Brendan Burns notifications@github.com
wrote:

Cool! I think we should definitely get this merged in. On a separate
note,
I was seeing the proxy eat ~30% of a core under heavy load, I have to
believe that iptables will give us better performance than that.

Reply to this email directly or view it on GitHub
<
#3760 (comment)

.

—
Reply to this email directly or view it on GitHub
#3760 (comment)
.

thockin · 2015-01-27T05:26:38Z

I'm trying to coax someone else who doesn't know this code well to learn it
and take it on. I really want to tackle it, but it would be better if
someone else learned this space (not you! :)

That said, you also sent (good) email about the massive P1 list - and I
don;t think this is on that list yet.

On Mon, Jan 26, 2015 at 1:06 PM, Brendan Burns notifications@github.com
wrote:

Maybe implementing it as a parallel option and slowly migrating makes
sense?

On Mon, Jan 26, 2015 at 12:01 PM, Tim Hockin notifications@github.com
wrote:

We have to prioritize this - it's almost a total rewrite of kube-proxy
and
all the tests thereof. It also has back-compat problems (will not work
on
older kernels or older iptables binaries).

On Mon, Jan 26, 2015 at 11:06 AM, Brendan Burns <
notifications@github.com>
wrote:

Cool! I think we should definitely get this merged in. On a separate
note,
I was seeing the proxy eat ~30% of a core under heavy load, I have to
believe that iptables will give us better performance than that.

Reply to this email directly or view it on GitHub
<

#3760 (comment)

.

Reply to this email directly or view it on GitHub
<
https://github.com/GoogleCloudPlatform/kubernetes/issues/3760#issuecomment-71527216>

.

Reply to this email directly or view it on GitHub
#3760 (comment)
.

satnam6502 · 2015-02-11T22:49:32Z

Is this a P2? Might it be worth making it a P3 for now?

thockin · 2015-02-11T23:49:23Z

I'm hoping to make it work, but we may yet demote it

On Wed, Feb 11, 2015 at 2:49 PM, Satnam Singh notifications@github.com
wrote:

Is this a P2? Might it be worth making it a P3 for now?

Reply to this email directly or view it on GitHub
#3760 (comment)
.

roberthbailey · 2015-02-12T00:08:40Z

Doesn't "hope" equate to a P3 that we'll get to if we can?

bgrant0607 · 2015-02-23T20:31:10Z

From discussion with @thockin: This is a requirement in order to support service port ranges, which aren't required for 1.0, but we would like to support eventually.

sidharta-s · 2015-02-23T22:44:00Z

@thockin "This has the downside of not being compatible with older iptables and kernels." How 'new' would the kernel have to be?

thockin · 2015-02-24T01:34:27Z

Not TOO new, but we have some users who REALLY want iptables from 2012 to
work.

On Mon, Feb 23, 2015 at 2:44 PM, Sidharta Seethana <notifications@github.com

wrote:

@thockin https://github.com/thockin "This has the downside of not being
compatible with older iptables and kernels." How 'new' would the kernel
have to be?

Reply to this email directly or view it on GitHub
#3760 (comment)
.

sidharta-s · 2015-02-24T04:39:51Z

@thockin thanks. We are using/testing with RHEL/CentOS 6, for example - so it would nice if we don't have a hard dependency on recent 3.x kernels.

pmorie · 2015-02-24T04:44:39Z

@pweil- we were discussing this the other day
On Mon, Feb 23, 2015 at 11:40 PM Sidharta Seethana notifications@github.com
wrote:

@thockin https://github.com/thockin thanks. We are using/testing with
RHEL/CentOS 6, for example - so it would nice if we don't have a hard
dependency on recent 3.x kernels.

—
Reply to this email directly or view it on GitHub
#3760 (comment)
.

thockin · 2015-02-24T05:33:33Z

Well, you do need Docker to run, and at some point we have to cut it off.
The back-rev iptables support will not stop me from (eventually) making
this change, and it's going to sting for some people.

On Mon, Feb 23, 2015 at 8:40 PM, Sidharta Seethana <notifications@github.com

wrote:

@thockin https://github.com/thockin thanks. We are using/testing with
RHEL/CentOS 6, for example - so it would nice if we don't have a hard
dependency on recent 3.x kernels.

Reply to this email directly or view it on GitHub
#3760 (comment)
.

ArtfulCoder · 2015-02-26T21:12:35Z

With @thockin 's help, we tried the same with udp.

We created a GCE Kubernetes cluster with 3 sky-dns replication controllers.
On the kubernetes-master, we set the following in iptables:
The dns service ip was 10.0.0.10, and the pod endpoints running dns were 10.244.0.5:53, 10.244.3.6:53, 10.244.0.6:53

iptables -t nat -N TESTSVC
iptables -t nat -F TESTSVC
iptables -t nat -N TESTSVC_A
iptables -t nat -F TESTSVC_A
iptables -t nat -N TESTSVC_B
iptables -t nat -F TESTSVC_B
iptables -t nat -N TESTSVC_C
iptables -t nat -F TESTSVC_C
iptables -t nat -N KUBE-PORTALS-HOST
iptables -t nat -F KUBE-PORTALS-HOST

iptables -t nat -A TESTSVC -m recent --name hostA --rcheck --seconds 1 --reap -j TESTSVC_A
iptables -t nat -A TESTSVC -m recent --name hostB --rcheck --seconds 1 --reap -j TESTSVC_B
iptables -t nat -A TESTSVC -m recent --name hostC --rcheck --seconds 1 --reap -j TESTSVC_C

iptables -t nat -A TESTSVC -m statistic --mode random --probability 0.333 -j TESTSVC_A
iptables -t nat -A TESTSVC -m statistic --mode random --probability 0.5 -j TESTSVC_B
iptables -t nat -A TESTSVC -m statistic --mode random --probability 1.000 -j TESTSVC_C

iptables -t nat -A TESTSVC_A -m recent --name hostA --set -j DNAT -p udp --to-destination 10.244.0.5:53
iptables -t nat -A TESTSVC_B -m recent --name hostB --set -j DNAT -p udp --to-destination 10.244.3.6:53
iptables -t nat -A TESTSVC_C -m recent --name hostC --set -j DNAT -p udp --to-destination 10.244.0.6:53
iptables -t nat -A KUBE-PORTALS-HOST -d 10.0.0.10/32 -p udp -m udp --dport 53 -j TESTSVC
iptables -t nat -A OUTPUT -j KUBE-PORTALS-HOST

kubernetes-master>nslookup kubernetes.default.kuberenetes.local 10.0.0.10

We get a response back!

justinsb · 2015-02-26T21:55:48Z

Great stuff! Just FYI (confirming from our face-to-face conversation), it's not safe to run multiple concurrent iptables commands in general (different chains sounds like it might be OK). iptables is a wrapper around libiptc, and see the comment on iptc_commit: http://www.tldp.org/HOWTO/Querying-libiptc-HOWTO/mfunction.html

This was apparently fixed in 2013, but maybe only if you pass --wait (?): http://git.netfilter.org/iptables/commit/?id=93587a04d0f2511e108bbc4d87a8b9d28a5c5dd8

The root cause of this is that iptables effectively calls iptables-save / iptables-restore (at least per chain); I've seen a lot of code which just therefore calls iptables-save & iptables-restore rather than doing things through adds and deletes. I may even have some code to do that I could dig up if that is helpful.

thockin · 2015-02-26T22:21:47Z

It boggles my mind that there's no way to do CAS or LL/SC sorts of ops.

We should add support for --wait, though it is recent enough that GCE's
debian-backports doesn't have it.

Maybe we should do our own locking inside our code to at least prevent us
from stepping on ourselves.

On Thu, Feb 26, 2015 at 1:56 PM, Justin Santa Barbara <
notifications@github.com> wrote:

Great stuff! Just FYI (confirming from our face-to-face conversation),
it's not safe to run multiple concurrent iptables commands in general
(different chains sounds like it might be OK). iptables is a wrapper around
libiptc, and see the comment on iptc_commit:
http://www.tldp.org/HOWTO/Querying-libiptc-HOWTO/mfunction.html

This was apparently fixed in 2013, but maybe only if you pass --wait (?):
http://git.netfilter.org/iptables/commit/?id=93587a04d0f2511e108bbc4d87a8b9d28a5c5dd8

The root cause of this is that iptables effectively calls iptables-save /
iptables-restore (at least per chain); I've seen a lot of code which just
therefore calls iptables-save & iptables-restore rather than doing things
through adds and deletes. I may even have some code to do that I could dig
up if that is helpful.

Reply to this email directly or view it on GitHub
#3760 (comment)
.

bgrant0607 · 2015-02-27T04:47:17Z

What happens in the case of failures in the middle of creating a bunch of rules?

thockin · 2015-02-27T05:08:17Z

Fair question - we should probably think really hard about what it means to
encounter an error in the middle of this

On Thu, Feb 26, 2015 at 8:47 PM, Brian Grant notifications@github.com
wrote:

What happens in the case of failures in the middle of creating a bunch of
rules?

Reply to this email directly or view it on GitHub
#3760 (comment)
.

larsks · 2015-02-27T21:48:17Z

@thockin From irc today:

The net.ipv4.conf.all.route_localnet permits 127.0.0.1 to be the target of DNAT rules. From the docs:

route_localnet - BOOLEAN

Do not consider loopback addresses as martian source or destination
while routing. This enables the use of 127/8 for local routing purposes.
default FALSE

thockin · 2016-01-20T17:11:37Z

It dawns on me (via Abhishek) that even if this works, we STILL have to
track flows somewhere, so it's not stateless in the end anyway.

On Mon, Jan 18, 2016 at 9:50 PM, Tim Hockin thockin@google.com wrote:

I got a bit farther, but something I should have predicted happened.

I set up a pod with 10.244.2.8/25 as its main interface and
10.244.2.250/25 as its "in-a-service" interface. I was hoping that I
could send UDP to .250 and detect responses, to SNAT them. But of course,
if the client is not in the same /25 (which it can not be) the default
route kicks in, which comes from the .8 address. tcpdump confirms that
responses come from .8 when using UDP.

I am again at a place where I am not sure how to make it work. will think
more on it.

On Mon, Jan 18, 2016 at 2:59 AM, Mikaël Cluseau notifications@github.com
wrote:

I'm trying on my side to get something (with pure netns on my local host
for now).

I'm trying an approach where I affect service IP ranges to hosts to
reduce the number of routing entries:

cli -- elb -- h1 -- c1

| `--- c2

`--- h2 -- c2

h1_ep_ip_ranges=( 10.1.1.0/24 10.1.2.0/24 )
h2_ep_ip_ranges=( 10.1.3.0/24 )

No ping ATM (packets not going through the PREROUTING chain...), and
need to sleep. More on this tomorrow ;)

On 01/18/2016 06:28 PM, Tim Hockin wrote:

I can't make this work on GCE, and I am not sure about AWS - there is a
limited number of static routes available.

I wonder if I could do it by piggybacking 2 IP ranges together in a
single
route. It's a lot of IPs to spend, but I guess it only matters for UDP.
I'll have to try it out.

Edit: I tried it out and couldn't make it work, but I'm missing
something.
We'll have to add/remove IPs in containers in response to services
coming
and going, but I could not make "extra" IPs in a container work (it
could
ping but not TCP or UDP, not sure why).

I'll have to try again sometime.

—
Reply to this email directly or view it on GitHub
#3760 (comment)
.

mcluseau · 2016-01-20T20:23:58Z

That's unfortunate :-( not sure why by the way. I'll try something with MPLS then, I want to learn it anyway.

thockin · 2016-01-20T21:11:20Z

If you have 2 backends for Service and you want to send more than a single
packet, you need to track flows in some way, don't you? Or are you
assuming it is safe to spray packets at different backends?

On Wed, Jan 20, 2016 at 12:24 PM, Mikaël Cluseau notifications@github.com
wrote:

That's unfortunate :-( not sure why by the way. I'll try something with
MPLS then, I want to learn it anyway.

—
Reply to this email directly or view it on GitHub
#3760 (comment)
.

mcluseau · 2016-01-20T21:23:35Z

I kind of assumed that for UDP workloads, yes. It also can be optional to go stateless even for UDP. @qoke any comment on this?

mcluseau · 2016-01-20T21:26:17Z

Also, we could use things like client IP hashing to make the flow more stable while still balanced (I don't know if we can call that "some kind of tracking" :-)).

qoke · 2016-02-01T21:33:11Z

@MikaelCluseau we use the default IPVS behaviour, which does some very light-weight UDP "stickyness"...

For scheduling UDP datagrams, IPVS load balancer records UDP datagram scheduling with configurable timeout, and the default UDP timeout is 300 seconds. Before UDP connection timeouts, all UDP datagrams from the same socket (protocol, ip address and port) will be directed to the same server.

-- Quoted from http://kb.linuxvirtualserver.org/wiki/IPVS

Of course, this only works if you have many clients talking to a single service, or a single client with varying source ports. If you have a single high-volume client, all sending traffic from the same source port, and you want to load balance this over multiple backends, then you may prefer to use a stateless/spray-and-pray approach.

We load balance a lot of DNS and RADIUS traffic - DNS typically falls into the first category (lotso of clients, or clients with lots of source ports), and RADIUS typically falls into the later category (few clients, lots of packets all from the same IP/port). Rather than using a stateless hash for RADIUS we instead decided to randomize source ports to get an even spread.

javiercr · 2016-02-18T21:48:09Z

After reading the whole thread I still can't figure out whether activating the iptables mode for kube-proxy should fix the problem of external IPs being hidden (#10921) or not. We did enable the iptables mode with v1.1 as suggested here but we're still seeing the IPs from the cluster, no the real ones from the users.

Our cluster is in GCE and we just need a load balancer with HTTPS support before we go live. As GCE doesn't support v.1.2 alpha we cannot use the new Ingress (which AFAIK supports HTTPS Load Balancers), so the Network Load Balancer is our only option. But obviously we cannot go live without the ability of logging real ips from our users.

Some clarification for new users on this would be appreciated. Supporting HTTPS is mandatory for many of us. Thanks!

maclof · 2016-02-19T12:55:56Z

I have been using the iptables proxy on and off for quite some time and can confirm that the external IPs of clients are still hidden/show cluster IPs.

We've gotten around this so far by running our frontend HTTP/HTTPS proxy running in host network mode so that it sees the source IP address.

javiercr · 2016-02-19T13:33:04Z

@maclof thanks for the feedback. Could you share more info about how your workaround? What do you mean by your HTTP/HTTPS running in host network?

maclof · 2016-02-19T13:45:57Z

@javiercr we use a pod spec something like this: http://pastie.org/private/zpdelblsob654zif7xus5g

Using host network means that the pod runs in the host machines network, instead of being assigned a cluster IP.

That means when our nginx server binds to port 80/443 it will listen on a host IP and will see the source IP addresses.

mcluseau · 2016-02-21T01:27:40Z

I'm using kubernetes 1.1, /opt/bin/kube-proxy ... --proxy-mode=iptables --masquerade-all=false and routing cluser IP network through an host having a kube-proxy. In this setup, my services are seeing the external IP. I use a highly available network namespace who has a external IP and a route to the hosts:

I0221 01:20:32.695440       1 main.go:224] <A6GSXEKN> Connection from 202.22.xxx.yyy:51954 closed.

yoshiwaan · 2016-03-30T22:46:06Z

I've learned a lot reading this thread!

As an FYI this doc states that AWS ELB uses round-robin for TCP connections and least connections for http/https: http://docs.aws.amazon.com/ElasticLoadBalancing/latest/DeveloperGuide/how-elb-works.html#request-routing

I agree that focussing on getting requests only to nodes that run pods and to try to serve local pods is the best way to go about it. The nice side benefit of this is there'll be less node-to-node traffic within the cluster and I suppose a latency improvement by always service local requests from service to pod (which I guess is of even more benefit if you have nodes in multiple availability zones in the same cluster).

In terms of working with a load balancer that doesn't support weighting then you could solve this with your replication controller by trying to always keep the same number of pods on a node (if there's more than 1 per node) and then distributing evenly between them, even if this means having to move pods off of a node in certain situations and allowing only certain replica counts. e.g. for a 4 node cluster with a service connected to a load balancer the only number of acceptable pod replicas would be 1,2,3,4,6,8,9,12,16,20, etc

emaildanwilson · 2016-03-31T21:55:03Z

We're also looking to solve for traffic to route to local pods only. I'd be fine with the nodeport going away on a node at times when no pods are present locally for a service. This way a simple load balancer TCP health check would prevent requests from going to those nodes. I think if we can at least solve for the iptables\kube-proxy portion of this, then we'll find out what the load balancing implications are when the pods are not balanced across the cluster. I think there are ways to solve that on load balancers without having to set a weight for each node w/ an API call.

Load balancers already deal w/ this using other dynamic methods. Also, depending on what the service you're running is actually doing inside that container for each api call it may not be able to support 2x the traffic when there are 2 pods on a node vs one anyways. If Kubernetes Limits are set and if maximum levels of usage are being approached on a pod\node could play into this as well which adds yet another layer of complexity to trying to find the right weight setting on the external load balancer.

I'd say, stay away from that level of complexity and not try to set load balancer weight from kubernetes.

justinsb · 2016-04-01T13:56:36Z

@yoshiwaan Can I suggest opening a new issue for the inter-node traffic suggestion, as this issue is now closed. Personally I think a good first step would be to ensure that if a pod is running on the local node, that we route to the local pod. I suspect that this will be sufficient, because you can then scale your RC so that there are pods on every node.

paralin · 2016-04-14T20:31:07Z

@justinsb +1, also we're running into a problem now where we need to see client IPs and it's basically impossible with the current setup.

dalanlan · 2016-04-19T10:40:16Z

This could be way too naive, yet I was wondering what's the difference between userspace mode and iptables? I cannot really tell from the user doc.

mcluseau · 2016-04-19T11:23:20Z

Userland mode means kube-proxy handles the connections itself by receiving the connection request from the client and opening a socket to the server, which (1) consume much more CPU and memory and (2) is limited to the number of ports a single can open (<65k). The iptables mode works at a lower level, in the kernel, and uses connection tracking instead, so it's much lighter and handles a lot more connections*.

(edit) (*) As long as you don't SNAT packets going through, which in turn requires a setup where you are sure packets will cross the connection tracking rules associated with them. For instance, using a routed access design allows you to avoid SNAT, which means the endpoint of the service will see the real client's IP.

dalanlan · 2016-04-19T11:50:11Z

@MikaelCluseau
meaning kube-proxy is only responsible for setting up and maintaining iptables rules and we no longer get a random local port for each service in iptables mode, right?

mcluseau · 2016-04-19T13:00:32Z

On 04/19/2016 10:51 PM, Emma He wrote:

meaning kube-proxy is only responsible for setting up and maintaining
iptables and we no longer get a random local port for each service in
iptables mode, right?

Yes.

dalanlan · 2016-04-20T02:58:28Z

Sorry but I absolutely missed this earlier.

(edit) (*) As long as you don't SNAT packets going through, which in turn requires a setup where you are sure packets will cross the connection tracking rules associated with them. For instance, using a routed access design allows you to avoid SNAT, which means the endpoint of the service will see the real client's IP.

@MikaelCluseau I was thinking iptables adopts SNAT and DNAT, which is not the case according to you. Could you please clarify this for me?

mcluseau · 2016-04-20T03:19:27Z

On 04/20/2016 01:59 PM, Emma He wrote:

@MikaelCluseau https://github.com/MikaelCluseau I was thinking
iptables adopts SNAT and DNAT, which is not the case according to you.
Could you please clarify this for me?

It's the tricky part.

(1) Using service/external IPs requires DNAT.
(2) If you are sure reply packets will go through the same conntrack
rule (ie, the same network stack or a replicated conntrack table), you
can skip the SNAT part (ie, MASQUERADE rules).

The condition of (2) is usually ok in routed access network designs
(which is the simplest design I can think of).

For instance, given

a client 1.0.1.1,
a service 1.0.2.1,
a pod implementing the service 1.0.3.1.

Then,

Your router/firewall/loadbalancer/host/whatever receives a packet
for the service so it sees a packet "1.0.1.1 -> 1.0.2.1";
It DNATs it to the endpoint (pod) so the packet will be "1.0.1.1 ->
1.0.3.1" in the cluster network;
The pod replies with a packet "1.0.3.1 -> 1.0.1.1";
The packet goes through a router/firewall/loadbalancer/host/whatever
having the conntrack rule, the conntrack system rewrite the packet
to "1.0.2.1 -> 1.0.1.1" before sending it back to the client.

If the condition of (2) cannot be met, you have to use SNAT/MASQUERADING
to be sure that the packet will go back through the
router/firewall/loadbalancer/host/whatever's conntrack.

thockin · 2016-04-20T04:41:34Z

@MikaelCluseau - drop me an email at my github name at google.com - I have
something for you

On Tue, Apr 19, 2016 at 8:20 PM, Mikaël Cluseau notifications@github.com
wrote:

On 04/20/2016 01:59 PM, Emma He wrote:

@MikaelCluseau https://github.com/MikaelCluseau I was thinking
iptables adopts SNAT and DNAT, which is not the case according to you.
Could you please clarify this for me?

It's the tricky part.

(1) Using service/external IPs requires DNAT.
(2) If you are sure reply packets will go through the same conntrack
rule (ie, the same network stack or a replicated conntrack table), you
can skip the SNAT part (ie, MASQUERADE rules).

The condition of (2) is usually ok in routed access network designs
(which is the simplest design I can think of).

For instance, given

a client 1.0.1.1,

a service 1.0.2.1,

a pod implementing the service 1.0.3.1.

Then,

Your router/firewall/loadbalancer/host/whatever receives a packet
for the service so it sees a packet "1.0.1.1 -> 1.0.2.1";

It DNATs it to the endpoint (pod) so the packet will be "1.0.1.1 ->
1.0.3.1" in the cluster network;

The pod replies with a packet "1.0.3.1 -> 1.0.1.1";

The packet goes through a router/firewall/loadbalancer/host/whatever
having the conntrack rule, the conntrack system rewrite the packet
to "1.0.2.1 -> 1.0.1.1" before sending it back to the client.

If the condition of (2) cannot be met, you have to use SNAT/MASQUERADING
to be sure that the packet will go back through the
router/firewall/loadbalancer/host/whatever's conntrack.

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#3760 (comment)

mrkurt · 2017-01-05T20:48:01Z

@justinsb @yoshiwaan did anyone ever create an issue for this? My search fu is failing me, and I have a similar need.

Can I suggest opening a new issue for the inter-node traffic suggestion, as this issue is now closed. Personally I think a good first step would be to ensure that if a pod is running on the local node, that we route to the local pod. I suspect that this will be sufficient, because you can then scale your RC so that there are pods on every node.

yoshiwaan · 2017-01-06T00:02:34Z

I didn't raise it myself

mrkurt · 2017-01-06T00:27:37Z

Ahhhhh, I think I found it, this appears to be the feature/fix: kubernetes/enhancements#27

Seems to be beta as of 1.5.x.

bgrant0607 mentioned this issue Jan 24, 2015

Service reorg ideas #2585

Closed

bgrant0607 added sig/network Categorizes an issue or PR as relevant to SIG Network. kind/enhancement labels Jan 27, 2015

goltermann added priority/backlog Higher priority than priority/awaiting-more-evidence. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. labels Jan 28, 2015

thockin mentioned this issue Jan 28, 2015

kube-proxy appears to have some bad long-tail latency issues #3734

Closed

thockin added the team/cluster label Feb 6, 2015

roberthbailey mentioned this issue Feb 11, 2015

Kube-proxy scaling/perf #1277

Closed

bgrant0607 added the kind/gsoc label Feb 19, 2015

bgrant0607 added priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. and removed priority/backlog Higher priority than priority/awaiting-more-evidence. labels Feb 23, 2015

emaildanwilson mentioned this issue Apr 14, 2016

iptables proxy could prefer local containers #24300

Closed

vsimon mentioned this issue Aug 2, 2016

Getting unusual timeouts with LoadBalancer on Kubernetes 1.3.3 running on GCE #29759

Closed

nvnobelen mentioned this issue Aug 12, 2016

Incorrect IP address routed(?) to container #30531

Closed

mcluseau mentioned this issue Feb 9, 2021

REQUEST: New membership for mcluseau kubernetes/org#2487

Closed

4 tasks

aauren mentioned this issue Jul 27, 2021

cant find x-real-ip in pod cloudnativelabs/kube-router#1134

Closed

use iptables for proxying instead of userspace #3760

use iptables for proxying instead of userspace #3760

Comments

thockin commented Jan 23, 2015

brendandburns commented Jan 26, 2015

thockin commented Jan 26, 2015

brendandburns commented Jan 26, 2015

thockin commented Jan 27, 2015

satnam6502 commented Feb 11, 2015

thockin commented Feb 11, 2015

roberthbailey commented Feb 12, 2015

bgrant0607 commented Feb 23, 2015

sidharta-s commented Feb 23, 2015

thockin commented Feb 24, 2015

sidharta-s commented Feb 24, 2015

pmorie commented Feb 24, 2015

thockin commented Feb 24, 2015

ArtfulCoder commented Feb 26, 2015

justinsb commented Feb 26, 2015

thockin commented Feb 26, 2015

bgrant0607 commented Feb 27, 2015

thockin commented Feb 27, 2015

larsks commented Feb 27, 2015

thockin commented Jan 20, 2016

cli -- elb -- h1 -- c1

| `--- c2

`--- h2 -- c2

mcluseau commented Jan 20, 2016

thockin commented Jan 20, 2016

mcluseau commented Jan 20, 2016

mcluseau commented Jan 20, 2016

qoke commented Feb 1, 2016

javiercr commented Feb 18, 2016

maclof commented Feb 19, 2016

javiercr commented Feb 19, 2016

maclof commented Feb 19, 2016

mcluseau commented Feb 21, 2016

yoshiwaan commented Mar 30, 2016

emaildanwilson commented Mar 31, 2016

justinsb commented Apr 1, 2016

paralin commented Apr 14, 2016

dalanlan commented Apr 19, 2016

mcluseau commented Apr 19, 2016 • edited

dalanlan commented Apr 19, 2016 • edited

mcluseau commented Apr 19, 2016

dalanlan commented Apr 20, 2016

mcluseau commented Apr 20, 2016

thockin commented Apr 20, 2016

mrkurt commented Jan 5, 2017

yoshiwaan commented Jan 6, 2017

mrkurt commented Jan 6, 2017

mcluseau commented Apr 19, 2016 •

edited

dalanlan commented Apr 19, 2016 •

edited