Skip to content

Commit

Permalink
Merge pull request #252 from cybozu-go/sport-auto-egress
Browse files Browse the repository at this point in the history
Support automatic source port selection in UDP encapsulation
  • Loading branch information
ysksuzuki authored Oct 2, 2023
2 parents e964dfd + ba902a3 commit f47cd59
Show file tree
Hide file tree
Showing 19 changed files with 367 additions and 67 deletions.
1 change: 1 addition & 0 deletions docs/cmd-coil-egress.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ It watches client Pods and creates or deletes Foo-over-UDP tunnels.
```
Flags:
--fou-port int port number for foo-over-udp tunnels (default 5555)
--enable-sport-auto enable automatic source port assignment (default false)
--health-addr string bind address of health/readiness probes (default ":8081")
-h, --help help for coil-egress
--metrics-addr string bind address of metrics endpoint (default ":8080")
Expand Down
28 changes: 23 additions & 5 deletions docs/design.md
Original file line number Diff line number Diff line change
Expand Up @@ -203,7 +203,7 @@ This can be configured by 1) creating IPIP tunnel device with FoU encapsulation
```console
$ sudo ip link add name tun1 type ipip ttl 225 \
remote 1.2.3.4 local 5.6.7.8 \
encap fou encap-sport 5555 encap-dport 5555
encap fou encap-sport auto encap-dport 5555

$ sudo ip fou add port 5555 ipproto 4 # 4 means IPIP protocol
```
Expand All @@ -225,13 +225,27 @@ The transmission between client pods and the SNAT router needs to be bidirection

If the SNAT routers are behind Kubernetes Service, the IPIP tunnel on the client pod is configured to send packets to the Service's ClusterIP. Therefore, the FoU encapsulated packet will have the ClusterIP as the destination address.

Remember we need bidirectional tunneling. If the returning packet has the SNAT router's IP address as the source address, the packet does not match the IPIP tunnel configured for the Service's ClusterIP. So, the returning packet *must* have the ClusterIP as the source address.
Remember we need bidirectional tunneling. If the returning packet has the SNAT router's IP address as the source address, the packet does not match the IPIP tunnel configured for the Service's ClusterIP.
We setup a flow based IPIP tunnel device to receive such the returning packet as well as the IPIP tunnel device with FoU encapsulation option. Otherwise, clients will return ICMP destination unreachable packets.
This flow based IPIP tunnel devices work as catch-all fallback interfaces for the IPIP decapsulation stack.

To resolve this, we need to understand how `kube-proxy` works for ClusterIP. `kube-proxy` rewrites outgoing packets' destination addresses if they are ClusterIP. So, it works as a destination NAT (DNAT) service.
For example, a NAT client(`10.64.0.65:49944`) sends an encapsulated packet from CLusterIP `10.68.114.217:5555`, and a return packet comes from a router Pod(`10.72.49.1.59203`) to the client.
The outgoing packet will be encapsulated by the IPIP tunnel device with FoU encapsulation option, and the incoming packet will be received and decapsulated by the flow based IPIP tunnel device.

Moreover, it rewrites the incoming packet's source addresses if the packet seems like a response returned from one of the destination servers of Service. To be more precise, the incoming packet will be handled by `kube-proxy` if and only if its destination address/port was the source address/port of the outgoing packet and its source address/port was the destination address/port.
```
10.64.0.65.49944 > 10.68.114.217.5555: UDP, length 60
10.72.49.1.59203 > 10.64.0.65.5555: UDP, length 60
```

Before coil v2.4.0, we configured a fixed source port 5555 for FoU encapsulation devices so that `kube-proxy` or `Cilium kube-proxy replacement` can do the reverse SNAT handling.
The transmit and receive sides have been separated and the communication can be asymmetric as the example above shows. We were relying on the fixed source port to handle the reverse SNAT.

To satisfy this condition, we use the port number 5555 for FoU on both client pods and SNAT router pods.
This fixed source port approach causes the following problems:

- Traffic from NAT clients to router Pods can't be distributed when users use Coil with a proxier that selects a backend based on the flow hash such as `Cilium`
- When a router Pod is terminating, traffic from NAT clients to the route Pod cant' be switched until the Pod is finally removed. This problem happens with the Graceful termination of `Cilium kube-proxy replacement`.

We encourage users to use `fouSourcePortAuto: true` setting to avoid these problems.

### Session persistence

Expand All @@ -240,6 +254,10 @@ This can be achieved by setting Service's [`spec.sessionAffinity`](https://kuber

Therefore, Coil creates a Service with `spec.sessionAffinity=ClientIP` for each NAT gateway.

It's also notable that the session persistence is not required if you use this feature in conjunction with `Cilium kube-proxy replacement`.
`Cilium` selects a backend for the service based on the flow hash, and the kernel picks source ports based on the flow hash of the encapsulated packet.
It means that the traffic belonging to the same TCP connection from a NAT client to a router service is always sent to the same Pod.

### Auto-scaling with HPA

To enable auto-scaling with horizontal pod autoscaler (HPA), `Egress` implements `scale` subresource.
Expand Down
7 changes: 7 additions & 0 deletions v2/api/v2/egress_types.go
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,13 @@ type EgressSpec struct {
// Ref. https://pkg.go.dev/k8s.io/api/core/v1?tab=doc#ServiceSpec
// +optional
SessionAffinityConfig *corev1.SessionAffinityConfig `json:"sessionAffinityConfig,omitempty"`

// FouSourcePortAuto indicates that the source port number in foo-over-udp encapsulation
// should be chosen automatically.
// If set to true, the kernel picks a flow based on the flow hash of the encapsulated packet.
// The default is false.
// +optional
FouSourcePortAuto bool `json:"fouSourcePortAuto,omitempty"`
}

// EgressPodTemplate defines pod template for Egress
Expand Down
10 changes: 6 additions & 4 deletions v2/cmd/coil-egress/sub/root.go
Original file line number Diff line number Diff line change
Expand Up @@ -12,10 +12,11 @@ import (
)

var config struct {
metricsAddr string
healthAddr string
port int
zapOpts zap.Options
metricsAddr string
healthAddr string
port int
enableSportAuto bool
zapOpts zap.Options
}

var rootCmd = &cobra.Command{
Expand Down Expand Up @@ -43,6 +44,7 @@ func init() {
pf.StringVar(&config.metricsAddr, "metrics-addr", ":8080", "bind address of metrics endpoint")
pf.StringVar(&config.healthAddr, "health-addr", ":8081", "bind address of health/readiness probes")
pf.IntVar(&config.port, "fou-port", 5555, "port number for foo-over-udp tunnels")
pf.BoolVar(&config.enableSportAuto, "enable-sport-auto", false, "enable automatic source port assignment")

goflags := flag.NewFlagSet("klog", flag.ExitOnError)
klog.InitFlags(goflags)
Expand Down
2 changes: 1 addition & 1 deletion v2/cmd/coil-egress/sub/run.go
Original file line number Diff line number Diff line change
Expand Up @@ -94,7 +94,7 @@ func subMain() error {
return err
}

if err := controllers.SetupPodWatcher(mgr, myNS, myName, ft, eg); err != nil {
if err := controllers.SetupPodWatcher(mgr, myNS, myName, ft, config.enableSportAuto, eg); err != nil {
return err
}

Expand Down
6 changes: 6 additions & 0 deletions v2/config/crd/bases/coil.cybozu.com_egresses.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,12 @@ spec:
type: string
minItems: 1
type: array
fouSourcePortAuto:
description: FouSourcePortAuto indicates that the source port number
in foo-over-udp encapsulation should be chosen automatically. If
set to true, the kernel picks a flow based on the flow hash of the
encapsulated packet. The default is false.
type: boolean
replicas:
default: 1
description: Replicas is the desired number of egress (SNAT) pods.
Expand Down
3 changes: 3 additions & 0 deletions v2/controllers/egress_controller.go
Original file line number Diff line number Diff line change
Expand Up @@ -157,6 +157,9 @@ func (r *EgressReconciler) reconcilePodTemplate(eg *coilv2.Egress, depl *appsv1.
}
if len(egressContainer.Args) == 0 {
egressContainer.Args = []string{"--zap-stacktrace-level=panic"}
if eg.Spec.FouSourcePortAuto {
egressContainer.Args = append(egressContainer.Args, "--enable-sport-auto=true")
}
}
egressContainer.Env = append(egressContainer.Env,
corev1.EnvVar{
Expand Down
6 changes: 3 additions & 3 deletions v2/controllers/mock_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -149,11 +149,11 @@ func (t *mockFoUTunnel) Init() error {
panic("not implemented")
}

func (t *mockFoUTunnel) AddPeer(ip net.IP) (netlink.Link, error) {
func (t *mockFoUTunnel) AddPeer(ip net.IP, sportAuto bool) (netlink.Link, error) {
t.mu.Lock()
defer t.mu.Unlock()

t.peers[ip.String()] = true
t.peers[ip.String()] = sportAuto
return nil, nil
}

Expand All @@ -172,7 +172,7 @@ func (t *mockFoUTunnel) GetPeers() map[string]bool {
defer t.mu.Unlock()

for k := range t.peers {
m[k] = true
m[k] = t.peers[k]
}
return m
}
Expand Down
34 changes: 18 additions & 16 deletions v2/controllers/pod_watcher.go
Original file line number Diff line number Diff line change
Expand Up @@ -40,18 +40,19 @@ func init() {
// +kubebuilder:rbac:groups="",resources=pods,verbs=get;list;watch

// SetupPodWatcher registers pod watching reconciler to mgr.
func SetupPodWatcher(mgr ctrl.Manager, ns, name string, ft founat.FoUTunnel, eg founat.Egress) error {
func SetupPodWatcher(mgr ctrl.Manager, ns, name string, ft founat.FoUTunnel, encapSportAuto bool, eg founat.Egress) error {
clientPods.Reset()

r := &podWatcher{
client: mgr.GetClient(),
myNS: ns,
myName: name,
ft: ft,
eg: eg,
metric: clientPods.WithLabelValues(ns, name),
podAddrs: make(map[string][]net.IP),
peers: make(map[string]map[string]struct{}),
client: mgr.GetClient(),
myNS: ns,
myName: name,
ft: ft,
encapSportAuto: encapSportAuto,
eg: eg,
metric: clientPods.WithLabelValues(ns, name),
podAddrs: make(map[string][]net.IP),
peers: make(map[string]map[string]struct{}),
}

return ctrl.NewControllerManagedBy(mgr).
Expand All @@ -65,12 +66,13 @@ func SetupPodWatcher(mgr ctrl.Manager, ns, name string, ft founat.FoUTunnel, eg
// this implementation can leave some tunnels as garbage. Such garbage tunnels
// do no harm, though.
type podWatcher struct {
client client.Client
myNS string
myName string
ft founat.FoUTunnel
eg founat.Egress
metric prometheus.Gauge
client client.Client
myNS string
myName string
ft founat.FoUTunnel
encapSportAuto bool
eg founat.Egress
metric prometheus.Gauge

mu sync.Mutex
podAddrs map[string][]net.IP
Expand Down Expand Up @@ -166,7 +168,7 @@ OUTER:
}
}

link, err := r.ft.AddPeer(ip)
link, err := r.ft.AddPeer(ip, r.encapSportAuto)
if errors.Is(err, founat.ErrIPFamilyMismatch) {
logger.Info("skipping unsupported pod IP", "pod", pod.Namespace+"/"+pod.Name, "ip", ip.String())
continue
Expand Down
2 changes: 1 addition & 1 deletion v2/controllers/pod_watcher_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -103,7 +103,7 @@ var _ = Describe("Pod watcher", func() {
})
Expect(err).ToNot(HaveOccurred())

err = SetupPodWatcher(mgr, "internet", "egress2", ft, eg)
err = SetupPodWatcher(mgr, "internet", "egress2", ft, true, eg)
Expect(err).ToNot(HaveOccurred())

go func() {
Expand Down
45 changes: 45 additions & 0 deletions v2/e2e/coil_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -258,6 +258,19 @@ var _ = Describe("Coil", func() {
}
return int(depl.Status.ReadyReplicas)
}).Should(Equal(2))

By("defining Egress with fouSourcePortAuto in the internet namespace")
kubectlSafe(nil, "apply", "-f", "manifests/egress-sport-auto.yaml")

By("checking pod deployments for fouSourcePortAuto")
Eventually(func() int {
depl := &appsv1.Deployment{}
err := getResource("internet", "deployments", "egress-sport-auto", "", depl)
if err != nil {
return 0
}
return int(depl.Status.ReadyReplicas)
}).Should(Equal(2))
})

It("should be able to run NAT client pods", func() {
Expand All @@ -280,6 +293,26 @@ var _ = Describe("Coil", func() {
}
return nil
}).Should(Succeed())

By("creating a NAT client pod for fouSourcePortAuto")
kubectlSafe(nil, "apply", "-f", "manifests/nat-client-sport-auto.yaml")

By("checking the pod status for fouSourcePortAuto")
Eventually(func() error {
pod := &corev1.Pod{}
err := getResource("default", "pods", "nat-client-sport-auto", "", pod)
if err != nil {
return err
}
if len(pod.Status.ContainerStatuses) == 0 {
return errors.New("no container status")
}
cst := pod.Status.ContainerStatuses[0]
if !cst.Ready {
return errors.New("container is not ready")
}
return nil
}).Should(Succeed())
})

It("should allow NAT traffic over foo-over-udp tunnel", func() {
Expand Down Expand Up @@ -319,5 +352,17 @@ var _ = Describe("Coil", func() {
resp := kubectlSafe(data, "exec", "-i", "nat-client", "--", "curl", "-sf", "-T", "-", fakeURL)
Expect(resp).To(HaveLen(1 << 20))
}

By("sending and receiving HTTP request from nat-client-sport-auto")
data = make([]byte, 1<<20) // 1 MiB
resp = kubectlSafe(data, "exec", "-i", "nat-client-sport-auto", "--", "curl", "-sf", "-T", "-", fakeURL)
Expect(resp).To(HaveLen(1 << 20))

By("running the same test 100 times with nat-client-sport-auto")
for i := 0; i < 100; i++ {
time.Sleep(1 * time.Millisecond)
resp := kubectlSafe(data, "exec", "-i", "nat-client-sport-auto", "--", "curl", "-sf", "-T", "-", fakeURL)
Expect(resp).To(HaveLen(1 << 20))
}
})
})
20 changes: 20 additions & 0 deletions v2/e2e/manifests/egress-sport-auto.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
apiVersion: coil.cybozu.com/v2
kind: Egress
metadata:
name: egress-sport-auto
namespace: internet
spec:
replicas: 2
destinations:
- 0.0.0.0/0
- ::/0
fouSourcePortAuto: true
template:
spec:
nodeSelector:
kubernetes.io/hostname: coil-control-plane
tolerations:
- effect: NoSchedule
operator: Exists
containers:
- name: egress
17 changes: 17 additions & 0 deletions v2/e2e/manifests/nat-client-sport-auto.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
apiVersion: v1
kind: Pod
metadata:
name: nat-client-sport-auto
namespace: default
annotations:
egress.coil.cybozu.com/internet: egress-sport-auto
spec:
tolerations:
- key: test
operator: Exists
nodeSelector:
test: coil
containers:
- name: ubuntu
image: quay.io/cybozu/ubuntu:22.04
command: ["pause"]
4 changes: 2 additions & 2 deletions v2/go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -20,9 +20,9 @@ require (
github.com/prometheus/common v0.42.0
github.com/spf13/cobra v1.7.0
github.com/spf13/viper v1.15.0
github.com/vishvananda/netlink v1.2.1-beta.2
github.com/vishvananda/netlink v1.2.1-beta.2.0.20230807190133-6afddb37c1f0
go.uber.org/zap v1.24.0
golang.org/x/sys v0.7.0
golang.org/x/sys v0.10.0
google.golang.org/grpc v1.54.0
google.golang.org/protobuf v1.30.0
k8s.io/api v0.26.4
Expand Down
9 changes: 4 additions & 5 deletions v2/go.sum
Original file line number Diff line number Diff line change
Expand Up @@ -308,8 +308,8 @@ github.com/stretchr/testify v1.8.1 h1:w7B6lhMri9wdJUVmEZPGGhZzrYTPvgJArz7wNPgYKs
github.com/stretchr/testify v1.8.1/go.mod h1:w2LPCIKwWwSfY2zedu0+kehJoqGctiVI29o6fzry7u4=
github.com/subosito/gotenv v1.4.2 h1:X1TuBLAMDFbaTAChgCBLu3DU3UPyELpnF2jjJ2cz/S8=
github.com/subosito/gotenv v1.4.2/go.mod h1:ayKnFf/c6rvx/2iiLrJUk1e6plDbT3edrFNGqEflhK0=
github.com/vishvananda/netlink v1.2.1-beta.2 h1:Llsql0lnQEbHj0I1OuKyp8otXp0r3q0mPkuhwHfStVs=
github.com/vishvananda/netlink v1.2.1-beta.2/go.mod h1:twkDnbuQxJYemMlGd4JFIcuhgX83tXhKS2B/PRMpOho=
github.com/vishvananda/netlink v1.2.1-beta.2.0.20230807190133-6afddb37c1f0 h1:CLsXiDYQjYqJVntHkQZL2AW0R8BrvJu1K/hbs+2Q+EQ=
github.com/vishvananda/netlink v1.2.1-beta.2.0.20230807190133-6afddb37c1f0/go.mod h1:whJevzBpTrid75eZy99s3DqCmy05NfibNaF2Ol5Ox5A=
github.com/vishvananda/netns v0.0.0-20200728191858-db3c7e526aae/go.mod h1:DD4vA1DwXk04H54A1oHXtwZmA0grkVMdPxx/VGLCah0=
github.com/vishvananda/netns v0.0.3 h1:WxY6MpgIdDMQX50UJ7bPIRJdBCOeUV6XtW8dZZja988=
github.com/vishvananda/netns v0.0.3/go.mod h1:SpkAiCQRtJ6TvvxPnOSyH3BMl6unz3xZlaprSwhNNJM=
Expand Down Expand Up @@ -468,7 +468,6 @@ golang.org/x/sys v0.0.0-20200501052902-10377860bb8e/go.mod h1:h1NjWce9XRLGQEsW7w
golang.org/x/sys v0.0.0-20200511232937-7e40ca221e25/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
golang.org/x/sys v0.0.0-20200515095857-1151b9dac4a9/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
golang.org/x/sys v0.0.0-20200523222454-059865788121/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
golang.org/x/sys v0.0.0-20200728102440-3e129f6d46b1/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
golang.org/x/sys v0.0.0-20200803210538-64077c9b5642/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
golang.org/x/sys v0.0.0-20200905004654-be1d3432aa8f/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
golang.org/x/sys v0.0.0-20200930185726-fdedc70b468f/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
Expand All @@ -486,8 +485,8 @@ golang.org/x/sys v0.0.0-20210615035016-665e8c7367d1/go.mod h1:oPkhp1MJrh7nUepCBc
golang.org/x/sys v0.0.0-20210630005230-0f9fa26af87c/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
golang.org/x/sys v0.0.0-20211025201205-69cdffdb9359/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
golang.org/x/sys v0.0.0-20220908164124-27713097b956/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
golang.org/x/sys v0.7.0 h1:3jlCCIQZPdOYu1h8BkNvLz8Kgwtae2cagcG/VamtZRU=
golang.org/x/sys v0.7.0/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
golang.org/x/sys v0.10.0 h1:SqMFp9UcQJZa+pmYuAKjd9xq1f0j5rLcDIk0mj4qAsA=
golang.org/x/sys v0.10.0/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
golang.org/x/term v0.0.0-20201126162022-7de9c90e9dd1/go.mod h1:bj7SfCRtBDWHUb9snDiAeCFNEtKQo2Wmx5Cou7ajbmo=
golang.org/x/term v0.6.0 h1:clScbb1cHjoCkyRbWwBEUZ5H/tIFu5TAXIqaZD0Gcjw=
golang.org/x/term v0.6.0/go.mod h1:m6U89DPEgQRMq3DNkDClhWw02AUbt2daBVO4cn4Hv9U=
Expand Down
Loading

0 comments on commit f47cd59

Please sign in to comment.