Pods with failed status IP address reused on new pods, but traffic still going to old pods across namespaces. #109414

declangallagher · 2022-04-11T11:28:49Z

What happened?

Since upgrading to v1.22.7-gke.1500 in GKE cluster, we have noticed that pods marked as OutOfmemory
Terminated, ContainerStatusUnknown, OOMKilled, OutOfcpu are still showing as "ready" in the endpoints list, causing traffic to route to them.

What did you expect to happen?

For the pods to be deleted, and if there IP address is reused, for the IP address not to be routed to the old pod anymore.

How can we reproduce it (as minimally and precisely as possible)?

Start a GKE cluster using OS: Container-Optimized OS, on version: v1.22.7-gke.1500

Achieve one of the following status's on a pod:
OutOfmemory
Terminated
ContainerStatusUnknown
OOMKilled
OutOfcpu

Observe that a new pod in cluster is assigned the same IP address as the pod with one of the above failed status's, and that old pod is still serving requests on the now reused IP address.

Anything else we need to know?

No response

Kubernetes version

$ kubectl version
v1.22.7-gke.1500

Cloud provider

GKE

OS version

# On Linux:
$ cat /etc/os-release
NAME="Container-Optimized OS"
ID=cos
PRETTY_NAME="Container-Optimized OS from Google"
HOME_URL="https://cloud.google.com/container-optimized-os/docs"
BUG_REPORT_URL="https://cloud.google.com/container-optimized-os/docs/resources/support-policy#contact_us"
KERNEL_COMMIT_ID=ccbab0481cec29d7f07947bcb6255f325b88513f
GOOGLE_CRASH_ID=Lakitu
GOOGLE_METRICS_PRODUCT_ID=26
VERSION=93
VERSION_ID=93
BUILD_ID=16623.102.23
$ uname -a
Linux gke-cf-europe-west2-cluster-ego-c2-pm-08b6d578-8kz5 5.10.90+ #1 SMP Sat Mar 5 10:09:49 UTC 2022 x86_64 Intel(R) Xeon(R) CPU @ 3.10GHz GenuineIntel GNU/Linux

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

declangallagher · 2022-04-11T11:33:12Z

/sig network

khenidak · 2022-04-14T18:21:40Z

Hi @declangallagher few questions to get a better picture of the issue:

is client-pod accessing the server-pod (of which the IP has been given to a different pod) via a service or directly?
What was the health of the controller manager specifically endpoint and endpointSlice controllers?

thockin · 2022-04-14T20:34:42Z

Are you saying you have two running pods with the same IP? Can you show us something like kubectl get pods -o json | jq '.items[] | .metadata.name, .status.podIP, .status.phase'

Is this on containerd or dockershim?

aojea · 2022-04-18T10:26:34Z

I've seen something similar with a CNI whose IPAM was recycling IPs, per example, podA was assigned 10.1.1.11, it was deleted and podB was created at the same time, reusing the previously assigned IP 10.1.1.11 ... just sharing

thockin · 2022-04-28T20:45:45Z

@declangallagher ICYMI we are waiting on you for more info.

breunigs · 2022-05-12T13:48:07Z

We're experiencing similar symptoms after upgrading from v1.21.5 to v1.22.8. Bisecting leads to 3eadd1a (#102344) as the first broken commit. The issue is still present on HEAD at the time (= 076168b). I'll attach the reproducer at the bottom.

We observe this when kubelet evicts a Pod, which then either goes into ContainerStatusUnknown or Error, depending on which exact version is checked out. While bisecting, I also observed some of the already fixed issues like #105462 . The issue does not appear when I manually evict through the Eviction API.

Prior to 3eadd1a, the Pod would not retain its PodIP after eviction and/or container termination/cleanup. After this PR it does, so that the IPs still show up in the Endpoint under notReadyAddresses. Since the IP eventually gets reused by unrelated pods, we run into issues similar to the ones described in this Prometheus ticket prometheus/prometheus#10257 (comment) .

The containers on the node are cleaned up properly.

I am not sure in which component the fault is. I.e. should the Pod lose its PodIP upon/after eviction; should it just not be present in the Endpoint/EndpointSlice at all; or should the IP not be reused until the Pod has been (manually) cleaned up?

My reproducer is similar to the ones in other tickets; essentially spawn a Pod that gets evicted due to ephemeral storage/disk pressure. I used resources.limits.ephemeral-storage here because that was easiest to change (and I did not want to exhaust my actual disk). On our non-kind cluster it works the same through normal node disk pressure eviction, though.

# setup
kubetest2 kind -v 2 --kube-root . --build --up
kubectl apply -f evictme.yaml
# observe through either; need to wait for the eviction to happen of course
kubectl get pods -owide
kubectl get endpoints -ojson evictme | jq '.subsets[0].notReadyAddresses'
# teardown
kubetest2 kind -v 2 --kube-root . --down

# evictme.yaml
kind: Pod
apiVersion: v1
metadata:
  name: evictme
  labels:
    pod: evictme
spec:
  terminationGracePeriodSeconds: 10
  containers:
  - name: busybox
    image: k8s.gcr.io/e2e-test-images/nginx:1.14-2
    # the first sleep is meant to curb races, as that's not the main issue.
    # I didn't check if it's actually necessary.
    command: ["sh", "-c", "sleep 10; fallocate -l 10M file; sleep 100000"]
    ports:
    - containerPort: 80
    resources:
      limits:
        ephemeral-storage: 5Mia
  tolerations:
  - key: "node-role.kubernetes.io/control-plane"
    operator: "Exists"
    effect: "NoSchedule"
---
apiVersion: v1
kind: Service
metadata:
  name: evictme
spec:
  ports:
  - name: http
    port: 80
  selector:
    pod: "evictme"
  type: ClusterIP

aojea · 2022-05-12T15:28:37Z

/triage accepted

This is legit, thanks for the reproducer, and there more issues related like this #109718

In 1.22 there was a change in the kubelet #106884 (comment) , most of the issues were solved but seems the surface was bigger than expected and is not deleting the pod.Status.PodIPs field cc: @smarterclayton

smarterclayton · 2022-05-12T15:51:42Z

Are CNI plug-ins depending on pod status ips as the authoritative record of allocation? Also, do we formally define CNI destroy as happening before that release?

The answer to those two questions is required to correctly determine where in pod shutdown logic needs to be added (and this is another kubelet e2e test we need to add). Ie if the second is no, we can clear the status podIPs once the pod containers are confirmed shutdown. If the second is yes, we have to defer the final pod status update and the clear until after cni destroy is guaranteed to succeed (that has other safety implications).

aojea · 2022-05-12T16:03:21Z

/assign

aojea · 2022-05-12T17:11:01Z

Are CNI plug-ins depending on pod status ips as the authoritative record of allocation?

the problem is on the contrary, pod.status.PodIPs are authoritative for endpoints and endpointslices , with the reproducer I can see the CNI works fine, the pod and its network disappears

Also, do we formally define CNI destroy as happening before that release?

after StopPodSandbox there should not be any IP assigned to that Pod

kubernetes/staging/src/k8s.io/cri-api/pkg/apis/runtime/v1/api.proto

Lines 41 to 50 in dd1a789

    
           // StopPodSandbox stops any running process that is part of the sandbox and 
        
           // reclaims network resources (e.g., IP addresses) allocated to the sandbox. 
        
           // If there are any running containers in the sandbox, they must be forcibly 
        
           // terminated. 
        
           // This call is idempotent, and must not return an error if all relevant 
        
           // resources have already been reclaimed. kubelet will call StopPodSandbox 
        
           // at least once before calling RemovePodSandbox. It will also attempt to 
        
           // reclaim resources eagerly, as soon as a sandbox is not needed. Hence, 
        
           // multiple StopPodSandbox calls are expected. 
        
           rpc StopPodSandbox(StopPodSandboxRequest) returns (StopPodSandboxResponse) {}

aojea · 2022-05-13T15:55:33Z

The main difference in behavior is that evicted pods , since 1.22 , show the IPs in its status

Evict

1.23
$ kubectl get pods -o wide
NAME       READY   STATUS       RESTARTS   AGE     IP           NODE                 NOMINATED NODE   READINESS GATES
busybox    0/1     Completed    0          22m     10.244.0.5   kind-control-plane   <none>           <none>
busybox3   0/1     StartError   0          5m36s   10.244.0.6   kind-control-plane   <none>           <none>
evictme    0/1     Error        0          67s     10.244.0.7   kind-control-plane   <none>           <none>

---

1.21

$ kubectl get pods -o wide
NAME       READY   STATUS       RESTARTS   AGE     IP           NODE                NOMINATED NODE   READINESS GATES
busybox    0/1     Completed    0          15m     10.244.0.2   121-control-plane   <none>           <none>
busybox3   0/1     StartError   0          3m56s   10.244.0.6   121-control-plane   <none>           <none>
evictme    0/1     Evicted      0          63s     <none>       121-control-plane   <none>           <none>

the logic for endpoint and endpointslices is very simplistic and is not even considering this situation, it worked before because the evicted pods weren't reporting IPs, but current behavior of pods seems more correct and consistent with other states

kubernetes/pkg/controller/util/endpoint/controller_utils.go

Lines 125 to 145 in 1c603e9

    
           // ShouldPodBeInEndpointSlice returns true if a specified pod should be in an EndpointSlice object. 
        
           // Terminating pods are only included if includeTerminating is true 
        
           func ShouldPodBeInEndpointSlice(pod *v1.Pod, includeTerminating bool) bool { 
        
           	if len(pod.Status.PodIP) == 0 && len(pod.Status.PodIPs) == 0 { 
        
           		return false 
        
           	} 
        
           	if !includeTerminating && pod.DeletionTimestamp != nil { 
        
           		return false 
        
           	} 
        
           	if pod.Spec.RestartPolicy == v1.RestartPolicyNever { 
        
           		return pod.Status.Phase != v1.PodFailed && pod.Status.Phase != v1.PodSucceeded 
        
           	} 
        
           	if pod.Spec.RestartPolicy == v1.RestartPolicyOnFailure { 
        
           		return pod.Status.Phase != v1.PodSucceeded 
        
           	} 
        
           	return true 
        
           }

kubernetes/pkg/controller/endpoint/endpoints_controller.go

Lines 612 to 638 in 1c603e9

    
           	if tolerateUnreadyEndpoints || podutil.IsPodReady(pod) { 
        
           		subsets = append(subsets, v1.EndpointSubset{ 
        
           			Addresses: []v1.EndpointAddress{epa}, 
        
           			Ports:     ports, 
        
           		}) 
        
           		readyEps++ 
        
           	} else if shouldPodBeInEndpoints(pod) { 
        
           		klog.V(5).Infof("Pod is out of service: %s/%s", pod.Namespace, pod.Name) 
        
           		subsets = append(subsets, v1.EndpointSubset{ 
        
           			NotReadyAddresses: []v1.EndpointAddress{epa}, 
        
           			Ports:             ports, 
        
           		}) 
        
           		notReadyEps++ 
        
           	} 
        
           	return subsets, readyEps, notReadyEps 
        
           } 
        
           func shouldPodBeInEndpoints(pod *v1.Pod) bool { 
        
           	switch pod.Spec.RestartPolicy { 
        
           	case v1.RestartPolicyNever: 
        
           		return pod.Status.Phase != v1.PodFailed && pod.Status.Phase != v1.PodSucceeded 
        
           	case v1.RestartPolicyOnFailure: 
        
           		return pod.Status.Phase != v1.PodSucceeded 
        
           	default: 
        
           		return true 
        
           	} 
        
           }

(we should centralize this logic for both controllers too) cc:@robscott @thockin

On another note, I don't think that tolerateUnreadyEndpoints should mean "add any podIP that I find on the pods", is there any reason to have terminal pods IPs on endpoints?

I will send a PR soon

declangallagher added the kind/bug Categorizes issue or PR as related to a bug. label Apr 11, 2022

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 11, 2022

k8s-ci-robot added sig/network Categorizes an issue or PR as relevant to SIG Network. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Apr 11, 2022

navidnadali mentioned this issue Apr 11, 2022

Kubelet reporting 137 exit code for a container whose exit code 0 was previously known #104107

Open

thockin assigned khenidak Apr 14, 2022

aojea mentioned this issue May 12, 2022

"Terminated" pod on shutdown node listed in service edpoints. #109718

Closed

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 12, 2022

k8s-ci-robot assigned aojea May 12, 2022

aojea mentioned this issue May 18, 2022

endpoints and endpointslices should not publish IPs for terminal pods #110115

Closed

robscott mentioned this issue May 27, 2022

Endpoints and EndpointSlices should not publish IPs for terminal pods #110255

Merged

k8s-ci-robot closed this as completed in #110255 May 27, 2022

aojea mentioned this issue Jun 1, 2022

podIP get lost when pod at terminating state #82268

Closed

bbreijer mentioned this issue Sep 8, 2022

AGIC generated backend pool members for OOMKilled pods of which IP is reused by valid pod Azure/application-gateway-kubernetes-ingress#1421

Open

aojea mentioned this issue Mar 10, 2023

Encapsulating probe run logic #116058

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pods with failed status IP address reused on new pods, but traffic still going to old pods across namespaces. #109414

Pods with failed status IP address reused on new pods, but traffic still going to old pods across namespaces. #109414

declangallagher commented Apr 11, 2022 •

edited

declangallagher commented Apr 11, 2022

khenidak commented Apr 14, 2022

thockin commented Apr 14, 2022

aojea commented Apr 18, 2022

thockin commented Apr 28, 2022

breunigs commented May 12, 2022

aojea commented May 12, 2022

smarterclayton commented May 12, 2022

aojea commented May 12, 2022

aojea commented May 12, 2022

aojea commented May 13, 2022 •

edited

Pods with failed status IP address reused on new pods, but traffic still going to old pods across namespaces. #109414

Pods with failed status IP address reused on new pods, but traffic still going to old pods across namespaces. #109414

Comments

declangallagher commented Apr 11, 2022 • edited

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Kubernetes version

Cloud provider

OS version

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

declangallagher commented Apr 11, 2022

khenidak commented Apr 14, 2022

thockin commented Apr 14, 2022

aojea commented Apr 18, 2022

thockin commented Apr 28, 2022

breunigs commented May 12, 2022

aojea commented May 12, 2022

smarterclayton commented May 12, 2022

aojea commented May 12, 2022

aojea commented May 12, 2022

aojea commented May 13, 2022 • edited

declangallagher commented Apr 11, 2022 •

edited

aojea commented May 13, 2022 •

edited