New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pods with failed status IP address reused on new pods, but traffic still going to old pods across namespaces. #109414
Comments
/sig network |
Hi @declangallagher few questions to get a better picture of the issue:
|
Are you saying you have two running pods with the same IP? Can you show us something like Is this on containerd or dockershim? |
I've seen something similar with a CNI whose IPAM was recycling IPs, per example, podA was assigned 10.1.1.11, it was deleted and podB was created at the same time, reusing the previously assigned IP 10.1.1.11 ... just sharing |
@declangallagher ICYMI we are waiting on you for more info. |
We're experiencing similar symptoms after upgrading from v1.21.5 to v1.22.8. Bisecting leads to 3eadd1a (#102344) as the first broken commit. The issue is still present on We observe this when kubelet evicts a Pod, which then either goes into Prior to 3eadd1a, the Pod would not retain its PodIP after eviction and/or container termination/cleanup. After this PR it does, so that the IPs still show up in the The containers on the node are cleaned up properly. I am not sure in which component the fault is. I.e. should the Pod lose its PodIP upon/after eviction; should it just not be present in the Endpoint/EndpointSlice at all; or should the IP not be reused until the Pod has been (manually) cleaned up? My reproducer is similar to the ones in other tickets; essentially spawn a Pod that gets evicted due to ephemeral storage/disk pressure. I used # setup
kubetest2 kind -v 2 --kube-root . --build --up
kubectl apply -f evictme.yaml
# observe through either; need to wait for the eviction to happen of course
kubectl get pods -owide
kubectl get endpoints -ojson evictme | jq '.subsets[0].notReadyAddresses'
# teardown
kubetest2 kind -v 2 --kube-root . --down # evictme.yaml
kind: Pod
apiVersion: v1
metadata:
name: evictme
labels:
pod: evictme
spec:
terminationGracePeriodSeconds: 10
containers:
- name: busybox
image: k8s.gcr.io/e2e-test-images/nginx:1.14-2
# the first sleep is meant to curb races, as that's not the main issue.
# I didn't check if it's actually necessary.
command: ["sh", "-c", "sleep 10; fallocate -l 10M file; sleep 100000"]
ports:
- containerPort: 80
resources:
limits:
ephemeral-storage: 5Mia
tolerations:
- key: "node-role.kubernetes.io/control-plane"
operator: "Exists"
effect: "NoSchedule"
---
apiVersion: v1
kind: Service
metadata:
name: evictme
spec:
ports:
- name: http
port: 80
selector:
pod: "evictme"
type: ClusterIP |
/triage accepted This is legit, thanks for the reproducer, and there more issues related like this #109718 In 1.22 there was a change in the kubelet #106884 (comment) , most of the issues were solved but seems the surface was bigger than expected and is not deleting the pod.Status.PodIPs field cc: @smarterclayton |
Are CNI plug-ins depending on pod status ips as the authoritative record of allocation? Also, do we formally define CNI destroy as happening before that release? The answer to those two questions is required to correctly determine where in pod shutdown logic needs to be added (and this is another kubelet e2e test we need to add). Ie if the second is no, we can clear the status podIPs once the pod containers are confirmed shutdown. If the second is yes, we have to defer the final pod status update and the clear until after cni destroy is guaranteed to succeed (that has other safety implications). |
/assign |
the problem is on the contrary, pod.status.PodIPs are authoritative for endpoints and endpointslices , with the reproducer I can see the CNI works fine, the pod and its network disappears
after kubernetes/staging/src/k8s.io/cri-api/pkg/apis/runtime/v1/api.proto Lines 41 to 50 in dd1a789
|
The main difference in behavior is that evicted pods , since 1.22 , show the IPs in its status
the logic for endpoint and endpointslices is very simplistic and is not even considering this situation, it worked before because the evicted pods weren't reporting IPs, but current behavior of pods seems more correct and consistent with other states kubernetes/pkg/controller/util/endpoint/controller_utils.go Lines 125 to 145 in 1c603e9
kubernetes/pkg/controller/endpoint/endpoints_controller.go Lines 612 to 638 in 1c603e9
(we should centralize this logic for both controllers too) cc:@robscott @thockin On another note, I don't think that I will send a PR soon |
What happened?
Since upgrading to v1.22.7-gke.1500 in GKE cluster, we have noticed that pods marked as OutOfmemory
Terminated, ContainerStatusUnknown, OOMKilled, OutOfcpu are still showing as "ready" in the endpoints list, causing traffic to route to them.
What did you expect to happen?
For the pods to be deleted, and if there IP address is reused, for the IP address not to be routed to the old pod anymore.
How can we reproduce it (as minimally and precisely as possible)?
Start a GKE cluster using OS: Container-Optimized OS, on version: v1.22.7-gke.1500
Achieve one of the following status's on a pod:
OutOfmemory
Terminated
ContainerStatusUnknown
OOMKilled
OutOfcpu
Observe that a new pod in cluster is assigned the same IP address as the pod with one of the above failed status's, and that old pod is still serving requests on the now reused IP address.
Anything else we need to know?
No response
Kubernetes version
Cloud provider
GKE
OS version
Install tools
Container runtime (CRI) and version (if applicable)
Related plugins (CNI, CSI, ...) and versions (if applicable)
The text was updated successfully, but these errors were encountered: