Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pods with failed status IP address reused on new pods, but traffic still going to old pods across namespaces. #109414

Closed
declangallagher opened this issue Apr 11, 2022 · 11 comments · Fixed by #110255
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/network Categorizes an issue or PR as relevant to SIG Network. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@declangallagher
Copy link

declangallagher commented Apr 11, 2022

What happened?

Since upgrading to v1.22.7-gke.1500 in GKE cluster, we have noticed that pods marked as OutOfmemory
Terminated, ContainerStatusUnknown, OOMKilled, OutOfcpu are still showing as "ready" in the endpoints list, causing traffic to route to them.

What did you expect to happen?

For the pods to be deleted, and if there IP address is reused, for the IP address not to be routed to the old pod anymore.

How can we reproduce it (as minimally and precisely as possible)?

Start a GKE cluster using OS: Container-Optimized OS, on version: v1.22.7-gke.1500

Achieve one of the following status's on a pod:
OutOfmemory
Terminated
ContainerStatusUnknown
OOMKilled
OutOfcpu

Observe that a new pod in cluster is assigned the same IP address as the pod with one of the above failed status's, and that old pod is still serving requests on the now reused IP address.

Anything else we need to know?

No response

Kubernetes version

$ kubectl version
v1.22.7-gke.1500

Cloud provider

GKE

OS version

# On Linux:
$ cat /etc/os-release
NAME="Container-Optimized OS"
ID=cos
PRETTY_NAME="Container-Optimized OS from Google"
HOME_URL="https://cloud.google.com/container-optimized-os/docs"
BUG_REPORT_URL="https://cloud.google.com/container-optimized-os/docs/resources/support-policy#contact_us"
KERNEL_COMMIT_ID=ccbab0481cec29d7f07947bcb6255f325b88513f
GOOGLE_CRASH_ID=Lakitu
GOOGLE_METRICS_PRODUCT_ID=26
VERSION=93
VERSION_ID=93
BUILD_ID=16623.102.23
$ uname -a
Linux gke-cf-europe-west2-cluster-ego-c2-pm-08b6d578-8kz5 5.10.90+ #1 SMP Sat Mar 5 10:09:49 UTC 2022 x86_64 Intel(R) Xeon(R) CPU @ 3.10GHz GenuineIntel GNU/Linux

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

@declangallagher declangallagher added the kind/bug Categorizes issue or PR as related to a bug. label Apr 11, 2022
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 11, 2022
@declangallagher
Copy link
Author

/sig network

@k8s-ci-robot k8s-ci-robot added sig/network Categorizes an issue or PR as relevant to SIG Network. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Apr 11, 2022
@khenidak
Copy link
Contributor

Hi @declangallagher few questions to get a better picture of the issue:

  1. is client-pod accessing the server-pod (of which the IP has been given to a different pod) via a service or directly?
  2. What was the health of the controller manager specifically endpoint and endpointSlice controllers?

@thockin
Copy link
Member

thockin commented Apr 14, 2022

Are you saying you have two running pods with the same IP? Can you show us something like kubectl get pods -o json | jq '.items[] | .metadata.name, .status.podIP, .status.phase'

Is this on containerd or dockershim?

@aojea
Copy link
Member

aojea commented Apr 18, 2022

I've seen something similar with a CNI whose IPAM was recycling IPs, per example, podA was assigned 10.1.1.11, it was deleted and podB was created at the same time, reusing the previously assigned IP 10.1.1.11 ... just sharing

@thockin
Copy link
Member

thockin commented Apr 28, 2022

@declangallagher ICYMI we are waiting on you for more info.

@breunigs
Copy link
Contributor

We're experiencing similar symptoms after upgrading from v1.21.5 to v1.22.8. Bisecting leads to 3eadd1a (#102344) as the first broken commit. The issue is still present on HEAD at the time (= 076168b). I'll attach the reproducer at the bottom.

We observe this when kubelet evicts a Pod, which then either goes into ContainerStatusUnknown or Error, depending on which exact version is checked out. While bisecting, I also observed some of the already fixed issues like #105462 . The issue does not appear when I manually evict through the Eviction API.

Prior to 3eadd1a, the Pod would not retain its PodIP after eviction and/or container termination/cleanup. After this PR it does, so that the IPs still show up in the Endpoint under notReadyAddresses. Since the IP eventually gets reused by unrelated pods, we run into issues similar to the ones described in this Prometheus ticket prometheus/prometheus#10257 (comment) .

The containers on the node are cleaned up properly.

I am not sure in which component the fault is. I.e. should the Pod lose its PodIP upon/after eviction; should it just not be present in the Endpoint/EndpointSlice at all; or should the IP not be reused until the Pod has been (manually) cleaned up?

My reproducer is similar to the ones in other tickets; essentially spawn a Pod that gets evicted due to ephemeral storage/disk pressure. I used resources.limits.ephemeral-storage here because that was easiest to change (and I did not want to exhaust my actual disk). On our non-kind cluster it works the same through normal node disk pressure eviction, though.

# setup
kubetest2 kind -v 2 --kube-root . --build --up
kubectl apply -f evictme.yaml
# observe through either; need to wait for the eviction to happen of course
kubectl get pods -owide
kubectl get endpoints -ojson evictme | jq '.subsets[0].notReadyAddresses'
# teardown
kubetest2 kind -v 2 --kube-root . --down
# evictme.yaml
kind: Pod
apiVersion: v1
metadata:
  name: evictme
  labels:
    pod: evictme
spec:
  terminationGracePeriodSeconds: 10
  containers:
  - name: busybox
    image: k8s.gcr.io/e2e-test-images/nginx:1.14-2
    # the first sleep is meant to curb races, as that's not the main issue.
    # I didn't check if it's actually necessary.
    command: ["sh", "-c", "sleep 10; fallocate -l 10M file; sleep 100000"]
    ports:
    - containerPort: 80
    resources:
      limits:
        ephemeral-storage: 5Mia
  tolerations:
  - key: "node-role.kubernetes.io/control-plane"
    operator: "Exists"
    effect: "NoSchedule"
---
apiVersion: v1
kind: Service
metadata:
  name: evictme
spec:
  ports:
  - name: http
    port: 80
  selector:
    pod: "evictme"
  type: ClusterIP

@aojea
Copy link
Member

aojea commented May 12, 2022

/triage accepted

This is legit, thanks for the reproducer, and there more issues related like this #109718

In 1.22 there was a change in the kubelet #106884 (comment) , most of the issues were solved but seems the surface was bigger than expected and is not deleting the pod.Status.PodIPs field cc: @smarterclayton

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 12, 2022
@smarterclayton
Copy link
Contributor

Are CNI plug-ins depending on pod status ips as the authoritative record of allocation? Also, do we formally define CNI destroy as happening before that release?

The answer to those two questions is required to correctly determine where in pod shutdown logic needs to be added (and this is another kubelet e2e test we need to add). Ie if the second is no, we can clear the status podIPs once the pod containers are confirmed shutdown. If the second is yes, we have to defer the final pod status update and the clear until after cni destroy is guaranteed to succeed (that has other safety implications).

@aojea
Copy link
Member

aojea commented May 12, 2022

/assign

@aojea
Copy link
Member

aojea commented May 12, 2022

Are CNI plug-ins depending on pod status ips as the authoritative record of allocation?

the problem is on the contrary, pod.status.PodIPs are authoritative for endpoints and endpointslices , with the reproducer I can see the CNI works fine, the pod and its network disappears

Also, do we formally define CNI destroy as happening before that release?

after StopPodSandbox there should not be any IP assigned to that Pod

// StopPodSandbox stops any running process that is part of the sandbox and
// reclaims network resources (e.g., IP addresses) allocated to the sandbox.
// If there are any running containers in the sandbox, they must be forcibly
// terminated.
// This call is idempotent, and must not return an error if all relevant
// resources have already been reclaimed. kubelet will call StopPodSandbox
// at least once before calling RemovePodSandbox. It will also attempt to
// reclaim resources eagerly, as soon as a sandbox is not needed. Hence,
// multiple StopPodSandbox calls are expected.
rpc StopPodSandbox(StopPodSandboxRequest) returns (StopPodSandboxResponse) {}

@aojea
Copy link
Member

aojea commented May 13, 2022

The main difference in behavior is that evicted pods , since 1.22 , show the IPs in its status

Evict

1.23
$ kubectl get pods -o wide
NAME       READY   STATUS       RESTARTS   AGE     IP           NODE                 NOMINATED NODE   READINESS GATES
busybox    0/1     Completed    0          22m     10.244.0.5   kind-control-plane   <none>           <none>
busybox3   0/1     StartError   0          5m36s   10.244.0.6   kind-control-plane   <none>           <none>
evictme    0/1     Error        0          67s     10.244.0.7   kind-control-plane   <none>           <none>

---

1.21

$ kubectl get pods -o wide
NAME       READY   STATUS       RESTARTS   AGE     IP           NODE                NOMINATED NODE   READINESS GATES
busybox    0/1     Completed    0          15m     10.244.0.2   121-control-plane   <none>           <none>
busybox3   0/1     StartError   0          3m56s   10.244.0.6   121-control-plane   <none>           <none>
evictme    0/1     Evicted      0          63s     <none>       121-control-plane   <none>           <none>

the logic for endpoint and endpointslices is very simplistic and is not even considering this situation, it worked before because the evicted pods weren't reporting IPs, but current behavior of pods seems more correct and consistent with other states

// ShouldPodBeInEndpointSlice returns true if a specified pod should be in an EndpointSlice object.
// Terminating pods are only included if includeTerminating is true
func ShouldPodBeInEndpointSlice(pod *v1.Pod, includeTerminating bool) bool {
if len(pod.Status.PodIP) == 0 && len(pod.Status.PodIPs) == 0 {
return false
}
if !includeTerminating && pod.DeletionTimestamp != nil {
return false
}
if pod.Spec.RestartPolicy == v1.RestartPolicyNever {
return pod.Status.Phase != v1.PodFailed && pod.Status.Phase != v1.PodSucceeded
}
if pod.Spec.RestartPolicy == v1.RestartPolicyOnFailure {
return pod.Status.Phase != v1.PodSucceeded
}
return true
}

if tolerateUnreadyEndpoints || podutil.IsPodReady(pod) {
subsets = append(subsets, v1.EndpointSubset{
Addresses: []v1.EndpointAddress{epa},
Ports: ports,
})
readyEps++
} else if shouldPodBeInEndpoints(pod) {
klog.V(5).Infof("Pod is out of service: %s/%s", pod.Namespace, pod.Name)
subsets = append(subsets, v1.EndpointSubset{
NotReadyAddresses: []v1.EndpointAddress{epa},
Ports: ports,
})
notReadyEps++
}
return subsets, readyEps, notReadyEps
}
func shouldPodBeInEndpoints(pod *v1.Pod) bool {
switch pod.Spec.RestartPolicy {
case v1.RestartPolicyNever:
return pod.Status.Phase != v1.PodFailed && pod.Status.Phase != v1.PodSucceeded
case v1.RestartPolicyOnFailure:
return pod.Status.Phase != v1.PodSucceeded
default:
return true
}
}

(we should centralize this logic for both controllers too) cc:@robscott @thockin

On another note, I don't think that tolerateUnreadyEndpoints should mean "add any podIP that I find on the pods", is there any reason to have terminal pods IPs on endpoints?

I will send a PR soon

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/network Categorizes an issue or PR as relevant to SIG Network. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
7 participants