New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TLS-bootstrapped kubelet loses client certs after reboot, node stays on NotReady status #53288
Comments
@hubo2005
Note: Method 1 will trigger an email to the group. You can find the group list here and label list here. |
/sig-autoscaling |
@kubernetes/sig-auth-bugs |
What is the full invocation of the |
I have the same issue. /var/run/kubernetes/kubelet-client.crt and /var/run/kubernetes/kubelet-client.key gets deleted when I reboot the node. Restarting kubelet only works fine. |
Did the restart occur after an upgrade from 1.7.x, or was the kubelet at 1.8 before and after the restart? |
Kubelet 1.8 is the only one version on my node (before and after the restart). |
cc @kubernetes/sig-cluster-lifecycle-bugs as this bug affects kubeadm a bit as well |
this is a clean 1.8 installation with kubeadm, not upgrade from any version. @liggitt |
@hubo2005 Thank you. It turns out that there is a bug in a feature kubeadm uses (Kubelet client certificate rotation). Meanwhile, can you remove the flag We might release new debs tomorrow or so removing that flag as we don't want to enable a feature that does not work. We also hope that we can fix the underlying cert rotation issue in v1.8.1, but that is not clear yet... Thanks again for reporting this bug! |
[MILESTONENOTIFIER] Milestone Labels Complete Issue label settings:
|
@luxas its my pleasure. |
@hubo2005 I'm working on recreating the scenario as well, but just to be sure, did you run |
actually, I think this is an issue with In short, the issue is:
That leaves us with the observed state:
In 1.7, this was still occurring with the server cert/key, but the client certificate/key data was inlined in the generated kubeconfig, rather than being placed in cert/key files in the specified The most obvious immediate resolution would be to set Long-term, I'd expect the following code fixes:
|
I was seeing this too. I can confirm setting --cert-dir to a non-transient dir, deleting /etc/kubernetes/kubelet.conf and restarting results in a working node. |
@liggitt so setting Thanks for the digging @liggitt! |
Yes.
|
Opened several PRs:
|
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Add to 1.8 known issues #53004 (comment) #53288
Automatic merge from submit-queue (batch tested with PRs 53317, 52186). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Change default --cert-dir for kubelet to a non-transient location The default kubelet `--cert-dir` location is `/var/run/kubernetes`, which is automatically erased on reboot on many platforms. As of 1.8.0, kubelet TLS bootstrapping and cert rotation now persist files in `--cert-dir`, this should default to a non-transient location. Default it to the `pki` subfolder of the default `--root-dir` Fixes #53288 Additionally, since `kubeadm` expects a running (albeit crashlooping) kubelet prior to running `kubeadm init` or `kubeadm join`, and was using the default `--root-dir` of `/var/lib/kubelet`, it should not expect that folder to be empty as a pre-init check. Fixes #53356 ```release-note kubelet: `--cert-dir` now defaults to `/var/lib/kubelet/pki`, in order to ensure bootstrapped and rotated certificates persist beyond a reboot. ```
Using kubeadm here and the proposed solution (referenced after this) hasn't worked for me:
I'm getting I tried making the kubelet to bootstrap itself again using Any recommendations? EDIT: Since this was a production server (and today is Sunday) I had to find a way to get it back ASAP.
However, I have the feeling that if this node reboots I'll have to manually do this all over again... |
In case anyone else is facing similar issues on an older kubeadm cluster. I fixed this by doing
|
How I wish I found this issue yesterday |
I just copy /var/lib/kubelet/pki/kubelet.crt and /var/lib/kubelet/pki/kubelet.key from other nodes. |
kubelet --version |
Updated description:
Problem
The default
--cert-dir
path points to a transient location (/var/run/kubernetes
) that is removed on reboot. The default--kubeconfig
location points to a persistent location. This means that after a reboot, a kubelet can be stuck with configuration that is missing credentials needed to reach the API server.Solution
kubeadm
users:1.8.0-01
, rpm1.8.0-1
) have been released with a config change that prevents this issue from occurring./etc/kubernetes/kubelet.conf
file references/var/run/kubernetes/kubelet-client.crt
/var/run/kubernetes/kubelet-client.crt
file does not exist/etc/kubernetes/kubelet.conf
filekubelet
--cert-dir
flag to a non-transient location, like--cert-dir=/var/lib/kubelet/pki
Details
In 1.8.0, TLS bootstrapping (using
--bootstrap-kubeconfig
) writes certs to the--cert-dir
path and a kubeconfig referencing those certs to the--kubeconfig
path. That means that on reboot:--bootstrap-kubeconfig
is ignored, because--kubeconfig
exists--kubeconfig
cannot be loaded, which causes the kubelet to log an error (but not exit)In 1.7, TLS bootstrapping inlined the certificates in the written
--kubeconfig
, which sidestepped the issue with the--cert-dir
getting removed.Original description:
/sig-autoscaling
/kind bug
What happened:
I have 1 master(174) and 2 nodes(175,171). I am restarting one of the node(171), when it came back, the kubelet on this server does not get started with errors:
when I look into the /var/run/kubernetes directory, two files are missing compare to pre-restart
And after restarting
What you expected to happen:
Does the node suppose to be back be "READY" status automatically?
How to reproduce it (as minimally and precisely as possible):
Just power off / restart the node can reproduce the issue
Anything else we need to know?:
Environment:
kubectl version
):Client Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.0", GitCommit:"6e937839ac04a38cac63e6a7a306c5d035fe7b0a", GitTreeState:"clean", BuildDate:"2017-09-28T22:57:57Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.0", GitCommit:"6e937839ac04a38cac63e6a7a306c5d035fe7b0a", GitTreeState:"clean", BuildDate:"2017-09-28T22:46:41Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
Cloud provider or hardware configuration**:
Bare Metal - Self hosted VMs
OS (e.g. from /etc/os-release):
NAME="Ubuntu"
VERSION="16.04.2 LTS (Xenial Xerus)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 16.04.2 LTS"
VERSION_ID="16.04"
HOME_URL="http://www.ubuntu.com/"
SUPPORT_URL="http://help.ubuntu.com/"
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
VERSION_CODENAME=xenial
UBUNTU_CODENAME=xenial
uname -a
):Linux kube171 4.4.0-62-generic #83-Ubuntu SMP Wed Jan 18 14:10:15 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
Install tools:
kubeadm, kubectl, etc. Nothing special
Others:
None
The text was updated successfully, but these errors were encountered: