Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubeadm 1.8.0 init fails with "/var/lib/kubelet is not empty" #53356

Closed
wjrogers opened this issue Oct 3, 2017 · 43 comments · Fixed by #53317
Closed

kubeadm 1.8.0 init fails with "/var/lib/kubelet is not empty" #53356

wjrogers opened this issue Oct 3, 2017 · 43 comments · Fixed by #53317
Assignees
Labels
area/kubeadm kind/bug Categorizes issue or PR as related to a bug. milestone/needs-approval priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle.
Milestone

Comments

@wjrogers
Copy link

wjrogers commented Oct 3, 2017

Is this a BUG REPORT or FEATURE REQUEST?:

/kind bug

What happened:
On a fresh Ubuntu 16.04.3 system booted from the official cloud image, kubeadm init fails because /var/lib/kubelet exists.

root@kubemaster:~# kubeadm init
[kubeadm] WARNING: kubeadm is in beta, please do not use it for production clusters.
[init] Using Kubernetes version: v1.8.0
[init] Using Authorization modes: [Node RBAC]
[preflight] Running pre-flight checks
[preflight] Some fatal errors occurred:
        /var/lib/kubelet is not empty
[preflight] If you know what you are doing, you can skip pre-flight checks with `--skip-preflight-checks`

What you expected to happen:
kubeadm successfully initializes the cluster

How to reproduce it (as minimally and precisely as possible):

  1. Boot a new VM from the latest Ubuntu Cloud image
  2. apt-get install -y apt-transport-https docker.io
  3. Follow the kubeadm installation instructions
  4. kubeadm init

Anything else we need to know?:
Contents of /var/lib/kubelet:

/var/lib/kubelet
/var/lib/kubelet/pki
/var/lib/kubelet/pki/kubelet.crt
/var/lib/kubelet/pki/kubelet.key

Environment:

  • Kubernetes version (use kubectl version):
    root@kubemaster:~# kubectl version
    Client Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.0", GitCommit:"6e937839ac04a38cac63e6a7a306c5d035fe7b0a", GitTreeState:"clean", BuildDate:"2017-09-28T22:57:57Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
    The connection to the server localhost:8080 was refused - did you specify the right host or port?
    
    root@kubemaster:~# apt search kube
    Sorting... Done
    Full Text Search... Done
    kubeadm/kubernetes-xenial,now 1.8.0-01 amd64 [installed]
      Kubernetes Cluster Bootstrapping Tool
    
    kubectl/kubernetes-xenial,now 1.8.0-00 amd64 [installed,automatic]
      Kubernetes Command Line Tool
    
    kubelet/kubernetes-xenial,now 1.8.0-00 amd64 [installed,automatic]
      Kubernetes Node Agent
    
    kubernetes-cni/kubernetes-xenial,now 0.5.1-00 amd64 [installed,automatic]
      Kubernetes CNI
    
  • Cloud provider or hardware configuration: Hyper-V generation 1 virtual machine
  • OS (e.g. from /etc/os-release):
    NAME="Ubuntu"
    VERSION="16.04.3 LTS (Xenial Xerus)"
    ID=ubuntu
    ID_LIKE=debian
    PRETTY_NAME="Ubuntu 16.04.3 LTS"
    VERSION_ID="16.04"
    HOME_URL="http://www.ubuntu.com/"
    SUPPORT_URL="http://help.ubuntu.com/"
    BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
    VERSION_CODENAME=xenial
    UBUNTU_CODENAME=xenial
    
  • Kernel (e.g. uname -a): Linux kubemaster 4.4.0-96-generic #119-Ubuntu SMP Tue Sep 12 14:59:54 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
  • Install tools: none
  • Others: none
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Oct 3, 2017
@k8s-github-robot k8s-github-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Oct 3, 2017
@liggitt
Copy link
Member

liggitt commented Oct 3, 2017

Cause

This is related to the location where the kubelet persists its certificates while running in the background, waiting for config:

since kubeadm expects there to be a running kubelet prior to kubeadm init being called, it shouldn't expect the kubelet's --root-dir folder to be empty

Workaround

if you are scripting bootstrapping a known clean machine, there are a few possible workarounds until #53317 is released in 1.8.1 (any of the following work around this issue):

  • verify this is the only preflight check failure, then run the init or join command with --skip-preflight-checks=true
  • stop the kubelet service and remove /var/lib/kubelet/pki prior to running the init or join command
  • run kubeadm reset prior to running init or join

Resolution

addressed as part of #53317

@liggitt
Copy link
Member

liggitt commented Oct 3, 2017

cc @kubernetes/sig-cluster-lifecycle-bugs @luxas

@k8s-ci-robot k8s-ci-robot added sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. kind/bug Categorizes issue or PR as related to a bug. labels Oct 3, 2017
@k8s-github-robot k8s-github-robot removed the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Oct 3, 2017
@surajssd
Copy link
Member

surajssd commented Oct 3, 2017

@wjrogers I did kubeadm reset && kubeadm init it worked for me!

@liggitt
Copy link
Member

liggitt commented Oct 3, 2017

@wjrogers I did kubeadm reset && kubeadm init it worked for me!

On a known clean environment, that's an ok workaround for the moment, but the reason kubeadm init checks for existing files/folders is to avoid stomping an existing installation, and running kubeadm reset would negate those checks.

@jpetazzo
Copy link
Contributor

jpetazzo commented Oct 3, 2017

(If you wonder why the deployment scripts that you wrote yesterday stopped working this morning – this is why! :-))

For the time being, before running kubeadmin init or kubeadmin join, I'll:

  • check if /etc/kubernetes/kubelet.conf or /etc/kubernetes/admin.conf exists
  • if they don't exist, stop kubelet, and wipe out /var/lib/kubelet/pki

I'd love to know if there are foreseeable pitfalls in this approach.

Thanks! <3

@liggitt
Copy link
Member

liggitt commented Oct 3, 2017

(If you wonder why the deployment scripts that you wrote yesterday stopped working this morning – this is why! :-))

apologies for the churn. we traded a data loss issue on reboot for this false-positive fail safe issue. working with @kubernetes/sig-cluster-lifecycle-bugs to determine the best way to resolve this.

@jpetazzo
Copy link
Contributor

jpetazzo commented Oct 3, 2017

No worries, and thanks for the super-quick turnaround time. Much appreciated!
(I didn't mean to sound snarky or anything; sorry if that how it came out!)

@liggitt liggitt added this to the v1.8 milestone Oct 3, 2017
@liggitt liggitt added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Oct 3, 2017
@vglisin
Copy link

vglisin commented Oct 3, 2017

This guys are posting new versions without properly testing it. Can you imagine brand new centos 7 installation and then getting: "kubeadm 1.8.0 init fails with "/var/lib/kubelet is not empty"
how disgusting that was and what I was thinking about responsibility and professionalism here?

@vglisin
Copy link

vglisin commented Oct 3, 2017

Everything was done by: https://kubernetes.io/docs/setup/independent/create-cluster-kubeadm/ but now it is even getting better: The HTTP call equal to 'curl -sSL http://localhost:10255/healthz' failed with error: Get http://localhost:10255/healthz: dial tcp [::1]:10255: getsockopt: connection refused
you should remove 1.8 and put it do "back to the drawing board" stadium.
It is useless even to test this 1.8

@liggitt
Copy link
Member

liggitt commented Oct 3, 2017

I apologize for the issue. Our top priority was to prevent data loss and issues upgrading existing installations that would disrupt already-running workloads. We are working now to resolve this issue with fresh installations.

@vglisin
Copy link

vglisin commented Oct 3, 2017

Wish you luck.

@naseemkullah
Copy link

ETA for a fix for this? should i hard code this deletion of contents of /var/lib/kubelet prior to running kubeadm init in my automation scripts? or simply add the ignore pre flight checks arguement will do the trick?

@liggitt
Copy link
Member

liggitt commented Oct 3, 2017

there are plenty of other pre-flight checks that are valuable and you don't want to skip, it's just the check for an empty /var/lib/kubelet that is incorrect.

if you are scripting bootstrapping a known clean machine, there are a couple possible workarounds until #53317 is released in 1.8.1:

  • run kubeadm init/kubeadm join skipping preflight checks
  • stop the kubelet service, remove /var/lib/kubelet/pki, then run kubeadm init/kubeadm join

@naseemkullah
Copy link

ok just to confirm, moving forward, ideally kubeadm init should not be checking for contents in /var/lib/kubelet (i.e. having contents is totally fine), or should installing kubelet not populate contenst of /var/lib/kubelet ?

@liggitt
Copy link
Member

liggitt commented Oct 3, 2017

ok just to confirm, moving forward, ideally kubeadm init should not be checking for contents in /var/lib/kubelet (i.e. having contents is totally fine)

Correct. The kubeadm instructions start the kubelet and let it run in a crash loop in the background, waiting for config. In that state, the kubelet is free to write to its directory containing state (/var/lib/kubelet), so kubeadm should not require that directory to be empty in order to run kubeadm init

@jpetazzo
Copy link
Contributor

jpetazzo commented Oct 3, 2017

(Reposting my earlier tip since it worked like a charm for my automated deployment scripts.)

For the time being, before running kubeadmin init or kubeadmin join, I'll:

  • check if /etc/kubernetes/kubelet.conf or /etc/kubernetes/admin.conf exists
  • if they don't exist, stop kubelet (systemctl stop kubelet), and wipe out /var/lib/kubelet/pki (rm -rf /var/lib/kubelet/pki)

This might or might not be appropriate for your use-cases, but for mine it works great (I'm automatically deploying hundreds of k8s clusters of training purposes). Stopping kubelet is necessary to avoid race conditions where it would recreate the pki directory before you run kubeadm.

@liggitt
Copy link
Member

liggitt commented Oct 4, 2017

After that i'm tried to weave it failed with error
kubectl apply -f "https://cloud.weave.works/k8s/net?k8s-version=$(kubectl version | base64 | tr -d '\n')"
The connection to the server localhost:8080 was refused

@hemaprasad, be sure to follow the instructions to add the admin kubeconfig file so kubectl can communicate with the initialized API server:

To start using your cluster, you need to run (as a regular user):
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config

@liggitt
Copy link
Member

liggitt commented Oct 4, 2017

For anyone following this thread, if you encounter issues other than the /var/lib/kubelet is not empty message, please report them in a separate issue (after searching for existing reports), to ensure they are triaged and routed appropriately and resolved as quickly as possible. Thanks.

@hemaprasad
Copy link

@liggiit, thank you for your quick response
now i'm getting

kubectl apply -f "https://cloud.weave.works/k8s/net?k8s-version=$(kubectl version | base64 | tr -d '\n')"
Unable to connect to the server: Forbidden
W1004 20:34:18.925025 20485 factory_object_mapping.go:423] Failed to download OpenAPI (Get https://192.168.1.3:6443/swagger-2.0.0.pb-v1: Forbidden), falling back to swagger
Unable to connect to the server: Forbidden

@jpetazzo
Copy link
Contributor

jpetazzo commented Oct 4, 2017

@hemaprasad: you seem to be hitting a totally different problem than the one related to /var/lib/kubelet/pki; can you please open a different issue? Thank you so much!

@hemaprasad
Copy link

@jpetazzo,even I also faced the same issue'var/lib/kubelet is not empty' after I used the command "kubeadm reset && kubeadm init" it want to hanged state.(if want u can see my above threads).to move to another step I'm trying to install the weave it was failed saying that
kubectl apply -f "https://cloud.weave.works/k8s/net?k8s-version=$(kubectl version | base64 | tr -d '\n')"
The connection to the server localhost:8080 was refused - did you specify the right host or port?
W1004 18:53:33.930337 16144 factory_object_mapping.go:423] Failed to download OpenAPI (Get http://localhost:8080/swagger-2.0.0.pb-v1: dial tcp [::1]:8080: getsockopt: connection refused), falling back to swagger
The connection to the server localhost:8080 was refused - did you specify the right host or port? Can you please help me to solve the issue
Thank in advance

k8s-github-robot pushed a commit that referenced this issue Oct 4, 2017
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Add /var/lib/kubelet error to known issues

Document #53356 in known issues for 1.8.0
k8s-github-robot pushed a commit that referenced this issue Oct 4, 2017
Automatic merge from submit-queue (batch tested with PRs 53317, 52186). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Change default --cert-dir for kubelet to a non-transient location

The default kubelet `--cert-dir` location is `/var/run/kubernetes`, which is automatically erased on reboot on many platforms. As of 1.8.0, kubelet TLS bootstrapping and cert rotation now persist files in `--cert-dir`, this should default to a non-transient location. Default it to the `pki` subfolder of the default `--root-dir` Fixes #53288

Additionally, since `kubeadm` expects a running (albeit crashlooping) kubelet prior to running `kubeadm init` or `kubeadm join`, and was using the default `--root-dir` of `/var/lib/kubelet`, it should not expect that folder to be empty as a pre-init check. Fixes #53356

```release-note
kubelet: `--cert-dir` now defaults to `/var/lib/kubelet/pki`, in order to ensure bootstrapped and rotated certificates persist beyond a reboot.
```
@liggitt
Copy link
Member

liggitt commented Oct 4, 2017

pick to 1.8 branch is #53448
reopening to track issue until resolved in 1.8.x

@liggitt liggitt reopened this Oct 4, 2017
@jpetazzo
Copy link
Contributor

jpetazzo commented Oct 4, 2017

@hemaprasad if you get The connection to the server localhost:8080 was refused it means that your kube config file is missing. By default, kubectl tries to connect to localhost:8080. But when you deploy a cluster with kubeadm, the API will be listening on port 6443. So you should see an error relative to port 6443, nor port 8080. If you see an error related to port 8080, it means that you haven't copied the configuration file (admin.conf) generated by kubeadm. I hope that makes sense.

My piece of advice would be to destroy the machines that you are currently using and restart from scratch (assuming you are using VMs), making sure that you follow each step carefully. Very often it solved the problem for me, because I had forgotten a step, or deviated from the instructions. Especially if something fails, it is often easier to restart from scratch, until you understand fully the technology and each underlying command. Good luck!

@vglisin
Copy link

vglisin commented Oct 5, 2017

Hello hemaprasad,
The problem is that when people know it is hard for them to set their mind to explain things to people who are new in a same area. Explanations given by jpetazzo and liggitt are ok but they are stepping over some steps which they assume but we as newbee don't.
Now, to have kubernetes operational we all should:
-stop the kubelet service, remove /var/lib/kubelet/pki, then run kubeadm init (without starting kubelet service first)
Same goes for nodes:
-stop the kubelet service, remove /var/lib/kubelet/pki, then run kubeadm join (with a token saved from kubeadm init)
Now... how to make dashboard accessible from my workstation which is out of kubernetes cluster?
Suppose for this I should go to some other thread?

@jpetazzo
Copy link
Contributor

jpetazzo commented Oct 5, 2017

@vglisin your explanations are correct! As for your dashboard question, I have no idea, but I would recommend to open another github issue for that (since the title of this issue is "kubeadm 1.8.0 init fails with "/var/lib/kubelet is not empty""). Thank you!

@vglisin
Copy link

vglisin commented Oct 5, 2017

Thank you for fast reply. Also I am trying these innocent commands from documentations:
scp root@:/etc/kubernetes/admin.conf .
./kubectl --kubeconfig ./admin.conf get nodes
but I am getting:
Unable to connect to the server: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "kubernetes")

My guess is that is also the same reason why dashboard over proxy is not working.
Some help?

@jpetazzo
Copy link
Contributor

jpetazzo commented Oct 5, 2017

@vglisin It looks like you are copying admin.conf from a remote host. If it is the case, you also need to edit admin.conf to point to that remote host (instead of localhost). The tutorials and documentations imply that people will run kubectl directly from the master.

@vglisin
Copy link

vglisin commented Oct 5, 2017

Nice this is... so why into same documentation exist this:
Controlling your cluster from machines other than the master
In order to get a kubectl on some other computer (e.g. laptop) to talk to your cluster, you need to copy the administrator kubeconfig file from your master to your workstation like this...

Never mind, it is obviously long way to go till this function even in test environment :(

@vglisin
Copy link

vglisin commented Oct 5, 2017

Hello all.
Finally dashboard starts working from master node. Please, do NOT use 1GB-2GB on master node, it would behave erratically and services which you expect to start normally... won't. Use at least 4GB and then also be patient. http://localhost:8001/ui/ will respond with strange formed URL so please use copy-paste:
http://localhost:8001/api/v1/namespaces/kube-system/services/https:kubernetes-dashboard:/proxy/

That's for today. Hope this will help somebody.

@k8s-github-robot
Copy link

[MILESTONENOTIFIER] Milestone Issue Needs Approval

@liggitt @luxas @mikedanese @pipejakob @wjrogers @kubernetes/sig-cluster-lifecycle-bugs

Action required: This issue must have the status/approved-for-milestone label applied by a SIG maintainer.

Issue Labels
  • sig/cluster-lifecycle: Issue will be escalated to these SIGs if needed.
  • priority/critical-urgent: Never automatically move out of a release milestone; continually escalate to contributor and SIG through all available channels.
  • kind/bug: Fixes a bug discovered during the current release.
Help

@dkirrane
Copy link

@jpetazzo I've tried you're workaround

systemctl stop kubelet
rm -Rf /var/lib/kubelet/pki
kubeadm init

But I'm then on kubeadm init I'm hitting this:

[kubelet-check] It seems like the kubelet isn't running or healthy.

@liggitt
Copy link
Member

liggitt commented Oct 12, 2017

this is fixed in v1.8.1

@liggitt liggitt closed this as completed Oct 12, 2017
@dkirrane
Copy link

After installing 1.8.1 kubelet won't start:

kubelet: error: unable to load client CA file /etc/kubernetes/pki/ca.crt: open /etc/kubernetes/pki/ca.crt: no such file or directory

@vglisin
Copy link

vglisin commented Oct 17, 2017

Not only what you mentioned dkirrane, there is a problem with:
https://kubernetes.io/docs/admin/kubeadm/#manage-tokens
They never wrote ANY detail how to add a new Node in a Cluster when your initial Token (which is valid for ... unknown) expired! Not a word. Only if you want to test-and-try yourself. Lovely beta troubleshooting job this is.
Documentation should be much, much better. Especially for somebody not working on Kubernetes dev. team.

@vglisin
Copy link

vglisin commented Oct 17, 2017

If you try with this you will see that there is no way to use IP address mentioned:
(Text is from: https://kubernetes.io/docs/admin/kubeadm/#manage-tokens)
"To implement this automation, you must know the IP address that the master will have after it is started.....
kubeadm can generate a token for you:
kubeadm token generate
Start both the master node and the worker nodes concurrently with this token."

Where to use mentioned IP address?
How to make any useful Node cluster joining after Token has expired?

@vglisin
Copy link

vglisin commented Oct 17, 2017

Hello, solved a Token problem.
Documentation really needs to be updated considering newbee mind. Otherwise it is source of frustration. Currently nodes are ok and operational.
If this above helps you... ok.

@rajendragosavi
Copy link

This issue is fixed in the latest version of kubernetes. Just reset the kubeadm and remove the /var/lib/etcd as kubeadm is expecting it to be empty.

Then start the kubeadm init.
It should work fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kubeadm kind/bug Categorizes issue or PR as related to a bug. milestone/needs-approval priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle.
Projects
None yet
Development

Successfully merging a pull request may close this issue.