Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubernetes/GCE corrupted PD volume #11231

Closed
rojer opened this issue Jul 14, 2015 · 33 comments
Closed

Kubernetes/GCE corrupted PD volume #11231

rojer opened this issue Jul 14, 2015 · 33 comments
Assignees
Labels
priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.

Comments

@rojer
Copy link

rojer commented Jul 14, 2015

Somehow a GCE persistent disk was corrupted when used exclusively by a single pod on a container engine. While testing, i repeatedly deleted / created the replication controller and pod (a singleton) with 4 containers all using a single PD in read-write mode.
Eventually the pod just refused to start, and the instance reported filesystem corruption in the logs.
This looks like failure to sync/unmount the volume properly
Here's /var/log/messages from the instance (there's nothing else going on in this tiny 2-node cluster so it was being assigned to the same one repeatedly) - http://pastebin.com/rH6jfHFH

@rojer
Copy link
Author

rojer commented Jul 14, 2015

well, wonderful - now it just destroyed all of my data. see that mkfs near the end? well...

some more log messages with more "mkfs" attempts - http://pastebin.com/sRS5RT8F

now i have a consistent but completely empty volume.

@thockin
Copy link
Member

thockin commented Jul 14, 2015

@saad-ali since you're kind of the domain expert on PD now. We do run e2e tests over PD - maybe we can push them harder? Can you inspect for evidence of improper umount ?

@thockin thockin added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Jul 14, 2015
@thockin
Copy link
Member

thockin commented Jul 14, 2015

Making this p0 to triage and try to repro

@thockin thockin removed the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Jul 14, 2015
@thockin thockin added this to the v1.0 milestone Jul 14, 2015
@thockin
Copy link
Member

thockin commented Jul 14, 2015

@swagiaal we should consider errors in safe_format_and_mount and not try to format... maybe

@rojer
Copy link
Author

rojer commented Jul 14, 2015

this is my session on a temp instance with this disk attached: http://pastebin.com/cjCQnhUa

@rojer
Copy link
Author

rojer commented Jul 14, 2015

sorry, some more pastes to hopefully assist in debugging.
last known successful launch of the pod (describe pod output): http://pastebin.com/FCTH1SSA
and the restart that killed it (note the failed mount at the end): http://pastebin.com/r2Czyrec

commands in between:

rojer@nbt:/go/src/cesanta.com/prod/cloud/demo/gce gce$ kubectl delete rc/cloud2
replicationcontrollers/cloud2
rojer@nbt:
/go/src/cesanta.com/prod/cloud/demo/gce gce$ kubectl create -f cloud.yml
replicationcontrollers/cloud2
(reverse-i-search)dele': kubectl ^Clete rc/cloud2 rojer@nbt:~/go/src/cesanta.com/prod/cloud/demo/gce gce$ kubectl delete rc/cloud2 replicationcontrollers/cloud2 rojer@nbt:~/go/src/cesanta.com/prod/cloud/demo/gce gce$ kubectl create -f cloud.yml replicationcontrollers/cloud2 (reverse-i-search)di': cd cloud/^Cspatcher
rojer@nbt:~/go/src/cesanta.com/prod/cloud/demo/gce gce$ kubectl describe rc cloud2
Name: cloud2
Image(s): docker.cesanta.com:5000/cloud/dispatcher:demo2,docker.cesanta.com:5000/cloud/frontend:demo2,docker.cesanta.com:5000/cloud/registry:demo2,docker.cesanta.com:5000/cloud/metrics_db:demo2
Selector: app=cloud2,version=3
Labels: app=cloud2,version=3
Replicas: 1 current / 1 desired
Pods Status: 1 Running / 0 Waiting / 0 Succeeded / 0 Failed
Events:
FirstSeen LastSeen Count From SubobjectPath Reason Message
Tue, 14 Jul 2015 16:06:15 +0100 Tue, 14 Jul 2015 16:06:15 +0100 1 {replication-controller } successfulCreate Created pod: cloud2-f6mzl

@brendandburns
Copy link
Contributor

We use gce's safe_format_and_mount command to mount PD, if it ran mkfs when there was a legit filesystem there, then that's a bug in that script (not that we shouldn't fix it, just so that we know where to look)

The script is checked in here:
https://github.com/GoogleCloudPlatform/kubernetes/blob/master/cluster/saltbase/salt/helpers/safe_format_and_mount

@thockin
Copy link
Member

thockin commented Jul 14, 2015

Sure, I @'ed @swagiaal because there's a PR to internalize that logic.

On Tue, Jul 14, 2015 at 9:39 AM, Brendan Burns notifications@github.com
wrote:

We use gce's safe_format_and_mount command to mount PD, if it ran mkfs
when there was a legit filesystem there, then that's a bug in that script
(not that we shouldn't fix it, just so that we know where to look)

The script is checked in here:

https://github.com/GoogleCloudPlatform/kubernetes/blob/master/cluster/saltbase/salt/helpers/safe_format_and_mount


Reply to this email directly or view it on GitHub
#11231 (comment)
.

@saad-ali
Copy link
Member

I'll try to repro this today.

@rojer What Kubernetes version did you experience this issue on? I'd like to see if you had this fix or not: #10169

@rojer
Copy link
Author

rojer commented Jul 14, 2015

@saad-ali whatever container engine is running at the moment.

@rojer
Copy link
Author

rojer commented Jul 14, 2015

the instance where all this happened is still up and i can pull logs or even give you access to it

@saad-ali
Copy link
Member

Thanks @rojer. Releases 0.21.0 onward all have #10169, so that means this is a separate issue.

Have you been able to repro the issue? Based on what you've said, repro steps should be:

  1. Create a replication controller with 1 replica for a pod with 4 containers all sharing a single GCE PD in RW mode.
  2. Change the replica size of the RC to 0.
  3. Repeat above steps in rapid succession.

And the symptoms are:

  • Pod fails to start because volume mount fails with exit status 32

@saad-ali
Copy link
Member

Also, @rojer, could you please share your kubelet logs with me. They should be under /var/logs/

@rojer
Copy link
Author

rojer commented Jul 14, 2015

@saad-ali i'd add some write activity to the PD, so that FS is dirtied between attaching and detaching. my containers weren't doing much writing, but there was some.

kubelet.log.bz2: https://drive.google.com/file/d/0BxU-yGrhFVkkYzZHR2JNdlZyVzQ/view?usp=sharing

@saad-ali
Copy link
Member

Thanks @rojer.

The interesting bits from the logs are here: https://gist.github.com/saad-ali/5aa08ff382fba68bb88a

It looks like the PD is attached correctly, but a mount fails with:

E0714 11:30:37.192488    3122 mount_linux.go:103] Mount failed: exit status 32
Mounting arguments: /var/lib/kubelet/plugins/kubernetes.io/gce-pd/mounts/cloud-disk /var/lib/kubelet/pods/a5ed9534-2a1b-11e5-a630-42010af09c49/volumes/kubernetes.io~gce-pd/data  [remount]
Output: mount: you must specify the filesystem type

Then all subsequent attaches continuously fail.

@rojer
Copy link
Author

rojer commented Jul 14, 2015

11:30 is not when the disaster happened. i had seen this mount error before but did not pay attention - delete/create seemed to have solved it. data loss happened later - at 15:23 +0100 it still worked, at 16:06 +0100 mount failed for one last time. machine's clock are UTC, so i think you need to subtract 1 from that.
see this update i posted earlier: #11231 (comment)

@saad-ali
Copy link
Member

Looks like it happened twice:

E0714 15:06:31.924995    3122 mount_linux.go:103] Mount failed: exit status 32
Mounting arguments: /var/lib/kubelet/plugins/kubernetes.io/gce-pd/mounts/cloud-disk /var/lib/kubelet/pods/df9e0103-2a39-11e5-a630-42010af09c49/volumes/kubernetes.io~gce-pd/data  [remount]
Output: mount: you must specify the filesystem type

16:06 +0100 aligns with the second occurance

@rojer
Copy link
Author

rojer commented Jul 14, 2015

yes, that's the fatal incident. i did not re-create PD in between, maybe automated fsck saved it. i do remember seeing what i now think was corruption (one of my containers complained about inconsistent on-disk state - one file was missing), but at the time i just blew out that part of the state and moved on.

@dchen1107
Copy link
Member

cc/ @zmerlynn on containervm image cut for 1.0.

@saad-ali
Copy link
Member

@rojer Can you re-confirm if you had a Read-Write PD or a Read-Only PD? The logs seem to indicate that kuebelet called mounter.Interface.Mount (which it does for RO PDs), instead of calling safe_format_and_mount (which it does for RW PDs).

@rojer
Copy link
Author

rojer commented Jul 15, 2015

RW.

specifically, i have this in my RC config:

  volumes:
    - name: data
      gcePersistentDisk:
        pdName: cloud-disk
        fsType: ext4

and for each of the 4 containers:

      volumeMounts:
        - name: data
          mountPath: /data

@zmerlynn
Copy link
Member

@saad-ali: Are you sure about the version number? @rojer, when did you create the cluster? (I ask because we only recently started upgrading masters to 0.21.2, but if you created it last week sometime, you would've gotten 0.21.x as well.)

@zmerlynn
Copy link
Member

(This was a driveby comment, I just wanted to make sure everyone understood that GKE may be somewhat lagged, especially on automatic master upgrades.)

@saad-ali
Copy link
Member

@zmerlynn Yes, the Terminating GCE PD...detach verification. Another attach/detach call was made for this PD errors were introduced in #10169 and show up in @rojer's kubelet.log The fix was in 0.21.0, so 0.21.x would have it.

@rojer
Copy link
Author

rojer commented Jul 15, 2015

$ gcloud --project cesanta-2015 beta container --zone europe-west1-c clusters describe cloud
clusterIpv4Cidr: 10.80.0.0/14
createTime: '2015-07-11T19:02:41+00:00'
currentMasterVersion: 0.21.2
currentNodeVersion: 0.21.1
endpoint: 104.155.47.32
initialClusterVersion: 0.21.1
initialNodeCount: 2
...

@zmerlynn
Copy link
Member

Cool, thanks, Just checking. :)

@saad-ali
Copy link
Member

The read-only bit should default to false if it is not specified, as in your example. However, I noticed that even when I explicitly set the read-only bit it seems to be stored incorrectly internally: Opened #11283 to track.

Regarding the filesystem type being empty, the fsType is loaded from the volume spec (same as the readOnly bit), so it being empty is pretty strange.

@saad-ali
Copy link
Member

I ran the PD E2E tests back to back on my cluster overnight. I wasn't able to repro this issue, but I discovered #11321. (The existing PD E2E tests create a PD, attach it to a node, mount it, write to it, unmount it, detach it, attach it to another node, read the data, unmount it, and detach it.)

I will write another E2E tests that more closely mirrors @rojer's repro conditions. Specifically:

  1. Create a pod with 4 containers all sharing a single GCE PD in RW mode always scheduled to the same node.
  2. Have one of the containers write data to the shared PD.
  3. Verify that the data is written correctly.
  4. Delete the pod, and immediately repeat above steps in rapid succession.

@rojer
Copy link
Author

rojer commented Jul 15, 2015

thanks, Saad.
fwiw and on the off-chance it makes a difference, i used g1-small instance type for my "cluster".

@saad-ali
Copy link
Member

Ok, I've got what looks like a repro: PD is empty after pod is deleted and recreated.

I think what's happening here is the rapid deletion and recreation of the pod causes the PD to be unmounted, asynchronously detached, while the new attach request comes in. The attach request terminates any further attempts to detach, but the detach request may still be pending on the GCE server-side. The attach code sees that the disk attached and happily declares it attached and continues to mount it. In the meantime GCE decides to detach the drive, and that causes all sorts of screwy behavior, including PD appearing empty to container.

In my test I was able to avoid this behavior by waiting 2 minutes after deleting the pod before recreating it--this gave the PD enough time to detach.

I'll think through a fix.

@justinsb
Copy link
Member

This sure sounds like #11012; I added a test in #11128 but I haven't reproduced the problem yet.

@saad-ali
Copy link
Member

@justinsb I based my test off yours in #11128. But instead of switching nodes, I make sure to recreate a pod on the same node (immediately after deletion), and specify multiple containers. I'll send out a PR with the new test once I have a fix.

@dchen1107
Copy link
Member

Talked to @thockin and @saad-ali offline: this shouldn't be the blocker issue for 1.0 cut today. cc/ @davidopp @brendandburns

But here is my concern especially after reading @justinsb's comment above in AWS. Quesion: Is PD corrupted and can user recover their data?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.
Projects
None yet
Development

No branches or pull requests

8 participants