Kubernetes/GCE corrupted PD volume #11231

rojer · 2015-07-14T15:43:14Z

Somehow a GCE persistent disk was corrupted when used exclusively by a single pod on a container engine. While testing, i repeatedly deleted / created the replication controller and pod (a singleton) with 4 containers all using a single PD in read-write mode.
Eventually the pod just refused to start, and the instance reported filesystem corruption in the logs.
This looks like failure to sync/unmount the volume properly
Here's /var/log/messages from the instance (there's nothing else going on in this tiny 2-node cluster so it was being assigned to the same one repeatedly) - http://pastebin.com/rH6jfHFH

rojer · 2015-07-14T15:57:03Z

well, wonderful - now it just destroyed all of my data. see that mkfs near the end? well...

some more log messages with more "mkfs" attempts - http://pastebin.com/sRS5RT8F

now i have a consistent but completely empty volume.

thockin · 2015-07-14T16:07:14Z

@saad-ali since you're kind of the domain expert on PD now. We do run e2e tests over PD - maybe we can push them harder? Can you inspect for evidence of improper umount ?

thockin · 2015-07-14T16:07:37Z

Making this p0 to triage and try to repro

thockin · 2015-07-14T16:14:52Z

@swagiaal we should consider errors in safe_format_and_mount and not try to format... maybe

rojer · 2015-07-14T16:15:15Z

this is my session on a temp instance with this disk attached: http://pastebin.com/cjCQnhUa

rojer · 2015-07-14T16:30:53Z

sorry, some more pastes to hopefully assist in debugging.
last known successful launch of the pod (describe pod output): http://pastebin.com/FCTH1SSA
and the restart that killed it (note the failed mount at the end): http://pastebin.com/r2Czyrec

commands in between:

rojer@nbt:/go/src/cesanta.com/prod/cloud/demo/gce gce$ kubectl delete rc/cloud2
replicationcontrollers/cloud2
rojer@nbt:/go/src/cesanta.com/prod/cloud/demo/gce gce$ kubectl create -f cloud.yml
replicationcontrollers/cloud2
(reverse-i-search)dele': kubectl ^Clete rc/cloud2 rojer@nbt:~/go/src/cesanta.com/prod/cloud/demo/gce gce$ kubectl delete rc/cloud2 replicationcontrollers/cloud2 rojer@nbt:~/go/src/cesanta.com/prod/cloud/demo/gce gce$ kubectl create -f cloud.yml replicationcontrollers/cloud2 (reverse-i-search)di': cd cloud/^Cspatcher
rojer@nbt:~/go/src/cesanta.com/prod/cloud/demo/gce gce$ kubectl describe rc cloud2
Name: cloud2
Image(s): docker.cesanta.com:5000/cloud/dispatcher:demo2,docker.cesanta.com:5000/cloud/frontend:demo2,docker.cesanta.com:5000/cloud/registry:demo2,docker.cesanta.com:5000/cloud/metrics_db:demo2
Selector: app=cloud2,version=3
Labels: app=cloud2,version=3
Replicas: 1 current / 1 desired
Pods Status: 1 Running / 0 Waiting / 0 Succeeded / 0 Failed
Events:
FirstSeen LastSeen Count From SubobjectPath Reason Message
Tue, 14 Jul 2015 16:06:15 +0100 Tue, 14 Jul 2015 16:06:15 +0100 1 {replication-controller } successfulCreate Created pod: cloud2-f6mzl

brendandburns · 2015-07-14T16:39:19Z

We use gce's safe_format_and_mount command to mount PD, if it ran mkfs when there was a legit filesystem there, then that's a bug in that script (not that we shouldn't fix it, just so that we know where to look)

The script is checked in here:
https://github.com/GoogleCloudPlatform/kubernetes/blob/master/cluster/saltbase/salt/helpers/safe_format_and_mount

thockin · 2015-07-14T17:06:31Z

Sure, I @'ed @swagiaal because there's a PR to internalize that logic.

On Tue, Jul 14, 2015 at 9:39 AM, Brendan Burns notifications@github.com
wrote:

We use gce's safe_format_and_mount command to mount PD, if it ran mkfs
when there was a legit filesystem there, then that's a bug in that script
(not that we shouldn't fix it, just so that we know where to look)

The script is checked in here:

https://github.com/GoogleCloudPlatform/kubernetes/blob/master/cluster/saltbase/salt/helpers/safe_format_and_mount

—
Reply to this email directly or view it on GitHub
#11231 (comment)
.

saad-ali · 2015-07-14T18:22:36Z

I'll try to repro this today.

@rojer What Kubernetes version did you experience this issue on? I'd like to see if you had this fix or not: #10169

rojer · 2015-07-14T18:47:35Z

@saad-ali whatever container engine is running at the moment.

rojer · 2015-07-14T18:48:11Z

the instance where all this happened is still up and i can pull logs or even give you access to it

saad-ali · 2015-07-14T21:58:53Z

Thanks @rojer. Releases 0.21.0 onward all have #10169, so that means this is a separate issue.

Have you been able to repro the issue? Based on what you've said, repro steps should be:

Create a replication controller with 1 replica for a pod with 4 containers all sharing a single GCE PD in RW mode.
Change the replica size of the RC to 0.
Repeat above steps in rapid succession.

And the symptoms are:

Pod fails to start because volume mount fails with exit status 32

saad-ali · 2015-07-14T22:05:36Z

Also, @rojer, could you please share your kubelet logs with me. They should be under /var/logs/

rojer · 2015-07-14T22:12:54Z

@saad-ali i'd add some write activity to the PD, so that FS is dirtied between attaching and detaching. my containers weren't doing much writing, but there was some.

kubelet.log.bz2: https://drive.google.com/file/d/0BxU-yGrhFVkkYzZHR2JNdlZyVzQ/view?usp=sharing

saad-ali · 2015-07-14T23:11:47Z

Thanks @rojer.

The interesting bits from the logs are here: https://gist.github.com/saad-ali/5aa08ff382fba68bb88a

It looks like the PD is attached correctly, but a mount fails with:

E0714 11:30:37.192488    3122 mount_linux.go:103] Mount failed: exit status 32
Mounting arguments: /var/lib/kubelet/plugins/kubernetes.io/gce-pd/mounts/cloud-disk /var/lib/kubelet/pods/a5ed9534-2a1b-11e5-a630-42010af09c49/volumes/kubernetes.io~gce-pd/data  [remount]
Output: mount: you must specify the filesystem type

Then all subsequent attaches continuously fail.

rojer · 2015-07-14T23:23:31Z

11:30 is not when the disaster happened. i had seen this mount error before but did not pay attention - delete/create seemed to have solved it. data loss happened later - at 15:23 +0100 it still worked, at 16:06 +0100 mount failed for one last time. machine's clock are UTC, so i think you need to subtract 1 from that.
see this update i posted earlier: #11231 (comment)

saad-ali · 2015-07-14T23:28:35Z

Looks like it happened twice:

E0714 15:06:31.924995    3122 mount_linux.go:103] Mount failed: exit status 32
Mounting arguments: /var/lib/kubelet/plugins/kubernetes.io/gce-pd/mounts/cloud-disk /var/lib/kubelet/pods/df9e0103-2a39-11e5-a630-42010af09c49/volumes/kubernetes.io~gce-pd/data  [remount]
Output: mount: you must specify the filesystem type

16:06 +0100 aligns with the second occurance

rojer · 2015-07-14T23:31:42Z

yes, that's the fatal incident. i did not re-create PD in between, maybe automated fsck saved it. i do remember seeing what i now think was corruption (one of my containers complained about inconsistent on-disk state - one file was missing), but at the time i just blew out that part of the state and moved on.

dchen1107 · 2015-07-15T00:26:26Z

cc/ @zmerlynn on containervm image cut for 1.0.

saad-ali · 2015-07-15T00:27:04Z

@rojer Can you re-confirm if you had a Read-Write PD or a Read-Only PD? The logs seem to indicate that kuebelet called mounter.Interface.Mount (which it does for RO PDs), instead of calling safe_format_and_mount (which it does for RW PDs).

rojer · 2015-07-15T00:32:19Z

RW.

specifically, i have this in my RC config:

  volumes:
    - name: data
      gcePersistentDisk:
        pdName: cloud-disk
        fsType: ext4

and for each of the 4 containers:

      volumeMounts:
        - name: data
          mountPath: /data

zmerlynn · 2015-07-15T00:35:18Z

@saad-ali: Are you sure about the version number? @rojer, when did you create the cluster? (I ask because we only recently started upgrading masters to 0.21.2, but if you created it last week sometime, you would've gotten 0.21.x as well.)

zmerlynn · 2015-07-15T00:37:50Z

(This was a driveby comment, I just wanted to make sure everyone understood that GKE may be somewhat lagged, especially on automatic master upgrades.)

saad-ali · 2015-07-15T00:38:14Z

@zmerlynn Yes, the Terminating GCE PD...detach verification. Another attach/detach call was made for this PD errors were introduced in #10169 and show up in @rojer's kubelet.log The fix was in 0.21.0, so 0.21.x would have it.

rojer · 2015-07-15T00:39:11Z

$ gcloud --project cesanta-2015 beta container --zone europe-west1-c clusters describe cloud
clusterIpv4Cidr: 10.80.0.0/14
createTime: '2015-07-11T19:02:41+00:00'
currentMasterVersion: 0.21.2
currentNodeVersion: 0.21.1
endpoint: 104.155.47.32
initialClusterVersion: 0.21.1
initialNodeCount: 2
...

zmerlynn · 2015-07-15T00:40:12Z

Cool, thanks, Just checking. :)

saad-ali · 2015-07-15T03:05:09Z

The read-only bit should default to false if it is not specified, as in your example. However, I noticed that even when I explicitly set the read-only bit it seems to be stored incorrectly internally: Opened #11283 to track.

Regarding the filesystem type being empty, the fsType is loaded from the volume spec (same as the readOnly bit), so it being empty is pretty strange.

saad-ali · 2015-07-15T19:10:52Z

I ran the PD E2E tests back to back on my cluster overnight. I wasn't able to repro this issue, but I discovered #11321. (The existing PD E2E tests create a PD, attach it to a node, mount it, write to it, unmount it, detach it, attach it to another node, read the data, unmount it, and detach it.)

I will write another E2E tests that more closely mirrors @rojer's repro conditions. Specifically:

Create a pod with 4 containers all sharing a single GCE PD in RW mode always scheduled to the same node.
Have one of the containers write data to the shared PD.
Verify that the data is written correctly.
Delete the pod, and immediately repeat above steps in rapid succession.

rojer · 2015-07-15T22:35:46Z

thanks, Saad.
fwiw and on the off-chance it makes a difference, i used g1-small instance type for my "cluster".

saad-ali · 2015-07-16T18:52:38Z

Ok, I've got what looks like a repro: PD is empty after pod is deleted and recreated.

I think what's happening here is the rapid deletion and recreation of the pod causes the PD to be unmounted, asynchronously detached, while the new attach request comes in. The attach request terminates any further attempts to detach, but the detach request may still be pending on the GCE server-side. The attach code sees that the disk attached and happily declares it attached and continues to mount it. In the meantime GCE decides to detach the drive, and that causes all sorts of screwy behavior, including PD appearing empty to container.

In my test I was able to avoid this behavior by waiting 2 minutes after deleting the pod before recreating it--this gave the PD enough time to detach.

I'll think through a fix.

justinsb · 2015-07-16T19:17:12Z

This sure sounds like #11012; I added a test in #11128 but I haven't reproduced the problem yet.

saad-ali · 2015-07-16T19:59:20Z

@justinsb I based my test off yours in #11128. But instead of switching nodes, I make sure to recreate a pod on the same node (immediately after deletion), and specify multiple containers. I'll send out a PR with the new test once I have a fix.

dchen1107 · 2015-07-17T18:45:53Z

Talked to @thockin and @saad-ali offline: this shouldn't be the blocker issue for 1.0 cut today. cc/ @davidopp @brendandburns

But here is my concern especially after reading @justinsb's comment above in AWS. Quesion: Is PD corrupted and can user recover their data?

rojer mentioned this issue Jul 14, 2015

Data does not persist past life cycle of container with mounted volumes #10664

Closed

thockin added priority/critical-urgent priority/important-soon labels Jul 14, 2015

thockin removed the priority/important-soon label Jul 14, 2015

thockin added this to the v1.0 milestone Jul 14, 2015

erictune assigned saad-ali Jul 14, 2015

goltermann added the team/cluster label Jul 14, 2015

saad-ali mentioned this issue Jul 15, 2015

Explicit errors while detaching GCE PD prevent PD from ever being detached #11321

Closed

saad-ali mentioned this issue Jul 17, 2015

Pod Disks should schedule a pod w/ a readonly PD on two hosts, then remove both. (Failed 25 times in the last 30 runs. Stability: 16 % #11302

Closed

saad-ali mentioned this issue Jul 20, 2015

Fix GCE PD attach/detach issues #11561

Merged

goltermann modified the milestones: v1.0-post, v1.0 Jul 20, 2015

saad-ali mentioned this issue Jul 20, 2015

v1.0.0 known issues / FAQ accumulator #10760

Closed

vishh closed this as completed in #11561 Jul 23, 2015

This was referenced Jul 24, 2015

Restarted pod/container does not recognize attached persistent disk as read/write #11457

Closed

Automated cherry pick of #11128 #10986 #11561 upstream release 1.0 #12051

Merged

ghost mentioned this issue Aug 4, 2015

Pod Cleanup Can Sometimes Delete Content from Persistnet Volumes #12210

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kubernetes/GCE corrupted PD volume #11231

Kubernetes/GCE corrupted PD volume #11231

rojer commented Jul 14, 2015

rojer commented Jul 14, 2015

thockin commented Jul 14, 2015

thockin commented Jul 14, 2015

thockin commented Jul 14, 2015

rojer commented Jul 14, 2015

rojer commented Jul 14, 2015

brendandburns commented Jul 14, 2015

thockin commented Jul 14, 2015

saad-ali commented Jul 14, 2015

rojer commented Jul 14, 2015

rojer commented Jul 14, 2015

saad-ali commented Jul 14, 2015

saad-ali commented Jul 14, 2015

rojer commented Jul 14, 2015

saad-ali commented Jul 14, 2015

rojer commented Jul 14, 2015

saad-ali commented Jul 14, 2015

rojer commented Jul 14, 2015

dchen1107 commented Jul 15, 2015

saad-ali commented Jul 15, 2015

rojer commented Jul 15, 2015

zmerlynn commented Jul 15, 2015

zmerlynn commented Jul 15, 2015

saad-ali commented Jul 15, 2015

rojer commented Jul 15, 2015

zmerlynn commented Jul 15, 2015

saad-ali commented Jul 15, 2015

saad-ali commented Jul 15, 2015

rojer commented Jul 15, 2015

saad-ali commented Jul 16, 2015

justinsb commented Jul 16, 2015

saad-ali commented Jul 16, 2015

dchen1107 commented Jul 17, 2015

Kubernetes/GCE corrupted PD volume #11231

Kubernetes/GCE corrupted PD volume #11231

Comments

rojer commented Jul 14, 2015

rojer commented Jul 14, 2015

thockin commented Jul 14, 2015

thockin commented Jul 14, 2015

thockin commented Jul 14, 2015

rojer commented Jul 14, 2015

rojer commented Jul 14, 2015

brendandburns commented Jul 14, 2015

thockin commented Jul 14, 2015

saad-ali commented Jul 14, 2015

rojer commented Jul 14, 2015

rojer commented Jul 14, 2015

saad-ali commented Jul 14, 2015

saad-ali commented Jul 14, 2015

rojer commented Jul 14, 2015

saad-ali commented Jul 14, 2015

rojer commented Jul 14, 2015

saad-ali commented Jul 14, 2015

rojer commented Jul 14, 2015

dchen1107 commented Jul 15, 2015

saad-ali commented Jul 15, 2015

rojer commented Jul 15, 2015

zmerlynn commented Jul 15, 2015

zmerlynn commented Jul 15, 2015

saad-ali commented Jul 15, 2015

rojer commented Jul 15, 2015

zmerlynn commented Jul 15, 2015

saad-ali commented Jul 15, 2015

saad-ali commented Jul 15, 2015

rojer commented Jul 15, 2015

saad-ali commented Jul 16, 2015

justinsb commented Jul 16, 2015

saad-ali commented Jul 16, 2015

dchen1107 commented Jul 17, 2015