-
Notifications
You must be signed in to change notification settings - Fork 40.4k
Kubernetes/GCE corrupted PD volume #11231
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
well, wonderful - now it just destroyed all of my data. see that mkfs near the end? well... some more log messages with more "mkfs" attempts - http://pastebin.com/sRS5RT8F now i have a consistent but completely empty volume. |
@saad-ali since you're kind of the domain expert on PD now. We do run e2e tests over PD - maybe we can push them harder? Can you inspect for evidence of improper umount ? |
Making this p0 to triage and try to repro |
@swagiaal we should consider errors in safe_format_and_mount and not try to format... maybe |
this is my session on a temp instance with this disk attached: http://pastebin.com/cjCQnhUa |
sorry, some more pastes to hopefully assist in debugging. commands in between: rojer@nbt: |
We use gce's safe_format_and_mount command to mount PD, if it ran mkfs when there was a legit filesystem there, then that's a bug in that script (not that we shouldn't fix it, just so that we know where to look) The script is checked in here: |
Sure, I @'ed @swagiaal because there's a PR to internalize that logic. On Tue, Jul 14, 2015 at 9:39 AM, Brendan Burns notifications@github.com
|
@saad-ali whatever container engine is running at the moment. |
the instance where all this happened is still up and i can pull logs or even give you access to it |
Thanks @rojer. Releases 0.21.0 onward all have #10169, so that means this is a separate issue. Have you been able to repro the issue? Based on what you've said, repro steps should be:
And the symptoms are:
|
Also, @rojer, could you please share your kubelet logs with me. They should be under |
@saad-ali i'd add some write activity to the PD, so that FS is dirtied between attaching and detaching. my containers weren't doing much writing, but there was some. kubelet.log.bz2: https://drive.google.com/file/d/0BxU-yGrhFVkkYzZHR2JNdlZyVzQ/view?usp=sharing |
Thanks @rojer. The interesting bits from the logs are here: https://gist.github.com/saad-ali/5aa08ff382fba68bb88a It looks like the PD is attached correctly, but a mount fails with:
Then all subsequent attaches continuously fail. |
11:30 is not when the disaster happened. i had seen this mount error before but did not pay attention - delete/create seemed to have solved it. data loss happened later - at 15:23 +0100 it still worked, at 16:06 +0100 mount failed for one last time. machine's clock are UTC, so i think you need to subtract 1 from that. |
Looks like it happened twice:
16:06 +0100 aligns with the second occurance |
yes, that's the fatal incident. i did not re-create PD in between, maybe automated fsck saved it. i do remember seeing what i now think was corruption (one of my containers complained about inconsistent on-disk state - one file was missing), but at the time i just blew out that part of the state and moved on. |
cc/ @zmerlynn on containervm image cut for 1.0. |
@rojer Can you re-confirm if you had a Read-Write PD or a Read-Only PD? The logs seem to indicate that kuebelet called |
RW. specifically, i have this in my RC config:
and for each of the 4 containers:
|
(This was a driveby comment, I just wanted to make sure everyone understood that GKE may be somewhat lagged, especially on automatic master upgrades.) |
$ gcloud --project cesanta-2015 beta container --zone europe-west1-c clusters describe cloud |
Cool, thanks, Just checking. :) |
The read-only bit should default to false if it is not specified, as in your example. However, I noticed that even when I explicitly set the read-only bit it seems to be stored incorrectly internally: Opened #11283 to track. Regarding the filesystem type being empty, the |
I ran the PD E2E tests back to back on my cluster overnight. I wasn't able to repro this issue, but I discovered #11321. (The existing PD E2E tests create a PD, attach it to a node, mount it, write to it, unmount it, detach it, attach it to another node, read the data, unmount it, and detach it.) I will write another E2E tests that more closely mirrors @rojer's repro conditions. Specifically:
|
thanks, Saad. |
Ok, I've got what looks like a repro: PD is empty after pod is deleted and recreated. I think what's happening here is the rapid deletion and recreation of the pod causes the PD to be unmounted, asynchronously detached, while the new attach request comes in. The attach request terminates any further attempts to detach, but the detach request may still be pending on the GCE server-side. The attach code sees that the disk attached and happily declares it attached and continues to mount it. In the meantime GCE decides to detach the drive, and that causes all sorts of screwy behavior, including PD appearing empty to container. In my test I was able to avoid this behavior by waiting 2 minutes after deleting the pod before recreating it--this gave the PD enough time to detach. I'll think through a fix. |
Talked to @thockin and @saad-ali offline: this shouldn't be the blocker issue for 1.0 cut today. cc/ @davidopp @brendandburns But here is my concern especially after reading @justinsb's comment above in AWS. Quesion: Is PD corrupted and can user recover their data? |
Somehow a GCE persistent disk was corrupted when used exclusively by a single pod on a container engine. While testing, i repeatedly deleted / created the replication controller and pod (a singleton) with 4 containers all using a single PD in read-write mode.
Eventually the pod just refused to start, and the instance reported filesystem corruption in the logs.
This looks like failure to sync/unmount the volume properly
Here's /var/log/messages from the instance (there's nothing else going on in this tiny 2-node cluster so it was being assigned to the same one repeatedly) - http://pastebin.com/rH6jfHFH
The text was updated successfully, but these errors were encountered: