Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gce: KCM detaches all in-tree volumes during update from K8s 1.20 to 1.21 #109354

Closed
ialidzhikov opened this issue Apr 7, 2022 · 26 comments
Closed
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/storage Categorizes an issue or PR as relevant to SIG Storage.

Comments

@ialidzhikov
Copy link
Contributor

ialidzhikov commented Apr 7, 2022

What happened?

KCM wrongly detaches all in-tree volumes during update from K8s 1.20 to 1.21.

What did you expect to happen?

KCM to do not detach all in-tree volumes during update from K8s 1.20 to 1.21.

How can we reproduce it (as minimally and precisely as possible)?

  1. Create a K8s 1.20.13 cluster.

  2. Create in-tree and out-of-tree StorageClasses.

    allowVolumeExpansion: true
    apiVersion: storage.k8s.io/v1
    kind: StorageClass
    metadata:
      name: default-intree
    parameters:
      type: pd-standard
    provisioner: kubernetes.io/gce-pd
    reclaimPolicy: Delete
    volumeBindingMode: Immediate
    allowVolumeExpansion: true
    apiVersion: storage.k8s.io/v1
    kind: StorageClass
    metadata:
      annotations:
        storageclass.kubernetes.io/is-default-class: "true"
      name: default
    parameters:
      type: pd-standard
    provisioner: pd.csi.storage.gke.io
    reclaimPolicy: Delete
    volumeBindingMode: WaitForFirstConsumer
  3. Create 3 StatefulSets with 4 replicas (1 Stateful set is using the out-of-tree, the other 2 - the in-tree):

    apiVersion: v1
    kind: Service
    metadata:
      name: app1
      labels:
        app: app1
    spec:
      ports:
      - port: 80
        name: web
      clusterIP: None
      selector:
        app: app1
    ---
    apiVersion: apps/v1
    kind: StatefulSet
    metadata:
      name: app1
    spec:
      serviceName: app1
      replicas: 4
      selector:
        matchLabels:
          app: app1
      template:
        metadata:
          labels:
            app: app1
        spec:
          containers:
            - name: app1
              image: centos
              command: ["/bin/sh"]
              args: ["-c", "while true; do echo $HOSTNAME $(date -u) >> /data/out.txt; sleep 5; done"]
              volumeMounts:
              - name: persistent-storage-app1
                mountPath: /data
              livenessProbe:
                exec:
                  command:
                  - tail
                  - -n 1
                  - /data/out.txt
      volumeClaimTemplates:
      - metadata:
          name: persistent-storage-app1
        spec:
          accessModes: [ "ReadWriteOnce" ]
          resources:
            requests:
              storage: 1Gi
    apiVersion: v1
    kind: Service
    metadata:
      name: app2
      labels:
        app: app2
    spec:
      ports:
      - port: 80
        name: web
      clusterIP: None
      selector:
        app: app2
    ---
    apiVersion: apps/v1
    kind: StatefulSet
    metadata:
      name: app2
    spec:
      serviceName: app2
      replicas: 4
      selector:
        matchLabels:
          app: app2
      template:
        metadata:
          labels:
            app: app2
        spec:
          containers:
            - name: app2
              image: centos
              command: ["/bin/sh"]
              args: ["-c", "while true; do echo $HOSTNAME $(date -u) >> /data/out.txt; sleep 5; done"]
              volumeMounts:
              - name: persistent-storage-app2
                mountPath: /data
              livenessProbe:
                exec:
                  command:
                  - tail
                  - -n 1
                  - /data/out.txt
      volumeClaimTemplates:
      - metadata:
          name: persistent-storage-app2
        spec:
          storageClassName: default-intree
          accessModes: [ "ReadWriteOnce" ]
          resources:
            requests:
              storage: 2Gi
    apiVersion: v1
    kind: Service
    metadata:
      name: app3
      labels:
        app: app3
    spec:
      ports:
      - port: 80
        name: web
      clusterIP: None
      selector:
        app: app3
    ---
    apiVersion: apps/v1
    kind: StatefulSet
    metadata:
      name: app3
    spec:
      serviceName: app3
      replicas: 4
      selector:
        matchLabels:
          app: app3
      template:
        metadata:
          labels:
            app: app3
        spec:
          containers:
            - name: app3
              image: centos
              command: ["/bin/sh"]
              args: ["-c", "while true; do echo $HOSTNAME $(date -u) >> /data/out.txt; sleep 5; done"]
              volumeMounts:
              - name: persistent-storage-app3
                mountPath: /data
              livenessProbe:
                exec:
                  command:
                  - tail
                  - -n 1
                  - /data/out.txt
      volumeClaimTemplates:
      - metadata:
          name: persistent-storage-app3
        spec:
          storageClassName: default-intree
          accessModes: [ "ReadWriteOnce" ]
          resources:
            requests:
              storage: 3Gi
  4. Update the cluster to K8s 1.21.10.

  5. Make sure that kube-controller-manager detaches all in-tree volumes during update

5.1 kube-controller-manager marks all in-tree volumes as uncertain.

2022-04-06 06:33:00	{"log":"Marking volume attachment as uncertain as volume:\"kubernetes.io/csi/pd.csi.storage.gke.io^projects/UNSPECIFIED/zones/europe-west1-b/disks/pv--589d5a8a-cf3a-4428-bd5f-4e03d1615e1e\" (\"cpu-worker-etcd-z1-86c78-7nqlq\") is not attached (Detached)","pid":"1","severity":"INFO","source":"attach_detach_controller.go:769"}
2022-04-06 06:33:00	{"log":"Marking volume attachment as uncertain as volume:\"kubernetes.io/csi/pd.csi.storage.gke.io^projects/UNSPECIFIED/zones/europe-west1-b/disks/pv--7e6663f2-2446-45ee-bca3-a53771b7226b\" (\"cpu-worker-etcd-z1-86c78-7nqlq\") is not attached (Detached)","pid":"1","severity":"INFO","source":"attach_detach_controller.go:769"}
2022-04-06 06:33:00	{"log":"Marking volume attachment as uncertain as volume:\"kubernetes.io/csi/pd.csi.storage.gke.io^projects/UNSPECIFIED/zones/europe-west1-b/disks/pv--6b11de8e-2115-42d6-8d26-c52bf97a1076\" (\"cpu-worker-etcd-z1-86c78-7nqlq\") is not attached (Detached)","pid":"1","severity":"INFO","source":"attach_detach_controller.go:769"}
2022-04-06 06:33:00	{"log":"Marking volume attachment as uncertain as volume:\"kubernetes.io/csi/pd.csi.storage.gke.io^projects/UNSPECIFIED/zones/europe-west1-b/disks/pv--148e95ac-196f-4b59-a547-c01ab7ae3f2d\" (\"cpu-worker-etcd-z1-86c78-7nqlq\") is not attached (Detached)","pid":"1","severity":"INFO","source":"attach_detach_controller.go:769"}
2022-04-06 06:33:00	{"log":"Marking volume attachment as uncertain as volume:\"kubernetes.io/csi/pd.csi.storage.gke.io^projects/UNSPECIFIED/zones/europe-west1-b/disks/pv--4f4d8963-b336-41c8-81ea-d1bbf25b61b9\" (\"cpu-worker-etcd-z1-86c78-7nqlq\") is not attached (Detached)","pid":"1","severity":"INFO","source":"attach_detach_controller.go:769"}
2022-04-06 06:33:01	{"log":"Marking volume attachment as uncertain as volume:\"kubernetes.io/csi/pd.csi.storage.gke.io^projects/UNSPECIFIED/zones/europe-west1-b/disks/pv--ec3c3c19-5a62-4b29-b263-713da5a07d6e\" (\"cpu-worker-etcd-z1-86c78-7nqlq\") is not attached (Detached)","pid":"1","severity":"INFO","source":"attach_detach_controller.go:769"}
2022-04-06 06:33:01	{"log":"Marking volume attachment as uncertain as volume:\"kubernetes.io/csi/pd.csi.storage.gke.io^projects/UNSPECIFIED/zones/europe-west1-b/disks/pv--ad094298-f42a-4f18-b453-9772ce21386b\" (\"cpu-worker-etcd-z1-86c78-7nqlq\") is not attached (Detached)","pid":"1","severity":"INFO","source":"attach_detach_controller.go:769"}
2022-04-06 06:33:01	{"log":"Marking volume attachment as uncertain as volume:\"kubernetes.io/csi/pd.csi.storage.gke.io^projects/UNSPECIFIED/zones/europe-west1-b/disks/pv--53b113af-aaf2-4306-9c6f-817ca0de62eb\" (\"cpu-worker-etcd-z1-86c78-7nqlq\") is not attached (Detached)","pid":"1","severity":"INFO","source":"attach_detach_controller.go:769"}
2022-04-06 06:33:01	{"log":"Marking volume attachment as uncertain as volume:\"kubernetes.io/csi/pd.csi.storage.gke.io^projects/UNSPECIFIED/zones/europe-west1-b/disks/pv--9a237184-7f03-451f-aad5-c85b09d6d580\" (\"cpu-worker-etcd-z1-86c78-7nqlq\") is not attached (Detached)","pid":"1","severity":"INFO","source":"attach_detach_controller.go:769"}
2022-04-06 06:33:01	{"log":"Marking volume attachment as uncertain as volume:\"kubernetes.io/csi/pd.csi.storage.gke.io^projects/UNSPECIFIED/zones/europe-west1-b/disks/pv--26b788f6-03e5-4b39-a5f2-c20ef9a8a884\" (\"cpu-worker-etcd-z1-86c78-7nqlq\") is not attached (Detached)","pid":"1","severity":"INFO","source":"attach_detach_controller.go:769"}

5.2 6min after marking a lot of volume attachments as uncertain, KCM detaches the in-tree volumes.

2022-04-06 06:39:08	{"log":"attacherDetacher.DetachVolume started for volume \"nil\" (UniqueName: \"kubernetes.io/csi/pd.csi.storage.gke.io^projects/UNSPECIFIED/zones/UNSPECIFIED/disks/pv--2bc75fc8-af96-4327-a9a7-cd648c41ec96\") on node \"cpu-worker-etcd-z1-86c78-7nqlq\" This volume is not safe to detach, but maxWaitForUnmountDuration 6m0s expired, force detaching","pid":"1","severity":"WARN","source":"reconciler.go:224"}
2022-04-06 06:39:08	{"log":"attacherDetacher.DetachVolume started for volume \"nil\" (UniqueName: \"kubernetes.io/csi/pd.csi.storage.gke.io^projects/UNSPECIFIED/zones/UNSPECIFIED/disks/pv--148e95ac-196f-4b59-a547-c01ab7ae3f2d\") on node \"cpu-worker-etcd-z1-86c78-7nqlq\" This volume is not safe to detach, but maxWaitForUnmountDuration 6m0s expired, force detaching","pid":"1","severity":"WARN","source":"reconciler.go:224"}
2022-04-06 06:39:08	{"log":"attacherDetacher.DetachVolume started for volume \"nil\" (UniqueName: \"kubernetes.io/csi/pd.csi.storage.gke.io^projects/UNSPECIFIED/zones/UNSPECIFIED/disks/pv--8ad228d7-187b-4972-ab4c-74e5a46c6ad8\") on node \"cpu-worker-etcd-z1-86c78-7nqlq\" This volume is not safe to detach, but maxWaitForUnmountDuration 6m0s expired, force detaching","pid":"1","severity":"WARN","source":"reconciler.go:224"}
2022-04-06 06:39:08	{"log":"attacherDetacher.DetachVolume started for volume \"nil\" (UniqueName: \"kubernetes.io/csi/pd.csi.storage.gke.io^projects/UNSPECIFIED/zones/UNSPECIFIED/disks/pv--26b788f6-03e5-4b39-a5f2-c20ef9a8a884\") on node \"cpu-worker-etcd-z1-86c78-7nqlq\" This volume is not safe to detach, but maxWaitForUnmountDuration 6m0s expired, force detaching","pid":"1","severity":"WARN","source":"reconciler.go:224"}
2022-04-06 06:39:08	{"log":"attacherDetacher.DetachVolume started for volume \"nil\" (UniqueName: \"kubernetes.io/csi/pd.csi.storage.gke.io^projects/UNSPECIFIED/zones/UNSPECIFIED/disks/pv--53b113af-aaf2-4306-9c6f-817ca0de62eb\") on node \"cpu-worker-etcd-z1-86c78-7nqlq\" This volume is not safe to detach, but maxWaitForUnmountDuration 6m0s expired, force detaching","pid":"1","severity":"WARN","source":"reconciler.go:224"}
2022-04-06 06:39:09	{"log":"attacherDetacher.DetachVolume started for volume \"nil\" (UniqueName: \"kubernetes.io/csi/pd.csi.storage.gke.io^projects/UNSPECIFIED/zones/UNSPECIFIED/disks/pv--4f4d8963-b336-41c8-81ea-d1bbf25b61b9\") on node \"cpu-worker-etcd-z1-86c78-7nqlq\" This volume is not safe to detach, but maxWaitForUnmountDuration 6m0s expired, force detaching","pid":"1","severity":"WARN","source":"reconciler.go:224"}
2022-04-06 06:39:09	{"log":"attacherDetacher.DetachVolume started for volume \"nil\" (UniqueName: \"kubernetes.io/csi/pd.csi.storage.gke.io^projects/UNSPECIFIED/zones/UNSPECIFIED/disks/pv--6b11de8e-2115-42d6-8d26-c52bf97a1076\") on node \"cpu-worker-etcd-z1-86c78-7nqlq\" This volume is not safe to detach, but maxWaitForUnmountDuration 6m0s expired, force detaching","pid":"1","severity":"WARN","source":"reconciler.go:224"}
2022-04-06 06:39:09	{"log":"attacherDetacher.DetachVolume started for volume \"nil\" (UniqueName: \"kubernetes.io/csi/pd.csi.storage.gke.io^projects/UNSPECIFIED/zones/UNSPECIFIED/disks/pv--7e6663f2-2446-45ee-bca3-a53771b7226b\") on node \"cpu-worker-etcd-z1-86c78-7nqlq\" This volume is not safe to detach, but maxWaitForUnmountDuration 6m0s expired, force detaching","pid":"1","severity":"WARN","source":"reconciler.go:224"}
2022-04-06 06:39:09	{"log":"attacherDetacher.DetachVolume started for volume \"nil\" (UniqueName: \"kubernetes.io/csi/pd.csi.storage.gke.io^projects/UNSPECIFIED/zones/UNSPECIFIED/disks/pv--9a237184-7f03-451f-aad5-c85b09d6d580\") on node \"cpu-worker-etcd-z1-86c78-7nqlq\" This volume is not safe to detach, but maxWaitForUnmountDuration 6m0s expired, force detaching","pid":"1","severity":"WARN","source":"reconciler.go:224"}
2022-04-06 06:39:09	{"log":"attacherDetacher.DetachVolume started for volume \"nil\" (UniqueName: \"kubernetes.io/csi/pd.csi.storage.gke.io^projects/UNSPECIFIED/zones/UNSPECIFIED/disks/pv--ad094298-f42a-4f18-b453-9772ce21386b\") on node \"cpu-worker-etcd-z1-86c78-7nqlq\" This volume is not safe to detach, but maxWaitForUnmountDuration 6m0s expired, force detaching","pid":"1","severity":"WARN","source":"reconciler.go:224"}
2022-04-06 06:39:09	{"log":"attacherDetacher.DetachVolume started for volume \"nil\" (UniqueName: \"kubernetes.io/csi/pd.csi.storage.gke.io^projects/UNSPECIFIED/zones/UNSPECIFIED/disks/pv--ec3c3c19-5a62-4b29-b263-713da5a07d6e\") on node \"cpu-worker-etcd-z1-86c78-7nqlq\" This volume is not safe to detach, but maxWaitForUnmountDuration 6m0s expired, force detaching","pid":"1","severity":"WARN","source":"reconciler.go:224"}
2022-04-06 06:39:09	{"log":"attacherDetacher.DetachVolume started for volume \"nil\" (UniqueName: \"kubernetes.io/csi/pd.csi.storage.gke.io^projects/UNSPECIFIED/zones/UNSPECIFIED/disks/pv--589d5a8a-cf3a-4428-bd5f-4e03d1615e1e\") on node \"cpu-worker-etcd-z1-86c78-7nqlq\" This volume is not safe to detach, but maxWaitForUnmountDuration 6m0s expired, force detaching","pid":"1","severity":"WARN","source":"reconciler.go:224"}
  1. After these detachments, the Node turns into unhealthy state with reason FilesystemIsReadOnly.
  Normal   FilesystemIsReadOnly  48m                     kernel-monitor  Node condition ReadonlyFilesystem is now: True, reason: FilesystemIsReadOnly

The corresponding Pods fail with IO errors as their volumes are detached.

Anything else we need to know?

I see that in ASW the volume is reported as attached

I0407 06:37:53.956930       1 actual_state_of_world.go:507] Report volume "kubernetes.io/csi/pd.csi.storage.gke.io^projects/UNSPECIFIED/zones/UNSPECIFIED/disks/pv--44ee54d8-42d9-4ff6-8092-4a20d932c941" as attached to node "worker-1-z1-7b85b-9r2tb"

Then the VA is marked as uncertain

I0407 06:37:53.957975       1 attach_detach_controller.go:769] Marking volume attachment as uncertain as volume:"kubernetes.io/csi/pd.csi.storage.gke.io^projects/UNSPECIFIED/zones/europe-west1-b/disks/pv--44ee54d8-42d9-4ff6-8092-4a20d932c941" ("worker-1-z1-7b85b-9r2tb") is not attached (Detached)

Note the diff in the volume name

-kubernetes.io/csi/pd.csi.storage.gke.io^projects/UNSPECIFIED/zones/UNSPECIFIED/disks/pv--44ee54d8-42d9-4ff6-8092-4a20d932c941
+kubernetes.io/csi/pd.csi.storage.gke.io^projects/UNSPECIFIED/zones/europe-west1-b/disks/pv--44ee54d8-42d9-4ff6-8092-4a20d932c941

Kubernetes version

Update from K8s 1.20.13 to 1.21.10.

$ kubectl version
# paste output here

Cloud provider

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

https://github.com/gardener/gardener

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

external-provisioner - k8s.gcr.io/sig-storage/csi-provisioner@v2.1.2
external-attacher - k8s.gcr.io/sig-storage/csi-attacher@v3.3.0
gcp-compute-persistent-disk-csi-driver - gcr.io/gke-release/gcp-compute-persistent-disk-csi-driver@v1.3.4-gke.0

@ialidzhikov ialidzhikov added the kind/bug Categorizes issue or PR as related to a bug. label Apr 7, 2022
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 7, 2022
@ialidzhikov
Copy link
Contributor Author

/sig storage

@k8s-ci-robot k8s-ci-robot added sig/storage Categorizes an issue or PR as relevant to SIG Storage. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Apr 7, 2022
@ialidzhikov
Copy link
Contributor Author

cc @Jiawei0227 @msau42 @jsafrane

@ialidzhikov
Copy link
Contributor Author

ialidzhikov commented Apr 8, 2022

It seems that we build the ASW based on the Node .status.volumesAttached.

for _, node := range nodes {
nodeName := types.NodeName(node.Name)
for _, attachedVolume := range node.Status.VolumesAttached {
uniqueName := attachedVolume.Name
// The nil VolumeSpec is safe only in the case the volume is not in use by any pod.
// In such a case it should be detached in the first reconciliation cycle and the
// volume spec is not needed to detach a volume. If the volume is used by a pod, it
// its spec can be: this would happen during in the populateDesiredStateOfWorld which
// scans the pods and updates their volumes in the ActualStateOfWorld too.
err = adc.actualStateOfWorld.MarkVolumeAsAttached(uniqueName, nil /* VolumeSpec */, nodeName, attachedVolume.DevicePath)
if err != nil {
klog.Errorf("Failed to mark the volume as attached: %v", err)
continue
}
adc.processVolumesInUse(nodeName, node.Status.VolumesInUse)
adc.addNodeToDswp(node, types.NodeName(node.Name))
}
}


In K8s 1.20.13 the volumeName for in-tree PV in the Node status looks like:

  volumesAttached:
  - devicePath: ""
    name: kubernetes.io/csi/pd.csi.storage.gke.io^projects/UNSPECIFIED/zones/UNSPECIFIED/disks/pv--21a16b25-edc8-4287-81dd-d4a71355a359

Note that the zone is UNSPECIFIED.

The corresponding PV topology labels are:

    labels:
      topology.kubernetes.io/region: europe-west1
      topology.kubernetes.io/zone: europe-west1-b

I see that this behaviour comes from external-provisioner@v2.1.2.

Using external-provisioner@v1.6.0 I see that no topology labels are set to the PV.


While in K8s 1.21 #97823 introduces the following change:

if zonesLabel == "" {
zonesLabel = pv.Labels[v1.LabelTopologyZone]
}

This results in a change in the volumeName (as the zone is part of the volumeName). The translated volumeName with 1.21 is:

kubernetes.io/csi/pd.csi.storage.gke.io^projects/UNSPECIFIED/zones/europe-west1-b/disks/pv--21a16b25-edc8-4287-81dd-d4a71355a359

Note that the zone is now set and no longer unspecified.

And having one volumeName in ASW (coming from Node status) and different volumeName in the garbage collection logic for dangling VolumeAttachments for the same volume leads KCM to mark the volume as uncertain and detach the volume.


@Jiawei0227 as author of #97823 I would appreciate if you confirm my observations.

/priority critical-urgent

@k8s-ci-robot k8s-ci-robot added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Apr 8, 2022
@xing-yang
Copy link
Contributor

/assign @Jiawei0227
Jiawei, can you take a look?

@xing-yang
Copy link
Contributor

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 8, 2022
@msau42
Copy link
Member

msau42 commented Apr 18, 2022

cc @mattcary @jingxu97 @gnufied

@Jiawei0227
Copy link
Contributor

I think @ialidzhikov you might be right, this seems to be a problem...

@Jiawei0227
Copy link
Contributor

Just to confirm, is csi migration enabled on the cluster?

@msau42
Copy link
Member

msau42 commented Apr 18, 2022

One thing to investigate is in what circumstances would the volume id zone be "UNSPECIFIED" on the node before the upgrade?

@ialidzhikov
Copy link
Contributor Author

Just to confirm, is csi migration enabled on the cluster?

Yes, the CSI migration is enabled.

KCM@1.20.13 flags:

--feature-gates=CSIMigration=true,CSIMigrationGCE=true,CSIMigrationGCEComplete=true

KCM@1.21.10 flags:

--feature-gates=CSIMigration=true,CSIMigrationGCE=true,InTreePluginGCEUnregister=true

One thing to investigate is in what circumstances would the volume id zone be "UNSPECIFIED" on the node before the upgrade?

I guess for in-tree volumes the corresponding handling is

zonesLabel := pv.Labels[v1.LabelFailureDomainBetaZone]
zones := strings.Split(zonesLabel, labelMultiZoneDelimiter)
if len(zones) == 1 && len(zones[0]) != 0 {
// Zonal
volID = fmt.Sprintf(volIDZonalFmt, UnspecifiedValue, zones[0], pv.Spec.GCEPersistentDisk.PDName)
} else if len(zones) > 1 {
// Regional
region, err := getRegionFromZones(zones)
if err != nil {
return nil, fmt.Errorf("failed to get region from zones: %v", err)
}
volID = fmt.Sprintf(volIDRegionalFmt, UnspecifiedValue, region, pv.Spec.GCEPersistentDisk.PDName)
} else {
// Unspecified
volID = fmt.Sprintf(volIDZonalFmt, UnspecifiedValue, UnspecifiedValue, pv.Spec.GCEPersistentDisk.PDName)
}

If in K8s 1.20.13 the in-tree PV is missing the failure-domain.beta.kubernetes.io/zone label, then the zone will be UNSPECIFIED.

  • In-tree PVs created by the in-tree volume plugin have the failure-domain.beta.kubernetes.io/zone label set.
  • In-tree PVs created with external-provisioner@v1.6.0 and K8s 1.20.13 don't have any topology label set and have key topology.gke.io/zone in the PV nodeAffinity.
  • In-tree PVs created with external-provisioner@v2.1.2 and K8s 1.20.13 have the topology.kubernetes.io/zone label set. These are actually the PVs affected by this issue as for them KCM@1.20.13 uses UNSPECIFIED but KCM@1.21.10 properly translates the zone.

@liggitt
Copy link
Member

liggitt commented Apr 19, 2022

we need to be sure that any fix/change here doesn't regress in the opposite direction (detaching volumes that currently work in 1.21+)

@Jiawei0227
Copy link
Contributor

@ialidzhikov can you help me understand this.

So basically the issue only happens if you enable CSI migration in 1.20 and then upgrade to 1.21 because we did the topology fix from 1.20 to 1.21

But for example, if csi migration is enabled in 1.21 and further, and then when they do upgrade, it should not get impact, right? Because at that time the volumesAttached information is correct on that node.

@Jiawei0227
Copy link
Contributor

let's first figure out when will the issue occur and what's the impact range and then we can work on a fix. It might be a fix that only on release-1.21 branch possibly.

@ialidzhikov
Copy link
Contributor Author

ialidzhikov commented May 6, 2022

So basically the issue only happens if you enable CSI migration in 1.20 and then upgrade to 1.21 because we did the topology fix from 1.20 to 1.21

Yes, the issue happens when you upgrade a 1.20 cluster with enabled CSI migration to 1.21.

But for example, if csi migration is enabled in 1.21 and further, and then when they do upgrade, it should not get impact, right? Because at that time the volumesAttached information is correct on that node.

I haven't checked what happens if you upgrade 1.20 cluster without CSI to 1.21 that enables CSI. However I cannot say we, from Gardener, are interested much in this case. In Gardener for GCP cluster we enable CSI migration starting from K8s 1.18 - ref https://github.com/gardener/gardener-extension-provider-gcp/blob/v1.22.0/docs/usage-as-end-user.md#csi-volume-provisioners.

let's first figure out when will the issue occur and what's the impact range and then we can work on a fix. It might be a fix that only on release-1.21 branch possibly.

When will the issue occur: when you upgrade a 1.20 GCP cluster with CSI enabled to 1.21
What's the impact range: all In-tree PVs created with external-provisioner@v2.1.2 and K8s 1.20.13 (which will have the topology.kubernetes.io/zone label) will be wrongly detached by KCM -> leads to unexpected downtime of the workload.


PS. It feels bad - I really put efforts into investigation of this issue. I believe I provided good steps to reproduce and tried to explain the issue. On the other side after 1 month I almost have no meaningful replies/actions from you, maintainers. On top a release team member (@palnabarun) makes a statement that 1.21.12 most probably will be the last release for 1.21 - ref https://kubernetes.slack.com/archives/CJH2GBF7Y/p1650480910742459?thread_ts=1649751203.028119&cid=CJH2GBF7Y. Will you really leave 1.21 affected by similar issue?

PS2. A similar bad example from the history - ref https://kubernetes.slack.com/archives/CJH2GBF7Y/p1623453145304800. #102812 was not accepted and left range of patch versions for 1.18 affected by #102856. Back then I even created a security incident for this issue and the "reply" was that the issue is already disclosed due to my issue to k/k first and that they cannot do much for it.

@Jiawei0227
Copy link
Contributor

@ialidzhikov Hey Ismail, I apologize for the late response here. I have no excuse and we will try to fix this issue as soon as possible.

jingxu97 added a commit to jingxu97/kubernetes that referenced this issue May 11, 2022
Change-Id: I774a429442327a8700db975c625dd681f8577bc8
@Jiawei0227
Copy link
Contributor

Hey @ialidzhikov , sorry for the late reply. We had some brainstorm meeting to figure out if there is some obvious solution or potential workaround/fix that we can go with to solve this, it is all documented in the doc linked below:

https://docs.google.com/document/d/17iIoVj3g02U8Pgt4nC7g7sJG-Jd4TvPCpEXZ5e3oCk4/edit?usp=sharing

Our recommendation is to follow Option 5 for now as none of the other solution might work or will cause some other problems. We will keep brainstorm solution for this one. Feel free to comment if you have other ideas.

Sorry again for leading to this problem, we also have some action items internally to help improve the stability of the PD driver. We will keep posting on this thread if there is any other updates.

@ialidzhikov
Copy link
Contributor Author

ialidzhikov commented May 12, 2022

Thank you all! I am impressed by the efforts you put! That's tremendously helpful for us as we are stuck in the 1.20 to 1.21 upgrade process because of this issue.


In the last weeks I was also looking for mitigations/workarounds that prevent hitting this issue. Here is what I managed to find out as options:

  1. (Option 1) Restart KCM at given interval (less than 6min) during the upgrade from 1.20 to 1.21

    After marking a volume as uncertain KCM is force detaching the volume after 6min. The elapsedTime is only preserved in-memory. Hence, restarting KCM in short intervals prevents the logic for the force detachment to be executed.

    // Trigger detach volume which requires verifying safe to detach step
    // If timeout is true, skip verifySafeToDetach check
    klog.V(5).Infof(attachedVolume.GenerateMsgDetailed("Starting attacherDetacher.DetachVolume", ""))
    verifySafeToDetach := !timeout
    err = rc.attacherDetacher.DetachVolume(attachedVolume.AttachedVolume, verifySafeToDetach, rc.actualStateOfWorld)

    With this workaround I upgraded several clusters from 1.20 to 1.21.
    I am able to confirm that with this workaround there were no unexpected detachments from KCM (coming from the dangling VolumeAttachments garbage collection logic) in the form of:

    2022-04-06 06:39:08	{"log":"attacherDetacher.DetachVolume started for volume \"nil\" (UniqueName: \"kubernetes.io/csi/pd.csi.storage.gke.io^projects/UNSPECIFIED/zones/UNSPECIFIED/disks/pv--2bc75fc8-af96-4327-a9a7-cd648c41ec96\") on node \"cpu-worker-etcd-z1-86c78-7nqlq\" This volume is not safe to detach, but maxWaitForUnmountDuration 6m0s expired, force detaching","pid":"1","severity":"WARN","source":"reconciler.go:224"}
    

    But during the upgrades with this workaround we had a lot of Nodes that got unhealthy because their FilesystemIsReadOnly was set to true. So, the upgrade was not a graceful one, again. In the last days I was looking into what is the reason for the Nodes getting unhealthy with FilesystemIsReadOnly. In short, the FilesystemIsReadOnly condition is maintained by node-problem-detector and NPD itself is watching for a well-known error pattern in the kernel logs. To my understanding the most common reason for having this condition set to true is a volume that gets wrongly detached/unmounted while the Pod is still running (or the container is still trying to operate with the volume).
    I suspect that these FilesystemIsReadOnly errors can be still related to this issue. I see the following func verifyVolumeIsSafeToDetach.

    func (og *operationGenerator) verifyVolumeIsSafeToDetach(
    volumeToDetach AttachedVolume) error {
    // Fetch current node object
    node, fetchErr := og.kubeClient.CoreV1().Nodes().Get(context.TODO(), string(volumeToDetach.NodeName), metav1.GetOptions{})
    if fetchErr != nil {
    if errors.IsNotFound(fetchErr) {
    klog.Warningf(volumeToDetach.GenerateMsgDetailed("Node not found on API server. DetachVolume will skip safe to detach check", ""))
    return nil
    }
    // On failure, return error. Caller will log and retry.
    return volumeToDetach.GenerateErrorDetailed("DetachVolume failed fetching node from API server", fetchErr)
    }
    if node == nil {
    // On failure, return error. Caller will log and retry.
    return volumeToDetach.GenerateErrorDetailed(
    "DetachVolume failed fetching node from API server",
    fmt.Errorf("node object retrieved from API server is nil"))
    }
    for _, inUseVolume := range node.Status.VolumesInUse {
    if inUseVolume == volumeToDetach.VolumeName {
    return volumeToDetach.GenerateErrorDetailed(
    "DetachVolume failed",
    fmt.Errorf("volume is still in use by node, according to Node status"))
    }
    }
    // Volume is not marked as in use by node
    klog.Infof(volumeToDetach.GenerateMsgDetailed("Verified volume is safe to detach", ""))
    return nil
    }

    In the context of this issue we know that KCM and kubelet operate with different volumeName for the same PV. In such case the above function won't be able to prevent an in-use volume to be detached.

    Also with this option during the upgrade it is not possible to create new Pods that mount PVs on the 1.20 Nodes. The volume mount fails with the following error:

    E0510 11:55:44.516034    3898 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/csi/pd.csi.storage.gke.io^projects/UNSPECIFIED/zones/UNSPECIFIED/disks/pv--83cee3be-e2f2-4f94-8557-ef09734e077a podName: nodeName:}" failed. No retries permitted until 2022-05-10 11:55:48.515988403 +0000 UTC m=+1649.019008969 (durationBeforeRetry 4s). Error: "Volume not attached according to node status for volume \"pv--83cee3be-e2f2-4f94-8557-ef09734e077a\" (UniqueName: \"kubernetes.io/csi/pd.csi.storage.gke.io^projects/UNSPECIFIED/zones/UNSPECIFIED/disks/pv--83cee3be-e2f2-4f94-8557-ef09734e077a\") pod \"app25-0\" (UID: \"65a11be8-d1a8-43a2-903d-f6d99179f856\") "
    

    Again the reason is that KCM and kubelet operate with different volumeName.

    TL;DRL I accept that this is an ugly option. And the upgrade is still not a graceful one.

  2. (Option 2) Remove the topology.kubernetes.io/zone label from the PV

    This option corresponds to Option 5 (Prefered) from your list.

    Back then I gave up early on this option because:

    • I was not sure whether it is a safe approach and whether it can cause short-term or long-term problems.
    • Re-adding the topology.kubernetes.io/zone label after the upgrade is not possible without a downtime.
    • The nodeAffinity field of the PV is immutable. I had the impression that removing the label is an change that would make the nodeAffinity out of sync with the PV labels.

    After reading the document from above obviously giving up early on this option was a wrong decision from my side.
    Back then and today as well I am able to confirm that this option works. The upgrade is a graceful one - without unexpected detaches or FilesystemIsReadOnly errors.
    In short, I have to make sure that no new PVs get created with the topology.kubernetes.io/zone label and I remove the label from the existing PVs before the upgrade.
    Now, as you suggest to prefer this option I assume that my initial concerns are not valid and this is the easiest and safest option. It would be great it you can confirm this one more time. If that is the case I believe we have a good option that will allows us to upgrade our clusters to 1.21. :)

@jingxu97
Copy link
Contributor

@ialidzhikov thanks for sharing the detailed information.
About label, I confirmed with @msau42, it is ok to not reapply it since it is mainly informational. Controller or scheduler does not depends on the zone label to make decisions.
Also as we discussed, in option 5, you can choose not to downgrade external provisioner as long as you make sure no new PVs are created during upgrade

@Jiawei0227
Copy link
Contributor

I was not sure whether it is a safe approach and whether it can cause short-term or long-term problems.

I think basically we need to make sure no customers are actually relying on these labels for their logic.

Re-adding the topology.kubernetes.io/zone label after the upgrade is not possible without a downtime.

I think we should not re-add the label unless the volume is being detached from a node.

The nodeAffinity field of the PV is immutable. I had the impression that removing the label is an change that would make the nodeAffinity out of sync with the PV labels.

Label and nodeAffinity are two different thing. It should not impact each other.

In short, I have to make sure that no new PVs get created with the topology.kubernetes.io/zone label and I remove the label from the existing PVs before the upgrade.

Yes. And to make sure no new PVs created has this label, you need to use external-provisioner <=2.1.0. After the cluster is upgraded into 1.21(including nodes), it should not have this requirement, which means you can upgrade the external-provisioner to a higher version.

It would be great it you can confirm this one more time. If that is the case I believe we have a good option that will allows us to upgrade our clusters to 1.21. :)

Yes, I think this is the safest approach so far to fix the issue. Sorry for all the trouble we have brought with this feature ;)

@ialidzhikov
Copy link
Contributor Author

/remove-priority critical-urgent
as there is existing workaround.

Keeping this issue open for the 2 action items listed in https://docs.google.com/document/d/17iIoVj3g02U8Pgt4nC7g7sJG-Jd4TvPCpEXZ5e3oCk4/edit#heading=h.5n431nhh830a.

@k8s-ci-robot k8s-ci-robot removed the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Jun 24, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 22, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 22, 2022
@ialidzhikov
Copy link
Contributor Author

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Nov 20, 2022
@k8s-triage-robot
Copy link

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

  • Confirm that this issue is still relevant with /triage accepted (org members only)
  • Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Jan 19, 2024
@ialidzhikov
Copy link
Contributor Author

Keeping this issue open for the 2 action items listed in https://docs.google.com/document/d/17iIoVj3g02U8Pgt4nC7g7sJG-Jd4TvPCpEXZ5e3oCk4/edit#heading=h.5n431nhh830a.

I am closing this issue. The AI are about making the issue visible in release notes and in compatibility notes/matrix improvement in the external-provisioner/GCE PD CSI.
The corresponding K8s/external-provisioner/GCE PD CSI versions are out-of-support since longtime, I assume.

/close

@k8s-ci-robot
Copy link
Contributor

@ialidzhikov: Closing this issue.

In response to this:

Keeping this issue open for the 2 action items listed in https://docs.google.com/document/d/17iIoVj3g02U8Pgt4nC7g7sJG-Jd4TvPCpEXZ5e3oCk4/edit#heading=h.5n431nhh830a.

I am closing this issue. The AI are about making the issue visible in release notes and in compatibility notes/matrix improvement in the external-provisioner/GCE PD CSI.
The corresponding K8s/external-provisioner/GCE PD CSI versions are out-of-support since longtime, I assume.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/storage Categorizes an issue or PR as relevant to SIG Storage.
Projects
None yet
Development

No branches or pull requests

8 participants