gce: KCM detaches all in-tree volumes during update from K8s 1.20 to 1.21 #109354

ialidzhikov · 2022-04-07T07:58:41Z

What happened?

KCM wrongly detaches all in-tree volumes during update from K8s 1.20 to 1.21.

What did you expect to happen?

KCM to do not detach all in-tree volumes during update from K8s 1.20 to 1.21.

How can we reproduce it (as minimally and precisely as possible)?

Create a K8s 1.20.13 cluster.

Create in-tree and out-of-tree StorageClasses.

allowVolumeExpansion: true
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: default-intree
parameters:
  type: pd-standard
provisioner: kubernetes.io/gce-pd
reclaimPolicy: Delete
volumeBindingMode: Immediate

allowVolumeExpansion: true
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  annotations:
    storageclass.kubernetes.io/is-default-class: "true"
  name: default
parameters:
  type: pd-standard
provisioner: pd.csi.storage.gke.io
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer

Create 3 StatefulSets with 4 replicas (1 Stateful set is using the out-of-tree, the other 2 - the in-tree):

apiVersion: v1
kind: Service
metadata:
  name: app1
  labels:
    app: app1
spec:
  ports:
  - port: 80
    name: web
  clusterIP: None
  selector:
    app: app1
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: app1
spec:
  serviceName: app1
  replicas: 4
  selector:
    matchLabels:
      app: app1
  template:
    metadata:
      labels:
        app: app1
    spec:
      containers:
        - name: app1
          image: centos
          command: ["/bin/sh"]
          args: ["-c", "while true; do echo $HOSTNAME $(date -u) >> /data/out.txt; sleep 5; done"]
          volumeMounts:
          - name: persistent-storage-app1
            mountPath: /data
          livenessProbe:
            exec:
              command:
              - tail
              - -n 1
              - /data/out.txt
  volumeClaimTemplates:
  - metadata:
      name: persistent-storage-app1
    spec:
      accessModes: [ "ReadWriteOnce" ]
      resources:
        requests:
          storage: 1Gi

apiVersion: v1
kind: Service
metadata:
  name: app2
  labels:
    app: app2
spec:
  ports:
  - port: 80
    name: web
  clusterIP: None
  selector:
    app: app2
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: app2
spec:
  serviceName: app2
  replicas: 4
  selector:
    matchLabels:
      app: app2
  template:
    metadata:
      labels:
        app: app2
    spec:
      containers:
        - name: app2
          image: centos
          command: ["/bin/sh"]
          args: ["-c", "while true; do echo $HOSTNAME $(date -u) >> /data/out.txt; sleep 5; done"]
          volumeMounts:
          - name: persistent-storage-app2
            mountPath: /data
          livenessProbe:
            exec:
              command:
              - tail
              - -n 1
              - /data/out.txt
  volumeClaimTemplates:
  - metadata:
      name: persistent-storage-app2
    spec:
      storageClassName: default-intree
      accessModes: [ "ReadWriteOnce" ]
      resources:
        requests:
          storage: 2Gi

apiVersion: v1
kind: Service
metadata:
  name: app3
  labels:
    app: app3
spec:
  ports:
  - port: 80
    name: web
  clusterIP: None
  selector:
    app: app3
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: app3
spec:
  serviceName: app3
  replicas: 4
  selector:
    matchLabels:
      app: app3
  template:
    metadata:
      labels:
        app: app3
    spec:
      containers:
        - name: app3
          image: centos
          command: ["/bin/sh"]
          args: ["-c", "while true; do echo $HOSTNAME $(date -u) >> /data/out.txt; sleep 5; done"]
          volumeMounts:
          - name: persistent-storage-app3
            mountPath: /data
          livenessProbe:
            exec:
              command:
              - tail
              - -n 1
              - /data/out.txt
  volumeClaimTemplates:
  - metadata:
      name: persistent-storage-app3
    spec:
      storageClassName: default-intree
      accessModes: [ "ReadWriteOnce" ]
      resources:
        requests:
          storage: 3Gi

Update the cluster to K8s 1.21.10.
Make sure that kube-controller-manager detaches all in-tree volumes during update

5.1 kube-controller-manager marks all in-tree volumes as uncertain.

2022-04-06 06:33:00	{"log":"Marking volume attachment as uncertain as volume:\"kubernetes.io/csi/pd.csi.storage.gke.io^projects/UNSPECIFIED/zones/europe-west1-b/disks/pv--589d5a8a-cf3a-4428-bd5f-4e03d1615e1e\" (\"cpu-worker-etcd-z1-86c78-7nqlq\") is not attached (Detached)","pid":"1","severity":"INFO","source":"attach_detach_controller.go:769"}
2022-04-06 06:33:00	{"log":"Marking volume attachment as uncertain as volume:\"kubernetes.io/csi/pd.csi.storage.gke.io^projects/UNSPECIFIED/zones/europe-west1-b/disks/pv--7e6663f2-2446-45ee-bca3-a53771b7226b\" (\"cpu-worker-etcd-z1-86c78-7nqlq\") is not attached (Detached)","pid":"1","severity":"INFO","source":"attach_detach_controller.go:769"}
2022-04-06 06:33:00	{"log":"Marking volume attachment as uncertain as volume:\"kubernetes.io/csi/pd.csi.storage.gke.io^projects/UNSPECIFIED/zones/europe-west1-b/disks/pv--6b11de8e-2115-42d6-8d26-c52bf97a1076\" (\"cpu-worker-etcd-z1-86c78-7nqlq\") is not attached (Detached)","pid":"1","severity":"INFO","source":"attach_detach_controller.go:769"}
2022-04-06 06:33:00	{"log":"Marking volume attachment as uncertain as volume:\"kubernetes.io/csi/pd.csi.storage.gke.io^projects/UNSPECIFIED/zones/europe-west1-b/disks/pv--148e95ac-196f-4b59-a547-c01ab7ae3f2d\" (\"cpu-worker-etcd-z1-86c78-7nqlq\") is not attached (Detached)","pid":"1","severity":"INFO","source":"attach_detach_controller.go:769"}
2022-04-06 06:33:00	{"log":"Marking volume attachment as uncertain as volume:\"kubernetes.io/csi/pd.csi.storage.gke.io^projects/UNSPECIFIED/zones/europe-west1-b/disks/pv--4f4d8963-b336-41c8-81ea-d1bbf25b61b9\" (\"cpu-worker-etcd-z1-86c78-7nqlq\") is not attached (Detached)","pid":"1","severity":"INFO","source":"attach_detach_controller.go:769"}
2022-04-06 06:33:01	{"log":"Marking volume attachment as uncertain as volume:\"kubernetes.io/csi/pd.csi.storage.gke.io^projects/UNSPECIFIED/zones/europe-west1-b/disks/pv--ec3c3c19-5a62-4b29-b263-713da5a07d6e\" (\"cpu-worker-etcd-z1-86c78-7nqlq\") is not attached (Detached)","pid":"1","severity":"INFO","source":"attach_detach_controller.go:769"}
2022-04-06 06:33:01	{"log":"Marking volume attachment as uncertain as volume:\"kubernetes.io/csi/pd.csi.storage.gke.io^projects/UNSPECIFIED/zones/europe-west1-b/disks/pv--ad094298-f42a-4f18-b453-9772ce21386b\" (\"cpu-worker-etcd-z1-86c78-7nqlq\") is not attached (Detached)","pid":"1","severity":"INFO","source":"attach_detach_controller.go:769"}
2022-04-06 06:33:01	{"log":"Marking volume attachment as uncertain as volume:\"kubernetes.io/csi/pd.csi.storage.gke.io^projects/UNSPECIFIED/zones/europe-west1-b/disks/pv--53b113af-aaf2-4306-9c6f-817ca0de62eb\" (\"cpu-worker-etcd-z1-86c78-7nqlq\") is not attached (Detached)","pid":"1","severity":"INFO","source":"attach_detach_controller.go:769"}
2022-04-06 06:33:01	{"log":"Marking volume attachment as uncertain as volume:\"kubernetes.io/csi/pd.csi.storage.gke.io^projects/UNSPECIFIED/zones/europe-west1-b/disks/pv--9a237184-7f03-451f-aad5-c85b09d6d580\" (\"cpu-worker-etcd-z1-86c78-7nqlq\") is not attached (Detached)","pid":"1","severity":"INFO","source":"attach_detach_controller.go:769"}
2022-04-06 06:33:01	{"log":"Marking volume attachment as uncertain as volume:\"kubernetes.io/csi/pd.csi.storage.gke.io^projects/UNSPECIFIED/zones/europe-west1-b/disks/pv--26b788f6-03e5-4b39-a5f2-c20ef9a8a884\" (\"cpu-worker-etcd-z1-86c78-7nqlq\") is not attached (Detached)","pid":"1","severity":"INFO","source":"attach_detach_controller.go:769"}

5.2 6min after marking a lot of volume attachments as uncertain, KCM detaches the in-tree volumes.

2022-04-06 06:39:08	{"log":"attacherDetacher.DetachVolume started for volume \"nil\" (UniqueName: \"kubernetes.io/csi/pd.csi.storage.gke.io^projects/UNSPECIFIED/zones/UNSPECIFIED/disks/pv--2bc75fc8-af96-4327-a9a7-cd648c41ec96\") on node \"cpu-worker-etcd-z1-86c78-7nqlq\" This volume is not safe to detach, but maxWaitForUnmountDuration 6m0s expired, force detaching","pid":"1","severity":"WARN","source":"reconciler.go:224"}
2022-04-06 06:39:08	{"log":"attacherDetacher.DetachVolume started for volume \"nil\" (UniqueName: \"kubernetes.io/csi/pd.csi.storage.gke.io^projects/UNSPECIFIED/zones/UNSPECIFIED/disks/pv--148e95ac-196f-4b59-a547-c01ab7ae3f2d\") on node \"cpu-worker-etcd-z1-86c78-7nqlq\" This volume is not safe to detach, but maxWaitForUnmountDuration 6m0s expired, force detaching","pid":"1","severity":"WARN","source":"reconciler.go:224"}
2022-04-06 06:39:08	{"log":"attacherDetacher.DetachVolume started for volume \"nil\" (UniqueName: \"kubernetes.io/csi/pd.csi.storage.gke.io^projects/UNSPECIFIED/zones/UNSPECIFIED/disks/pv--8ad228d7-187b-4972-ab4c-74e5a46c6ad8\") on node \"cpu-worker-etcd-z1-86c78-7nqlq\" This volume is not safe to detach, but maxWaitForUnmountDuration 6m0s expired, force detaching","pid":"1","severity":"WARN","source":"reconciler.go:224"}
2022-04-06 06:39:08	{"log":"attacherDetacher.DetachVolume started for volume \"nil\" (UniqueName: \"kubernetes.io/csi/pd.csi.storage.gke.io^projects/UNSPECIFIED/zones/UNSPECIFIED/disks/pv--26b788f6-03e5-4b39-a5f2-c20ef9a8a884\") on node \"cpu-worker-etcd-z1-86c78-7nqlq\" This volume is not safe to detach, but maxWaitForUnmountDuration 6m0s expired, force detaching","pid":"1","severity":"WARN","source":"reconciler.go:224"}
2022-04-06 06:39:08	{"log":"attacherDetacher.DetachVolume started for volume \"nil\" (UniqueName: \"kubernetes.io/csi/pd.csi.storage.gke.io^projects/UNSPECIFIED/zones/UNSPECIFIED/disks/pv--53b113af-aaf2-4306-9c6f-817ca0de62eb\") on node \"cpu-worker-etcd-z1-86c78-7nqlq\" This volume is not safe to detach, but maxWaitForUnmountDuration 6m0s expired, force detaching","pid":"1","severity":"WARN","source":"reconciler.go:224"}
2022-04-06 06:39:09	{"log":"attacherDetacher.DetachVolume started for volume \"nil\" (UniqueName: \"kubernetes.io/csi/pd.csi.storage.gke.io^projects/UNSPECIFIED/zones/UNSPECIFIED/disks/pv--4f4d8963-b336-41c8-81ea-d1bbf25b61b9\") on node \"cpu-worker-etcd-z1-86c78-7nqlq\" This volume is not safe to detach, but maxWaitForUnmountDuration 6m0s expired, force detaching","pid":"1","severity":"WARN","source":"reconciler.go:224"}
2022-04-06 06:39:09	{"log":"attacherDetacher.DetachVolume started for volume \"nil\" (UniqueName: \"kubernetes.io/csi/pd.csi.storage.gke.io^projects/UNSPECIFIED/zones/UNSPECIFIED/disks/pv--6b11de8e-2115-42d6-8d26-c52bf97a1076\") on node \"cpu-worker-etcd-z1-86c78-7nqlq\" This volume is not safe to detach, but maxWaitForUnmountDuration 6m0s expired, force detaching","pid":"1","severity":"WARN","source":"reconciler.go:224"}
2022-04-06 06:39:09	{"log":"attacherDetacher.DetachVolume started for volume \"nil\" (UniqueName: \"kubernetes.io/csi/pd.csi.storage.gke.io^projects/UNSPECIFIED/zones/UNSPECIFIED/disks/pv--7e6663f2-2446-45ee-bca3-a53771b7226b\") on node \"cpu-worker-etcd-z1-86c78-7nqlq\" This volume is not safe to detach, but maxWaitForUnmountDuration 6m0s expired, force detaching","pid":"1","severity":"WARN","source":"reconciler.go:224"}
2022-04-06 06:39:09	{"log":"attacherDetacher.DetachVolume started for volume \"nil\" (UniqueName: \"kubernetes.io/csi/pd.csi.storage.gke.io^projects/UNSPECIFIED/zones/UNSPECIFIED/disks/pv--9a237184-7f03-451f-aad5-c85b09d6d580\") on node \"cpu-worker-etcd-z1-86c78-7nqlq\" This volume is not safe to detach, but maxWaitForUnmountDuration 6m0s expired, force detaching","pid":"1","severity":"WARN","source":"reconciler.go:224"}
2022-04-06 06:39:09	{"log":"attacherDetacher.DetachVolume started for volume \"nil\" (UniqueName: \"kubernetes.io/csi/pd.csi.storage.gke.io^projects/UNSPECIFIED/zones/UNSPECIFIED/disks/pv--ad094298-f42a-4f18-b453-9772ce21386b\") on node \"cpu-worker-etcd-z1-86c78-7nqlq\" This volume is not safe to detach, but maxWaitForUnmountDuration 6m0s expired, force detaching","pid":"1","severity":"WARN","source":"reconciler.go:224"}
2022-04-06 06:39:09	{"log":"attacherDetacher.DetachVolume started for volume \"nil\" (UniqueName: \"kubernetes.io/csi/pd.csi.storage.gke.io^projects/UNSPECIFIED/zones/UNSPECIFIED/disks/pv--ec3c3c19-5a62-4b29-b263-713da5a07d6e\") on node \"cpu-worker-etcd-z1-86c78-7nqlq\" This volume is not safe to detach, but maxWaitForUnmountDuration 6m0s expired, force detaching","pid":"1","severity":"WARN","source":"reconciler.go:224"}
2022-04-06 06:39:09	{"log":"attacherDetacher.DetachVolume started for volume \"nil\" (UniqueName: \"kubernetes.io/csi/pd.csi.storage.gke.io^projects/UNSPECIFIED/zones/UNSPECIFIED/disks/pv--589d5a8a-cf3a-4428-bd5f-4e03d1615e1e\") on node \"cpu-worker-etcd-z1-86c78-7nqlq\" This volume is not safe to detach, but maxWaitForUnmountDuration 6m0s expired, force detaching","pid":"1","severity":"WARN","source":"reconciler.go:224"}

After these detachments, the Node turns into unhealthy state with reason FilesystemIsReadOnly.

  Normal   FilesystemIsReadOnly  48m                     kernel-monitor  Node condition ReadonlyFilesystem is now: True, reason: FilesystemIsReadOnly

The corresponding Pods fail with IO errors as their volumes are detached.

Anything else we need to know?

I see that in ASW the volume is reported as attached

I0407 06:37:53.956930       1 actual_state_of_world.go:507] Report volume "kubernetes.io/csi/pd.csi.storage.gke.io^projects/UNSPECIFIED/zones/UNSPECIFIED/disks/pv--44ee54d8-42d9-4ff6-8092-4a20d932c941" as attached to node "worker-1-z1-7b85b-9r2tb"

Then the VA is marked as uncertain

I0407 06:37:53.957975       1 attach_detach_controller.go:769] Marking volume attachment as uncertain as volume:"kubernetes.io/csi/pd.csi.storage.gke.io^projects/UNSPECIFIED/zones/europe-west1-b/disks/pv--44ee54d8-42d9-4ff6-8092-4a20d932c941" ("worker-1-z1-7b85b-9r2tb") is not attached (Detached)

Note the diff in the volume name

-kubernetes.io/csi/pd.csi.storage.gke.io^projects/UNSPECIFIED/zones/UNSPECIFIED/disks/pv--44ee54d8-42d9-4ff6-8092-4a20d932c941
+kubernetes.io/csi/pd.csi.storage.gke.io^projects/UNSPECIFIED/zones/europe-west1-b/disks/pv--44ee54d8-42d9-4ff6-8092-4a20d932c941

Kubernetes version

Update from K8s 1.20.13 to 1.21.10.

$ kubectl version
# paste output here

Cloud provider

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

https://github.com/gardener/gardener

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

external-provisioner - k8s.gcr.io/sig-storage/csi-provisioner@v2.1.2
external-attacher - k8s.gcr.io/sig-storage/csi-attacher@v3.3.0
gcp-compute-persistent-disk-csi-driver - gcr.io/gke-release/gcp-compute-persistent-disk-csi-driver@v1.3.4-gke.0

The text was updated successfully, but these errors were encountered:

ialidzhikov · 2022-04-07T07:59:04Z

/sig storage

ialidzhikov · 2022-04-07T08:54:10Z

cc @Jiawei0227 @msau42 @jsafrane

ialidzhikov · 2022-04-08T09:45:01Z

It seems that we build the ASW based on the Node .status.volumesAttached.

kubernetes/pkg/controller/volume/attachdetach/attach_detach_controller.go

Lines 376 to 393 in a7a3274

    
           for _, node := range nodes { 
        
           	nodeName := types.NodeName(node.Name) 
        
           	for _, attachedVolume := range node.Status.VolumesAttached { 
        
           		uniqueName := attachedVolume.Name 
        
           		// The nil VolumeSpec is safe only in the case the volume is not in use by any pod. 
        
           		// In such a case it should be detached in the first reconciliation cycle and the 
        
           		// volume spec is not needed to detach a volume. If the volume is used by a pod, it 
        
           		// its spec can be: this would happen during in the populateDesiredStateOfWorld which 
        
           		// scans the pods and updates their volumes in the ActualStateOfWorld too. 
        
           		err = adc.actualStateOfWorld.MarkVolumeAsAttached(uniqueName, nil /* VolumeSpec */, nodeName, attachedVolume.DevicePath) 
        
           		if err != nil { 
        
           			klog.Errorf("Failed to mark the volume as attached: %v", err) 
        
           			continue 
        
           		} 
        
           		adc.processVolumesInUse(nodeName, node.Status.VolumesInUse) 
        
           		adc.addNodeToDswp(node, types.NodeName(node.Name)) 
        
           	} 
        
           }

In K8s 1.20.13 the volumeName for in-tree PV in the Node status looks like:

  volumesAttached:
  - devicePath: ""
    name: kubernetes.io/csi/pd.csi.storage.gke.io^projects/UNSPECIFIED/zones/UNSPECIFIED/disks/pv--21a16b25-edc8-4287-81dd-d4a71355a359

Note that the zone is UNSPECIFIED.

The corresponding PV topology labels are:

    labels:
      topology.kubernetes.io/region: europe-west1
      topology.kubernetes.io/zone: europe-west1-b

I see that this behaviour comes from external-provisioner@v2.1.2.

Using external-provisioner@v1.6.0 I see that no topology labels are set to the PV.

While in K8s 1.21 #97823 introduces the following change:

kubernetes/staging/src/k8s.io/csi-translation-lib/plugins/gce_pd.go

Lines 220 to 222 in a7a3274

    
           if zonesLabel == "" { 
        
           	zonesLabel = pv.Labels[v1.LabelTopologyZone] 
        
           }

This results in a change in the volumeName (as the zone is part of the volumeName). The translated volumeName with 1.21 is:

kubernetes.io/csi/pd.csi.storage.gke.io^projects/UNSPECIFIED/zones/europe-west1-b/disks/pv--21a16b25-edc8-4287-81dd-d4a71355a359

Note that the zone is now set and no longer unspecified.

And having one volumeName in ASW (coming from Node status) and different volumeName in the garbage collection logic for dangling VolumeAttachments for the same volume leads KCM to mark the volume as uncertain and detach the volume.

@Jiawei0227 as author of #97823 I would appreciate if you confirm my observations.

/priority critical-urgent

xing-yang · 2022-04-08T13:28:37Z

/assign @Jiawei0227
Jiawei, can you take a look?

xing-yang · 2022-04-08T13:32:14Z

/triage accepted

msau42 · 2022-04-18T20:45:09Z

cc @mattcary @jingxu97 @gnufied

Jiawei0227 · 2022-04-18T20:45:44Z

I think @ialidzhikov you might be right, this seems to be a problem...

Jiawei0227 · 2022-04-18T21:16:48Z

Just to confirm, is csi migration enabled on the cluster?

msau42 · 2022-04-18T21:32:37Z

One thing to investigate is in what circumstances would the volume id zone be "UNSPECIFIED" on the node before the upgrade?

ialidzhikov · 2022-04-19T06:40:40Z

Just to confirm, is csi migration enabled on the cluster?

Yes, the CSI migration is enabled.

KCM@1.20.13 flags:

--feature-gates=CSIMigration=true,CSIMigrationGCE=true,CSIMigrationGCEComplete=true

KCM@1.21.10 flags:

--feature-gates=CSIMigration=true,CSIMigrationGCE=true,InTreePluginGCEUnregister=true

One thing to investigate is in what circumstances would the volume id zone be "UNSPECIFIED" on the node before the upgrade?

I guess for in-tree volumes the corresponding handling is

kubernetes/staging/src/k8s.io/csi-translation-lib/plugins/gce_pd.go

Lines 218 to 233 in 2444b33

    
           zonesLabel := pv.Labels[v1.LabelFailureDomainBetaZone] 
        
           zones := strings.Split(zonesLabel, labelMultiZoneDelimiter) 
        
           if len(zones) == 1 && len(zones[0]) != 0 { 
        
           	// Zonal 
        
           	volID = fmt.Sprintf(volIDZonalFmt, UnspecifiedValue, zones[0], pv.Spec.GCEPersistentDisk.PDName) 
        
           } else if len(zones) > 1 { 
        
           	// Regional 
        
           	region, err := getRegionFromZones(zones) 
        
           	if err != nil { 
        
           		return nil, fmt.Errorf("failed to get region from zones: %v", err) 
        
           	} 
        
           	volID = fmt.Sprintf(volIDRegionalFmt, UnspecifiedValue, region, pv.Spec.GCEPersistentDisk.PDName) 
        
           } else { 
        
           	// Unspecified 
        
           	volID = fmt.Sprintf(volIDZonalFmt, UnspecifiedValue, UnspecifiedValue, pv.Spec.GCEPersistentDisk.PDName) 
        
           }

If in K8s 1.20.13 the in-tree PV is missing the failure-domain.beta.kubernetes.io/zone label, then the zone will be UNSPECIFIED.

In-tree PVs created by the in-tree volume plugin have the failure-domain.beta.kubernetes.io/zone label set.
In-tree PVs created with external-provisioner@v1.6.0 and K8s 1.20.13 don't have any topology label set and have key topology.gke.io/zone in the PV nodeAffinity.
In-tree PVs created with external-provisioner@v2.1.2 and K8s 1.20.13 have the topology.kubernetes.io/zone label set. These are actually the PVs affected by this issue as for them KCM@1.20.13 uses UNSPECIFIED but KCM@1.21.10 properly translates the zone.

liggitt · 2022-04-19T12:52:16Z

we need to be sure that any fix/change here doesn't regress in the opposite direction (detaching volumes that currently work in 1.21+)

Jiawei0227 · 2022-05-05T21:34:28Z

@ialidzhikov can you help me understand this.

So basically the issue only happens if you enable CSI migration in 1.20 and then upgrade to 1.21 because we did the topology fix from 1.20 to 1.21

But for example, if csi migration is enabled in 1.21 and further, and then when they do upgrade, it should not get impact, right? Because at that time the volumesAttached information is correct on that node.

Jiawei0227 · 2022-05-05T21:38:43Z

let's first figure out when will the issue occur and what's the impact range and then we can work on a fix. It might be a fix that only on release-1.21 branch possibly.

ialidzhikov · 2022-05-06T18:19:57Z

So basically the issue only happens if you enable CSI migration in 1.20 and then upgrade to 1.21 because we did the topology fix from 1.20 to 1.21

Yes, the issue happens when you upgrade a 1.20 cluster with enabled CSI migration to 1.21.

But for example, if csi migration is enabled in 1.21 and further, and then when they do upgrade, it should not get impact, right? Because at that time the volumesAttached information is correct on that node.

I haven't checked what happens if you upgrade 1.20 cluster without CSI to 1.21 that enables CSI. However I cannot say we, from Gardener, are interested much in this case. In Gardener for GCP cluster we enable CSI migration starting from K8s 1.18 - ref https://github.com/gardener/gardener-extension-provider-gcp/blob/v1.22.0/docs/usage-as-end-user.md#csi-volume-provisioners.

let's first figure out when will the issue occur and what's the impact range and then we can work on a fix. It might be a fix that only on release-1.21 branch possibly.

When will the issue occur: when you upgrade a 1.20 GCP cluster with CSI enabled to 1.21
What's the impact range: all In-tree PVs created with external-provisioner@v2.1.2 and K8s 1.20.13 (which will have the topology.kubernetes.io/zone label) will be wrongly detached by KCM -> leads to unexpected downtime of the workload.

PS. It feels bad - I really put efforts into investigation of this issue. I believe I provided good steps to reproduce and tried to explain the issue. On the other side after 1 month I almost have no meaningful replies/actions from you, maintainers. On top a release team member (@palnabarun) makes a statement that 1.21.12 most probably will be the last release for 1.21 - ref https://kubernetes.slack.com/archives/CJH2GBF7Y/p1650480910742459?thread_ts=1649751203.028119&cid=CJH2GBF7Y. Will you really leave 1.21 affected by similar issue?

PS2. A similar bad example from the history - ref https://kubernetes.slack.com/archives/CJH2GBF7Y/p1623453145304800. #102812 was not accepted and left range of patch versions for 1.18 affected by #102856. Back then I even created a security incident for this issue and the "reply" was that the issue is already disclosed due to my issue to k/k first and that they cannot do much for it.

Jiawei0227 · 2022-05-09T20:04:41Z

@ialidzhikov Hey Ismail, I apologize for the late response here. I have no excuse and we will try to fix this issue as soon as possible.

Change-Id: I774a429442327a8700db975c625dd681f8577bc8

Jiawei0227 · 2022-05-12T00:11:10Z

Hey @ialidzhikov , sorry for the late reply. We had some brainstorm meeting to figure out if there is some obvious solution or potential workaround/fix that we can go with to solve this, it is all documented in the doc linked below:

https://docs.google.com/document/d/17iIoVj3g02U8Pgt4nC7g7sJG-Jd4TvPCpEXZ5e3oCk4/edit?usp=sharing

Our recommendation is to follow Option 5 for now as none of the other solution might work or will cause some other problems. We will keep brainstorm solution for this one. Feel free to comment if you have other ideas.

Sorry again for leading to this problem, we also have some action items internally to help improve the stability of the PD driver. We will keep posting on this thread if there is any other updates.

ialidzhikov · 2022-05-12T13:28:03Z

Thank you all! I am impressed by the efforts you put! That's tremendously helpful for us as we are stuck in the 1.20 to 1.21 upgrade process because of this issue.

In the last weeks I was also looking for mitigations/workarounds that prevent hitting this issue. Here is what I managed to find out as options:

(Option 1) Restart KCM at given interval (less than 6min) during the upgrade from 1.20 to 1.21

After marking a volume as uncertain KCM is force detaching the volume after 6min. The elapsedTime is only preserved in-memory. Hence, restarting KCM in short intervals prevents the logic for the force detachment to be executed.

kubernetes/pkg/controller/volume/attachdetach/reconciler/reconciler.go

Lines 214 to 218 in a7a3274

    
           // Trigger detach volume which requires verifying safe to detach step 
        
           // If timeout is true, skip verifySafeToDetach check 
        
           klog.V(5).Infof(attachedVolume.GenerateMsgDetailed("Starting attacherDetacher.DetachVolume", "")) 
        
           verifySafeToDetach := !timeout 
        
           err = rc.attacherDetacher.DetachVolume(attachedVolume.AttachedVolume, verifySafeToDetach, rc.actualStateOfWorld)

With this workaround I upgraded several clusters from 1.20 to 1.21.
I am able to confirm that with this workaround there were no unexpected detachments from KCM (coming from the dangling VolumeAttachments garbage collection logic) in the form of:

2022-04-06 06:39:08	{"log":"attacherDetacher.DetachVolume started for volume \"nil\" (UniqueName: \"kubernetes.io/csi/pd.csi.storage.gke.io^projects/UNSPECIFIED/zones/UNSPECIFIED/disks/pv--2bc75fc8-af96-4327-a9a7-cd648c41ec96\") on node \"cpu-worker-etcd-z1-86c78-7nqlq\" This volume is not safe to detach, but maxWaitForUnmountDuration 6m0s expired, force detaching","pid":"1","severity":"WARN","source":"reconciler.go:224"}

But during the upgrades with this workaround we had a lot of Nodes that got unhealthy because their FilesystemIsReadOnly was set to true. So, the upgrade was not a graceful one, again. In the last days I was looking into what is the reason for the Nodes getting unhealthy with FilesystemIsReadOnly. In short, the FilesystemIsReadOnly condition is maintained by node-problem-detector and NPD itself is watching for a well-known error pattern in the kernel logs. To my understanding the most common reason for having this condition set to true is a volume that gets wrongly detached/unmounted while the Pod is still running (or the container is still trying to operate with the volume).
I suspect that these FilesystemIsReadOnly errors can be still related to this issue. I see the following func verifyVolumeIsSafeToDetach.

kubernetes/pkg/volume/util/operationexecutor/operation_generator.go

Lines 1528 to 1560 in a7a3274

    
           func (og *operationGenerator) verifyVolumeIsSafeToDetach( 
        
           	volumeToDetach AttachedVolume) error { 
        
           	// Fetch current node object 
        
           	node, fetchErr := og.kubeClient.CoreV1().Nodes().Get(context.TODO(), string(volumeToDetach.NodeName), metav1.GetOptions{}) 
        
           	if fetchErr != nil { 
        
           		if errors.IsNotFound(fetchErr) { 
        
           			klog.Warningf(volumeToDetach.GenerateMsgDetailed("Node not found on API server. DetachVolume will skip safe to detach check", "")) 
        
           			return nil 
        
           		} 
        
           		// On failure, return error. Caller will log and retry. 
        
           		return volumeToDetach.GenerateErrorDetailed("DetachVolume failed fetching node from API server", fetchErr) 
        
           	} 
        
           	if node == nil { 
        
           		// On failure, return error. Caller will log and retry. 
        
           		return volumeToDetach.GenerateErrorDetailed( 
        
           			"DetachVolume failed fetching node from API server", 
        
           			fmt.Errorf("node object retrieved from API server is nil")) 
        
           	} 
        
           	for _, inUseVolume := range node.Status.VolumesInUse { 
        
           		if inUseVolume == volumeToDetach.VolumeName { 
        
           			return volumeToDetach.GenerateErrorDetailed( 
        
           				"DetachVolume failed", 
        
           				fmt.Errorf("volume is still in use by node, according to Node status")) 
        
           		} 
        
           	} 
        
           	// Volume is not marked as in use by node 
        
           	klog.Infof(volumeToDetach.GenerateMsgDetailed("Verified volume is safe to detach", "")) 
        
           	return nil 
        
           }

In the context of this issue we know that KCM and kubelet operate with different volumeName for the same PV. In such case the above function won't be able to prevent an in-use volume to be detached.

Also with this option during the upgrade it is not possible to create new Pods that mount PVs on the 1.20 Nodes. The volume mount fails with the following error:

E0510 11:55:44.516034    3898 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/csi/pd.csi.storage.gke.io^projects/UNSPECIFIED/zones/UNSPECIFIED/disks/pv--83cee3be-e2f2-4f94-8557-ef09734e077a podName: nodeName:}" failed. No retries permitted until 2022-05-10 11:55:48.515988403 +0000 UTC m=+1649.019008969 (durationBeforeRetry 4s). Error: "Volume not attached according to node status for volume \"pv--83cee3be-e2f2-4f94-8557-ef09734e077a\" (UniqueName: \"kubernetes.io/csi/pd.csi.storage.gke.io^projects/UNSPECIFIED/zones/UNSPECIFIED/disks/pv--83cee3be-e2f2-4f94-8557-ef09734e077a\") pod \"app25-0\" (UID: \"65a11be8-d1a8-43a2-903d-f6d99179f856\") "

Again the reason is that KCM and kubelet operate with different volumeName.

TL;DRL I accept that this is an ugly option. And the upgrade is still not a graceful one.

(Option 2) Remove the topology.kubernetes.io/zone label from the PV

This option corresponds to Option 5 (Prefered) from your list.

Back then I gave up early on this option because:
- I was not sure whether it is a safe approach and whether it can cause short-term or long-term problems.
- Re-adding the topology.kubernetes.io/zone label after the upgrade is not possible without a downtime.
- The nodeAffinity field of the PV is immutable. I had the impression that removing the label is an change that would make the nodeAffinity out of sync with the PV labels.
After reading the document from above obviously giving up early on this option was a wrong decision from my side.
Back then and today as well I am able to confirm that this option works. The upgrade is a graceful one - without unexpected detaches or FilesystemIsReadOnly errors.
In short, I have to make sure that no new PVs get created with the topology.kubernetes.io/zone label and I remove the label from the existing PVs before the upgrade.
Now, as you suggest to prefer this option I assume that my initial concerns are not valid and this is the easiest and safest option. It would be great it you can confirm this one more time. If that is the case I believe we have a good option that will allows us to upgrade our clusters to 1.21. :)

jingxu97 · 2022-05-12T19:03:21Z

@ialidzhikov thanks for sharing the detailed information.
About label, I confirmed with @msau42, it is ok to not reapply it since it is mainly informational. Controller or scheduler does not depends on the zone label to make decisions.
Also as we discussed, in option 5, you can choose not to downgrade external provisioner as long as you make sure no new PVs are created during upgrade

Jiawei0227 · 2022-05-16T18:30:58Z

I was not sure whether it is a safe approach and whether it can cause short-term or long-term problems.

I think basically we need to make sure no customers are actually relying on these labels for their logic.

Re-adding the topology.kubernetes.io/zone label after the upgrade is not possible without a downtime.

I think we should not re-add the label unless the volume is being detached from a node.

The nodeAffinity field of the PV is immutable. I had the impression that removing the label is an change that would make the nodeAffinity out of sync with the PV labels.

Label and nodeAffinity are two different thing. It should not impact each other.

In short, I have to make sure that no new PVs get created with the topology.kubernetes.io/zone label and I remove the label from the existing PVs before the upgrade.

Yes. And to make sure no new PVs created has this label, you need to use external-provisioner <=2.1.0. After the cluster is upgraded into 1.21(including nodes), it should not have this requirement, which means you can upgrade the external-provisioner to a higher version.

It would be great it you can confirm this one more time. If that is the case I believe we have a good option that will allows us to upgrade our clusters to 1.21. :)

Yes, I think this is the safest approach so far to fix the issue. Sorry for all the trouble we have brought with this feature ;)

ialidzhikov · 2022-06-24T06:54:15Z

/remove-priority critical-urgent
as there is existing workaround.

Keeping this issue open for the 2 action items listed in https://docs.google.com/document/d/17iIoVj3g02U8Pgt4nC7g7sJG-Jd4TvPCpEXZ5e3oCk4/edit#heading=h.5n431nhh830a.

k8s-triage-robot · 2022-09-22T07:38:00Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2022-10-22T08:04:29Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

ialidzhikov · 2022-11-20T17:08:56Z

/remove-lifecycle rotten

k8s-triage-robot · 2024-01-19T09:58:47Z

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

Confirm that this issue is still relevant with /triage accepted (org members only)
Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

ialidzhikov · 2024-01-19T10:08:07Z

Keeping this issue open for the 2 action items listed in https://docs.google.com/document/d/17iIoVj3g02U8Pgt4nC7g7sJG-Jd4TvPCpEXZ5e3oCk4/edit#heading=h.5n431nhh830a.

I am closing this issue. The AI are about making the issue visible in release notes and in compatibility notes/matrix improvement in the external-provisioner/GCE PD CSI.
The corresponding K8s/external-provisioner/GCE PD CSI versions are out-of-support since longtime, I assume.

/close

k8s-ci-robot · 2024-01-19T10:08:13Z

@ialidzhikov: Closing this issue.

In response to this:

Keeping this issue open for the 2 action items listed in https://docs.google.com/document/d/17iIoVj3g02U8Pgt4nC7g7sJG-Jd4TvPCpEXZ5e3oCk4/edit#heading=h.5n431nhh830a.

I am closing this issue. The AI are about making the issue visible in release notes and in compatibility notes/matrix improvement in the external-provisioner/GCE PD CSI.
The corresponding K8s/external-provisioner/GCE PD CSI versions are out-of-support since longtime, I assume.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ialidzhikov added the kind/bug Categorizes issue or PR as related to a bug. label Apr 7, 2022

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 7, 2022

k8s-ci-robot added sig/storage Categorizes an issue or PR as relevant to SIG Storage. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Apr 7, 2022

k8s-ci-robot added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Apr 8, 2022

k8s-ci-robot assigned Jiawei0227 Apr 8, 2022

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 8, 2022

dkistner mentioned this issue Apr 12, 2022

Forbid k8s version upgrade from < v1.21 to >= v1.21.0 gardener/gardener-extension-provider-gcp#425

Closed

jingxu97 added a commit to jingxu97/kubernetes that referenced this issue May 11, 2022

WIP: fix issue kubernetes#109354

8d95386

Change-Id: I774a429442327a8700db975c625dd681f8577bc8

ialidzhikov mentioned this issue May 13, 2022

Use external-provisioner@v2.1.0 for K8s < 1.22 gardener/gardener-extension-provider-gcp#439

Merged

k8s-ci-robot removed the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Jun 24, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 22, 2022

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 22, 2022

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Nov 20, 2022

k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Jan 19, 2024

k8s-ci-robot closed this as completed Jan 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gce: KCM detaches all in-tree volumes during update from K8s 1.20 to 1.21 #109354

gce: KCM detaches all in-tree volumes during update from K8s 1.20 to 1.21 #109354

ialidzhikov commented Apr 7, 2022 •

edited

ialidzhikov commented Apr 7, 2022

ialidzhikov commented Apr 7, 2022

ialidzhikov commented Apr 8, 2022 •

edited

xing-yang commented Apr 8, 2022

xing-yang commented Apr 8, 2022

msau42 commented Apr 18, 2022

Jiawei0227 commented Apr 18, 2022

Jiawei0227 commented Apr 18, 2022

msau42 commented Apr 18, 2022

ialidzhikov commented Apr 19, 2022

liggitt commented Apr 19, 2022

Jiawei0227 commented May 5, 2022

Jiawei0227 commented May 5, 2022

ialidzhikov commented May 6, 2022 •

edited

Jiawei0227 commented May 9, 2022

Jiawei0227 commented May 12, 2022

ialidzhikov commented May 12, 2022 •

edited

jingxu97 commented May 12, 2022

Jiawei0227 commented May 16, 2022

ialidzhikov commented Jun 24, 2022

k8s-triage-robot commented Sep 22, 2022

k8s-triage-robot commented Oct 22, 2022

ialidzhikov commented Nov 20, 2022

k8s-triage-robot commented Jan 19, 2024

ialidzhikov commented Jan 19, 2024

k8s-ci-robot commented Jan 19, 2024

gce: KCM detaches all in-tree volumes during update from K8s 1.20 to 1.21 #109354

gce: KCM detaches all in-tree volumes during update from K8s 1.20 to 1.21 #109354

Comments

ialidzhikov commented Apr 7, 2022 • edited

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Kubernetes version

Cloud provider

OS version

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

ialidzhikov commented Apr 7, 2022

ialidzhikov commented Apr 7, 2022

ialidzhikov commented Apr 8, 2022 • edited

xing-yang commented Apr 8, 2022

xing-yang commented Apr 8, 2022

msau42 commented Apr 18, 2022

Jiawei0227 commented Apr 18, 2022

Jiawei0227 commented Apr 18, 2022

msau42 commented Apr 18, 2022

ialidzhikov commented Apr 19, 2022

liggitt commented Apr 19, 2022

Jiawei0227 commented May 5, 2022

Jiawei0227 commented May 5, 2022

ialidzhikov commented May 6, 2022 • edited

Jiawei0227 commented May 9, 2022

Jiawei0227 commented May 12, 2022

ialidzhikov commented May 12, 2022 • edited

jingxu97 commented May 12, 2022

Jiawei0227 commented May 16, 2022

ialidzhikov commented Jun 24, 2022

k8s-triage-robot commented Sep 22, 2022

k8s-triage-robot commented Oct 22, 2022

ialidzhikov commented Nov 20, 2022

k8s-triage-robot commented Jan 19, 2024

ialidzhikov commented Jan 19, 2024

k8s-ci-robot commented Jan 19, 2024

ialidzhikov commented Apr 7, 2022 •

edited

ialidzhikov commented Apr 8, 2022 •

edited

ialidzhikov commented May 6, 2022 •

edited

ialidzhikov commented May 12, 2022 •

edited