Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix GC uid races and handling of conflicting ownerReferences #92743

Merged
merged 13 commits into from Nov 17, 2020

Conversation

liggitt
Copy link
Member

@liggitt liggitt commented Jul 2, 2020

Addresses issues with race conditions encountered when observing invalid ownerReferences

Fixes #65200

  • address review comments
  • add comments to each commit on the scenario being addresses
  • add unit tests for GC controller and graph builder fixes
  • ensure the coverage of the following testcases are represented:

  • child in namespace A with owner reference to namespaced type in namespace B

    • should be deleted immediately
    • event should be logged in namespace A with involvedObject of bad-child indicating the error
    • test both orders:
      • real parent added to graph, then bad child added to graph
      • bad child added to graph, virtual parent added to graph, then real parent added to graph
  • child that is cluster-scoped with owner reference to namespaced type in namespace B

    • should not be deleted
    • event should be logged in namespace kube-system with involvedObject of bad-child indicating the error
    • test both orders:
      • real parent added to graph, then bad child added to graph
      • bad child added to graph, virtual parent added to graph, then real parent added to graph
  • child pointing at non-preferred still-served apiVersion of parent object (e.g. rbac/v1beta1)

    • should not be deleted prematurely
    • should not repeatedly poll attemptToDelete while waiting
    • should be deleted when the actual parent is deleted
  • child pointing at no-longer-served apiVersion of still-existing parent object (e.g. extensions/v1beta1 deployment)

    • should not be deleted (this is indistinguishable from referencing an unknown kind/version)
    • should not repeatedly poll attemptToDelete once real parent is observed
    • TODO: follow up: should be fixed up by a controller adjusting known ownerReference API versions
  • child pointing at no-longer-served apiVersion of no-longer-existing parent object (e.g. extensions/v1beta1 deployment)

    • should not be deleted (this is indistinguishable from referencing an unknown kind/version)
    • should repeatedly poll attemptToDelete
    • should not block deletion of legitimate children of missing deployment
    • TODO: follow up: should be fixed up by a controller adjusting known ownerReference API versions and once fixed up, should be deleted since parent does not exist
  • child pointing at incorrect apiVersion/kind of still-existing parent object (e.g. core/v1 Secret with uid=123, where an apps/v1 Deployment with uid=123 exists)

    • should be deleted immediately
    • should not trigger deletion of legitimate children of parent

Follow-ups:

/kind bug
/cc @jpbetz @deads2k
/sig api-machinery

Resolves non-deterministic behavior of the garbage collection controller when ownerReferences with incorrect data are encountered. Events with a reason of `OwnerRefInvalidNamespace` are recorded when namespace mismatches between child and owner objects are detected. The [kubectl-check-ownerreferences](https://github.com/kubernetes-sigs/kubectl-check-ownerreferences) tool can be run prior to upgrading to locate existing objects with invalid ownerReferences.
* A namespaced object with an ownerReference referencing a uid of a namespaced kind which does not exist in the same namespace is now consistently treated as though that owner does not exist, and the child object is deleted.
* A cluster-scoped object with an ownerReference referencing a uid of a namespaced kind is now consistently treated as though that owner is not resolvable, and the child object is ignored by the garbage collector.

@k8s-ci-robot k8s-ci-robot requested a review from jpbetz July 2, 2020 06:52
@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/bug Categorizes issue or PR as related to a bug. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. area/test sig/apps Categorizes an issue or PR as relevant to SIG Apps. approved Indicates a PR has been approved by an approver from all required OWNERS files. sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels Jul 2, 2020
@liggitt liggitt force-pushed the gc branch 3 times, most recently from 3235bec to a4b11dd Compare July 2, 2020 08:05
@liggitt
Copy link
Member Author

liggitt commented Jul 2, 2020

/retest

@liggitt
Copy link
Member Author

liggitt commented Jul 2, 2020

/skip

@fedebongio
Copy link
Contributor

/assign @jpbetz
/cc @caesarxuchao

@liggitt liggitt force-pushed the gc branch 3 times, most recently from 28a133e to df55393 Compare July 9, 2020 07:16
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jul 9, 2020
@liggitt
Copy link
Member Author

liggitt commented Jul 9, 2020

/retest

@mattfarina
Copy link
Contributor

/retest pull-kubernetes-conformance-kind-ga-only-parallel

@k8s-ci-robot

This comment has been minimized.

@mattfarina
Copy link
Contributor

/test pull-kubernetes-conformance-kind-ga-only-parallel

@liggitt liggitt added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Nov 17, 2020
@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@liggitt
Copy link
Member Author

liggitt commented Nov 17, 2020

docs update for 1.20 at kubernetes/website#25091

@liggitt liggitt deleted the gc branch November 17, 2020 22:41
@krmayankk
Copy link

curious why did we make the cross namespace reference an invalid case ?

@liggitt
Copy link
Member Author

liggitt commented Nov 18, 2020

why did we make the cross namespace reference an invalid case ?

namespaces are intended to be independent of each other, so cross-namespace references have not been permitted in things like ownerReferences, secret/configmap volume references, etc.

additionally, granting permissions to namespace a is not generally intended to provide visibility or ability to interact with objects from namespace b (or cause system controllers to interact with objects from namespace b).

@krmayankk
Copy link

why did we make the cross namespace reference an invalid case ?

namespaces are intended to be independent of each other, so cross-namespace references have not been permitted in things like ownerReferences, secret/configmap volume references, etc.

additionally, granting permissions to namespace a is not generally intended to provide visibility or ability to interact with objects from namespace b (or cause system controllers to interact with objects from namespace b).

thanks

@little-guy-lxr
Copy link

could this fix merge to 1.19.x ?

@liggitt
Copy link
Member Author

liggitt commented Sep 8, 2021

could this fix merge to 1.19.x ?

No, this is a significant enough rewrite of the garbagecontroller that it is limited to 1.20+

For clusters on previous versions, I would recommend running https://github.com/kubernetes-sigs/kubectl-check-ownerreferences tool is available to identify invalid ownerReferences that could trigger the bug this PR is fixing so they can be eliminated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet