Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using exec auth in kubectl triggers one connection per applied/described object #111911

Closed
liggitt opened this issue Aug 18, 2022 · 22 comments · Fixed by #112017
Closed

Using exec auth in kubectl triggers one connection per applied/described object #111911

liggitt opened this issue Aug 18, 2022 · 22 comments · Fixed by #112017
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/auth Categorizes an issue or PR as relevant to SIG Auth. sig/cli Categorizes an issue or PR as relevant to SIG CLI. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@liggitt
Copy link
Member

liggitt commented Aug 18, 2022

What happened?

Switching from the gcp auth provider to the exec auth provider in 1.25 made kubectl apply requests which previously made a single connection to the API server make one connection per applied object.

If enough items are applied, the connections can start to fail because of ephemeral port exhaustion

/kind bug
/milestone v1.25
/sig auth cli api-machinery

What did you expect to happen?

Switching to the exec auth plugin would allow commands that previously worked to continue to work

How can we reproduce it (as minimally and precisely as possible)?

Reproducer integration test at https://github.com/liggitt/kubernetes/commits/exec-auth-reproducer demonstrating one transport (and therefore a new connection) for each item in an applied file when using exec auth

on that branch:

go test ./test/integration/client/ -run TestApplyMultipleItemsWithExecPlugin -v

and observe 10 transport creations to apply 10 items (plus a couple others for discovery):

before apply
creating transport
creating transport
creating transport
creating transport
creating transport
creating transport
creating transport
creating transport
creating transport
creating transport
creating transport
creating transport
after apply

Anything else we need to know?

This happens because exec auth setup unconditionally sets the GetCert field in transport config here (even if the plugin is only going to return token credentials):

That makes the transport config uncacheable here:

if c.TLS.GetCert != nil || c.Dial != nil || c.Proxy != nil {
// cannot determine equality for functions
return tlsCacheKey{}, false, nil
}

Commands like kubectl describe and kubectl apply repeatedly construct transports when dealing with multiple objects. This poor behavior was masked in past configurations by the transports being cached.

Kubernetes version

tried with 1.24 and 1.25, reproduced with both

Cloud provider

n/a

OS version

n/a

Install tools

n/a

Container runtime (CRI) and version (if applicable)

n/a

Related plugins (CNI, CSI, ...) and versions (if applicable)

n/a

@liggitt liggitt added the kind/bug Categorizes issue or PR as related to a bug. label Aug 18, 2022
@k8s-ci-robot k8s-ci-robot added sig/auth Categorizes an issue or PR as relevant to SIG Auth. sig/cli Categorizes an issue or PR as relevant to SIG CLI. labels Aug 18, 2022
@k8s-ci-robot k8s-ci-robot added this to the v1.25 milestone Aug 18, 2022
@k8s-ci-robot k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 18, 2022
@liggitt
Copy link
Member Author

liggitt commented Aug 18, 2022

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 18, 2022
@liggitt
Copy link
Member Author

liggitt commented Aug 18, 2022

labeling as a regression since a configuration that worked prior to 1.25 no longer has an equivalent that works properly

/kind regression

@k8s-ci-robot k8s-ci-robot added the kind/regression Categorizes issue or PR as related to a regression from a prior release. label Aug 18, 2022
@liggitt
Copy link
Member Author

liggitt commented Aug 18, 2022

I see a few directions a fix could go:

  1. exec auth plugin could avoid adding a GetCert hook if the plugin config indicated it would only return tokens. This would require more exec plugin config and wouldn't fix the issue for exec auth plugins that do return cert credentials
  2. client-go transport config/caching could try to enabling caching transports for configs which set a GetCert callback, but we tried to figure out how to do this in the past and couldn't find a way to do it safely
  3. client-go consumers that repeatedly create transports (primarily kubectl commands) could restructure to avoid doing that

All three of those changes seem likely to be invasive and inappropriate to make at the last minute for 1.25.

Option 3 seems like the proper fix, but also the one with the biggest surface area, since it would require checking all the kubectl commands, and likely some significant restructuring in how clients are constructed

@enj
Copy link
Member

enj commented Aug 18, 2022

1. exec auth plugin could avoid adding a GetCert hook if the plugin config indicated it would only return tokens. This would require more exec plugin config and wouldn't fix the issue for exec auth plugins that _do_ return cert credentials

I would like to avoid this approach as projects such as pinniped exclusively use client certificates for end user facing authentication.

@liggitt
Copy link
Member Author

liggitt commented Aug 18, 2022

1. exec auth plugin could avoid adding a GetCert hook if the plugin config indicated it would only return tokens. This would require more exec plugin config and wouldn't fix the issue for exec auth plugins that _do_ return cert credentials

I would like to avoid this approach as projects such as pinniped exclusively use client certificates for end user facing authentication.

me too, that doesn't really seem like a fix

@liggitt
Copy link
Member Author

liggitt commented Aug 18, 2022

it looks like the thing in kubectl that is repeatedly constructing clients is the Builder... there might be a central way to make that use the "construct http.Client once, reuse for multiple clientsets" that @aojea made last release... I'm looking into that now

@liggitt liggitt self-assigned this Aug 18, 2022
@cici37
Copy link
Contributor

cici37 commented Aug 18, 2022

/cc @cici37

@aojea
Copy link
Member

aojea commented Aug 18, 2022

it looks like the thing in kubectl that is repeatedly constructing clients is the Builder... there might be a central way to make that use the "construct http.Client once, reuse for multiple clientsets" that @aojea made last release... I'm looking into that now

/cc @soltysh @atiratree

adding them to the loop, there was also a WIP PR #108459 to fix this

@liggitt
Copy link
Member Author

liggitt commented Aug 18, 2022

it looks like the thing in kubectl that is repeatedly constructing clients is the Builder... there might be a central way to make that use the "construct http.Client once, reuse for multiple clientsets" that @aojea made last release... I'm looking into that now

/cc @soltysh @atiratree

adding them to the loop, there was also a WIP PR #108459 to fix this

oof... I didn't know this was a known issue... I'd have pushed to resolve this before forcing the migration to exec auth :-/

@liggitt
Copy link
Member Author

liggitt commented Aug 18, 2022

expanding on #111911 (comment), I see our options for 1.25 as:

  1. ship 1.25 as-is with this documented as a known kubectl issue
    • removal of gcp and azure credential providers forces migration to exec auth with a known kubectl issue without a reasonable workaround
  2. delay 1.25 until the kubectl issue is fixed or worked around
    • current WIP fix in share transport in kubectl #108459 is not small, and I would characterize as ~low-medium risk (it's hard to reason about the impact on all callers of some of the code paths being changed). I'd estimate the best case at O(days) for the fix to land, and I would expect another release candidate given the scope of the changes as well. this fix also relies on work @aojea did that is only in 1.23+, so it couldn't be backported to 1.22, which is still in support
    • a workaround-style fix inside the exec auth plugin could maybe avoid triggering the cert use case, which would avoid regressing the gcp and azure providers (which use tokens), but would require changing exec-auth plugin logic, which I wouldn't be confident in backporting
    • even if the fix lands in 1.25, it only fixes 1.25 kubectl unless we backport and wait for backport patch releases, so we'd still be forcing migration to exec auth with a known kubectl issue in still-active/supported patch releases without a reasonable workaround
  3. (selected after discussing with @enj and @cici37) restore the removed in-tree auth plugins in 1.25, document the known kubectl issue in 1.25, and fix the kubectl issue in 1.26

As loathe as I am to pick path 3, I think it's the right call for users and the release schedule, and I think we still have a clear path to actually remove the gcp and azure credential providers in 1.26 if we:

@dims
Copy link
Member

dims commented Aug 18, 2022

+1 to the proposed plan (3 bullet points) thanks @liggitt @aojea @enj

@liggitt
Copy link
Member Author

liggitt commented Aug 18, 2022

revert for 1.25 open at #111918 😞

@cici37
Copy link
Contributor

cici37 commented Aug 18, 2022

Thank you for addressing this and raising the revert PR!
Agree on do it right even if it's slow!

@enj
Copy link
Member

enj commented Aug 18, 2022

Reverting this really kills me, but I agree that we should:

  1. Not break users in ways that they cannot reasonably workaround
  2. Not risk the release with large changes on a late timeline

Lets get this fixed correctly early in the 1.26 release so that we can backport the fix to 1.23+

@liggitt
Copy link
Member Author

liggitt commented Aug 18, 2022

revert PR is merged for 1.25 and in the release-1.25 branch

dropping the regression label from this issue now (since it's just an exec auth bug now, not a 1.25 regression) and adding important-soon priority for 1.26. Will work with @soltysh to get #108459 in a backport-safe shape, targeting merge first thing in 1.26

@liggitt liggitt added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed kind/regression Categorizes issue or PR as related to a regression from a prior release. labels Aug 18, 2022
@liggitt liggitt modified the milestones: v1.25, v1.26 Aug 18, 2022
@sftim
Copy link
Contributor

sftim commented Aug 18, 2022

Optimizations on top of cutting down the number of new transports we make:
For plugins that do return a cert, I'd be happy to cache that cert based on its expiry date (notAfter) - perhaps with a max TTL.
For plugins that return tokens, we could provide a way for the plugin to additionally assert a freshness period and / or maximum number of uses for the token.

For some cases it could be useful for tokens to be strictly single use.


We need to update kubernetes/website#35981 to reflect the new situation.

@liggitt
Copy link
Member Author

liggitt commented Aug 18, 2022

Optimizations on top of cutting down the number of new transports we make:
For plugins that do return a cert, I'd be happy to cache that cert based on its expiry date (notAfter) - perhaps with a max TTL.
For plugins that return tokens, we could provide a way for the plugin to additionally assert a freshness period and / or maximum number of uses for the token.

The exec auth plugin can already do both of those things (can return expiration timestamps that avoid additional calls to the plugin to get new credentials). In this case, we weren't actually calling the authenticator multiple times, the same cached credentials were getting used with transports and talking to the server over multiple connections

@liggitt
Copy link
Member Author

liggitt commented Aug 18, 2022

We need to update kubernetes/website#35981 to reflect the new situation.

thanks for pointing out the docs update needed

@cici37
Copy link
Contributor

cici37 commented Aug 19, 2022

We need to update kubernetes/website#35981 to reflect the new situation.

I have left a comment earlier in kubernetes/website#35981 (comment) and the update is done. The CHANGLOG update is covered in kubernetes/sig-release#1995. We still need a release note update. Since it was merged after rc1 cut, the release notes has to be added manually. I am working with release notes team on it: kubernetes/sig-release#1996 (comment)

@atiratree
Copy link
Member

I can try to improve #108459 next week and make it safer for backports

@enj
Copy link
Member

enj commented Aug 24, 2022

/assign

@enj
Copy link
Member

enj commented Aug 24, 2022

I have opened #112017 to fix this bug. I believe the change is small and safe enough to backport.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/auth Categorizes an issue or PR as relevant to SIG Auth. sig/cli Categorizes an issue or PR as relevant to SIG CLI. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Archived in project
8 participants