Using exec auth in kubectl triggers one connection per applied/described object #111911

liggitt · 2022-08-18T14:05:20Z

What happened?

Switching from the gcp auth provider to the exec auth provider in 1.25 made kubectl apply requests which previously made a single connection to the API server make one connection per applied object.

If enough items are applied, the connections can start to fail because of ephemeral port exhaustion

/kind bug
/milestone v1.25
/sig auth cli api-machinery

What did you expect to happen?

Switching to the exec auth plugin would allow commands that previously worked to continue to work

How can we reproduce it (as minimally and precisely as possible)?

Reproducer integration test at https://github.com/liggitt/kubernetes/commits/exec-auth-reproducer demonstrating one transport (and therefore a new connection) for each item in an applied file when using exec auth

on that branch:

go test ./test/integration/client/ -run TestApplyMultipleItemsWithExecPlugin -v

and observe 10 transport creations to apply 10 items (plus a couple others for discovery):

before apply
creating transport
creating transport
creating transport
creating transport
creating transport
creating transport
creating transport
creating transport
creating transport
creating transport
creating transport
creating transport
after apply

Anything else we need to know?

This happens because exec auth setup unconditionally sets the GetCert field in transport config here (even if the plugin is only going to return token credentials):

kubernetes/staging/src/k8s.io/client-go/plugin/pkg/client/auth/exec/exec.go

Line 303 in 58c10aa

c.TLS.GetCert = a.cert

That makes the transport config uncacheable here:

kubernetes/staging/src/k8s.io/client-go/transport/cache.go

Lines 141 to 144 in 58c10aa

    
           if c.TLS.GetCert != nil || c.Dial != nil || c.Proxy != nil { 
        
           	// cannot determine equality for functions 
        
           	return tlsCacheKey{}, false, nil 
        
           }

Commands like kubectl describe and kubectl apply repeatedly construct transports when dealing with multiple objects. This poor behavior was masked in past configurations by the transports being cached.

Kubernetes version

tried with 1.24 and 1.25, reproduced with both

Cloud provider

n/a

OS version

n/a

Install tools

n/a

Container runtime (CRI) and version (if applicable)

n/a

Related plugins (CNI, CSI, ...) and versions (if applicable)

n/a

The text was updated successfully, but these errors were encountered:

liggitt · 2022-08-18T14:35:11Z

/triage accepted

liggitt · 2022-08-18T14:36:15Z

labeling as a regression since a configuration that worked prior to 1.25 no longer has an equivalent that works properly

/kind regression

liggitt · 2022-08-18T14:41:15Z

I see a few directions a fix could go:

exec auth plugin could avoid adding a GetCert hook if the plugin config indicated it would only return tokens. This would require more exec plugin config and wouldn't fix the issue for exec auth plugins that do return cert credentials
client-go transport config/caching could try to enabling caching transports for configs which set a GetCert callback, but we tried to figure out how to do this in the past and couldn't find a way to do it safely
client-go consumers that repeatedly create transports (primarily kubectl commands) could restructure to avoid doing that

All three of those changes seem likely to be invasive and inappropriate to make at the last minute for 1.25.

Option 3 seems like the proper fix, but also the one with the biggest surface area, since it would require checking all the kubectl commands, and likely some significant restructuring in how clients are constructed

enj · 2022-08-18T14:53:34Z

1. exec auth plugin could avoid adding a GetCert hook if the plugin config indicated it would only return tokens. This would require more exec plugin config and wouldn't fix the issue for exec auth plugins that _do_ return cert credentials

I would like to avoid this approach as projects such as pinniped exclusively use client certificates for end user facing authentication.

liggitt · 2022-08-18T15:01:21Z

1. exec auth plugin could avoid adding a GetCert hook if the plugin config indicated it would only return tokens. This would require more exec plugin config and wouldn't fix the issue for exec auth plugins that _do_ return cert credentials
I would like to avoid this approach as projects such as pinniped exclusively use client certificates for end user facing authentication.

me too, that doesn't really seem like a fix

liggitt · 2022-08-18T15:02:06Z

it looks like the thing in kubectl that is repeatedly constructing clients is the Builder... there might be a central way to make that use the "construct http.Client once, reuse for multiple clientsets" that @aojea made last release... I'm looking into that now

cici37 · 2022-08-18T15:26:14Z

/cc @cici37

aojea · 2022-08-18T15:32:58Z

it looks like the thing in kubectl that is repeatedly constructing clients is the Builder... there might be a central way to make that use the "construct http.Client once, reuse for multiple clientsets" that @aojea made last release... I'm looking into that now

/cc @soltysh @atiratree

adding them to the loop, there was also a WIP PR #108459 to fix this

liggitt · 2022-08-18T15:39:48Z

it looks like the thing in kubectl that is repeatedly constructing clients is the Builder... there might be a central way to make that use the "construct http.Client once, reuse for multiple clientsets" that @aojea made last release... I'm looking into that now

/cc @soltysh @atiratree

adding them to the loop, there was also a WIP PR #108459 to fix this

oof... I didn't know this was a known issue... I'd have pushed to resolve this before forcing the migration to exec auth :-/

liggitt · 2022-08-18T18:06:01Z

expanding on #111911 (comment), I see our options for 1.25 as:

ship 1.25 as-is with this documented as a known kubectl issue
- removal of gcp and azure credential providers forces migration to exec auth with a known kubectl issue without a reasonable workaround
delay 1.25 until the kubectl issue is fixed or worked around
- current WIP fix in share transport in kubectl #108459 is not small, and I would characterize as ~low-medium risk (it's hard to reason about the impact on all callers of some of the code paths being changed). I'd estimate the best case at O(days) for the fix to land, and I would expect another release candidate given the scope of the changes as well. this fix also relies on work @aojea did that is only in 1.23+, so it couldn't be backported to 1.22, which is still in support
- a workaround-style fix inside the exec auth plugin could maybe avoid triggering the cert use case, which would avoid regressing the gcp and azure providers (which use tokens), but would require changing exec-auth plugin logic, which I wouldn't be confident in backporting
- even if the fix lands in 1.25, it only fixes 1.25 kubectl unless we backport and wait for backport patch releases, so we'd still be forcing migration to exec auth with a known kubectl issue in still-active/supported patch releases without a reasonable workaround
(selected after discussing with @enj and @cici37) restore the removed in-tree auth plugins in 1.25, document the known kubectl issue in 1.25, and fix the kubectl issue in 1.26

As loathe as I am to pick path 3, I think it's the right call for users and the release schedule, and I think we still have a clear path to actually remove the gcp and azure credential providers in 1.26 if we:

fix the kubectl issue early in 1.26 with a scoped-down/safer version of share transport in kubectl #108459
backport the kubectl fix to 1.23, 1.24, 1.25 patch releases (can't backport further than 1.23 because the fix depends on client-go: share the same transport for generated clientsets #105490, but 1.22 goes out of support before 1.26.0 is released)
force the exec auth migration in 1.26 once all supported minor versions of kubectl have a patch version available that resolves the issue

dims · 2022-08-18T18:08:57Z

+1 to the proposed plan (3 bullet points) thanks @liggitt @aojea @enj

liggitt · 2022-08-18T18:25:00Z

revert for 1.25 open at #111918 😞

cici37 · 2022-08-18T18:30:07Z

Thank you for addressing this and raising the revert PR!
Agree on do it right even if it's slow!

enj · 2022-08-18T19:13:05Z

Reverting this really kills me, but I agree that we should:

Not break users in ways that they cannot reasonably workaround
Not risk the release with large changes on a late timeline

Lets get this fixed correctly early in the 1.26 release so that we can backport the fix to 1.23+

liggitt · 2022-08-18T19:54:43Z

revert PR is merged for 1.25 and in the release-1.25 branch

dropping the regression label from this issue now (since it's just an exec auth bug now, not a 1.25 regression) and adding important-soon priority for 1.26. Will work with @soltysh to get #108459 in a backport-safe shape, targeting merge first thing in 1.26

sftim · 2022-08-18T21:22:44Z

Optimizations on top of cutting down the number of new transports we make:
For plugins that do return a cert, I'd be happy to cache that cert based on its expiry date (notAfter) - perhaps with a max TTL.
For plugins that return tokens, we could provide a way for the plugin to additionally assert a freshness period and / or maximum number of uses for the token.

For some cases it could be useful for tokens to be strictly single use.

We need to update kubernetes/website#35981 to reflect the new situation.

liggitt · 2022-08-18T21:25:53Z

Optimizations on top of cutting down the number of new transports we make:
For plugins that do return a cert, I'd be happy to cache that cert based on its expiry date (notAfter) - perhaps with a max TTL.
For plugins that return tokens, we could provide a way for the plugin to additionally assert a freshness period and / or maximum number of uses for the token.

The exec auth plugin can already do both of those things (can return expiration timestamps that avoid additional calls to the plugin to get new credentials). In this case, we weren't actually calling the authenticator multiple times, the same cached credentials were getting used with transports and talking to the server over multiple connections

liggitt · 2022-08-18T21:26:32Z

We need to update kubernetes/website#35981 to reflect the new situation.

thanks for pointing out the docs update needed

cici37 · 2022-08-19T02:54:53Z

We need to update kubernetes/website#35981 to reflect the new situation.

I have left a comment earlier in kubernetes/website#35981 (comment) and the update is done. The CHANGLOG update is covered in kubernetes/sig-release#1995. We still need a release note update. Since it was merged after rc1 cut, the release notes has to be added manually. I am working with release notes team on it: kubernetes/sig-release#1996 (comment)

atiratree · 2022-08-19T11:30:29Z

I can try to improve #108459 next week and make it safer for backports

enj · 2022-08-24T19:55:36Z

/assign

enj · 2022-08-24T20:05:43Z

I have opened #112017 to fix this bug. I believe the change is small and safe enough to backport.

liggitt added the kind/bug Categorizes issue or PR as related to a bug. label Aug 18, 2022

k8s-ci-robot added sig/auth Categorizes an issue or PR as relevant to SIG Auth. sig/cli Categorizes an issue or PR as relevant to SIG CLI. labels Aug 18, 2022

k8s-ci-robot added this to the v1.25 milestone Aug 18, 2022

k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 18, 2022

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 18, 2022

k8s-ci-robot added the kind/regression Categorizes issue or PR as related to a regression from a prior release. label Aug 18, 2022

liggitt self-assigned this Aug 18, 2022

liggitt mentioned this issue Aug 18, 2022

Restore in-tree credential plugins for 1.25 #111918

Merged

liggitt added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed kind/regression Categorizes issue or PR as related to a regression from a prior release. labels Aug 18, 2022

liggitt modified the milestones: v1.25, v1.26 Aug 18, 2022

liggitt mentioned this issue Aug 18, 2022

share transport in kubectl #108459

Closed

liggitt mentioned this issue Aug 18, 2022

Add release announcement blog article for Kubernetes v1.25 kubernetes/website#35981

Merged

k8s-ci-robot assigned enj Aug 24, 2022

enj mentioned this issue Aug 24, 2022

exec auth: support TLS config caching #112017

Merged

reylejano mentioned this issue Aug 26, 2022

Document setting up cloud provider kubectl auth via plugin (AKS, GKE) kubernetes/website#35896

Closed

aojea mentioned this issue Aug 31, 2022

client-go: make transport cacheable using Interfaces instead of callback functions #112144

Closed

k8s-ci-robot closed this as completed in #112017 Sep 8, 2022

mjhilton mentioned this issue Sep 26, 2022

Update Kubernetes integration to support plugin-based architecture for vendor auth OctopusDeploy/Issues#7621

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using exec auth in kubectl triggers one connection per applied/described object #111911

Using exec auth in kubectl triggers one connection per applied/described object #111911

liggitt commented Aug 18, 2022 •

edited

liggitt commented Aug 18, 2022

liggitt commented Aug 18, 2022

liggitt commented Aug 18, 2022

enj commented Aug 18, 2022

liggitt commented Aug 18, 2022

liggitt commented Aug 18, 2022 •

edited

cici37 commented Aug 18, 2022

aojea commented Aug 18, 2022 •

edited

liggitt commented Aug 18, 2022

liggitt commented Aug 18, 2022 •

edited

dims commented Aug 18, 2022

liggitt commented Aug 18, 2022

cici37 commented Aug 18, 2022

enj commented Aug 18, 2022

liggitt commented Aug 18, 2022 •

edited

sftim commented Aug 18, 2022

liggitt commented Aug 18, 2022 •

edited

liggitt commented Aug 18, 2022

cici37 commented Aug 19, 2022

atiratree commented Aug 19, 2022

enj commented Aug 24, 2022

enj commented Aug 24, 2022

Using exec auth in kubectl triggers one connection per applied/described object #111911

Using exec auth in kubectl triggers one connection per applied/described object #111911

Comments

liggitt commented Aug 18, 2022 • edited

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Kubernetes version

Cloud provider

OS version

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

liggitt commented Aug 18, 2022

liggitt commented Aug 18, 2022

liggitt commented Aug 18, 2022

enj commented Aug 18, 2022

liggitt commented Aug 18, 2022

liggitt commented Aug 18, 2022 • edited

cici37 commented Aug 18, 2022

aojea commented Aug 18, 2022 • edited

liggitt commented Aug 18, 2022

liggitt commented Aug 18, 2022 • edited

dims commented Aug 18, 2022

liggitt commented Aug 18, 2022

cici37 commented Aug 18, 2022

enj commented Aug 18, 2022

liggitt commented Aug 18, 2022 • edited

sftim commented Aug 18, 2022

liggitt commented Aug 18, 2022 • edited

liggitt commented Aug 18, 2022

cici37 commented Aug 19, 2022

atiratree commented Aug 19, 2022

enj commented Aug 24, 2022

enj commented Aug 24, 2022

liggitt commented Aug 18, 2022 •

edited

liggitt commented Aug 18, 2022 •

edited

aojea commented Aug 18, 2022 •

edited

liggitt commented Aug 18, 2022 •

edited

liggitt commented Aug 18, 2022 •

edited

liggitt commented Aug 18, 2022 •

edited