Validation Tightening is not Generally Possible #64841

lavalamp · 2018-06-06T16:37:33Z

I have made this comment on a few PRs so maybe it's time to document and/or build a mechanism.

You cannot generally make validation tighter, because any existing objects that violate the new rule will not be able to be updated, and in many cases, deleted (because deletions involving finalizers require a sequence of updates).

This is because, to validate updates, we validate BOTH the transition, AND the final state of the object. It's this last part that is problematic.

It should be possible to address this by adding a new class of validation function that ONLY gets called on new objects, but then we have these problems:

When is it safe to promote the logic into the main validation function? There's no way to know.
This solves the problem for apiserver, but not for any clients that were sending invalid objects. Especially automated clients will suddenly and without recourse be quite broken.

Even at version boundaries it's hard to tighten validation, since objects have to be round-trippable and there are still old objects in the system.

So, new API authors: start out with restrictive validation; it's easier to relax than to tighten.

Existing API authors: prepare to write defensive clients.

/kind bug
/sig api-machinery

liggitt · 2018-06-06T16:59:44Z

xref #59984 #60635 #57615 #52936 #59593 #60934 #61623 #61984 #62987 #63426 #65470 for wanting to tighten validation

wanting to loosen validation:

liggitt · 2018-06-06T17:01:11Z

a methodical way to do ratcheting validation would be ideal:

apply tightened validation on create
apply tightened validation on update where the old object passed this validation (don't let a valid object become invalid)
skip tightened validation on update where the old object fails this validation

lavalamp · 2018-06-06T21:06:54Z

Yes, that's the mechanism I had in mind when I wrote the two bullet points in the original post.

(also thanks for the extensive list of cross references)

jennybuckley · 2018-06-07T20:23:53Z

/cc @mbohlool

liggitt · 2018-06-08T02:37:58Z

/assign

liggitt · 2018-06-08T07:04:00Z

WIP in #64907

tallclair · 2018-07-06T22:39:43Z

xref #64532

bgrant0607 · 2018-07-17T01:36:26Z

Sorry for not noticing this earlier.

Tightening validation can break clients. I don't see much of a way around that. Special-casing updates protects the apiserver but not clients and users.

It seems like the only recourse would be to downgrade to the prior release until the clients can be fixed. And downgrade is known to not work in the general case.

Maybe we need an advisory mode that also applies to create, so that users can debug problems before they irreversibly hose their clusters?

bgrant0607 · 2018-07-17T01:40:23Z

https://github.com/kubernetes/community/blob/master/contributors/devel/api_changes.md#backward-compatibility-gotchas

Changing any validation rules always has the potential of breaking some client, since it changes the assumptions about part of the API, similar to adding new enum values. Validation rules on spec fields can neither be relaxed nor strengthened. Strengthening cannot be permitted because any requests that previously worked must continue to work. Weakening validation has the potential to break other consumers and generators of the API resource. Status fields whose writers are under our control (e.g., written by non-pluggable controllers), may potentially tighten validation, since that would cause a subset of previously valid values to be observable by clients.

bgrant0607 · 2018-07-17T01:42:06Z

The end of that should be: "to no longer be observable by clients"

bgrant0607 · 2018-07-17T01:52:15Z

The list of cross references was useful.

We should document how to handle common cases, such as duplicate list entries.

It looks like maybe a few of them wouldn't have worked, but with asynchronous failures rather than synchronous ones.

However, it's possible for cases we didn't intend to be supported, such as multiple instances of the same environment variable, to be working for someone.

bgrant0607 · 2018-07-17T02:08:17Z

Other thoughts:

We also need to demand validation tests that fail for every field that doesn't accept all values of the base type.

This doesn't handle cases where contributors introduce new validation checks outside the ratcheting mechanism. What are we trying to defend against?

The cluster admin will want to know if any of potential problems. We need to introduce exported variables that surface affected resource types.

liggitt · 2018-07-17T02:09:56Z

Tightening validation can break clients. I don't see much of a way around that. Special-casing updates protects the apiserver but not clients and users.

What are we protecting clients and users from? My intent with these was to only address validation gaps that were allowing in fatally flawed objects.

It seems like the only recourse would be to downgrade to the prior release until the clients can be fixed.

Is it a great loss to block a client that is submitting deployments/pods that can not succeed? Is the concern that such a client would not back off and would perpetually attempt to create the objects? Doesn't that risk already exist with dynamic reasons for denial (quota, authorization, webhook admission)?

any requests that previously worked must continue to work

what does that mean for an object which, if created, can not work properly, and will inevitably fail asynchronously (sometimes taking a long time to do so, as in the case of the duplicate PVC volume refs)? the fixes in the linked PR were intended to resolve problems of that variety.

The only one I could see being questionable is the duplicate envvar fix. That seemed valuable to prevent because it is known to cause data loss when using apply to try to remove the extra envvar on an update, but I can pull that one out if we'd rather try to think of a different way to address that (since a pod with that duplication can still run successfully)

bgrant0607 · 2018-07-17T02:10:22Z

Added to the SIG Arch backlog: https://github.com/kubernetes-sigs/architecture-tracking/projects/3

bgrant0607 · 2018-07-17T02:18:06Z

@liggitt I agree with respect to requests that were fatally flawed. However, we haven't always been correct when trying to identify such cases in the past.

bgrant0607 · 2018-07-17T02:22:00Z

Example:

kubernetes/pkg/volume/empty_dir/empty_dir_linux.go

Line 59 in c4f3102

return v1.StorageMediumDefault, !notMnt, nil

It looks like unknown emptyDir medium values succeed:

	if buf.Type == linuxTmpfsMagic {
		return v1.StorageMediumMemory, !notMnt, nil
	} else if int64(buf.Type) == linuxHugetlbfsMagic {
		return v1.StorageMediumHugePages, !notMnt, nil
	}
	return v1.StorageMediumDefault, !notMnt, nil

liggitt · 2018-07-17T02:24:50Z

It looks like unknown emptyDir medium values succeed

If that's the case, I agree it shouldn't be tightened. I was going off of #52936 (comment)

bgrant0607 · 2018-07-17T02:24:57Z

Sorry, that's not the right check.

So, maybe not:

kubernetes/pkg/volume/empty_dir/empty_dir.go

Line 217 in c4f3102

switch ed.medium {

But, anyway, my point is sometimes these can unintentionally succeed

liggitt · 2018-07-17T02:32:26Z

Yeah, point taken. I put a hold on the PR until we can think through whether this approach is a net positive. If we end up moving forward with it, we'll need to be really clear about the types of scenarios where it is an appropriate tool to use.

bgrant0607 · 2018-07-17T02:39:19Z

@liggitt One approach could be to write tests that demonstrate asynchronous failure and commit them before enabling new validation.

I also think we need to start requiring tests for issues such as duplicate entries in associative lists ASAP.

tallclair · 2018-07-17T02:39:39Z

Along those lines, I actually don't think the integer memory requirement should be included. I can successfully start a pod with fractional memory requirements, which just get rounded when necessary.

apelisse · 2018-07-17T02:46:24Z

Would it be possible to implement a tighter validation as a validating webhook, that the cluster admin can decide to enable/disable? There could even be a way (annotation?) to ignore specific objects known to be broken.

liggitt · 2018-07-17T02:47:11Z

One approach could be to write tests that demonstrate asynchronous failure and commit them before enabling new validation.

Good idea.

I also think we need to start requiring tests for issues such as duplicate entries in associative lists ASAP.

Yes. Validation in general tends to get little attention in new API PRs.

lavalamp · 2019-03-28T16:12:42Z

https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api_changes.md#backward-compatibility-gotchas

There's another way validation changes can break clients. By e.g. permitting formerly disallowed field contents, e.g. a new character in a name. Clients that depended on the former limited expressiveness could be broken. This is similar to the problem that obtains when one adds an enum value.

Detecting this in a fully general way is going to be hard to impossible, because the error happens after the client has done a read. The error may actually be very subtle and not cause a crash (imagine allowing / in some field which then leads to a client having a path traversal vulnerability).

liggitt · 2019-03-28T16:21:17Z

There's another way validation changes can break clients. By e.g. permitting formerly disallowed field contents, e.g. a new character in a name. Clients that depended on the former limited expressiveness could be broken.

Correct. #63362 (comment) are the types of questions we need to ask in cases like that.

This is similar to the problem that obtains when one adds an enum value.

At least with enums, we have the option (when defining the field for the first time) to specify what a client should do if it encounters an unknown enum value.

lavalamp · 2019-03-28T18:49:23Z

At least with enums, we have the option (when defining the field for the first time) to specify what a client should do if it encounters an unknown enum value.

Clients which ignore suggestions like that will still exist, so it doesn't seem like a solution to me.

apelisse · 2019-03-28T19:51:57Z

we have the option (when defining the field for the first time) to specify what a client should do if it encounters an unknown enum value.

Yeah, it feels more like it's up to client to decide what to do with unknown enum values. At least for enum values they can expect that MAYBE it's going to change in the future and plan accordingly. For value validation, there isn't much they can do.

khenidak · 2019-08-22T17:54:43Z

We don't currently validate that loadBalancerIP, loadBalancerSourceRanges, externalIPs match the ipfamily of the cluster. And we do have users with resources that has this broken behavior. This comment is for tracking only. Actions will be planned later. @liggitt @thockin

lavalamp · 2019-08-22T18:01:09Z

this is still important
serverside apply in particular is probably going to drive some schema changes that will need this
My Validation Tightening is not Generally Possible #64841 (comment) above (plus following clarifications) is still how I want to deal with this
validation relaxation actually needs to go through this process in reverse to preserve downgradability.

julianvmodesto · 2020-03-03T18:05:09Z

/wg apply

k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. labels Jun 6, 2018

lavalamp mentioned this issue Jun 6, 2018

Quobyte Volume name validation #60986

Closed

k8s-ci-robot assigned liggitt Jun 8, 2018

liggitt mentioned this issue Jun 8, 2018

WIP - introduce ratcheting validation mechanism #64907

Closed

This was referenced Aug 1, 2018

Added set and map structural validation for AllowedTopologies #66843

Merged

KEP: Pruning for CustomResources kubernetes/community#2443

Closed

janetkuo mentioned this issue Dec 21, 2018

[Need-Feedback] Support Volume Expansion Through StatefulSets #71384

Closed

tallclair mentioned this issue Jan 29, 2019

kubelet: add allowed sysctl to KubeletConfiguration #72974

Merged

jennybuckley mentioned this issue Feb 19, 2019

Umbrella Issue for Server Side Apply #73723

Closed

liggitt mentioned this issue Mar 15, 2019

add pod template validation for statefulset #62987

Closed

liggitt mentioned this issue May 4, 2019

Setting defaultMode is not Fully Respected When Pod.spec.securityContext.runAsUser is Set #57923

Open

liggitt added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Jun 12, 2019

liggitt mentioned this issue Jul 10, 2019

Containers with "m" used as abbreviation in memory limits fail to start on Linux, and consume 1e+9 too much memory on Windows #79950

Closed

jennybuckley mentioned this issue Aug 6, 2019

Improve client connection config validation. #80746

Closed

liggitt mentioned this issue Aug 8, 2019

statefulset don't have podSpec validation #81147

Closed

liggitt mentioned this issue Oct 23, 2019

CSI Topology ga #83474

Merged

liggitt mentioned this issue Dec 11, 2019

Failed to create three way merge patch when container environment variable specified multiple times #86163

Closed

rainest mentioned this issue Jan 18, 2020

[stable/kong] refactor env block generation Kong/charts#4

Merged

2 tasks

k8s-ci-robot added the wg/apply label Mar 3, 2020

k8s-ci-robot added wg/api-expression Categorizes an issue or PR as relevant to WG API Expression. and removed wg/apply labels Apr 17, 2020

tallclair mentioned this issue May 7, 2020

Seccomp GA kubernetes/enhancements#1747

Merged

liggitt mentioned this issue Jul 20, 2020

$setElementOrder error encountered updating Deployment removing and reordering multiple env vars with same name #93265

Closed

knight42 mentioned this issue Aug 30, 2020

PVCs specified with quantity containing suffix m are not counted in ResourceQuota and block ResourceQuota updates #94313

Closed

This was referenced Sep 8, 2020

Add field-level warnings for deprecated / known bad values #94626

Open

fix windows container root validate #92355

Merged

tallclair mentioned this issue Feb 15, 2022

Forbid empty AppArmor localhost profile #108143

Merged

stevekuznetsov mentioned this issue Dec 14, 2022

feature: Evolve an API by Changing Field Validation kcp-dev/kcp#2494

Open

1 task

thockin mentioned this issue Mar 2, 2023

LB Source Range Check is dropping packets in some of the cilium-agent pods. cilium/cilium#22658

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validation Tightening is not Generally Possible #64841

Validation Tightening is not Generally Possible #64841

lavalamp commented Jun 6, 2018

liggitt commented Jun 6, 2018 •

edited

liggitt commented Jun 6, 2018

lavalamp commented Jun 6, 2018

jennybuckley commented Jun 7, 2018

liggitt commented Jun 8, 2018

liggitt commented Jun 8, 2018

tallclair commented Jul 6, 2018

bgrant0607 commented Jul 17, 2018

bgrant0607 commented Jul 17, 2018

bgrant0607 commented Jul 17, 2018

bgrant0607 commented Jul 17, 2018

bgrant0607 commented Jul 17, 2018

liggitt commented Jul 17, 2018 •

edited

bgrant0607 commented Jul 17, 2018

bgrant0607 commented Jul 17, 2018

bgrant0607 commented Jul 17, 2018

liggitt commented Jul 17, 2018

bgrant0607 commented Jul 17, 2018

liggitt commented Jul 17, 2018

bgrant0607 commented Jul 17, 2018

tallclair commented Jul 17, 2018

apelisse commented Jul 17, 2018

liggitt commented Jul 17, 2018

lavalamp commented Mar 28, 2019

liggitt commented Mar 28, 2019

lavalamp commented Mar 28, 2019

apelisse commented Mar 28, 2019

khenidak commented Aug 22, 2019

lavalamp commented Aug 22, 2019

julianvmodesto commented Mar 3, 2020

Validation Tightening is not Generally Possible #64841

Validation Tightening is not Generally Possible #64841

Comments

lavalamp commented Jun 6, 2018

liggitt commented Jun 6, 2018 • edited

liggitt commented Jun 6, 2018

lavalamp commented Jun 6, 2018

jennybuckley commented Jun 7, 2018

liggitt commented Jun 8, 2018

liggitt commented Jun 8, 2018

tallclair commented Jul 6, 2018

bgrant0607 commented Jul 17, 2018

bgrant0607 commented Jul 17, 2018

bgrant0607 commented Jul 17, 2018

bgrant0607 commented Jul 17, 2018

bgrant0607 commented Jul 17, 2018

liggitt commented Jul 17, 2018 • edited

bgrant0607 commented Jul 17, 2018

bgrant0607 commented Jul 17, 2018

bgrant0607 commented Jul 17, 2018

liggitt commented Jul 17, 2018

bgrant0607 commented Jul 17, 2018

liggitt commented Jul 17, 2018

bgrant0607 commented Jul 17, 2018

tallclair commented Jul 17, 2018

apelisse commented Jul 17, 2018

liggitt commented Jul 17, 2018

lavalamp commented Mar 28, 2019

liggitt commented Mar 28, 2019

lavalamp commented Mar 28, 2019

apelisse commented Mar 28, 2019

khenidak commented Aug 22, 2019

lavalamp commented Aug 22, 2019

julianvmodesto commented Mar 3, 2020

liggitt commented Jun 6, 2018 •

edited

liggitt commented Jul 17, 2018 •

edited