Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

informers: Don't treat relist same as sync #86015

Merged
merged 2 commits into from Jan 24, 2020

Conversation

squeed
Copy link
Contributor

@squeed squeed commented Dec 6, 2019

What type of PR is this?
/kind bug

What this PR does / why we need it:

Background:

Before this change, DeltaFIFO emits the Sync DeltaType on Resync() and Replace(). Seperately, the SharedInformer will only pass that event on to handlers that have a ResyncInterval and are due for Resync. This can cause updates to be lost if an object changes as part of the Replace(), as it may be incorrectly discarded if the handler does not want a Resync.

What this change does:

Creates a new DeltaType, Replaced, which is emitted by DeltaFIFO on Replace(). For backwards compatability concerns, the old behavior of always emitting Sync is preserved unless explicity overridden.

As a result, if an object changes (or is added) on Replace(), now all SharedInformer handlers will get a correct Add() or Update() notification.

One additional side-effect is that handlers which do not ever want Resyncs will now see them for all objects that have not changed during the Replace.

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Shared informers are now more reliable in the face of network disruption.

/cc @sttts
/cc @liggitt

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Dec 6, 2019
@squeed
Copy link
Contributor Author

squeed commented Dec 6, 2019

Many thanks to @sttts, who helped me understand exactly where to look in the SharedInformer when faced with an "impossible" sequence of log lines.

@liggitt
Copy link
Member

liggitt commented Dec 6, 2019

Separately, there is a commit that elides re-List updates if RV and UID hasn't changed.

Can that be split to a separate PR? That's separate from the bug fix, correct? We've previously decided against teaching informers about uids, would like to consider that separately and not backport that.

@squeed
Copy link
Contributor Author

squeed commented Dec 9, 2019

Can that be split to a separate PR? That's separate from the bug fix, correct? We've previously decided against teaching informers about uids, would like to consider that separately and not backport that.

It's not separate to this bug fix; @sttts suggested that I split it out since it's not strictly necessary. There are two points for consideration:

  1. The larger bugfix is also changing existing behavior. Assume a handler without ResyncInterval doesn't want to get a Sync-style update (e.g. Update(A1, A1)) and only wants edges. However, if the client internally decides to do a re-List, then the handler will still see no-op updates. Now, I don't think there's any kind of contract that says you'll never get no-op updates, but it it was coded in to some of the tests.

  2. As to UIDs, I wasn't sure of the guarantees for ResourceVersion across delete-and-recreate. It's not explicitly mentioned in the documentation. If a new object can't have the same ResourceVersion (currently impossible with etcd, of course), then we can drop the UID.

Additionally, the elision only takes place for Replace events - so it's not a large change.

@liggitt
Copy link
Member

liggitt commented Dec 9, 2019

2. If a new object can't have the same ResourceVersion (currently impossible with etcd, of course), then we can drop the UID.

resourceVersion is unique per resource type (it has to be, since we use it for list requests across resource instances), so it cannot repeat on a delete/recreate

@liggitt
Copy link
Member

liggitt commented Dec 9, 2019

It's not separate to this bug fix; @sttts suggested that I split it out since it's not strictly necessary.

We should split it to a separate PR, as it needs more discussion, and even if accepted, I would not expect to backport that part.

@squeed
Copy link
Contributor Author

squeed commented Dec 9, 2019

resourceVersion is unique per resource type (it has to be, since we use it for list requests across resource instances), so it cannot repeat on a delete/recreate

Ack. Removed UID check; doesn't affect correctness.

@liggitt
Copy link
Member

liggitt commented Dec 9, 2019

fyi @smarterclayton @lavalamp

@squeed
Copy link
Contributor Author

squeed commented Dec 9, 2019

We should split it to a separate PR, as it needs more discussion, and even if accepted, I would not expect to backport that part.

Understood. The open question to me is: is it OK to send resync updates to handlers that don't want them? If so, then we don't need the last commit. Otherwise we do.

@lavalamp
Copy link
Member

lavalamp commented Dec 9, 2019

I think it is probably good to make the distinction, but this will be disruptive to users, potentially very disruptive.

Therefore, I think you need to wire an option or provide a 2nd constructor or something, so that existing code can continue to work the same with no changes.

@squeed
Copy link
Contributor Author

squeed commented Dec 9, 2019

I think it is probably good to make the distinction, but this will be disruptive to users, potentially very disruptive.

Therefore, I think you need to wire an option or provide a 2nd constructor or something, so that existing code can continue to work the same with no changes.

@lavalamp I don't quite follow - what sort of option were you thinking of?

To be clear here, this PR doesn't disable resyncs or change the standard informer behavior at all. It just ensures we don't miss updates due to a re-List. AFAICT, the question is whether or not we should expose the re-List due to disruption to the handlers.

@lavalamp
Copy link
Member

lavalamp commented Dec 9, 2019

doesn't disable resyncs or change the standard informer behavior at all

There can be direct clients of this.

The fact that we can make corresponding changes in the informer to preserve its behavior makes your life much easier though.

@squeed
Copy link
Contributor Author

squeed commented Dec 9, 2019

As an alternative to adding a new Delta type, we could "fix" this in the SharedInformer by cheating a bit, and treating Sync as Updated if the object differs from the store.

@liggitt
Copy link
Member

liggitt commented Jan 20, 2020

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 20, 2020
@liggitt
Copy link
Member

liggitt commented Jan 20, 2020

unit test failure is legitimate:

$ go test ./vendor/k8s.io/client-go/tools/cache
# k8s.io/client-go/tools/cache [k8s.io/client-go/tools/cache.test]
staging/src/k8s.io/client-go/tools/cache/delta_fifo_test.go:294:3: undefined: keyLookupFunc
staging/src/k8s.io/client-go/tools/cache/delta_fifo_test.go:410:17: undefined: keyLookupFunc
staging/src/k8s.io/client-go/tools/cache/delta_fifo_test.go:440:17: undefined: keyLookupFunc

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 20, 2020
@liggitt
Copy link
Member

liggitt commented Jan 20, 2020

/lgtm cancel

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 20, 2020
@squeed
Copy link
Contributor Author

squeed commented Jan 20, 2020

whoops, fixing

@squeed
Copy link
Contributor Author

squeed commented Jan 20, 2020

/retest

1 similar comment
@liggitt
Copy link
Member

liggitt commented Jan 21, 2020

/retest

@lavalamp
Copy link
Member

Overall, great change-- I hate to ask for something but the name is unchangeable once we cut a client-go release, so...

Background:

Before this change, DeltaFIFO emits the Sync DeltaType on Resync() and
Replace(). Seperately, the SharedInformer will only pass that event
on to handlers that have a ResyncInterval and are due for Resync. This
can cause updates to be lost if an object changes as part of the Replace(),
as it may be incorrectly discarded if the handler does not want a Resync.

What this change does:

Creates a new DeltaType, Replaced, which is emitted by DeltaFIFO on
Replace(). For backwards compatability concerns, the old behavior of
always emitting Sync is preserved unless explicity overridden.

As a result, if an object changes (or is added) on Replace(), now all
SharedInformer handlers will get a correct Add() or Update()
notification.

One additional side-effect is that handlers which do not ever want
Resyncs will now see them for all objects that have not changed during
the Replace.
@squeed
Copy link
Contributor Author

squeed commented Jan 23, 2020

OK, rename done.

@liggitt
Copy link
Member

liggitt commented Jan 23, 2020

/lgtm
/approve
/retest

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 23, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: liggitt, squeed

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

2 similar comments
@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants