Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Job controller: Updates might override stale data #105199

Closed
alculquicondor opened this issue Sep 22, 2021 · 2 comments · Fixed by #105214
Closed

Job controller: Updates might override stale data #105199

alculquicondor opened this issue Sep 22, 2021 · 2 comments · Fixed by #105214
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/apps Categorizes an issue or PR as relevant to SIG Apps.

Comments

@alculquicondor
Copy link
Member

alculquicondor commented Sep 22, 2021

When investigating #105179, @liggitt and I discovered that the job controller does a GET request of the job before issuing any Job status update.

func (jm *Controller) updateJobStatus(job *batch.Job) error {
jobClient := jm.kubeClient.BatchV1().Jobs(job.Namespace)
var err error
for i := 0; i <= statusUpdateRetries; i = i + 1 {
var newJob *batch.Job
newJob, err = jobClient.Get(context.TODO(), job.Name, metav1.GetOptions{})
if err != nil {
break
}
newJob.Status = job.Status
if _, err = jobClient.UpdateStatus(context.TODO(), newJob, metav1.UpdateOptions{}); err == nil {
break
}
}
return err
}

This is problematic because it can masquerade any incompatibilities between the job sync and the latest state of the Job. In particular, this can cause UIDs or counters to have stale data when tracking job status with finalizers.
It might not have been a problem in the past because the job controller would always recompute status from zero. However, when tracking with finalizers, the existing status is part of the input to the sync.

The solution is to skip the Job get and let the sync fail in case of conflict. The conflict implies that the Job is back in the workqueue because of its update.

/sig apps

@alculquicondor alculquicondor added the kind/bug Categorizes issue or PR as related to a bug. label Sep 22, 2021
@k8s-ci-robot k8s-ci-robot added the sig/apps Categorizes an issue or PR as relevant to SIG Apps. label Sep 22, 2021
@k8s-ci-robot
Copy link
Contributor

@alculquicondor: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Sep 22, 2021
@alculquicondor
Copy link
Member Author

/assign

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/apps Categorizes an issue or PR as relevant to SIG Apps.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants