Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix auto-allocated ports conflicting with explicitly specified ports on service create #93922

Conversation

knight42
Copy link
Member

@knight42 knight42 commented Aug 12, 2020

Signed-off-by: knight42 anonymousknight96@gmail.com

What type of PR is this?

/kind flake
/kind bug

What this PR does / why we need it:

I added the following debugging:

diff --git pkg/registry/core/service/storage/rest.go pkg/registry/core/service/storage/rest.go
index 0d5379f8c11..8095f386431 100644
--- pkg/registry/core/service/storage/rest.go
+++ pkg/registry/core/service/storage/rest.go
@@ -208,6 +208,8 @@ func (rs *REST) Create(ctx context.Context, obj runtime.Object, createValidation
        // Handle ExternalTraffic related fields during service creation.
        if apiservice.NeedsHealthCheck(service) {
                if err := allocateHealthCheckNodePort(service, nodePortOp); err != nil {
+                       spec := service.Spec
+                       fmt.Printf("NodePort: %d, HealthCheckPort: %d\n", spec.Ports[0].NodePort, spec.HealthCheckNodePort)
                        return nil, errors.NewInternalError(err)
                }
        }

and got:

WARNING: Package "github.com/golang/protobuf/protoc-gen-go/generator" is deprecated.
        A future release of golang/protobuf will delete this package,
        which has long been excluded from the compatibility promise.

I0812 17:52:19.904068   78356 client.go:360] parsed scheme: "endpoint"
I0812 17:52:19.904194   78356 endpoint.go:68] ccResolverWrapper: sending new addresses to cc: [{unix://localhost:83760927833799529510  <nil> 0 <nil>}]
I0812 17:52:19.908507   78356 client.go:360] parsed scheme: "endpoint"
I0812 17:52:19.908588   78356 endpoint.go:68] ccResolverWrapper: sending new addresses to cc: [{unix://localhost:83760927833799529510  <nil> 0 <nil>}]
I0812 17:52:19.909254   78356 once.go:66] CPU time info is unavailable on non-linux or appengine environment.
I0812 17:52:19.914674   78356 client.go:360] parsed scheme: "endpoint"
I0812 17:52:19.914761   78356 endpoint.go:68] ccResolverWrapper: sending new addresses to cc: [{unix://localhost:83760927833799529510  <nil> 0 <nil>}]
I0812 17:52:19.918666   78356 client.go:360] parsed scheme: "endpoint"
I0812 17:52:19.918736   78356 endpoint.go:68] ccResolverWrapper: sending new addresses to cc: [{unix://localhost:83760927833799529510  <nil> 0 <nil>}]

NodePort: 30773, HealthCheckPort: 30773  <-- Please notice this line

W0812 17:52:19.961603   78356 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {unix://localhost:83760927833799529510  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial unix localhost:83760927833799529510: connect: no such file or directory". Reconnecting...
W0812 17:52:19.961638   78356 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {unix://localhost:83760927833799529510  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial unix localhost:83760927833799529510: connect: no such file or directory". Reconnecting...
W0812 17:52:19.961971   78356 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {unix://localhost:83760927833799529510  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial unix localhost:83760927833799529510: connect: no such file or directory". Reconnecting...
--- FAIL: TestServiceRegistryExternalTrafficHealthCheckNodePortUserAllocation (1.98s)
    rest_test.go:2214: Unexpected failure creating service :Internal error occurred: failed to allocate requested HealthCheck NodePort 30773: provided port is already allocated
FAIL

ERROR: exit status 1

The log shows the root cause is the randomly generated health check nodePort happens to be the same as the service nodePort allocated later.

The flake also uncovers a bug that apiserver could potenially allocate a same port as the user-specified HealthCheck NodePort to be the Service NodePort when creating NodePort or LoadBalancer Service.

Which issue(s) this PR fixes:

xref: #93605 (comment)

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Fix a bug that apiserver could potenially allocate a same port as the user-specified HealthCheck NodePort to be the Service NodePort when creating `NodePort` or `LoadBalancer` Service.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/flake Categorizes issue or PR as related to a flaky test. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Aug 12, 2020
@knight42
Copy link
Member Author

/cc @liggitt

@liggitt
Copy link
Member

liggitt commented Aug 12, 2020

The log shows the root cause is the randomly generated health check nodePort happens to be the same as the service nodePort allocated later.

nice catch

@knight42 knight42 force-pushed the fix/TestServiceRegistryExternalTrafficHealthCheckNodePortUserAllocation branch from 6de5593 to 1709887 Compare August 12, 2020 13:26
@knight42 knight42 requested a review from liggitt August 12, 2020 13:27
@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Aug 12, 2020
@knight42
Copy link
Member Author

@liggitt Now that this PR actually fixes a bug, is it OK to add a release note?

@knight42
Copy link
Member Author

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 12, 2020
@liggitt liggitt changed the title test(storage): deflake TestServiceRegistryExternalTrafficHealthCheckNodePortUserAllocation sig-storage: Fix auto-allocated ports conflicting with explicitly specified ports on service create Aug 13, 2020
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Aug 14, 2020
@knight42
Copy link
Member Author

knight42 commented Aug 14, 2020

@liggitt I think adding tests that describes our expected behavior might be a good start, so I add some tests in 88d4926, and it turns out the problem you pointed out at #93922 (comment) actually exists.

@knight42 knight42 force-pushed the fix/TestServiceRegistryExternalTrafficHealthCheckNodePortUserAllocation branch from cd1dbe1 to 704b9d2 Compare August 14, 2020 12:39
@knight42 knight42 requested a review from liggitt August 14, 2020 12:43
@liggitt liggitt changed the title sig-storage: Fix auto-allocated ports conflicting with explicitly specified ports on service create sig-network: Fix auto-allocated ports conflicting with explicitly specified ports on service create Aug 14, 2020
@liggitt
Copy link
Member

liggitt commented Aug 14, 2020

/assign @thockin @bowei
/sig network

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 4, 2020
@knight42 knight42 force-pushed the fix/TestServiceRegistryExternalTrafficHealthCheckNodePortUserAllocation branch from 9484aea to dfe481b Compare September 4, 2020 04:44
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 4, 2020
@knight42 knight42 force-pushed the fix/TestServiceRegistryExternalTrafficHealthCheckNodePortUserAllocation branch from dfe481b to ef97435 Compare September 4, 2020 04:49
Signed-off-by: knight42 <anonymousknight96@gmail.com>
@knight42 knight42 force-pushed the fix/TestServiceRegistryExternalTrafficHealthCheckNodePortUserAllocation branch from ef97435 to a19fc51 Compare September 4, 2020 04:50
@k8s-ci-robot k8s-ci-robot added area/code-generation area/kubelet kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/auth Categorizes an issue or PR as relevant to SIG Auth. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/storage Categorizes an issue or PR as relevant to SIG Storage. and removed kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API area/kubelet area/code-generation sig/auth Categorizes an issue or PR as relevant to SIG Auth. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/storage Categorizes an issue or PR as relevant to SIG Storage. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. labels Sep 4, 2020

const count = 3
// The `PortAllocator` seems to allocate the first port randomly,
// So we run the test multiple times to ensure the NodePort is correctly allocated.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code change seems good, but I am not sure I understand this random + retry. Can you help me understand?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The port allocator would allocate port randomly from the available ports, if we did not retry, the port allocator might happen to allocate the right port and the test got passed.

@thockin thockin changed the title sig-network: Fix auto-allocated ports conflicting with explicitly specified ports on service create Fix auto-allocated ports conflicting with explicitly specified ports on service create Sep 19, 2020
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 19, 2020
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 18, 2021
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. kind/flake Categorizes issue or PR as related to a flaky test. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/network Categorizes an issue or PR as relevant to SIG Network. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants