Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QoS tiers #147

Closed
bgrant0607 opened this issue Jun 18, 2014 · 38 comments
Closed

QoS tiers #147

bgrant0607 opened this issue Jun 18, 2014 · 38 comments
Assignees
Labels
area/isolation priority/backlog Higher priority than priority/awaiting-more-evidence. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.

Comments

@bgrant0607
Copy link
Member

We have described QoS tiers in several presentations and the necessary underlying mechanisms are supported by lmctfy:
http://www.slideshare.net/dotCloud/eric-brewer-dockercon-keynote
http://www.slideshare.net/vmarmol/containers-google
http://www.linuxplumbersconf.org/2011/ocw/system/presentations/705/original/Control_Loops_In_Userspace.pdf
https://github.com/google/lmctfy

Borg: https://research.google.com/pubs/pub43438.html

We should expose QoS tiers through the Kubernetes pod API.

@vmarmol
Copy link
Contributor

vmarmol commented Jun 18, 2014

Since we use Docker to start containers, we'll need support there as well. @rjnagal @vishh and myself have been working towards that. There should already be some support for this and more will drop in the near future.

@smarterclayton
Copy link
Contributor

Would QoS tiers specified via labeling or orthogonal to labeling (assuming orthogonal given presentations, just want to make sure)? Do you see the need for feedback between a minion about current resource load and the scheduler, and if so, is this a backchannel (a formal event bus) concept that is part of Kubernetes or encapsulated behind the scheduler interface?

@thockin
Copy link
Member

thockin commented Jun 20, 2014

Caveat, still coming up to speed on everything, but that won't stop me from
having an opinion.

QoS and administrative or config information should never be labels.
Labels are the domain of the user, not the system.

On Fri, Jun 20, 2014 at 7:50 AM, Clayton Coleman notifications@github.com
wrote:

Would QoS tiers specified via labeling or orthogonal to labeling (assuming
orthogonal given presentations, just want to make sure)? Do you see the
need for feedback between a minion about current resource load and the
scheduler, and if so, is this a backchannel (a formal event bus) concept
that is part of Kubernetes or encapsulated behind the scheduler interface?


Reply to this email directly or view it on GitHub
#147 (comment)
.

@bgrant0607
Copy link
Member Author

@smarterclayton : I agree with @thockin . QoS specifications need to be first-class fields in our API, together with resource specifications (#168).

For one possible example of what this could look like, see lmctfy:
https://github.com/google/lmctfy/blob/master/include/lmctfy.proto

@timothysc
Copy link
Member

How does QoS tiers relate to resource requests?

@bgrant0607
Copy link
Member Author

@timothysc

At the node level, QoS would be specified together with resource requests to the execution layer, such as an exec driver over lmctfy (which might eventually be layered on libcontainer).

At the scheduling level, a variety of policies are possible. We could start with the naive approach of not considering QoS in scheduling. Eventually we could support some kind of overcommitment for lower QoS tiers.

@bgrant0607 bgrant0607 added area/isolation sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. labels Sep 30, 2014
@bgrant0607 bgrant0607 added the priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. label Dec 3, 2014
@bgrant0607
Copy link
Member Author

Another example where we'd like differentiated QoS:

Pods/containers with specified memory limits should receive preferential treatment compared to "best effort" pods/containers that don't, such as by setting their OOM adjust scores.

/cc @dchen1107

@bgrant0607
Copy link
Member Author

Current thinking for basic tiers:

  • Support specification of limit (max allowed) and request (min required for scheduling)
  • If limit==request, then it's highest priority
  • If limit > request, then middle priority (main reason to need this is for vertical auto-sizing)
  • If no limit or request, then lowest priority
  • Suggested defaulting is described in resources.md
  • High-priority pods/containers would get reserved resources.
  • Lower-priority pods/containers would be put into cgroups with the resources reserved by the higher-priority ones subtracted, in addition to nested cgroups with their specified limits (using the proposed parent cgroup Docker feature)
  • Containers specifying no cpu would get a tiny number of shares

We could also use core allocation, oom score adjustment, and other mechanisms to strengthen this basic protection.

@erictune
Copy link
Member

erictune commented Mar 3, 2015

@bgrant0607 @derekwaynecarr would there be quota at each tier?

@bgrant0607
Copy link
Member Author

@erictune Yes.

@erictune
Copy link
Member

erictune commented Mar 3, 2015

We will need to name the tiers/priorities so they can be referred to in the quota file.

@derekwaynecarr
Copy link
Member

Is this post 1.0? Want to know if I need to reserve time to handle quota related fallout?

@bgrant0607
Copy link
Member Author

Yes, post-1.0.

@bgrant0607
Copy link
Member Author

--cgroup_parent is in: moby/moby#11428

@rjnagal
Copy link
Contributor

rjnagal commented Mar 19, 2015

Shall we go ahead with changes to differentiate between jobs with and without limits? Essentially introducing tier2 guarantees for jobs with unspecified limits?

@vishh
Copy link
Contributor

vishh commented Mar 20, 2015

+1 for having a limit for all jobs without limits. We need to reserve
resources for the system daemons as well.

On Thu, Mar 19, 2015 at 5:00 PM, Rohit Jnagal notifications@github.com
wrote:

Shall we go ahead with changes to differentiate between jobs with and
without limits? Essentially introducing tier2 guarantees for jobs with
unspecified limits?


Reply to this email directly or view it on GitHub
#147 (comment)
.

@rjnagal
Copy link
Contributor

rjnagal commented Mar 26, 2015

Discussed offline. The minimal protection for sys daemons we have now is good enough for v1.0. We'll take up the rest post 1.0.

@AnanyaKumar
Copy link
Contributor

To clarify, this is a very preliminary proposal that I'm drafting - I wanted to get some feedback. I'll update after syncing with more experienced people :)

From what I understand, we don't support requests right now, we only support limits (basically request == limits). Limits will be guaranteed for top-tier pods and won't be guaranteed for best-effort pods. There will be over-commitment in the new model, however over-commitment will be invisible to top-tier pods that specify limits.

Points 2 and 3 that you mention are in line with what I'm proposing, but I expect 3 might be controversial for incompressible resources. However, I don't see a better way to do this unless we preempt with numerical priorities (which Brian seems opposed to).

@chakri-nelluri
Copy link
Contributor

Sounds good. Will wait for the proposal.

@bgrant0607
Copy link
Member Author

Proposal is in #11713.

@bgrant0607
Copy link
Member Author

Note that this issue is about performance/resource QoS, NOT availability/reliability/disruption QoS. They are related, but not necessarily coupled.

@davidopp davidopp added priority/backlog Higher priority than priority/awaiting-more-evidence. and removed priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. labels Jul 23, 2015
@erictune
Copy link
Member

@NELCY @AnanyaKumar
I agree it should be possible to handle the scenario that @NELCY describes.

However, I don't think it is necessary to introduce tiers/priority as a first-class to handle the @NELCY mentions. You could also handle that scenario using vertical pod autoscaling.

With vertical pod autoscaling, you initially create a pod with some request and limit, and then a control loop periodically looks at each pod which is it supposed to manage, and decides whether to increase, decrease, or leave alone the request and limit of the pod. A typical vertical autoscaling policy would be to look at historical usage, take, say, the 95%ile, and add some padding to that, and call that the new request and limit.

The case @NELCY mentions could be implemented with the following autoscaling rule:

  • if pod has label type=web then try to increase limit and request by 10%
  • keep doing this until all pods with type=web have consumed all free resources on the nodes they occupy.

I think it is better to keep the kubelet's and scheduler's view of resources as simple as possible, and to layer more complex policies as control systems on top of those. This approach has several advantages:

  • it keeps the complexity of scheduler and kubelet manageable.
  • with less policy in kubelet, it is able to work well on a wider variety of kernels/platforms.
  • If we hardcode tiers/priorities into a place in pod spec that everyone uses, it is going to be difficult to evolve, because any change will affect all kubernetes users when they try to upgrade kubelet and/or scheduler.
  • However, if priorities/tiers are expressed via labels or annotations, or custom fields in the pod spec, then those can evolve separately. Different clusters can have different definitions of tiers. I am certain different people will want different, incompatible, types of tiers.

@chakri-nelluri
Copy link
Contributor

@AnanyaKumar
Thanks for the proposal. We included burstable pods and that should cover the use case I was concerned about. I will spend some more time and send you review comments if any. First look, the proposal is very good.

@erictune I like the idea of control loop on top, but it doesn't always give us a completely elastic behavior. in my previous experiences, with control loop it was always a debate about how frequently you run the control loop especially for compressible resources. But I agree it has several advantages and gives us greater flexibility. For now, I am happy with support for Burstable pods. Thanks.

@bgrant0607
Copy link
Member Author

Superseded by more recent issues.

iaguis pushed a commit to kinvolk/kubernetes that referenced this issue Feb 6, 2018
seans3 pushed a commit to seans3/kubernetes that referenced this issue Apr 10, 2019
Update kubectl release notes with changes in behavior in 1.5
@rafis
Copy link

rafis commented May 15, 2019

Superseded by more recent issues.

Kubernetes v1.12.7. Problem still exists. Pods are randomly dying. I call it "Happy New Year Tree" effect in Kubernetes cluster. Which recent issue I shoud look?

marun added a commit to marun/kubernetes that referenced this issue Jun 20, 2020
pjh pushed a commit to pjh/kubernetes that referenced this issue Jan 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/isolation priority/backlog Higher priority than priority/awaiting-more-evidence. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.
Projects
None yet
Development

No branches or pull requests