QoS tiers #147

bgrant0607 · 2014-06-18T04:17:12Z

We have described QoS tiers in several presentations and the necessary underlying mechanisms are supported by lmctfy:
http://www.slideshare.net/dotCloud/eric-brewer-dockercon-keynote
http://www.slideshare.net/vmarmol/containers-google
http://www.linuxplumbersconf.org/2011/ocw/system/presentations/705/original/Control_Loops_In_Userspace.pdf
https://github.com/google/lmctfy

Borg: https://research.google.com/pubs/pub43438.html

We should expose QoS tiers through the Kubernetes pod API.

vmarmol · 2014-06-18T17:32:13Z

Since we use Docker to start containers, we'll need support there as well. @rjnagal @vishh and myself have been working towards that. There should already be some support for this and more will drop in the near future.

smarterclayton · 2014-06-20T14:50:32Z

Would QoS tiers specified via labeling or orthogonal to labeling (assuming orthogonal given presentations, just want to make sure)? Do you see the need for feedback between a minion about current resource load and the scheduler, and if so, is this a backchannel (a formal event bus) concept that is part of Kubernetes or encapsulated behind the scheduler interface?

thockin · 2014-06-20T15:55:02Z

Caveat, still coming up to speed on everything, but that won't stop me from
having an opinion.

QoS and administrative or config information should never be labels.
Labels are the domain of the user, not the system.

On Fri, Jun 20, 2014 at 7:50 AM, Clayton Coleman notifications@github.com
wrote:

Would QoS tiers specified via labeling or orthogonal to labeling (assuming
orthogonal given presentations, just want to make sure)? Do you see the
need for feedback between a minion about current resource load and the
scheduler, and if so, is this a backchannel (a formal event bus) concept
that is part of Kubernetes or encapsulated behind the scheduler interface?

—
Reply to this email directly or view it on GitHub
#147 (comment)
.

bgrant0607 · 2014-06-20T16:12:22Z

@smarterclayton : I agree with @thockin . QoS specifications need to be first-class fields in our API, together with resource specifications (#168).

For one possible example of what this could look like, see lmctfy:
https://github.com/google/lmctfy/blob/master/include/lmctfy.proto

timothysc · 2014-07-08T18:33:50Z

How does QoS tiers relate to resource requests?

bgrant0607 · 2014-07-09T00:54:53Z

@timothysc

At the node level, QoS would be specified together with resource requests to the execution layer, such as an exec driver over lmctfy (which might eventually be layered on libcontainer).

At the scheduling level, a variety of policies are possible. We could start with the naive approach of not considering QoS in scheduling. Eventually we could support some kind of overcommitment for lower QoS tiers.

bgrant0607 · 2014-12-09T19:32:57Z

Another example where we'd like differentiated QoS:

Pods/containers with specified memory limits should receive preferential treatment compared to "best effort" pods/containers that don't, such as by setting their OOM adjust scores.

/cc @dchen1107

bgrant0607 · 2015-02-28T02:12:48Z

Current thinking for basic tiers:

Support specification of limit (max allowed) and request (min required for scheduling)
If limit==request, then it's highest priority
If limit > request, then middle priority (main reason to need this is for vertical auto-sizing)
If no limit or request, then lowest priority
Suggested defaulting is described in resources.md
High-priority pods/containers would get reserved resources.
Lower-priority pods/containers would be put into cgroups with the resources reserved by the higher-priority ones subtracted, in addition to nested cgroups with their specified limits (using the proposed parent cgroup Docker feature)
Containers specifying no cpu would get a tiny number of shares

We could also use core allocation, oom score adjustment, and other mechanisms to strengthen this basic protection.

erictune · 2015-03-03T16:31:36Z

@bgrant0607 @derekwaynecarr would there be quota at each tier?

bgrant0607 · 2015-03-03T23:16:55Z

@erictune Yes.

erictune · 2015-03-03T23:23:37Z

We will need to name the tiers/priorities so they can be referred to in the quota file.

derekwaynecarr · 2015-03-05T03:14:31Z

Is this post 1.0? Want to know if I need to reserve time to handle quota related fallout?

bgrant0607 · 2015-03-05T04:40:49Z

Yes, post-1.0.

bgrant0607 · 2015-03-19T23:42:04Z

--cgroup_parent is in: moby/moby#11428

rjnagal · 2015-03-19T23:59:38Z

Shall we go ahead with changes to differentiate between jobs with and without limits? Essentially introducing tier2 guarantees for jobs with unspecified limits?

vishh · 2015-03-20T00:04:06Z

+1 for having a limit for all jobs without limits. We need to reserve
resources for the system daemons as well.

On Thu, Mar 19, 2015 at 5:00 PM, Rohit Jnagal notifications@github.com
wrote:

Shall we go ahead with changes to differentiate between jobs with and
without limits? Essentially introducing tier2 guarantees for jobs with
unspecified limits?

—
Reply to this email directly or view it on GitHub
#147 (comment)
.

rjnagal · 2015-03-26T21:43:02Z

Discussed offline. The minimal protection for sys daemons we have now is good enough for v1.0. We'll take up the rest post 1.0.

AnanyaKumar · 2015-07-17T01:46:31Z

To clarify, this is a very preliminary proposal that I'm drafting - I wanted to get some feedback. I'll update after syncing with more experienced people :)

From what I understand, we don't support requests right now, we only support limits (basically request == limits). Limits will be guaranteed for top-tier pods and won't be guaranteed for best-effort pods. There will be over-commitment in the new model, however over-commitment will be invisible to top-tier pods that specify limits.

Points 2 and 3 that you mention are in line with what I'm proposing, but I expect 3 might be controversial for incompressible resources. However, I don't see a better way to do this unless we preempt with numerical priorities (which Brian seems opposed to).

chakri-nelluri · 2015-07-17T05:19:58Z

Sounds good. Will wait for the proposal.

bgrant0607 · 2015-07-23T00:44:49Z

Proposal is in #11713.

bgrant0607 · 2015-07-23T00:46:38Z

Note that this issue is about performance/resource QoS, NOT availability/reliability/disruption QoS. They are related, but not necessarily coupled.

erictune · 2015-07-23T17:01:14Z

@NELCY @AnanyaKumar
I agree it should be possible to handle the scenario that @NELCY describes.

However, I don't think it is necessary to introduce tiers/priority as a first-class to handle the @NELCY mentions. You could also handle that scenario using vertical pod autoscaling.

With vertical pod autoscaling, you initially create a pod with some request and limit, and then a control loop periodically looks at each pod which is it supposed to manage, and decides whether to increase, decrease, or leave alone the request and limit of the pod. A typical vertical autoscaling policy would be to look at historical usage, take, say, the 95%ile, and add some padding to that, and call that the new request and limit.

The case @NELCY mentions could be implemented with the following autoscaling rule:

if pod has label type=web then try to increase limit and request by 10%
keep doing this until all pods with type=web have consumed all free resources on the nodes they occupy.

I think it is better to keep the kubelet's and scheduler's view of resources as simple as possible, and to layer more complex policies as control systems on top of those. This approach has several advantages:

it keeps the complexity of scheduler and kubelet manageable.
with less policy in kubelet, it is able to work well on a wider variety of kernels/platforms.
If we hardcode tiers/priorities into a place in pod spec that everyone uses, it is going to be difficult to evolve, because any change will affect all kubernetes users when they try to upgrade kubelet and/or scheduler.
However, if priorities/tiers are expressed via labels or annotations, or custom fields in the pod spec, then those can evolve separately. Different clusters can have different definitions of tiers. I am certain different people will want different, incompatible, types of tiers.

chakri-nelluri · 2015-07-29T19:22:38Z

@AnanyaKumar
Thanks for the proposal. We included burstable pods and that should cover the use case I was concerned about. I will spend some more time and send you review comments if any. First look, the proposal is very good.

@erictune I like the idea of control loop on top, but it doesn't always give us a completely elastic behavior. in my previous experiences, with control loop it was always a debate about how frequently you run the control loop especially for compressible resources. But I agree it has several advantages and gives us greater flexibility. For now, I am happy with support for Burstable pods. Thanks.

Adding v1.1 remote API.

bgrant0607 · 2016-05-17T06:07:28Z

Superseded by more recent issues.

V1.7.0 beta.2 patchset

Update kubectl release notes with changes in behavior in 1.5

rafis · 2019-05-15T10:53:40Z

Superseded by more recent issues.

Kubernetes v1.12.7. Problem still exists. Pods are randomly dying. I call it "Happy New Year Tree" effect in Kubernetes cluster. Which recent issue I shoud look?

Marun testing8 00

This is an empty configuration file to activate snippet-bot: https://github.com/googleapis/repo-automation-bots/tree/master/packages/snippet-bot

jbeda added the enhancement label Jun 18, 2014

bgrant0607 mentioned this issue Jun 19, 2014

Container and pod resource limits #168

Closed

bgrant0607 mentioned this issue Jul 7, 2014

More general scheduling constraints #367

Closed

bgrant0607 mentioned this issue Jul 15, 2014

Least load scheduler #471

Closed

dchen1107 mentioned this issue Jul 16, 2014

Report resource limits and usage of Kubelet and other daemons #490

Closed

bgrant0607 mentioned this issue Jul 16, 2014

Allow cluster resources to be subdivided #442

Closed

bgrant0607 added area/isolation sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. labels Sep 30, 2014

bgrant0607 added the priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. label Dec 3, 2014

bgrant0607 mentioned this issue Dec 17, 2014

Resource Advice in Pod Templates #2768

Closed

bgrant0607 mentioned this issue Feb 14, 2015

Containers startup throttling #3312

Closed

bgrant0607 added the team/master label Feb 28, 2015

bgrant0607 mentioned this issue Mar 19, 2015

Kubelet: Create a top-level container for pods #5671

Closed

AnanyaKumar mentioned this issue Jul 23, 2015

QoS proposal #11713

Merged

davidopp added priority/backlog Higher priority than priority/awaiting-more-evidence. and removed priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. labels Jul 23, 2015

pmorie mentioned this issue Jul 31, 2015

Add support for multiple tiers #11967

Closed

davidopp added team/control-plane and removed team/master labels Aug 23, 2015

bgrant0607 mentioned this issue Oct 8, 2015

Updaing QoS policy to be at the pod level #14943

Merged

dchen1107 mentioned this issue Feb 29, 2016

Preemption priority / scheme #22212

Open

vishh pushed a commit to vishh/kubernetes that referenced this issue Apr 6, 2016

Merge pull request kubernetes#147 from vmarmol/subcontainers2

01969c5

Adding v1.1 remote API.

bgrant0607 assigned vishh May 17, 2016

bgrant0607 closed this as completed May 17, 2016

bgrant0607 mentioned this issue Sep 23, 2016

Explicit pod QoS class. #32468

Closed

praveenrajr9 mentioned this issue Nov 7, 2016

Etcd unavailable Error: when performing kube-up.sh #36340

Closed

wangycc mentioned this issue Jun 29, 2017

Unable to mount volumes for pod #47956

Closed

jlzhao27 mentioned this issue Aug 3, 2017

expose "KernelTunableWarn" in kublet config #50110

Closed

iaguis pushed a commit to kinvolk/kubernetes that referenced this issue Feb 6, 2018

Merge pull request kubernetes#147 from coreos/v1.7.0-beta.2-patchset

0135053

V1.7.0 beta.2 patchset

seans3 pushed a commit to seans3/kubernetes that referenced this issue Apr 10, 2019

Merge pull request kubernetes#147 from pwittrock/release-notes

61514cd

Update kubectl release notes with changes in behavior in 1.5

marun added a commit to marun/kubernetes that referenced this issue Jun 20, 2020

Merge pull request kubernetes#147 from marun/marun-testing8-00

39f9d41

Marun testing8 00

pjh pushed a commit to pjh/kubernetes that referenced this issue Jan 31, 2022

chore: activate snippet-bot (kubernetes#147)

422a288

This is an empty configuration file to activate snippet-bot: https://github.com/googleapis/repo-automation-bots/tree/master/packages/snippet-bot

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

QoS tiers #147

QoS tiers #147

bgrant0607 commented Jun 18, 2014

vmarmol commented Jun 18, 2014

smarterclayton commented Jun 20, 2014

thockin commented Jun 20, 2014

bgrant0607 commented Jun 20, 2014

timothysc commented Jul 8, 2014

bgrant0607 commented Jul 9, 2014

bgrant0607 commented Dec 9, 2014

bgrant0607 commented Feb 28, 2015

erictune commented Mar 3, 2015

bgrant0607 commented Mar 3, 2015

erictune commented Mar 3, 2015

derekwaynecarr commented Mar 5, 2015

bgrant0607 commented Mar 5, 2015

bgrant0607 commented Mar 19, 2015

rjnagal commented Mar 19, 2015

vishh commented Mar 20, 2015

rjnagal commented Mar 26, 2015

AnanyaKumar commented Jul 17, 2015

chakri-nelluri commented Jul 17, 2015

bgrant0607 commented Jul 23, 2015

bgrant0607 commented Jul 23, 2015

erictune commented Jul 23, 2015

chakri-nelluri commented Jul 29, 2015

bgrant0607 commented May 17, 2016

rafis commented May 15, 2019

QoS tiers #147

QoS tiers #147

Comments

bgrant0607 commented Jun 18, 2014

vmarmol commented Jun 18, 2014

smarterclayton commented Jun 20, 2014

thockin commented Jun 20, 2014

bgrant0607 commented Jun 20, 2014

timothysc commented Jul 8, 2014

bgrant0607 commented Jul 9, 2014

bgrant0607 commented Dec 9, 2014

bgrant0607 commented Feb 28, 2015

erictune commented Mar 3, 2015

bgrant0607 commented Mar 3, 2015

erictune commented Mar 3, 2015

derekwaynecarr commented Mar 5, 2015

bgrant0607 commented Mar 5, 2015

bgrant0607 commented Mar 19, 2015

rjnagal commented Mar 19, 2015

vishh commented Mar 20, 2015

rjnagal commented Mar 26, 2015

AnanyaKumar commented Jul 17, 2015

chakri-nelluri commented Jul 17, 2015

bgrant0607 commented Jul 23, 2015

bgrant0607 commented Jul 23, 2015

erictune commented Jul 23, 2015

chakri-nelluri commented Jul 29, 2015

bgrant0607 commented May 17, 2016

rafis commented May 15, 2019