Container and pod resource limits #168

bgrant0607 · 2014-06-19T16:44:24Z

Before we implement QoS tiers (#147), we need to support basic resource limits for containers and pods. All resource values should be integers.

For inspiration, see lmctfy:
https://github.com/google/lmctfy/blob/master/include/lmctfy.proto

Arguably we should start with pods first, to at least provide isolation between pods. However, that would require the ability to start Docker containers within cgroups. The support we need for individual containers already exists.

We should allow both minimum and maximum resource values to be provided, as lmctfy does. But let's not reuse lmctfy's limit and max_limit terminology. I like "requested" (amount scheduler will use for placement) and "limit" (hard limit beyond which the pod/container is throttled or killed).

Even without limit enforcement, the scheduler could use resource information for placement decisions.

timothysc · 2014-07-08T19:47:30Z

+1.

bgrant0607 · 2014-07-11T21:04:23Z

We have cpu and memory in the container manifest:
// Optional: Defaults to unlimited.
Memory int yaml:"memory,omitempty" json:"memory,omitempty"
// Optional: Defaults to unlimited.
CPU int yaml:"cpu,omitempty" json:"cpu,omitempty"

However, AFAICT, we don't do anything with them. Besides, I think we want something more similar to lmctfy's API (request, limit, qos for each resource).

Another consideration: We could make it fairly easy to add new resources. Kubelet needs to understand each individual resource's characteristics, for isolation, QoS, overcommitment, etc. OTOH, the scheduler could deal with resources entirely abstractly. It could get resources and their capacities from the machines. Similarly, we'd need to make it possible to request abstract resources in the container/pod manifest.

thockin · 2014-07-11T21:21:41Z

What we described internally was that "common" resources like CPU, memory,
disk, etc were described as first-class things. Other resources are
handled essentially as opaque counters. E.g. a node says "I have 5
resources with ID 12345", a client says "I need 2 resources with ID 12345".
The scheduler maps them.

On Fri, Jul 11, 2014 at 2:04 PM, bgrant0607 notifications@github.com
wrote:

We have cpu and memory in the container manifest:
// Optional: Defaults to unlimited.
Memory int yaml:"memory,omitempty" json:"memory,omitempty"
// Optional: Defaults to unlimited.
CPU int yaml:"cpu,omitempty" json:"cpu,omitempty"

However, AFAICT, we don't do anything with them. Besides, I think we want
something more similar to lmctfy's API (request, limit, qos for each
resource).

Another consideration: We could make it fairly easy to add new resources.
Kubelet needs to understand each individual resource's characteristics, for
isolation, QoS, overcommitment, etc. OTOH, the scheduler could deal with
resources entirely abstractly. It could get resources and their capacities
from the machines. Similarly, we'd need to make it possible to request
abstract resources in the container/pod manifest.

Reply to this email directly or view it on GitHub
#168 (comment)
.

erictune · 2014-07-15T17:40:14Z

Consider that the resource types and units used for pod/container requests could also be used for describing how to subdivide cluster resources (see #442 ). For example, if team A is limited to using 10GB RAM at the cluster level, then team A can run 10 pods x 1GB RAM; or 2 pods x 5GB per pod; or some combination, etc.

adam-mesos · 2014-07-30T07:27:57Z

+1 to all of this. Mesos has a very similar model, with the scheduler/allocator able to work with any custom resource, but the slave/containerizer needs to know enough details to map it to an isolator. This would also be the appropriate separation for requested resource vs. resource limits.

bgrant0607 · 2014-10-02T21:55:55Z

/cc @johnwilkes @davidopp @rjnagal @smarterclayton @brendandburns @thockin

The resource model doc has been created. We should align our API with it. v1beta3 leaves resource requests unchanged, though the ResourceList type was added in order to represent node capacity. We could either add the new fields in a backwards-compatible way, or replace the existing Container Memory and CPU fields in v1beta3 -- if we prefer to do the latter, we should add this issue to #1519.

I propose that we add an optional ResourceSpec struct containing optional Request and Limit ResourceList fields to both PodSpec and Container.

bgrant0607 · 2014-10-02T23:29:57Z

Clarification: The separation of desired-state fields into a ResourceSpec struct was deliberate, conforming to the careful separation of desired and current state in v1beta3. Usage-related fields would go into a ResourceStatus struct, as would effective settings, such as soft or hard container limits. @johnwilkes agreed this made sense. At some point, we should clarify this in resources.md.

thockin · 2014-10-02T23:56:37Z

I don't think we want pod-level resources yet, or if we do then we accept
EITHER pod resources OR container resources, but never both on a single
pod. Not yet.

On Thu, Oct 2, 2014 at 2:56 PM, bgrant0607 notifications@github.com wrote:

/cc @johnwilkes https://github.com/johnwilkes @davidopp
https://github.com/davidopp @rjnagal https://github.com/rjnagal
@smarterclayton https://github.com/smarterclayton @brendandburns
https://github.com/brendandburns @thockin https://github.com/thockin

The resource model doc
https://github.com/GoogleCloudPlatform/kubernetes/blob/master/docs/resources.md
has been created. We should align our API with it. v1beta3
https://github.com/GoogleCloudPlatform/kubernetes/blob/master/pkg/api/v1beta3/types.go
leaves resource requests unchanged, though the ResourceList type was
added in order to represent node capacity. We could either add the new
fields in a backwards-compatible way, or replace the existing Container
Memory and CPU fields in v1beta3 -- if we prefer to do the latter, we
should add this issue to #1519
#1519.

I propose that we add an optional ResourceSpec struct containing optional
Request and Limit ResourceList fields to both PodSpec and Container.

Reply to this email directly or view it on GitHub
#168 (comment)
.

bgrant0607 · 2014-10-02T23:57:56Z

Fair enough. We can't support pod limits until libcontainer and Docker do, so I'd be fine with omitting that for now.

vishh · 2015-12-28T18:32:06Z

FYI: Docker now supports updates to cgroups #15078

Factor out data comparator for storage tests.

andyxning · 2016-08-21T09:15:55Z

Allowing memory limit over-commiting may cause unpredicatable process killing by triggering kernel OOM killers.

I have run a program which allocate 50GB memory in a pod whose memory limit is 118GB on a node with 64GB. when the program is running for several seconds, it is oom killed and i can get the oom killer log in /var/log/syslog.

montanaflynn · 2016-09-11T01:04:43Z

Hard limits for CPU are very important for our video transcoding pods we run on Google Container Engine. We need to have nodes with lots of cores for speed but also don't want a single pod greedily using up all the cores. It would be ideal to set their limit at 3/4 of the total nodes CPU.

We can currently do this for scheduling with requests so we don't put two transcoders on a single node but the lack of hard limits mean that when the pod is running it uses all the cores even with limits set. This had led us to having two clusters, one especially for transcoding large media and the other for small media and the rest of our services.

thockin · 2016-09-11T23:39:09Z

I thought we used shares for "request" and quota for "limit" thereby
providing true hard limits. Did I mis-comprehend?

On Sat, Sep 10, 2016 at 6:05 PM, Montana Flynn notifications@github.com
wrote:

Hard limits for CPU are very important for our video transcoding pods we
run on Google Container Engine. We need to have nodes with lots of cores
for speed but also don't want a single pod greedily using up all the cores.
It would be ideal to set their limit at 3/4 of the total nodes CPU.

We can currently do this for scheduling with requests so we don't put two
transcoders on a single node but the lack of hard limits mean that when the
pod is running it uses all the cores even with limits set. This had led
us to having two clusters, one especially for transcoding large media and
the other for small media and the rest of our services.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#168 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AFVgVDzTg30uN-jKpxxsJMU9x61_z6KVks5qo1PegaJpZM4CFzkL
.

smarterclayton · 2016-09-12T02:36:52Z

So did I - are you not seeing quota enforced? What docker version are you on?

montanaflynn · 2016-09-12T04:53:54Z

It seems hard limits came with v1.2 based on the changelog. I remember when I first started with kubernetes there was a warning saying that CPU limits were not enforced. Maybe it was that my host OS that didn't support it. Looking at the compute resources documentation it looks like kubernetes does support hard limits by default now.

CPU hardcapping will be enabled by default for containers with CPU limit set, if supported by the kernel. You should either adjust your CPU limit, or set CPU request only, if you want to avoid hardcapping. If the kernel does not support CPU Quota, NodeStatus will contain a warning indicating that CPU Limits cannot be enforced.

thockin · 2016-09-12T05:34:11Z

Note that CPU hardlimits can be surprising. All it guarantees is that you
can use X core-seconds per wall-second. Consider a 16 core machine, and a
pod that has an 8 core limit. If your app is multi-threaded or
multi-process, and the number of executable threads/processes is larger
than 8, you could use up all 8 cores of your limit in less than 1 wall
second. If you used all 16 cores for 0.5 seconds, you would leave your pod
ineligible to run for 0.5 seconds (that's a long time!), giving you
terrible tail latency.

Now, in reality the time slice is smaller, but it is still in the tens or
hundreds of milliseconds. If you're not careful, you really could find
yourself with unexpected latency blips of 50 or 100 milliseconds or more.

On Sun, Sep 11, 2016 at 9:54 PM, Montana Flynn notifications@github.com
wrote:

It seems hard limits came with v1.2 based on the changelog
https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG.md/#action-required-11.
I remember when I first started with kubernetes there was a warning saying
that CPU limits were not enforced. Maybe it was that my host OS that didn't
support it. Looking at the [compute resources documentation] it looks like
kubernetes does support hard limits by default now.

CPU hardcapping will be enabled by default for containers with CPU limit
set, if supported by the kernel. You should either adjust your CPU limit,
or set CPU request only, if you want to avoid hardcapping. If the kernel
does not support CPU Quota, NodeStatus will contain a warning indicating
that CPU Limits cannot be enforced.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#168 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AFVgVCtWesObV0ElJqI9oeeeu1XUwL8Fks5qpNsUgaJpZM4CFzkL
.

timothysc · 2016-09-12T19:33:35Z

If it's a hard constraint to not tolerate the blips, then you're likely looking for cpu affinity or cpu-sets. xref: #10570

montanaflynn · 2016-09-12T21:20:25Z

I'm on google cloud's container engine and found the warning I referenced above is still shown while running master and nodes kubernetes version 1.3.5:

The warning is displayed with kubectl describe nodes isWARNING: CPU hardcapping unsupported.

$ kubectl describe nodes
Name:           gke-cluster-1-default-pool-777adf16-an5j
Labels:         beta.kubernetes.io/arch=amd64
            beta.kubernetes.io/instance-type=n1-highcpu-8
            beta.kubernetes.io/os=linux
            cloud.google.com/gke-nodepool=default-pool
            failure-domain.beta.kubernetes.io/region=us-central1
            failure-domain.beta.kubernetes.io/zone=us-central1-b
            kubernetes.io/hostname=gke-cluster-1-default-pool-777adf16-an5j
Taints:         <none>
CreationTimestamp:  Wed, 12 Sep 2016 08:14:45 -0700
Phase:
Conditions:
  Type          Status  LastHeartbeatTime           LastTransitionTime          Reason              Message
  ----          ------  -----------------           ------------------          ------              -------
  NetworkUnavailable    False   Mon, 01 Jan 0001 00:00:00 +0000     Wed, 07 Sep 2016 18:15:58 -0700     RouteCreated            RouteController created a route
  OutOfDisk         False   Mon, 12 Sep 2016 14:15:13 -0700     Wed, 07 Sep 2016 18:14:45 -0700     KubeletHasSufficientDisk    kubelet has sufficient disk space available
  MemoryPressure    False   Mon, 12 Sep 2016 14:15:13 -0700     Wed, 07 Sep 2016 18:14:45 -0700     KubeletHasSufficientMemory  kubelet has sufficient memory available
  Ready         True    Mon, 12 Sep 2016 14:15:13 -0700     Wed, 07 Sep 2016 18:15:21 -0700     KubeletReady            kubelet is posting ready status. WARNING: CPU hardcapping unsupported

vishh · 2016-09-12T22:05:25Z

@montanaflynn On Google Container Engine can you switch to GCI as the image type. You can upgrade your node-pool to GCI by setting --image-type=gci or pass that flag while creating a new cluster.
GCI is the new version of the existing debian 7 based base image on GKE. CPU limits are supported there.

montanaflynn · 2016-09-12T23:06:57Z

@vishh where / how could I set --image-type=gci for an existing cluster?

vishh · 2016-09-12T23:24:08Z

@montanaflynn Assuming you have only the default node-pool, run gcloud container clusters upgrade <your_cluster_name> --image-type gci --node-pool default-pool.
This change is disruptive since it restarts existing nodes.
Another option is to create a new node pool for your cluster that is using GCI and then slowly turn down the default node pool. - gcloud container node-pools create --cluster <your_cluster_name> --image-type gci

montanaflynn · 2016-09-12T23:35:13Z

Thanks! Will container engine be using that image by default in the future?

vishh · 2016-09-12T23:50:36Z

Yes. That might happen as early as v1.4 on GKE.

timothysc · 2016-12-07T20:12:31Z

I think we should move to close this issue, the root topic has been addressed but there are multiple side-threads that are on this issue where I believe they would be better served on other issues.

@vishh thoughts?

derekwaynecarr · 2016-12-07T23:53:12Z

I think this can be closed.

…

On Wed, Dec 7, 2016 at 3:13 PM Timothy St. Clair ***@***.***> wrote: I think we should move to close this issue, the root topic has been addressed but there are multiple side-threads that are on this issue where I believe they would be better served on other issues. @vishh <https://github.com/vishh> thoughts? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#168 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AF8dbELMdlH4rE1zioGrUTXpVEu6Piwdks5rFxNLgaJpZM4CFzkL> .

V1.5.8 patchset

maicohjf · 2019-03-24T00:50:13Z

Before we implement QoS tiers (#147), we need to support basic resource limits for containers and pods. All resource values should be integers.

For inspiration, see lmctfy:
https://github.com/google/lmctfy/blob/master/include/lmctfy.proto

Arguably we should start with pods first, to at least provide isolation between pods. However, that would require the ability to start Docker containers within cgroups. The support we need for individual containers already exists.

We should allow both minimum and maximum resource values to be provided, as lmctfy does. But let's not reuse lmctfy's limit and max_limit terminology. I like "requested" (amount scheduler will use for placement) and "limit" (hard limit beyond which the pod/container is throttled or killed).

Even without limit enforcement, the scheduler could use resource information for placement decisions.

#168 (comment)

Schedule a Pod as follows:
 Name: nginx-kusc00101
 Image: nginx
 Node selector: disk=ssd

apiVersion: v1
kind: Pod
metadata:
name: nginx-kusc00101
spec:
containers:
- name: nginx-kusc00101
image: nginx
nodeSelector:
disk: ssd

kubectl create -f nginx-kusc00101.yaml

release v1.4.4 debs and rpms

…alue-default Finalize the interface of the Fixtures struct

Fixes kubernetes#168

Pass 2: k8s GCR vanity URL

jbeda added the enhancement label Jun 19, 2014

bgrant0607 mentioned this issue Jun 20, 2014

QoS tiers #147

Closed

bgrant0607 mentioned this issue Jul 7, 2014

More general scheduling constraints #367

Closed

brendandburns mentioned this issue Jul 15, 2014

Add support for memory limits in the kubelet. #453

Merged

bgrant0607 mentioned this issue Jul 15, 2014

Least load scheduler #471

Closed

johnwilkes mentioned this issue Jul 15, 2014

scheduler should leverage resource usage for pod placement #274

Closed

erictune mentioned this issue Jul 16, 2014

Service scale-to-zero (fka "Socket activation") #484

Open

erictune mentioned this issue Jul 24, 2014

Multiple uses of the word resource #600

Closed

bgrant0607 mentioned this issue Jul 25, 2014

Durable local storage #598

Closed

bgrant0607 added this to the v1.0 milestone Aug 27, 2014

brendandburns modified the milestones: 0.7, v1.0 Sep 24, 2014

bgrant0607 added area/isolation sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. labels Sep 25, 2014

bgrant0607 modified the milestones: v0.8, v0.7 Sep 26, 2014

brendandburns mentioned this issue Sep 26, 2014

Add a resource fit predicate. #1474

Merged

bgrant0607 added the area/api Indicates an issue on api area. label Oct 2, 2014

smarterclayton mentioned this issue Oct 2, 2014

Implement v1beta3 api #1519

Closed

20 tasks

bgrant0607 mentioned this issue Oct 2, 2014

Numeric type for resources #1185

Merged

brendandburns mentioned this issue Oct 9, 2014

Switch on the resource requested prioritization. #1679

Merged

bgrant0607 mentioned this issue Aug 7, 2015

Add compute resource metrics API proposal #11951

Merged

davidopp added team/control-plane and removed team/master labels Aug 23, 2015

vishh pushed a commit to vishh/kubernetes that referenced this issue Apr 6, 2016

Merge pull request kubernetes#168 from rjnagal/master

494c63c

Factor out data comparator for storage tests.

timothysc closed this as completed Dec 8, 2016

iaguis pushed a commit to kinvolk/kubernetes that referenced this issue Feb 6, 2018

Merge pull request kubernetes#168 from coreos/v1.5.8-patchset

4578c9a

V1.5.8 patchset

justaugustus pushed a commit to justaugustus/kubernetes that referenced this issue Jun 10, 2019

Merge pull request kubernetes#168 from errordeveloper/v1.4.4

b3658fb

release v1.4.4 debs and rpms

wking pushed a commit to wking/kubernetes that referenced this issue Jul 21, 2020

Merge pull request kubernetes#168 from totherme/163-fixtures-sane-0-v…

2e18103

…alue-default Finalize the interface of the Fixtures struct

b3atlesfan pushed a commit to b3atlesfan/kubernetes that referenced this issue Feb 5, 2021

update vendored netlink

3740b01

Fixes kubernetes#168

pjh pushed a commit to pjh/kubernetes that referenced this issue Jan 31, 2022

whereami - Added header propagation for tracing (kubernetes#168)

b377a73

linxiulei pushed a commit to linxiulei/kubernetes that referenced this issue Jan 18, 2024

Merge pull request kubernetes#168 from thockin-tmp/gcr-vanity

c64b1ed

Pass 2: k8s GCR vanity URL

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Container and pod resource limits #168

Container and pod resource limits #168

bgrant0607 commented Jun 19, 2014

timothysc commented Jul 8, 2014

bgrant0607 commented Jul 11, 2014

thockin commented Jul 11, 2014

erictune commented Jul 15, 2014

adam-mesos commented Jul 30, 2014

bgrant0607 commented Oct 2, 2014

bgrant0607 commented Oct 2, 2014

thockin commented Oct 2, 2014

bgrant0607 commented Oct 2, 2014

vishh commented Dec 28, 2015

andyxning commented Aug 21, 2016

montanaflynn commented Sep 11, 2016

thockin commented Sep 11, 2016

smarterclayton commented Sep 12, 2016 via email

montanaflynn commented Sep 12, 2016 •

edited

thockin commented Sep 12, 2016

timothysc commented Sep 12, 2016

montanaflynn commented Sep 12, 2016

vishh commented Sep 12, 2016

montanaflynn commented Sep 12, 2016

vishh commented Sep 12, 2016

montanaflynn commented Sep 12, 2016

vishh commented Sep 12, 2016

timothysc commented Dec 7, 2016

derekwaynecarr commented Dec 7, 2016 via email

maicohjf commented Mar 24, 2019

Container and pod resource limits #168

Container and pod resource limits #168

Comments

bgrant0607 commented Jun 19, 2014

timothysc commented Jul 8, 2014

bgrant0607 commented Jul 11, 2014

thockin commented Jul 11, 2014

erictune commented Jul 15, 2014

adam-mesos commented Jul 30, 2014

bgrant0607 commented Oct 2, 2014

bgrant0607 commented Oct 2, 2014

thockin commented Oct 2, 2014

bgrant0607 commented Oct 2, 2014

vishh commented Dec 28, 2015

andyxning commented Aug 21, 2016

montanaflynn commented Sep 11, 2016

thockin commented Sep 11, 2016

smarterclayton commented Sep 12, 2016 via email

montanaflynn commented Sep 12, 2016 • edited

thockin commented Sep 12, 2016

timothysc commented Sep 12, 2016

montanaflynn commented Sep 12, 2016

vishh commented Sep 12, 2016

montanaflynn commented Sep 12, 2016

vishh commented Sep 12, 2016

montanaflynn commented Sep 12, 2016

vishh commented Sep 12, 2016

timothysc commented Dec 7, 2016

derekwaynecarr commented Dec 7, 2016 via email

maicohjf commented Mar 24, 2019

montanaflynn commented Sep 12, 2016 •

edited