Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Container and pod resource limits #168

Closed
bgrant0607 opened this issue Jun 19, 2014 · 50 comments
Closed

Container and pod resource limits #168

bgrant0607 opened this issue Jun 19, 2014 · 50 comments
Assignees
Labels
area/api Indicates an issue on api area. area/isolation priority/backlog Higher priority than priority/awaiting-more-evidence. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.

Comments

@bgrant0607
Copy link
Member

Before we implement QoS tiers (#147), we need to support basic resource limits for containers and pods. All resource values should be integers.

For inspiration, see lmctfy:
https://github.com/google/lmctfy/blob/master/include/lmctfy.proto

Arguably we should start with pods first, to at least provide isolation between pods. However, that would require the ability to start Docker containers within cgroups. The support we need for individual containers already exists.

We should allow both minimum and maximum resource values to be provided, as lmctfy does. But let's not reuse lmctfy's limit and max_limit terminology. I like "requested" (amount scheduler will use for placement) and "limit" (hard limit beyond which the pod/container is throttled or killed).

Even without limit enforcement, the scheduler could use resource information for placement decisions.

@timothysc
Copy link
Member

+1.

@bgrant0607
Copy link
Member Author

We have cpu and memory in the container manifest:
// Optional: Defaults to unlimited.
Memory int yaml:"memory,omitempty" json:"memory,omitempty"
// Optional: Defaults to unlimited.
CPU int yaml:"cpu,omitempty" json:"cpu,omitempty"

However, AFAICT, we don't do anything with them. Besides, I think we want something more similar to lmctfy's API (request, limit, qos for each resource).

Another consideration: We could make it fairly easy to add new resources. Kubelet needs to understand each individual resource's characteristics, for isolation, QoS, overcommitment, etc. OTOH, the scheduler could deal with resources entirely abstractly. It could get resources and their capacities from the machines. Similarly, we'd need to make it possible to request abstract resources in the container/pod manifest.

@thockin
Copy link
Member

thockin commented Jul 11, 2014

What we described internally was that "common" resources like CPU, memory,
disk, etc were described as first-class things. Other resources are
handled essentially as opaque counters. E.g. a node says "I have 5
resources with ID 12345", a client says "I need 2 resources with ID 12345".
The scheduler maps them.

On Fri, Jul 11, 2014 at 2:04 PM, bgrant0607 notifications@github.com
wrote:

We have cpu and memory in the container manifest:
// Optional: Defaults to unlimited.
Memory int yaml:"memory,omitempty" json:"memory,omitempty"
// Optional: Defaults to unlimited.
CPU int yaml:"cpu,omitempty" json:"cpu,omitempty"

However, AFAICT, we don't do anything with them. Besides, I think we want
something more similar to lmctfy's API (request, limit, qos for each
resource).

Another consideration: We could make it fairly easy to add new resources.
Kubelet needs to understand each individual resource's characteristics, for
isolation, QoS, overcommitment, etc. OTOH, the scheduler could deal with
resources entirely abstractly. It could get resources and their capacities
from the machines. Similarly, we'd need to make it possible to request
abstract resources in the container/pod manifest.

Reply to this email directly or view it on GitHub
#168 (comment)
.

@erictune
Copy link
Member

Consider that the resource types and units used for pod/container requests could also be used for describing how to subdivide cluster resources (see #442 ). For example, if team A is limited to using 10GB RAM at the cluster level, then team A can run 10 pods x 1GB RAM; or 2 pods x 5GB per pod; or some combination, etc.

@adam-mesos
Copy link

+1 to all of this. Mesos has a very similar model, with the scheduler/allocator able to work with any custom resource, but the slave/containerizer needs to know enough details to map it to an isolator. This would also be the appropriate separation for requested resource vs. resource limits.

@bgrant0607 bgrant0607 added this to the v1.0 milestone Aug 27, 2014
@brendandburns brendandburns modified the milestones: 0.7, v1.0 Sep 24, 2014
@bgrant0607 bgrant0607 added area/isolation sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. labels Sep 25, 2014
@bgrant0607 bgrant0607 modified the milestones: v0.8, v0.7 Sep 26, 2014
@bgrant0607 bgrant0607 added the area/api Indicates an issue on api area. label Oct 2, 2014
@bgrant0607
Copy link
Member Author

/cc @johnwilkes @davidopp @rjnagal @smarterclayton @brendandburns @thockin

The resource model doc has been created. We should align our API with it. v1beta3 leaves resource requests unchanged, though the ResourceList type was added in order to represent node capacity. We could either add the new fields in a backwards-compatible way, or replace the existing Container Memory and CPU fields in v1beta3 -- if we prefer to do the latter, we should add this issue to #1519.

I propose that we add an optional ResourceSpec struct containing optional Request and Limit ResourceList fields to both PodSpec and Container.

@bgrant0607
Copy link
Member Author

Clarification: The separation of desired-state fields into a ResourceSpec struct was deliberate, conforming to the careful separation of desired and current state in v1beta3. Usage-related fields would go into a ResourceStatus struct, as would effective settings, such as soft or hard container limits. @johnwilkes agreed this made sense. At some point, we should clarify this in resources.md.

@thockin
Copy link
Member

thockin commented Oct 2, 2014

I don't think we want pod-level resources yet, or if we do then we accept
EITHER pod resources OR container resources, but never both on a single
pod. Not yet.

On Thu, Oct 2, 2014 at 2:56 PM, bgrant0607 notifications@github.com wrote:

/cc @johnwilkes https://github.com/johnwilkes @davidopp
https://github.com/davidopp @rjnagal https://github.com/rjnagal
@smarterclayton https://github.com/smarterclayton @brendandburns
https://github.com/brendandburns @thockin https://github.com/thockin

The resource model doc
https://github.com/GoogleCloudPlatform/kubernetes/blob/master/docs/resources.md
has been created. We should align our API with it. v1beta3
https://github.com/GoogleCloudPlatform/kubernetes/blob/master/pkg/api/v1beta3/types.go
leaves resource requests unchanged, though the ResourceList type was
added in order to represent node capacity. We could either add the new
fields in a backwards-compatible way, or replace the existing Container
Memory and CPU fields in v1beta3 -- if we prefer to do the latter, we
should add this issue to #1519
#1519.

I propose that we add an optional ResourceSpec struct containing optional
Request and Limit ResourceList fields to both PodSpec and Container.

Reply to this email directly or view it on GitHub
#168 (comment)
.

@bgrant0607
Copy link
Member Author

Fair enough. We can't support pod limits until libcontainer and Docker do, so I'd be fine with omitting that for now.

@vishh
Copy link
Contributor

vishh commented Dec 28, 2015

FYI: Docker now supports updates to cgroups #15078

vishh pushed a commit to vishh/kubernetes that referenced this issue Apr 6, 2016
Factor out data comparator for storage tests.
@andyxning
Copy link
Member

Allowing memory limit over-commiting may cause unpredicatable process killing by triggering kernel OOM killers.

I have run a program which allocate 50GB memory in a pod whose memory limit is 118GB on a node with 64GB. when the program is running for several seconds, it is oom killed and i can get the oom killer log in /var/log/syslog.

@montanaflynn
Copy link

Hard limits for CPU are very important for our video transcoding pods we run on Google Container Engine. We need to have nodes with lots of cores for speed but also don't want a single pod greedily using up all the cores. It would be ideal to set their limit at 3/4 of the total nodes CPU.

We can currently do this for scheduling with requests so we don't put two transcoders on a single node but the lack of hard limits mean that when the pod is running it uses all the cores even with limits set. This had led us to having two clusters, one especially for transcoding large media and the other for small media and the rest of our services.

@thockin
Copy link
Member

thockin commented Sep 11, 2016

I thought we used shares for "request" and quota for "limit" thereby
providing true hard limits. Did I mis-comprehend?

On Sat, Sep 10, 2016 at 6:05 PM, Montana Flynn notifications@github.com
wrote:

Hard limits for CPU are very important for our video transcoding pods we
run on Google Container Engine. We need to have nodes with lots of cores
for speed but also don't want a single pod greedily using up all the cores.
It would be ideal to set their limit at 3/4 of the total nodes CPU.

We can currently do this for scheduling with requests so we don't put two
transcoders on a single node but the lack of hard limits mean that when the
pod is running it uses all the cores even with limits set. This had led
us to having two clusters, one especially for transcoding large media and
the other for small media and the rest of our services.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#168 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AFVgVDzTg30uN-jKpxxsJMU9x61_z6KVks5qo1PegaJpZM4CFzkL
.

@smarterclayton
Copy link
Contributor

smarterclayton commented Sep 12, 2016 via email

@montanaflynn
Copy link

montanaflynn commented Sep 12, 2016

It seems hard limits came with v1.2 based on the changelog. I remember when I first started with kubernetes there was a warning saying that CPU limits were not enforced. Maybe it was that my host OS that didn't support it. Looking at the compute resources documentation it looks like kubernetes does support hard limits by default now.

CPU hardcapping will be enabled by default for containers with CPU limit set, if supported by the kernel. You should either adjust your CPU limit, or set CPU request only, if you want to avoid hardcapping. If the kernel does not support CPU Quota, NodeStatus will contain a warning indicating that CPU Limits cannot be enforced.

@thockin
Copy link
Member

thockin commented Sep 12, 2016

Note that CPU hardlimits can be surprising. All it guarantees is that you
can use X core-seconds per wall-second. Consider a 16 core machine, and a
pod that has an 8 core limit. If your app is multi-threaded or
multi-process, and the number of executable threads/processes is larger
than 8, you could use up all 8 cores of your limit in less than 1 wall
second. If you used all 16 cores for 0.5 seconds, you would leave your pod
ineligible to run for 0.5 seconds (that's a long time!), giving you
terrible tail latency.

Now, in reality the time slice is smaller, but it is still in the tens or
hundreds of milliseconds. If you're not careful, you really could find
yourself with unexpected latency blips of 50 or 100 milliseconds or more.

On Sun, Sep 11, 2016 at 9:54 PM, Montana Flynn notifications@github.com
wrote:

It seems hard limits came with v1.2 based on the changelog
https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG.md/#action-required-11.
I remember when I first started with kubernetes there was a warning saying
that CPU limits were not enforced. Maybe it was that my host OS that didn't
support it. Looking at the [compute resources documentation] it looks like
kubernetes does support hard limits by default now.

CPU hardcapping will be enabled by default for containers with CPU limit
set, if supported by the kernel. You should either adjust your CPU limit,
or set CPU request only, if you want to avoid hardcapping. If the kernel
does not support CPU Quota, NodeStatus will contain a warning indicating
that CPU Limits cannot be enforced.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#168 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AFVgVCtWesObV0ElJqI9oeeeu1XUwL8Fks5qpNsUgaJpZM4CFzkL
.

@timothysc
Copy link
Member

If it's a hard constraint to not tolerate the blips, then you're likely looking for cpu affinity or cpu-sets. xref: #10570

@montanaflynn
Copy link

I'm on google cloud's container engine and found the warning I referenced above is still shown while running master and nodes kubernetes version 1.3.5:

The warning is displayed with kubectl describe nodes isWARNING: CPU hardcapping unsupported.

$ kubectl describe nodes
Name:           gke-cluster-1-default-pool-777adf16-an5j
Labels:         beta.kubernetes.io/arch=amd64
            beta.kubernetes.io/instance-type=n1-highcpu-8
            beta.kubernetes.io/os=linux
            cloud.google.com/gke-nodepool=default-pool
            failure-domain.beta.kubernetes.io/region=us-central1
            failure-domain.beta.kubernetes.io/zone=us-central1-b
            kubernetes.io/hostname=gke-cluster-1-default-pool-777adf16-an5j
Taints:         <none>
CreationTimestamp:  Wed, 12 Sep 2016 08:14:45 -0700
Phase:
Conditions:
  Type          Status  LastHeartbeatTime           LastTransitionTime          Reason              Message
  ----          ------  -----------------           ------------------          ------              -------
  NetworkUnavailable    False   Mon, 01 Jan 0001 00:00:00 +0000     Wed, 07 Sep 2016 18:15:58 -0700     RouteCreated            RouteController created a route
  OutOfDisk         False   Mon, 12 Sep 2016 14:15:13 -0700     Wed, 07 Sep 2016 18:14:45 -0700     KubeletHasSufficientDisk    kubelet has sufficient disk space available
  MemoryPressure    False   Mon, 12 Sep 2016 14:15:13 -0700     Wed, 07 Sep 2016 18:14:45 -0700     KubeletHasSufficientMemory  kubelet has sufficient memory available
  Ready         True    Mon, 12 Sep 2016 14:15:13 -0700     Wed, 07 Sep 2016 18:15:21 -0700     KubeletReady            kubelet is posting ready status. WARNING: CPU hardcapping unsupported

@vishh
Copy link
Contributor

vishh commented Sep 12, 2016

@montanaflynn On Google Container Engine can you switch to GCI as the image type. You can upgrade your node-pool to GCI by setting --image-type=gci or pass that flag while creating a new cluster.
GCI is the new version of the existing debian 7 based base image on GKE. CPU limits are supported there.

@montanaflynn
Copy link

@vishh where / how could I set --image-type=gci for an existing cluster?

@vishh
Copy link
Contributor

vishh commented Sep 12, 2016

@montanaflynn Assuming you have only the default node-pool, run gcloud container clusters upgrade <your_cluster_name> --image-type gci --node-pool default-pool.
This change is disruptive since it restarts existing nodes.
Another option is to create a new node pool for your cluster that is using GCI and then slowly turn down the default node pool. - gcloud container node-pools create --cluster <your_cluster_name> --image-type gci

@montanaflynn
Copy link

Thanks! Will container engine be using that image by default in the future?

@vishh
Copy link
Contributor

vishh commented Sep 12, 2016

Yes. That might happen as early as v1.4 on GKE.

@timothysc
Copy link
Member

I think we should move to close this issue, the root topic has been addressed but there are multiple side-threads that are on this issue where I believe they would be better served on other issues.

@vishh thoughts?

@derekwaynecarr
Copy link
Member

derekwaynecarr commented Dec 7, 2016 via email

iaguis pushed a commit to kinvolk/kubernetes that referenced this issue Feb 6, 2018
@maicohjf
Copy link

Before we implement QoS tiers (#147), we need to support basic resource limits for containers and pods. All resource values should be integers.

For inspiration, see lmctfy:
https://github.com/google/lmctfy/blob/master/include/lmctfy.proto

Arguably we should start with pods first, to at least provide isolation between pods. However, that would require the ability to start Docker containers within cgroups. The support we need for individual containers already exists.

We should allow both minimum and maximum resource values to be provided, as lmctfy does. But let's not reuse lmctfy's limit and max_limit terminology. I like "requested" (amount scheduler will use for placement) and "limit" (hard limit beyond which the pod/container is throttled or killed).

Even without limit enforcement, the scheduler could use resource information for placement decisions.

#168 (comment)

Schedule a Pod as follows:
 Name: nginx-kusc00101
 Image: nginx
 Node selector: disk=ssd

apiVersion: v1
kind: Pod
metadata:
name: nginx-kusc00101
spec:
containers:
- name: nginx-kusc00101
image: nginx
nodeSelector:
disk: ssd

kubectl create -f nginx-kusc00101.yaml

justaugustus pushed a commit to justaugustus/kubernetes that referenced this issue Jun 10, 2019
wking pushed a commit to wking/kubernetes that referenced this issue Jul 21, 2020
…alue-default

Finalize the interface of the Fixtures struct
b3atlesfan pushed a commit to b3atlesfan/kubernetes that referenced this issue Feb 5, 2021
pjh pushed a commit to pjh/kubernetes that referenced this issue Jan 31, 2022
linxiulei pushed a commit to linxiulei/kubernetes that referenced this issue Jan 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/api Indicates an issue on api area. area/isolation priority/backlog Higher priority than priority/awaiting-more-evidence. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.
Projects
None yet
Development

No branches or pull requests