Allow cluster resources to be subdivided #442

smarterclayton · 2014-07-13T19:43:58Z

Currently Kubernetes supports a single global scope for all resources. This works well for single/small deployments where a small group of admins are coordinating changes. There are two other important use cases that we would like to solve with Kubernetes:

Use case 1 - Tenant visibility: Allow multiple administrative teams to subdivide the resources in the infrastructure for separate applications with visibility isolation to prevent conflicts between distinct uses (service names are unique only within a tenant, tenant 1 cannot see the resources of tenant 2)
Use case 2 - Quota: Allow infrastructure to be finely divided among many users, with upper limits on the resources / locations / capabilities they can consume - allow resources to fungibly be divided between small and large teams with hard and soft limits.
Use case 3 - Exclusion: Prevent resources from one tenant from being collocated with the resources (pods, services, traffic) of another tenant. A less common scenario.

At best, the two additional use cases would not complicate the simple scenario (an administrator can start a cluster without having to configure subdivision). In addition, a consumer of the pod/controller/service APIs should be agnostic to the subdivision.

Assumptions about subdivision:

Most infrastructure use cases involve one to several related groups of containers/services performing a set of related functions (one, two, or many related "applications")
As with most development, the natural unit of coordination on infrastructure is a team - a set of people who have similar roles and are working on related functionality
While coordination across teams is not unusual, the relationships between these teams and functions is likely more formal - perhaps a versioned API, or a schema in a shared resource
This is modeled in many IaaS and related software as a "project" (or a "tenant") - a group of physical or virtual resources that are managed by a set of human or machine actors with predefined roles (which limit the actions those people/machines may perform), typically devoted to a common goal or set of goals
There usually exists an owning relationship to an account/organization that pays/allocates physical resources - an owning entity may have multiple projects, may choose to assign individuals specific roles across those projects, and usually has an elevated level of privilege to those projects that users within the project cannot access (typically the right to set/alter limits on usage)
It's rarely necessary to nest projects, which is typically done for further subdividing resources - instead, being able to easily assign groups of users (a "team") into roles on that project means projects can easily be split.
- There is still value in being able to categorize / organize large numbers of projects, which concepts like labeling can have value
As the number of managed resources grows, it becomes valuable to offer individual users the ability to query the resources they have access to across projects. The base design should allow these broader scopes to be introduced at a later point.

Straw-man proposal:

Define a project resource (arbitrarily chosen name) which is the sole unit of access control and subdivision of pods, controllers, services, and other infrastructure resources
- Limits and constraints on the number, size, and placement options available to pods are managed at the project level
Allow the Kubernetes API concepts to work scoped to a project
- In the default case, it should be easy for a solo admin to ignore the existence of the project resource - perhaps through a default singleton project with no imposed limits
Limits and constraints must be available to be imposed when a pod is scheduled - the relationship between the scheduled pod, the containing project, and the acting user needs better definition, but either the storage or the scheduler interface (or both) must have the ability to influence / reject placement.
Scope and define the following concepts in Kubernetes (Allow actors against the apiserver to be authenticated and actions to be authorized #443):
- identity - a way of tracking the actors of the API
- authentication - how the identity of an actor is determined and checked
- authorization - which actions an identity can take in which context

Thoughts?

monnand · 2014-07-13T20:50:58Z

If I understand it correctly, does sub-container (partly) solve this issue? This feature (sub-container) is implemented in lmctfy, but not in docker. Not sure if docker will add this function in the future.

(I'm not saying that sub-container is the only solution. Just think it could be a way.)

smarterclayton · 2014-07-13T21:15:25Z

This is subdivision of total cluster resources across one or more tenants, vs subdivision of resources in a container.

thockin · 2014-07-16T05:23:08Z

Clayton,

Not ignoring you, just buried in email. This is an interesting topic.
You've touched on a number of things we do internally, but I am not an
expert in. Will come back to this topic soon with some expertise in tow.

On Sun, Jul 13, 2014 at 2:15 PM, Clayton Coleman notifications@github.com
wrote:

This is subdivision of total cluster resources across one or more tenants,
vs subdivision of resources in a container.

Reply to this email directly or view it on GitHub
#442 (comment)
.

erictune · 2014-07-16T17:43:53Z

Clayton:
Here is my interpretation of your two uses cases, from the top of your original post. (I've reversed the order). I'm holding off on commenting on the subsequent paragraphs of your proposal while we discuss the first part.

Prevent exhaustion of basic kubelet resources (cores, ram, disk bytes, NIC b/w, etc). Allow projects to be assured that resources will be available in the future even if not in use at the moment. Analogy is Unix disk quota (with just one filesystem): usage and limit per user. So, I'll call this Quota.
Prevent one team's pods from being "near" any other team's pods. I'll call this Exclusion. Reasons include:
teams feel a sense of ownership of resource ("We bought these machines that were added to the cluster, so we get to use them")
to get isolation from effects of other team's pods (e.g. processor cache interference)
because company or third-party policies require such a separation
to get access to specific hardware or machines with specific host OS tuning on them.

Assuming I got those uses cases right, here are a few thoughts:

Quota can possibly be done at a very fine granularity (fractions of cores, ~kB of RAM, etc). Exclusion typically happens at a number of coarser granularities (core, VM, physical machine, rack, site, etc).

Since Quota can be fine grained, it is easy to assign just the right amount to a project.
Since exclusion is coarser grained, waste due to rounding errors is more likely.

One project's quota can be represented concisely (e.g. one integer each for ram, compute, disk). This representation has additive properties, so it is convenient for producing accounting reports. It can be traded among teams, so there can be an economy of resources. Exclusion has a more complex representation (list of labels or of physical machine names, etc). An economy of exclusion-resources sometimes has too much friction.

A basic type of exclusion can be implemented by saying "run your own k8s cluster if you want to be separate". You can get a long ways with a combination of quota and multiple k8s clusters. This does imply a need to name k8s clusters. But this is useful for other reasons.

bgrant0607 · 2014-07-16T17:55:43Z

Clayton, at a high level this all sounds very reasonable and compatible with our cloud APIs. We'll need to discuss in more detail how users will authenticate and what authorization policies will look like.

A few other comments:

We call resource limits quotas. Aggregate limits work pretty well for the most part, though additional restrictions on maximum pod size and/or shape may be needed to ensure schedulability. If we support multiple quality of service levels ( QoS tiers #147 ), we'll want limits by level.
Limits on numbers of objects in the system (pods, etc.) are also needed and its good to impose such limits from the beginning rather than try to add them later.
Expect someone to eventually want to automate management of these limits.
We can use a variant of constraints ( More general scheduling constraints #367 ) to manage placement restrictions.
We should do quota-based admission control (i.e., the decision to accept or reject a request) at the apiserver, and resource/constraint-based admission control at kubelet.
How we actually determine resource capacity and consumption is an interesting topic unto itself, especially in the presence of quality of service tiers, SLOs, overcommitment, etc. We likely will need pluggable policies.

smarterclayton · 2014-07-16T17:59:49Z

Prevent one team's pods from being "near" any other team's pods. I'll call this Exclusion.

This one is probably a 3rd use case. I'd restate the original more correctly in terms of this:

Prevent one team from being able to interact with another's teams pods directly. You might call this Limited Visibility.

For Quota - no disagreement that it is typically fine grained, although I can think of cases where compute impact can be nuanced in terms of how it impacts other users (the two sides of CPU scheduling, percentage of each time interval to which you are allocated a CPU vs maximum contiguous block within that interval which you may execute without interruption).

I think from a use case perspective in our area we see Exclusion as less common, and agree that the basic type often works. The next step is typically at a large granularity (and thus inefficient), but occurs for either "important customer/project" or "production vs development". Fine grained exclusion seems less common.

So Quota and Limited Visibility drive a lot of our thinking, with Exclusion being the less important because you can (as you note) easily say "run your own cluster".

smarterclayton · 2014-07-16T18:06:29Z

Brian, for admission control do you see that as fundamentally a hard limit, or soft with reconciliation? Since you're drawing from a pool you need to coordinate the reservation of that resource - if failures occur after you've reserved but before you deploy you then need to undo that reservation, and doing that correctly is difficult to implement correctly. We struggle with this in practical terms of whether you can allow eventually consistent behavior at the admission control level (create something, 5s later it gets deleted) and whether that compromises experience. This is getting more into implementation, just fishing for different perspectives here.

erictune · 2014-07-16T18:45:08Z

(answering for Brian)
Our assumptions are:

teams with the largest resource usage are typically better staffed. So these teams can devote time to understand the complexity of hard limits. And they can plan their needs carefully and reserve the right amount of resources ahead of time.
teams with smaller resource usage just want it to work. The cost of wasting their admin time on resource planning exceeds the value of setting a precise resource quota for them.
If you let a big team go, say 10% over its quota temporarily, it might completely exhaust a cluster resource.
if you let a small team go say 10% over its quota temporarily, it is very unlikely to exhaust a clusters resource.
there are typically a few big users and lots of small users in a cluster with many users. You can imagine that it may be a power-law distribution, like income. Except that, once a user gets big enough, they have incentives to run their own cluster. But, then they end up subdividing the resources in their private cluster...

One way you could approach this is:

have hard limits for big teams.
put the rest of the resources in an "ALL_OTHER_TEAMS" bucket.
sum(big team limits) + ALL_OTHER_TEAMS <= max practical cluster capacity
give small teams a soft limit.
sum(small team limits) <= OVERSUB_FACTOR * ALL_OTHER_TEAMS

Thoughts?

erictune · 2014-07-16T18:46:07Z

btw, I used quota and limit interchangeably in that last post, which wasn't very precise.

thockin · 2014-07-16T18:48:02Z

On Wed, Jul 16, 2014 at 10:59 AM, Clayton Coleman
notifications@github.com wrote:

Prevent one team's pods from being "near" any other team's pods. I'll call this Exclusion.

This one is probably a 3rd use case. I'd restate the original more correctly in terms of this:

Prevent one team from being able to interact with another's teams pods directly. You might call this Limited Visibility.

Define "interact" ?

For Quota - no disagreement that it is typically fine grained, although I can think of cases where compute impact can be nuanced in terms of how it impacts other users (the two sides of CPU scheduling, percentage of each time interval to which you are allocated a CPU vs maximum contiguous block within that interval which you may execute without interruption).

I think from a use case perspective in our area we see Exclusion as less common, and agree that the basic type often works. The next step is typically at a large granularity (and thus inefficient), but occurs for either "important customer/project" or "production vs development". Fine grained exclusion seems less common.

So Quota and Limited Visibility drive a lot of our thinking, with Exclusion being the less important because you can (as you note) easily say "run your own cluster".

Reply to this email directly or view it on GitHub.

smarterclayton · 2014-07-16T18:54:11Z

@thockin - Can I (on team A) see the pods you've created (on team B)? Can I change the replication controllers you've created? Can I see the environment variables in the pod templates you've created? Can my pods reach your pods over IPv4 if we're not on the same team?

There are places where you explicitly want distinct teams to coordinate, although you'd prefer they do so through a defined boundary (network host and port, load balancer, dns name, specific API key) that steps outside of infrastructure (traditionally).

The last question is a fairly specific topology request from OpenShift customers - they'd like to blanket drop outgoing packets from containers except those that match certain IPs in their project / team's units. It requires a willingness to distribute out the graph of connections to endpoints (which imposes its own scale limits), but is something that isn't terribly hard to do if motivated. That's a separate thread though.

thockin · 2014-07-16T21:03:32Z

That's what I thought you meant. Yeah, that all gets complicated.

On Wed, Jul 16, 2014 at 11:54 AM, Clayton Coleman notifications@github.com
wrote:

@thockin https://github.com/thockin - Can I see the pods you've
created? Can I change the replication controllers you've created? Can I see
the environment variables in the pod templates you've created? Can my pods
contact your pods if we're not on the same team?

There are places where you explicitly want distinct teams to coordinate,
although you'd prefer they do so through a defined boundary (network host
and port, load balancer, dns name, specific API key) that steps outside of
infrastructure (traditionally).

The last question is a fairly specific topology request from OpenShift
customers - they'd like to blanket drop outgoing packets from containers
except those that match certain IPs in their project / team's units. It
requires a willingness to distribute out the graph of connections to
endpoints (which imposes its own scale limits), but is something that isn't
terribly hard to do if motivated.

Reply to this email directly or view it on GitHub
#442 (comment)
.

smarterclayton · 2014-07-21T22:00:20Z

Updated the description at top to capture the three scenarios.

@erictune - the approach you described makes a lot of sense - hard limits for big, fungible limits for soft. There's also benefit to allowing temporary resource usage over soft limits for small teams (to within limits that they perhaps define).

bgrant0607 · 2015-02-28T02:26:55Z

@smarterclayton @derekwaynecarr @erictune What else needs to be done on this issue? Do we have an overview doc that describes this area?

smarterclayton · 2015-02-28T04:50:06Z

Other than the individual efforts for authz, authn, and identity which continue elsewhere, the primary objective of this issue is satisfied. I'd be ok with closing and letting those issues stand on their own

Also process /dev/xvd? partitions

edit the theme section from SIG Apps to match with other SIGs' themes

Update figs

Remove race-condition when setting up masquerade rules

…-capture Capture the logs from stderr of custom plugins

smarterclayton added the enhancement label Jul 13, 2014

smarterclayton changed the title ~~Allow resources to be subdivided~~ Allow cluster resources to be subdivided Jul 13, 2014

erictune mentioned this issue Jul 15, 2014

Container and pod resource limits #168

Closed

erictune mentioned this issue Jul 16, 2014

Service scale-to-zero (fka "Socket activation") #484

Open

bgrant0607 added the design label Jul 17, 2014

smarterclayton mentioned this issue Jul 22, 2014

Container manifests can define docker registry #553

Closed

smarterclayton mentioned this issue Aug 14, 2014

Access.md added. #891

Merged

smarterclayton mentioned this issue Aug 27, 2014

Proposal: Describe possible project structure and implications #1017

Closed

bgrant0607 added the priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. label Dec 3, 2014

erictune mentioned this issue Jan 15, 2015

Disk space monitoring #3518

Closed

bgrant0607 added the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label Feb 28, 2015

erictune self-assigned this Mar 3, 2015

bgrant0607 closed this as completed Mar 8, 2015

vishh pushed a commit to vishh/kubernetes that referenced this issue Apr 6, 2016

Merge pull request kubernetes#442 from arkadijs/xen

fa27e5e

Also process /dev/xvd? partitions

seans3 pushed a commit to seans3/kubernetes that referenced this issue Apr 10, 2019

Merge pull request kubernetes#442 from radhikapc/themes

d6d1382

edit the theme section from SIG Apps to match with other SIGs' themes

wking pushed a commit to wking/kubernetes that referenced this issue Jul 21, 2020

Merge pull request kubernetes#442 from monopole/updateFigs

0e09ef4

Update figs

b3atlesfan pushed a commit to b3atlesfan/kubernetes that referenced this issue Feb 5, 2021

Merge pull request kubernetes#442 from tomdee/ip-masq-return

ebf5639

Remove race-condition when setting up masquerade rules

linxiulei pushed a commit to linxiulei/kubernetes that referenced this issue Jan 18, 2024

Merge pull request kubernetes#442 from abansal4032/custom-plugin-logs…

f3ab10e

…-capture Capture the logs from stderr of custom plugins

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow cluster resources to be subdivided #442

Allow cluster resources to be subdivided #442

smarterclayton commented Jul 13, 2014

monnand commented Jul 13, 2014

smarterclayton commented Jul 13, 2014

thockin commented Jul 16, 2014

erictune commented Jul 16, 2014

bgrant0607 commented Jul 16, 2014

smarterclayton commented Jul 16, 2014

smarterclayton commented Jul 16, 2014

erictune commented Jul 16, 2014

erictune commented Jul 16, 2014

thockin commented Jul 16, 2014

smarterclayton commented Jul 16, 2014

thockin commented Jul 16, 2014

smarterclayton commented Jul 21, 2014

bgrant0607 commented Feb 28, 2015

smarterclayton commented Feb 28, 2015

Allow cluster resources to be subdivided #442

Allow cluster resources to be subdivided #442

Comments

smarterclayton commented Jul 13, 2014

monnand commented Jul 13, 2014

smarterclayton commented Jul 13, 2014

thockin commented Jul 16, 2014

erictune commented Jul 16, 2014

bgrant0607 commented Jul 16, 2014

smarterclayton commented Jul 16, 2014

smarterclayton commented Jul 16, 2014

erictune commented Jul 16, 2014

erictune commented Jul 16, 2014

thockin commented Jul 16, 2014

smarterclayton commented Jul 16, 2014

thockin commented Jul 16, 2014

smarterclayton commented Jul 21, 2014

bgrant0607 commented Feb 28, 2015

smarterclayton commented Feb 28, 2015