Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More general scheduling constraints #367

Closed
thockin opened this issue Jul 7, 2014 · 62 comments
Closed

More general scheduling constraints #367

thockin opened this issue Jul 7, 2014 · 62 comments
Labels
priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.

Comments

@thockin
Copy link
Member

thockin commented Jul 7, 2014

There have been a few folks who have asked about machine constraints for scheduling. Let's use this issue as a place to gather ideas and requirements.

@timothysc

@verdverm
Copy link

verdverm commented Jul 7, 2014

I have noticed the FirstFit (default?) scheduler co-locates pods when there are open machines available. Each of these machines has a single cpu.

It would be nice to use information about available cpu and a pod's expected cpu requirements

sed 's/cpu/other_machine_stat/'

@monnand
Copy link
Contributor

monnand commented Jul 7, 2014

Currently, kubelet could get stats from cAdvisor which would be useful for scheduler. It could provide different percentiles of CPU and memory usage of a container (including the root container, i.e. the machine).

@thockin
Copy link
Member Author

thockin commented Jul 7, 2014

That's just "scheduling", as opposed to machine constraints, though very
coarsely they feel similar :)

On Mon, Jul 7, 2014 at 1:32 PM, Tony Worm notifications@github.com wrote:

I have noticed the FirstFit (default?) scheduler co-locates pods when
there are open machines available. Each of these machines has a single cpu.

It would be nice to use information about available cpu and a pod's
expected cpu requirements

sed 's/cpu/other_machine_stat/'

Reply to this email directly or view it on GitHub
#367 (comment)
.

@timothysc
Copy link
Member

Labels "seems" like the ideal place to enable a rank & requirements to define constraints. However labels would need to be regularly published by minions.

e.g.
rank = memory
requirements = gpu & clusterXYZ

I have a couple of concerns here:

  1. This treads into the full scale scheduling world.
  2. Config syntax = ?, DSL? ...

@thockin
Copy link
Member Author

thockin commented Jul 7, 2014

Let's worry about semantics before syntax. We have a similar issue open
for label selectors in general - we can discuss syntax there.

On Mon, Jul 7, 2014 at 1:42 PM, Timothy St. Clair notifications@github.com
wrote:

Labels "seems" like the ideal place to enable a rank & requirements to
define constraints. For example:

rank = memory
requirements = gpu & clusterXYZ

I have a couple of concerns here:

  1. This treads into the full scale scheduling world.
  2. Config syntax = ?, DSL? ...

Reply to this email directly or view it on GitHub
#367 (comment)
.

@timothysc
Copy link
Member

FWIW I often view constraints as a SQL query on a nvp store.

SELECT Resources
FROM Pool
WHERE Requirements
ORDER BY Rank
...

The hardest part are the 'fields' in an nvp store.

@bgrant0607
Copy link
Member

Scheduling based on resources and constraints are 2 significantly different issues.

We have several issues open about resource (and QoS) awareness: #147 , #160 , #168 , #274 , #317.

Constraint syntax/semantics: We should start with the proposed label selector mechanism, #341 .

@timothysc
Copy link
Member

I'm ok with doing the selection from a set of offers/resources from the scheduler.

Provided the offers have enough NVP information to enable discrimination.

@thockin
Copy link
Member Author

thockin commented Jul 9, 2014

I don't know about NVP - where can I read more on it?
On Jul 8, 2014 7:45 PM, "Timothy St. Clair" notifications@github.com
wrote:

I'm ok with doing the selection from a set of offers/resources from the
scheduler.

Provided the offers have enough NVP information to enable discrimination.

Reply to this email directly or view it on GitHub
#367 (comment)
.

@bgrant0607
Copy link
Member

Searching for "NVP SQL" or "name value pair SQL" or "key value pair SQL" comes up with lots of hits. Common arguments against are performance and loss of control over DB schema. But I'm getting the feeling that we're barking up the wrong forest.

@timothysc What are you trying to do? Right now, k8s has essentially no intelligent scheduling. However, that's not a desirable end state. If what you want is a scheduler, we should figure out how to support scheduling plugins and/or layers on top of k8s.

@thockin
Copy link
Member Author

thockin commented Jul 9, 2014

Name Value Pairs? Now I feel dumb :)

On Tue, Jul 8, 2014 at 7:51 PM, Tim Hockin thockin@google.com wrote:

I don't know about NVP - where can I read more on it?
On Jul 8, 2014 7:45 PM, "Timothy St. Clair" notifications@github.com
wrote:

I'm ok with doing the selection from a set of offers/resources from the
scheduler.

Provided the offers have enough NVP information to enable discrimination.

Reply to this email directly or view it on GitHub
#367 (comment)
.

@bgrant0607
Copy link
Member

Something somewhat different than label selectors is per-attribute limits for spreading. Aurora is one system that supports this model:
https://aurora.incubator.apache.org/documentation/latest/configuration-reference/#specifying-scheduling-constraints

This is more relevant to physical rather than virtual deployments. I'd consider it a distinct mechanism from constraints. @timothysc If you'd like this, we should file a separate issue. However, I'd prefer a a new failure tolerance scheduling policy object that specifies a label selector to identify the set of instances to be spread. We could debate about how to describe what kind and/or how much spreading to apply, but I'd initially just leave it entirely up to the infrastructure.

@timothysc
Copy link
Member

I completely agree its more relevant to physical rather then virtual deployments.

I was somewhat testing the possibility of enabling the capabilities for more general purpose scheduling, on par with a mini-Condor approach but it's not a requirement.

Aurora or Marathon -esk capabilities will fill the gap.
https://github.com/mesosphere/marathon/wiki/Constraints

@bgrant0607
Copy link
Member

Note that in order to add constraints, we'd need a way to attach labels to minions/nodes.

@timothysc
Copy link
Member

That is what I had alluded to earlier, but it received luke warm attention. In fact, I believe Wilkes had chimed in on a different thread regarding this topic.

@brendandburns
Copy link
Contributor

I think we should have labels for worker nodes, but they need to be
dynamic, and that's tough without a re-scheduler.

For now, I think we should use resources on nodes, since they are already
there, and the are known to be static.

You can add resource requests to pods, to achieve appropriate scheduling.

Brendan
On Oct 20, 2014 8:26 AM, "Timothy St. Clair" notifications@github.com
wrote:

That is what I had alluded to earlier, but it received luke warm
attention. In fact, I believe Wilkes had chimed in on a different thread
regarding this topic.


Reply to this email directly or view it on GitHub
#367 (comment)
.

@erictune
Copy link
Member

Replication controllers reschedule pods when the machines they are on are no longer available. Seems like replication controller could do the same if the machine becomes infeasible for scheduling. A fairly simple loop can recheck predicates as a background task in the scheduler, and move pods to terminated state if they no longer fit.

Questions:

  1. If a pod running pod is updated so that it's requirements no longer match the machine it is bound to, what happens?
    • pod moves to terminated state
    • refuse update
    • both, and let users be able to control the behavior (yay)

@bgrant0607
Copy link
Member

The minion/node controller (#1366) should be responsible for killing pods with mismatched label selectors, and then, yes, replication controllers would recreate them.

Re. @erictune's question: Yes, we could support both, for instance, using a URL parameter to select the desired behavior.

@brendandburns
Copy link
Contributor

Yeah, having the kubelet kill pods that don't match makes the most sense.

--brendan

On Mon, Oct 20, 2014 at 9:39 AM, bgrant0607 notifications@github.com
wrote:

The minion/node controller (#1366
#1366) should
be responsible for killing pods with mismatched label selectors, and then,
yes, replication controllers would recreate them.

Re. @erictune https://github.com/erictune's question: Yes, we could
support both, for instance, using a URL parameter to select the desired
behavior.


Reply to this email directly or view it on GitHub
#367 (comment)
.

@timothysc
Copy link
Member

@davidopp

Constraint evals to simlpe true or false expressions, so it should really be &&, || this way you can connect them including simple predicate matching.

GROUPBY (cluster) || rack==funkytown

The more I think about it, the less I want to tread into config language space given the cattle idiom on services. Weather they are soft or hard could either be denoted kia keyword or some other semantics.

@davidopp
Copy link
Member

So IIUC the way the thing you're proposing would work would like this?

GROUPBY expr1 || expr2 => put a virtual label X on all the machines that match expr1 or expr2, and then try to co-locate all the pods of the service on machines with label X
GROUPBY expr1 && expr2 => put a virtual label X on all the machines that match expr1 and expr2, and then try to co-locate all the pods of the service on machines with label X
SPREADBY expr1 || expr2 => put a virtual label X on all the machines that match expr1 or expr2, and then try to spread all the pods of the service across machines with label X
SPREADBY expr1 && expr2 => put a virtual label X on all the machines that match expr1 and expr2, and then try to spread all the pods of the service across machines with label X

It would be good to flesh out some use cases...

@bgrant0607
Copy link
Member

I agree that flavors of affinity and anti-affinity are the basic 2 features that would satisfy most use cases.

With respect to As Simple As Possible, specifying just whether to group or spread seams like the simplest possible API. That needs to be associated with some set of pods, via label selector (in which object TBD). Node groups to concentrate in or spread across could be configured in the scheduler in most cases.

@jcderr
Copy link

jcderr commented Sep 17, 2015

+1

I deploy some fairly hefty celery tasks in our cluster, and definitively do not ever want more than one running on the same host at the same time. I'd rather some get left unscheduled and run a monitoring task to pick them up by scaling my cluster up.

@dchen1107
Copy link
Member

cc/ @pwittrock

@dchen1107
Copy link
Member

#13524

@davidopp
Copy link
Member

davidopp commented Dec 7, 2015

This is part of #18261

@timothysc
Copy link
Member

@davidopp I think it reasonable to close this issue, in lieu of the assorted proposal.

@davidopp
Copy link
Member

@timothysc Let's wait until we merge the proposal.

vishh pushed a commit to vishh/kubernetes that referenced this issue Apr 6, 2016
Caches container data for 5 seconds before updating it
keontang pushed a commit to keontang/kubernetes that referenced this issue May 14, 2016
@bgrant0607
Copy link
Member

Affinity/anti-affinity proposals merged and implementations are underway.

keontang pushed a commit to keontang/kubernetes that referenced this issue Jul 1, 2016
harryge00 pushed a commit to harryge00/kubernetes that referenced this issue Aug 11, 2016
mqliang pushed a commit to mqliang/kubernetes that referenced this issue Dec 8, 2016
mqliang pushed a commit to mqliang/kubernetes that referenced this issue Mar 3, 2017
wking pushed a commit to wking/kubernetes that referenced this issue Jul 21, 2020
change packages to bases in Manifest
smarterclayton pushed a commit to smarterclayton/kubernetes that referenced this issue Sep 22, 2020
Bug 1877793: Force releasing the lock on exit for KS
linxiulei pushed a commit to linxiulei/kubernetes that referenced this issue Jan 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.
Projects
None yet
Development

No branches or pull requests