Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explore memory pressure on etcd. #3853

Closed
brendandburns opened this issue Jan 28, 2015 · 17 comments
Closed

Explore memory pressure on etcd. #3853

brendandburns opened this issue Jan 28, 2015 · 17 comments
Assignees
Labels
area/etcd area/introspection priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/node Categorizes an issue or PR as relevant to SIG Node.

Comments

@brendandburns
Copy link
Contributor

It appears that our event stream is putting pressure on etcd and leading to crashes. Investigate event + etcd memory performance

@brendandburns brendandburns added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Jan 28, 2015
@dchen1107
Copy link
Member

There is a couple ways to do this:

  1. Create cgroups in cluster init.d-like script, and fork / exec etcd process in those cgroups, including memcg cgroup. Register such cgroup with cAdvisor, heapster will pick up the stats.

  2. Run etcd as a pod. Here is a link on how to run etcd as a docker container: https://coreos.com/blog/Running-etcd-in-Containers/

With 2), we could benefit from Kubelet's RestartPolicy feature besides resource monitoring and logging, thus #3852 is not needed.

@dchen1107
Copy link
Member

cc/ @ArtfulCoder

@brendandburns
Copy link
Contributor Author

Yeah, I think that we should do #2, and then we can use cAdvisor/grafana,
etc to track memory usage over time under varying load.

--brendan

On Wed, Jan 28, 2015 at 12:59 PM, Dawn Chen notifications@github.com
wrote:

There is a couple ways to do this:

  1. Create cgroups in cluster init.d-like script, and fork / exec etcd
    process in those cgroups, including memcg cgroup. Register such cgroup with
    cAdvisor, heapster will pick up the stats.

  2. Run etcd as a pod. Here is a link on how to run etcd as a docker
    container: https://coreos.com/blog/Running-etcd-in-Containers/

With 2), we could benefit from Kubelet's RestartPolicy feature besides
resource monitoring and logging, thus #3852
#3852 is not
needed.


Reply to this email directly or view it on GitHub
#3853 (comment)
.

@satnam6502
Copy link
Contributor

Interested. I think one of the reasons I can't get a functional 500-node cluster is due to etcd so any help with debugging would be great.

@dchen1107
Copy link
Member

Just had offline discussion with @ArtfulCoder. He will work on this as a starter project.

@xiang90
Copy link
Contributor

xiang90 commented Jan 28, 2015

@brendandburns @dchen1107
FYI: We are working on moving etcd to a disk backed system.

@derekwaynecarr
Copy link
Member

I don't see how running etcd in a pod helps in the near term without knowing why it was actually crashing for us. I would be curious if you guys reproduce how big a etcd snapshot size was getting for you when it crashed to know if it was data size that was causing the issue, or just traffic.

@xiang90 - moving to a disk backed system +1000

@dchen1107
Copy link
Member

@xiang90 +1 on moving etcd to a disk backed system.

@derekwaynecarr We saw a couple incidents reported lately about the cluster is down, the pod stays pending, kubectl gets errors, etc. After some investigation, all of them are due etc etcd is crashed. Investigating future, there is a OOM event recorded in kernel log. Since etcd running on master node as "free" process today, we have no idea how much memory it used when kernel OOM kills it.

On GCE, every master node has kubelet and cadvisor running already. Running etcd in a pod, cadvisor will reports its resource usage. When etcd docker container is oom-killed, Kubelet will auto-restart it based on the policy, and log an OOM events (#2853, cadvisor just added feature to detect such events). I agree this is just the first step on exploring and solving memory pressure. Also we already know we send too much duplicate events to etcd. But that is a different topic and will be addressed separately.

@dchen1107 dchen1107 added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Feb 4, 2015
@dchen1107 dchen1107 added this to the v1.0 milestone Feb 18, 2015
@bprashanth
Copy link
Contributor

I debugged the memory problem when I had some free time, because it was annoying to have my cluster going down every few days. Debug notes:

@bgrant0607
Copy link
Member

cc @hchoudh

@ghost
Copy link

ghost commented Apr 6, 2015

/quinton

@xiang90
Copy link
Contributor

xiang90 commented Apr 12, 2015

@brendandburns @dchen1107

We have done some investigation around this. Typically etcd will use upto 10-20x memory for small key value. The OOM @dchen1107 observed was caused by memory usage increment during recovery. We have some fixes included in the next etcd release. However, you still need to prepare for upto 40x memory usage while snapshotting/recovering.

My question is what is the hard requirement from k8s v1? Do you have a data size estimation?
We can try our best to meet the requirement with today's etcd storage backend.

(We are also planing on new backend to completely some the data scalability issue)

/cc @philips

@xiang90
Copy link
Contributor

xiang90 commented Apr 12, 2015

also see etcd-io/etcd#2657 etcd-io/etcd#2658

@xiang90
Copy link
Contributor

xiang90 commented Apr 13, 2015

also see etcd-io/etcd#2662

@brendandburns
Copy link
Contributor Author

Moving this out of 1.0, as I believe any related issues are covered by performance metrics elsewhere.

@brendandburns brendandburns removed this from the v1.0 milestone Apr 27, 2015
@dchen1107
Copy link
Member

Besides performance metrics, a little bit more status updated on this:

  1. Done: All master components are running in a Pod, including etcd server
  2. Done: All master pods' stats are collected by cAdvisor
  3. Done: cAdvisor now reports system OOM events to kubelet, which is running on master node too.
  4. Done: A pending PR on reporting OOM events to upstream layers
  5. Pending: @vishh is working on to report system OOM cause in ContainerStatus. ContainerStatus also includes last_terminated_reason now.
  6. Done: kubectl get pods now reports all static pods, including master component pods. Once 5) is done, the user can easily figure out if etcd_server was killed due to system OOM, or other reason lately, and its current status.

Here are what we haven't done yet:

  1. NOT Done: We need to really configure master node same as the minions, but unschedulable, so that monitoring, such as heapster can collect the stats related to master pods

@mikedanese
Copy link
Member

All the things dawn listed are done so closing. Feel free to reopen. cc @wojtek-t since perf related

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/etcd area/introspection priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet
Development

No branches or pull requests

8 participants