Explore memory pressure on etcd. #3853

brendandburns · 2015-01-28T00:46:18Z

It appears that our event stream is putting pressure on etcd and leading to crashes. Investigate event + etcd memory performance

dchen1107 · 2015-01-28T20:59:06Z

There is a couple ways to do this:

Create cgroups in cluster init.d-like script, and fork / exec etcd process in those cgroups, including memcg cgroup. Register such cgroup with cAdvisor, heapster will pick up the stats.
Run etcd as a pod. Here is a link on how to run etcd as a docker container: https://coreos.com/blog/Running-etcd-in-Containers/

With 2), we could benefit from Kubelet's RestartPolicy feature besides resource monitoring and logging, thus #3852 is not needed.

dchen1107 · 2015-01-28T21:00:42Z

cc/ @ArtfulCoder

brendandburns · 2015-01-28T21:10:22Z

Yeah, I think that we should do #2, and then we can use cAdvisor/grafana,
etc to track memory usage over time under varying load.

--brendan

On Wed, Jan 28, 2015 at 12:59 PM, Dawn Chen notifications@github.com
wrote:

There is a couple ways to do this:

Create cgroups in cluster init.d-like script, and fork / exec etcd
process in those cgroups, including memcg cgroup. Register such cgroup with
cAdvisor, heapster will pick up the stats.

Run etcd as a pod. Here is a link on how to run etcd as a docker
container: https://coreos.com/blog/Running-etcd-in-Containers/

With 2), we could benefit from Kubelet's RestartPolicy feature besides
resource monitoring and logging, thus #3852
#3852 is not
needed.

—
Reply to this email directly or view it on GitHub
#3853 (comment)
.

satnam6502 · 2015-01-28T21:12:01Z

Interested. I think one of the reasons I can't get a functional 500-node cluster is due to etcd so any help with debugging would be great.

dchen1107 · 2015-01-28T21:28:48Z

Just had offline discussion with @ArtfulCoder. He will work on this as a starter project.

xiang90 · 2015-01-28T21:33:04Z

@brendandburns @dchen1107
FYI: We are working on moving etcd to a disk backed system.

derekwaynecarr · 2015-01-28T21:38:05Z

I don't see how running etcd in a pod helps in the near term without knowing why it was actually crashing for us. I would be curious if you guys reproduce how big a etcd snapshot size was getting for you when it crashed to know if it was data size that was causing the issue, or just traffic.

@xiang90 - moving to a disk backed system +1000

dchen1107 · 2015-01-28T22:23:56Z

@xiang90 +1 on moving etcd to a disk backed system.

@derekwaynecarr We saw a couple incidents reported lately about the cluster is down, the pod stays pending, kubectl gets errors, etc. After some investigation, all of them are due etc etcd is crashed. Investigating future, there is a OOM event recorded in kernel log. Since etcd running on master node as "free" process today, we have no idea how much memory it used when kernel OOM kills it.

On GCE, every master node has kubelet and cadvisor running already. Running etcd in a pod, cadvisor will reports its resource usage. When etcd docker container is oom-killed, Kubelet will auto-restart it based on the policy, and log an OOM events (#2853, cadvisor just added feature to detect such events). I agree this is just the first step on exploring and solving memory pressure. Also we already know we send too much duplicate events to etcd. But that is a different topic and will be addressed separately.

bprashanth · 2015-03-09T19:10:03Z

I debugged the memory problem when I had some free time, because it was annoying to have my cluster going down every few days. Debug notes:

etcd OOM: fwiw I filed Configurable memory usage options for etcd etcd-io/etcd#2465
etcd fails to restart: It seems to need roughly 20-30% overhead in restoring/taking a snapshot (json overhead), so if it OOMs all the way initially, it will never restart with large snapshots. Also noticed on Kubectl stop rc <name> is flaky #4528. This should be easy enough to fix, but there's a tradeoff between the frequency of snapshotting and performance that I haven't investigated.
Monit dies: This was funny because I spent 20 minutes debugging etcd before I realized that monit responds to sigterm and sigkill (obviously the latter). ref: Babysit daemons on KNode #2884. I don't know how it dies, the OOM killer shouldn't have picked it.
Debuggin a cluster is hard when etcd goes down Replicate cluster health and configuration outside etcd #5190

bgrant0607 · 2015-04-03T21:49:15Z

cc @hchoudh

ghost · 2015-04-06T16:00:53Z

/quinton

xiang90 · 2015-04-12T07:29:09Z

@brendandburns @dchen1107

We have done some investigation around this. Typically etcd will use upto 10-20x memory for small key value. The OOM @dchen1107 observed was caused by memory usage increment during recovery. We have some fixes included in the next etcd release. However, you still need to prepare for upto 40x memory usage while snapshotting/recovering.

My question is what is the hard requirement from k8s v1? Do you have a data size estimation?
We can try our best to meet the requirement with today's etcd storage backend.

(We are also planing on new backend to completely some the data scalability issue)

/cc @philips

xiang90 · 2015-04-12T16:05:14Z

also see etcd-io/etcd#2657 etcd-io/etcd#2658

xiang90 · 2015-04-13T02:02:56Z

also see etcd-io/etcd#2662

brendandburns · 2015-04-27T19:32:46Z

Moving this out of 1.0, as I believe any related issues are covered by performance metrics elsewhere.

dchen1107 · 2015-04-27T19:54:58Z

Besides performance metrics, a little bit more status updated on this:

Done: All master components are running in a Pod, including etcd server
Done: All master pods' stats are collected by cAdvisor
Done: cAdvisor now reports system OOM events to kubelet, which is running on master node too.
Done: A pending PR on reporting OOM events to upstream layers
Pending: @vishh is working on to report system OOM cause in ContainerStatus. ContainerStatus also includes last_terminated_reason now.
Done: kubectl get pods now reports all static pods, including master component pods. Once 5) is done, the user can easily figure out if etcd_server was killed due to system OOM, or other reason lately, and its current status.

Here are what we haven't done yet:

NOT Done: We need to really configure master node same as the minions, but unschedulable, so that monitoring, such as heapster can collect the stats related to master pods

mikedanese · 2016-05-18T06:14:46Z

All the things dawn listed are done so closing. Feel free to reopen. cc @wojtek-t since perf related

brendandburns added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Jan 28, 2015

brendandburns assigned dchen1107 Jan 28, 2015

brendandburns mentioned this issue Jan 28, 2015

Error: [...] peers are not reachable [...] #3557

Closed

dchen1107 added area/etcd area/availability area/introspection labels Jan 28, 2015

alex-mohr mentioned this issue Jan 28, 2015

Switch kubernetes default etcd implementation to etcd 2.0 #3892

Closed

dchen1107 mentioned this issue Feb 3, 2015

Compress duplicate events #4073

Closed

dchen1107 added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Feb 4, 2015

dchen1107 mentioned this issue Feb 4, 2015

Configure master node same as slave node #4128

Closed

bgrant0607 mentioned this issue Feb 13, 2015

Consider moving events out of etcd #4432

Closed

dchen1107 added this to the v1.0 milestone Feb 18, 2015

bprashanth mentioned this issue Mar 10, 2015

Kubectl stop rc <name> is flaky #4528

Closed

brendandburns removed this from the v1.0 milestone Apr 27, 2015

abronan mentioned this issue May 5, 2015

swarm join with Etcd fails docker-archive/classicswarm#721

Closed

colemickens mentioned this issue May 13, 2015

CoreOS + Azure + Weave Getting Started Guide Doesn't Work #8159

Closed

mikedanese closed this as completed May 18, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explore memory pressure on etcd. #3853

Explore memory pressure on etcd. #3853

brendandburns commented Jan 28, 2015

dchen1107 commented Jan 28, 2015

dchen1107 commented Jan 28, 2015

brendandburns commented Jan 28, 2015

satnam6502 commented Jan 28, 2015

dchen1107 commented Jan 28, 2015

xiang90 commented Jan 28, 2015

derekwaynecarr commented Jan 28, 2015

dchen1107 commented Jan 28, 2015

bprashanth commented Mar 9, 2015

bgrant0607 commented Apr 3, 2015

ghost commented Apr 6, 2015

xiang90 commented Apr 12, 2015

xiang90 commented Apr 12, 2015

xiang90 commented Apr 13, 2015

brendandburns commented Apr 27, 2015

dchen1107 commented Apr 27, 2015

mikedanese commented May 18, 2016

Explore memory pressure on etcd. #3853

Explore memory pressure on etcd. #3853

Comments

brendandburns commented Jan 28, 2015

dchen1107 commented Jan 28, 2015

dchen1107 commented Jan 28, 2015

brendandburns commented Jan 28, 2015

satnam6502 commented Jan 28, 2015

dchen1107 commented Jan 28, 2015

xiang90 commented Jan 28, 2015

derekwaynecarr commented Jan 28, 2015

dchen1107 commented Jan 28, 2015

bprashanth commented Mar 9, 2015

bgrant0607 commented Apr 3, 2015

ghost commented Apr 6, 2015

xiang90 commented Apr 12, 2015

xiang90 commented Apr 12, 2015

xiang90 commented Apr 13, 2015

brendandburns commented Apr 27, 2015

dchen1107 commented Apr 27, 2015

mikedanese commented May 18, 2016