Video streaming devops blog

AWS ECS Downtime

Host count

I spoke too soon about AWS fixing the “Docker images” exhausting all the instance’s disk space issue being fixed in the previous blog.

This morning we had some downtime of a non-critical service and it leaves me with a lot of questions!

ECS production questions

  1. Why wasn’t the agent upto date?
  2. Why wasn’t a new instance automatically spawned when one failed? aka an automatic “scaling event”
  3. Can we slow down two independent Healthy hosts going Unhealthy together? Check the graph above! Ideally the service hangs on with one healthy host.
  4. How does one scale up an instance with a bigger disk? ecs-cli scale doesn’t allow you to express the instance type, only up does!?
  5. How do I find what machine type I’m on quickly and discover which is the next instance up with a larger disk?
  6. How can I get an alert when disk space is near exhausted?
  1. How can avoid issues like this whilst trying to quickly free space?!

    $ docker rmi efaaf58ff978
    Error response from daemon: devmapper: Error saving transaction metadata: devmapper: Error writing metadata to /var/lib/docker/devicemapper/metadata/.tmp736910121: write /var/lib/docker/devicemapper/metadata/.tmp736910121: no space left on device
    
  2. Why did this happen at an odd time? There was no deployment at 6AM!

Answers

tl;dr is to UPDATE THE AGENT!

I’ll post the answers to the questions below when I find out. Note: We were running AWS ECS Agent 1.11.1. It’s very likely that new versions will fix all the issues logged here.

Before nuking the machine, I did manage to grab the logs over some time. It took time unsurprisingly because there were 3GB of logs! I have commented upon the ECS agent issue that they should be using systemd’s journalctl or at least rotating the logs more aggressively with compression.

Error retrieving stats for container LONG_HASH: dial unix /var/run/docker.sock: socket: too many open files

Which is addressed in https://github.com/aws/amazon-ecs-agent/issues/488 aka 1.13.0!

  1. So the ECS agent version is baked right into the AWS ECS-Optimized AMI and are listed here: http://docs.aws.amazon.com/AmazonECS/latest/developerguide/container_agent_versions.html#ecs-optimized-ami-agent-versions

Thus, if you would like the most up to date agent, you can either:

a) Bake the update commands as listed here into your instance: http://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-agent-update.html#d0e7941

or

b) Update the underlying Cloud Formation template to use the most up to date AMI version when there is a new version available.

  1. Application Auto Scaling is available in Singapore. It’s just that I need to configure it. Hint: You need to manually update the service from the console to see the options.

3/8. ~6AM just so happened to be the time the repeating logs filled the space on both instances at effectively the exact same time. Doh!

  1. I need the change the “Launch configuration” of the “Autoscaling group”.

  2. Check the launch configuration. I need to figure out how to specify EBS volume size manually

  3. Run a script or rather use Datadog instrumentation since we are customers

  4. Should have tried forcing it, like so docker rmi -f <image_id> and of course nuking the logs.

Posted 2016-10-25
Page history

Devops at Spuul. Any tips or suggestions? Reach out!