How our Security Requirements Turned Us into Accidental Chaos Engineers

Paul Carleton

Stripe

@paulcarletonjr

Abstract This talk will cover a security focused project that evolved into a chaos injection system.

The system is called “Lifespan Management” and it enforces a lifespan on a cloud hosted VM. After the lifespan expires, the host is terminated, and a replacement is brought up. It has the benefits of making it easier to apply fixes for CVE’s (CVE comes out on day X, we know hosts will age out by day Y), and reducing the value of a compromised machine (“I’ve finally captured a host! It’s being shutdown?? No!”)

This seemed simple enough, but the complexity it uncovered made for a fun, year-long adventure in chaos engineering.

In this talk, I’ll cover the evolution of the system, and some lessons we learned along the way like:

  • All termination API calls are not created equal
  • Zero failing health checks does not mean a host is healthy
  • Answering “Was this the chaos system?” quickly is essential

I’ll also include anecdotes like how it helped with Spectre/Meltdown mitigations, how it mercilessly killed all our kubernetes workers, and how it locked us out of our QA environment.

Bio Paul is a software engineer on Stripe’s Cloud team and he wants to make systems that are delightful to work with. Outside of work, Paul writes about linux, cycles, and spends too much time thinking about the moon.

Back to Videos

Join our mailing list:

Be the first to know all the current REdeploy happenings!

* indicates required
;