Abstract Running in multiple regions is better for your users through increased availability and lower latencies, and it won’t cost as much as you think. We’ve turned region resiliency from a driver of cost and complexity into a strategic advantage by understanding human and system dynamics both at a high-level and in the nitty-gritty details. Calamity, heartbreak, and inefficiency drove us to refine our approach — and our understanding — as we’ve matured.
Executing a failover used to be an all-hands-on-deck situation that would bring VPs to the table. Now, it’s a matter of routine that usually concludes with a brief “all is well’ email.
This talk dives into the experiences of operating in multiple regions at scale and the algebraic models, code and incident management playbooks we’ve developed to tame, refine and leverage our approach. Once you’ve decided to go multi-region, the three major questions that arise are: How many regions? How should we steer users to regions? How do we actually perform the failover? In addition to the story of how we got to where we are, I’ll present the design considerations and system models we used to make those decisions.
Bio Aaron has been building, breaking, and fixing systems for over a decade from tiny startups to serving over 125 million users at Netflix. He is presently applying his passion for empiricism and system design to multi-region high-availability architecture and operations on the Traffic team at Netflix. Previously, Aaron co-authored Chaos Engineering. (O’Reilly, 2017.)