Abstract How confident are you in your prod servers staying up without your help? Too often in tech we mistakenly interchange three important concepts when describing our socio-technical systems: how resilient they are, the reliability they exhibit in day to day work, and how robust they are under duress. Though interrelated, they are not equivalent.
How can we successfully gain insights in post incident reviews, execute chaos engineering experiments, and build scalable infrastructure if we’re misinterpreting our approaches? By separating out these core concepts, we can isolate better approaches in adapting to unforeseen circumstances. We’ll look at common misconceptions when describing our systems as resilient and focus on proven methods to help us improve our understanding of our systems.
Bio Will Gallego is a systems engineer with 15+ years of experience in the web development field, currently as a Senior Engineer at Fastly. Comfortable with several parts of the stack, he focuses now on building scalable, distributed backend systems and tools to help engineers grow. He believes in a free and open internet, blame aware retrospectives, and pronouncing gif with a soft “G”.