REdeploy | Experimentation. Chaos. Resilience.

John Allspaw

Adaptive Capacity Labs

@allspaw
Thursday, 9:10 AM

In the Center of the Cyclone: Finding Sources of Resilience

Abstract Sustaining the potential to adapt to unforeseen situations (resilience) is a necessary element in complex systems. One could say that all successful endeavors require this. But resilience is (in many ways) both invisible and also difficult to locate in concrete and grounded ways. Understanding complex systems cannot rely on simple approaches, by definition.

“Monitoring,” “observability,” “culture,” “management,” “organizational design”” ... none of these terms, concepts, or approaches can singularly help us in this area. We’ll walk through empirically-supported approaches that do.

Bio John Allspaw has worked in software systems engineering and operations for over twenty years in many different environments: biotech, government, online media, social networking, and e-commerce. John’s publications include the books The Art of Capacity Planning and Web Operations as well as the foreword to “The DevOps Handbook.”

His 2009 Velocity talk with Paul Hammond, 10+ Deploys Per Day: Dev and Ops Cooperation, helped start the DevOps movement. John served as SVP of Infrastructure and Operations and then CTO at Etsy, and holds an MSc in Human Factors and Systems Safety from Lund University.

David Blank-Edelman

Microsoft

@otterbook
Thursday, 11:30 AM

When Will They Ever Learn?

Abstract Did you know there is a feeling that is the opposite of déjà vu called jamais vu? “Jamais vu involves a sense of eeriness and the observer’s impression of seeing the situation for the first time, despite rationally knowing that he or she has been in the situation before.” (Wikipedia).

Our industry is rife with the opportunity for both jamais vu and déjà vu. Cloud, DevOps, SRE, {latest hotness here}, you name it—they all provoke some combination of these feelings, especially if you’ve been in the operations field for a substantial time. The question is: what do we do with these feelings? This leads to further inquiry, like: what’s a productive response to each, how do we express them constructively to others, how do we allow them to push our field forward (in a humane way) rather than hold it back? Oh, and where do all of these curious responses to the major shifts in our field come from, anyway?

Join me for a talk that will be equal parts neuropsychology, group experimentation, and social and technological commentary (not to mention a little help for you when you try to sort things out in the future). Let’s put the pedal to the meta and together change the way we think about the way we think about the most important developments in our field.

Bio David has over thirty years of experience in the SRE/DevOps/sysadmin field in large multiplatform environments and currently works for Microsoft as a Senior Cloud Ops Advocate. He is the author of the O’Reilly Otter book (Automating Systems Administration with Perl) and the editor/curator of “Seeking SRE: Conversations on Running Production Systems at Scale” (published by O’Reilly in August 2018). David is one of the co-founders of the now global set of SREcon conferences.

Aaron Blohowiak

Netflix

@aaronblohowiak
Thursday, 9:55 AM

Availability, Latency and Cost: Withstanding Regional Outages

Abstract Running in multiple regions is better for your users through increased availability and lower latencies, and it won’t cost as much as you think. We’ve turned region resiliency from a driver of cost and complexity into a strategic advantage by understanding human and system dynamics both at a high-level and in the nitty-gritty details. Calamity, heartbreak, and inefficiency drove us to refine our approach — and our understanding — as we’ve matured.

Executing a failover used to be an all-hands-on-deck situation that would bring VPs to the table. Now, it’s a matter of routine that usually concludes with a brief “all is well’ email.

This talk dives into the experiences of operating in multiple regions at scale and the algebraic models, code and incident management playbooks we’ve developed to tame, refine and leverage our approach. Once you’ve decided to go multi-region, the three major questions that arise are: How many regions? How should we steer users to regions? How do we actually perform the failover? In addition to the story of how we got to where we are, I’ll present the design considerations and system models we used to make those decisions.

Bio Aaron has been building, breaking, and fixing systems for over a decade from tiny startups to serving over 125 million users at Netflix. He is presently applying his passion for empiricism and system design to multi-region high-availability architecture and operations on the Traffic team at Netflix. Previously, Aaron co-authored Chaos Engineering. (O’Reilly, 2017.)

VM (Vicky) Brasseur

Open Source Consultant

@vmbrasseur
Thursday, 3:40 PM

The Human Nature of Failure and Resiliency

Abstract Projects fail in droves. Systems hiccup and hours of downtime follows. Screws fall out all the time; the world is an imperfect place.

We talk a lot about building resilient systems, but all systems are (at least for now) built by humans. Humans who have been making the same types of mistakes for thousands of years.

Just because failure happens doesn’t mean we can’t do our best to prevent it or—at the very least—to minimize the damage when it does. As a matter of fact, embracing failure can be one of the best things you do for your system. Failure is a vital part of evolution. By learning to love failure we learn how to take the next step forward. Ignoring or punishing failure leads to stagnation and wasted potential.

This talk distills 3000 pages of failure research into 40 minutes of knowledge about the human factors of failure, how it can be recognised, and how you can work around it to create more resilient systems.

By the end of this talk the audience will have an awareness of the most common psychological reasons for mistakes and failures and how to develop systems and processes to protect against them.

Bio VM (aka Vicky) spent most of her 20 years in the tech industry leading software development departments and teams, and providing technical management and leadership consulting for small and medium businesses. Now she leverages nearly 30 years of free and open source software experience and a strong business background to advise companies about free/open source, technology, community, business, and the intersections between them.

She is the author of Forge Your Future with Open Source, the first book to detail how to contribute to free and open source software projects. Think of it as the missing manual of open source contributions and community participation. The book is published by The Pragmatic Programmers and is now available in an early release beta version. It’s available at fossforge.com.

Vicky is the proud winner of the Perl White Camel Award (2014) and the O’Reilly Open Source Award (2016). She’s a moderator and author for opensource.com, a Director for the Open Source Initiative, and a frequent and popular speaker at free/open source conferences and events. She blogs about free/open source, business, and technical management at anonymoushash.vmbrasseur.com.

Matt Broberg

Sensu

@mbbroberg
Friday, 2:00 PM

Out of Maintenance Mode: Refactoring a Community

Abstract We are used to slowdowns before a big push forward: the code freeze before every software release and maintenance mode before a deployment. But what happens when you need to take a full step back and reevaluate the direction your initiative is heading in?

This is the story of Sensu — an Open Source project that was wildly successful without much structure, but then it hit a plateau. After several conversations with the top contributors, VP of Community, Matt Broberg took drastic measures and announced that the project was going on “Maintenance Mode,” giving all project participants the ability to rethink how they communicated. The Sensu community has since come out of Maintenance Mode with clearer contributor guidelines, better coding practices and a stronger communication structure. But that’s just one story.

This talk digs into why all companies benefit from a Maintenance Mode at times of significant change. It will provide a framework to identify contributor needs and give recommendations on how to set up your initiative for success. Whether you’re a lead of a community, or software stack of an entire freaking company, this intentional look at what you’re building and how people can communicate about it is relevant to you.

Bio Matt is VP of Community for Sensu Inc., focused on the incredible community around Sensu, the open source monitoring framework. He contributes to infrastructure communities with a focus on open collaboration, especially through GitHub. Matt has spoken at many conference (OSCON, VMworld, Velocity), on podcasts (Cloudcast, Speaking in Tech, The Hot Aisle) and co-created the Geek Whisperers podcast. Matt is on the board of the Influence Marketing Council, co-maintains the Evangelist Collective, contributes to the Go Community Outreach Working Group, occasionally blogs on Medium.com and shares code on GitHub. He’s also a fan of tattoos, rock climbing and cats, though remains unsure of Schrödinger’s.

Paul Carleton

Stripe

@paulcarletonjr
Friday, 2:45 PM

How our Security Requirements Turned Us into Accidental Chaos Engineers

Abstract This talk will cover a security focused project that evolved into a chaos injection system.

The system is called “Lifespan Management” and it enforces a lifespan on a cloud hosted VM. After the lifespan expires, the host is terminated, and a replacement is brought up. It has the benefits of making it easier to apply fixes for CVE’s (CVE comes out on day X, we know hosts will age out by day Y), and reducing the value of a compromised machine (“I’ve finally captured a host! It’s being shutdown?? No!”)

This seemed simple enough, but the complexity it uncovered made for a fun, year-long adventure in chaos engineering.

In this talk, I’ll cover the evolution of the system, and some lessons we learned along the way like:

All termination API calls are not created equal
Zero failing health checks does not mean a host is healthy
Answering “Was this the chaos system?” quickly is essential

I’ll also include anecdotes like how it helped with Spectre/Meltdown mitigations, how it mercilessly killed all our kubernetes workers, and how it locked us out of our QA environment.

Bio Paul is a software engineer on Stripe’s Cloud team and he wants to make systems that are delightful to work with. Outside of work, Paul writes about linux, cycles, and spends too much time thinking about the moon.

Cecilia Deng

Amazon

@cicikendiggit
Friday, 9:55 AM

Try Catch Blocks for your Distributed System

Abstract Whether you are trying to introduce change or the world is introducing it for you, your distributed system needs to be able to handle it. You want pipelines that help ensure your expected changes are good, but you also want detectors for both expected and unexpected change in production, as well as preparations within your system to protect against bad change like dependency failures or latency spikes. In this talk I explore catching distributed system problems and handling them with approaches like caching, retries and tactics to take advantage of multi-host redundancy.

Bio Cecilia Deng is a software development engineer turned manager at Amazon Web Services, delivering services powered by the cloud that support some of the biggest online companies today, including Netflix, Spotify, and Pinterest. Cecilia joined AWS from Vancouver, BC over 3 years ago where she was creating online services for the games industry at Electronic Arts. She graduated from the University of British Columbia with a degree in Computer Science and Math.

Hannah Foxwell

Pivotal

@hannahfoxwell
Thursday, 10:50 AM

Resilient Systems Require Resilient People

Abstract Building resilient systems is what we do and we do it well, but how much time do we spend working on our own personal resilience? In the ever changing world of technology, how do we ensure we are flexible, adaptable and resilient in the face of challenges and setbacks?

In this talk we’ll look at ways in which we can improve the resilience of our organisations, our teams and ourselves. Because if your team isn’t ready for change, your platform isn’t either.

Bio Hannah Foxwell is Delivery Manager at Pivotal. In both enterprise organisations and startups Hannah has spent her career trying to create awesome working environments for engineers to do their best work, and she continues this at Pivotal today helping teams transform how they deliver software.

Hannah leads the HumanOps community in the UK and organizes DevOpsDays London.

Nora Jones

Netflix

@nora_js
Thursday, 2:45 PM

Chaos Engineering: A Step Towards Resilience

Abstract Chaos Engineering is a helpful tool in understanding your system’s unknowns, but it is not the means to an end for achieving resilience. Instead, it helps to instill higher confidence in the ability to cope and be resilient in the face of inevitable failures.

In this talk, I’ll go over lessons learned and the impact to this confidence that Chaos Engineering has had at Netflix. As John Allspaw has said, "Resilience is the story of the outage that didn’t happen". I’ll share those stories from Chaos vulnerabilities that our team has found, how we follow those vulnerabilities, and how Chaos Engineering is incorporated into our day-to-day culture.

Bio Nora is a Senior Software Engineer at Netflix and a student of Human Factors and Systems Safety at Lund University. She is passionate about resilient software, people, and the intersection of those two worlds.

She recently co-wrote the book on Chaos Engineering and keynoted AWS re:Invent to an audience of over 40,000 people about the benefits and business case behind implementing Chaos Engineering.

Jessica Kerr

Atomist

@jessitron
Friday, 9:10 AM

The Origins of Opera and the Future of Programming

Abstract There’s a story to tell, about musicians, artists, philosophers, scientists, and then programmers.

There’s a truth inside it that leads to a new view of work, that sees beauty in the painful complexity that is software development.

Starting from The Journal of the History of Ideas, Jessica traces the concept of an "invisible college" through music and art and science to programming. She finds the dark truth behind the 10x developer, a real definition of "Senior Developer," and a new word for our work and our teams: symmathesy, a learning system of learning parts, that is both us and our code. We become great together, by learning from each other.

Bio Jessica is a developer of developer automation at Atomist. She’s obsessed with systems, especially the ones with software in them. For output, she tweets at @jessitron, podcasts on >Code, and flies around the world to speak at conferences.

Lee Kussmann

WillowTree, Inc.

@leekussmann
Friday, 11:30 AM

Mindfulness at Work: A Way to Overcome the Overwhelm

Abstract Working adults rarely can be as productive or effective with their work if they feel overwhelmed and overstressed at their jobs. In the technology world, this feeling of stress is all too common in our "go go go" society. This overwhelm at work so often also impacts personal lives by continuing to have these feelings of anxiety after work hours. It’s crucial to be mindful about these feelings and have the correct tools to overcome the overwhelm. In this talk, I will discuss tricks and tips for dealing with overwhelm, anxiety, and stress at work. I will also talk about how mindfulness impacts personal wellbeing as well as productivity and team dynamics. You will come away from this talk with a set of tools that you can apply today to deal with overwhelm.

Bio As a Software Test Engineer at WillowTree, Inc., a digital products agency based in Charlottesville, VA, Lee Kussmann knows about the pressures of a fast paced work environment. She is passionate about understanding the mind-body connection and how state of mind can affect physical, emotional, and mental wellbeing both at work and at home. Her hobbies include spending time outside, horseback riding, doing yoga, and being involved in her local community. She is always willing to talk about mindfulness, health and fitness, and the latest Game of Thrones fan theories.

Laura MD Maguire

Ohio State University

@LauraMDMaguire
Friday, 3:40 PM

Operating at the Edge of the Envelope

Abstract Resilience is everywhere in continuous deployment environments. Without it, the scale and scope of digital operations would not be possible. From Site Reliability Engineers who detect, diagnose and resolve outages to Architects who build adaptive capacity into their systems to managers who create conditions for radical collaboration during high stress outages, resilient performance is a uniquely human capability but, can we actually ‘engineer’ resilience into distributed, at scale systems?

This talk will explore how, despite 30+ years of complex systems research, the questions around how to engineer resilient systems defies an easy answer.

Bio Laura Maguire is a graduate researcher with the Cognitive Systems Engineering Laboratory at the Ohio State University. She spent 15 years working in safety systems design and management in high risk/high consequence industries before returning to school to complete her PhD in Resilience Engineering. An avid alpine climber and backcountry skier, Laura brings both her professional and personal experiences to her research in understanding how expert practitioners use anticipation, adaptation and coordination to manage disruptive events.

Avery Regier

Deere & Company

@averyregier
Thursday, 2:00 PM

Recognizing Zombies, Black Holes & Tribbles Before You Get Eaten: Techniques to Avoid Cascading Failure

Abstract Complex mixes of monoliths, micro-services, databases, data centers, networking, and cloud providers provide a dizzying array of opportunities for your services to fail. No one has perfect failover, so you have be prepared to play defense.

We will look at three categories of failures and ways to recognize them coming, and avoid spreading the carnage they cause to other services you provide.

Zombies: Long running but abandoned requests that eat up memory and crash the system long after the user who conjured it gave up.

Black Holes: Dependent services that take connections but never give them up, or perform so poorly for a time that all your attention eventually gets focused on that one thing.

Tribbles: Similar requests that you normally invite, but they come too many, too fast for your service to handle as they take your attention and eat up all your resources.

You can expect to see:

Novel concurrent data structures for tracking ongoing work used as a basic building block for recognizing Zombies, Black Holes & Tribbles.
Circuit Breaking: Statistics used for recognizing normal and when to stop using a dependency.
Dealing with the behavior differences of highly vs little used and fast vs slow dependencies.
Sci-fi references and Horror Stories

Bio Avery has been building and supporting large systems for two decades. His recent focus has been on finding themes for why systems fail, and building practical solutions.

Matty Stratton

PagerDuty

@mattstratton
Friday, 10:50 AM

Fight, Flight, or Freeze — Releasing Organizational Trauma

Abstract When humans are faced with a traumatic experience, our brains kick in with survival mechanisms. These mechanisms are the familiar fight or flight response, but can also include the freeze response - which occurs when we are terrified or feel that there is no chance of escape.

In this talk I will explain the background of fight, flight, and freeze, and how it applies to organizations. Based on my own experiences with post-traumatic stress (PTS), I will give examples and suggestions on how to identify your own organizational trauma and how to help heal it.

Sufferers of post-traumatic stress continue to feel these fight, flight, and freeze responses long after the trauma has passed, because our brains are unable to differentiate between the memory of trauma and an actually occurring event. When activated or triggered, the brain reverts to these behaviors, which are then expressed in the person’s body (through posture, disassociation, muscle tension, etc).

The same can occur to organizations - once an organization has experienced a trauma (a large outage, say) the “memory” of that trauma leads to a deregulated state whenever activated (by symptoms of similar indicators, such as system alerts, customer issues, and more). The organization will insist on revisiting the same fight, flight, or freeze response as the embedded trauma has caused, which, like a triggered post-traumatic stress sufferer, is a false equivalency.

One of the treatments for post-traumatic stress is Eye Movement Desensitization and Reprocessing (EMDR), in which the patient’s difficult memories are offset with a positive association that is reinforced through external stimuli. The same can be done for organizations - removing the inaccurate traumatic associations of previous outages and organizational pain through game days, and other techniques, we can reduce the “scar tissue” of our organization and move forward in a balanced manner.

Bio Matty Stratton is a DevOps Evangelist at PagerDuty, where he helps dev and ops teams advance the practice of their craft and become more operationally mature. He collaborates with PagerDuty customers and industry thought leaders in the broader devops community, and when he still had a car, his license plate actually said “DevOps.”

Matty has over 20 years experience in IT operations, ranging from large financial institutions such as JPMorganChase and internet firms, including Apartments.com. He has given presentations at ITSM focused events, ChefConf, DevOpsDays, Interop, PINK, and various local groups within the Chicagoland area. He is the founder and co-host of the popular Arrested DevOps podcast.

Speakers

John Allspaw

Adaptive Capacity Labs

David Blank-Edelman

Microsoft

Aaron Blohowiak

Netflix

VM (Vicky) Brasseur

Open Source Consultant

Matt Broberg

Sensu

Paul Carleton

Stripe

Cecilia Deng

Amazon

Hannah Foxwell

Pivotal

Nora Jones

Netflix

Jessica Kerr

Atomist

Lee Kussmann

WillowTree, Inc.

Laura MD Maguire

Ohio State University

Avery Regier

Deere & Company

Matty Stratton

PagerDuty

Contact