Designing for Failure

Building more Resilient Applications

17 February 2017

Adam Hawkins

SRE Team Lead, Saltside

What is Failure?

When a system cannot perform its intended functionality



Build more reliable applications by:

1. Planning for known failure cases
2. Mitigating risks of unknown failure cases

Reliable: more functions functional over a longer period of time.

Reliability is not binary. Reliable systems operate in degraded states.

Degraded Features Beats No Features

Step 1: Adopt this Mental Model

Systems generally fit these three states:

1. Happy path -- :)
2. Degraded paths -- :/
3. Un-happy paths -- :(

Step 2: Enumerate Functionalities/Flows


Step 3: Identify Dependency Chains

Ask the yourself these questions from everything in step 2:

Step 4: Identify Failure Modes

Unavailable Dependency

Degraded Dependency

Step 5: Become Resilient


Resilient systems handle unavailable or degraded dependencies in ways still provide value to users.

How to become resilient?

Resiliency Patterns

Circuit Breaker

Detect failures and encapsulates logic of preventing a failure to reoccur constantly (during maintenance, temporary external system failure or unexpected system difficulties).

Aids in determining if an action is safe to continue or should wait to retry

Can signal to other parts of the system happy, degraded, or unavailable state

Different technical implementations and many FOSS solutions

Feature Flag/Flipper

Control behavior via flags. Flags can be simple boolean or more complex logic like who the end user is.

Launch features with config, not with code. Code is ready, but requires someone to flip the switch

Use cases:

Staged Rollouts


Read this Book

"Release It!" by Michael Nygaard

Thank you

Adam Hawkins

SRE Team Lead, Saltside