Designing for Failure
Building more Resilient Applications
17 February 2017
SRE Team Lead, Saltside
What is Failure?
When a system cannot perform its intended functionality
- User not able to login
- User does not recieve a notification email
- Website throws a 5xx response from a server side error
- Android application crashes due to incompatible android version
- 3rd party provider is unavailable
Build more reliable applications by:
1. Planning for known failure cases
2. Mitigating risks of unknown failure cases
Reliable: more functions functional over a longer period of time.
Reliability is not binary. Reliable systems operate in degraded states.
Degraded Features Beats No Features
Step 1: Adopt this Mental Model
Systems generally fit these three states:
1. Happy path --
2. Degraded paths --
3. Un-happy paths --
Step 2: Enumerate Functionalities/Flows
- Post a feed update
- Contact another user
- Purchase extra services
- Contract support
Step 3: Identify Dependency Chains
Ask the yourself these questions from everything in step 2:
- What internal dependencies?
- What external services?
- Is reduced functionality possible for this flow?
- Do my dependencies match or exceed my SLO (Service Level Objective)?
Step 4: Identify Failure Modes
- DNS lookup fails
- Web server throws 5xx
- RPC throw exceptions
- Non-technical reasons (e.g. account locked because of lack of payment, legal reasons)
- Upstream errors (e.g. SMS provider not functioning with local provider)
- Intermittent failures
- Timeouts (commonly overlooked!)
- Over Capacity
Step 5: Become Resilient
Resilient systems handle unavailable or degraded dependencies in ways still provide value to users.
How to become resilient?
- Commit to implementing degraded functionalities
- Enforce timeouts on the caller side. Don't let external dependencies dictate your application
- Communicate to users if actions should be retried or to come back later
- Communicate via APIs if systems are degraded or unavailable
- Migrate fires by quarantining off subsystems (e.g. feature flippers)
- Test systems under degraded conditions (e.g. increased load or network latency)
- Health checks for running process (e.g. restart process, or remove from load balancer)
Detect failures and encapsulates logic of preventing a failure to reoccur constantly (during maintenance, temporary external system failure or unexpected system difficulties).
Aids in determining if an action is safe to continue or should wait to retry
Can signal to other parts of the system happy, degraded, or unavailable state
Different technical implementations and many FOSS solutions
Control behavior via flags. Flags can be simple boolean or more complex logic like who the end user is.
Launch features with config, not with code. Code is ready, but requires someone to flip the switch
- Testing functionality that can only be tested in a production environment (e.g. a payment integration)
- Disabling broken functionality for any reason
- Rollout features to percentages of users at time
- Requires something like a feature flipper
- Users: Facebook, Github, Amazon
- Plan for the degraded state together with the product owner
- Prioritize degraded behavior for critical business functions
- Communicate to the user what state the application is in
- Add telemetry for insight into current system state
Read this Book
"Release It!" by Michael Nygaard