Reliability
For software, typical expectations for reliability include:
- The application performs the function that the user expected.
- It can tolerate the user making mistakes or using the software in unexpected ways.
- Its performance is good enough for the required use case, under the expected load and data volume.
- The system prevents any unauthorized access and abuse.
Reliability means continuing to work correctly, even when things go wrong. The things that can go wrong are called faults, and systems that anticipate faults and can cope with them are called fault-tolerant or resilient.
Hardware Faults
Hard disks crash, RAM becomes faulty, the power grid has a blackout, someone unplugs the wrong network cable.
- Redundancy: Disks in RAID, dual power supplies, hot-swappable CPUs.
- Software Tolerance: There is a move toward systems that can tolerate the loss of entire machines, using software fault-tolerance techniques in preference or in addition to hardware redundancy.
Software Errors
These are harder to anticipate because they are correlated across nodes (systematic errors).
- A software bug that causes every instance of an application server to crash.
- A runaway process that uses up shared resources.
- A service that the system depends on slows down or returns corrupted responses.
- Cascading failures.
Human Errors
Humans are known to be unreliable. How do we make our systems reliable in spite of unreliable humans?
- Design systems in a way that minimizes opportunities for error (well-designed abstractions, APIs).
- Decouple the places where people make the most mistakes from the places where they can cause failures (sandboxes).
- Test thoroughly at all levels (unit, integration, manual).
- Allow quick and easy recovery from human errors (rollback, recompute).
- Set up detailed and clear monitoring (telemetry).
Knowledge Check
Which of the following best describes the difference between a Fault and a Failure?
They are synonyms and can be used interchangeably.
A Fault is a user error, while a Failure is a system crash.
A Fault is when one component deviates from its spec; a Failure is when the system as a whole stops providing service.
Failures are caused by hardware, Faults are caused by software.