Fault tolerance defines a process that allows the OS to respond to hardware or software failures. Fault tolerance is important for business continuity and high availability of applications and systems, regardless of problems.
How to ensure fault tolerance?
For failure-free operation in the system, there should be no components that in the event of a malfunction would lead to a failure of the entire system. Key aspects of a sustainable system: load balancing and elimination of a single point of failure.
Fault tolerance follows two basic models.
- Normal operation of the system — when the fault-tolerant structure encounters a malfunction, but continues to operate normally. The system sees no changes in performance, such as throughput or response time.
- Smooth performance slowdown when problems occur. The impact of a failure on system performance is proportional to the severity of the failure — a small failure will have little impact on system performance rather than a complete failure.
Main elements of the fault-tolerant system
Fault-tolerant systems use redundant components that automatically replace failed elements to prevent performance loss.
Hardware systems can be reserved by identical or equivalent systems. A typical example is a server made failover by deploying an identical server that runs in parallel and mirrors all of its operations. For example, a redundant array of independent disks (RAID) that combines physical disk components to achieve redundancy and improve performance.
Software systems can be made failsafe by backing up with other software. A common example is database back-up containing client data to ensure continuous replication to another machine. If the primary database fails, operations will continue because they are automatically replicated and redirected to the back-up database.
Power supplies can also be made fault tolerant – the system is equipped with one or more power supplies that do not need to supply power to the system if the primary power source is working normally. In case of failure or malfunction of the main power source, it can be decommissioned and replaced by a standby one, which takes over its functions and ensures system performance.