Design and Implementation of an Early Timeout-Detection Mechanism for Systematic Fault-Injection Campaigns

Fault injection is a common approach to systematically assess the resilience of a system and the effectiveness of software-based counter measures. It tries to mimic either physical causes for single event upsets (by exposing the system to, e.g. heat or radiation) or their effects (by changing logic signals). For the fault injection, we use the simulation-based fault injection framework FAIL*, which extracts program traces and simulates the representative faults.

Every single injection includes in general a run-to-completion execution and comparing its behavior with a fault-free execution. Executing the application which ends in a timeout is one possible unexpected behaviour. The time, which is lost until a fixed timeout limit is reached, is ineffective and sums up over all possible injection points (every cycle and every single bit).

This thesis builds on thesis Early Timeout Detection for FI Campaigns, which was exploratory and used machine learning to provide initial insights on this topic. In this thesis, the machine learning mechanism will be omitted, and even though the halting problem will be addressed, the approach should be as deterministic as possible. However, the knowledge of the preliminary work should be integrated to design a concrete mechanism, to implement this mechanism in FAIL* as well as to evaluate it on different application types, its resulting behaviors and possibly its origins.

Further Reading