a9a4a1a4a10a11a1a4a3a6a12a4a8a14a13a16a15a18a17a4a19c a0a2a1a4a3a6a5 a7a4a8 Design for Safety Unfortunately, everyone had forgotten why the branch came off the top of the main and nobody realized that this was important. Trevor Kletz What Went Wrong? Before a wise man ventures into a pit, he lowers a ladder so he can climb out. Rabbi Samuel Ha?Levi Ben Joseph Ibm Nagrela . c a9a4a1a4a10a6a1a4a3a11a12a20a8a14a13a21a15a18a17a4a17 a0a2a1a4a3a11a5 a7a4a8 Design for Safety Software design must enforce safety constraints Should be able to trace from requirements to code (vice versa) Design should incorporate basic safety design principles c a9a4a1a4a10a6a1a4a3a11a12a20a8a14a13a23a22a4a24a4a24 a0a2a1a4a3a11a5 a7a4a8 Safe Design Precedence HAZARD ELIMINATION Reduction of hazardous materials or conditions Elimination of human errors Substitution Simplification Decoupling HAZARD REDUCTION Design for controllability Barriers Lockins, Lockouts, Interlocks Failure Minimization Safety Factors and Margins Redundancy HAZARD CONTROL Reducing exposure Isolation and containment Protection systems and fail?safe design DAMAGE REDUCTION Decreasing cost Increasing effectiveness c a9a4a1a4a10a6a1a4a3a6a12a4a8a14a13a23a22a4a24a25a15 a0a2a1a4a3a6a5 a7a4a8 Hazard Elimination SUBSTITUTION Use safe or safer materials. Simple hardware devices may be safer than using a computer. No technological imperative that says we MUST use computers to control dangerous devices. Introducing new technology introduces unknowns and even unk?unks. c a9a4a1a4a10a6a1a4a3a6a12a4a8a14a13a23a22a4a24a4a22 a0a2a1a4a3a6a5 a7a4a8 SIMPLIFICATION Criteria for a simple software design: 1. Testable: Number of states limited determinism vs. nondeterminism single tasking vs. multitasking polling over interrupts 2. Easily understood and readable 3. Interactions between components are limited and straightforward. 4. Code includes only minimum features and capability required by system. Should not contain unnecessary or undocumented features or unused executable code. 5. Worst case timing is determinable by looking at code. c a9a4a1a4a10a6a1a4a3a6a12a4a8a14a13a23a22a4a24a4a26 a0a2a1a4a3a6a5 a7a4a8 SIMPLIFICATION (con’t) Reducing and simplifying interfaces will eliminate errors and make designs more testable. Easy to add functions to software, hard to practice restraint. Constructing a simple design requires discipline, creativity, restraint, and time. Design so that structural decomposition matches functional decomposition. . a9a4a1a4a10a6a1a4a3a6a12a4a8a14a13a23a22a4a24a4a27c a0a2a1a4a3a6a5 a7a4a8 DECOUPLING Tightly coupled system is one that is highly interdependent: Each part linked to many other parts. Failure or unplanned behavior in one can rapidly affect status of others. Processes are time?dependent and cannot wait. Little slack in system Sequences are invariant. Only one way to reach a goal. System accidents caused by unplanned interactions. Coupling creates increased number of interfaces and potential interactions. c a9a4a1a4a10a6a1a4a3a6a12a4a8a14a13a23a22a4a24a4a28 a0a2a1a4a3a6a5 a7a4a8 DECOUPLING (con’t) Computers tend to increase system coupling unless very careful. Applying principles of decoupling to software design: Modularization: How split up is crucial to determining effects. Firewalls Read?only or restricted write memories Eliminate hazardous effects of common hardware failures c a9a4a1a4a10a6a1a4a3a6a12a4a8a14a13a23a22a4a24a4a29 a0a2a1a4a3a6a5 a7a4a8 ELIMINATION OF HUMAN ERRORS Design so few opportunities for errors. Make impossible or possible to detect immediately. Lots of ways to increase safety of human?machine interaction. Making status of component clear. Designing software to be error tolerant etc. (will cover separately) Programming language design: Not only simple itself (masterable), but should encourage the production of simple and understandable programs. Some language features have been found to be particularly error prone. c a9a4a1a4a10a6a1a4a3a6a12a4a8a14a13a23a22a4a24a4a19 a0a2a1a4a3a6a5 a7a4a8 REDUCTION OF HAZARDOUS MATERIALS OR CONDITIONS Software should contain only code that is absolutely necessary to achieve required functionality. Implications for COTS Extra code may lead to hazards and may make software analysis more difficult. Memory not used should be initialized to a pattern that will revert to a safe state. c a9a4a1a4a10a6a1a4a3a6a12a4a8a14a13a23a22a4a24a4a30 a0a2a1a4a3a6a5 a7a4a8 Turbine?Generator Example Safety requirements: 1. Must always be able to close steam valves within a few hundred milliseconds. 2. Under no circumstances can steam valves open spuriously, whatever the nature of internal or external fault. Divided into two parts (decoupled) on separate processors: 1. Non?critical functions: loss cannot endanger turbine nor cause it to shutdown. less important governing functions supervisory, coordination, and management functions 2. Small number of critical functions. c a9a20a1a4a10a11a1a4a3a6a12a4a8a14a13a32a22a4a24a4a17 a0a31a1a4a3a6a5 a7a4a8 Turbine?Generator Example (2) Uses polling : No interrupts except for fatal store fault (nonmaskable) Timing and sequencing thus defined More rigorous and exhaustive testing possible. All messages unidirectional No recovery or contention protocols required Higher level of predictability Self?checks of Sensibility of incoming signals Whether processor functioning correctly Failure of self?check leads to reversion to safe state through fail?safe hardware. State table defines: Scheduling of tasks Self?check criteria appropriate under particular conditions a9a20a1a4a10a11a1a4a3a6a12a4a8a14a13a32a22a25a15a18a24 a0a31a1a4a3a6a5 a7a4a8 Hazard Reduction Passive safeguards: Maintain safety by their presence Fail into safe states Active safeguards: Require hazard or condition to be detected and corrected Tradeoffs: Passive rely on physical principles Active depend on less reliable detection and recovery mechanisms. c BUT Passive tend to be more restrictive in terms of design freedom and not always feasible to implement. c a9a20a1a4a10a11a1a4a3a6a12a4a8a14a13a32a22a25a15a4a15 a0a31a1a4a3a6a5 a7a4a8 Design for Controllability Make system easier to control, both for humans and computers. Use incremental control: Perform critical steps incrementally rather than in one step. Provide feedback To test validity of assumptions and models upon which decisions made To allow taking corrective action before significant damage done. Provide various types of fallback or intermediate states Lower time pressures Provide decision aids Use monitoring c a9a20a1a4a10a11a1a4a3a6a12a4a8a14a13a32a22a25a15a18a22 a0a31a1a4a3a6a5 a7a4a8 Monitoring Difficult to make monitors independent: Checks require access to information being monitored but usually involves possibility of corrupting that information. Depends on assumptions about structure of system and about errors that may or may not occur May be incorrect under certain conditions Common incorrect assumptions may be reflected both in design of monitor and devices being monitored. A Hierarchy of Software Checking a9a4a1a4a10a6a1a4a3a11a12a20a8a14a13a23a22a25a15a18a26 c a0a2a1a4a3a11a5 a7a4a8 not detected not detected not detected Used to detect hardware failures and individual instruction errors. Observe system externally to provide independent view not detected Fail Checksums e.g., memory protection violation, divide by zero e.g. range checks, state checks, reasonableness checks about expected value of parameters passed to module. Use assertions: statements (boolean expressions on system state) about expected state of module at different points in execution or Can detect coding errors and implementation errors. expected timing of modules or processes consistency of global data structures data being passed between modules May check: Independent monitoring by process separate from that being checked. Often observe both controlled system and controller. Use additional hardware or completely separate hardware. Often built into hardware or checks included in operating system. c a9a4a1a4a10a6a1a4a3a6a12a4a8a14a13a23a22a25a15a18a27 a0a2a1a4a3a6a5 a7a4a8 Software Monitoring (Checking) In general, farther down the hierarchy check can be made, the better: Detect the error closer to the time it occurred and before erroneous data used. Easier to isolate and diagnose the problem More likely to be able to fix erroneous state rather than recover to safe state. Writing effective self?checks very hard and number usually limited by time and memory. Limit to safety?critical states Use hazard analysis to determine check contents and location Added monitoring and checks can cause failures themselves. c a9a4a1a4a10a6a1a4a3a6a12a4a8a14a13a32a22a25a15a18a28 a0a2a1a4a3a6a5 a7a4a8 Barriers LOCKOUTS Make access to dangerous state difficult or impossible. Implications for software: Avoiding EMI Authority limiting Controlling access to and modification of critical variables Can adapt some security techniques c a9a4a1a4a10a6a1a4a3a6a12a4a8a14a13a32a22a25a15a18a29 a0a2a1a4a3a6a5 a7a4a8 LOCKIN Make it difficult or impossible to leave a safe state. Need to protect software against environmental conditions. operator errorse.g., data arriving in wrong order or at unexpected speed Completeness criteria ensure specified behavior robust against mistaken environmental conditions. c a9a4a1a4a10a6a1a4a3a6a12a4a8a14a13a23a22a25a15a18a19 a0a2a1a4a3a6a5 a7a4a8 INTERLOCK Used to enforce a sequence of actions or events. 1. Event A does not occur inadvertently 2. Event A does not occur while condition C exists 3. Event A occurs before event D. Examples: Batons Critical sections Synchronization mechanisms Remember, the more complex the design, the more likely errors will be introduced by the protection facilities themselves. c a9a4a1a4a10a6a1a4a3a6a12a4a8a14a13a23a22a25a15a18a30 a0a2a1a4a3a6a5 a7a4a8 Example: Nuclear Detonation Safety depends on NOT working Three basic techniques (called ‘‘positive measures’’) 1. Isolation Separate critical elements (barriers) 2. Inoperability Keep in inoperable state, e.g., remove ignition device or arming pin 3. Incompatibility Detonation requires an unambiguous indication of human intent be communicated to weapon. Protecting entire communication system against all credible abnormal environments (including sabotage) not practical. Instead, use unique signal of sufficient information complexity that unlikely to be generated by an abnormal environment. c a9a4a1a4a10a6a1a4a3a11a12a20a8a14a13a23a22a25a15a18a17 a0a2a1a4a3a11a5 a7a4a8 Example: Nuclear Detonation (2) Unique signal discriminators must: 1. Accept proper unique signal while rejecting spurious inputs 2. Have rejection logic that is highly immune to abnormal environments 3. Provide predictably safe response to abnormal environments 4. Be analyzable and testable Protect unique signal sources by barriers. Removable barrier between these sources and communication channels. c a9a4a1a4a10a6a1a4a3a11a12a20a8a14a13a23a22a4a22a4a24 a0a2a1a4a3a11a5 a7a4a8 Example: Nuclear Detonation (3) Barrier Removable Communications channel UQS Stored barrier Human intent Inclusion Region Isolated Arming and firing voltages Isolated element incompatible ? Unique Signal Inoperable in abnormal environments UQS Reader Discriminator/ Driver Exclusion Region component Unique Signal Source c a7a4a8a0a2a1a4a3a6a5 a9a4a1a4a10a6a1a4a3a6a12a4a8a14a13a23a22a4a22a25a15 Example: Nuclear Detonation (4) May require multiple unique signals from different individuals along various communication channels, using different types of signals (energy and information) to ensure proper intent. Communication Safing and firing Stimuli source Intended human Intended Intended system no. 2 AABABBB fusing system Arming and Human?machine Unique signal no. 1 signal signals Arming Unique system action human action human action interface c a9a20a1a4a10a11a1a4a3a6a12a4a8a14a13a32a22a4a22a4a22 a0a31a1a4a3a6a5 a7a4a8 Failure Minimization SAFETY FACTORS AND SAFETY MARGINS Used to cope with uncertainties in engineering: Inaccurate calculations or models Limitations in knowledge Variation in strength of a specific material due to differences in composition, manufacturing, assembly, handling, environment, or usage. Some ways to minimize problem, but cannot eliminate it. Appropriate for continuous and non?action systems. c a9a20a1a4a10a11a1a4a3a6a12a4a8a14a13a32a22a4a22a4a26a4a55a4a22a4a22a4a27 Safety Margins and Safety Factors a0a31a1a4a3a6a5 a7a4a8 Probability of occurrence Stress a46a16a47a41a48 a36a14a40a41a37a18a36a14a45 a35a18a34a14a54 a44 a56 a43a6a36 a37a18a43a6a36a14a50a14a51a14a37a18a52 a49 (a) Probability density function of failure for two parts with same expected failure strength. a33a16a34a14a35a18a36a14a37a18a38a39a35a18a34a14a40a41a37a18a42a14a43 a44a42a14a34a14a45 a46a16a47a41a48 a36a14a40a41a37a18a36a14a45 a53 a34a14a43a6a51a14a54 a50 a49 a37a18a43a11a36a14a50a14a51a14a37a18a52 a46a16a47a41a48 a36a14a40a41a37a18a36a14a45 Stress Probability of occurrence a42a14a35 a34a14a35a18a36a14a37a18a38 a49 (b) A relatively safe case. a33a16a34a14a35a18a36a14a37a18a38a39a35a18a34a14a40a41a37a18a42a14a43 a46a16a47a41a48 a36a14a40a41a37a18a36a14a45 a46a16a47a41a48 a36a14a40a41a37a18a36a14a45 Probability of occurrence Stress a42a14a34a14a45 a37a18a43a11a36a14a50a14a51a14a37a18a52a44 a49 (c) A dangerous overlap but the safety factor is the same as in (b) c a9a4a1a4a10a6a1a4a3a6a12a4a8a14a13a23a22a4a22a4a28 a0a2a1a4a3a6a5 a7a4a8 REDUNDANCY Goal is to increase reliability and reduce failures. Common?cause and common?mode failures May add so much complexity that causes failures. More likely to operate spuriously. May lead to false confidence (Challenger) Useful to reduce hardware failures. But what about software? Design redundancy vs. design diversity Bottom Line: claims that multiple version software will achieve ultra?high reliability levels are not supported by empirical data or theoretical models. c a9a4a1a4a10a6a1a4a3a6a12a4a8a14a13a23a22a4a22a4a29 a0a2a1a4a3a6a5 a7a4a8 REDUNDANCY (con’t.) Standby spares vs. concurrent use of multiple devices (with voting) Identical designs or intentionally different ones (diversity). Diversity must be carefully planned to reduce dependencies. Can also introduce dependencies in maintenance, testing, repair Redundancy most effective against random failures not design errors. c a9a4a1a4a10a6a1a4a3a6a12a4a8a14a13a23a22a4a22a4a19 a0a2a1a4a3a6a5 a7a4a8 REDUNDANCY (con’t.) Software errors are design errors. Data redundancy: extra data for detecting errors e.g. parity bit and other codes checksums message sequence numbers duplicate pointers and other structural information Algorithmic redundancy: 1. Acceptance tests (hard to write) 2. Multiple versions with voting on results c a9a4a1a4a10a6a1a4a3a6a12a4a8a14a13a23a22a4a22a4a30 a0a2a1a4a3a6a5 a7a4a8 Multi (or N) Version Programming Assumptions: Probability of correlated failures is very low for independently developed software. Software errors occur at random and are unrelated. Even small probabilities of correlated failures cause a substantial reduction in expected reliability gains. Conducted a series of experiments with John Knight Failure independence in N?version programming Embedded assertions vs. N?version programming Fault Tolerance vs. Fault Elimination c a9a4a1a4a10a6a1a4a3a6a12a4a8a14a13a32a22a4a22a4a17 a0a2a1a4a3a6a5 a7a4a8 Failure Independence Experimental Design: 27 programs, one requirements specification Graduate students and seniors from two universities Simulation of a production environment: 1,000,000 input cases Individual programs were high quality Results: Rejected independence hypothesis: Analysis of reliability gains must include effect of dependent errors. Statistically correlated failures result from: Nature of application "Hard" cases in input space Programs with correlated failures were structurally and algorithmically very different. Conclusion: Correlations due to fact that working on same problem, not due to tools used or languages used or even algorithms used. c a9a4a1a4a10a6a1a4a3a6a12a4a8a14a13a32a22a4a26a4a24 a0a2a1a4a3a6a5 a7a4a8 Consistent Comparison Problem Arises from use of finite?precision real numbers (rounding errors) Correct versions may arrive a completely different correct outputs and thus be unable to reach a consensus even when none of components "fail.". May cause failures that would not have occurred with single versions. No general practical solution to the problem . c a9a4a1a4a10a6a1a4a3a6a12a4a8a14a13a23a22a4a26a25a15 a0a2a1a4a3a6a5 a7a4a8 Self?Checking Software Experimental Design: Launch Interceptor Programs (LIP) from previous study. 24 graduate students from UCI and UVA employed to instrument 8 programs (chosen randomly from subset of 27 in which we had found errors). Provided with identical training materials. Checks written using specifications only at first and then participants were given a program to instrument. Allowed to make any number or type of check. Students treated this as a competition among themselves. c a9a4a1a4a10a6a1a4a3a6a12a4a8a14a13a23a22a4a26a4a22 a0a2a1a4a3a6a5 a7a4a8 Fault Tolerance vs. Fault Elimination Techniques compared: Run?time assertions (self?checks) Multi?version voting Functional testing augmented with structural testing Code reading by stepwise abstraction Static data?flow analysis Experimental Design: Combat Simulation Problem (from TRW) Programmers separate from fault detectors Eight version produced with 2 person teams Number of modules from 28 to 75 Executable lines of code from 1200 to 2400 Attempted to hold resources constant for each technique. c a9a4a1a4a10a11a1a4a3a6a12a4a8a14a13a32a22a4a26a4a26a4a55a4a22a4a26a4a27 a0a2a1a4a3a6a5 a7a4a8 Self?Checking Software (2) 12b 12c 1 0 5 2238360Total a57 a57 12a 1 1 2 28a 8b 8c 1 1 3 1 2 1 1 1 2 2 3 20a 20b 20c 2 4 14a 14b 14c 1 2 1 25c 25b 2 125a 2 2 423a 23b 23c 2 1 1 1 1 6c CDCR SP CR CD 1 Added Detected Errors 3a 3b 3c 3 2 16a 6b Other Errors # 4 Detected SP Already Known Errors a58a60a59a60a61a63a62a64a59a60a65a60a66a68a67a70a69a72a71a14a73 a58a72a59a60a61 a62a74a59a72a65a60a66a68a67a70a69a60a71a14a73 a75a77a76a79a78a81a80a82a76 a76a84a83a85a80a82a86a60a87a89a88a90a78a92a91a74a76a84a93 a94a95a93a79a93a70a83a77a93 c a9a4a1a4a10a6a1a4a3a6a12a4a8a14a13a32a22a4a26a4a28 a0a2a1a4a3a6a5 a7a4a8 Fault Tolerance vs. Fault Elimination (2) Results: Multi?version programming is not a substitute for testing. Did not tolerate most of faults detected by fault?elimination techniques. Unreliable in tolerating the faults it was capable of tolerating. Testing failed to detect errors causing coincident failures. Cast doubt on effectiveness of voting as a test oracle. Instrumenting the code to examine internal states was much more effective. Intersection of sets of faults found by each method was relatively small. a9a4a1a4a10a6a1a4a3a6a12a4a8a14a13a32a22a4a26a4a29 a0a2a1a4a3a6a5 a7a4a8 N?Version Programming (Summary) Doesn’t mean shouldn’t use, but should have realistic expectations of benefits to be gained and costs involved: Costs very high (more than N times) In practice, end up with lots of similarity in designs (more than in our experiments) Overspecification Cross Checks So safety of system dependent on quality that has been systematically eliminated. And no way to tell how different 2 software designs are in their failure behavior. c Requirements flaws not handled, which is where most safety problems arise anyway. c a9a4a1a4a10a6a1a4a3a6a12a4a8a14a13a23a22a4a26a4a19 a0a2a1a4a3a6a5 a7a4a8 Recovery Backward Assume can detect error before does any damage. Assume alternative will be more effective. Forward Robust data structures. Dynamically altering flow of control. Ignoring single cycle errors. But real problem is detecting erroneous states. c a9a4a1a4a10a6a1a4a3a6a12a4a8a14a13a23a22a4a26a4a30 a0a2a1a4a3a6a5 a7a4a8 Hazard Control LIMITING EXPOSURE Start out in safe state and require deliberate change to unsafe state. Set critical flags and conditions as close to code they protect as possible. Critical conditions should not be complementary, e.g., absence of an arm condition should not be used to indicate system is unarmed. ISOLATION AND CONTAINMENT PROTECTION SYSTEMS AND FAIL?SAFE DESIGN c a9a4a1a4a10a6a1a4a3a6a12a4a8a14a13a23a22a4a26a4a17 a0a2a1a4a3a6a5 a7a4a8 Protection Systems and Fail?Safe Design Depends upon existence of a safe state and availability of adequate warning time. May have multiple safe states, depending upon process conditions. General rule is hazardous states should be hard to get into and safe states should be easy. Panic button Watchdog timer: Software it is protecting should not be responsible setting it. Sanity checks (I’m alive signals) Protection system should provide information about its control actions and status to operators or bystanders. The easier and faster is return of system to operational state, the less likely protection system is to be purposely bypassed or turned off. c a9a4a1a4a10a6a1a4a3a6a12a4a8a14a13a23a22a4a27a4a24 a0a2a1a4a3a6a5 a7a4a8 Damage Reduction May need to determine a ‘‘point of no return’’ where recovery no longer possible or likely and should just try to minimize damage. Design Modification and Maintenance Need to reanalyze Need to record design rationale.