a25c a6a8a26a4a6a8a3a4a27a8a20a19a28a30a29a32a31 a9a24a11a8a13a15a11a8a16 a17a19a18a2a20a8a11a8a22 a1a4a3a15a23 a3 Hazard Log Information System, subsystem, unit Description Cause(s) Possible effects, effect on system Category (hazard level ?? probability and severity) Design constraints Corrective or preventative measures, possible safeguards, recommended action Operational phase when hazardous Responsible group or person for ensuring safeguards provided. Tests (verification) to be undertaken to demonstrate safety. Other proposed and necessary actions Status of hazard resolution process. a6a8a26a15a6a8a3a4a27a8a20a19a28a33a29a8a34 a0a2a1a4a3a4a5 a6a8a7a10a9a12a11a14a13a15a11a8a16 a17a19a18a21a20a8a11a8a22 a1a15a3a4a23 a3 Risk and Hazard Level Measurement Risk = f (likelihood, severity) Impossible to measure risk accurately. Instead, use risk assessment: Accuracy of such assessments is controversial. ‘‘To avoid paralysis resulting from waiting for definitive data, we assume we have greater knowledge than scientists actually possess and make decisions based on those assumptions.’’ William Ruckleshaus Cannot evaluate probability of very rare events directly. a25c So use models of the interaction of events that can lead to an accident. a25c a6a8a26a4a6a8a3a4a27a8a20a19a28a30a29a8a35 Risk Modeling In practice, models only include events that can be measured. Most causal factors involved in major accidents are unmeasurable. Unmeasurable factors tend to be ignored or forgotten. Can we measure software? (what does it mean to measure design?) Human error? Risk assessment data can be like the captured spy; if you torture it long enough, it will tell you anything you want to know. William Ruckelshaus Risk in a Free Society a25 a6a8a26a4a6a8a3a4a27a8a20a19a28a30a29a8a36c Misinterpreting Risk Risk assessments can easily be misinterpreted: 10 ?4 ?3 System Boundary Extended system boundary 10 10 ?3 ?3 ?6 . 10 = 10 c a6a8a26a15a6a8a3a4a27a8a20a19a28a33a29a8a37a8a38a8a29a8a29 a25 Example of unrealistic risk assessment contributing to an accident Design: System design included a relief valve opened by an operator to protect against overpressurization. A secondary valve was installed as backup in case the primary valve failed. The operator must know if the first valve did not open so the second valve could be activated. Events: The operator commanded the relief valve to open. The open position indicator light and open indicator light both illuminated. The operator, thinking the primary relief valve had opened, did not activate the secondary relief valve. However, the primary valve was NOT open and the system. exploded. Causal Factors: Post?accident examination discovered the indicator light circuit was wired to indicate presence of power at the valve, but it did not indicate valve position. Thus, the indicator showed only that the activation button had been pushed, not that the valve had opened. An extensive quantitative safety analysis of this design had assumed a low probability of simultaneous failure for the two relief valves, but ignored the possibility of design error in the electrical wiring; the probability of design error was not quantifiable. No safety evaluation of the electrical wiring was made; instead confidence was established on the basis of the low probability of coincident failure of the two relief valves. The Therac?25 is another example where unrealistic risk assessment contributed to the losses. c a25 a6a8a26a4a6a8a3a4a27a8a20a19a28a30a29a8a79 Classic Hazard Level Matrix a9a24a11a8a13a15a11a8a16 a17a19a18a2a20a8a11a8a22 a1a4a3a15a23 a3 SEVERITY A B C LIKELIHOOD D E F I II III IV Catastrophic Critical Marginal Negligible Frequent Moderate Occasional Remote Unlikely Impossible a25c I?A II?A III?A IV?A I?B II?B III?B IV?B I?C II?C III?C IV?C I?D II?D III?D IV?D I?E II?E III?E IV?E I?F II?F III?F IV?F a6a8a26a4a6a8a3a4a27a8a20a19a28a30a29a8a80 a9a24a11a8a13a15a11a8a16 a17a19a18a2a20a8a11a8a22 a1a4a3a15a23 a3 Another Example Hazard Level Matrix A B C D E F Frequent Probable Occasional Remote Improbable Impossible 12 1212121110 a39a41a40a43a42a19a40a45a44a8a46a48a47a19a49a45a50a43a51a52a44a8a49a43a53 a46a45a54a43a55a19a56 a44a8a40a45a57a43a53 a54a48a56 a58 a47a19a49a45a55a19a51a43a54a45a58a59a58a52a54a43a47a19a51a52a56 a60a19a54 a47a19a49a45a55a19a51a43a54a45a58a59a58a52a54a43a47a19a51a52a56 a60a19a54 a46a45a54a43a55a19a56 a44a8a40a45a57a43a53 a54a48a56 a58 a44a8a54a45a46a45a61a43a47a19a54a45a46 a39a41a40a43a42a19a40a45a44a8a46a48a47a19a49a45a50a43a51a52a44a8a49a45a53 a62 a44a8a49a45a57a43a40a45a57a45a56 a53 a56 a51a59a63 a49a45a44a65a64a45a40a45a42a19a40a43a44a8a46 a57a45a54a48a47a19a49a45a50a43a51a52a44a8a49a43a53 a53 a54a43a46 a39a41a40a45a42a19a40a43a44a8a46a48a66a48a61a45a55a19a51 a47a19a49a45a50a43a51a52a44a8a49a43a53a19a64a45a40a45a42a19a40a43a44a8a46 a54a45a53 a56 a66a48a56 a50a45a40a43a51a59a54a48a49a45a44 a44a8a54a45a67a43a61a45a56 a44a8a54a43a46a48a51a52a49 a68a41a54a43a55a19a56 a69a45a50a48a40a43a47a19a51a52a56 a49a45a50 a44a8a54a45a46a43a61a45a47a19a54a45a46 a62 a44a8a49a43a57a45a40a45a57a43a56 a53 a56 a51a52a63 12 II Marginal III Negligible IV 1 2 3 4 129 12127643 5 6 8 10 12 a57a45a54a48a47a19a49a43a50a45a51a59a44a8a49a45a53 a53 a54a45a46 a49a45a44a65a64a45a40a43a42a19a40a45a44a8a46 a62 a44a8a49a43a57a45a40a45a57a43a56 a53 a56 a51a52a63 a44a8a54a45a46a43a61a45a47a19a54a45a46 a70a71a54a45a69a45a53 a56 a69a45a56 a57a45a53 a54a48a64a43a40a45a42a19a40a45a44a8a46 a49a45a47a19a47a19a61a43a44a72a44a8a54a43a50a45a47a19a54 a73 a66 a62 a49a43a55a19a55a19a56 a57a45a53 a54a74a10a55a19a55a19a61a43a66a48a54a48a75a76a56 a53 a53 a50a45a49a45a51a43a49a45a47a19a47a19a61a45a44 a70a76a49a45a44a8a66a48a40a45a53 a53 a63a77a50a45a49a45a51 a60a19a54 a57a45a54a48a47a19a49a43a50a45a51a59a44a8a49a45a53 a53 a54a45a46 a49a45a44a65a64a45a40a43a42a19a40a45a44a8a46 a39a41a40a43a42a19a40a45a44a8a46a48a66a48a61a45a55a19a51 a68a41a54a45a55a19a56 a69a45a50a48a40a45a47a19a51a59a56 a49a43a50 a44a8a54a45a67a45a61a43a56 a44a8a54a45a46a48a51a59a49 a54a45a53 a56 a66a48a56 a50a45a40a43a51a52a54a48a49a43a44 a47a19a49a45a50a45a51a59a44a8a49a45a53a19a64a43a40a45a42a19a40a45a44a8a46 a68a41a54a43a55a19a56 a69a45a50a48a40a43a47a19a51a52a56 a49a45a50 a44a8a54a45a67a43a61a45a56 a44a8a54a43a46a48a51a52a49 a54a45a53 a56 a66a48a56 a50a45a40a43a51a59a54a48a49a45a44 a47a19a49a45a50a43a51a52a44a8a49a43a53a19a64a45a40a45a42a19a40a43a44a8a46 a68a41a54a43a55a19a56 a69a45a50a48a40a43a47a19a51a52a56 a49a45a50 a44a8a54a45a67a43a61a45a56 a44a8a54a43a46a48a51a52a49 a54a45a53 a56 a66a48a56 a50a45a40a43a51a59a54a48a49a45a44 a47a19a49a45a50a43a51a52a44a8a49a43a53a19a64a45a40a45a42a19a40a43a44a8a46 a68a41a54a45a55a19a56 a69a45a50a48a40a45a47a19a51a59a56 a49a43a50 a44a8a54a45a67a45a61a43a56 a44a8a54a45a46a48a51a59a49 a54a45a53 a56 a66a48a56 a50a45a40a43a51a52a54a48a49a43a44 a47a19a49a45a50a45a51a59a44a8a49a45a53a19a64a43a40a45a42a19a40a45a44a8a46 a68a41a54a43a55a19a56 a69a45a50a48a40a43a47a19a51a52a56 a49a43a50 a44a8a54a45a67a43a61a45a56 a44a8a54a43a46a48a51a52a49 a54a45a53 a56 a66a48a56 a50a45a40a43a51a52a54a48a49a43a44 a47a19a49a45a50a43a51a52a44a8a49a43a53a19a64a45a40a45a42a19a40a43a44a8a46 a39a41a40a43a42a19a40a45a44a8a46a48a66a48a61a45a55a19a51 Critical Catastrophic I c a25 a6a8a26a4a6a8a3a4a27a8a20a19a28a30a29a8a81 a9a12a11a8a13a4a11a8a16 a17a19a18a21a20a8a11a8a22 a1a4a3a4a23 a3 Hazard Level Assessment Not feasible for complex human/computer?controlled systems No way to determine likelihood Almost always involves new designs and new technology Severity is often adequate (and can be determined) to plan effort to spend on eliminating or mitigating hazard. May be possible to establish qualitative criteria to evaluate potential hazard level to make deployment or technology decisions, but will depend on system. c a25 a6a8a26a4a6a8a3a4a27a8a20a19a28a30a79a8a82 a9a12a11a8a13a4a11a8a16 a17a19a18a21a20a8a11a8a22 a1a4a3a4a23 a3 Example of Qualitative Criteria AATT Safety Criterion: The introduction of AATT tools will not degrade safety from the current level. Hazard level assessment based on: Severity of worst possible loss associated with tool Likelihood that introduction of tool will reduce current safety level of ATC system. a25c Example Severity Level a9a12a11a8a13a4a11a8a16 a17a19a18a21a20a8a11a8a22 a1a4a3a4a23 a3 a6a8a26a4a6a8a3a4a27a8a20a19a28a30a79a32a31 (from a proposed JAA standard) Class I: Catastrophic Unsurvivable accident with hull loss. Class II: Critical Survivable accident with less than full hull loss; fatalities possible Class III: Marginal Equipment loss with possible injuries and no fatalities Class IV: Negligible Some loss of efficiency Procedures able to compensate, but controller workload likely to be high until overall system demand reduced. Reportable incident events such as operational errors, pilot deviations, surface vehicle deviation. c a25 a6a8a26a4a6a8a3a4a27a8a20a19a28a30a79a8a34 a9a24a11a8a13a15a11a8a16 a17a19a18a2a20a8a11a8a22 a1a4a3a15a23 a3 Example Likelihood Level User tasks and responsibilities Low: Insignificant or no change Medium: Minor change High: Significant change Potential for inappropriate human decision making Low: Insignificant or no change Medium: Minor change High: Significant change Potential for user distraction or disengagement from primary task Low: Insignificant or no change Medium: Minor change High: Significant change c a25 a6a8a26a4a6a8a3a15a27a8a20a19a28a33a79a8a35a8a38a8a79a8a36 a9a24a11a8a13a4a11a8a16 a17a19a18a2a20a8a11a8a22 a1a4a3a4a23 a3 Example Likelihood Level (2) Safety margins Low: Insignificant or no change Medium: Minor change High: Significant change Potential for reducing situation awareness Low: Insignificant or no change Medium: Minor change High: Significant change Skills currently used and those necessary to backup and monitor new decision support tools Low: Insignificant or no change Medium: Minor change High: Significant change Introduction of new failure modes and hazard causes Low: New tools have same function and failure modes as system components they are replacing Medium: Introduced but well understood and effective mitigation measures can be designed High: Introduced and cannot be classified under medium Effect of software on current system hazard mitigation measures Low: Cannot render ineffective High: Can render ineffective Need for new system hazard mitigation measures Low: Potential software errors will not require High: Potential software errors could require a6a8a26a4a6a8a3a4a27a8a20a19a28a30a79a8a37 c a25 a25 a6a8a26a4a6a8a3a4a27a8a20a19a28 a18a21a83a15a83a4a23 a17a8a6a8a20a8a5a8a84a24a11a8a85a8a3a15a6a8a3 Causality Accident causes are often oversimplified: The vessel Baltic Star, registered in Panama, ran aground at full speed on the shore of an island in the Stockholm waters on account of thick fog. One of the boilers had broken down, the steering system reacted only slowly, the compass was maladjusted, the captain had gone down into the ship to telephone, the lookout man on the prow took a coffee break, and the pilot had given an erroneous order in English to the sailor who was tending the rudder. The latter was hard of hearing and understood only Greek. LeMonde Larger organizational and economic factors? c a25 a6a8a26a4a6a8a3a4a27a8a20a19a28 a18a21a83a15a83a4a23 a17a8a6a8a20a8a5a8a84a24a11a8a85a8a3a15a6a8a3 Issues in Causality Filtering and subjectivity in accident reports Root cause seduction Idea of a singular cause is satisfying to our desire for certainty and control. Leads to fixing symptoms The "fixing" orientation Well understood causes given more attention Component failure Operator error Tend to look for linear cause?effect relationships Makes it easier to select corrective actions (a "fix") 76 a25c a6a8a26a4a6a8a3a4a27a8a20a19a28a30a79a8a79 a18a2a83a4a83a4a23 a17a14a6a8a20a8a5a8a84a12a11a8a85a8a3a4a6a8a3 NASA Procedures and Guidelines: NPG 8621 Draft 1 Root Cause: "Along a chain of events leading to a mishap, the first causal action or failure to act that could have been controlled systematically either by policy/practice/procedure or individual adherence to policy/practice/procedure." Contributing Cause: "A factor, event, or circumstance that led directly or indirectly to the dominant root cause, or which contributed to the severity of the mishap." c a25 a6a8a26a4a6a8a3a4a27a8a20a19a28a30a79a8a80 a18a21a83a4a83a15a23 a17a8a6a8a20a8a5a8a84a24a11a8a85a8a3a4a6a8a3 Hierarchical Models EVENTS OR ACCIDENT MECHANISM LEVEL 2 CONDITIONS LEVEL 3 SYSTEMIC FACTORS a6a8a26a4a6a8a3a4a27a8a20a19a28a30a79a8a81 a6a8a26a4a6a8a3a4a27a8a20a19a28a30a80a8a82 a6a8a26a4a6a8a3a4a27a8a20a19a28 a18a21a83a4a83a15a23 a17a8a6a8a20a8a5a8a84a24a11a8a85a8a3a4a6a8a3 a25 c a25 Hierarchical Analysis Example Org. and Diffused responsibility communication and authority problems Inadequate review process Everyone assumes someone else tested using load tape QA did not understand process from Titan IV unstable becomes Centaur to FC software IMS sends zero roll rateseparates S/w load tape contains incorrect filter constant sloshing Low accel leads to wrong time for engine shutdown Fuel Centaur a6a8a26a4a6a8a3a4a27a8a20a19a28 a18a21a83a4a83a15a23 a17a8a6a8a20a8a5a8a84a24a11a8a85a8a3a4a6a8a3 Systemic Factors in (Software?Related) Accidents 1. Flaws in the Safety Culture Safety Culture: The general attitude and approach to safety reflected by those who participate in an industry or organization, including management, workers, and government regulators Underestimating or not understanding software risks Overconfidence and complacency Assuming risk decreases over time Ignoring warning signs Inadequate emphasis on risk management Incorrect prioritization of changes to automation a25 a25c Slow understanding of problems in human?automation mismatch Overrelying on redundancy and protection systems Unrealistic risk assessment a25c a6a8a26a4a6a8a3a15a27a8a20a19a28a33a80a32a31 a18a2a83a4a83a4a23 a17a8a6a8a20a8a5a8a84a12a11a8a85a8a3a4a6a8a3 Systemic Factors (con’t) 2. Organizational Structure and Communication Diffusion of responsibility and authority Limited communication channels and poor information flow 3. Technical Activities Flawed review process Inadequate specifications and requirements validation Flawed or inadequate analysis of software functions Violation of basic safety engineering practices in digital components Inadequate system engineering Lack of defensive programming Software reuse without appropriate safety analysis c a25 a6a8a26a4a6a8a3a15a27a8a20a19a28a33a80a8a34 a18a2a83a4a83a4a23 a17a8a6a8a20a8a5a8a84a12a11a8a85a8a3a4a6a8a3 Systemic Factors (con’t) Inadequate system safety engineering Unnecessary complexity and software functions Test and simulation environment does not match operations Deficiencies in safety?related information collection and use Operational personnel not understanding automation Inadequate design of feedback to operators Inadequate cognitive engineering a25c a6a8a26a4a6a8a3a4a27a8a20a19a28a30a80a8a35 a18a21a83a4a83a15a23 a17a8a6a8a20a8a5a8a84a24a11a8a85a8a3a4a6a8a3 Do Operators Cause Most Accidents? Data may be biased and incomplete Positive actions usually not recorded Blame may be based on premise that operators can overcome every emergency Operators often have to intervene at the limits. Hindsight is always 20/20 Separating operator error from design error is difficult and perhaps impossible. a25 a6a8a26a4a6a8a3a4a27a8a20a19a28a30a80a8a36c a18a21a83a4a83a15a23 a17a8a6a8a20a8a5a8a84a24a11a8a85a8a3a4a6a8a3 Example accidents from chemical plants: 6573 421 Operator told to fix pump 7. NEW B C OLD OLD OLD Operator told to replace crystallizer A MFPT TRIP?RESET Reset Trip a. Note reversal of trip?reset positions NO. 1 HTR 600 1000 1400 300 600 900 1200 FW HTR FW HTR SUPPLY HDR OUTLET HDR c. Heater pressure gauges. A hurried operator under stress might believe the outlet pressure is higher than the supply, even though it is lower. a25 a6a8a26a4a6a8a3a4a27a8a20a19a28a33a80a8a37a8a38a8a80a8a29 a18a21a83a4a83a15a23 a17a8a6a8a20a8a5a8a84a24a11a8a85a8a3a4a6a8a3 Open Close Close Open b. Another Inconsistency TURB AUX FWP LVL CONTROL 3 60 20 40 80 4 60 20 40 80 1 60 20 40 80 2 60 20 40 80 d. A strange way to count. c a25c a6a8a26a4a6a8a3a4a27a8a20a19a28a33a80a8a79a8a38a8a80a8a80 a18a21a83a4a83a15a23 a17a8a6a8a20a8a5a8a84a24a11a8a85a8a3a4a6a8a3 A?320 accident while landing at Warsaw: Blamed on pilots for landing too fast. Was it that simple? Pilots told to expect windshear. In response, landed faster than normal to give aircraft extra stability and lift. Meteorological information out of date ?? no windshear by time pilots landed. Polish government’s meteorologist supposedly in toilet at time of landing. Thin film of water on runway that had not been cleared. Wheels aquaplaned, skimming surface, without gaining enough rotary speed to tell computer braking systems that aircraft was landing. Computers refused to allow pilots to use aircraft’s braking systems. So did not work until too late. Still would not have been catastrophic if had not built a high bank at end of runway. Aircraft crashed into bank and broke up. Blaming pilots turns attention away from: Why pilots were given out?of?date weather information Design of computer?based braking system Ignored pilots commands Pilots not able to apply braking systems manually Who has final authority? Why allowed to land with water on runway Why decision made to build a bank at end of runway a25c a6a8a26a4a6a8a3a4a27a8a20a19a28a30a80a8a81 a18a21a83a4a83a15a23 a17a8a6a8a20a8a5a8a84a24a11a8a85a8a3a4a6a8a3 Human Error vs. Computer Error Automation does not eliminate human error or remove humans from systems. It simply moves them to other functions Design and programming High?level supervisory control and decision making Maintenance where increased system complexity and reliance on indirect information makes decision?making process more difficult. a25c a6a8a26a4a6a8a3a4a27a8a20a19a28a30a81a8a82 a18a21a83a4a83a15a23 a17a8a6a8a20a8a5a8a84a24a11a8a85a8a3a4a6a8a3 Mixing Humans and Computers Automated systems on aircraft have eliminated some types of human error and created some new ones. Human skill levels and required knowledge may go up. Correct partnership and allocation of tasks is difficult Who has the final authority? a25c a6a8a26a4a6a8a3a4a27a8a20a19a28a30a81a32a31 a18a21a83a4a83a15a23 a17a8a6a8a20a8a5a8a84a24a11a8a85a8a3a4a6a8a3 Why Not Simply Replace Humans with Computers? a120 a89a10a112a121a86a91a107a91a105a108a94a10a87a2a100a111a93a10a89a96a95a10a89a91a105a19a86a91a87a2a89a10a93a10a107a10a122a101a94a96a95a10a94a91a97a99a100a123a89a10a87a78a105a108a100a124a89a10a110a10a94a10a87a21a87a2a89a10a87a2a100a101a119a115a125a126a116a10a94a104a127a128a112a96a94a10a87a2a94a10a118 a127 a86a88a87a2a89a91a90a30a92a52a93a10a94a96a95a10a94a91a97a99a98a10a95a10a93a96a94a10a98a10a100a101a92a102a94a10a87a103a89a104a86a10a86a91a89a10a87a106a105a108a107a10a95a10a92a109a105a108a92a102a94a10a100a111a110a8a89a10a87a71a112a96a98a10a113a101a92a102a95a10a114a115a105a108a116a10a94a117a89a10a118a102a93a96a94a10a87a2a87a21a89a10a87a2a100a101a119 a125a129a87a2a94a91a90a33a89a10a87a71a130a131a118a102a94a91a105a108a132a101a133a91a134a12a135a136a92a52a100a123a94a124a137a138a110a32a105a108a94a10a87a139a105a108a116a10a94a96a140a141a90a30a94a10a95a91a105a8a134 Not all conditions (or the correct way to deal with them) are foreseeable. Even those that can be predicted are programmed by error?prone human beings. a25 a6a8a26a4a6a8a3a4a27a8a20a19a28a30a81a8a34c a18a21a83a4a83a15a23 a17a8a6a8a20a8a5a8a84a24a11a8a85a8a3a4a6a8a3 Designers Make Mistakes Too Many of same limitations of human operators are characteristic of designers: Difficulty in assessing probabilities of rare events. Bias against considering side effects. Tendency to overlook contingencies. Limited capacity to comprehend complex relationships. Propensity to control complexity by concentrating only on a few aspects of the system. a25 a6a8a26a4a6a8a3a15a27a8a20a19a28a33a81a8a35c a18a2a83a4a83a4a23 a17a8a6a8a20a8a5a8a84a12a11a8a85a8a3a4a6a8a3 Advantages of Humans Human operators are adaptable and flexible. Able to adapt both goals and means to achieve them. Able to use problem solving and creativity to cope with unusual and unforeseen situations. Can exercise judgement. Humans are unsurpassed in Recognizing patterns. Making associative leaps. Operating in ill?structured, ambiguous situations Human error is the inevitable side effect of this flexibility and adaptability. a25c a6a8a26a4a6a8a3a15a27a8a20a19a28a33a81a8a36 a18a2a83a4a83a4a23 a17a8a6a8a20a8a5a8a84a12a11a8a85a8a3a4a6a8a3 Operators continually test their models against reality Mental Models manufacturing Designer deals with ideals or averages, not constructed system changes over time training procedures operational experience operational spec original design MODEL MODEL OPERATOR’SDESIGNER’S and construction evolution and SYSTEM ACTUAL variances System changes and so must operator’s model