Software System Safety Handbook
Software System Safety Handbook
December 1999
Joint Services Computer Resources Management Group, U.S. Navy, U.S. Army, and the U.S. Air Force
Under the direction and guidance of the
AUTHORS
David Alberico John Bozarth Michael Brown Janet Gill Steven Mattern Arch McKinlay VI Contributing (Former Chairman) Contributing Contributing (Current Chairman) Contributing Contributing and Integrating Contributing
This Handbook represents the cumulative effort of many people. It underwent several reviews by the technical community that resulted in numerous changes to the original draft. Therefore, the contributors are too numerous to list. However, the Joint Services Software System Safety Committee wishes to acknowledge the contributions of the contributing authors to the Handbook. Special thanks to Lt. Col. David Alberico, USAF (RET), Air Force Safety Center, Chair-
person of the JSSSSC, from 1995 to 1998, for his initial guidance and contributions in the development of the Handbook. The following authors wrote significant portions of the current Handbook: John Bozarth, CSP, EG&G Technical Services, Dahlgren, VA Michael Brown, Naval Surface Warfare Center, Dahlgren Division, (Chairperson, JSSSSC, 1998 to Present) Janet Gill, Naval Air Warfare Center, Aircraft Division, Patuxent River, MD Steven Mattern, Science and Engineering Associates, Albuquerque, NM Archibald McKinlay, Booz-Allen and Hamilton, St. Louis, MO Other contributing authors: Brenda Hyland, Naval Air Warfare Center, Aircraft Division, Patuxent River, MD Lenny Russo, U.S. Army Communication & Engineering Command, Ft. Monmouth, NJ The committee would also like to thank the following individuals for their specific contributions: Edward Kratovil, Naval Ordnance Safety and Security Activity, Indian Head, MD Craig Schilders, Naval Facilities Command, Washington, DC Benny Smith, U.S. Coast Guard, Washington, DC Steve Smith, Federal Aviation Administration, Washington, DC Lud Sorrentino, Booz-Allen and Hamilton, Dahlgren, VA Norma Stopyra, Naval Space and Warfare Systems Command, San Diego, CA Dennis Rilling, Naval Space and Warfare Systems Command, San Diego, CA Benny White, National Aeronautics and Space Administration, Washington, DC Martin Sullivan, EG&G Technical Services, Dahlgren, VA This Handbook is the result of the contributions of the above mentioned individuals and the extensive review comments from many others. The committee thanks all of the authors and the contributors for their assistance in the development of this Handbook.
TABLE OF CONTENTS
1. 2. EXECUTIVE OVERVIEW................................................................................................. 11 INTRODUCTION TO THE HANDBOOK ........................................................................ 21
2.1 Introduction ..................................................................................................................... 21 2.2 Purpose ............................................................................................................................ 22 2.3 Scope ............................................................................................................................... 22 2.4 Authority/Standards......................................................................................................... 23 2.4.1 Department of Defense.............................................................................................. 23 2.4.1.1 DODD 5000.1 ..................................................................................................... 23 2.4.1.2 DOD 5000.2R...................................................................................................... 24 2.4.1.3 Military Standards ............................................................................................... 24 2.4.2 Other Government Agencies ..................................................................................... 28 2.4.2.1 Department of Transportation ............................................................................. 28 2.4.2.2 National Aeronautics and Space Administration .............................................. 211 2.4.3 Commercial ............................................................................................................. 211 2.4.3.1 Institute of Electrical and Electronic Engineering............................................. 212 2.4.3.2 Electronic Industries Association...................................................................... 212 2.4.3.3 International Electrotechnical Commission ...................................................... 212 2.5 International Standards.................................................................................................. 213 2.5.1 Australian Defense Standard DEF(AUST) 5679 .................................................... 213 2.5.2 United Kingdom Defense Standard 00-55 & 00-54................................................ 214 2.5.3 United Kingdom Defense Standard 00-56 .............................................................. 214 2.6 Handbook Overview ..................................................................................................... 215 2.6.1 Historical Background............................................................................................. 215 2.6.2 Problem Identification............................................................................................. 215 2.6.2.1 Within System Safety........................................................................................ 216 2.6.2.2 Within Software Development.......................................................................... 217 2.6.3 Management Responsibilities ................................................................................. 218 2.6.4 Introduction to the Systems Approach................................................................. 218 2.6.4.1 The Hardware Development Life Cycle............................................................ 219 2.6.4.2 The Software Development Life Cycle ............................................................. 220 2.6.4.3 The Integration of Hardware and Software Life Cycles.................................... 224 2.6.5 A Team Solution.................................................................................................. 225 2.7 Handbook Organization ................................................................................................ 226 2.7.1 Planning and Management ...................................................................................... 228 2.7.2 Task Implementation............................................................................................... 228 2.7.3 Software Risk Assessment and Acceptance............................................................ 229 2.7.4 Supplementary Appendices ..................................................................................... 229 3. 3.1 3.2 INTRODUCTION TO RISK MANAGEMENT AND SYSTEM SAFETY....................... 31 Introduction ..................................................................................................................... 31 A Discussion of Risk....................................................................................................... 31
3.3 Types of Risk................................................................................................................... 32 3.4 Areas of Program Risk .................................................................................................... 33 3.4.1 Schedule Risk............................................................................................................ 35 3.4.2 Budget Risk ............................................................................................................... 36 3.4.3 Sociopolitical Risk .................................................................................................... 37 3.4.4 Technical Risk........................................................................................................... 37 3.5 System Safety Engineering.............................................................................................. 38 3.6 Safety Risk Management............................................................................................... 311 3.6.1 Initial Safety Risk Assessment ................................................................................ 312 3.6.1.1 Hazard and Failure Mode Identification............................................................ 312 3.6.1.2 Hazard Severity ................................................................................................. 312 3.6.1.3 Hazard Probability............................................................................................. 313 3.6.1.4 HRI Matrix ........................................................................................................ 314 3.6.2 Safety Order of Precedence ..................................................................................... 315 3.6.3 Elimination or Risk Reduction................................................................................ 316 3.6.4 Quantification of Residual Safety Risk ................................................................... 317 3.6.5 Managing and Assuming Residual Safety Risk ...................................................... 318 4. SOFTWARE SAFETY ENGINEERING............................................................................ 41 4.1 Introduction ..................................................................................................................... 41 4.1.1 Section 4 Format ....................................................................................................... 43 4.1.2 Process Charts ........................................................................................................... 43 4.1.3 Software Safety Engineering Products ...................................................................... 45 4.2 Software Safety Planning Management .......................................................................... 45 4.2.1 Planning..................................................................................................................... 46 4.2.1.1 Establish the System Safety Program................................................................ 410 4.2.1.2 Defining Acceptable Levels of Risk.................................................................. 411 4.2.1.3 Program Interfaces............................................................................................. 412 4.2.1.4 Contract Deliverables ........................................................................................ 416 4.2.1.5 Develop Software Hazard Criticality Matrix .................................................... 417 4.2.2 Management ............................................................................................................ 421 4.3 Software Safety Task Implementation .......................................................................... 425 4.3.1 Software Safety Program Milestones ...................................................................... 426 4.3.1 Preliminary Hazard List Development.................................................................... 428 4.3.2 Tailoring Generic Safety-Critical Requirements..................................................... 431 4.3.3 Preliminary Hazard Analysis Development ............................................................ 433 4.3.4 Derive System Safety-Critical Software Requirements .......................................... 437 4.3.4.1 Preliminary Software Safety Requirements ...................................................... 439 4.3.4.2 Matured Software Safety Requirements............................................................ 440 4.3.4.3 Documenting Software Safety Requirements ................................................... 440 4.3.4.4 Software Analysis Folders................................................................................. 441 4.3.5 Preliminary Software Design, Subsystem Hazard Analysis.................................... 442 4.3.5.1 Module Safety-Criticality Analysis ................................................................... 445 4.3.5.2 Program Structure Analysis............................................................................... 445 4.3.5.3 Traceability Analysis......................................................................................... 446
ii
4.3.6 Detailed Software Design, Subsystem Hazard Analysis ......................................... 447 4.3.6.1 Participate in Software Design Maturation ....................................................... 448 4.3.6.2 Detailed Design Software Safety Analysis........................................................ 449 4.3.6.3 Detailed Design Analysis Related Sub-processes ............................................. 453 4.3.7 System Hazard Analysis.......................................................................................... 460 4.4 Software Safety Testing & Risk Assessment ................................................................ 463 4.4.1 Software Safety Test Planning ................................................................................ 463 4.4.2 Software Safety Test Analysis................................................................................. 465 4.4.3 Software Standards and Criteria Assessment.......................................................... 469 4.4.4 Software Safety Residual Risk Assessment ............................................................ 471 4.5 Safety Assessment Report ............................................................................................. 473 4.5.1 Safety Assessment Report Table of Contents ......................................................... 474 A. DEFINITION OF TERMS A.1 A.2 B.1 B.2 B.3 B.4 Acronyms ........................................................................................................................ A-1 Definitions....................................................................................................................... A-5 Government References .................................................................................................. B-1 Commericial References ................................................................................................. B-1 Individual References...................................................................................................... B-2 Other References ............................................................................................................. B-3
B. REFERENCES
C. HANDBOOK SUPPLEMENTAL INFORMATION C.1 Proposed Contents of the System Safety Data Library ................................................... C-1 C.1.1 System Safety Program Plan ..................................................................................... C-1 C.1.2 Software Safety Program Plan................................................................................... C-2 C.1.3 Preliminary Hazard List ............................................................................................ C-3 C.1.4 Safety-Critical Functions List.................................................................................... C-4 C.1.5 Preliminary Hazard Analysis..................................................................................... C-5 C.1.6 Subsystem Hazard Analysis ...................................................................................... C-6 C.1.7 System Hazard Analysis............................................................................................ C-6 C.1.8 Safety Requirements Criteria Analysis ..................................................................... C-7 C.1.9 Safety Requirements Verification Report ................................................................. C-8 C.1.10 Safety Assessment Report ......................................................................................... C-9 C.2 Contractual Documentation........................................................................................... C-10 C.2.1 Statement of Operational Need ............................................................................... C-10 C.2.2 Request For Proposal .............................................................................................. C-10 C.2.3 Contract ................................................................................................................... C-11 C.2.4 Statement of Work .................................................................................................. C-11 C.2.5 System and Product Specification........................................................................... C-13 C.2.6 System and Subsystem Requirements ..................................................................... C-14 C.3 Planning Interfaces ........................................................................................................ C-14 C.3.1 Engineering Management........................................................................................ C-14 C.3.2 Design Engineering ................................................................................................. C-14 C.3.3 Systems Engineering ............................................................................................... C-15
iii
C.3.4 C.3.5 C.3.6 C.4 C.4.1 C.4.2 C.4.3 C.4.4 C.4.5 C.4.6 C.4.7 C.4.8 C.4.9 C.5 C.5.1 C.5.2 C.5.3 C.5.4 C.5.5 C.6 C.6.1 C.6.2 C.6.3 C.6.4 C.6.5 C.7 C.7.1 C.7.2 C.7.3 C.7.4 C.7.5 C.7.6 C.7.7 C.7.8 C.7.9 C.8 C.8.1 C.8.2 C.8.3 C.8.4 C.8.5 C.9 C.9.1
Software Development ............................................................................................ C-16 Integrated Logistics Support.................................................................................... C-16 Other Engineering Support...................................................................................... C-17 Meetings and Reviews .................................................................................................. C-17 Program Management Reviews............................................................................... C-17 Integrated Product Team Meetings ......................................................................... C-18 System Requirements Reviews ............................................................................... C-18 SYSTEM/Subsystem Design Reviews.................................................................... C-19 Preliminary Design Review..................................................................................... C-19 Critical Design Review ........................................................................................... C-20 Test Readiness Review............................................................................................ C-21 Functional Configuration Audit .............................................................................. C-22 Physical Configuration Audit.................................................................................. C-22 Working Groups............................................................................................................ C-23 System Safety Working Group................................................................................ C-23 Software System Safety Working Group ................................................................ C-23 Test Integration Working Group/Test Planning Working Group............................ C-25 Computer Resources Working Group ..................................................................... C-25 Interface Control Working Group ........................................................................... C-25 Resource Allocation ...................................................................................................... C-26 Safety Personnel ...................................................................................................... C-26 Funding.................................................................................................................... C-27 Safety Schedules and Milestones ............................................................................ C-27 Safety Tools and Training ....................................................................................... C-28 Required Hardware and Software ........................................................................... C-28 Program Plans ............................................................................................................... C-29 Risk Management Plan............................................................................................ C-29 Quality Assurance Plan ........................................................................................... C-30 Reliability Engineering Plan ................................................................................... C-30 Software Development Plan.................................................................................... C-31 Systems Engineering Management Plan ................................................................. C-32 Test and Evaluation Master Plan............................................................................. C-33 Software Test Plan .................................................................................................. C-34 Software Installation Plan ....................................................................................... C-34 Software Transition Plan......................................................................................... C-35 Hardware and Human Interface Requirements ............................................................. C-35 Interface Requirements............................................................................................ C-35 Operations and Support Requirements.................................................................... C-36 Safety/Warning Device Requirements .................................................................... C-36 Protective Equipment Requirements....................................................................... C-37 Procedures and Training Requirements .................................................................. C-37 Managing Change.......................................................................................................... C-37 Software Configuration Control Board ................................................................... C-37
iv
D. COTS AND NDI SOFTWARE D.1 Introduction ..................................................................................................................... D-1 D.2 Related Issues .................................................................................................................. D-2 D.2.1 Managing Change...................................................................................................... D-2 D.2.2 Configuration Management....................................................................................... D-2 D.2.3 Reusable and Legacy Software.................................................................................. D-3 D.3 Applications of Non-Developmental Items..................................................................... D-3 D.3.1 Commercial-Off-the-Shelf Software......................................................................... D-3 D.4 Reducing Risks................................................................................................................ D-5 D.4.1 Applications Software Design................................................................................... D-5 D.4.2 Middleware or Wrappers........................................................................................... D-6 D.4.3 Message Protocol ...................................................................................................... D-7 D.4.4 Designing Around It .................................................................................................. D-7 D.4.5 Analysis and Testing of NDI Software...................................................................... D-8 D.4.6 Eliminating Functionality.......................................................................................... D-8 D.4.7 Run-Time Versions ................................................................................................... D-9 D.4.8 Watchdog Timers ...................................................................................................... D-9 D.4.9 Configuration Management....................................................................................... D-9 D.4.10 Prototyping .............................................................................................................. D-10 D.4.11 Testing..................................................................................................................... D-10 D.5 Summary ....................................................................................................................... D-10 E. GENERIC REQUIREMENTS AND GUIDELINES E.1 Introduction ..................................................................................................................... E-1 E.1.1 Determination of Safety-Critical Computing System Functions............................... E-1 E.2 Design And Development Process Requirements And Guidelines................................. E-2 E.2.1 Configuration Control ............................................................................................... E-2 E.2.2 Software Quality Assurance Program ....................................................................... E-3 E.2.3 Two Person Rule ....................................................................................................... E-3 E.2.4 Program Patch Prohibition ........................................................................................ E-3 E.2.5 Software Design Verification and Validation ........................................................... E-3 E.3 System Design Requirements And Guidelines ............................................................... E-5 E.3.1 Designed Safe States ................................................................................................. E-5 E.3.2 Standalone Computer ................................................................................................ E-5 E.3.3 Ease of Maintenance ................................................................................................. E-5 E.3.4 Safe State Return....................................................................................................... E-6 E.3.5 Restoration of Interlocks ........................................................................................... E-6 E.3.6 Input/output Registers ............................................................................................... E-6 E.3.7 External Hardware Failures....................................................................................... E-6 E.3.8 Safety Kernel Failure................................................................................................. E-6 E.3.9 Circumvent Unsafe Conditions ................................................................................. E-6 E.3.10 Fallback and Recovery .............................................................................................. E-6 E.3.11 Simulators ................................................................................................................. E-6 E.3.12 System Errors Log..................................................................................................... E-7 E.3.13 Positive Feedback Mechanisms ................................................................................ E-7
v
E.3.14 E.3.15 E.3.16 E.3.17 E.3.18 E.3.19 E.4 E.4.1 E.4.2 E.4.3 E.4.4 E.4.5 E.4.6 E.5 E.5.1 E.5.2 E.5.3 E.5.4 E.6 E.6.1 E.6.2 E.6.3 E.6.4 E.7 E.7.1 E.7.2 E.7.3 E.7.4 E.7.5 E.7.6 E.7.7 E.7.8 E.8 E.8.1 E.8.2 E.8.3 E.8.4 E.8.5 E.8.6 E.8.7 E.8.8 E.9 E.9.1 E.9.2
Peak Load Conditions ............................................................................................... E-7 Endurance Issues ....................................................................................................... E-7 Error Handling........................................................................................................... E-8 Redundancy Management ......................................................................................... E-9 Safe Modes And Recovery...................................................................................... E-10 Isolation And Modularity ........................................................................................ E-10 Power-Up System Initialization Requirements ............................................................. E-11 Power-Up Initialization ........................................................................................... E-11 Power Faults............................................................................................................ E-11 Primary Computer Failure....................................................................................... E-12 Maintenance Interlocks ........................................................................................... E-12 System-Level Check................................................................................................ E-12 Control Flow Defects .............................................................................................. E-12 Computing System Environment Requirements And Guidelines................................. E-14 Hardware and Hardware/Software Interface Requirements .................................... E-14 CPU Selection ......................................................................................................... E-15 Minimum Clock Cycles .......................................................................................... E-16 Read Only Memories............................................................................................... E-16 Self-Check Design Requirements And Guidelines ....................................................... E-16 Watchdog Timers .................................................................................................... E-16 Memory Checks ...................................................................................................... E-16 Fault Detection ........................................................................................................ E-16 Operational Checks ................................................................................................. E-17 Safety-Critical Computing System Functions Protection Requirements And Guidelines.............................................................................................................. E-17 Safety Degradation .................................................................................................. E-17 Unauthorized Interaction......................................................................................... E-17 Unauthorized Access............................................................................................... E-17 Safety Kernel ROM................................................................................................. E-17 Safety Kernel Independence.................................................................................... E-17 Inadvertent Jumps ................................................................................................... E-17 Load Data Integrity.................................................................................................. E-18 Operational Reconfiguration Integrity..................................................................... E-18 Interface Design Requirements ..................................................................................... E-18 Feedback Loops....................................................................................................... E-18 Interface Control...................................................................................................... E-18 Decision Statements ................................................................................................ E-18 Inter-CPU Communications .................................................................................... E-18 Data Transfer Messages .......................................................................................... E-18 External Functions................................................................................................... E-19 Input Reasonableness Checks ................................................................................. E-19 Full Scale Representations ...................................................................................... E-19 Human Interface ............................................................................................................ E-19 Operator/Computing System Interface.................................................................... E-19 Processing Cancellation .......................................................................................... E-20
vi
E.9.3 Hazardous Function Initiation ................................................................................. E-20 E.9.4 Safety-Critical Displays........................................................................................... E-21 E.9.5 Operator Entry Errors .............................................................................................. E-21 E.9.6 Safety-Critical Alerts............................................................................................... E-21 E.9.7 Unsafe Situation Alerts ........................................................................................... E-21 E.9.8 Unsafe State Alerts.................................................................................................. E-21 E.10 Critical Timing And Interrupt Functions....................................................................... E-21 E.10.1 Safety-Critical Timing............................................................................................. E-21 E.10.2 Valid Interrupts........................................................................................................ E-22 E.10.3 Recursive Loops ...................................................................................................... E-22 E.10.4 Time Dependency.................................................................................................... E-22 E.11 Software Design And Development Requirements And Guidelines ............................ E-22 E.11.1 Coding Requirements/Issues ................................................................................... E-22 E.11.2 Modular Code.......................................................................................................... E-24 E.11.3 Number of Modules ................................................................................................ E-24 E.11.4 Execution Path ........................................................................................................ E-24 E.11.5 Halt Instructions ...................................................................................................... E-25 E.11.6 Single Purpose Files ................................................................................................ E-25 E.11.7 Unnecessary Features .............................................................................................. E-25 E.11.8 Indirect Addressing Methods .................................................................................. E-25 E.11.9 Uninterruptable Code .............................................................................................. E-25 E.11.10 Safety-Critical Files................................................................................................. E-25 E.11.11 Unused Memory...................................................................................................... E-25 E.11.12 Overlays Of Safety-Critical Software Shall All Occupy The Same Amount Of Memory................................................................................................ E-26 E.11.13 Operating System Functions ................................................................................... E-26 E.11.14 Compilers ................................................................................................................ E-26 E.11.15 Flags and Variables ................................................................................................. E-26 E.11.16 Loop Entry Point ..................................................................................................... E-26 E.11.17 Software Maintenance Design................................................................................. E-26 E.11.18 Variable Declaration................................................................................................ E-26 E.11.19 Unused Executable Code ........................................................................................ E-26 E.11.20 Unreferenced Variables ........................................................................................... E-26 E.11.21 Assignment Statements ........................................................................................... E-27 E.11.22 Conditional Statements ........................................................................................... E-27 E.11.23 Strong Data Typing ................................................................................................. E-27 E.11.24 Timer Values Annotated ......................................................................................... E-27 E.11.25 Critical Variable Identification................................................................................ E-27 E.11.26 Global Variables...................................................................................................... E-27 E.12 Software Maintenance Requirements And Guidelines ................................................. E-27 E.12.1 Critical Function Changes ....................................................................................... E-28 E.12.2 Critical Firmware Changes...................................................................................... E-28 E.12.3 Software Change Medium....................................................................................... E-28 E.12.4 Modification Configuration Control ....................................................................... E-28 E.12.5 Version Identification.............................................................................................. E-28
vii
E.13 Software Analysis And Testing..................................................................................... E-28 E.13.1 General Testing Guidelines ..................................................................................... E-28 E.13.2 Trajectory Testing for Embedded Systems ............................................................. E-30 E.13.3 Formal Test Coverage ............................................................................................. E-30 E.13.4 Go/No-Go Path Testing........................................................................................... E-30 E.13.5 Input Failure Modes ................................................................................................ E-30 E.13.6 Boundary Test Conditions....................................................................................... E-30 E.13.7 Input Rata Rates ...................................................................................................... E-30 E.13.8 Zero Value Testing.................................................................................................. E-31 E.13.9 Regression Testing .................................................................................................. E-31 E.13.10 Operator Interface Testing....................................................................................... E-31 E.13.11 Duration Stress Testing ........................................................................................... E-31 F. LESSONS LEARNED Therac Radiation Therapy Machine Fatalities .................................................................F-1 Summary ....................................................................................................................F-1 Key Facts ....................................................................................................................F-1 Lessons Learned .........................................................................................................F-2 Missile Launch Timing Causes Hangfire .........................................................................F-2 Summary ....................................................................................................................F-2 Key Facts ....................................................................................................................F-2 Lessons Learned .........................................................................................................F-3 Reused Software Causes Flight Controls to Shut Down..................................................F-3 Summary ....................................................................................................................F-3 Key facts.....................................................................................................................F-4 Lessons Learned .........................................................................................................F-4 Flight Controls Fail at Supersonic Transition ..................................................................F-4 Summary ....................................................................................................................F-4 Key Facts ....................................................................................................................F-5 Lessons Learned .........................................................................................................F-5 Incorrect Missile Firing from Invalid Setup Sequence.....................................................F-5 Summary ....................................................................................................................F-5 Key Facts ....................................................................................................................F-6 Lessons Learned .........................................................................................................F-6 Operators Choice of Weapon Release Overridden by Software .....................................F-6 Summary ....................................................................................................................F-6 Key Facts ....................................................................................................................F-7 Lessons Learned .........................................................................................................F-7 F.1 F.1.1 F.1.2 F.1.3 F.2 F.2.1 F.2.2 F.2.3 F.3 F.3.1 F.3.2 F.3.3 F.4 F.4.1 F.4.2 F.4.3 F.5 F.5.1 F.5.2 F.5.3 F.6 F.6.1 F.6.2 F.6.3
G. PROCESS CHART WORKSHEETS H. SAMPLE CONTRACTUAL DOCUMENTS H.1 Sample Request for Proposal .......................................................................................... H-1 H.2 Sample Statement of Work ............................................................................................. H-2 H.2.1 System Safety ............................................................................................................ H-2 H.2.2 Software Safety ......................................................................................................... H-3
viii
LIST OF FIGURES
Figure 2-1: Management Commitment to the Integrated Safety Process.................................. 218 Figure 2-2: Example of Internal System Interfaces................................................................... 219 Figure 2-3: Weapon System Life Cycle .................................................................................... 220 Figure 2-4: Relationship of Software to the Hardware Development Life Cycle ..................... 221 Figure 2-5: Grand Design Waterfall Software Acquisition Life Cycle Model ......................... 222 Figure 2-6: Modified V Software Acquisition Life Cycle Model ............................................. 223 Figure 2-7: Spiral Software Acquisition Life Cycle Model ...................................................... 224 Figure 2-8: Integration of Engineering Personnel and Processes.............................................. 226 Figure 2-9: Handbook Layout ................................................................................................... 227 Figure 2-10: Section 4 Format................................................................................................... 228 Figure 3-1: Types of Risk............................................................................................................ 33 Figure 3-2: Systems Engineering, Risk Management Documentation........................................ 36 Figure 3-3: Hazard Reduction Order of Precedence ................................................................. 316 Figure 4-1: Section 4 Contents.................................................................................................... 41 Figure 4-2: Who is Responsible for SSS?................................................................................... 42 Figure 4-3: Example of Initial Process Chart .............................................................................. 44 Figure 4-4: Software Safety Planning ......................................................................................... 46 Figure 4-5: Software Safety Planning by the Procuring Authority ............................................. 47 Figure 4-6: Software Safety Planning by the Developing Agency.............................................. 48 Figure 4-7: Planning the Safety Criteria Is Important ............................................................... 410 Figure 4-8: Software Safety Program Interfaces ....................................................................... 412 Figure 4-9: Ultimate Safety Responsibility............................................................................... 414 Figure 4-10: Proposed SSS Team Membership ........................................................................ 415 Figure 4-11: Example of Risk Acceptance Matrix.................................................................... 417 Figure 4-12: Likelihood of Occurrence Example...................................................................... 419 Figure 4-13: Examples of Software Control Capabilities ......................................................... 419 Figure 4-14: Software Hazard Criticality Matrix, MIL-STD-882C .......................................... 420 Figure 4-15: Software Safety Program Management ................................................................ 421 Figure 4-16: Software Safety Task Implementation.................................................................. 425 Figure 4-17: Example POA&M Schedule................................................................................. 427 Figure 4-18: Preliminary Hazard List Development ................................................................. 429 Figure 4-19: An Example of Safety-Critical Functions ............................................................ 431 Figure 4-20: Tailoring the Generic Safety Requirements ......................................................... 432 Figure 4-21: Example of a Generic Software Safety Requirements Tracking Worksheet....................................................................................................... 433 Figure 4-22: Preliminary Hazard Analysis................................................................................ 434 Figure 4-23: Hazard Analysis Segment..................................................................................... 435 Figure 4-24: Example of a Preliminary Hazard Analysis.......................................................... 437 Figure 4-25: Derive Safety-Specific Software Requirements ................................................... 438 Figure 4-26: Software Safety Requirements Derivation ........................................................... 439 Figure 4-27: In-Depth Hazard Cause Analysis.......................................................................... 440 Figure 4-28: Preliminary Software Design Analysis................................................................. 442
ix
Figure 4-29: Software Safety Requirements Verification Tree................................................. 444 Figure 4-30: Hierarchy Tree Example....................................................................................... 446 Figure 4-31: Detailed Software Design Analysis ...................................................................... 448 Figure 4-32: Verification Methods............................................................................................ 449 Figure 4-33: Identification of Safety-Related CSUs ................................................................. 450 Figure 4-34: Example of a Data Flow Diagram ........................................................................ 455 Figure 4-35: Flow Chart Examples ........................................................................................... 456 Figure 4-36: System Hazard Analysis ....................................................................................... 460 Figure 4-37: Example of a System Hazard Analysis Interface Analysis................................... 461 Figure 4-38: Documentation of Interface Hazards and Safety Requirements ........................... 462 Figure 4-39: Documenting Evidence of Hazard Mitigation...................................................... 463 Figure 4-40: Software Safety Test Planning ............................................................................. 464 Figure 4-41: Software Safety Testing and Analysis .................................................................. 466 Figure 4-42: Software Requirements Verification .................................................................... 470 Figure 4-43: Residual Safety Risk Assessment......................................................................... 472 Figure C.1: Contents of a SwSPP - IEEE STD 1228-1994......................................................... C-3 Figure C.2: SSHA & SHA Hazard Record Example .................................................................. C-7 Figure C.3: Hazard Requirement Verification Document Example ........................................... C-9 Figure C.4: Software Safety SOW Paragraphs.......................................................................... C-13 Figure C.5: Generic Software Configuration Change Process .................................................. C-38
LIST OF TABLES
Table 2-1: Survey Response...................................................................................................... 217 Table 3-1: Hazard Severity........................................................................................................ 312 Table 3-2: Hazard Probability ................................................................................................... 313 Table 3-3: HRI Matrix............................................................................................................... 314 Table 4-1: Acquisition Process Trade-off Analyses.................................................................. 435 Table 4-2: Example of a Software Safety Requirements Verification Matrix .......................... 444 Table 4-3: Example of a RTM .................................................................................................. 445 Table 4-4: Safety-critical Function Matrix................................................................................ 445 Table 4-5: Data Item Example .................................................................................................. 454
1. Executive Overview
Since the development of the digital computer, software continues to play an important and evolutionary role in the operation and control of hazardous, safety-critical functions. The reluctance of the engineering community to relinquish human control of hazardous operations has diminished dramatically in the last 15 years. Today, digital computer systems have autonomous control over safety-critical functions in nearly every major technology, both commercially and within government systems. This revolution is primarily due to the ability of software to reliably perform critical control tasks at speeds unmatched by its human counterpart. Other factors influencing this transition is our ever-growing need and desire for increased versatility, greater performance capability, higher efficiency, and a decreased life cycle cost. In most instances, software can meet all of the above attributes of the systems performance when properly designed. The logic of the software allows for decisions to be implemented without emotion, and with speed and accuracy. This has forced the human operator out of the control loop; because they can no longer keep pace with the speed, cost effectiveness, and decision making process of the system. Therefore, there is a critical need to perform system safety engineering tasks on safety-critical systems to reduce the safety risk in all aspects of a program. These tasks include the software system safety (SSS) activities involving the design, code, test, Independent Verification and Validation (IV&V), operation & maintenance, and change control functions of the software engineering development process. The main objective (or definition) of system safety engineering, which includes SSS, is as follows: The application of engineering and management principles, criteria, and techniques to optimize all aspects of safety within the constraints of operational effectiveness, time, and cost throughout all phases of the system life cycle. The ultimate responsibility for the development of a safe system rests with program management. The commitment to provide qualified people and an adequate budget and schedule for a software development program begins with the program director or program manager (PM). Top management must be a strong voice of safety advocacy and must communicate this personal commitment to each level of program and technical management. The PM must support the integrated safety process between systems engineering, software engineering, and safety engineering in the design, development, test, and operation of the system software. Thus, the purpose of this document (hereafter referred to as the Handbook) is as follows: Provide management and engineering guidelines to achieve a reasonable level of assurance that software will execute within the system context with an acceptable level of safety risk.
11
As a member of the software development team, the safety engineer is critical in the design, and redesign, of modern systems. Whether a hardware engineer, software engineer, safety specialist, or safety manager, it is his/her responsibility to ensure that an acceptable level of safety is achieved and maintained throughout the life cycle of the system(s) being developed. This Handbook provides a rigorous and pragmatic application of SSS planning and analysis to be used by the safety engineer. SSS, an element of the total system safety and software development program, cannot function independently of the total effort. Nor can it be ignored. Systems, both simple and highly integrated multiple subsystems, are experiencing an extraordinary growth in the use of computers and software to monitor and/or control safety-critical subsystems and functions. A software specification error, design flaw, or the lack of initial safety requirements can contribute to or cause a system failure or erroneous human decision. Preventable death, injury, loss of the system, or environmental damage can result. To achieve an acceptable level of safety for software used in critical applications, software safety engineering must be given primary emphasis early in the requirements definition and system conceptual design process. Safetycritical software must then receive a continuous emphasis from management as well as a continuing engineering analysis throughout the development and operational life cycles of the system. This SSSH is a joint effort. The U.S. Army, Navy, Air Force, and Coast Guard Safety Centers, with cooperation from the Federal Aviation Administration (FAA), National Aeronautics and Space Administration (NASA), defense industry contractors, and academia are the primary contributors. This extensive research captures the best practices pertaining to SSS program management and safety-critical software design. The Handbook consolidates these contributions into a single, user-friendly resource. It aids the system development team in understanding their SSS responsibilities. By using this Handbook, the user will appreciate the need for all disciplines to work together in identifying, controlling, and managing software-related hazards within the safety-critical components of hardware systems.
21
To summarize, this Handbook is a how-to guide for use in the understanding of SSS and the contribution of each functional discipline to the overall goal. It is applicable to all types of systems (military and commercial), in all types of operational uses.
2.2 Purpose
The purpose of the SSSH is to provide management and engineering guidelines to achieve a reasonable level of assurance that the software will execute within the system context with an acceptable level of safety risk1.
2.3 Scope
This Handbook is both a reference document and management tool for aiding managers and engineers at all levels, in any government or industrial organization. It demonstrates how to in the development and implementation of an effective SSS process. This process minimizes the likelihood or severity of system hazards caused by poorly specified, designed, developed, or operation of software in safety-critical applications. The primary responsibility for management of the SSS process lies with the system safety manager/engineer in both the developers (supplier) and acquirers (customer) organization. However, nearly every functional discipline has a vital role and must be intimately involved in the SSS process. The SSS tasks, techniques, and processes outlined in this Handbook are basic enough to be applied to any system that uses software in critical areas. It serves the need for all contributing disciplines to understand and apply qualitative and quantitative analysis techniques to ensure the safety of hardware systems controlled by software. This Handbook is a guide and is not intended to supersede any Agency policy, standard, or guidance pertaining to system safety (MIL-STD-882C) or software engineering and development (MIL-STD-498). It is written to clarify the SSS requirements and tasks specified in governmental and commercial standards and guideline documents. The Handbook is not a compliance document but a reference document. It provides the system safety manager and the software development manager with sufficient information to perform the following: Properly scope the SSS effort in the Statement of Work (SOW), Identify the data items needed to effectively monitor the contractors compliance with the contract system safety requirements, and Evaluate contractor performance throughout the development life cycle.
The Handbook is not a tutorial on software engineering. However, it does address some technical aspects of software function and design to assist with understanding software safety. It is an objective of this Handbook to provide each member of the SSS Team with a basic understanding of sound systems and software safety practices, processes, and techniques. The stated purpose of this Handbook closely resembles Nancy Levesons definition of Software System Safety. The authors would like to provide the appropriate credit for her implicit contribution.
22
1
Another objective is to demonstrate the importance of each technical and managerial discipline to work hand-in-hand in defining software safety requirements (SSR) for the safety-critical software components of the system. A final objective is to show where safety features can be designed into the software to eliminate or control identified hazards.
2.4 Authority/Standards
Numerous directives, standards, regulations, and regulatory guides establish the authority for system safety engineering requirements in the acquisition, development, and maintenance of software-based systems. Although the primary focus of this Handbook is targeted toward military systems, much of the authority for the establishment of Department of Defense (DOD) system safety, and software safety programs, is derived from other governmental and commercial standards and guidance. We have documented many of these authoritative standards and guidelines within this Handbook. First, to establish their existence; second, to demonstrate the seriousness that the government places on the reduction of safety risk for software performing safety-critical functions; and finally, to consolidate in one place all authoritative documentation. This allows a PM, safety manager, or safety engineer to clearly demonstrate the mandated requirement and need for a software safety program to their superiors. 2.4.1 Department of Defense Within the DOD and the acquisition corps of each branch of military service, the primary documents of interest pertaining to system safety and software development include DOD Instruction 5000.1, Defense Acquisition; DOD 5000.2R, Mandatory Procedures for Major Defense Acquisition Programs (MDAPs) and Major Automated Information System (MAIS) Acquisition Programs; MIL-STD-498, Software Development and Documentation; and MILSTD-882D, Standard Practice for System Safety. The authority of the acquisition professional to establish a software safety program is provided in the following paragraphs. These paragraphs are quoted or summarized from various DOD directives and military standards. They clearly define the mandated requirement for all DOD systems acquisition and development programs to incorporate safety requirements and analysis into the design, development, testing, and support of software being used to perform or control critical system functions. The DOD documents also levy the authority and responsibility for establishing and managing an effective software safety program to the highest level of program authority. 2.4.1.1 DODD 5000.1 DODD 5000.1, Defense Acquisition, March 15, 1996; Paragraph D.1.d, establishes the requirement and need for an aggressive risk management program for acquiring quality products. d. Risk Assessment and Management. PMs and other acquisition managers shall continually assess program risks. Risks must be well understood, and risk management approaches developed, before decision authorities can authorize a program to proceed into the next phase of the acquisition process. To assess and manage risk, PMs and other acquisition managers shall use a variety of techniques, including technology demonstrations, prototyping, and test and evaluation. Risk management encompasses
23
identification, mitigation, and continuous tracking, and control procedures that feed back through the program assessment process to decision authorities. To ensure an equitable and sensible allocation of risk between government and industry, PMs and other acquisition managers shall develop a contracting approach appropriate to the type of system being acquired. 2.4.1.2 DOD 5000.2R DOD 5000.2R, Mandatory Procedures for MDAPs and MAIS Acquisition Programs, March 15, 1996, provides the guidance regarding system safety and health. 4.3.7.3 System Safety and Health: The PM shall identify and evaluate system safety and health hazards, define risk levels, and establish a program that manages the probability and severity of all hazards associated with development, use, and disposal of the system. All safety and health hazards shall be managed consistent with mission requirements and shall be cost-effective. Health hazards include conditions that create significant risks of death, injury, or acute chronic illness, disability, and/or reduced job performance of personnel who produce, test, operate, maintain, or support the system. Each management decision to accept the risks associated with an identified hazard shall be formally documented. The Component Acquisition Executive (CAE) shall be the final approval authority for acceptance of high-risk hazards. All participants in joint programs shall approve acceptance of high-risk hazards. Acceptance of serious risk hazards may be approved at the Program Executive Officer (PEO) level. 2.4.1.3 Military Standards 2.4.1.3.1 MIL-STD-882B, Notice 1 MIL-STD-882B, System Safety Program Requirements, March 30, 1984 (Notice 1, July 1, 1987), remains on numerous government programs which were contracted during the 1980s prior to the issuance of MIL-STD-882C. The objective of this standard is the establishment of a System Safety Program (SSP) to ensure that safety, consistent with mission requirements, is designed into systems, subsystems, equipment, facilities, and their interfaces. The authors of this standard recognized the safety risk that influenced software presented in safety-critical systems. The standard provides guidance and specific tasks for the development team to address the software, hardware, system, and human interfaces. These include the 300-series tasks. The purpose of each task is as follows: Task 301, Software Requirements Hazard Analysis: The purpose of Task 301 is to require the contractor to perform and document a Software Requirements Hazard Analysis. The contractor shall examine both system and software requirements as well as design in order to identify unsafe modes for resolution, such as out-of-sequence, wrong event, inappropriate magnitude, inadvertent command, adverse environment, deadlocking, failure-to-command, etc. The analysis shall examine safety-critical computer software components at a gross level to obtain an initial safety evaluation of the software system.
24
Task 302, Top-level Design Hazard Analysis: The purpose of Task 302 is to require the contractor to perform and document a Top-level Design Hazard Analysis. The contractor shall analyze the top-level design, using the results of the Safety Requirements Hazard Analysis if previously accomplished. This analysis shall include the definition and subsequent analysis of safety-critical computer software components, identifying the degree of risk involved, as well as the design and test plan to be implemented. The analysis shall be substantially complete before the software-detailed design is started. The results of the analysis shall be present at the Preliminary Design Review (PDR). Task 303, Detailed Design Hazard Analysis: The purpose of Task 303 is to require the contractor to perform and document a Detailed Design Hazard Analysis. The contractor shall analyze the software detailed design using the results of the Software Requirements Hazard Analysis and the Top-level Design Hazard Analysis to verify the correct incorporation of safety requirements and to analyze the safety-critical computer software components. This analysis shall be substantially complete before coding of the software is started. The results of the analysis shall be presented at the Critical Design Review (CDR). Task 304, Code-level Software Hazard Analysis: The purpose of Task 304 is to require the contractor to perform and document a Code-level Software Hazard Analysis. Using the results of the Detailed Design Hazard Analysis, the contractor shall analyze program code and system interfaces for events, faults, and conditions that could cause or contribute to undesired events affecting safety. This analysis shall start when coding begins, and shall be continued throughout the system life cycle. Task 305, Software Safety Testing: The purpose of Task 305 is to require the contractor to perform and document Software Safety Testing to ensure that all hazards have been eliminated or controlled to an acceptable level of risk. Task 306, Software/User Interface Analysis: The purpose of Task 306 is to require the contractor to perform and document a Software/User Interface Analysis and the development of software user procedures. Task 307, Software Change Hazard Analysis: The purpose of Task 307 is to require the contractor to perform and document a Software Change Hazard Analysis. The contractor shall analyze all changes, modifications, and patches made to the software for safety hazards. 2.4.1.3.2 MIL-STD-882C MIL-STD-882C, System Safety Program Requirements, January 19, 1993, establishes the requirement for detailed system safety engineering and management activities on all system procurements within the DOD. This includes the integration of software safety within the context of the SSP. Although MIL-STD-882B and MIL-STD-882C remain on older contracts within the DOD, MIL-STD-882D is the current system safety standard as of the date of this handbook.
25
Paragraph 4, General Requirements, 4.1, System Safety Program: The contractor shall establish and maintain a SSP to support efficient and effective achievement of overall system safety objectives. Paragraph 4.2, System Safety Objectives: The SSP shall define a systematic approach to make sure that:..(b.) Hazards associated with each system are identified, tracked, evaluated, and eliminated, or the associated risk reduced to a level acceptable to the Procuring Authority (PA) throughout entire life cycle of a system. Paragraph 4.3, System Safety Design Requirements: ...Some general system safety design requirements are:..(j.) Design software controlled or monitored functions to minimize initiation of hazardous events or mishaps. Task 202, Preliminary Hazard Analysis (PHA), Section 202.2, Task Description: ...The PHA shall consider the following for identification and evaluation of hazards as a minimum: (b.) Safety related interface considerations among various elements of the system (e.g., material compatibilities, electromagnetic interference, inadvertent activation, fire/explosive initiation and propagation, and hardware and software controls.) This shall include consideration of the potential contribution by software (including software developed by other contractors/sources) to subsystem/system mishaps. Safety design criteria to control safety-critical software commands and responses (e.g., inadvertent command, failure to command, untimely command or responses, inappropriate magnitude, or PA-designated undesired events) shall be identified and appropriate actions taken to incorporate them in the software (and related hardware) specifications. Task 202 is included as a representative description of tasks integrating software safety. The general description is also applicable to all the other tasks specified in MIL-STD-882C. The point is that software safety must be an integral part of system safety and software development. 2.4.1.3.3 MIL-STD-882D MIL-STD 882D, Standard Practice of System Safety, replaced MIL-STD-882C in September 1999. Although the new standard is radically different than its predecessors, it still captures their basic tenets. It requires that the system developers document the approach to produce the following: Satisfy the requirements of the standard, Identify hazards in the system through a systematic analysis approach, Assess the severity of the hazards, Identify mitigation techniques, Reduce the mishap risk to an acceptable level, Verify and validate the mishap risk reduction, and
26
This process is identical to the process described in the preceding versions of the standard without specifying programmatic particulars. The process described in this handbook meets the requirements and intent of MIL-STD-882D. Succeeding paragraphs in this Handbook describe its relationship to MIL-STDs-882B and 882C since these invoke specific tasks as part of the system safety analysis process. The tasks, while no longer part of MIL-STD-882D, still reside in the Defense Acquisition Deskbook (DAD). The integration of this Handbook into DAD will include links to the appropriate tasks. A caveat for those managing contracts: A PM should not blindly accept a developers proposal to make a no-cost change to replace earlier versions of the 882 series standard with MIL-STD 882D. This could have significant implications in the conduct of the safety program preventing the PM and his/her safety team from obtaining the specific data required to evaluate the system and its software. 2.4.1.3.4 DOD-STD-2167A Although MIL-STD-498 replaced DOD-STD-2167A, Military Standard Defense System Software Development, February 29, 1988, it remains on numerous older contracts within the DOD. This standard establishes the uniform requirements for software development that are applicable throughout the system life cycle. The requirements of this standard provide the basis for government insight into a contractors software development, testing, and evaluation efforts. The specific requirement of the standard, which establishes a system safety interface with the software development process, is as follows: Paragraph 4.2.3, Safety Analysis: The contractor shall perform the analysis necessary to ensure that the software requirements, design, and operating procedures minimize the potential for hazardous conditions during the operational mission. Any potentially hazardous conditions or operating procedures shall be clearly defined and documented. 2.4.1.3.5 MIL-STD-498 MIL-STD-4982, Software Development and Documentation, December 5, 1994, Paragraph 4.2.4.1, establishes an interface with system safety engineering and defines the safety activities which are required for incorporation into the software development throughout the acquisition life cycle. This standard merges DOD-STD-2176A and DOD-STD-7935A to define a set of activities and documentation suitable for the development of both weapon systems and automated information systems. Other changes include improved compatibility with incremental and evolutionary development models; improved compatibility with non-hierarchical design methods; improved compatibility with Computer-Aided Software Engineering (CASE) tools; alternatives to, and more flexibility in, preparing documents; clearer requirements for incorporating reusable software; introduction of software management indicators; added
2
IEEE 1498, Information Technology - Software Development and Documentation is the demilitarized version of MIL-STD-498 for use in commercial applications
27
emphasis on software support; and improved links to systems engineering. This standard can be applied in any phase of the system life cycle. Paragraph 4.2.4.1, Safety Assurance: The developer shall identify as safety-critical those Computer Software Configuration Items (CSCI) or portions thereof whose failure could lead to a hazardous system state (one that could result in unintended death, injury, loss of property, or environmental harm). If there is such software, the developer shall develop a safety assurance strategy, including both tests and analyses, to assure that the requirements, design, implementation, and operating procedures for the identified software minimize or eliminate the potential for hazardous conditions. The strategy shall include a software safety program that shall be integrated with the SSP if one exists. The developer shall record the strategy in the software development plan (SDP), implement the strategy, and produce evidence, as part of required software products, that the safety assurance strategy has been carried out. In the case of reusable software products [this includes Commercial Off-The-Shelf (COTS)], MIL-STD-498 states that: Appendix B, B.3, Evaluating Reusable Software Products, (b.): General criteria shall be the software products ability to meet specified requirements and to be cost effective over the life of the system. Non-mandatory examples of specific criteria include, but are not limited to:..b. Ability to provide required safety, security, and privacy. 2.4.2 Other Government Agencies Outside the DOD, other governmental agencies are not only interested in the development of safe software, but are aggressively pursuing the development or adoption of new regulations, standards, and guidance for establishing and implementing software SSPs for their developing systems. Those governmental agencies expressing an interest and actively participating in the development of this Handbook are identified below. Also included is the authoritative documentation used by these agencies which establish the requirement for a SwSSP. 2.4.2.1 Department of Transportation 2.4.2.1.1 Federal Aviation Administration FAA Order 1810 ACQUISITION POLICY establishes general policies and the framework for acquisition for all programs that require operational or support needs for the FAA. It implements the Department of Transportation (DOT) Major Acquisition Policy and Procedures (MAPP) in its entirety and consolidates the contents of more than 140 FAA Orders, standards, and other references. FAA Order 8000.70 FAA SYSTEM SAFETY PROGRAM requires that the FAA SSP be used, where applicable, to enhance the effectiveness of FAA safety efforts through the uniform approach of system safety management and engineering principles and practices.3
A significant FAA safety document is (RTCA)/DO-178B, Software Considerations In Airborne Systems and Equipment Certification. Important points from this resource are as follows: Paragraph 1.1, Purpose: The purpose of this document is to provide guidelines for the production of software for airborne systems and equipment that performs its intended function with a level of confidence in safety that complies with airworthiness requirements. Paragraph 2.1.1, Information Flow from System Processes to Software Processes: The system safety assessment process determines and categorizes the failure conditions of the system. Within the system safety assessment process, an analysis of the system design defines safety-related requirements that specify the desired immunity from, and system responses to, these failure conditions. These requirements are defined for hardware and software to preclude or limit the effects of faults, and may provide fault detection and fault tolerance. As decisions are being made during the hardware design process and software development processes, the system safety assessment process analyzes the resulting system design to verify that it satisfies the safety-related requirements. The safety-related requirements are inputs to the software life cycle process. To ensure that they are properly implemented, the system requirements typically include or reference: The system description and hardware definition; Certification requirements, including Federal Aviation Regulation (United States), Joint Aviation Regulations (Europe), Advisory Circulars (United States), etc.; System requirements allocated to software, including functional requirements, performance requirements, and safety-related requirements; Software level(s) and data substantiating their determination, failure conditions, their Hazard Risk Index (HRI) categories, and related functions allocated to software; Software strategies and design constraints, including design methods, such as, partitioning, dissimilarity, redundancy, or safety monitoring; and If the system is a component of another system, the safety-related requirements and failure conditions for that system.
System life cycle processes may specify requirements for software life cycle processes to aid system verification activities. 2.4.2.1.2 Coast Guard COMDTINST M41150.2D, Systems Acquisition Manual, December 27, 1994, or the SAM establishes policy, procedures, and guidance for the administration of Coast Guard major acquisition projects. The SAM implements the DOT MAPP in its entirety. The System Safety Planning section of the SAM requires the use of MIL-STD-882C in all Level I, IIIA, and IV
29
acquisitions. The SAM also outlines system hardware and software requirements in the Integrated Logistics Support Planning section of the manual. Using MIL-STD-498 as a foundation, the Coast Guard has developed a Software Development and Documentation Standards, Draft, May 1995 document for internal Coast Guard use. The important points from this document are as follows: Paragraph 1.1, Purpose: The purpose of this standard is to establish Coast Guard software development and documentation requirements to be applied during the acquisition, development, or support of the software system. Paragraph 1.2, Application: This standard is designed to be contract specific applying to both contractors or any other government agency(s) who would develop software for the Coast Guard. Paragraph 1.2.3, Safety Analysis: Safety shall be a principle concern in the design and development of the system and its associated software development products. This standard will require contractors to develop a software safety program, integrating it with the SSP. This standard also requires the contractor to perform safety analysis on software to identify, minimize, or eliminate hazardous conditions that could potentially affect operational mission readiness. 2.4.2.1.3 Aerospace Recommended Practice The Society of Automotive Engineers provides two standards representing Aerospace Recommended Practice (ARP) to guide the development of complex aircraft systems. ARP4754 presents guidelines for the development of highly integrated or complex aircraft systems, with particular emphasis on electronic systems. While safety is a key concern, the advice covers the complete development process. The standard is designed for use with ARP4761, which contains detailed guidance and examples of safety assessment procedures. These standards could be applied across application domains but some aspects are avionics specific.4 The avionics risk assessment framework is based on Development Assurance Levels (DAL), which are similar to the Australian Defense Standard Def(Aust) 5679 Safety Integrity Levels (SIL). Each functional failure condition identified under ARP4754 and ARP4761 is assigned a DAL based on the severity of the effects of the failure condition identified in the Functional Hazard Assessment. However, the severity corresponds to levels of aircraft controllability rather than direct levels of harm. As a result, the likelihood of accident sequences is not considered in the initial risk assessment. The DAL of an item in the design may be reduced if the system architecture:
4
Provides multiple implementations of a function (redundancy), Isolates potential faults in part of the system (partitioning),
International Standards Survey and Comparison to Def(Aust) 5679 Document ID: CA38809101 Issue: 1.1, Dated 12 May 1999, pg 3.
210
Provides for active (automated) monitoring of the item, or Provides for human recognition or mitigation of failure conditions.
Detailed guidance is given on these issues. Justification of the reduction is provided by the preliminary system safety assessment. DALs are provided with equivalent numerical failure rates so that quantitative assessments of risk can be made. However, it is acknowledged that the effectiveness of particular design strategies cannot always be quantified and that qualitative judgments are often required. In particular, no attempt is made to interpret the assurance levels of software in probabilistic terms. Like Def(Aust) 5679, the software assurance levels are used to determine the techniques and measures to be applied in the development processes. When the development is sufficiently mature, actual failure rates of hardware components are estimated and combined by the System Safety Assessment (SSA) to provide an estimate of the functional failure rates. The assessment should determine if the corresponding DAL has been met. To achieve its objectives, the SSA suggests Failure Modes and Effects Analysis and Fault Tree Analysis (FTA), which are described in the appendices of ARP4761.5 2.4.2.2 National Aeronautics and Space Administration NASA has been developing safety-critical, software-intensive aeronautical and space systems for many years. To support the required planning of software safety activities on these research and operational procurements, NASA published NASA Safety Standard (NSS) 1740.13, Interim, Software Safety Standard, in June 1994. The purpose of this standard is to provide requirements to implement a systematic approach to software safety as an integral part of the overall SSPs. It describes the activities necessary to ensure that safety is designed into software that is acquired or developed by NASA and that safety is maintained throughout the software life cycle. Several DOD and Military Standards including DOD-STD-2167A, Defense System Software Development, and MIL-STD-882C, System Safety Program Requirements influenced the development of this NASA standard. The defined purpose of NSS 1740.13 is as follows: To ensure that software does not cause or contribute to a system reaching a hazardous state, That it does not fail to detect or take corrective action if the system reaches a hazardous state, and That it does not fail to mitigate damage if an accident occurs.
2.4.3 Commercial Unlike the historical relationship established between DOD agencies and their contractors, commercial companies are not obligated to a specified, quantifiable level of safety risk
5
management on the products they produce (unless contractually obligated through a subcontract arrangement with another company or agency). Instead, they are primarily motivated by economical, ethical, and legal liability factors. For those commercial companies that are motivated or compelled to pursue the elimination or control of safety risk in software, several commercial standards are available to provide them guidance. This Handbook will only reference a few of the most popular. While these commercial standards are readily accessible, few provide the practitioner with a defined software safety process or the how-to guidance required to implement the process. 2.4.3.1 Institute of Electrical and Electronic Engineering The Institute of Electrical and Electronic Engineers (IEEE) published IEEE STD 1228-1994, IEEE Standard for Software Safety Plans, for the purpose of describing the minimum acceptable requirements for the content of a software safety plan. This standard contains four clauses. Clause 1 discusses the application of the standard. Clause 2 lists references to other standards. Clause 3 provides a set of definitions and acronyms used in the standard. Clause 4 contains the required content of a software safety plan. An informative annex is included and discusses software safety analyses. IEEE STD 1228-1994 is intended to be wholly voluntary and was written for those who are responsible for defining, planning, implementing, or supporting software safety plans. This standard closely follows the methodology of MIL-STD-882B, Change Notice 1. 2.4.3.2 Electronic Industries Association The Electronic Industries Association (EIA), G-48 System Safety Committee published the Safety Engineering Bulletin No. 6B, System Safety Engineering In Software Development, in 1990. The G-48 System Safety Committee has as its interest, the procedures, methodology, and development of criteria for the application of system safety engineering to systems, subsystems, and equipment. The purpose of the document is ...to provide guidelines on how a system safety analysis and evaluation program should be conducted for systems which include computercontrolled or -monitored functions. It addresses the problems and concerns associated with such a program, the processes to be followed, the tasks which must be performed, and some methods which can be used to effectively perform those tasks. 2.4.3.3 International Electrotechnical Commission The International Electrotechnical Commission (IEC) has submitted a draft International Standard (IEC-61508) December 1997, which is primarily concerned with safety-related control systems incorporating Electrical/Electronic/Programmable Electronic Systems (E/E/PES). It also provides a framework which is applicable to safety-related systems irrespective of the technology on which those systems are based (e.g., mechanical, hydraulic, or pneumatic). Although some parts of the standard are in draft form, it is expected to be approved for use in 1999. The draft International Standard has two concepts which are fundamental to its application - namely, a Safety Life Cycle and SIL. The Overall Safety Life Cycle is introduced in Part 1 and forms the
212
central framework which links together most of the concepts in this draft International Standard.6 This draft International Standard (IEC-61508) consists of seven parts: Part 1: General Requirements Part 2: Requirements for E/E/PES Part 3: Software Requirements Part 4: Definitions Part 5: Guidelines on the Application of Part 1 Part 6: Guidelines on the Application of Part 2 and Part 3 Part 7: Bibliography of Techniques The draft standard addresses all relevant safety life cycle phases when E/E/PES are used to perform safety functions. It has been developed with a rapidly developing technology in mind. The framework in this standard is considered to be sufficiently robust and comprehensive to cater to future developments.
213
Each LOT defines the desired level of confidence that the corresponding system safety requirement will be met. Next, one of seven SILs is assigned to each Component Safety Requirement (CSR), indicating the level of rigor required meeting the CSR. By default, the SIL level of the CSR is the same as the LOT of the system safety requirement corresponding to the CSR. However, the default SIL may be reduced by up to two levels by implementing fault-tolerant measures in the design to reduce the likelihood of the corresponding hazard. As the standard prohibits allocation of probabilities to hazards, this is based on a qualitative argument.8 2.5.2 United Kingdom Defense Standard 00-55 & 00-54 United Kingdom (UK) DEF STAN 00-55 describes requirements and guidelines for procedures and technical practices in the development of safety-related software. The standard applies to all phases of the procurement lifecycle. Interim UK DEF STAN 00-54 describes requirements for the procurement of safety-related electronic hardware, with particular emphasis on the procedures required in various phases of the procurement lifecycle. Both standards are designed to be used in conjunction with DEF STAN 00-56.9 DEF STANs 00-55 and 00-54 require risk assessment to be conducted in accordance with DEF STAN 00-56. DEF STAN 00-55 explicitly mentions that software diversity may, if justified, reduce the required SIL of the application being developed.10 2.5.3 United Kingdom Defense Standard 00-56 UK DEF STAN 00-56 provides requirements and guidelines for the development of all defense systems. The standard applies to all systems engineering phases of the project lifecycle and all systems, not just computer-based ones.11 In DEF STAN 00-56, accidents are classified as belonging to one of four severity categories and one of six probability categories. The correspondence between probability categories and actual probabilities must be stated and approved by the Independent Safety Auditor. Using these classifications, a risk class is assigned to each accident using a matrix approved by the Independent Safety Auditor before hazard analysis activities begin. For systematic (as opposed to random) failures, the SIL (or actual data if available) determines the minimum failure rate that may be claimed of the function developed according to the SIL; such failure rates must be approved by the Independent Safety Auditor (ISA). Accidents in the highest risk class (A) are regarded as unacceptable, while probability targets are set for accidents in the next two risk classes (B and C). Accidents in the lowest risk class are regarded as tolerable. Accident probability targets are regarded as having a systematic and a random component. The consideration of accident International Standards Survey and Comparison to Def(Aust) 5679 Document ID: CA38809-101 Issue: 1.1, Dated 12 May 1999, pg 26-27. 9 Ibid., pg 3. 10 Ibid., Page 27 11 Ibid., Page 3.
214
8
probability targets and accident sequences determines the hazard probability targets with systematic and random components. These hazard probability targets must be approved by the Independent Safety Auditor. DEF STAN 00-56 recommends conducting a Safety Compliance Assessment using techniques such as FTA. If the hazard probability target cannot be met for risk class C, then risk reduction techniques such as redesign, safety or warning features, or special operator procedures must be introduced. If risk reduction is impracticable, then risk class B may be used with the approval of the Project Safety Committee.12
International Standards Survey and Comparison to DEF(AUST) 5679 Document ID: CA38809-101 Issue: 1.1, Dated 12 May 1999, pg 27.
215
the problem from a vantage point and perspective from within the confines of their respective area of expertise. In many instances, this view was analogous to the view seen when looking down a tunnel. The responsibilities of, and the interfaces with, other management and engineering functions were often distorted due to individual or organizational biases. Part of the problem is that SSS is still a relatively new discipline with methodologies, techniques, and processes that are still being researched and evaluated in terms of logic and practicality for software development activities. As with any new discipline, the problem must be adequately defined prior to the application of recommended practices. 2.6.2.1 Within System Safety From the perspective of most of the system safety community, digital control of safety-critical functions introduced a new and unwanted level of uncertainty to a historically sound hazard analysis methodology for hardware. Many within system safety were unsure of how to integrate software into the system safety process, techniques, and methods that were currently being used. System safety managers and engineers, educated in the 1950s, 60s, and 70s, had relatively no computer-, or software-related education or experience. This compounded their reluctance to, or in many cases their desire or ability to, even address the problem. In the late 1970s and early 1980s, bold individuals within the safety, software, and research (academia) communities took their first steps in identifying and addressing the safety risks associated with software. Although these individuals may not have been in total lock step and agreement, they did, in fact, lay the necessary foundation for where we are today. It was during this period that MIL-STD-882B was developed and published. This was the first military standard to require that the developing contractor perform SSS engineering and management activities and tasks. However, due to the distinct lack of cooperation or communication between the system safety and software engineering disciplines in defining a workable process for identifying and controlling software-related hazards in developing systems, the majority of system safety professionals waited for academia, or the software engineering community to develop a silver bullet analysis methodology or tool. It was their hope that such an analytical technique or verification tool could be applied to finished software code to identify any fault paths to hazard conditions which could then be quickly corrected prior to delivery. This concept did not include the identification of system hazard and failure modes caused (or influenced) by software inputs, or the identification of safety-specific requirements to mitigate these hazards and failure modes. Note that there is yet no silver bullet, and there will probably never be one. Even if a silver bullet existed, it would be used too late in the system development life cycle to influence design. To further obscure the issue, the safety community within DOD finally recognized that contractors developing complex hardware and software systems must perform software safety tasks. As a result contracts from that point forward included tasks that included software in the system safety process. The contractor was now forced to propose, bid, and perform software safety tasks with relatively little guidance. Those with software safety tasks on contract were in a desperate search for any tool, technique, or method that would assist them in meeting their contractual requirements. This was demonstrated by a sample population survey conducted in
216
1988 involving software and safety engineers and managers13. When these professionals were asked to identify the tools and techniques that they used to perform contractual obligations pertaining to software safety, they provided answers that were wide and varied across the analytical spectrum. Of 148 surveyed, 55 provided responses. These answers are provided in Table 2-1. It is interesting to note that of all respondents to the survey, only five percent felt that they had accomplished anything meaningful in terms of reducing the safety risk of the software analyzed. Table 2-1: Survey Response
Tool/Technique
Fault Tree Analysis Software PrelimHazard Analysis Traceability Analysis Failure Modes & Effects Analysis Requirements Modeling/Analysis Source Code Analysis Test Coverage Analysis Cross Reference Tools Code/Module Walkthrough Sneak Circuit Analysis Emulation SubSystem Hazard Analysis Failure Mode Analysis Prototyping Design and Code Inspections Checklist of Common SW Errors Data Flow Techniques
No.
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Tool/Technique
Hierarchy Tool Compare & Certification Tool System Cross Check Matrices Top-Down Review of Code Software Matrices Thread Analysis Petri-Net Analysis Sofware Hazard List BIT/FIT Plan Nuclear Safety Cross-Check Anal. Mathematical Proof Software Fault Hazard Analysis MIL-STD 882B, Series 300 Tasks Topological Network Trees Critical Function Flows Black Magic
NOTE: No. = Cumulative total from those responding to the 1988 Survey
The information provided in Table 2-1 demonstrated that the lack of any standardized approach for the accomplishment of software safety tasks that were levied contractually. It also appeared as if the safety engineer either tried to accomplish the required tasks using a standard system safety approach, or borrowed the most logical tool available from the software development group. In either case, they remained unconvinced of their efforts utility in reducing the safety risk of the software performing in their system. 2.6.2.2 Within Software Development Historically, the software development and engineering community made about as much progress addressing the software safety issue as did system safety. Although most software development managers recognized the safety risk potential that the software posed within their systems, few possessed the credible means or methods for both minimizing the risk potential and verifying that safety specification requirements had been achieved in the design. Most failed to include system safety engineering in software design and development activities, and many did not recognize that this interface was either needed or required.
13
Mattern, Steven F., Software System Safety, Masters Thesis, Department of Computer Resource Management, Webster University, December 1988
217
A problem, which still exists today, is that most educational institutions do not teach students in computer science and software engineering that there is a required interface with safety engineering when software is integrated into a potentially hazardous system. Although the software engineer may implement a combination of fault avoidance, fault removal, and/or fault tolerance techniques in the design, code, or test of software, they usually fail to tie the fault or error potential to a specific system hazard or failure mode. While these efforts most likely increase the overall reliability of the software, many fail to verify that the safety requirements of the system have been implemented to an acceptable level. It is essential that the software development community understand the needed interface with system safety and that system safety understands their essential interface with software development. 2.6.3 Management Responsibilities The ultimate responsibility for the development of a safe system rests with program management. The commitment of qualified people and an adequate budget and schedule for a software development program must begin with the program director or PM. Top management must be a strong voice of safety advocacy and must communicate this personal commitment to each level of program and technical management. The PM must be committed to support the integrated safety process within systems engineering and software engineering in the design, development, test, and operation of the system software. Figure 2-1 graphically portrays the managerial element for the integrated team.
Program Management
Software Engineering
Systems Engineering
Figure 2-1: Management Commitment to the Integrated Safety Process 2.6.4 Introduction to the Systems Approach System safety engineering has historically demonstrated the benefits of a systems approach to safety risk analysis and mitigation. When a hazard analysis is conducted on a hardware subsystem as a separate entity, it will produce a set of unique hazards applicable only to that subsystem. However, when that same subsystem is analyzed in the context of its physical,
218
functional, and zonal interfaces with the rest of the system components, the analysis will likely produce numerous other hazards which were not discovered by the original analysis. Conversely, the results of a system analysis may demonstrate that hazards identified in the subsystem analysis were either reduced or eliminated by other components of the system. Regardless, the identification of critical subsystem interfaces (such as software) with their associated hazards is a vital aspect of safety risk minimization for the total system. When analyzing software that performs, and/or controls, safety-critical functions within a system, a systems approach is also required. The success of a software safety program is predicated on it. Todays software is a very critical component of the safety risk potential of systems being developed and fielded. Not only are the internal interfaces of the system important to safety, but so are the external interfaces. Figure 2-2 depicts specific software internal interfaces within the system block (within the ovals) and also external software interfaces to the system. Each identified software interface may possess safety risk potential to the operators, maintainers, environment, or the system itself. The acquisition and development process must consider these interfaces during the design of both the hardware and software systems. To accomplish this, the hardware and software development life cycles must be fully understood and integrated by the design team.
Non-Embedded System Software
Test Program Sets Data Reduction Crew Training Simulator Logistics Support Maintenance Trainer Test Equipment Program Management Mission Planning Scenario Analysis Battle Management Engineering Software Development
Computer-Based System
Procedures Documentation Hardware
OUTPUT
SYSTEM
ENVIRONMENT
Firmware
Database People
Software
INPUT TO SYSTEM
Figure 2-2: Example of Internal System Interfaces 2.6.4.1 The Hardware Development Life Cycle The typical hardware development life cycle has been in existence for many years. It is a proven acquisition model which has produced, in most instances, the desired engineering results in the design, development, manufacturing, fabrication, and test activities. It consists of five phases. These are identified as the concept exploration and definition, demonstration and validation, engineering and manufacturing development, production and deployment, and operations and support phases. Each phase of the life cycle ends, and the next phase begins, with a milestone
219
decision point (0, I, II, III, and IV). An assessment of the system design and program status is made at each milestone decision point, and plans are made or reviewed for subsequent phases of the life cycle. Specific activities conducted for each milestone decision are covered in numerous system acquisition management courses and documents. Therefore, they will not be discussed in greater detail in the contents of this Handbook. The purpose of introducing the system life cycle in this Handbook is to familiarize the reader with a typical life cycle model. The one shown in Figure 2-3 is used in most DOD procurements. It identifies and establishes defined phases for the development life cycle of a system and can be overlaid on a proposed timetable to establish a milestone schedule. Detailed information regarding milestones and phases of a system life cycle can be obtained from Defense Systems Management College (DSMC) documentation, and systems acquisition management course documentation of the individual services. Weapon System Life Cycle
DoDI 5000.2
Phase 0
Mission Needs Analysis Concept Exploration & Definition
Phase I
Demonstration & Validation
Phase II
Engineering and Manufactureing Development
Phase III
Production & Deployment
Phase IV
Operations & Support
Figure 2-3: Weapon System Life Cycle 2.6.4.2 The Software Development Life Cycle The system safety team must be fully aware of the software life cycle being used by the development activity. In the past several years, numerous life cycle models have been identified, modified, and used in some capacity on a variety of software development programs. This Handbook will not enter into a discussion as to the merits and limitations of different life cycle process models because the software engineering team must make the decision for or against a model for an individual procurement. The important issue here is for the system safety team to recognize which model is being used, and how they should correlate and integrate safety activities with the chosen software development model to achieve the desired outcomes and safety goals. Several different models will be presented to introduce examples of the various models to the reader. Figure 2-4 is a graphical representation of the relationship of the software development life cycle to the system/hardware development life cycle. Note that software life cycle graphic shown in Figure 2-4 portrays the DOD-STD-2167A software life cycle, which was replaced with MIL-STD-498, dated December 5, 1994. The minor changes made to the software life cycle by MIL-STD-498 are also shown. Notice also, that the model is representative of the Waterfall, or Grand Design life cycle. While this model is still being used on numerous procurements, other models are more representative of the current software development schemes currently being followed, such as the Spiral and Modified V software development life cycles. It is important to recognize that the software development life cycle does not correlate exactly with the hardware (system) development life cycle. It lags behind the hardware development
220
at the beginning but finishes before the hardware development is completed. It is also important to realize that specific design reviews for hardware usually lag behind those required for software. The implications will be discussed in Section 4 of this Handbook.
MS 0 MS 1 Phase 0 Concept Exploration & Definition MS 2 Phase I Demonstration & Validation MS 3 DODI 5000.2R Phase II Engineering and Manufacturing Development InterHardware Prototype Operability Design Manufacturing Test System Code & CSU PD DD CSU CSCI Integration Test Test Test Phase III Production & Deployment Phase IV Operations & Support MS 4
O P Maintenance E Manufacturing PIPs V Technical Reviews A Obsolescence Copy Media L Recovery & Distribution System Upgrades
Figure 2-4: Relationship of Software to the Hardware Development Life Cycle 2.6.4.2.1 Grand Design, Waterfall Life Cycle Model14 The Waterfall software acquisition and development life cycle model is the oldest in terms of use by software developers. This strategy usually uses DOD-STD-2167A terminology and ...was conceived during the early 1970s as a remedy to the code-and-fix method of software development. Grand Design places emphasis on up-front documentation during early development phases, but does not support modern development practices such as prototyping and automatic code generation. With each activity as a prerequisite for succeeding activities, this strategy is a risky choice for unprecedented systems because it inhibits flexibility. Another limitation to the model is that after a single pass through the model, the system is complete. Therefore, many integration problems are identified much too late in the development process to be corrected without significant cost and schedule impacts. In terms of software safety, interface issues must be identified and rectified as early as possible in the development life cycle to be adequately corrected and verified. Figure 2-5 is a representation of the Grand Design, or Waterfall, life cycle model. The Waterfall model is not recommended for large, softwareintensive, systems. This is due to the limitations stated above and the inability to effectively manage program risks, including safety risk during the software development process. The Grand Design does, however, provide a structured and well-disciplined method for software development.
14
The following descriptions of the software acquisition life cycle models are either quoted or paraphrased from the Guidelines for Successful Acquisition and Management of Software Intensive Systems, Software Technology Support Center (STSC), September 1994, unless otherwise noted.
221
DESIGN
CODING CODING
OPERATION OPERATION
Figure 2-5: Grand Design Waterfall Software Acquisition Life Cycle Model 2.6.4.2.2 Modified V Life Cycle Model The Modified V software acquisition life cycle model is another example of a defined method for software development. It is depicted in Figure 2-6. This model is heavily weighted in the ability to design, code, prototype, and test in increments of design maturity. The left side of the figure identifies the specification, design, and coding activities for developing software. It also indicates when the test specification and test design activities can start. For example, the system/acceptance tests can be specified and designed as soon as software requirements are known. The integration tests can be specified and designed as soon as the software design structures are known. And, the unit tests can be specified and designed as soon as the code units are prepared.15 The right side of the figure identifies when the evaluation activities occur that are involved with the execution and testing of the code at its various stages of evolution.
15
Software Test Technologies Report, August 1994, STSC, Hill Air Force Base, UT 84056
222
SPECIFY REQUIREMENTS
Code
DESIGN
CODE
CODE
Figure 2-6: Modified V Software Acquisition Life Cycle Model 2.6.4.2.3 Spiral Life cycle Model The Spiral acquisition life cycle model provides a risk-reduction approach to the software development process. In the Spiral model, Figure 2-7, the radial distance is a measure of effort expended, while the angular distance represents progress made. It combines features of the Waterfall and the incremental prototype approaches to software development. Spiral development emphasizes evaluation of alternatives and risk assessment. These are addressed more thoroughly than with other strategies. A review at the end of each phase ensures commitment to the next phase or identifies the need to rework a phase if necessary. The advantages of Spiral development are its emphasis on procedures, such as risk analysis, and its adaptability to different development approaches. If Spiral development is employed with demonstrations and Baseline/Configuration Management (CM), you can get continuous user buyin and a disciplined process.16
16
Guidelines for Successful Acquisition and Management of Software Intensive Systems, STSC, September 1994.
223
Implementation Design
Objectives
Risk Analysis
System &
Risk Analysis
Limits
Prototype #1
Prototype #2
Prototype #3
Operational Prototype
Customer Evaluation
SW Rqmts
SW Product Design
Integration Test
Figure 2-7: Spiral Software Acquisition Life Cycle Model Within the DOD, an Ada Spiral Model Environment is being considered for some procurements where the Ada language is being used. It provides an environment that combines a model and a tool environment, such that it offers the ability to have continual touch-and-feel of the software product (as opposed to paper reports and descriptions). This model represents a demonstrationbased process that employs a top-down incremental approach that results in an early and continuous design and implementation validation. Advantages of this approach are that it is built from the top down, it supports partial implementation; the structure is automated, real and evolved; and that each level of development can be demonstrated. Each build and subsequent demonstration validates the process and the structure to the previous build. 2.6.4.3 The Integration of Hardware and Software Life Cycles The life cycle process of system development was instituted so managers would not be forced to make snap decisions. A structured life cycle, complete with controls, audits, reviews, and key decision points, provides a basis for sound decision making based on knowledge, experience, and training. It is a logical flow of events representing an orderly progression from a user need to finalize activation, deployment, and support. The systems approach to software safety engineering must support a structured, welldisciplined, and adequately documented system acquisition life cycle model that incorporates both the system development model and the software development model. Program plans (as described in Appendix C, Section C.7) must describe in detail how each engineering discipline will interface and perform within the development life cycle. It is recommended that you refer back to Figure 2-4 and review the integration of the hardware and software development life
224
cycle models. Graphical representations of the life cycle model of choice for a given development activity must be provided during the planning processes. This activity will aid in the planning and implementation processes of software safety engineering. It will allow for the integration of safety-related requirements and guidelines into the design and code phases of software development. It will also assist in the timely identification of safety-specific test and verification requirements to prove that original design requirements have been implemented as they were intended. It further allows for the incorporation of safety inputs to the prototyping activities in order to demonstrate safety concepts. 2.6.5 A Team Solution The system safety engineer (SSE) cannot reduce the safety risk of systems software by himself. The software safety process must be an integrated team effort between engineering disciplines. Previously depicted in Figure 2-1, software, safety, and systems engineering are the pivotal players of the team. The management block is analogous to a conductor that provides the necessary motivation, direction, support, and resources for the team to perform as a wellorchestrated unit. It is the intent of the authors of this Handbook to demonstrate that neither the software developers, nor safety engineers, can accomplish the necessary tasks to the level of detail required by themselves. This Handbook will focus on the required tasks of the safety engineer, the software engineer, the software safety engineer, the system and design engineers, and the interfaces between each. Regardless of who executes the individual software safety tasks, each engineer must be intimately aware of the duties, responsibilities, and tasks required from each functional discipline. Each must also understand the time (in terms of life cycle schedule), place (in terms of required audits, meetings, reviews, etc.), and functional analysis tasks that must be produced and delivered at any point in the development process. Section 4 will expand on the team approach in detail as the planning, process tasks, products, and risk assessment tasks are presented. Figure 2-8 uses a puzzle analogy to demonstrate that the software safety approach must establish integration between functions and among engineers. Any piece of the puzzle that is missing from the picture will propagate into an unfinished or incomplete software safety work. The elements contributing to a credible and successful software safety engineering program will include the following: A defined and established system safety engineering process, A structured and disciplined software development process, An established hardware and software systems engineering process, An established hardware/software configuration control process, and An integrated SSS Team responsible for the identification, implementation, and verification of safety-specific requirements in the design and code of the software.
225
THE SOFTWARE SAFETY ENGINEERING TEAM AN ESTABLISHED HARDWARE AND SOFTWARE CONFIGURATION CONTROL PROCESS AN ESTABLISHED HW & SW SYSTEMS ENGINEERING PROCESS
226
management for those readers not familiar with the MIL-STD-882 methods and the approach for the establishment and implementation of a SSP. It also provides an introduction to risk management and how safety risk is an integral part of the risk management function. Section 3 also provides an introduction to, and an overview of, the system acquisition, systems engineering, and software development process and guidance for the effective integration of these efforts in comprehensive systems safety process. Section 4 provides the how-to of a baseline software safety program. The authors recognize that not all acquisitions and procurements are similar, nor do they possess the same problems, assumptions, and limitations in terms of technology, resources, development life cycles, and personalities. This section provides the basis for careful planning and forethought required in establishing, tailoring, and implementing a SwSSP guidance for the practitioner, and not as a mindless checklist process.
4.0
3.0
2.0
Handbook Sections
Executive Overview
1.0
Figure 2-9: Handbook Layout Section 4, Software Safety Engineering, is formatted logically (see Figure 2-10) to provide the reader with the steps required for planning, task implementation, and risk assessment and
227
acceptance for a SSS program. Appendix C.9 through C-11 provides information regarding the management of configuration changes and issues pertaining to software reuse and COTS software packages.
Introduction
4.1
4.3
Managing Change
Appendix C.9
4.4
Reusable Software
Appendix D
COTS Software
Appendix D
Figure 2-10: Section 4 Format 2.7.1 Planning and Management Section 4 begins with the planning required to establish a SwSSP. It discusses the program interfaces, contractual interfaces and obligations, safety resources, and program planning and plans. This particular section assists and guides the safety manager and engineer in the required steps of software safety program planning. Although there may be subject areas that are not required for individual product procurements, each area should be addressed and considered in the planning process. It is acceptable to determine that a specific activity or deliverable is not appropriate or necessary for your individual program. 2.7.2 Task Implementation This is the very heart of the handbook as applied to implementing a credible software safety program. It establishes a step-by-step baseline of best practices for todays approach in reducing the safety risk of software performing safety-critical and safety-significant functions within a system. A caution at this point is to not consider these process steps completely serial in nature. Although they are presented in a near serial format (for ease of reading and understanding), there are many activities that will require parallel processing and effort from the safety manager and engineer. Activities as complicated and as interface-dependent as a software
228
development within a systems acquisition process will seldom have required tasks line up where one task is complete before the next one begins. This is clearly demonstrated by the development of a SSS program and milestone schedule (see paragraph 4.3.1) This section of the Handbook describes the tasks associated with contract and deliverable data development (including methods for tailoring), safety-critical function identification, preliminary and detailed hazard analysis, safety-specific requirements identification, implementation, test and verification, and residual risk analysis and acceptance. It also includes the participation in trade studies and design alternatives. 2.7.3 Software Risk Assessment and Acceptance The risk assessment and acceptance portion of Section 4 focuses on the tasks identifying residual safety risk in the design, test, and operation of the system. It includes the evaluation and categorization of hazards remaining in the system and their impact to operations, maintenance, and support functions. It also includes the graduated levels of programmatic sign-off for hazard and failure mode records of the Subsystem, System, and Operations and Support Hazard Analyses. This section includes the tasks required to identify the hazards remaining in the system, assess their safety risk impact with their severity, probability or software control criticality, and determine the residual safety risk. 2.7.4 Supplementary Appendices The Handbooks appendices include acronyms, definition of terms, references, supplemental system safety information, generic safety requirements and guidelines, and lessons learned pertaining to the accomplishment of the SSS tasks.
229
31
Risk Perspectives. When discussing risk, we must distinguish between three different standpoints, which are as follows: Standpoint of an INDIVIDUAL exposed to a hazard. Standpoint of SOCIETY. Besides being interested in guaranteeing minimum individual risk for each of its members, society is concerned about the total risk to the general public. Standpoint of the INSTITUTION RESPONSIBLE FOR THE ACTIVITY. The institution responsible for an activity can be a private company or a government agency. From their point of view, it is essential to keep individual risks to employees or other persons and the collective risk at a minimum. An institutions concern is also to avoid catastrophic accidents.
The system safety effort is an optimizing process that varies in scope and scale over the lifetime of the system. SSP balances are the result of the interplay between system safety and the three very familiar basic program elements: cost, performance, and schedule. Without an acute awareness of the system safety balance on the part of both the PM and the system safety manager, they cannot discuss when, where, and how much they can afford to spend on system safety. We cannot afford mishaps that will prevent the achievement of the primary mission goal, nor can we afford systems that cannot perform because of overstated safety goals. Safety Managements Risk Review. The SSP examines the interrelationships of all components of a program and its systems with the objective of bringing mishap risk or risk reduction into the management review process for automatic consideration in total program perspective. It involves the preparation and implementation of system safety plans; the performance of system safety analyses on both system design and operations, and risk assessments in support of both management and system engineering activities. The system safety activity provides the manager with a means of identifying what the risk of mishap is, where a mishap can be expected to occur, and what alternate designs are appropriate. Most important, it verifies implementation and effectiveness of hazard control. What is generally not recognized in the system safety community is that there are no safety problems in system design. There are only engineering and management problems, which if left unresolved, can result in a mishap. When a mishap occurs, then it is a safety problem. Identification and control of mishap risk is an engineering and management function. This is particularly true of software safety risk.
32
Identified Risk is that risk which has been determined through various analytical techniques. The first task of system safety is to make identified risk as large a piece of the overall pie as practical. The time and costs of analytical efforts, the quality of the safety program, and the state of technology impact the amount of risk identified.
Controlled/Eliminated Identified
Residual
Unidentified
Total Risk
Residual Risk
Figure 3-1: Types of Risk Unacceptable Risk is that risk which cannot be tolerated by the managing activity. It is a subset of identified risk that is either eliminated or controlled. Residual Risk is the risk left over after system safety efforts have been fully employed. It is sometimes erroneously thought of as being the same as acceptable risk. Residual risk is actually the sum of unacceptable risk (uncontrolled), acceptable risk and unidentified risk. This is the total risk passed on to the user that may contain some unacceptable risk. Acceptable Risk is the part of identified risk that is allowed to persist without further engineering or management action. It is accepted by the managing activity. However, it is the user who is exposed to this risk. Unidentified Risk is the risk that has not been determined. It is real. It is important, but it cannot be measured. Some unidentified risk is subsequently determined and measured when a mishap occurs. Some risk is never known.
33
and inputs of all individuals involved in the program management and engineering design functional activities. In DOD, this risk management group is usually assigned to (or contained within) the systems engineering group. They are responsible for identifying, evaluating, measuring, and resolving risk within the program. This includes recognizing and understanding the warning signals that may indicate that the program, or elements of the program, is off track. This risk management group must also understand the seriousness of the problems identified and then develop and implement plans to reduce the risk. A risk management assessment must be made early in the development life cycle and the risks must continually be reevaluated throughout the development life cycle. The members of the risk management group and the methods of risk identification and control should be documented in the programs Risk Management Plan. Risk management17 must consist of three activities: Risk Planning This is the process to force organized, purposeful thought to the subject of eliminating, minimizing, or containing the effects of undesirable consequences. Risk Assessment This is the process of examining a situation and identifying the areas of potential risk. The methods, techniques, or documentation most often used in risk assessment include the following: Systems engineering documents Operational Requirements Document Operational Concepts Document Life cycle cost analysis and models Schedule analysis and models Baseline cost estimates Requirements documents Lessons learned files and databases Trade studies and analyses Technical performance measurements and analyses Work Breakdown Structures (WBS) Project planning documents Risk Analysis This is the process of determining the probability of events and the potential consequences associated with those events relative to the program. The purpose of a risk analysis is to discover the cause, effects, and magnitude of the potential risk, and
17
Selected descriptions and definitions regarding risk management are paraphrased from the DSMC, Systems Engineering Management Guide, January 1990
34
to develop and examine alternative actions that could reduce or eliminate these risks. Typical tools or models used in risk analysis include the following: Schedule Network Model Life Cycle Cost Model Quick Reaction Rate/Quantity Cost Impact Model System Modeling and Optimization To further the discussion of program risk, short paragraphs are provided to help define schedule, budget, sociopolitical, and technical risk. Although safety, by definition, is a part of technical risk, it can impact all areas of programmatic risk as described in subsequent paragraphs. This is what ties safety risk to technical and programmatic risk. 3.4.1 Schedule Risk The master systems engineering and software development schedule for a program contains numerous areas of programmatic risk, such as schedules for new technology development, funding allocations, test site availability, critical personnel availability and rotation, etc. Each of these has the potential for delaying the development schedule and can induce unwarranted safety risk to the program. While these examples are by no means the only source of schedule risk, they are common to most programs. The risk manager must identify, analyze, and control risks to the program schedule by incorporating positive measures into the planning, scheduling, and coordinating activities for the purpose of minimizing their impact to the development program. To help accomplish these tasks, the systems engineering function maintains the Systems Engineering Master Schedule (SEMS) and the Systems Engineering Detailed Schedule (SEDS). Maintaining these schedules helps to guide the interface between the customer and the developer, provides the cornerstone of the technical status and reporting process, and provides a disciplined interface between engineering disciplines and their respective system requirements. An example of the integration, documentation, tracking, and tracing of risk management issues is depicted in Figure 3-2. Note that the SEMS and SEDS schedules, and the risk management effort are supported by a risk issue table and risk management database. These tools assist the risk manager in the identification, tracking, categorization, presentation, and resolution of managerial and technical risk. Software developers for DOD customers or agencies are said to have maintained a perfect record to date. That is, they have never yet delivered a completed (meets all user/program requirements and specifications) software package on time18. While this may be arguable, the inference is worthy of consideration. It implies that schedule risk is an important issue on a software development program. The schedule can become the driving factor forcing the delivery of immature and improperly tested critical software product to the customer. The risk manager, in concert with the safety manager, must ensure that the delivered product does not introduce safety risk to the user, system, maintainer, or the environment that is considered unacceptable. This is
18
Paraphrased from comments made at the National Workshop on Software Logistics, 1989
35
accomplished by the implementation of a SwSSP (and safety requirements) early in the software design process. The end result should produce a schedule risk reduction by decreasing the potential for re-design and re-code of software possessing safety deficiencies.
SEMS
122203 1 1.1 1.2 1.2.1 1.2.2 1.2.3 1.2.4 2.1 2 2.2 3 01 02 03 04 05
SEDS
122 1 01 02 03 2 2 01 02 03 3
Figure 3-2: Systems Engineering, Risk Management Documentation 3.4.2 Budget Risk Almost hand-in-hand with schedule risk comes budget risk. Although they can be mutually exclusive, that is seldom the case. The lack of monetary resources is always a potential risk in a development program. Within the DOD research, acquisition, and development agencies, the potential for budget cuts or congressionally mandated program reductions always seems to be lurking around the next corner. Considering this potential, budgetary planning, cost scheduling, and program funding coordination become paramount to the risk management team. They must ensure those budgetary plans for current- and out-years are accurate and reasonable, and those potential limitations or contingencies to funding are identified, analyzed, and incorporated into the program plans. In system safety terms, the development of safety-critical software requires significant program resources, highly skilled engineers, increased training requirements, software development tools, modeling and simulation, and facilities and testing resources. To ensure that this software meets functionality, safety, and reliability goals, these activities become drivers for both the budget and schedule of a program. Therefore, the safety manager must ensure that all safety-specific software development and test functions are prioritized in terms of safety risk potential to the program and to the operation of software after implementation. The prioritization of safety hazards and failure modes, requirements, specifications, and test activities attributed to software help to facilitate and support the tasks performed by the risk management team. It will help them
36
understand, prioritize, and incorporate the activities necessary to minimize the safety risk potential for the program. 3.4.3 Sociopolitical Risk This is a difficult subject to grasp from a risk management perspective. It is predicated more on public and political perceptions than basic truth and fact. Examples of this type of risk are often seen during the design, development, test, and fielding of a nuclear weapon system in a geographical area that has a strong public, and possibly political, resistance. With an example like this in mind, several programmatic areas become important for discussion. First, program design, development, and test results have to be predicated on complete, technical fact. This will preclude any public perception of attempts to hide technical shortfalls or safety risk. Second, social and political perceptions can generate programmatic risk that must be considered by the risk managers. This includes the potential for budget cuts, schedule extensions, or program delays due to funding cuts as a result of public outcry and protest and its influence on politicians. Safety plays a significant role in influencing the sensitivities of public or political personages toward a particular program. It must be a primary consideration in assessing risk management alternatives. If an accident (even a minor accident without injury) should occur during a test, it could result in program cancellation. The sociopolitical risk may also change during the life cycle of a system. For example, explosive handling facilities once located in isolated locations often find residential areas encroaching. Protective measures adequate for an isolated facility may not be adequate as residents, especially those not associated with the facility, move closer. While the PM cannot control the later growth in population, he/she must consider this and other factors during the system development process. 3.4.4 Technical Risk Technical risk is where safety risk is most evident in system development and procurement. It is the risk associated with the implementation of new technologies or new applications of existing technologies into the system being developed. These include the hardware, software, human factors interface, and environmental safety issues associated with the design, manufacture, fabrication, test, and deployment of the system. Technical risk results from poor identification and incorporation of system performance requirements to meet the intent of the user and system specifications. The inability to incorporate defined requirements into the design (lack of a technology base, lack of funds, lack of experience, etc.) increases the technical risk potential. Systems engineers are usually tasked with activities associated with risk management. This is due to their assigned responsibility for technical risk within the design engineering function. The systems engineering function performs specification development, functional analysis, requirements allocation, trade studies, design optimization and effectiveness analysis, technical interface compatibility, logistic support analysis, program risk analysis, engineering integration and control, technical performance measurement, and documentation control.
37
The primary objective of risk management in terms of software safety is to understand that safety risk is a part of the technical risk of a program. All program risks must be identified, analyzed, and either eliminated or controlled. This includes safety risk, and thus, software (safety) risk.
19 20
Air Force System Safety Handbook, August 1992 Leveson, Nancy G., Safeware, System Safety and Computers, 1995, Addison Wesley, page 129 21 Ibid, page 143
38
The identification of separate software safety tasks in MIL-STD-882B focused engineering attention on the hazard risks associated with the software components of a system and its critical effect on safety. However, the engineering community perceived them as segregated tasks to the overall system safety process as system safety engineers tried to push the responsibility for performing these tasks onto software engineers. Since software engineers had little understanding of the system safety process and of the overall system safety functional requirements, this was an unworkable process for dealing with SSRs. Therefore, the separate software safety tasks were not included in MIL-STD-882C as separate tasks, but were integrated into the overall system-related safety tasks. In addition, software engineers were given a clear responsibility and a defined role in the SSS process in MIL-STD-498. MIL-STD-882 defines system safety as: The application of engineering and management principles, criteria, and techniques to optimize all aspects of safety within the constraints of operational effectiveness, time, and cost throughout all phases of the system life cycle. SSP objectives can be further defined as follows: a. Safety, consistent with mission requirements, is designed into the system in a timely, cost-effective manner. b. Hazards associated with systems, subsystems, or equipment are identified, tracked, evaluated, and eliminated; or their associated risk is reduced to a level acceptable to the Managing Authority (MA) by evidence analysis throughout the entire life cycle of a system. c. Historical safety data, including lessons learned from other systems, are considered and used. d. Minimum risk consistent with user needs is sought in accepting and using new design technology, materials, production, tests, and techniques. Operational procedures must also be considered. e. Actions taken to eliminate hazards or reduce risk to a level acceptable to the MA are documented. f. Retrofit actions required to improve safety are minimized through the timely inclusion of safety design features during research, technology development for, and acquisition of a system. g. Changes in design, configuration, or mission requirements are accomplished in a manner that maintains a risk level acceptable to the MA. h. Consideration is given early in the life cycle to safety and ease of disposal [including Explosive Ordnance Disposal (EOD)], and demilitarization of any hazardous materials associated with the system. Actions should be taken to minimize the use of hazardous materials and, therefore, minimize the risks and life cycle costs associated with their use.
39
i. Significant safety data are documented as lessons learned and are submitted to data banks or as proposed changes to applicable design handbooks and specifications. j. Safety is maintained and assured after the incorporation and verification of Engineering Change Proposals (ECP) and other system-related changes. With these definitions and objectives in mind, the system safety manager/engineer is the primary individual(s) responsible for the identification, tracking, elimination, and/or control of hazards or failure modes that exist in the design, development, test, and production of both hardware and software. This includes their interfaces with the user, maintainer, and the operational environment. System safety engineering is a proven and credible function supporting the design and systems engineering process. The steps in the process for managing, planning, analyzing, and coordinating system safety requirements are well established and when implemented, successfully meet the above stated objectives. These general SSP requirements are as follows: a. Eliminate identified hazards or reduce associated risk through design, including material selection or substitution. b. Isolate hazardous substances, components, and operations from other activities, areas, personnel, and incompatible materials. c. Locate equipment so that access during operations, servicing, maintenance, repair, or adjustment minimizes personnel exposure to hazards. d. Minimize risk resulting from excessive environmental conditions (e.g., temperature, pressure, noise, toxicity, acceleration, and vibration). e. Design to minimize risk created by human error in the operation and support of the system. f. Consider alternate approaches to minimize risk from hazards that cannot be eliminated. Such approaches include interlocks; redundancy; fail-safe design; fire suppression; and protective clothing, equipment, devices, and procedures. g. Protect power sources, controls, and critical components of redundant subsystems by separation or shielding. h. Ensure personnel and equipment protection (when alternate design approaches cannot eliminate the hazard) provide warning and caution notes in assembly, operations, maintenance, and repair instructions as well as distinctive markings on hazardous components and materials, equipment, and facilities. These shall be standardized in accordance with MA requirements. i. Minimize severity of personnel injury or damage to equipment in the event of a mishap. j. Design software-controlled or monitored functions to minimize initiation of hazardous events or mishaps.
310
k. Review design criteria for inadequate or overly restrictive requirements regarding safety. Recommend a new design criterion supported by study, analyses, or test data. A good example of the need for, and the credibility of, a system safety engineering program is the Air Force aircraft mishap rate improvement since the establishment of the SSP in the design, test, operations, support, and training processes. In the mid-1950s, the aircraft mishap rates were over 10 per 100,000 flight hours. Today, this rate has been reduced to less than 1.25 per 100,000 flight hours. Further information regarding the management and implementation of system safety engineering (and the analyses performed to support the goals and objectives of a SSP) is available through numerous technical resources. It is not the intent of this Handbook to become another technical source book for the subject of system safety, but to address the implementation of SSS within the discipline of system safety engineering. If specific system safety methods, techniques, or concepts remain unclear, please refer to the list of references in Appendix B for supplemental resources relating to the subject matter. With the above information regarding system safety engineering (as a discipline) firmly in tow, a brief discussion must be presented as it applies to hazards and failure mode identification, categorization of safety risk in terms of probability and severity, and the methods of resolution. This concept must be firmly understood as the discussion evolves to the accomplishment of software safety tasks within the system safety engineering discipline.
22
3.6.1 Initial Safety Risk Assessment The efforts of the SSE are launched by the performance of the initial safety risk assessment of the system. In the case of most DOD procurements, this takes place with the accomplishment of the Preliminary Hazard List (PHL) and the PHA. These analyses are discussed in detail later in this Handbook. The primary focus here is the assessment and analysis of hazards and failure modes that are evident in the proposed system. This section of the Handbook will focus on the basic principles of system safety and hazard resolution. Specific discussions regarding how software influences or is related to hazards will be discussed in detail in Section 4. 3.6.1.1 Hazard and Failure Mode Identification A hazard is defined as follows: A condition that is prerequisite to a mishap. The SSE identifies these conditions, or hazards. The initial hazard analysis, and the Failure Modes and Effects Analysis (FMEA) accomplished by the reliability engineer, provides the safety information required to perform the initial safety risk assessment of identified hazards. Without identified hazards and failure modes, very little can be accomplished to improve the overall safety of a system (remember this fact as software safety is introduced). Identified hazards and failure modes become the basis for the identification and implementation of safety requirements within the design of the system. Once the hazards are identified, they must be categorized in terms of safety risk. 3.6.1.2 Hazard Severity The first step in classifying safety risk requires the establishment of hazard severity within the context of the system and user environments. This is typically done in two steps, first using the severity of damage and then applying the number of times that the damage might occur. Table 3-1 provides an example of how severity can be qualified. Table 3-1: Hazard Severity
For Example Purposes Only
DESCRIPTION
Catastrophic Critical
CATEGORY
I II
Marginal
III
Negligible
IV
312
Note that this example is typical to the MIL-STD-882-specified format. As you can see, the severity of hazard effect is qualitative and can be modified to meet the special needs of a program. There is an important aspect of the graphic to remember for any procurement. In order to assess safety severity, a benchmark to measure against is essential. The benchmark allows for the establishment of a qualitative baseline that can be communicated across programmatic and technical interfaces. It must be in a format language that makes sense among individuals and between program interfaces. 3.6.1.3 Hazard Probability The second half of the equation for the determination of safety risk is the identification of the probability of occurrence. The probability that a hazard or failure mode may occur, given that it is not controlled, can be determined by numerous statistical techniques. These statistical probabilities are usually obtained from reliability analysis pertaining to hardware component failures acquired through qualification programs. Component failure rates from reliability engineering are not always obtainable. This is especially true on advanced technology programs where component qualification programs do not exist and one-of-a-kind items are procured. Thus, the quantification of probability to a desired confidence level is not always possible for a specific hazard. When this occurs, alternative techniques of analysis are required for the qualification or quantification of hazard probability of hardware nodes. Examples of credible alternatives include Sensitivity Analysis, Event Tree Diagrams, and FTAs. An example of the categorization of probability is provided in Table 3-2 and is similar to the format recommended by MIL-STD-882. Table 3-2: Hazard Probability
For Example Purposes Only
DESCRIPTION
Frequent Probable Occasional
LEVEL
A B C
PROGRAM DESCRIPTION
Will Occur Likely To Occur Unlikely To Occur, But Possible Very Unlikely To Occur Assume It Will Not Occur
PROBABILITY
1 in 100 1 in 1000 1 in 10,000
Remote
1 in 100,000
Improbable
1 in 1,000,000
As with the example provided for hazard severity, Table 3-2 can be modified to meet the specification requirements of the user and/or developer. A systems engineering team (to include system safety engineering) may choose to shift the probability numbers an order of magnitude in
313
either direction or to include, or reduce, the number of categories. All of the options are acceptable if the entire team is in agreement. This agreement must definitely include the customers opinions and specification requirements. Also of importance when considering probability categories is the inclusion of individual units, entire populations, and time intervals (periods) which are appropriate for the system being analyzed. 3.6.1.4 HRI Matrix Hazard severity and hazard probability when integrated into a table format produces the HRI matrix, and the initial HRI risk categorization for hazards prior to control requirements is implemented. An example of a HRI matrix is provided as Table 3-3. This example was utilized on a research and development activity where little component failure data was available. This matrix is divided into three levels of risk as indicated by the grayscale legend beneath the matrix. HRIs of 1-5 (the darkest gray) are considered unacceptable risk. These risks are considered high and require resolution or acceptance from the Acquisition Executive (AE). HRIs of 6, 7, 8, and 9 (the medium gray) are considered to be marginal risk; while HRIs of 10 through 16 are minimum risk. Those hazards deemed marginal should be redesigned for elimination and require PM approval or risk acceptance. Those hazards deemed minimum should be redesigned to further minimize their risk and require PM approval or risk acceptance. HRIs 10-16 were considered lower risk hazards and were put under the cognizance of the lead design engineer and the safety manager. Table 3-3: HRI Matrix Hazard Risk Index
For Example Purposes Only
Severity Probability
(A) FREQUENT 1 IN 100 (B) PROBABLE 1 IN 1,000 (C) OCCASIONAL 1 IN 10,000 (D) REMOTE 1 IN 100,000 (E) IMPROBABLE 1 IN 1,000,000
I
Catastrophic 1
II
Critical 3
III
Marginal 7
IV
Negligible 11
16
13
18
12
14
19
10
15
17
20
Unacceptable Risk - Acquisition Executive Resolution Or Risk Acceptance Marginal Risk - Design To Eliminate, Requires Program Manager Resolution Or Risk Acceptance Minimum Risk - Design To Minimize, Requires Program Manager Resolution Or Risk Acceptance
The true benefit of the HRI matrix is the ability and flexibility to prioritize hazards in terms of severity and probability. This prioritization of hazards allows the PM, safety manager, and engineering manager the ability to also prioritize the expenditure and allocation of critical resources. Although it seems simplistic, a hazard with an HRI of 11 should have fewer resources expended in its analysis, design, test, and verification than a hazard of 4. Without the availability of the HRI matrix, the allocation of resources becomes more arbitrary.
314
Another benefit of the HRI matrix, is the accountability and responsibility of program and technical management to the system safety effort. The SwSSP identifies and assigns specific levels of management authority with the appropriate levels of safety hazard severity and probability. The HRI methodology holds program management and technical engineering accountable for the safety risk of the system during design, test and operation, and the residual risk upon delivery to the customer. From the perspective of the safety analyst, the HRI is a tool that is used during the entire system safety effort throughout the product life cycle. Note, however, that the HRI as a tool is more complex when applied to the evaluation of system hazards and failure modes influenced by software inputs or software information. Alternatives to the classical HRI are discussed in detail in Section 4. 3.6.2 Safety Order of Precedence The ability to adequately eliminate or control safety risk is predicated on the ability to accomplish the necessary tasks early in the design phases of the acquisition life cycle. For example, it is more cost effective and technologically efficient to eliminate a known hazard by changing the design (on paper), than retrofitting a fleet in operational use. Because of this, the system safety engineering methodology employs a safety order of precedence for hazard elimination or control. When incorporated, the design order of precedence further eliminates or reduces the severity and probability of hazard/failure mode initiation and propagation throughout the system. The following is extracted from MIL-STD-882C, subsection 4.4. a. Design for Minimum Risk - From the first, design to eliminate hazards. If an identified hazard cannot be eliminated, reduce the associated risk to an acceptable level, as defined by the MA, through design selection. b. Incorporate Safety Devices - If identified hazards cannot be eliminated or their associated risk adequately reduced through design selection, that risk shall be reduced to a level acceptable to the MA through the use of fixed, automatic, or other protective safety design features or devices. Provisions shall be made for periodic functional checks of safety devices when applicable. c. Provide Warning Devices - When neither design nor safety devices can effectively eliminate identified hazards or adequately reduce associated risk, devices shall be used to detect the condition and to produce an adequate warning signal to alert personnel of the hazard. Warning signals and their application shall be designed to minimize the probability of incorrect personnel reaction to the signals and shall be standardized within like types of systems. d. Develop Procedures and Training - Where it is impractical to eliminate hazards through design selection or adequately reduce the associated risk with safety and warning devices, procedures and training shall be used. However, without a specific waiver from the MA, no warning, caution, or other form of written advisory shall be used as the only risk reduction method for Category I or II hazards. Procedures may include the use of personal protective equipment. Precautionary notations shall be standardized
315
as specified by the MA. Tasks and activities judged to be safety-critical by the MA may require certification of personnel proficiency. 3.6.3 Elimination or Risk Reduction The process of hazard and failure mode elimination or risk reduction is based on the design order of precedence. Once hazards and failure modes are identified by evidence analysis and categorized, specific (or functionally derived) safety requirements must be identified for incorporation into the design for the elimination or control of safety risk. Defined requirements can be applicable for any of the four categories of the defined order of safety precedence. For example, a specific hazard may have several design requirements identified for incorporation into the system design. However, to further minimize the safety risk of the hazard, supplemental requirements may be appropriate for safety devices, warning devices, and operator/maintainer procedures and training. In fact, most hazards have more than one design or risk reduction requirement unless the hazard is completely eliminated through the first (and only) design requirement. Figure 3-3 shows the process required to eliminate or control safety risk via the order of precedence described in Paragraph 3.6.2.
Hazard Eliminated?
Yes
No
No
Hazard Reduced?
Yes
No
No
Hazard Reduced?
Yes
No
Provide Risk Assessment Package for Management and the System Safety Group Conclude Hazard Analysis and Risk Assessment Activities
END
316
Identification of safety-specific requirements to the design and implementation portions of the system does not complete the safety task. The safety engineer must verify that the derived requirements have indeed been implemented as intended. Once hazard elimination and control requirements are identified and communicated to the appropriate design engineers, testing requirements must be identified for hazards which have been categorized as safety-critical. The categorization of safety risk in accordance with severity and probability must play a significant role in the depth of testing and requirements verification methods employed. Very low risk hazards do not require the same rigor of safety testing to verify the incorporation of requirements as compared to those associated with safety-critical hazards. In addition, testing cannot always be accomplished whereas verification methods may be appropriate (i.e., designer sign-off on hazard record, as-built drawing review, inspection of manufactured components, etc.) 3.6.4 Quantification of Residual Safety Risk After the requirements are implemented (to the extent possible), and appropriately verified in the design, the safety engineer must analyze each identified and documented hazard record to assess and analyze the residual risk that remains within the system during its operations and support activities. This is the same risk assessment process that was performed in the initial analysis described in Paragraph 3.6.1. The difference in the analysis is the amount of design and test data to support the risk reduction activities. After the incorporation of safety hazard elimination or reduction requirements, the hazard is once again assessed for severity, probability of occurrence, and an HRI determined. A hazard with an initial HRI of 4 may have been reduced in safety risk to an HRI of 8. However, since in this example, the hazard was not completely eliminated and only reduced; there remains a residual safety risk. Granted, it is not as severe or as probable as the original; but the hazard does exist. Remember that the initial HRI of a hazard is determined during the PHA development prior to the incorporation or implementation of requirements to control or reduce the safety risk and is often an initial engineering judgment. The final HRI categorizes the hazard after the requirements have been implemented and verified by the developer. If hazards are not reduced sufficiently to meet the safety objectives and goals of the program, they must be reintroduced to safety engineering for further analyses and safety risk reduction. It should be noted that risk is generally reduced within a probability category. Risk reduction across severity levels generally requires a hardware design change. In conjunction with the safety analysis, and the available engineering data and information available, residual safety risk of the system, subsystems, user, maintainer, and tester interfaces must be quantified. Hazard records with remaining residual risk must be correlated within subsystems, interfaces, and the total system for the purpose of calculating the remaining risk. This risk must be communicated in detail [via the System Safety Working Group (SSWG) and the detailed hazard record system], to the PM, the lead design engineers, the test manager, and the user and fully documented in the hazard database record. If residual risk in terms of safety is unacceptable to the PM, further direction and resources must be provided to the engineering effort.
317
3.6.5 Managing and Assuming Residual Safety Risk Managing safety risk is another one of those simple processes which usually takes a great deal of time, effort, and resources to accomplish. Referring back to Table 3-3, HRI Matrix, specific categories must be established in the matrix to identify the level of management accountability, responsibility, and risk acceptance. Using Table 3-3 as an example, hazards with an HRI of between 1 through 5 are considered unacceptable23. These hazards, if not reduced to a lower level below an HRI of 5, cannot be officially closed without the acquisition executives signature. This forces the accountability of assuming this particular risk to the appropriate level of management (the top manager). However, the PM can officially close hazards from HRI 6 through 20. This is to say that the PM would be at the appropriate level of management to assume the safety risk to reduce the HRI to a lower category. Recognize that Tables 3-1 through 3-3 are for purposes of example only. They provide a graphical representation and depiction of how a program may be set up with three levels of program and technical management. It is ideal to have the PM as the official sign-off for all residual safety risk to maintain safety accountability with that individual. Remember that the PM is responsible for the safety of a product or system at the time of test and deployment. The safety manager must establish an accountability system for the assumption of residual safety risk based upon user inputs, contractual obligations, and negotiations with the PM.
23
Remember that this is for example purpose only. Within DOD programs HRI 1 through 9 would require the AEs acceptance of risk.
318
Introduction
4 .1
4 .5
Figure 4-1: Section 4 Contents Section 4 is applicable to all managerial and technical disciplines. It describes the processes, tools, and techniques to reduce the safety risk of software operating in safety-critical systems. Its primary purposes are as follows: Define a recommended software safety engineering process. Describe essential tasks to be accomplished by each professional discipline assigned to the SSS Team. Identify interface relationships between professional disciplines and the individual tasks assigned to the SSS Team. Identify best practices to complete the software safety process and describe each of its individual tasks. Recommend tailoring actions to the software safety process to identify specific user requirements.
Section 4 should be reviewed and understood by all systems engineers, system safety engineers, software safety engineers, and software development engineers. It is also appropriate for review by PMs interested in the technical aspects of the SSS processes and the possible process improvement initiatives implemented by their systems engineers, software developers, design engineers, and programmers. This section not only describes the essential tasks required by the
41
system safety engineers, but also the required tasks that must be accomplished by the software safety engineers, systems engineers, and the software developers. This includes the critical communication interfaces between each functional discipline. It also includes the identification, communication, and implementation of initial SSRs and guidelines. The accomplishment of a software safety management and engineering program requires careful forethought, adequate support from various other disciplines, and timely application of expertise across the entire software development life cycle. Strict attention to planning is required in order to integrate the developers resources, expertise, and experience with tasks to support contractual obligations established by the customer. This section focuses on the establishment of a software safety program within the system safety engineering and the software development process. It establishes a baseline program that, when properly implemented, will ensure that both initial SSRs and requirements specifically derived from functional hazards analysis are identified, prioritized, categorized, and traced through design, code, test, and acceptance. A goal of this section of the Handbook is to formally identify the software safety duties and responsibilities assigned to the safety engineer, the software safety engineer, the software engineer, and the managerial and technical interfaces of each through sound systems engineering methods (Figure 4-2). This section of the Handbook will identify and focus on the logical and practical relationships between the safety, design, and software disciplines. It will also provide the reader with the information necessary for the assignment of software safety responsibilities, and the identification of tasks attributed to system safety, software development, as well as hardware and digital systems engineering.
Figure 4-2: Who is Responsible for SSS? This Handbook assumes a novices understanding of software safety engineering within the context of system safety and software engineering. Note that many topics of discussion within this section are considered constructs within basic system safety engineering. This is due to the fact that it is impossible to discuss software safety outside of the domain of system safety
42
engineering and management, systems engineering, software development, and program management. 4.1.1 Section 4 Format This section is formatted specifically to present both graphical and textual descriptions of the managerial and technical tasks that must be performed for a successful software safetyengineering program. Each managerial process task and technical task, method, or technique will be formatted to provide the following: Graphical representation of the process step or technical method Introductory and supporting text Prerequisite (input) requirements for task initiation Activities required to perform the task (including interdisciplinary interfaces) Associated subordinate tasks Critical task interfaces A description of required task output(s) and/or product(s)
This particular format helps to explain the inputs, activities, and outputs for the successful accomplishment of activities to meet the goals and objectives of the software safety program. For those that desire additional information, Appendices A-G are provided to supplement the information in the main sections of this Handbook. The appendices are intended to provide practitioners with supplemental information and credible examples for guidance purposes. The titles of the appendices are as follows: Appendix A - Definition of Terms Appendix B - References Appendix C - Handbook Supplemental Information Appendix D - Generic Software Safety Requirements and Guidelines Appendix E - Lessons Learned Appendix F - Process Chart Worksheets Appendix G - Examples of Contractual Language [RFP, SOW/Statement of Objectives (SOO)] 4.1.2 Process Charts Each software safety-engineering task possesses a supporting process chart. Each chart was developed for the purpose of providing the engineer with a detailed and complete "road map" for performing software safety engineering within the context of software design, code, and test
43
activities. Figure 4-3 provides an example of the depth of information considered for each process task. The depth of information presented in this figure includes processes, methods, documents, and deliverables associated with system safety engineering and management activities. However, for the purposes of this Handbook, these process charts were "trimmed" to contain the information deemed essential for the effective management and implementation of the software safety program under the parent SSP. The in-depth process chart worksheets are provided in Appendix G for those interested in this detailed information.
Preceding Process
Software Detailed Design Subsystem Hazard Analysis SSHA
Next Process
Software Safety Test Planning
Inputs (Suppliers)
PHA Draft SSHAs SSS S/SDD IRS IDD Tailored Generic Safety-Critical Software Design Requirements List Incident Reports Threat Hazard Assessment Life Cycle Environmental Profile HARs Lessons Learned
To Analyze Subsystem Interfaces & Interactions, Interface Design, and System Functional and Physical Requirements for Safety Hazards and to Assess and Classify System Risks
Purpose:
Outputs (Customers)
Input to SPRA Reviews Updates to PHA Updates to SSHAs HARs Inputs to Software Design Inputs to Interface Design Inputs to Test Requirements Inputs to Test Plan Prioritized Hazard List List of Causal Interrelationships to Hazards
Primary Sub-Processes
Analyze IRS, IDD to Ensure Correct Implementation of Safety Design Requirements (I) Integrate the Results of the SSHAs(U) Identify Hazards That Cross Subsystem Boundaries (I) Ensure That Hazards Are Mitigated in Interfacing Subsystems or External Systems (I) Identify Unresolved Interface Safety Issues and Reflect Back to SSHAs (I) Examine Causal Relationship of Multiple Failure Modes (HW, SW) to Creating Software System Hazards (I) Determine Compliance With Safety Criteria and System and Subsystem Requirements Documents (I) Assess Hazard Impacts Related to Interfaces (I) Develop Recommendations to Minimize Hazard Effects (I) Develop Test Recommendations to Verify Hazard Mitigation (I)
Entry Criteria
System Design Review
Players
SSWG SwSWG
Comments
Figure 4-3: Example of Initial Process Chart Each process chart presented in this handbook will contain the following: Primary task description Task inputs Task outputs Primary sub-process tasks Critical interfaces
44
4.1.3 Software Safety Engineering Products The specific products to be produced by the accomplishment of the software safety engineering tasks are difficult to segregate from those developed within the context of the SSP. It is likely, within individual programs, that supplemental software safety documents and products will be produced to support the system safety effort. These may include supplemental analysis, Data Flow Diagrams (DFD), functional flow analysis, and software requirements specifications (SRS) and the development of Software Analysis Folders (SAF). This Handbook will identify and describe the documents and products that the software safety tasks will either influence or generate. Specific documents include the following: a. System Safety Program Plan (SSPP) b. Software Safety Program Plan (SwSPP) c. Generic Software Safety Requirements List (GSSRL) d. Safety-Critical Functions List (SCFL) e. PHL f. PHA g. Subsystem Hazard Analysis (SSHA) h. Safety Requirements Criteria Analysis (SRCA) i. System Hazard Analysis (SHA) j. Safety Assessment Report (SAR)
and document the breadth and depth of the program. Detailed planning ensures the identification of critical program interfaces and support and establishes formal lines of communication between disciplines and among engineering functions. Depicted in Figure 4-4, he potential for program success increases through sound planning activities that identify and formalize the managerial and technical interfaces of the program. To minimize the depth of the material presented, supporting and supplemental text is provided in Appendix C.
Planning
4.2.1
M anaging
4.2.2
Figure 4-4: Software Safety Planning This section is applicable to all members of the SSS Team. It assumes a minimal understanding and experience with safety engineering programs. The topics include the following: a. Identification of managerial and technical program interfaces required by the SSS Team. b. Definition of user and developer contractual relationships to ensure that the SSS Team implements the tasks, and produces the products, required to meet contractual requirements. c. Identification of programmatic and technical meetings and reviews normally supported by the SSS Team. d. Identification and allocation of critical resources to establish a SSS Team and conduct a software safety program. e. Definition of planning requirements for the execution of an effective program. The planning for an effective SSP and software safety program requires extensive forethought from both the supplier and the customer. Although they both envision a perfect SSP, there are subtle differences associated with the identification, preparation, and execution of a successful safety program from these two perspectives. The contents of Figures 4-5 and 4-6 represent the primary differences between agencies that both must understand before considering the software safety planning and coordinating activities. 4.2.1 Planning Comprehensive planning for the software safety program requires an initial assessment of the degree of software involvement in the system design and the associated hazards. Unfortunately,
46
this is difficult since little is usually known about the system other than operational requirements during the early planning stages. Therefore, the safety SOW must encompass all possible designs. This generally results in a fairly generic SOW that will require later tailoring of a SSS program to the system design and implementation. Figure 4-5 represents the basic inputs, outputs, tasks, and critical interfaces associated with the planning requirements associated with the PA. Frustrations experienced by safety engineers executing the system safety tasks can usually be traced back to the lack of adequate planning by the customer. The direct result is normally an under-budget, under-staffed safety effort that does not focus on the most critical aspects of the system under development. This usually can be traced back to the customer not assuring that the Request for Proposal (RFP), SOW, and the contract contain the correct language, terminology, and/or tasks to implement a safety program and the required or necessary resources. Therefore, the ultimate success of any safety program strongly depends on the planning function by the customer.
Inputs
Acquisition Policy OCD or MENS DOP Proposal Safety Policy Generic Requirements Lessons Learned Preliminary Hazard List
Outputs
Primary Task
Software Safety Program Planning Procuring Agency (Customer)
Input to the SOW/SOO Input to the RFP Safety POA&M System Safety Program Plan w/Software Safety Appendix SSWG Charter Inputs to SDP, TEMP, SEMP, ILSP, PHL, PHA, and CRLCMP
Iterative Loop
Primary Sub-Tasks
Establish System Safety Program Define Acceptable Levels of Risk (HRI) Establish Program Interfaces Establish Contract Deliverables Establish Hazard Tracking Process Establish Resource Requirements
Critical Interfaces
Program Management Contracts Systems Engineering (Hardware & Software) Design Engineering (Hardware & Software) Software Engineering Support Engineering Disciplines
Figure 4-5: Software Safety Planning by the Procuring Authority For the PA, software safety program planning begins as soon as the need for the system is identified. The PA must identify points of contact within the organization and define the interfaces between various engineering disciplines, administrative support organizations, program management, contracting group, and Integrated Product Teams (IPT) within the PA to develop the necessary requirements and specifications documents. In the context of acquisition reform, invoking military standards and specifications for DOD procurements is not permitted or is significantly reduced. Therefore, the PA must incorporate the necessary language into any contractual documents to ensure that the system under development will meet the safety goals and objectives. PA safety program planning continues through contract award and may require periodic updating during initial system development and as the development proceeds through various phases.
47
However, management of the overall SSP continues through system delivery and acceptance and throughout the systems life cycle. After deployment, the PA must continue to track system hazards and risks and monitor the system in the field for safety concerns identified by the user. The PA must also make provisions for safety program planning and management for any upgrades, product improvements, maintenance, technology refreshment, and other follow-on efforts to the system. The major milestones affecting the PAs safety and software safety program planning include release of contract requests for proposals or quotes, proposal evaluation, major program milestones, system acceptance testing and evaluation, production contract award, initial operational capability (release to the users), and system upgrades or product improvements. Although the Developing Agencys (DA) software safety program planning begins after receipt of a contract RFP, or quotes, the DA can significantly enhance his/her ability to establish an effective program through prior planning (see Figure 4-6). Prior planning includes establishing effective systems engineering and software engineering processes that fully integrate system and software systems safety. Corporate engineering standards and practices documents that incorporate the tenets of system safety provide a strong baseline from which to build a successful SSP even though the contract may not contain specific language regarding the safety effort.
Inputs
Statement of Work Request For Proposal OCD or MENS Safety Policy Preliminary Hazard List
Outputs
Primary Task
Software Safety Program Planning Developing Agency (Supplier)
Iterative Loop
Primary Sub-Tasks
Interpretation of SOW Requirements Resource Requirements Determination Establish System Safety Program Develop Software Safety Program Plan
RFP/Proposal Input Safety POA&M Requirements Review BAFO Response System Safety Program Plan w/Software Safety Appendix SSWG Charter Inputs to SDP, TEMP, SEMP, ILSP, PHL, PHA, and CRLCMP
Critical Interfaces
Program Management Contracts Systems Engineering (Hardware & Software) Design Engineering (Hardware & Software) Support Engineering Disciplines
Figure 4-6: Software Safety Planning by the Developing Agency Acquisition reform recommends that the Government take a more interactive approach to system development without interfering with that development. The interactive aspect is to participate as a member of the DAs IPTs as an advisor without hindering development. This requires a careful balance on the part of the government participants. From the system safety and SSS perspective, that includes active participation in the appropriate IPTs by providing the government perspective on recommendations and decisions made in those forums. This also
48
requires the government representative to alert the developer to hazards known to the government but not to the developer. Acquisition reform also requires the DA to warrant the system thus making the DA liable for any mishaps that occur, even after system acceptance by the PA. Although the courts have yet to fully test that liability, the DA can significantly reduce his/her liability through this interactive process. Having government representatives present when making safety-related decisions provides an inherent buy-in by the Government to the residual risks in the system. This has the effect of significantly reducing the DAs liability.24 MIL-STD 882D also implies this reduction in liability. Where is this discussion leading? Often, contract language is non-specific and does not provide detailed requirements, especially with respect to safety requirements for the system. Therefore, it is the DAs responsibility to define a comprehensive SSP that will ensure that the delivered system provides an acceptably low level of safety risk to the customer, not only for the customers benefit, but for the DAs benefit as well. At the same time, the DA must remain competitive and reduce safety program costs to the lowest practical level consistent with ensuring the delivery of a system with the lowest risk practical. Although the preceding discussion focused on the interaction between the Government and the DA, the same tenets apply to any contractual relationship, especially between prime and subcontractors. The DA software safety planning continues after contract award and requires periodic updates as the system proceeds through various phases of development. These updates should be in concert with the PAs software safety plans. However, management of the overall system and SSS programs continues from contract award through system delivery and acceptance and may extend throughout the system life cycle, depending on the type of contract. If the contract is a Total System Responsibility contract or requires the DA perform routine maintenance, technology refreshments, or system upgrade, the software safety program management and engineering must continue throughout the systems life cycle. Thus, the DA must make provisions for safety program planning and management for these phases and other follow-on efforts on the system. The major milestones affecting the DAs safety and software safety program planning include the receipt of contract requests for proposals or quotes, contract award, major program milestones, system acceptance testing and evaluation, production contract award, release to the customer, system upgrades, and product improvements. While the software safety planning objectives of the PA and DA may be similar, the planning and coordination required to meet these objectives may come from different angles (in terms of specific tasks and their implementation), but they must be in concert (Figure 4-7). Regardless, both agencies must work together to meet the safety objectives of the program. In terms of planning, this includes the following:
24
This represents a consensus opinion of lawyers and legal experts in the practice of defending government contractors in liability cases.
49
Definition of critical program, management, and engineering interfaces Definition of contract deliverables Development of a Software Hazard Criticality Matrix (SHCM) (see Paragraph 4.2.1.5)
SOW
RFP
Contract
Dave Lud
USER
DEVELOPER
Figure 4-7: Planning the Safety Criteria Is Important 4.2.1.1 Establish the System Safety Program The PA must establish the safety program as early as practical in the development of the system. The PM should identify a Principal for Safety (PFS Navy term) or other safety manager early in the program to serve as the single point of contact for all safety-related matters on the system. This individual will interface with safety review authorities, the DA safety team, PA and DA program management, the safety engineering team, and other groups as required to ensure that the safety program is effective and efficient. The PFS may also establish and chair a Software Systems Safety Working Group (SwSWG) or SSS Team. For large system developments where software is likely to be a major portion of the development, a safety engineer for software may also be identified who reports directly to the overall system PFS. The size of the safety organization will depend on the complexity of the system under development, and the inherent safety risks. Another factor influencing the size of the PMs safety team is the degree of interaction with the customer and supplier and the other engineering and program disciplines. If the development approach is a team effort with a high degree of interaction between the organizations, the safety organization may require additional personnel to provide adequate support. The PA should prepare a System Safety Management Plan (SSMP) describing the overall safety effort within the PA organization and the interface between the PA safety organization and the DAs system safety organization. The SSMP is similar to the SSPP in that it describes the roles and responsibilities of the program office individuals with respect to the overall safety effort. The PFS or safety manager should coordinate the SSMP with the DAs SSPP to ensure that the tasks and responsibilities are complete and will provide the desired risk assessment. The SSMP differs from the SSPP in that it does not describe the details of the safety program, such as
410
analysis tasks, contained in the SSPP. A special note with regard to programs initiated under MIL-STD-882D. MIL-STD-882D does not require or contain a Contract Deliverable Requirements List (CDRL) listing for a SSPP. However, Section 4.1 requires that the PM and the developer document the agreed upon system safety process. This is virtually identical to the role of the SSPP. Therefore, the PFS or safety manager coordinates the SSMP with this documented safety process. The PA must specify the software safety program for programs where software performs or influences safety-critical functions of the system. The PA must establish the team in accordance with contractual requirements, managerial and technical interfaces and agreements, and the results of all planning activities discussed in previous sections of this Handbook. Proper and detailed planning will increase the probability of program success. The tasks and activities associated with the establishment of the SSP are applicable to both the supplier and the customer. Unfortunately, the degree of influence of the software on safety-critical functions in the system is often not known until the design progresses to the point of functional allocation of requirements at the system level. The PM must predicate the software safety program on the goals and objectives of the system safety and the software development disciplines of the proposed program. The safety program must focus on the identification and tracking (from design, code, and test) of both initial SSRs and guidelines, and those requirements derived from system-specific, functional hazards analyses. Common deficiencies in software safety programs are usually the lack of a team approach in addressing both the initial and the functional SSRs of a system. The software development community has a tendency to focus on only the initial SSRs while the system safety community may focus primarily on the functional SSRs derived through hazard analyses. A sound SSS program traces both sets of requirements through test and requirements verification activities. The ability to identify (in total) all applicable SSRs is essential for any given program and must be adequately addressed. 4.2.1.2 Defining Acceptable Levels of Risk One of the key elements in safety program planning is the identification of the acceptable level of risk for the system. This process requires both the identification of a HRI and a statement of the goal of the safety program for the system. The former establishes a standardized means with which to group hazards by risk (e.g., unacceptable, undesirable, etc.) while the latter provides a statement of the expected safety quality of the system. The ability to categorize specific hazards into the HRI matrix is based upon the ability of the safety engineer to assess hazard severity and likelihood of occurrence. The PA, in concert with the user, must develop a definition of the acceptable risk and provide that to the DA. The PA must also provide the developer with guidance on risk acceptance authorities and reporting requirements. DOD 5000.2R requires that high-risk hazards (Unacceptable hazard per MIL-STD-882) obtain component CAE signature for acceptance. Serious risk hazards (Undesirable) require acceptance at the PEO level. The DA must provide the PM supporting documentation for the risk acceptance authority.
411
4.2.1.3 Program Interfaces System safety engineering is responsible for the coordination, initiation, and implementation of the software safety engineering program. While this responsibility cannot be delegated to any other engineering discipline within the development team, software safety must assign specific tasks to the engineers with the appropriate expertise. Historically, system safety engineering performs the engineering necessary to identify, assess, and eliminate or reduce the safety risk of hazards associated with complex systems. Now, as software becomes a major aspect of the system, software safety engineering must establish and perform the required tasks and establish the technical interfaces required to fulfill the goals and objectives of the system safety (and software safety) program. However, the SSS Team cannot accomplish this independently without the inter-communication and support from other managerial and technical functions. Within the DOD acquisition and product development agencies, IPTs have been established to ensure the success of the design, manufacture, fabrication, test, and deployment of weapon systems. These IPTs formally establish the accountability and responsibility between functions and among team members. This accountability and responsibility is both from the top down (management-to-engineer) and from the bottom up (engineer-to-management). The establishment of a credible SSS activity within an organization requires this same rigor in the identification of team members, the definition of program interfaces, and the establishment of lines of communication. Establishing formal and defined interfaces allows program and engineering managers to assign required expertise for the performance of the identified tasks of the software safety engineering process. Figure 4-8 shows the common interfaces necessary to adequately support a SwSSP. It includes management interfaces, technical interfaces, and contractual interfaces.
Figure 4-8: Software Safety Program Interfaces 4.2.1.3.1 Management Interfaces The PM, under the authority of the AE or the PEO: Coordinates the activities of each professional discipline for the entire program,
412
Allocates program resources, Approves the programs planning documents, including the SSPP, and Reviews safety analyses; accepts impact on system for Critical and higher category hazards (based upon acceptable levels of risk); and submits finding to PEO for acceptance of unmitigated, unacceptable hazards.
It is the PMs responsibility to ensure that processes are in place within a program that meet, not only the programmatic, technical, and safety objectives, but also the functional and system specifications and requirements of the customer. The PM must allocate critical resources within the program to reduce the sociopolitical, managerial, financial, technical, and safety risk of the product. Therefore, management support is essential to the success of the SSS program. The PM ensures that the safety team develops a practical process and implements the necessary tasks required to: Identify system hazards, Categorize hazards in terms of severity and likelihood, Perform causal factor analysis, Derive hardware and software design requirements to eliminate and/or control the hazards, Provide evidence for the implementation of hardware and software safety design requirements, Analyze and assess the residual safety risk of any hazards that remain in the design at the time of system deployment and operation, and Report the residual safety risk and hazards associated with the fielded system to the appropriate acceptance authority.
The safety manager and the software engineering manager depend on program management for the allocation of necessary resources (time, tools, training, money, and personnel) for the successful completion of the required SSS engineering tasks. Within the DOD framework, the AE (Figure 4-9) is ultimately responsible for the acceptance of the residual safety risk at the time of test, initial systems operation, and deployment. The AE must certify at the Test Readiness Review (TRR), and the Safety Program Review Authority (SPRA) [sometimes referred to as a Safety Review Board (SRB)], that all hazards and failure modes have been eliminated or the risk mitigated or controlled to a level As-Low-AsReasonably-Possible (ALARP). At this critical time, an accurate assessment on the residual safety risk of a system facilitates informed management and engineering decisions. Under the old acquisition process, without the safety risk assessment provided by a credible system safety process, the AE assumed the personal, professional, programmatic, and political liabilities in the decision making process. If the PM failed to implement effective system and SSS programs,
413
he/she may assume the liability due to failure to follow DOD directives. The developer now assumes much of that liability under acquisition reform. The ability of the PFS or safety manager to provide an accurate assessment of safety risk depends on the support provided by program management throughout the design and development of the system. Under acquisition reform, the government purchases systems as if they are off-the-shelf products. The developer warrants the system for performance and safety characteristics thus making the developer liable for any mishaps that occur. However, the AE is ultimately responsible for the safety of the system and the assessment and acceptance of the residual risk. The developers safety team, in coordination with the PAs safety team must provide the AE with the accurate assessment of the residual risk such that he/she can make informed decisions. Again, this is also implied by MIL-STD 882D. Acquisition Executive
CERTIFIED
ALARP
SAFE
Figure 4-9: Ultimate Safety Responsibility 4.2.1.3.2 Technical Interfaces The engineering disciplines associated with system development must also provide technical support to the SSS Team (Figure 4-10). Engineering management, design engineers, systems engineers, software development engineers, Integrated Logistics Support (ILS), and other domain engineers supply this essential engineering support. Other domain engineers include reliability, human factors, quality assurance (QA), test and evaluation, verification and validation, maintainability, survivability, and supportability. Each member of the engineering team must provide timely support to the defined processes of the SSS Team to accomplish the safety analyses and for specific design influence activities which eliminate, reduce, or control hazard risk. This includes the traceability of SSRs from design-to-test (and test results) with its associated and documented evidence of implementation. A sure way for the software safety activity to fail is to not secure software engineering acceptance and support of the software safety process, functions, and implementation tasks. One must recognize that most formal education and training for software engineers and developers does not present, teach, or rationalize system safety. The system safety process relating to the derivation of functional SSR through hazard analyses is foreign to most software developers. In fact, the concept that software can be a causal factor to a hazard is foreign to many software engineers.
414
System Safety PM Principle For Safety System Safety Engineer Software Safety Engineer
Software Engineer Digital Systems Engineer Software Quality Assurance Software Safety Engineer Software Test Engineer
Figure 4-10: Proposed SSS Team Membership Without the historical experience of cultivating technical interfaces between software development and system safety engineering, several issues may need resolution. They include: Software engineers may feel threatened that system safety has the responsibility for activities considered part of the software engineering realm Software developers are confident enough in their own methods of error detection, error correction, and error removal, that they ignore the system safety inputs to the design process. This is normally in support of initial SSRs There is insufficient communication and resource allocation between software development and system safety engineering to identify, analyze, categorize, prioritize, and implement both generic and derived SSRs A successful SSS effort requires the establishment of a technical SSS Team approach. The SSP Manager, in concert with the systems engineer and software engineering team leaders must define the individual tasks and specific team expertise required and assigns responsibility and accountability for the accomplishment of these tasks. The SwSPP must include the identification and definition of the required expertise and tasks in the software safety portion or appendix. The team must identify both the generic SSRs and guidelines and the functional safety design requirements derived from system hazards and failure modes that have specific software input or influence. Once these hazards and failure modes are identified, the team can identify specific safety design requirements through an integrated effort. All SSRs must be traceable to test and be correct, complete, and testable where possible. The Requirements Traceability Matrix (RTM) within the SRCA documents this traceability. The implemented requirements must eliminate, control, or reduce the safety risk as low as reasonably possible while meeting the user requirements within operational constraints. Appendix C.3 contains supplemental information pertaining to the technical interfaces.
415
4.2.1.3.3 Contractual Interfaces Management planning for the SSS function includes the identification of contractual interfaces and obligations. Each program has the potential to present unique challenges to the system safety and software development managers. These may include a RFP that does not specifically address the safety of the system, to contract deliverables that are extremely costly to develop. Regardless of the challenges, the tasks needed to accomplish a SSS program must be planned to meet both the system and user specifications and requirements and the safety goals of the program. The following are those contractual obligations that are deemed to be most essential for any given contract: RFP SOW Contract CDRL
Example templates of a RFP and SOW/SOO are contained in Appendix G. 4.2.1.4 Contract Deliverables The SOW defines the deliverable documents and products (e.g., CDRLs) desired by the customer. Each CDRL deliverable should be addressed in the SSPP to include the necessary activities and process steps required for its production. Completion of contract deliverables is normally tied to the acquisition life cycle of the system being produced and the program milestones identified in the Systems Engineering Management Plan (SEMP). The planning required by the system safety manager ensure that the system safety and software safety processes provide the necessary data and output for the successful accomplishment of the plans and analysis. The system safety schedule should track closely to the SEMP and be proactive and responsive to both the customer and the design team. Contract deliverables should be addressed individually on the safety master schedule and within the SSPP whether these documents are contractual deliverables or internal documents required to support the development effort. As future procurements under acquisition reform will generally not have specific military and DOD standards and few if any deliverables, the PA must ensure that sufficient deliverables are identified and contractually required to meet programmatic and technical objectives. This activity must also specify the content and format of each deliverable item. As existing government standards transition to commercial standards and guidance, the safety manager must ensure that sufficient planning is accomplished to specify the breadth, depth, and timeline of each deliverable [which is normally defined by Data Item Descriptions (DID)]. The breadth and depth of the deliverable items must provide the necessary audit trail to ensure that safety levels of risk is achieved (and are visible) during development, test, support transition, and maintenance in the out-years. The deliverables must also provide the necessary evidence or audit trail for validation and verification of SSRs. The primary method of maintaining a sufficient audit trail is the utilization of a developers safety data library (SDL). This library would be the repository for all
416
safety documentation. Appendix C, Section C.1 describes the contractual deliverables that should be contained in the SDL. 4.2.1.5 Develop Software Hazard Criticality Matrix Criteria described in MIL-STD-882 provides the basis for the HRI (described in Paragraph 3.6.1.4). This example may be used for guidance, or an alternate HRI may be proposed. The given HRI methodology used by a program must possess the capability to graphically delineate the boundaries between acceptable, allowable, undesirable (i.e., serious), and unacceptable (i.e., high) risk. Figure 4-11 provides a graphical representation of a risk acceptance matrix. In this example, the hazard record database contains 10 hazards, which currently remain in the unacceptable categories (categories IA, IB, IC, IIA, IIB, and IIIA), of safety risk. This example explicitly states that the hazards represented in the unacceptable range must be resolved.
HAZARD CATEGORIES
FREQUENCY OF OCCURRENCE A - FREQUENT B - PROBABLE C - OCCASIONAL D - REMOTE E - IMPROBABLE I CATASTROPHIC II CRITICAL III MARGINAL IV NEGLIGIBLE
0 4 5 24 1
0 1 16 25 1
Legend:
0 0 0 3 1
0 0 0 0 0
IA, IB, IC, IIA, IIB, IIIA ID, IIC, IID, IIIB, IIIC IE, IIE, IIID, IIIE, IVA, IVB IVC, IVD, IVE
UNACCEPTABLE , condition must be resolved. Design action is required to eliminate or control hazard. UNDESIRABLE, Program Manager decision is required. Hazard must be controlled or hazard probability reduced. ALLOWABLE , with Program Manager review. Hazard control desirable if cost effective. ACCEPTABLE without review. Normally not cost effective to control. Hazard is either negligible or can be assumed will not occur.
Figure 4-11: Example of Risk Acceptance Matrix The ability to categorize specific hazards into the above matrix is based upon the ability of the safety engineer to assess their severity and likelihood of occurrence. Historically, the traditional HRI matrix did not include the influence of the software on the hazard occurrence. The rationale for this is twofold: When the HRI matrix was developed, software was generally not used in safety-critical roles. Second, applying failure probabilities to software is impractical. The traditional HRI uses the hazard severity and probability of occurrence to assign the HRI with probabilities defined in terms of mean time between failure, probability of failure per operation, or probability of failure during the life cycle, depending on the nature of the system. This relies heavily on the ability to obtain component reliability information from engineering sources. However, applying probabilities of this nature to software, except in purely qualitative terms, is impractical. Therefore, software requires an alternative methodology. Software does not fail in the same manner as hardware. It does not wear out, break, or have increasing tolerances that
417
result in failures. Software errors are generally errors in the requirements (failure to anticipate a set of conditions that lead to a hazard, or influence of an external component failure on the software) or implementation errors (coding errors, incorrect interpretation of design requirements). If the conditions occur that cause the software to not perform as expected, a failure occurs. Therefore, reliability predictions become a prediction of when the specific conditions will occur that cause it to fail. Without the ability to accurately predict a software error occurrence, alternate methods of hazard categorization must be available when the hazard possesses software causal factors. During the early phases of the safety program, the prioritization and categorization of hazards is essential for the allocation of resources to the functional area possessing the highest risk potential. This section of the Handbook presents a method of categorizing hazards having software causal factors strictly for purposes of allocation of resources to the SSS program. This methodology does not provide an assessment of the residual risk associated with the software at the completion of development. However, the execution of the safety program, the development and analysis of SSRs and the verification of their implementation in the final software provide the basis for a qualitative assessment of the residual risk in traditional terms. 4.2.1.5.1 Hazard Severity Regardless of the hazard causal factors (hardware, software, human error, or environment) the severity of the hazard remains constant. The consequence of a hazards occurrence remains the same regardless of what actually caused the hazard unless the design of the system somehow changes the possible consequence. As the hazard severity is the same, the severity table presented in Paragraph 3.6.1.2 (Table 3-1, Hazard Severity), remains an applicable criteria for the determination of hazard criticality for those hazards having software causal factors. 4.2.1.5.2 Hazard Probability The difficulty of assigning useful probabilities to faults or errors in software requires a supplemental method of determining hazard risk where software causal factors exist. Figure 412 demonstrates that in order to determine a hazard probability, the analyst must assess the software causal factors in conjunction with the causal factors from hardware, human error, and other factors. The determination of hardware and human error causal factor probabilities remains constant (although there is significant disagreement regarding assigning probabilities to human error) in terms of historical best practices. Regardless, the risk assessment process must address the contribution of the software to the hazards cumulative risk.
418
ROOT HAZARD
Software
Hardware
Human Error
Example
Example
Example
Figure 4-12: Likelihood of Occurrence Example There have been numerous methods of determining the softwares influence on system-level hazards. Two of the most popular are presented in MIL-STD-882C and RTCA DO-178B (see Figure 4-13). These do not specifically determine software-caused hazard probabilities, but instead assess the softwares control capability within the context of the software causal factors. In doing so, each software causal factor can be labeled with a software control category for the purpose of helping to determine the degree of autonomy that the software has on the hazardous event. The SSS Team must review these lists and tailor them to meet the objectives of the SSP and software development program.
MIL-STD-882C
(I)
Software exercises autonomous control over potentially hazardous hardware systems, subsystems or components without the possibility of intervention to preclude the occurrence of a hazard. Failure of the software or a failure to prevent an event leads directly to a hazards occurrence. Software exercises control over potentially hazardous hardware systems, subsystems, or components allowing time for intervention by independent safety systems to mitigate the hazard. However, these systems by themselves are not considered adequate. Software item displays information requiring immediate operator action to mitigate a hazard. Software failure will allow or fail to prevent the hazards occurrence.
RTCA-DO-178B
(A) Software whose anomalous behavior, as shown by the system safety assessment process, would cause or contribute to a failure of system function resulting in a catastrophic failure condition for the aircraft. (B)
Software whose anomalous behavior, as shown by the System Safety assessment process, would cause or contribure to a failure of system function resulting in a hazardous/severe-major failure condition of the aircraft.
(IIa)
(IIb)
(IIIa)
Software items issues commands over potentially hazardous hardware systems, subsystem, or components requiring human action to complete the control function. There are several, redundant, independent safety measures for each hazardous event.
(C) Software whose anomalous behavior, as shown by the system safety assessment process, would cause or contribute to a failure of system function resulting in a major failure condition for the the aircraft. (D) Software whose anomalous behavior, as shown by the system safety assessment process, would cause or contribute to a failure of system function resulting in a minor failure condition for the aircraft. (E) Software whose anomalous behavior, as shown by the system safety assessment process, would cause or contribute to a failure of function with no effect on aircraft operational capability or pilot workload. Once software has been confirmed as level E by the certification authority, no further guidelines of this document apply.
(IIIb) Software generates information of a safety critical nature used to make safety critical decisions. There are several, redundant, independent safety measures for each hazardous event. (IV)
Software does not control safety critical hardware systems, subsystems, or components and does not provide safety critical information.
419
Once again, the concept of labeling software causal factors with control capabilities is foreign to most software developers and programmers. They must be convinced that this activity has utility in the identification and prioritization of software entities that possess safety implications. In most instances, the software development community desires the list to be as simplistic and short as possible. The most important aspect of the activity must not be lost; that is, the ability to categorize software causal factors in determining the hazard likelihood and the design, code, and test activities required to mitigate the potential software cause. Autonomous software with functional links to catastrophic hazards demands more coverage than software that influences low severity hazards. 4.2.1.5.3 Software Hazard Criticality Matrix The SHCM, shown in Figure 4-14, assists PMs, SSS Team, and the subsystem and system designers in allocating resources to the software safety effort.
Control Category
Software exercises autonomous control over potentially hazardous hardware systems, subsystems or components without the possibility of intervention to preclude the occurrence of a hazard. Failure of the software or a failure to prevent an event leads directly to a hazards occurrence.
Severity
Catastrophic 1 Critical 1 M arginal 3 Negligible 5
(I )
(I I a)
Software exercises control over potentially hazardous hardware systems, subsystems, or components allowing time for intervention by independent safety systems to mitigate the hazard. However, these systems by themselves are not considered adequate.
(I I b)
Software item displays information requiring immediate operator action to mitigate a hazard. Software failure will allow or fail to prevent the hazard s occurrence.
(I I I a) Software items issues commands over potentially hazardous hardware systems, subsystem, or components requiring human action to complete the control function. There are several, redundant, independent safety measures for each hazardous event. (I I I b) (I V)
Software generates information of a safety critical nature used to make safety critical decisions. There are several, redundant, independent safety measures for each hazardous event.
Software does not control safety critical hardware systems, subsystems, or components and does not provide safety critical information.
High Risk - Significant Analyses and Testing Resources Medium Risk - Requirements and Design Analysis and Depth Testing Required Moderate Risk - High Levels of Analysis and Testing Acceptable With Managing Activity Approval Moderate Risk - High Levels of Analysis and Testing Acceptable With Managing Activity Approval Low Risk - Acceptable
Figure 4-14: Software Hazard Criticality Matrix, MIL-STD-882C It is not an HRI matrix for software. The higher the Software Hazard Risk Index (SHRI) number, the fewer resources required to ensure that the software will execute safely in the system context. The software control measure of the SHCM also assists in the prioritization of software design and programming tasks. However, the SHRIs greatest value may be during the functional allocation phase. Using the SHRI, software safety can influence the design to:
420
Reduce the autonomy of the software control of safety-critical aspects of the system, Minimize the number of safety-critical functions in the software, and Use the software to reduce the risk of other hazards in the system design.
If conceptual design (architecture) shows a high degree of autonomy over safety-critical functions, the software safety effort requires significantly more resources. Therefore, the systems engineering team can consider this factor in the early design phases. By reducing the number of software modules containing safety-critical functions, the developer reduces the portion of the software requiring safety assessment and thus the resources required for that assessment. The systems engineering team must balance these issues with the required and desired capabilities of the system. Too often, developers rush to use software to control functionality when nonsoftware alternatives will provide the same capabilities. While the safety risk associated with the non-software alternatives must still be assessed, the process is likely to be less costly and resource intensive. 4.2.2 Management SSS program management (Figure 4-15), like SSP management, begins as soon as the SSP is established, and continues throughout the system development. Management of the effort requires a variety of tasks or processes, from establishing the SwSWG to preparing the SAR. Even after a system is placed in service, management of the SSS effort continues to address modifications and enhancements to the software and the system. Often, changes in the use or application of a system necessitate a re-assessment of the safety of the software in the new application.
Inputs
ORD/MENS Statement of Work Statement of Objectives Request For Proposal Safety Policy
Outputs
Primary Task
Software Safety Program Management
Iterative Loop
Primary Sub-Tasks
Establish and Manage SwSWG Update Safety Program Plans Safety Program Monitoring Provide, Udate, or Develop Presentations Provide Safety Management Inputs to Software Test Plans
Input to SOW Input to SOO Input to RFP SwSWG Memebership Charter Update to SSPP Update to Safety Program Schedule Update to SEMP Update to TEMP Input to SPRA Reviews
Critical Interfaces
Program Management System Safety Program Management Customer Personnel Supplier Personnel
421
Effective management of the safety program is essential to the effective and efficient reduction of system risk. This section discusses the managerial aspects of the software safety tasks and provides guidance in establishing and managing an effective software safety program. Initiation of the SSP is all that is required to begin the activities pertaining to software safety tasks. Initial management efforts parallel portions of the planning process since many of the required efforts (such as establishing a hazard tracking system or researching lessons learned) need to begin very early in the safety program. Safety management pertaining to software generally ends with the completion of the program and its associated testing; whether it is a single phase of the development process (e.g., concept exploration) or continues through the development, production, deployment, and maintenance phases. In the context of acquisition reform, this means that management of the efforts must continue throughout the system life cycle. From a practical standpoint, management efforts end when the last safety deliverable is completed and is accepted by the customer. Management efforts then may revert to a caretaker status in which the PFS or safety manager monitors the use of the system in the field and identifies potential safety deficiencies based on user reports and accident/incident reports. Even if the developer has no responsibility for the system after deployment, the safety program manager can develop a valuable database of lessons learned for future systems by identifying these safety deficiencies. Establishing a software safety program includes establishing a Software Safety Working Group (SwSWG). This is normally a sub-group of the SSWG and chaired by the PFS or safety manager. The SwSWG has overall responsibility for the following: Monitoring and control of the software safety program, Identifying and resolving hazards with software causal factors, Interfacing with the other IPTs, and Performing final safety assessment of the system design.
A detailed discussion of a SwSWG is found in the supplemental information of Appendix C, paragraph C.5.2. It is in this phase of the program that the Software Safety Plan of Action and Milestones (POA&M) is developed based on the overall software development program POA&M in coordination with the system safety POA&M. Milestones from the software development POA&M, particularly design reviews and transition points (e.g., from unit code and test to integration) determine the milestones required of the software safety program. The SwSWG must ensure that the necessary analyses are complete in time to provide the necessary input to various development efforts to ensure effective integration of software safety into the overall software development process. The overall Phases, Milestones and Processes Chart, discussed in Paragraph 4.3 below, identifies the major program milestones from MIL-STD-498 and -499 with the associated software safety program events. One of the most difficult aspects of software safety program management is the identification and allocation of resources required to adequately assess the safety of the software. In the early planning phases, the configuration of the system and the degree of interaction of the software with the potential hazards in the system are largely unknown. The higher the degree of software
422
involvement, the greater the resources required to perform the assessment. To a large extent, the software safety program manager can use the early analyses of the design, participation in the functional allocation, and high-level software design process to ensure that the amount of safetycritical software is minimized. If safety-critical functions are distributed throughout the system and its related software, then the software safety program must encompass a much larger portion of the software. However, if the safety-critical functions are associated with as few software modules as practical, the level of effort may be significantly reduced. Effective planning and integration of the software safety efforts into the other IPTs will significantly reduce the software safety-related tasks that must be performed by the SSS Team. Incorporating the generic SSRs into the plans developed by the other IPTs allows them to assume responsibility for their assessment, performance, and/or evaluation. For example, if the SSS Team provides the quality assurance generic SSRs to the Software Quality Assurance (SQA) IPT, they will perform compliance assessments with requirements, not just for safety, but for all aspects of the software engineering process. In addition, if the SQA IPT buys-into the software safety program and its processes, it significantly supplements the efforts of the software safety engineering team, reduces their workload, and avoids duplication of effort. The same is true of the other IPTs such as CM and Software Test and Evaluation. In identifying and allocating resources to the software safety program, the software safety program manager can perform advance planning, establish necessary interfaces with the other IPTs, and identify individuals to act as software safety representatives on those IPTs. Identifying the number of analyses and the level of detail required to adequately assess the software involves a number of processes. Experience with prior programs of a similar nature is the most valuable resource that the software safety program manager has for this task. However, every program development is different and involves different teams of people, PA requirements, and design implementations. The process begins with the identification of the system-level hazards in the PHL. This provides an initial idea of the concerns that must be assessed in the overall safety program. From the system specification review process, the functional allocation of requirements results in a high-level distribution of safety-critical functions and system-level safety requirements to the design architecture. The safety-critical functions and requirements are thus known in general terms. Software functions that have a high safety-criticality (e.g., warhead arming and firing) will require a significant analysis effort that may include code-level analysis. Safetys early involvement in the design process can reduce the amount of software that requires analysis; however, the software safety manager must still identify and allocate resources to perform these tasks. Those safety requirements that conflict with others (e.g., reliability) require trade-off studies to achieve a balance between desirable attributes. The software control categories discussed in Paragraph 4.2.1.5 provide a useful tool for identifying software that requires high levels of analysis and testing. Obviously, the more critical the software, the higher the level of effort necessary to analyze, test, and assess the risk associated with the software. In the planning activities, the SwSWG identifies the analyses necessary to assess the safety of specific modules of code. The best teacher for determining the level of effort required is experience. These essential analyses do not need to be performed by the software engineering group and may be assigned to another group or person with the specialized expertise necessary. The SwSWG will have to provide the necessary safety-related
423
guidance and training to the individuals performing the analysis, but only to the extent necessary for them to accomplish the task. One of the most important aspects of software safety program management is monitoring the activities of the safety program throughout system development to ensure that tasks are on schedule and within cost, and to identify potential problem areas that could impact the safety or software development activities. The software safety manager must: Monitor the status and progress of the software and system development effort to ensure that program schedule changes are reflected in the software safety program POA&M. Monitor the progress of the various IPTs and ensure that the safety interface to each is working effectively. When problems are detected, either through feedback from the software safety representative or other sources, the software safety manager must take the necessary action to mitigate the problem. Monitor and receive updates regarding the status of analyses, open Hazard Action Report (HAR), and other safety activities on a weekly basis. Significant HARs should be discussed at each SwSWG meeting and the status updated as required. A key factor that the software safety program manager must keep in mind is the tendency for many software development efforts to begin compressing the test schedule as slippage occurs in the software development schedule. He or she must ensure that the safety test program is not compromised as a result of the slippage.
SPRA requirements vary with the PA and are often governed by PA directives. The contract will generally identify the review requirements; however, it is the responsibility of the DA to ensure that the software safety program incorporates the appropriate reviews into the SwSPP. The system safety manager must identify the appropriate SPRA and review the schedule during the development process. SPRA reviews generally involve significant effort outside of the other software safety tasks. The DA must determine the level of effort required for each review and the support that will be required during the review, and incorporate these into the SwSPP. For complex systems, multiple reviews are generally required to update the SPRA and ensure that all of the PA requirements are achieved. Although SPRA requirements may vary from each PA, some require a technical data package and briefing to a review board. The technical data package may be a SAR or may be considerably more complex. The DA must determine whether they are to provide the technical data package and briefing, or whether that activity is to be performed independently. In either event, safety program personnel may be required to participate or attend the reviews to answer specific technical questions that may arise. Normally, the presenters require several weeks of preparation for the SPRA reviews. Preparation of the technical data package and supporting documentation requires time and resources even though the data package is a draft or final version of the SAR.
424
4 .3 .2
Figure 4-16: Software Safety Task Implementation The credibility of software safety engineering activities within the hardware and software development project depends on the credibility of the individual(s) performing the managerial and technical safety tasks. It also depends on the identification of a logical, practical, and cost effective process that produces the safety products to meet the safety objectives of the program. The primary safety products include hazard analyses, initial safety design requirements, functionally derived safety design requirements (based on hazard causes), test requirements to produce evidence for the elimination and/or control of the safety hazards, and the identification of safety requirements pertaining to operations and support of the product. The managerial and technical interfaces must agree that the software safety tasks defined in this section will provide
425
the documented evidence for the resolution of identified hazards and failure modes in design, manufacture (code in software), fabrication, test, deployment, and support activities. It must also thoroughly define and communicate residual safety risk to program management at any point in time during each phase of the development life cycle. 4.3.1 Software Safety Program Milestones
The planning and management of a successful software safety program is supplemented by the safety engineering and management program schedule. The schedule should include near-term and long-term events, milestones, and contractual deliverables. The schedule should also reflect the system safety management and engineering tasks that are required for each life cycle phase of the program and that are required to support DOD milestone decisions. Specific safety data to support special safety boards or safety studies for compliance and certification purposes is also crucial. Examples include FAA certification, US Navy Weapon System Explosives Safety Review Board approval, Defense Nuclear Agency Nuclear Certification, and the U.S. Air Force Non-Nuclear Munitions Safety Board approval. The PM must track each event, deliverable, and/or milestone to ensure that safety analysis activities are timely in the development process to help facilitate cost-effective and technically feasible design solutions. These activities ensure that the SSS program will meet the desired safety specifications of program and system development activities. Planning for the SSP must include the allocation of resources to support the travel of safety management and engineers. The contractual obligations of the SOW, in concert with the processes stated in the program plans and the required support of program meetings, dictate the scope of safety involvement. With the limited funds and resources of todays programs, the system safety manager must determine and prioritize the level of support allocated to program meetings and reviews. Planning for the budgeted travel allocations for the safety function must assess the number of meetings requiring support, the number of safety personnel required to attend, and the physical location of the meetings. Likewise, budgets must include adequate funds for support tools, such as database programs for hazard tracking and analysis tools. The resource allocation activity becomes complicated if priorities are not established up-front with the determination of meetings to support, tools required, and other programmatic activities. Once priorities are established, safety management can alert program management to meetings that cannot be supported due to budget constraints for the purpose of concurrence or the reallocation of resources. Figure 4-17 provides an example milestone schedule for a software safety program. It graphically depicts the relationship of safety-specific activities to the acquisition life cycles of both system and software development. Remember that each procurement is unique and will have subtle differences associated with managerial and technical interfaces, timelines, processes and milestones. This schedule is an example with specific activities and time relationship-based typical programs. Program planning must integrate program-specific differences into the schedule and support the practical assumptions and limitations of the program.
426
DODI 5000.2R MS 0 SPRA MS 1 SPRA MS 2 SPRA Phase II Engineering and Manufacturing Development Hardware Prototype Design Manufacturing PD DD Code & CSU CSU CSCI Test Test InterOperability Test System Integration Test O P E V A L SPRA SPRA MS 3 Phase III Production & Deployment Manufacturing MS 4 SPRA MS 5
SRR
Maintenance PIPs Technical Reviews Obsolescence Copy Media Recovery & Distribution System Upgrades
Software Safety Program Management Preliminary Hazard List Development Tailor the Generic Software Safety Requirements Preliminary Hazard Analysis - (PHA) Develop Safety Requirements Criteria Analysis MILSTD 882
Software Preliminary Design Subsystem Hazard Analysis - SSHA
ECP, PTR, TR, PCR, SCN, Analysis Product Improvement Programs (PIPs)
Software Detailed Design Subsystem Hazard Analysis SSHA System Hazard Analysis (SHA) Software Safety Test Planning
Safety Processes for Phase 4 Are an Iteration of the Processes Performed for Phases 0-3.
Software Safety Testing & Analysis Software Safety Assessment Verify Software Developed IAW Applicable Standards and Criteria SRR SDR SSR PDR CDR TRR
Figure 4-17: Example POA&M Schedule As described in Paragraph 4.2.2, the POA&M will also include the various safety reviews, PA reviews, internal reviews, and the SwSWG meetings. The software safety assessment milestones are generally geared to the SPRA reviews, since the technical data package required is in fact either the draft or final software-related SAR. Hazard analysis schedules must reflect program milestones where hazard analysis input is required. For example, SSRs resulting from generic requirements tailoring (documented in the SRCA) must be available as early as practical in the design process for integration into design, programmatic, and system safety documents. Specific safety requirements from the PHA and an initial set of safety design requirements must be available prior to the PDR for integration into the design documents. System safety and software safety must participate in the system specification review and provide recommendations during
427
the functional allocation of system requirements to hardware, software, operation, and maintenance. After functional allocation is complete, the Software Engineering IPT, with the help of the software safety representative, will develop the SRS. At this point, SSS should have the preliminary software safety assessment complete with hazards identified and initial softwarerelated HRIs. The SwSWG updates the analyses as the system development progresses however, the safety design requirements (hardware, software, and human interfaces) must be complete prior to the CDR. Requirements added after the CDR can have a major impact on program schedule and cost. During the development of the SRS, the SSS Team initiates the SSHA and its evaluation of the preliminary software design. This preliminary design analysis assesses the system and software architecture, and provides design recommendations to reduce the associated risk. This analysis provides the basis for input to the design of the Computer Software Configuration Items (CSCIs), and the individual software modules. At this point the software safety engineer (SwSE) must establish a SAF for each CSCI or Computer Software Unit (CSU), depending on the complexity of the design to document the analysis results generated. As the design progresses and detailed specifications are available, the SSS Team initiates a SSHA that assesses the detailed software design. The team analyzes the design of each module containing safety-critical functions and the software architecture in the context of hazard failure pathways and documents the results in the SAF. For highly safety-critical software, the analysis will extend to the source code to ensure that the intent of the SSRs is properly implemented. The development of safety test requirements begins with the identification of SSRs. SSRs can be either safety design requirements, generic or functional (derived) requirements, or requirements generated from the implementation of hazard controls that will be discussed in Paragraph 4.3.5. SSRs incorporated into software documentation automatically becomes a part of the software test program. However, throughout the development, the software safety organization must ensure that the test plans and procedures will provide the desired validation of SSRs demonstrating that they meet the intent of the requirement. Section 4.4 provides additional guidance on the development of the safety test program. Detailed inputs regarding specific safety tests are derived from the hazard analyses, causal factor analysis, and the definition of software hazard mitigation requirements. Safety-specific test requirements are provided to the test organization for development of specific test procedures to validate the SSRs. The analysis associated with this phase begins as soon as test data from the safety tests is available. The SHA begins as soon as functional allocation of requirements occurs and continues through the completion of system design. Specific milestones for the SHA include providing safety test requirements for integration testing to the test organization and detailed test requirements for interface testing. The latter will be required before testing of the software with other system components begins. 4.3.1 Preliminary Hazard List Development The PHL is a contractual deliverable on many programs and is described in Appendix C, paragraph C.1.3. This list is the initial set of hazards associated with the system under development. Development of the PHL requires knowledge of the physical and functional requirements of the system and some foreknowledge of the conceptual system design. The
428
documentation of the PHL helps to initiate the analyses that must be performed on the system, subsystems, and their interfaces. The PHL is based upon the review of analyses of similar systems, lessons learned, potential kinetic energies associated with the design, design handbooks, and user and systems specifications. The generated list also aids in the development of initial (or preliminary) requirements for the system designers and the identification of programmatic (technical or managerial) risks to the program. Figure 4-18 illustrates the PHL development process.
Inputs
OCD/OR DOP Statement of Objectives Design Standards Generic Requirements Hazard Lists Lessons Learned Draft PHL Functional Allocations Safety Inputs from Domain Experts
Outputs
Primary Task
Preliminary Hazard List Development
Iterative Loop
Input to PHL Input to TEMP Input to SEMP Input to SSHA Input to Draft CRLCMP Input to RHA/SRCA Initial SCFL Input to Trade Studies Design Options Domain Specific Functional Hazards Input to SPRA Reviews
Primary Sub-Tasks
Establish and Manage SwSWG Update Safety Program Plans Safety Program Monitoring Provide, Update, or Develop Presentations Determine Functional Hazards
Critical Interfaces
System Safety Working Group Software Safety Working Group Domain Engineers
Figure 4-18: Preliminary Hazard List Development The development of the PHL is an integrated engineering task that requires cooperation and communication between functional disciplines and among systems, safety, and design engineers. The assessment and analysis of all preliminary and current data pertaining to the proposed system accomplish this task. From a documentation perspective, the following should be available for review: Preliminary system specification Preliminary product specification User requirements document Lessons learned Analysis of similar systems Prior safety analyses (if available) Design criteria and standards
429
From the preceding list of documentation and functional specifications, system safety develops a preliminary list of system hazards for further analysis. Although the identified hazards may appear to be general or immature at this time, this is normal for the early phase of system development. As the hazards are analyzed against system physical and functional requirements, they will mature to become the hazards fully documented in the PHA, SSHA, SHA, and the Operating and Support Hazard Analysis (O&SHA). A preliminary risk assessment of the PHL hazards will help determine whether trade studies or design options must be considered to reduce the potential for unacceptable or unnecessary safety risk in the design. In addition to the information assessed from preliminary documents and databases, technical discussions with systems engineering to help determine the ultimate functions associated with the system that are safety-critical. Functions that should be assessed include manufacturing, fabrication, operations, maintenance, and test. Other technical considerations include transportation and handling, software/hardware interfaces, software/human interfaces, hardware/human interfaces, environmental health and safety, explosive and other energetic components, product loss prevention, as well as nuclear safety considerations. This effort begins with the safety engineer analyzing the functionality of each segment of the conceptual design. From the gross list of system functions, the analyst must determine the safety ramifications of loss of function, interrupted function, incomplete function, function occurring out of sequence, or function occurring inadvertently. This activity also provides for the initial identification of safety-critical functions. The rationale for the identification of safety-critical functions (list) of the system is addressed in the identification of safety deliverables (Appendix C, Paragraph C.1.4). It should be reiterated at this point, that this is an activity that must be performed as a part of the defined software safety process. This process step ensures that the project manager, systems and design engineers, in addition to the software developers and engineers are aware of each function of the design considered safety-critical or to have a safety impact. It also ensures that each individual module of code that performs these functions is officially labeled as safety-critical and that defined levels of design and code analysis and test activity are mandated. An example of the possible safety-critical functions of a tactical aircraft is provided in Figure 4-19. There are two benefits to identifying the safety-critical functions of a system. First, the identification assists the SSS Team in the categorization and prioritization of safety requirements for the software architecture early in the design life cycle. If the software performs or influences the safety-critical function(s), that module of code becomes safety-critical. This eliminates emotional discussions on whether individual modules of code are designed and tested to specific and extensive criteria. Second, it reduces the level of activity and resource allocations to software code not identified as safety-critical. This benefit is cost avoidance. At this phase of the program, specific ties from the PHL to the software design are quite premature. Specific ties to the software are normally through hazard causal factors, which have yet to be defined at this point in the development. However, there may be identified hazards which have preliminary ties to safety-critical functions which in turn are functionally linked to the preliminary software design architecture. If this is the case, this functional link should be adequately documented in the safety analysis for further development and analysis. At the same time, there are likely to be specific generic SSRs applicable to the system (see Appendix E).
430
These requirements are available from multiple sources and must be specifically tailored to the program as they apply to the system design architecture.
SAFETY-CRITICAL FUNCTIONS
** for example purposes only **
G Altitude Indication G Attitude Indication G Air Speed Indication G Engine Control G Inflight Restart After Flameout G Engine Monitor and Display G Bleed Air Leak Detection G Engine/APU Fire Detection G Fuel Feed for Main Engines G Fire Protection/Explosion Suppression G Flight Control - Level III Flying Qualities G Flight Control - Air Data G Flight Control - Pitot Heat G Flight Control - Fuel System/CG Control G Flight Control - Cooling G Flight Control - Electrical G Flight Control - Hydraulic Power G Canopy Defog G Adequate Oxygen Supply to the Pilot G Stores and/or Expendables Separation G Safe Gun and Missile Operation G Armament/Expendables for System
Ground Operations
G Emergency Canopy Removal G Emergency Egress G Ejection Capability G Landing Gear Extension G Ground Deceleration G Structure Capability to Withstand Flight
Loads
G Freedom From Flutter G Stability in Pitch, Roll and Yaw G Heading Indication G Fuel Quantity Indication G Fuel Shut-off to Engine and APU G Engine Anti-Ice G Caution and Warning Indications
Figure 4-19: An Example of Safety-Critical Functions 4.3.2 Tailoring Generic Safety-Critical Requirements Figure 4-20 depicts the software engineering process for tailoring the generic safety-related software requirement list. Generic SSRs are those design features, design constraints, development processes, "best practices," coding standards and techniques, and other general requirements that are levied on a system containing safety-critical software, regardless of the functionality of the application. The requirements themselves are not safety specific (i.e., not tied to a specific system hazard). In fact, they may just as easily be identified as reliability requirements, good coding practices, and the like. They are, however, based on lessons learned from previous systems where failures or errors occurred that either resulted in a mishap or a potential mishap. The PHL, as described above, may help determine the disposition or applicability of many individual generic requirements. The software safety analysis must identify the applicable generic SSRs necessary to support the development of the SRS as well as programmatic documents (e.g., SDP). A tailored list of these requirements should be provided to the software developer for inclusion into the SRS and other documents. Several individuals, agencies, and/or institutions have published lists of generic safety requirements for consideration. To date, the most thorough is included in Appendix E, Generic Requirements and Guidelines, which includes the STANAG 4404, NATO Standardization Agreement, Safety Design Requirements and Guidelines for Munitions Related Safety-Critical Computing Systems, the Mitre (Ada) list, and other language-specific requirements. These
431
requirements should be assessed and prioritized according to the applicability to the development effort. Whatever list is used, the analyst must assess each item individually for compliance, noncompliance, or non-applicability. On a particular program, the agreed upon generic SSRs should be included in the SRCA and appropriate high-level system specifications.
Inputs
General Rqmts Doc. Design Standards Generic Safety-Critical Software Rqmts Lists Lessons Learned Similar Systems Hazard Analyses Mishap Data
Outputs
Primary Task
Tailoring The Generic Safety-Critical Software Requirements List
Input to TEMP Input to SEMP Input to PHL Input to PHA Input to SSHA Input to SDP Input to CRLCMP Input to SPRA Reviews Input to Software Test Plans and Generic Test Requirements
Iterative Loop
Primary Sub-Tasks
Obtain Existing Generic Requirements and Guidelines Determine Applicability of Requirements to System Under Development Generate Additional Generic Requirements Rebiew Draft SDP, SEMP, TEMP Obtain Evidence to Support Compliance
Critical Interfaces
System Safety Working Group Software Safety Working Group Software Quality Assurance Domain Engineers Test and Evaluation
Figure 4-20: Tailoring the Generic Safety Requirements Figure 4-21 is an example of a worksheet form that may be used to track generic SSR implementation. Whether the program is complying with the requirement, the physical location of the requirement and the physical location of the evidence of implementation must be cited in the EVIDENCE block. If the program is not complying with the requirement (e.g., too late in the development to impose a safety kernel) or the requirement is not-applicable (e.g., an Ada requirement when developing in assembly language), a statement of explanation must be included in the RATIONALE block. An alternative mitigation of the source risk that the requirement addresses should be described if applicable, possibly pointing to another generic requirement on the list. A caution regarding the blanket approach of establishing the entire list of guidelines or requirements for a program: Each requirement will cost the program critical resources; people to assess and implement; budget for the design, code, and testing activities; and program schedule. Unnecessary requirements will impact these factors and result in a more costly product with little or no benefit. Thus, these requirements should be assessed and prioritized according to the applicability to the development effort. Inappropriate requirements, which have not been adequately assessed, are unacceptable. The analyst must assess each requirement individually and introduce only those that may apply to the development program. Some requirements only necessitate a sampling of evidence to provide implementation (e.g., no conditional GO-TO statements). The lead software developer will often be the appropriate
432
individual to gather the implementation evidence of the generic SSRs from those who can provide the evidence. The lead software developer may assign SQA, CM, V&V, human factors, software designers, or systems designers to fill out individual worksheets. The entire tailored list of completed forms should be approved by the SSE and submitted to the SDL and referred to by the SAR. This provides evidence of generic SSR implementation
INTENDED COMPLIANCE YES NO X N/A
Item:
Coding Requirements Issues Has an analysis (scaling, frequency response, time response, discontinuity, initialization, etc.) of the macros been performed?
Rationale: (If NO or N/A, describe the rationale for the decision and resulting risk.)
There are no macros in the design (discussed at checklist review 1/11/96)
Evidence: (If YES, describe the kind of evidence that will be provided. Note: Specify sampling percentage per SwSPP, if applicable.) Action: (State the Functional area with responsibility.)
Software Development POC:
Figure 4-21: Example of a Generic Software Safety Requirements Tracking Worksheet 4.3.3 Preliminary Hazard Analysis Development The PHA is a safety engineering and software safety engineering analysis performed to identify and prioritize hazards and their casual factors in the system under development. Figure 4-22 depicts the safety engineering process for the PHA. There is nothing unique about the software aspects other than the identification of the software causal factors. Many safety engineering texts provide guidance for developing the PHA. This Handbook will not describe the processes for brevity. Many techniques provide an effective means of identifying system hazards and the determination of their causal factors. A note of caution is that each methodology focuses on a process that will identify a substantial portion of the hazards, however, none of the methodologies are complete. For example, the Energy-Barrier trace analysis is an effective process, however, it may lead the analyst to neglect certain energy control functions. In addition, in applying this technique, the analyst must consider not only the obvious forms of energy (chemical, electrical, mechanical, etc.) but also such energy forms as biological. Many analysts use the life cycle profile of a system as the basis
433
for the hazard identification and analysis. Unless the analyst is particularly astute, he/she may miss subtle system hazards and, more importantly, causal factors. Appendix B provides a list of references that includes many texts describing the PHA.
Inputs
SOW/SOO/RFP Risk Assessment Criteria Draft SS, S/SDD Lessons Learned Similar Systems Hazard Analyses Mishap Data Life Cycle Environmental Profile PHL Tailored Rqmts Lists
Outputs
Primary Task
Preliminary Hazard Analysis (PHA)
Input to RHA/SRCA Update PHA Input to S/W Design Input to SDP Input to Preliminary S/W Design Analysis Input to Trade Studies Safety-specific S/W Design Requirements Hazard Action Records Prioritized Hazard List
Iterative Loop
Critical Interfaces
Software Safety Working Group Domain Engineers
Primary Sub-Tasks
Identify System-Level Causal Factors Identify Software Level Causal Factors Apply HRI and Prioritize Hazards Apply Risk Assessment Criteria and Categorize Hazards Link Hazard Causal Factors to Requirements Develop Design Recommendations
Figure 4-22: Preliminary Hazard Analysis The PHA becomes the springboard documentation to launch the SSHA and SHA analyses as the design matures and progresses through the development life cycle. Preliminary hazards can be eliminated (or officially closed through the SSWG) if they are deemed to be inappropriate for the design. Remember that this analysis is preliminary and is used to provide early design considerations that may or may not be derived or matured into design requirements. Throughout this analysis, the PHA provides input to trade-off studies. Trade-off analyses performed in the acquisition process are listed in Table 4-1 (DSMC, 1990). These analyses offer alternative considerations for performance, producibility, testability, survivability, compatibility, supportability, reliability, and system safety during each phase of the development life cycle. System safety inputs to trade studies include the identification of potential or real safety concerns, and the recommendations of credible alternatives that may meet all (or most) of the requirements while reducing overall safety risk. The entire unabridged list of potential hazards developed in the PHL is the entry point of PHA. The list should be scrubbed for applicability and reasonability as the system design progresses. The first step is to eliminate from the PHL any hazards not applicable to the system [e.g., if the system uses a titanium penetrator vice a Depleted Uranium (DU) penetrator, eliminate the DU related hazards]. The next step is to categorize and prioritize the remaining hazards according to the (System) HRI. The categorization provides an initial assessment of system hazard severity and probability of occurrence and, thus, the risk. The probability assessment at this point is usually subjective and qualitative. After developing the prioritized list of preliminary hazards,
434
the analysis proceeds with determining the hardware, software, and human interface causal factors to the individual hazard as shown in Figure 4-23. Table 4-1: Acquisition Process Trade-off Analyses
Acquisition Phase
Mission Area Analysis Concept Exploration
Select Technology Reduce Alternative Configurations to a Testable Number Select Component/Part Designs Select Test Methods Select Operational Test & Evaluation Quantities
Production
Examine Effectiveness of all Proposed Design Changes Perform Make-Or-Buy, Process, Rate, and Location Decisions
After the prioritized list of preliminary hazards is determined, the analysis proceeds with determining the hardware, software, and human interface causal factors to the individual hazard as shown in Figure 4-23.
Inadvertent Stores Release
H/W
S/W
H/F
Faulty Latch
Figure 4-23: Hazard Analysis Segment This differentiation of causes assists in the separation and derivation of specific design requirements for implementation in software. For example, as the analysis progresses, the analyst may determine that software or hardware could subsequently contribute to a hardware casual factor. A hardware component failure may cause the software to react in an undesired manner leading to a hardware-influenced software causal factor. The analyst must consider all paths to ensure coverage of the software safety analysis. Although this tree diagram can represent the entire system, software safety is particularly concerned with the software causal factors linked to individual hazards in addition to ensuring that the mitigation of each causal factor is traced from requirements to design and code, and
435
subsequently tested. These preliminary analyses and subsequent system and software safety analyses identify when software is a potential cause, or contributor to a hazard, or will be used to support the control of a hazard. At this point, tradeoffs evolve. It should become apparent at this time whether hardware, software, or human training best mitigates the first-level causal factors of the PHL item (the root event that is undesirable). This causal factor analysis provides insight into the best functional allocation within software design architecture. It should be noted that requirements designed to mitigate the hazard causal factors do not have to be one-to-one, i.e., one software causal factor does not yield one software control requirement. Safety requirements can be one-to-one, one-tomany, or many-to-one in terms of controlling hazard causal factors to acceptable levels of safety risk. In many instances, designers can use software to compensate for hardware design deficiencies or where hardware alternatives are impractical. As software is considered to be cheaper to change than hardware, software design requirements may be designed to control specific hardware causal factors. In other instances, one design requirement (hardware or software) may eliminate or control numerous hazard causal factors (e.g., some generic requirements). This is extremely important to understand as it illuminates the importance of not accomplishing hardware safety analysis and software safety analysis separately. A systemlevel, or subsystem-level hazard can be caused by a single causal factor or a combination of many causal factors. The safety analyst must consider all aspects of what causes the hazard and what will be required to eliminate or control the hazard. Hardware, software, and human factors can usually not be segregated from the hazard and cannot be analyzed separately. The analysis performed at this level is integrated into the trade-off studies to allow programmatic and technical risks associated with various system architectures to be determined. Both software-initiated causes and human error causes influenced by software input must be adequately communicated to the digital systems engineers and software engineers to identify software design requirements that preclude the initiation of the root hazard identified in the analysis. The software development team may have already been introduced to the applicable generic SSRs. These requirements must address how the system will react safely to operator errors, component failures, functional software faults, hardware/software interface failures, and data transfer errors. As detailed design progresses, however, functionally derived software requirements will be defined and matured to specifically address causal factors and failure pathways to a hazardous condition or event. Communication with the software design team is paramount to ensure adequate coverage in preliminary design, detailed design, and testing. If a PHL is executed on a system that has progressed past the requirements phase, a list or a tree of identified software safety-critical functions becomes helpful to flesh out the fault tree, or the tool used to represent the hazards and their causal factors. In fact, the fault tree method is one of the most useful tools in the identification of specific causal factors in both the hardware and software domains. During the PHA activities, the link from the software casual factors to the system-level requirements must be established. If there are causal factors that, when inverted descriptively, cannot be linked to a requirement, they must be reported back to the SSWG for additional consideration as well as development and incorporation of additional requirements or implementations into the system-level specifications.
436
The hazards are formally documented in a hazard tracking database record system. They include information regarding the description of the hazard, casual factors, the effects of the hazard (possible mishaps) and the preliminary design considerations for hazard control. Controlling causal factors reduces the probability of occurrence of the hazard. Performing the analysis includes assessing hazardous components, safety-related interfaces between subsystems, environmental constraints, operation, test and support activities, emergency procedures, test and support facilities, and safety-related equipment and safeguards. A suggested PHA format (Figure 4-24) is defined by the CDRL and can be included in the Hazard Tracking database. This is only a summary of the analytical evidence that needs to be progressively included in the SDL to support the final safety and residual risk assessment in the SAR.
PAGE 1
Analysis Phase:
Hazard Description:
Hazard Cause: Hardware Software Human Error Software-Influenced Human Error Hazard Effect:
Figure 4-24: Example of a Preliminary Hazard Analysis The PHA becomes the input document and information source for all other hazard analyses performed on the system including the SSHA, SHA, and the O&SHA. 4.3.4 Derive System Safety-Critical Software Requirements Safety-critical SSRs are derived from known safety-critical functions, tailored generic SSRs, and hazard causal factors determined from previous activities. Figure 4-25 identifies the software safety engineering process for developing the SRCA.
437
Inputs
Draft SS, S/SDD Draft SDP/SQAP/QAPP Draft PHA Draft CRLCMP Tailored Generic Safety Specific Requirements List Initial Safety-Critical Function (SCF) List
Outputs
Primary Task
Develop Software Requirements Criteria Analysis
Iterative Loop
Primary Sub-Tasks
Develop Safety Design Requirements Design Requirements Tied to Causal Links Recommend Design Restrictions/Limits Identify Safety-Critical S/W Functions Identify Causal Links to Software Perform Path Analysis Identify Influences to Safety-Critical Path Allocate SCFs to Identified Hazards
Input to RHA/SRCA Safety-Specific Requirements List Input to SS, S/SDD Input to TEMP Input to OOD Process Input to SPrA Reviews Input to Reliability, Availability, and Maintainability Plans Input to CRLCMP SCF List
Critical Interfaces
Software Safety Working Group Domain Engineers Software Quality Assurance Software V&V, T&E, and CM Maintainability
Figure 4-25: Derive Safety-Specific Software Requirements Safety requirement specifications identify the specifics and the decisions made, based upon the level of safety risk, desired level of safety assurance, and the visibility of software safety within the developer organization. Methods for doing so are dependent upon the quality, breadth, and depth of initial hazard and failure mode analyses and on lessons learned and/or derived from similar systems. As stated previously, the generic list of requirements and guidelines establishes the starting point, which initiates the system-specific SSR identification process. Identification of system-specific software requirements is the direct result of a complete hazard analysis methodology (see Figure 4-26). SSRs are derived from four sources: generic lists, analysis of the system functionality (safety design requirements), from the causal factor analysis, and from implementation of hazard controls. The analysis of system functionality identifies those functions in the system that, if not properly executed, can result in an identified system hazard. Therefore, the correct operation of the function related to the safety design requirements is critical to the safety of the system making them safety-critical as well. The software causal factor analysis identifies lower-level design requirements that, based on their relationship to safety-critical functions, or the context of the failure pathway of the hazard make them safety-related as well. Finally, design requirements developed to mitigate other system-level hazards (e.g., monitors on safety-critical functions in the hardware) are also SSRs. The SwSE must present the SSRs to the customer via the SwSWG for concurrence with the assessment as to whether they eliminate or resolve the hazardous condition to acceptable levels of safety risk prior to their implementation. For most SSRs, there must be a direct link between the requirement and a system-level hazard. The following paragraphs provide additional guidance on developing SSRs other than the generics.
438