Space System Failures
Space System Failures
TOR-2007(8617)-1
29 June 2007
Prepared by
P. G. CHENG
Risk Assessment and Management Subdivision
Systems Engineering Division
Prepared for
EL SEGUNDO, CALIFORNIA
AEROSPACE REPORT NO.
TOR-2007(8617)-1
Prepared by
P. G. CHENG
Risk Assessment and Management Subdivision
Systems Engineering Division
29 June 2007
Prepared for
“It’s always the simple stuff that kills you….It’s not that they are stupid, with all the testing
systems everything looked good.”
James Cantrell, Main Engineer for the Skipper Satellite
Skipper failed because its solar panels were connected
backward (Associated Press 1996).
Failure reports routinely trace the underlying cause of mishaps to “engineering blunders” and
bemoan “inadequate reviewing.” But how can reviewers, in a few hours, find a mistake that
has eluded years of design and quality checks by the contractor and program office?
Over the years the author has published 20 volumes of Space Systems Engineering Lessons
Learned, analyzing past failures and highlighting practices required to avoid recurrences.
Another report, 100 Questions for technical Review, suggests questions that reviewers can
use to efficiently look for errors. This report focuses on five of the most common lapses and
prescribes ways to catch them. The original “100 Questions” are shown in Appendix A; all
published lessons are available in Appendix B via hyperlinks.
Adequate reviews call for substantial supporting information, which requires reviewers to
coordinate with the contractors beforehand. Better yet, the program office and the contractors
should address each of these five areas of concern on their own.
v
vi
Five Common Mistakes Reviewers Should Look Out For
vii
viii
Acronyms
CDR Critical design review
EPS Electrical power subsystem
ESD Electrostatic discharging
FR Final review (of an item’s as-flight configuration, encompassing unit selloff,
system selloff, and various pre-flight readiness reviews)
GN&C Guidance, navigation, and control subsystem
IRT Independent review team, a term encompassing the mission assurance team
(MAT), independent readiness review team (IRRT), and other independent
review activity teams
PDR Preliminary design review
SRR System requirements review
TRR Test readiness review
ix
x
1. Could the Sign Be Wrong?
Background
Sign errors, involving orientation and phasing (polarity) of torque coils, moving mechanical
assemblies, and many other components is a leading cause of satellite failures (Lessons 43,
53, 60, 80, 93 , and 97). A spacecraft recently crashed because the engineers did not realize
an avionics sensor could only detect deceleration from a particular direction. Catastrophic
mistakes have likewise occurred during programming, database development, manufacturing,
and even post-test integration.
Recommendations
1
2
2. How Will Last Minute Configuration Changes Be Verified?
Background
Vehicle configurations often change after system testing. For example, placeholder blankets
need to be swapped out, flight connectors have to mate, and database parameters may
change. Some items, such as locking brackets to prevent flight hardware from coming loose
during ascent, can only be introduced at the launch site. Non-flight items have to be removed.
Hasty changes, especially those made in the heat of launch preparation, have caused several
failures (Lessons, 3, 25, 29, 43, 61, 63, 64, 70, 79, 97, and 104), in part because late installa-
tions and removals can be difficult to verify.
Recommendations
For more information, call Dana Speece at (310) 336-5021 or Gary Shultz at (310) 336-2342.
3
4
3. Can the Vehicle Survive a Computer Crash?
Background
• Revert to the “last known good state,” otherwise even a brief outage may cause
irreversible harm—for example, by incorrectly resetting the guidance system.
• Reboot without being stuck in an endless reset cycle.
• Switch to back-up computers after the primary side fails.
• Protect the memory, communication channels, thrusters, batteries, and thermal
subsystems to buy time for ground diagnosis and rescue. In one case, a prolonged
computer anomaly drained the batteries, but when the solar arrays were eventually
illuminated, an EPS design oversight prevented the solar arrays from charging the
battery, and the “dead bus” could not recover.
• Implement robust fault protection functions that cannot be falsely tripped, can function
after the computer locks up, and permit ground rescue. Avoid complex safemode design.
Recommendations
5
6
4. Is the Circuit Overcurrent Protection Adequate?
Most Applicable Review Occasions:
Background
Numerous satellites failed due to improper protection against shorting from foreign objects,
debris, unexpected contact, or plasma arcing (a high-current, low-voltage discharge in
vacuum over metal vapor). Even digital units are not immune. On-board processors
containing relays plated with pure tin have disabled several satellites after arcing triggered by
tin whiskers blew the fuses (Lessons 5, 19, 41, 47, 49, 56, 71, 75, 98, 100, and 104).
Fuses, circuit breakers, thermostats, and other devices protect upstream assets (such as the
power distribution board) from being damaged by a short, but careful design is required to
prevent them from becoming a single-point source of failure.
Recommendations
For more information, call Tom Hecht at (310) 336-1505 or Dave Landis at (310) 336-1585.
1
. Zinc and cadmium plating should also be avoided due to contamination hazard.
2
. Examples include double insultation, serially connected components, current limiters, etc.
7
8
5. Can Pyros Cause Unexpected Damage?
Background
Pyro designers assiduously make sure their devices will fire. However, premature pyrotech-
nic firings have caused not only catastrophic mission failures, but also severe personnel
losses. Excessive shocks, flying debris, voltage surges, and post-firing shorts have also
damaged critical equipment (Lessons 7, 68, 77, 89, 95, 98, 100, 109, 110, 111, and 119).
Safe design and handling procedures are thus vital.
Recommendations
1. Conduct in-depth analysis of single-point failures, overcurrent, and sneak circuits on pyro
equipment. In particular, all inhibits (typically the ARM and FIRE relays) must be inde-
pendent—not defeated by any single event, including an erroneous software command. 3
2. Follow Range safety requirements and guidelines set forth in MIL-HDBK-83578, includ-
ing the use of double-shielded circuits and safe-arm devices to prevent accidental ignition
by electrostatic discharging (ESD).
3. Implement safe testing procedures and mechanisms to retain ejected parts and to prevent
electrical disruption.
4. Verify that pyroshock does not cause damage, especially to nearby ordnance devices.
5. Adopt design and inspection procedures to guard against damaging transients and post-
firing conduction to chassis. Ensure that live tests before launch do not disable the drive
elements.
For more information, call Selma Goldstein at (310) 336-1013 and Ron Williamson at (310)
336-2149.
3
. The same concern also applies to nonexplosively actuated mechanisms such as wax heaters.
9
10
Appendix A
A-1
A-2
Section 1: Requirements
A-3
A-4
Section 2: Heritage and “Qualification by Similarity”
2-1 Have all “heritage equipment” test and flight anomalies been resolved?
• The implication of each anomaly must be carefully addressed.
• Lessons: 41 and 65.
2-2 Have catastrophic failures that involved similar technologies been reviewed?
• Lessons: 87 and 107.
A-5
A-6
Section 3: Analysis
3-1 Have all critical analyses been placed under configuration control?
• Design changes may invalidate the original analysis.
• Lessons: 83 and 26.
A-7
3-8 Was the space environment fully accounted for?
• Examples: damping, radiation, charging, arcing, heat dissipation, refractive
index, and microgravity.
• Ground thermal insulation blanket to prevent space charge buildups.
• Lessons: 41, 42, 10, 75, and 120.
3-10 Has the electrical schematic been independently checked, from end to end?
• Mistakes sometimes occur between drawings.
• Lesson: 68.
3-15 Has a thorough safety analysis been conducted on each pyro event?
• Pyros impart a large and irreversible shock to the system and are involved
in many mission failures.
• Pyro design should be checked against available guidelines.
• The effect of pyro shock on adjacent structures and circuits must be
thoroughly validated.
• If explosive bolt cutters are used, all ejected debris should be contained.
• Lessons: 109, 110, 111, 119, 98, 89, 68, 77, and 7.
A-8
3-16 Are deployables readily tested both in 0 g and in 1 g?
• Designs that work in 0 g but not in 1 g are difficult to verify.
• During the performance of 1 g test, avoid imparting force unavailable in
space.
• Lessons: 116, 42, and 20.
3-21 Are the power distribution and grounding schemes, including over-voltage and
under-voltage limits, safe?
• All units should be protected from over- or under-voltage conditions.
• Double-check if a fuse should be installed, and carefully analyze fault
scenarios to size fuses.
• Components such as step motors and pyro circuits that experience sudden
current changes should be isolated from all other current-carrying circuits.
• Lessons: 98, 104, 114, and 117.
3-22 Are all known quirks of field programmable gate arrays (FPGAs)
accounted for?
• FPGAs have demanding electrical design rules and software interface.
• A NASA website http://www.klabs.org/ describes common design mistakes.
• Lessons: 77 and 100.
A-9
A-10
Section 4: Failure Modes and Fault Management
4-2 Will the satellite autonomous management system and the ground controller be
provided with correct information?
• Inaccurate situation awareness can lead to wrong disposition.
• Ensure subsystems report true status to the autonomy functions.
• Lessons: 44 and 29.
4-3 Does the fault management design consider all operational possibilities?
• Example: solar array mispointing, engine abort, or eclipse transient.
• Lessons: 36 and 38.
A-11
4-7 How will the satellite handle battery undercharging?
• The satellite should be able to automatically shed non-essential loads under
low voltage.
• Even a partially deployed solar array should provide enough current to
sustain the system.
• The power regulator should be energized from the solar array, instead of
being solely dependent on the battery for housekeeping.
• Lessons: 53, 47, 67, 30, and 101.
4-8 Can the fault management system itself survive major anomalies?
• Example: If a computer freezes, will fault correction software execute?
• Lesson: 35.
4-10 Can a problem in a primary unit cause the same failure in its backup?
• If the primary and redundant units share the same current feed, software, or
processor, one flaw in the primary component can cause the backup to fail
in the same way.
• Lessons: 18 and 19.
A-12
Section 5: Embedded Software and Database
5-1 Will unexpected inputs cause the software to freeze or loop endlessly?
• Lessons: skipped sensor input data, data outside the expected range, or data
that does not compute.
• Software should ignore spurious inputs through filtering or limit checking.
• Consider deliberately ignoring faults if there is no possible recovery.
• Avoid permitting software to reset in response to errors. Consider error
messages in telemetry instead.
• All “IF” branches should provide an “ELSE” for the unexpected input.
• Lesson: 18.
A-13
5-7 Are command scripts formally controlled?
• A bad command sequence can be fatal.
• Lessons: 29 and 104.
5-10 Has the flight software been tested with high-fidelity hardware in the loop, in
the flight configuration?
• The ground test bed should be configured the same as the flight computer.
At a minimum, the test bed should have the flight processor, flight
memories, flight software, flight cables, flight power management
equipment, and high-fidelity engineering model hardware.
• Test beds should include test points for measuring all signal and control
voltages and currents.
• Lessons: 19, 36, and 53.
5-12 Have all major events been scrubbed for out-of-sequence inputs?
• A signal arriving earlier or later than expected can trigger unintended timing
conflicts.
• Missing data may leave the system in an unknown state.
• Lessons: 12, 25, and 104.
A-14
Section 6: Interfaces
6-2 Have potential incompatibilities between interfaces been analyzed early on?
• Independent analysis is often needed to overcome organizational barriers.
• Lessons: coupled loads, nutational instability, and EMI.
• Lessons: 2, 11, and 33.
6-3 Are handover procedures between two sources of control well defined?
• Two pieces of equipment vying for control (or each assuming the other is
doing the job) can be dangerous.
• Conduct thorough switching analysis to ensure fail-safe transfers.
• Lessons: 105 and 81.
6-4 Are there items that could resonate with one another?
• Example: Spacecraft can mechanically resonate with the launch vehicle,
causing fatigue damage.
• Lesson: 11.
A-15
A-16
Section 7: Parts, Materials, and Manufacturing Process
7-3 Does any part, including those subcontracted, contain pure tin-plating or
cadmium?
• Tin whiskers can cause shorts and arcing and have disabled several
satellites.
• Cadmium, commonly used to plate airborne equipment, outgases in space.
• Audit vendor or subcontractor materials lists to ensure completeness.
• Lessons: 49 and 5.
7-4 Are there separable flared fittings (B-nuts) or check valves in fluid lines?
• B-nuts and check valves can leak.
• Lessons: 83 and 15.
7-5 Are cables, connectors, and circuit cards labeled and/or keyed to prevent
mismating?
• Mismating can cause inadvertent shorting during testing, even flight failure.
• Lesson: 63.
7-8 Are there procedures to prevent parts from being mixed up?
• Different parts may look alike.
• Lesson: 51.
A-17
7-9 Did a significant accident occur during manufacturing?
• Make sure the MRB thoroughly investigated the anomaly before accepting
the part as-is.
• Lesson: 6.
7-15 Are procedures adequate to prevent non-flight items or debris from being left
inside the hardware?
• Make sure there is a special tracking system for non-flight items since loose
materials have led to numerous reworks or failures.
• Lesson: 90.
A-18
Section 8: Testing and Evaluation
8-5 Has all test data been reviewed for trends, oddities, “out-of-family” values, and
other indicators of anomalies?
• Test sets should collect data and enable automatic trending.
• Excessive current draw during electrical test (suggestive of an impending
short) and high G spikes (indicating intermittent rubbing) during acoustic
testing should receive particular attention.
• Many problems occur during the first temperature cycle. Therefore, the
results after the first cycle should be scrutinized.
• Lessons: 71, 39, and 19.
A-19
8-6 Are all test anomalies fully understood?
• Many flight failures first occur during tests but are mistakenly attributed to
“random failures” or “test set malfunctions.”
• Test equipment should be sufficiently powerful to enable unambiguous
assignment of anomaly causes.
• Lessons: 106, 120, 92, 38, 46, 55, and 56.
8-7 Have the test articles been fully inspected after testing?
• It is particularly important to inspect the hardware after vibration or
acoustic tests, thermal cycling, or live pyro firing.
• Lessons: 100, 66, and 7.
A-20
8-12 Does the test equipment allow sneak paths?
• Sneak paths via the test set can mask hardware deficiencies (by providing
gratuitous grounding or power, for example).
• If test equipment temporarily provides certain functions, independently
verify that the hardware can operate on its own.
• Test set sneak paths can also damage hardware.
• Lessons: 58, 72, 109, 111, and 119.
8-13 Have the units demonstrated an ability to start without the need of ground
equipment (plug-out) or manual intervention?
• It is particularly important to check payload, GN&C, and C&DH processors
to prevent endless looping.
• Lessons: 84 and 79.
8-15 Does the system being tested represent the flight configuration?
• Insert enough test points to compensate for items that could not be
live-tested (thrusters and deployment mechanisms, for example).
• Lessons: 85, 53, and 19.
8-16 Does the test inject sufficient off-nominal conditions to ensure the equipment
is robust?
• Examples of off-nominal conditions include current spikes, sluggish
separation wire breakage, and excessive data rate.
Lessons: 44, 56, 94, 103, and 123.
A-21
A-22
Appendix B: Space Systems Engineering Lessons 1-124
B-1
B-2
Space Systems Engineering Lessons Learned
Lesson 1
Honeycomb Structures Should be Vented to Reduce Delamination Risk
The Problem:
Several satellites have been destroyed when their honeycomb structures failed. Examples in-
clude:
• A NASA satellite was destroyed at T+103 sec when the payload fairing reached 600°F.
During subsequent ground tests, the witness panels disintegrated (1964).
• A DOD rocket blew up shortly after launch. Later, the fairing's witness panel came apart
when tested on ground (1966).
• Another DOD satellite was severely damaged upon launch. The fairing for the next flight
was subsequently proof tested, whereupon it also burst (1981).
• Two solar array panels on a DOD program failed during qualification (1985).
• The massive hydrogen tank on an experimental reusable launch vehicle delaminated,
eventually causing the program to be cancelled (1999).
The Cause: As Fabricated
Lesson 1
Space Systems Engineering Lessons Learned
Lesson 2
Perform Independent Mass Property, Stability Control, and Structural Load Analyses
on Spacecraft and Launch Vehicles
The Problem:
Mistakes in determination of mass-property and control-stability analyses have caused a large
number of launch failures. Examples include:
• Inappropriate reuse of aerodynamic coefficients (1994).
• Unanticipated structural vibration mode not filtered out (1995).
• Incorrectly simulated weight (1995).
• Underprediction of the load as well as an unexpected resonance due to wind shear (1992
and 1995).
• Unexpected increase in horizontal velocity (1996).
• Unaccounted roll mode caused by air-lit solid rocket motors (1998).
Flawed analysis has also led to numerous on-orbit anomalies.
The Cause: SV Structural
Dynamic Model
Launching a satellite calls for extremely complex SV Drawing
simulation of the mass, thermo-structural, fluid-
mechanical, propulsion, and control properties (a Response Recovery
Equations
single subsystem can easily involve over 100,000 Coupled
Engine Drawings
equations). The state of the art in this area is far from SV/LV
Analysis
robust: subtle assumptions, insufficiently sophisti- Fairing Drawings
cated techniques, or human errors can all throw the Tank Drawings
LV Structural
Dynamic Model
results seriously off. U/S Drawings
Moreover, when the satellite is integrated with the LV Drawings
launcher, each organization must generate parochial
models but each has little insight into each other's
analytical process. Costly problems can easily arise Integrating space vehicle (SV) to launch vehicle (LV)
without a clear settling of responsibility, especially involves complex modeling; independent analysis is
with today's emphasis on proprietary data protection. often necessary to overcome organizational barriers.
Lessons Learned:
• Inaccuracies on mass property, stability control, and structural loads continue to threaten
mission performance.
• To ensure correct analysis, many programs require an independent analysis. These activi-
ties also help validate operational procedures, support flight anomaly resolution, and
overcome the organizational issues. There have been no catastrophic failures in programs
that abide by this policy, and several failures were averted thanks to independent analysis.
For more technical information, call Ray Skrinska at (310) 336-4001.
For comments on the Aerospace Lessons Learned Program, including background specifics,
call Paul Cheng at (310) 336-8222.
Lesson 2
Space Systems Engineering Lessons Learned
Lesson 3
Rigorously Manage and Test Software, Including the Database
The Problem:
An expensive military satellite failed to reach the right orbit because a misplaced decimal
point in the avionics database of the upper stage caused the reaction controller to fire exces-
sively, depleting its fuel.
The Cause: Before wrong database was loaded
As the program downsized, mission assurance functions were supposed to change from “over-
sight to insight.” This transition did not successfully take place, and the problem sneaked
through all QA gates.
After the wrong constant was loaded, launch site personnel saw anomalous reading and tried
to contact the designers. However, the issue was ignored. Even during the day of launch, the
rocket showed a wrong response to the wind and to the rotation of the earth. A simple plot
could have identified the problem and averted the failure.
Lessons Learned:
• One must test actual flight hardware and software.
• The integrity of software databases is no less critical than the source codes.
• The space business is extremely complex and human error cannot be completely elimi-
nated. The system must be robust enough to catch the inevitable faults.
For comments on the Aerospace Lessons Learned Program, including background specifics,
call Paul Cheng at (310) 336-8222.
Lesson 3
Space Systems Engineering Lessons Learned
Lesson 4
Document Engineering Requirements As Clearly As Possible
The Problem:
Two very expensive mishaps occurred recently, in part, due to inadequate communications
between the designers and the manufacturing operation:
• The combustion chamber of a rocket engine breached because an unclear requirement
made it possible for a weak joint to pass quality assurance, leading to the loss of a $230M
commercial satellite.
• A DOD satellite was stranded in the wrong orbit because confusing drawing instructions
led technicians to apply thermal protection tape in a way that prevented stage separation.
The Cause:
In the first incident, the seams of the engine are re- Design Intention (80%
inforced with many metal strips. The design per linear inch) means
requires the strips be brazed "80% per linear inch" there can be no big void Brazing voids
(i.e., no big holes, see diagram), but the drawing anywhere.
only specified "80%".
X-ray photos revealed that some strips were
poorly brazed, but they were allowed to pass since Actual requirement (80%)
the requirement was thought as "80% coverage implies that a big hole is OK as
averaged over the entire length of the reinforce- long as there is 80% coverage over
ment strip." The strips failed in flight. the entire length
In the second failure, the work instruction stated Deleting the "per linear inch" phrase led
QA to pass joints with low brazing
that the wrapping should be applied "within 0.5 coverage. In flight, the defective part
inches of the mounting bracket flange" (instead of caused combustion chamber to breach.
saying, e.g., no closer than 0.5 inches). The techni-
cians, not knowing that the parts were to unfasten,
applied the tapes as closely to the flange as possi-
ble, making separation impossible. As-built
Correct
Lesson Learned:
• Engineers must clearly articulate their inten-
tions and determine how the requirements
should be interpreted or could be miscon-
strued. This is particularly true when making Thermal tapes were too tightly wrapped
seemingly minor (Category II) changes. over the as-built connector and inhibited
stage separation.
For comments on the Aerospace Lessons Learned Program, including background specifics,
call Paul Cheng at (310) 336-8222.
Lesson 4
Space Systems Engineering Lessons Learned
Lesson 5
Avoid Pure Tin Plating
The Problem:
Pure tin plating can grow conductive filaments (whiskers) which have caused many problems.
Examples include:
• In the late 1990s, at least four commercial satellites had problems with their spacecraft
control processors (SCP), reportedly because whiskers grew on the relays and caused the
power-supply fuses to blow. In three cases, both the primary and the redundant SCPs
failed, and the satellites were lost.
• Again in the late 1990s, three DOD programs incurred costly delays: one discovered tin
whiskers in an atomic clock, the second found tin whiskers on ground lugs, and the third
saw tin whiskers forming inside thin-film capacitors.
The Cause:
NASA
commercially because it forms an excellent protec-
tive layer that accepts solder readily. Plating shops
prefer pure tin over tin-lead to avoid lead disposal Tin whisker shorts
costs.
However, pure tin is liable to spontaneously form conductive whiskers, which can provide an
unwanted conductive path and degrade hardware by causing shorts and even catastrophic
arcing. The whiskers appear unpredictably, without the need of an applied voltage or moisture
(unlike silver dendrites), even in vacuum. It is impossible to ensure hardware integrity by
inspection or by stress testing—the only way to prevent this problem is to eschew pure tin
plating, fused tin, and alloys with very high (greater than 97%) tin contents.
Lessons Learned:
• Prohibit pure tin plating in both flight hardware and ground equipment but assume tin will
be found.
• Ensure prime contractors flow down unambiguous plating requirements, and perform
appropriate receiving inspections.
• Purge prohibited materials from project stores and standard catalog items, paying particu-
lar attention to the "commercial parts."
• Review subcontractor designs and part specifications to confirm that parts are safe.
• Apply conformal coatings on all exposed conducting surfaces wherever possible to inhibit
shorts and vacuum arcing.
For more technical information, call Katherine Westphal at (310) 336-8794 or Steve Frost at
(310) 336-7131.
For comments on the Aerospace Lessons Learned Program, including background specifics,
call Paul Cheng at (310) 336-8222.
Lesson 5
Space Systems Engineering Lessons Learned
Lesson 6
Following a Major Repair, Watch Out for Secondary Damage
The Problem:
In two launch failures, the Material Review Board (MRB) allowed repaired hardware to be
used without taking secondary damage into full account. The first incident led to the destruc-
tion of three DOD satellites; the second mishap stranded another DOD satellite in a wrong
orbit.
Restrictor Patch
The Cause:
In the first incident, a large cut was made on a Patch
rocket segment during repair, and the slit was sub- Restrictor
sequently patched up. The engineers expected the Cut
Unfortunately, the part was not oven-dried—moisture was trapped in the primary structure
and diffused back to the interface when the part was cured again. Not only was the mechani-
cal strength lower as a result, but the interfacial adhesion between the primary structure and
the overlap also became seriously degraded. During flight, the nozzle was unable to withstand
the motor pressure and was ejected.
Lesson Learned:
• Ad-hoc repair processes tend to be much less defined and qualified than regular
manufacturing operations. MRB reviews need to be more vigilant, and significant MRBs
should be added to the readiness review process. In particular, the possibility of secondary
damage must be taken into account.
For more technical information, call S. R. Lin at (310) 336-7697.
For comments on the Aerospace Lessons Learned Program, including background specifics,
call Paul Cheng at (310) 336-8222.
Lesson 6
Space Systems Engineering Lessons Learned
Lesson 7
Perform High-Fidelity System Validation Tests for Pyrotechnics
The Problem:
Explosive devices (pyros) are highly efficient, easily controlled, and can be readily stored.
However, several anomalies occurred when pyros were turned on:
1. A science mission ended during the first orbit when its infrared telescope cover was
unintentionally ejected, causing the loss of all cryogen (1999).
2. Three satellites, one for Earth observation, one for communication, and one for science,
failed due to propulsion-system ruptures induced by pyros. A propulsive valve on a fourth
similarly failed on ground (early 1990s).
3. An interplanetary probe almost fatally failed when the firing of a pyro initiator caused a
voltage surge and induced a latch-up in the redundant memory board. The mission would
have ended if the primary memory board had been affected (1989).
The Cause:
The telescope cover was ejected because a
Pr imary
controller chip took a few milliseconds to Pyro
Back-up
Pyro
warm up, during which a transient was gen- F uel Line F uel Line
Lesson 7
Space Systems Engineering Lessons Learned
Lesson 8
Solar Arrays Must Withstand Extreme Environments
The Problem:
Solar array mishaps have disabled numerous satellites. Examples include:
• Two Earth observation satellites failed due to shorts in the solar-array system, one in 1978
and another in 1993.
• In 1999, a technology demonstration spacecraft experienced excessive solar panel
degradation that ended its mission prematurely.
• In the late 1990s, two commercial satellites suffered serious power losses, reportedly in
solar storms.
The Cause:
Solar arrays contain many fragile elements, and are exposed to wide temperature fluctuations
and other space hazards. They are thus particularly vulnerable to a host of problems that the
designers must guard against. The mishaps above were caused by faulty materials, processes,
and insufficient testing. Staking
Adhesive Insulated
Wire
In the case of the commercial satellites, the wiring Gr/Ep
Facesheet
harnesses were squeezed into tight feed-through
holes with sharp kinks and without sufficient Aluminum Core
Lesson 8
Space Systems Engineering Lessons Learned
Lesson 9
Excessive Handling Can Destroy Solid Lubricant
The Problem:
Lubricants based on molybdenum disulfide (MoS2) are used in gyros, drives, gimbals, or
other moving mechanical assemblies. Several problems involving this lubricant have been
noted, including:
• A microwave imager on a weather satellite catastrophically failed.
• A degraded sun sensor on another weather satellite caused excessive oscillation.
• The high-gain antenna on an interplanetary probe could be not fully opened.
The Cause:
MoS2 has excellent properties in space, but it
oxidizes in the presence of moisture. Hence,
MoS2 is degraded either by improper han-
dling or by prolonged storage.
Unfortunately, ground tests can fail to detect
degraded lubrication because materials can
behave differently on the ground than they
do in space.
The imager problem occurred because (a) High gain antenna unfurls (b, inverse view) The
like an umbrella. Excessive motor could not overcome
manufacturing and storage exposed the la- friction developed between the the friction and stalled, and
bile lubricants in the slip-ring assembly to pin and the socket (inset) due the antenna could not open.
excessive oxidation. Furthermore, the part to loss of lubricant.
was stored for more than 11 years, causing
more lubricant loss. The sun sensor problem
was also traced to oxidation and contamina-
tion of the slip-ring materials during storage.
The high-gain antenna problem was caused by excessive handling (including vibration test-
ing, rib pre-loading, and four cross-country trips) that dispersed the lube. Ground testing did
not catch the problem because the vacuum test was not realistic and because the titanium pins
got some lubrication (from the contaminants in the test chamber) not available in space.
Lessons Learned:
• Operation, testing, or storage of mechanisms under nonvacuum conditions must be per-
formed with caution when MoS2 dry lubricant is involved.
• Follow Aerospace's handling and storage guidelines to safeguard lubricants.
Lesson 9
Space Systems Engineering Lessons Learned
Lesson 10
Design Satellites to Withstand Space Weather, Regardless of Solar Cycles
The Problem:
Space environment has caused hundreds of on-orbit anomalies, including:
• A military satellite lost power to its communications subsystem suddenly (1973).
• A weather satellite lost its primary instrument (1982).
• A foreign weather satellite lost attitude control (1988).
• A foreign communication satellite found its solar cells severely damaged (1991).
• A foreign commercial satellite was disabled for seven months after both reaction wheels
failed (1994).
• A foreign communication satellite lost power (1997).
• A foreign science satellite was abandoned when increased atmospheric drag overpowered
the attitude control system (2000).
The Cause:
The principal space weather hazards involve geomagnetic storms, which are stirred up when
large numbers of solar particles hit the Earth’s magnetic field. Storms can trigger an electro-
static discharge (ESD) in the spacecraft: all failures cited above except the last one involved
ESDs. Max 180
Max 160 Max Max
Space weather hazards are often thought as
Sunspot Number
140
120
mainly driven by the 11-year solar cycles. For 100
80
example, there was extensive “satellite-killer” 60
40
hype in the media in 2000 because one cycle 20
0
peaked late that year. Conversely, some peo- 1970 1975 1980
Year
1985 1990 1995 2000
ple associate periods of low solar activities : Catastrophic Failures Due to Charging
: Other Weatherr-induced Catastrophic Failures
with minimal weather hazards. Space Weather Hazards Can Occur Outside of Solar Max
This belief is unfounded since space weather hazards and solar activity only marginally
correlate. Geomagnetic storms can occur anytime, not just during the height of the solar cy-
cles. Satellites can thus fail during valleys of solar cycle as easily as during peaks. Moreover,
all storm prediction efforts, including new spacecraft designed to monitor solar activities,
have been unsuccessful so far, and satellite operators cannot count on being forewarned of
weather threats.
Lessons Learned:
• Spacecraft must be designed to withstand worst-case space environments as a matter of
course.
• Satellites should be hardened against ESD, using well-established design guide-lines on
structure, materials, shielding, cable interfaces, and circuits.
For more technical information, call Harry Koons at (310) 336-6519.
For comments on the Aerospace Lessons Learned Program, including background specifics,
call Paul Cheng at (310) 336-8222.
Lesson 10
Space Systems Engineering Lessons Learned
Lesson 11
Carefully Evaluate Satellite-Launcher Interface
The Problem:
An experimental spacecraft fell silent after having been successfully released from the launch
vehicle. This failure was deemed to have occurred because unexpectedly high vibration de-
veloped in the launch vehicle before it was air-dropped, imparting stress in the satellite
beyond its design limit.
1E-1
The Cause:
ICD Spec
This failure was caused primarily by a satellite- 1E-2
Lesson 11
Space Systems Engineering Lessons Learned
Lesson 12
One Requirement, One Statement
The Problem:
Contact to an interplanetary probe was lost.
The Cause:
As the lander parachuted down, it deployed three
Legs deployed;
legs, each with a sensor designed to command the sensors tricked
Software read
sensors;
engine off upon touchdown lest the lander overturn. Descent decelerated shut down engine
with engine
Leg deployment shock could spoof the sensors into
thinking the probe had landed. To prevent the confu-
sion, the systems spec required: “The sensors
shall…(commence operation shortly before touch-
Landing Sequence
down). However, the use of the sensor data shall not
begin until...(after the leg deployment completes)….”
This “However...” phrase was unfortunately not picked up by the software team or by other
subsystems, and was not specifically tested at the system level. During descent, the deploy-
ment shock set off a status flag. When the touchdown sensing logic subsequently ran, it was
misled into thinking landing already occurred. The descent engine shut itself off prematurely;
the probe crashed.
The software walkthrough and integration/test did not detect this problem (logic flow dia-
grams could have helped). What’s more, a leg-deployment test failed to detect the fault
because the sensors were improperly wired at first. A rerun of the deployment test, which
might have caught the error, was not performed after rewiring.
Lessons Learned:
• Do not lump several requirements together—write them out separately so that each can be
tracked individually. Negative statements (e.g., “Sampling shall not begin until…”) may
cause misunderstanding and should be avoided.
• Systems engineers must take ownership of requirements and partition them to the
appropriate subsystem. Whether or not a requirement is the software’s responsibility, for
example, should not be left to the discretion of the software team.
• Systems engineering must ensure thorough end-to-end failure mode testing.
• The software review process should emphasize logic flow. Tests should exercise every
requirement to see if there are conditions that could cause the software to fail.
• Test planning needs to consider transients or spurious signals.
• When important tests are aborted or are known to be flawed, they must be rerun after the
errors are fixed. Repeat the test if any software or hardware involved are changed.
For comments on the Aerospace Lessons Learned Program, including background specifics,
call Paul Cheng at (310) 336-8222.
Lesson 12
Space Systems Engineering Lessons Learned
Lesson 13
Flexible Solar Arrays Are Susceptible to Thermally Induced Vibrations
The Problem:
Thermally induced vibrations of spacecraft appendages have recurred numerous times. Re-
sultant problems include:
• Two science satellites stopped spinning (early 1960s).
• Two Earth observation satellites showed large disturbances about the roll and yaw axes
whenever the spacecraft entered or exited sunlight (early 1980s).
• A space observatory had to have its solar arrays replaced on-orbit because “jitters” inter-
fered with star pointing (1993).
• A scientific satellite failed due to heating and expansion of the solar panels that damaged
the structure (1997).
The Cause:
Spacecraft equipped with long appendages or solar arrays are susceptible to attitude perturba-
tion upon entering or leaving the Earth's shadow, because large temperature gradients can
develop around the boom. The sun-facing side of the boom or array can bend and create a
torque on the satellite very rapidly, causing a flutter. Satellites with a single solar array are
most susceptible.
The space observatory mentioned above, for example, employed flexible solar arrays with
telescoping booms. A thermal gradient as much as 25-deg C developed around the boom cir-
cumference within one minute, causing the tip of the spar to defect by 20 cm.
Lessons Learned:
• Flexible solar arrays and supporting equipment are sensitive to thermal environment.
• Thorough thermomechanical analyses of the solar arrays, particularly on their modal fre-
quencies, should be conducted.
• Control algorithms used to mitigate the effects of solar-array excitations should be refined.
Lesson 13
Space Systems Engineering Lessons Learned
Lesson 14
Look Beyond Specifications in Qualifying Materials by Similarity
The Problem:
Numerous failures have occurred due to deficiencies in substitution materials that were
thought to be similar to those originally specified. Some recent examples include:
• A rocket nozzle failed during test firing because a replacement insulator delaminated.
• The propulsion valves in a rocket broke down just before launch because the oxidizer
reacted with a new cleaning solvent.
• A solar array would not open in space because radiation caused a rubber spacer to become
sticky.
The Cause:
Programs sometimes must replace materials that are no longer available. It is often thought
that if the substitute meets all the specifications, it can be accepted "by similarity." This
approach can be risky; specifications usually only call out rudimentary requirements to fa-
cilitate incoming inspection—key tests used to qualify a material may be cumbersome to
repeat, and are routinely left out of the spec as new materials lots are received.
In the first incident, a supplier problem prompted the contractor to select a replacement resin
for the nozzle skirt. This new material met the applicable specification, had been used on
other programs, and had passed an array of tests in the laboratory. However, test results of the
new material were statistically different from the original material, and test conditions were
not sufficiently flight-like: many properties were measured at room temperature, whereas the
flight temperature approached 3000-deg F. Additionally, certain critical properties were not
measured, and the vital thermal expansion test was performed at too low a heating rate.
In a test firing, the flame burned through the new resin. At the time, two rockets having
nozzles made from the new materials were already being prepared for launch. Potential losses
of the satellites were narrowly averted.
Lesson Learned:
• Substitute materials should be tested under conditions that realistically simulate flight
conditions and give results comparable to those exhibited by the original material.
Lesson 14
Space Systems Engineering Lessons Learned
Lesson 15
Avoid Separable Flared Fittings
The Problem:
Tubular fittings with flared ends, commonly referred to as B-nuts and designated as AN, MS,
and MC types, are sometimes used as separable plumbing joints in rocket engines and space-
craft propulsion subsystems. These connectors are often found to leak during tests, and may
be difficult to fix. Leaky fittings have also been implicated in several in-flight malfunctions,
including the failure of a transfer vehicle.
The Cause:
Standard separable connectors are commonly
used in ground systems to facilitate part replace-
ment. B-nuts work by converting the applied
torque into a stress that physically clamps and
deforms the flared end of fitting until it fits The flared-fitting seal relies on maintaining
tightly over the threaded element. the clamping force high enough to deform
the flare into a fit on the threaded elements.
However, just as bolts in furniture can unscrew
over time, the flared end of these fittings can un-
dergo "stress relaxation" and become loose, 1400
resulting in a leak. Launch vibration can also pull 1200
1000
1/3 drop in torque
How fast the seals loosen depends on the manu- 800
600
facturing process, storage conditions, and other 400
Lesson 15
Space Systems Engineering Lessons Learned
Lesson 16
Systematically Monitor and Control Contamination
The Problem:
Contamination has degraded numerous radiators, thermal coatings, solar arrays, sensors,
moving mechanical assemblies, and other components in space. Examples include:
• The sun-viewing bays of an interplanetary probe were 20-deg C hotter than anticipated.
• The radiator of a data-relay satellite became too hot.
• An instrument failed on-orbit when internal outgassing caused arcing.
• The focal plane on an early-warning satellite degraded.
• A satellite lost its orientation accuracy because three star trackers were fouled.
• The solar array output from five navigation satellites decreased more than expected.
• The wide-field planetary camera on a space telescope lost its ultraviolet capability. A
similar camera degraded during thermal vacuum test.
The Cause: Solar Absorptance (α) of Silverized Radiator Mirrors
Contamination is a serious risk during all phases of a A (MEO) B (HEO)
spacecraft’s life. Particulate can accumulate during
manufacturing, testing, storage, and launch. Volatile F
G
materials can be released during vacuum tests or in E
space, and condense on critical surfaces. Some mole- Doubling of α (raising radiator
temperature by up to 20°C)
cules can react with sunlight to deposit tenacious D
C
films that darken over time.
0 2 4 6 8
Contamination control has historically been per- Years on Orbit
formed on a "best effort" basis: all "low outgassing" Contamination of radiators makes electronics
materials were deemed acceptable in any application run hotter. Except for curves A and B, data
in any quantity, and manufacturing requirements was obtained from GEO satellites. Satellite C
were rather arbitrary. used a special design to reduce conta-
mination.
Today’s new sensors, which must be kept extraordinarily clean, require a quantitative con-
tamination budget flowdown throughout the entire spacecraft lifecycle. Sophisticated moni-
tors and models should be used to verify that derived cleanliness requirements are met.
Lessons Learned:
• Recognize the importance of contamination-control engineering during every phase of
development and hardware design.
• Perform contamination budget analysis, using tools derived from experimental data.
• Establish quantitative cleanliness requirements and apply cutting-edge processes to con-
trol particulate and molecular contamination.
Lesson 16
Space Systems Engineering Lessons Learned
Lesson 17
Watch Out for the Return of Leonid Micrometeoroid Storms
The Problem:
When the Earth crosses a comet’s orbit, tiny debris trailing the comet can trigger micromete-
oroid outbursts and damage satellites. For example:
• A scientific spacecraft suffered a hit and lost substantial telescope capability (1991).
• A communication satellite lost its Earth sensor and had to be abandoned, probably due to
a particle strike that triggered a power surge (1993).
The Cause:
1x10 -3
Flux (particles/m 2 - s)
1966 11/18/01, 11/19/02 2031
1x10 -4
Background
Showers with 1000 or more particles per hour are 1x10 -8
1960 1970 1980 1990 2000 2010 2020 2030 2040
peak rate approaching 100,000 per hour. Leonid par- The next Leonid storms will occur in
ticles travel at speeds of about 70 km/sec and pose a November 2001 and 2002. Each may have
significant threat to satellites. multiple bursts over approximately 16 hours.
Long-term projections remain imprecise.
Satellite operators can mitigate risks by:
• Turning telescopes away from incoming particles, Fewer hits,
more get in
More hits,
fewer get in
adjusting solar panels, and orienting the satellite
to minimize damage to internal hardware.
• Reviewing procedures for rebooting subsystems. Spacecraft
Body Spacecraft
• Making sure experienced personnel are on duty Body
Body
One way to reduce storm damage involves
during the storm. orienting the satellite to face the micromete-
• Turning off equipment sensitive to electrostatic oroids at an oblique angle. Although more
surface is exposed, particles will tend to glance
discharge (ESD), and avoiding commanding the off instead of penetrating into the spacecraft.
satellite or firing thrusters during storms.
These techniques have proved successful. In the widely publicized 1998-2000 Leonid season,
only a few minor anomalies were attributed to possible meteor strikes.
Lessons Learned
• Awareness of the space environment situation is vital.
• Advanced planning in anticipation of the coming storms is essential.
For more technical information, call Dave Desrocher at (719) 638-2280. A monograph from
The Aerospace Press, Dynamics of Meteor Outbursts and Satellite Mitigation Strategies, dis-
cusses this issue in great length.
For comments on the Aerospace Lessons Learned Program, including background specifics,
call Paul Cheng at (310) 336-8222.
Lesson 17
Space Systems Engineering Lessons Learned
Lesson 18
Make Sure Critical Software Performs in its Intended Environment
The Problem:
The 1996 maiden flight of a launch vehicle ended in a crash.
Flight Software Sizes of Major Programs
The Cause: 100000
80000 SBIRS-High
The launcher’s flight control system, which had de-
60000
rived considerable heritage from the previous Milstar
40000
generation, used two identical inertial reference con- DSP UHF F/O
20000
trollers, including a “hot” stand-by. Phase 1 DSP
0
1965 1975 1985 1995 2005
One function inherited from the legacy software com-
puted the platform alignment before launch. This As software takes over many functions
that used to be controlled by hardware,
function was no longer needed in the new generation. code sizes increase almost exponentially.
Software reliability thus poses a growing
The new rocket flew a different trajectory, creating an challenge and warrants more quality
alignment bias that was too large for the legacy code to assurance efforts.
compute. An “operand error exception” occurred.
Such errors are common, and are typically handled by software (for example, by inserting
“likely” values). Unfortunately, although the programmers did identify the alignment bias in-
put as one of the several variables capable of causing operand errors, they chose to leave it
unprotected, probably supposing that there would be large safety margins.
More tragically, the system was designed in the belief that any fault would be due to random
hardware problems, and should be handled by an equipment swap. Thus, when the software
detected the errant and irrelevant exception, it halted the active controller and switched to the
backup. Of course, the backup immediately encountered the same error exception, and also
shut down. The launch vehicle in essence destroyed itself even though both controllers
worked perfectly.
Lessons Learned:
• Hardware redundancy does not necessarily protect against software faults.
• Mission-critical software failures should be included in system reliability and fault
analysis.
• Software specifications should always include specific operational scenarios.
• Software reuse should be thoroughly analyzed to ensure suitability in a new environment,
and all associated documentation, especially assumptions, should be reexamined.
• Extensive testing should be performed at every level, from unit through system test, using
realistic operational and exception scenarios.
Lesson 18
Space Systems Engineering Lessons Learned
Lesson 19
Be Sure that the Architecture Isolates Faults
5
The Problem:
Lesson 19
Space Systems Engineering Lessons Learned
Lesson 20
Thoroughly Analyze and Test Deployables
The Problem:
Troubles associated with deployables have affected numerous satellites. For example:
• A foreign satellite could not open its solar sail, causing attitude-control errors to build up
and the mission to fail (1982).
• A comsat was abandoned after a solar array failed to deploy (1987).
• An interplanetary probe could not unfurl its high-gain antenna (1989).
• Two solar arrays of a comsat jammed, leading to an insurance claim of over $200 million
(1998).
In addition, several potential on-orbit catastrophes have been narrowly averted. Stuck deploy-
ables have been shaken loose by space-walking astronauts or by rocket burns. In 1991, the
antenna on a comsat stuck and disabled the satellite for three months, until repeated on-orbit
maneuvers finally freed it.
The Cause:
Deployables are complex mechanical equipment
customized for each mission, and thus lack the heri-
tage of testing and usage common to electronic
devices. With deployables, robust design, thorough
= 4-hinge line Articulation
testing, and careful handling are vital. = 3-hinge line
Fingers
= 2-hinge line
The design must provide adequate force margins,
including thermal and tolerance analyses, to over-
come all resistances. The 1991 anomaly cited above
was caused by interference from thermal blankets.
A thermal blanket Velcro pad likewise snagged the Deployable design should not be so
complex that it cannot be verified on the
magnetometer boom of another satellite in 1990.
ground. The deployment scheme in the
Testing is a major part of the deployment develop- satellite depicted above was too complex
ment effort. Special tests and off-loading fixtures to be tested, and The Aerospace
Corporation had to run an in-depth
(such as balloons or air bearings) are frequently re- analysis to verify it. Although the
quired to demonstrate deployability in a zero- deployment proved successful in space,
gravity environment. Some deployables cannot the contractor learned a lesson and
support their own weight on Earth, and require spe- decided to revert to simpler schemes in
cial testing accommodations. the future.
Lessons Learned:
• Make sure the design can be effectively tested.
• Avoid unconventional designs, especially those involving complex motions.
For more technical information, call Brian Gore at (310) 336-7253.
For comments on the Aerospace Lessons Learned Program, including background specifics,
call Paul Cheng at (310) 336-8222.
Lesson 20
Space Systems Engineering Lessons Learned
Lesson 21
Prevent Loss of Lubricating Oil and Grease During Storage and Test
The Problem:
Many failures have been caused by mishandling of liquid lubricants (oils and greases), par-
ticularly during prelaunch storage. For example:
1. The reaction wheels on several navigation satellites malfunctioned.
2. Many instruments stopped functioning when their ball-bearing cages ran dry.
3. The focusing system in a space telescope developed high torque and had to be replaced in
space.
4. A gyroscope stopped working during testing.
5. A sensor problem affected eight satellites, and caused an on-orbit failure.
6. A gimbal drive unit developed excessive noise.
The Cause:
Liquid lubricants are susceptible to physical loss and chemi- Reaction Wheel
Bearing
cal degradation. Physical loss can occur by evaporation and
migration. In the first mishap above, the satellites were
stored longer than originally anticipated, and some oil was
lost. Later builds switched to a less volatile oil, and stored
the wheels separately from the satellites, with their spin axes
oriented horizontally to limit migration. Oil Retained Oil Drips Out
Physical loss can also involve absorption. The second mis- The spin axes of gyros and wheels
hap occurred because the hardware surfaces are porous. Oil should be oriented during storage
was absorbed into them and was no longer available for in such a way as to ensure oil
retention.
lubrication.
Oil and grease can also chemically degrade and lose their ability to lubricate. Unprotected lu-
bricants have been known to polymerize (which caused mishap No. 3), oxidize (No. 4), react
with titanium surfaces (No. 5), or dissolve plastics (No. 6).
Lessons Learned:
• Minimize oil evaporation and migration during hardware storage.
• Use enough oil to sustain storage and operation needs. If porous hardware requires
lubrication, they should be thoroughly cleaned, protected from moisture, and stored in oil.
• Test high-speed moving parts in an inert environment to prevent oxidation.
• Perform materials compatibility analysis to avert chemical reactions.
• Check NASA Mechanisms Handbook (NASA/TP-1999-206988) for guidelines on
mechanical assemblies.
For more technical information, call Steve Didziulis at (310) 336-0460.
For comments on the Aerospace Lessons Learned Program, including background specifics,
call Paul Cheng at (310) 336-8222.
Lesson 21
Space Systems Engineering Lessons Learned
Lesson 22
Be Aware of Challenges in Silver/Zinc Battery Manufacturing and Deployment
The Problem:
Silver-zinc batteries have supplied power to many launch vehicles and upper stages over the
years. These batteries are susceptible to a variety of problems during development and manu-
facturing. In the field, batteries have splashed operators with caustic chemicals, delayed
launches, and caused a serious malfunction in an upper stage.
The Cause:
Launch vehicles rely on primary (non- Terminal Vent
rechargeable) batteries to power avionics,
pyrotechnics, range safety, and other equip-
ment. Silver/zinc batteries, the most common Case
Lessons Learned:
• Design, documentation, manufacturing, storage, and field application of batteries require
constant vigilance.
• Materials must be thoroughly screened before being incorporated in batteries.
Lesson 22
Space Systems Engineering Lessons Learned
Lesson 23
Make Sure Requirements Are Developed Correctly
The Problem:
As a planetary probe neared its objective, a potentially crippling flaw was discovered—the
designers had neglected to take the Doppler Effect into account.
The Cause:
After a seven-year journey toward one of the Saturn’s 100
an accompanying orbiter. 50
% Cost
Expended
As the probe speeds away from the orbiter, the data sig- Production
& Operations
nal frequency will drop slightly, due to the Doppler 10
Engr & Deployment &
Concept Dem/Val Support
shift. According to the Inquiry Board Report, this un- Mfring
Program Milestones
avoidable frequency drop was overlooked from initial
project requirement determination all the way through Most of the project’s cost and performance
design specification of the orbiter’s receiver. Extensive are established by front-end decisions, but
internal and external reviews failed to discover this mistakes made there are difficult to catch.
oversight, in part due to a proprietary issue. Later, the More resources, including the most
experienced personnel, should be made
design flaw escaped the system-level test because an available to ensure the early decisions are
incorrect frequency was used. made properly.
Two and half years after launch, a check-out of the Designers should thoroughly review the
history of similar projects. If the probe
probe indicated that the signal frequency was outside the designers had analyzed the requirements of
receiver’s bandwidth. Had the problem been unveiled on other deep space projects, both the
the ground, it could have been fixed with a simple soft- importance of the Doppler shift and the
ware patch. Unfortunately, the software is not accessible correct way to perform end-to-end test would
have become obvious.
in flight.
To minimize the Doppler shift, the flight trajectory had to be changed, at considerable ex-
pense in fuel, so that the orbiter will be farther away from the probe as it descends.
Lessons Learned:
• Formalize requirement development process and capture lessons.
• Provide adequate design margins and operational flexibility, such as the ability to use soft-
ware patches.
• Make sure that the hardware or software a contractor wants to reuse from another program
is indeed applicable and has a satisfactory flight history. Do not be deterred by the excuse
that details are not available because the previous program was proprietary or classified—
there are always ways to get around that hurdle.
Lesson 23
Space Systems Engineering Lessons Learned
Lesson 24
Safeguard Hardware Against Inadvertent Overtesting
The Problem:
A satellite suffered considerable damage during vibration test because worn-out equipment
misled the test operator into applying an excessive force.
The Cause:
Prior to vibrating the spacecraft, the operators
first subjected it to a low-level calibration test Shaker
to compute how much force should be applied Body
to achieve the specified acceleration. Trunion
Slip Plate
Unfortunately, the shaker was over 40 years Shaker Area of Friction
old, and its trunion bearings had broken. The Base
Granite Oil Film Table
slip plate came into contact with the shaker
table, resulting in an interference that attenu- Friction during start-up can greatly
ated the satellite’s motion. exceed that during operation. This
problem, known as stiction, frequently
Unaware of the malfunction, the test engineer causes trouble. For example, when a
thought a much larger force needed to be tape drive is adjusted, the tape may not
move until enough voltage to
applied to achieve the required acceleration.
overcome the stiction is applied; but
This force overcame the start-up friction, but then the force is too large, and the tape
overshot the acceleration by tenfold, damaging suddenly runs wild.
the spacecraft.
Lessons Learned:
• Make sure that test facilities are maintained and checked.
• Implement overtest protection (such as over-temperature trip circuits in thermal cham-
bers).
• Take risks of overtesting during vibration tests into account. In particular, large satellites
should typically be acoustically tested instead of vibration-tested to prevent damage.
• Step up vibration tests from one-third to one-half of the full level so that the required force
can be more accurately computed.
• Test procedures, set up, and data should be thoroughly checked to account for operator
mistakes and avoid damage.
Lesson 24
Space Systems Engineering Lessons Learned
Lesson 25
Thoroughly Verify All Software Changes
The failed launch was rehearsed three
times, during which the console opera-
The Problem: tors could have spotted the open valve
but missed it.
A launch vehicle failed because part of a command
Graphical displays, summarized tele-
line was left out of a software change.
metry data, and error checking should
The Cause: be provided to allow operators to
identify and diagnose faults.
The launch vehicle had flown successfully several
times. This mission, however, had to be launched at a
particular time. Accordingly, the time variable in the Valve 1
Valve 2
on
off
Valve 7
Valve 8
off
on Cluttered telemetry
Valve 3 off Valve 9 on
software was changed from Reference Time to Fixed Valve 4
Valve 5
on
off
Valve 10
Valve 11
off
off
dis plays can c onfuse
operators
Valve 6 on Valve 12 on
Time. Well- designed graphic al
dis plays allow operators to
Multiple updates to the ground software were made, quickly identify faults
Lesson 25
Space Systems Engineering Lessons Learned
Lesson 26
Make Sure Hardware Analyzed Is Hardware Actually Built
The Problem:
A technology-demonstrator mission was terminated after only eight months because an over-
sight in thermal analysis was unrecognized by two projects.
Lesson 26
Space Systems Engineering Lessons Learned
Lesson 27
Control Propellant Balance
The Problem:
Dynamic instability caused by fluid imbalance has afflicted several satellites during orbit
transfer maneuvers. Example include:
• A commercial communication satellite was stranded in a low orbit, and had to expend sig-
nificant fuel in hundreds of thruster firings to reach a geosynchronous orbit.
• A foreign satellite failed to reach geostationary orbit.
• A military communication satellite wobbled unexpectedly (but was able to recover).
The Cause:
Propulsion control is a delicate task because many
parameters, such as the flow rate of propellant in
space, cannot be precisely modeled or controlled.
Several factors can trigger fluid imbalance:
• Improper fuel-load procedures. (This problem
caused the first incident cited above). 1 2 3
• Differences in flow rates or valve responses can
cause propellant to be drawn preferentially from As satellites spin during transfer
one tank over another. (This problem probably maneuvers, mass imbalances coupled with
centrifugal forces can cause tilting. Severe
caused the second mishap). tilt can divert the transfer thrust and
If one tank is cooler than the other, propellant will prevent satellites from reaching their
flow into the cooler tank from the warmer tank, proper orbit.
causing imbalance.
Gas
Lessons Learned: Feedback loops can be
n designed to control gas
• Make sure tank loads are balanced. pressure (n) or fuel flow
• Use a single tank, if feasible, to avoid propellant Fuel (o) between the tanks to
restore balance. The latter
migration. o method is more precise.
• Ensure that attitude-control algorithms and
mechanisms can correct dynamic instability Thruster
Lesson 27
Space Systems Engineering Lessons Learned
Lesson 28
Graphite/Epoxy Structures Are Easily Damaged by Processing Changes and Handling
Mishaps
The Problem:
Two failures involving graphite/epoxy pressure vessels occurred recently:
• A launch vehicle crashed when one of its solid boosters ruptured.
• Two solid-rocket segments failed during hydroproof testing.
The Cause:
Graphite/epoxy composites are used for trusses,
Impact
pressure vessels (such as nickel-hydrogen
batteries and motor cases), and many other
applications. Composite technology is rela-
tively new. Minor variations in fiber, resin, and
processing can dramatically affect product per-
formance. Quality assurance is vital, yet diffi-
cult to achieve. Broken Fibers Delamination
Lessons Learned:
• Protect graphite/epoxy pressure vessels from handling damages.
• Insist on safety margins and quality inspections for composite structures.
• Perform extensive requalification and acceptance tests to guard against subtle processing
changes.
For more technical information, call S. R. Lin at (310) 336-7697.
For comments on the Aerospace Lessons Learned Program, including background specifics,
call Paul Cheng at (310) 336-8222.
Lesson 28
Space Systems Engineering Lessons Learned
Lesson 29
Validate Changes in Command Script Configuration
The Problem:
Contact with a deep space observatory was lost (control was regained three months later
following a dramatic rescue; see Lesson 30).
The Cause:
Sun L1 L2
The spacecraft used three gyros:
• Gyro A, to control the safe mode; The Lagrange Points
• Gyro B, to detect faults; and There are five Lagrange Points where
gravitational attractions from the Sun and
• Gyro C, for normal attitude control. Earth balance each other. The loss of
control occurred at the first Lagrange Point
The flight software should turn on the normally (L1, about 1.5 million kilometers from
off Gyro A when the satellite entered safe mode. Earth), from which location the space
Unfortunately, the engineer making a command observatory monitors solar activities. The
procedure change did not know to implement the L2 point, on the night side, is suitable for
infrared astronomy.
enable command. A loose change-control
process failed to catch the error.
During a routine operation, Gyro B was accidentally set incorrectly, causing a false reading.
The on-board computer detected B’s error and put the satellite in safe mode. The fault on B
was fixed, but control shifted from C to A.
Sensed rates from Gyro A (despun, reading zero) and B (active with variable readings) soon
diverged, prompting the thruster to fire to try to null the nonexistent roll error. The effort was
futile, and the satellite entered safe mode again two hours later.
The spacecraft was designed to survive in safe mode for at least 48 hours. Nonetheless, the
operators did not pause to analyze why one anomaly followed on the heels of another. Side-
stepping the required telemetry data check that would have indicated that Gyro A was in fact
off, the operators mistook Gyro B’s variable readings as a sign of a fault, and turned it off.
With no functional gyro, control was soon lost.
Lessons Learned:
• Treat command-procedure changes with the same rigor as flight-critical software. This
includes formal configuration management, peer review with knowledgeable technical
personnel, and full command verification with an up-to-date simulator.
• Ensure change implementation timelines are consistent with staff workloads.
• Display spacecraft health and safety information clearly.
• Follow validated operations procedures, including review of all pertinent data.
For more technical information, call Suellen Eslinger at (310) 336-2906.
For comments on the Aerospace Lessons Learned Program, including background specifics,
call Paul Cheng at (310) 336-8222.
Lesson 29
Space Systems Engineering Lessons Learned
Lesson 30
Maximize On-board Reprogrammability To Enable Fault Recovery
The Event:
An observatory lost in deep space (Lesson 29) was brought back to life following three
months of clever troubleshooting.
The Cause: 800 Power from
700 Solar Array
The salvage team faced daunting challenges.
600
Following the loss of attitude control, the
Power, W
500 Power Used
satellite’s heaters had shut down, its batteries 400 to Thaw Tanks
were drained, and its fuel had frozen. Insuffi- 300
200
cient bus power made it impossible to sustain 100
a downlink long enough for the ground station 0
0 5 10 15 20 25
to lock on, and rescuers were not even sure
Seconds
exactly which communication frequency
would work.
Power-Efficient Thawing of the Hydrazine Tank
The team hit upon the idea of borrowing the
The fuel tank had to be warmed up before pipes and
world’s largest radar to transmit to the space-
thrusters were, lest overpressure burst the lines.
craft, and using another big dish to receive
Software changes allowed the battery to discharge
return signals. They set up a special wideband current like a thermistor and turn on selective
analyzer over the Internet so that the down- heaters whenever power became available. Because
link signal could be analyzed instantly. the flight computer was off during battery charging,
the software patch had to be reloaded each time.
The shot in the dark paid off—a faint heart-
After fine-tuning, controllers managed to thaw the
beat was received from the lost satellite. Only tanks with 48 heaters, using a peak power of over
the carrier signal came, however, because the 500 watts!
on-board receivers could not lock onto the
uplink signal.
Ingenious commands, together with efficient power management, eventually brought the bus
voltage up to 28 V, permitting controllers to monitor spacecraft status and thaw the propulsion
system. An intricate attitude recovery maneuver was devised to allow the satellite to reacquire
the Sun, and normal operations resumed. Remarkably, despite having been alternatively ex-
posed to extremes of -120º and 100ºC, all instruments survived!
Lessons Learned:
• Design into the satellite the flexibility to handle unforeseen emergencies, and provide
emergency reset capability for major components.
• Add emergency protection of a satellite battery system, such as low-battery-voltage cut-
out of nonessential loads.
For more technical information, call Julie White at (310) 416-7229.
For comments on the Aerospace Lessons Learned Program, including background specifics,
call Paul Cheng at (310) 336-8222.
Lesson 30
Space Systems Engineering Lessons Learned
Lesson 31
Oxidation Can Cause Erratic Open Circuits In Solid State Devices
The Problem:
Several photodetector chips developed intermittent open (high resistance) circuits during inte-
gration.
Self-
Current
The Cause: Repaired
This anomaly baffled experts because the
chips, when returned to the foundry, often Voltage
passed diagnostic tests. Also, investigators Good Chip Oxidized Chip
could find no mechanical defects (such as
fractures) that might account for the open cir- An applied voltage can sometimes heal the
chips temporarily by pushing the oxide layer
cuits. aside.
An in-depth study revealed that the anomaly
resulted from oxidation of the titanium diffu-
sion barrier under the gold signal line. Aperture Detector Geometry
Titanium oxide can “switch” (jumping be- Pixel Top
tween conducting and insulating states) View
Via
causing the circuits to open erratically. Gold Trace
Lessons Learned:
• Protect sensitive metal layers from oxidation (caused by over-etching, for example) during
semiconductor fabrication.
• Use current-voltage profiles as a diagnostic tool—nonlinear high resistance usually indi-
cates oxidation.
Lesson 31
Space Systems Engineering Lessons Learned
Lesson 32
One Operation, One Verification
The Problem:
A prototype reusable rocket crashed because a technician forgot to reconnect a helium line.
The Cause:
Stem
The goal of the project was to demonstrate rapid Spring Plug
turnarounds between vertical takeoffs and land-
ings. A streamlined management approach kept Set Screw
paperwork to a minimum. A working vehicle
was built in 18 months; a modified version had
already flown three times before the incident.
The flyer was supported with four legs that were A Similar Incident:
Failure Caused by a Loose Screw
actuated by an on-board helium supply. During
preflight preparation, each leg was deployed The precision regulator in a booster engine
once so the control center could verify its de- control system used a stem screw to modulate
ployment monitors. The helium line was then gas inlet. A set screw forced a nylon plug
disconnected to vent the actuator, the legs against the stem screw threads and prevented
the stem from rotating.
stowed, and the helium line reconnected. Four
technicians repeated this procedure on each leg. The regulator was reworked to repair leakage
Unfortunately, a technician forgot to reattach one during build. The rework instruction did not
explicitly require set screw retorquing and
helium line. The error was not detected because verification. The loose set screw caused the
there was no procedure to check the integrity of stem screw to unseat. The launch failed.
the system after disconnection and reconnection.
At landing, the leg failed to deploy, whereupon
the vehicle toppled and exploded.
The investigators found that procedures were neither well developed nor rigorously applied.
Operators and technicians used the procedures as guidelines instead of checklists. In fact, fail-
ure to reconnect happened once before. Although caught, the incident was not documented.
Lessons Learned:
• Implement a discrete verification step for each critical task.
• Avoid multiple tasks within a procedure (see Lesson 12).
• Ensure a fail-safe process by applying software technology, self-checking indicators, or
positive feedback mechanisms to complex operations vulnerable to human errors.
• Document each near miss and correct its root cause.
Lesson 32
Space Systems Engineering Lessons Learned
Lesson 33
Check Satellite-Launcher Compatibility As Early As Possible
The Problem:
A technology demonstrator satellite had to be substantially redesigned because the vehicle’s
stability during the orbit-transfer maneuver was not considered early on.
The Cause:
When a satellite spins, its components vibrate
at a “nutation frequency” determined by the
moments of inertia and by the spin rate. Flexi-
ble parts, such as whip antennas and fluids, will
dissipate the rotational energy, particularly if
these parts resonate near the nutation fre- The first American satellite, Explorer 1,
quency. Energy dissipation may lead to went into a flat spin because its flexible
increased coning angles, even a flat spin. antennas triggered nutational growth.
Nutational growth caused several early
satellites to malfunction. Although well under-
Lesson 33
Space Systems Engineering Lessons Learned
Lesson 34
Safeguard Hardware Against Inadvertent Overtesting (II)
The Problem:
A satellite launch had to be postponed by several months because an antenna panel delami-
nated.
Antenna
Element
The Cause:
Heater Conformal
Vent Blockage
The antenna assembly, based on a honeycomb Tape Coating
sandwich structure, was undergoing a thermal
vacuum test. An operator set the heater voltage Honeycomb Core
Lesson 34
Space Systems Engineering Lessons Learned
Lesson 35
Implement Independent Fault Protection
The Problem:
A deep-space mission ended prematurely after excessive thruster firing depleted its fuel.
The Cause:
This spacecraft was developed by a highly
motivated group operating under a rigid cost
cap and tight schedule. Flying just 22 months
Command Sensor Solid State
after being funded, it successfully circled the Module D Recorder
A
moon and demonstrated many technologies. T Data Handler
Telemetry
A
Soon afterward, however, a maneuver triggered Module Sensor
Processor
a numeric overflow in the processor, causing it B
U
31,000 Lines
to erroneously fire its thrusters and freeze. A ACS/RCS S Housekeeping
Module Processor
“watchdog timer” algorithm should have
Memory 34,000 Lines
stopped the thrusters from continuously firing,
but did not execute because the computer had
A Rushed Job
already crashed. By the time ground operators
regained control, all the fuel was gone. Over 65,000 lines of flight code
(only 20% inherited) were de-
A hard-wired timer, which would have stopped veloped in 17 increments within one
thruster firing, was not implemented due to the year, leaving little time for thorough
tight schedule. Time pressure also prevented testing
the software from being fully tested, and many
changes had to be uploaded as faults were dis-
covered.
The overflow error had occurred thousand of times (without causing malfunctions) because
the project had to settle for an inadequate but available processor. Software changes had been
written to correct the problem, but the overstretched staff could not handle operations, anom-
aly analysis, and software repair at the same time, and the change was not loaded.
Four years later, another interplanetary probe encountered a similar anomaly. Fortunately, en-
gineers learned the lessons from the previous incident; the precautions they took allowed them
to successfully complete the mission (see Lesson 36).
Lessons Learned:
• Apply independent fault protection for critical software functions.
• Implement exception handling to protect the flight processor from aborts due to data han-
dling errors (see Lesson 18).
• Do not cut corners in testing critical flight software.
For more technical information, call Suellen Eslinger at (310) 336-2906.
For comments on the Aerospace Lessons Learned Program, including background specifics,
call Paul Cheng at (310) 336-8222.
Lesson 35
Space Systems Engineering Lessons Learned
Lesson 36
Implement Independent Fault Protection (II)
The Event:
An interplanetary probe recovered from a major anomaly.
Lesson 36
Space Systems Engineering Lessons Learned
Lesson 37
Aim for Realistic Schedules in Development Projects
The Problem:
A sophisticated instrument was delivered five years behind schedule.
The Cause:
Delivery Date Slip (Year)
Combining three previously separate sen-
Pyroshock/Vibration Failure
sors and aiming for greater sensitivity, this Chip Underperformance
instrument densely packed together diverse Scan Drive Problems
5 EMI/Intermittent Problems
technologies. The developer contracted for Deployment Redesign
delivery in three years, even though two 4
Digital Engineer’s Death
heritage systems each took eight years to 3 Slip Ring Noise
RF Parts Delayed
build. 2 Oscillator Failure
1 Faulty Mixers
Soon after program start, the spacecraft Delivery!
0
Award CDR Thermal Vac Rework/Retest
prime contractor issued unexpectedly strin-
~ 10 Years
gent interface requirements. The
preliminary instrument design had to be
substantially altered to meet new weight, Slim margins, unproven technology, tight
schedules, and fixed cost conspired to incre-
volume, and vibration constraints. mentally push the delivery date.
More features (such as stiffer structures) Items marked with arrows each impacted the
had to be added, but design flexibility was schedule by between 9 and 18 months.
limited due to volume constraints. In com-
pensation, cutting-edge electronics had to
be deployed, but the vendors could not de-
liver them on schedule due to manufac-
turing difficulties.
The contractor adopted first-pass-success schedules—the design went into manufacturing
directly, skipping prototyping. Problems surfaced late (such as during thermal vacuum test-
ing), and were discovered sequentially. Despite the contractor’s heroic effort, it took eight
years before the product was delivered.
Lessons Learned:
• Provide a detailed interface specification as early as possible.
• Foster a cooperative working arrangement among contractors and proactively maintain
realistic power, weight, and volume reserves.
• Create engineering models so that problems can be discovered early.
For more technical information, call Alfred Fote at (310) 336-6926.
For comments on the Aerospace Lessons Learned Program, including background specifics,
call Paul Cheng at (310) 336-8222.
Lesson 37
Space Systems Engineering Lessons Learned
Lesson 38
Do Not Ignore Unexplained Test Anomalies
The Problem:
A power regulator had to be pulled from a spacecraft.
Stand-by
The Cause: Battery
found a circuit instability that induced the glitches. Solar Array Voltage
Moreover, the design flaw would have caused the solar A
Lessons Learned:
• Test under all operating conditions—not only sunlight and eclipse operation, but transi-
tions, safe-hold mode, loadshed mode, and recovery mode.
• Strive to understand implications of test anomalies.
• Ensure perceptive instrumentation, lest test-set glitches cast doubt on results.
• Minor design changes in power supplies can result in disastrous consequences. Double-
check design changes, and perform independent analysis where practical.
For more technical information, call Kasemsan Siri at (310) 336-2931.
For comments on the Aerospace Lessons Learned Program, including background specifics,
call Paul Cheng at (310) 336-8222.
Lesson 38
Space Systems Engineering Lessons Learned
Lesson 39
Thoroughly Review Test Data for Early Indicators of Anomalies
Brazed
Junction
The Problem: Fusion
Copper Weld
Foil
Lesson 39
Space Systems Engineering Lessons Learned
Lesson 40
Avoid Radio Frequency Interference
The Problem:
Signals from one program inadvertently interfered with another program.
The Cause:
This project, driven by a unique requirement,
provides radio frequency (RF) intersatellite Emission from crosslinks can reach
links among its fleet. Earth and interfere with other users.
Lessons Learned:
• Understand why requirements exist in legacy designs before discarding them.
• Coordinate spectrum planning with authorities (for example, Manager of Spectrum
Allocation at the Space Command), because not all frequency usages are public informa-
tion.
For comments on the Aerospace Lessons Learned Program, including background specifics,
call Paul Cheng at (310) 336-8222.
Lesson 40
Space Systems Engineering Lessons Learned
Lesson 41
Carefully Consider the Implication of Test Failures Beyond the Narrow Issues at Hand
the program focused virtually exclusively on the Slip rings connect rotating solar
payload. The bus in fact had to be extensively modi- arrays to the bus.
fied—rotating arrays, for example, were put on the
aft end of the satellite for the first time, requiring
new array drive electronics. Yet, the program was Short
Ejection Force
too firmly set in the idea of a standard bus to grasp
- +
the risks. (a)
Bridging
Boiling
Arc Anode
• The slip ring design provided practically no internal (b) - + - +
Metal
(c)
clearance between adjacent brushes, making it apt
for debris to cause a short. The design was accepted
because another project had flown it. Shorting of slip rings is fairly common—
improperly lubricated brushes can easily
• The other project, however, had rewired the rings to abrade conductive slivers out of the rings.
keep the same polarities next to each other after en- The voltage gap across adjacent brushes
countering a short during launch-simulating exacerbated shorting by triggering an arc,
vibration tests. Notified of the change, the first pro- which wrecked every anode in its path.
gram felt that the change did not apply because its
slip rings were unpowered during launch.
• Slip ring arcing was also observed during ground test of a control moment gyro by the
same contractor working on yet another project. Unaware of this incident, the designers
did not consider shorting in the reliability analysis or in part selection. The program also
deleted thermal vacuum test of the slip rings to save money.
Lessons Learned:
• Thoroughly evaluate the heritage and applicability of using “existing” or “flight-proven”
equipment, especially if modifications have been made.
• Include shorting in analyzing potential failure modes of power systems.
• Apply manufacturing and handling practices that minimize slip ring damage.
Lesson 41
Space Systems Engineering Lessons Learned
Lesson 42
Account for Electrostatic Interaction in Structural Analysis
The Problem:
The performance of a communication satellite significantly degraded.
The Cause:
The satellite deployed a new phased-array antenna, Antenna Element:
consisting of multiple microstrip elements made of Sunshield Cu on dielectric
copper circuits over dielectrics. A large thermal blan- Tensioner
ket, used for the first time on this type of antenna, Antenna Structure Standoff
shielded the elements from the Sun. As Designed
The sunshield was not adequately supported―too
few tensioners were provided to keep the blanket taut
under Earth’s gravity (1 G). The sunshield was
installed loosely, often touching the antenna ele-
ments. Nevertheless, no attempt was made to Actual
compare antenna performance before and after blan-
ket installation on ground, because the cover was The sunshield curled toward the
expected to recover from drooping once in orbit. antenna due to charges that
accumulated in the insulators. Notice
Unfortunately, an electrostatic charge built up in the that electrostatic attraction can take
ungrounded dielectrics of the antenna. The resulting place even though one surface (the
electrostatic attraction overpowered the insufficiently sunshield in this case) is grounded.
applied tension, keeping part of the blanket in contact
with the elements. The phased-array’s gain degraded
due to dielectric coupling and shorting to the conduc-
tive layer of the sunshield.
Lessons Learned:
• Be aware of the propensity of dielectrics to pick up an electrostatic charge in space.
• Thoroughly review the potential impacts of the space environment on flight hardware.
• Whenever possible, a design’s operation in space (0 G) should be designed to be verifiable
under 1 G test conditions.
• Test the entire system in the final flight configuration.
Lesson 42
Space Systems Engineering Lessons Learned
Lesson 43
Do Not Circumvent Processes Designed to Catch Human Errors
The Problem:
A satellite was placed into a moderately degraded orbit.
. -.
The Cause: Rn versus Rn
During launch preparations, operators made final The First Software-Related Crash
measurements of the spacecraft’s inertial measurement
unit (IMU). The readings, together with factory calibra- An incorrect formula in the ground
tion data, were used to control the satellite’s orientation software led to the failure of Mariner I
in 1962.
during ascent.
Ascent control required velocity
Unlike all the other inputs loaded to the satellites, the smoothing, or “R dot bar n” where R
IMU measurement and calibration data could not be stood for radius from a tracking
verified in a testbed because the readings had to be made antenna, the dot for the first derivative
just before launch. Therefore, a procedure was set forth (i.e., the velocity), the bar for
averaging, and n for the increment.
to avert mistakes: one operator was required to tran-
scribe the calibrations numbers from the factory The bar was left out of the
printout, another would verify the entries. handwritten equations provided to the
programmer, causing the guidance
An engineer supervising the keyboard operators copied computer to be coded to process raw
the calibration data from the computer printout onto a velocity instead. Confronted by
fluctuating telemetry, the computer
scratch paper, leaving the original printout in his office. sent erratic correction signals, forcing
He gave the scratch paper to the operators, telling them a smoothly ascending booster to veer
that it was suitable. The data were typed in and verified. off course.
Unfortunately, the engineer left out a symbol, and the
orbit insertion went awry!
Lessons Learned:
• Ascertain software databases as thoroughly as the source codes (see Lesson 3).
• Verify software algorithm and database on a simulator whenever possible.
• Double-check manually entered data against original sources.
• Automate data transfer and checking whenever possible to minimize human error.
Lesson 43
Space Systems Engineering Lessons Learned
Lesson 44
Beware of Sneak Paths Through Test Equipment
The Problem:
Two days before launch, a satellite spontaneously tried to deploy.
The Cause: Timed
Baffled engineers found that the separation sensor On/Off Reset Solar Array
Command
unexpectedly powered up. Even then, it should not Spacecraft
Deployment
c Separation e Squib Firing
have turned on. Unexplained internal flaws inside the +28 V Sensor S-Band
unit, which had operated nominally up to that day, f Transmitter
threatened to scrub the mission. Simulator
Turn-On
Launcher/Satellite
Not wanting to spend millions of dollars to return the Port d Breakwires
satellite to the factory, the program sought help from
an outside expert, who found: Simplified Separation Electronics Schematics
• The functional test was unable to detect whether A latch in the separation sensor (powered via
the power relay was open or not. relay c) opens after the satellite breaks away
from the launcher (d), deploying the solar
• The test set inadvertently enabled the sensor, as if array via relay e.
the breakwire had opened. Failure of relay c, due to the addition of a filter
f, formed a sneak path (dashed line) via the
• The sensor could turn on only if powered quickly. simulator port, triggering the prelaunch
• The anomaly first occurred when the bus was anomaly. Premature separation in fact could not
powered up too fast by mistake, but appeared occur in flight because the port is not used.
again after the power was properly reapplied.
The analyst traced the anomaly to a noise filter added to the input line. The filter caused an
overcurrent, welding the relay shut and powering the sensor up. Welding in fact occurred on a
relay installed in this same spot once before, but no corrective action was taken.
Energizing the bus too fast during ground test created a current strong enough to turn on the
sensor and start the deployment sequence. After an abort, the problem recurred upon a nomi-
nal restart because the sensor timer had not yet reset.
Once understood, the concern vanished—the relay would be closed in flight and the sneak
path would be blocked by the flight plug. The satellite flew successfully.
Lessons Learned:
• Determine and correct the root cause of all failures.
• Trace the flow of power and signals from source to load during troubleshooting.
• Provide a mechanism to independently validate the status of critical components.
• Inject unexpected conditions (such as a closed relay, current surge, and sluggish separa-
tion wire breakage) during reliability analysis to discover lurking failure paths.
Lesson 44
Space Systems Engineering Lessons Learned
Lesson 45
Guard Against Chloride Contamination Due to Manufacturing Process Changes
The Problem:
Two heat pipes suffered significant performance degradation in system-level test.
The Cause:
Heat Out
Analysis of the failed units revealed particulate
materials, hydrogen gas, and internal etching. Obvi- Heat In Noncondensables
ously, the ammonia working fluid had reacted with Condensor
Thermocouple
the aluminum tubing—a problem that had not Gaseous Ammonia
Evaporator
occurred in recent memory.
Constant Conductance Heat Pipe (Degraded)
The problem was eventually traced to a minor
manufacturing procedure change. After machining,
the vendor previously wrapped the end of the tubing Noncondensables
Lesson 45
Space Systems Engineering Lessons Learned
Lesson 46
Make Sure Test Equipment Is Sufficiently Capable
The Problem:
A power regulation unit underwent five months of acceptance tests due to an inefficient setup.
The Cause: S S S
T T T
The unit under test, consisting of eight DC-DC A A A
G G G
power stages, exhibited major glitches during E E E
vibration. Based on sketchy data, the manufac- 8 2 1
turer assumed a short had occurred in the output
(a)
stage, and replaced all suspected parts.
Scope
The same anomalies recurred during a second S S S
vibration test. Now the vendor believed that the T
A
T
A
T
A
first power stage was at fault. G G G
E E E
An independent simulation showed that neither 8 2 1
scenario was credible, and it was recommended
that full instrumentation as well as computerized (b)
data collection be implemented. The manufac- Data Log
Lesson 46
Space Systems Engineering Lessons Learned
Lesson 47
Review Hardware Reusability When Configuration Changes Affect Margins
The Problem:
A satellite failed two weeks after launch when a battery charger shorted.
The Cause:
The short took place between the grounded radiator
Most Screws
and the electronics-mounting heatsink that was at Longer Than Relay
1.217
1.217 inc hes Inches
the solar array potential. Available
adhesive and anodization layers only. A tolerance Heatsink (at Solar Array Potential)
buildup, after repeated temperature excursions, Radiator Plate (at Ground Potential)
drove the mounting screws through the anodization, Anodization
3. The survival mode software, which could have shed the load and provided time to diag-
nose the problem before the spacecraft batteries were depleted, was not enabled.
Lessons Learned:
• Recognize that workmanship plays a large role in the space hardware, and reliability may
be compromised when undertrained personnel assemble heritage equipment.
• Computerize manufacturability analysis, including interface tolerance buildup, dynamic
interference, and ease of inspection on all packaging designs.
• Provide automatic fault management mechanisms so that a single defect will not bring
down the entire system.
For more technical information, call Robert Tsutsui at (310) 336-3273.
For comments on the Aerospace Lessons Learned Program, including background specifics,
call Paul Cheng at (310) 336-8222.
Lesson 47
Space Systems Engineering Lessons Learned
Lesson 48
Thoroughly Reverify Software When Requirements Change
The Problem:
The Patriot defense system failed to intercept a Scud Track
Validate
missile.
Range Search
Gate
The Cause:
As the Patriot detects a threat, its radar beam narrows Scud
Patriot
for better tracking. The fire controller extrapolates the Radar
Lesson 48
Space Systems Engineering Lessons Learned
Lesson 49
Equipment Intended for Use in Simulated Space Environments Should Be Space-Rated
The Problem:
A flight payload was damaged during thermal vacuum testing.
The Cause:
Lessons Learned:
• Perform formal design reviews on ground-test equipment intended for use in space-like
environments.
• Test radio frequency equipment in vacuum to 6 decibels over the expected input level (to
account for unfavorable signal return) to ensure operational safety.
• Monitor flight hardware during test lest overstressing cause damage.
• Improve interfaces between payload engineers and bus engineers, particularly during
system level tests.
Lesson 49
Space Systems Engineering Lessons Learned
Lesson 50
Virtual Cross-strapping Extends Satellite Life
The Event:
A government satellite, almost deorbited after losing both primary and redundant gimbal con-
trol, was brought back to operational status.
Position
Sensor x Power Supply
The Cause: CMD Gimbal Controller
Gimbal
Gimbal Motor ? Motor Side A
A power supply failure in the A-side caused O
TLM Controller Processor Driver
B
the payload gimbal control to be switched to C
Forward Control CMD
At Failure
Gimbal Motor
the B-side. Later, the B-side was disabled CMD
Controller Processor Driver
?
Gimbal
Motor Side B
impossible. Sensor
x Power Supply
New Database
CMD Gimbal
An engineer who worked on the original gim- TLM
Gimbal
Controller Processor
Motor
Driver
? Motor
Side A
O
bal development was brought in to assist B
C
Forward Control CMD Re routed
x
this gimbal’s design, the engineer realized Sensor Power Supply
that there was a secondary command path for : Disconnected (gain set to zero)
? : Inoperative due to malfunction
Lesson 50
Space Systems Engineering Lessons Learned
Lesson 51
Review Troubleshooting Process When Encountering Surprising Test Results
The Problem:
An attitude control unit exhibited unrepeatable performance degradation.
The Cause:
In the middle of the acceptance test, a pro-
duction unit failed. Engineers could not
identify the cause.
Eleven days later, the problem abruptly
vanished. An all-out effort, lasting over four
months, failed to recreate the anomaly,
driving the contractor to consider tearing the
unit apart.
It turned out that the unit, slightly modified A Similar Incident
from a product designed for another project, A thermal vacuum test was delayed because two
looked identical to the other except for the rolls of Kapton tapes were mixed up.
part number on the nameplate. Both
operated on the same test set and were Both rolls of tape came from the same supplier
and looked exactly the same. However, the roll
equipped with identical connectors. inadvertently used to attach insulation blankets
contained a adhesive that was based on silicone
Units for both programs, by chance having instead of on low-outgassing acrylics. The
the same serial number, were stored in iden- satellite had to be baked and pumped for a long
tical carrying cases and stowed side by side time before silicone outgassing subsided.
in the same storage cabinet. Apparently, a
technician had removed the wrong unit from
the cabinet to test. During the intensive
troubleshooting effort, nobody checked the
label of the unit under test!
Lessons Learned:
• Consider using bar codes in production control.
• Incorporate design features, such as colored cables, to preclude human errors.
• Don’t overlook simple human errors when confronting unexplained problems.
For more technical information, call Tom Fuhrman at (310) 336-6596.
For comments on the Aerospace Lessons Learned Program, including background specifics,
call Paul Cheng at (310) 336-8222.
Lesson 51
Space Systems Engineering Lessons Learned
Lesson 52
Protect Cryogenic Systems Against Thermal Expansion Mismatch
The Problem:
The instrument used a dewar filled with solid nitro- Vent Line Two-part
gen to cool the detectors. Between filling the dewar Aluminum
Light Baffle
and launch, cold helium was pumped through coils Vacuum Jacket
For Dewar
Optical Bench
Photodetector
to keep the nitrogen from thawing. Solid Nitrogen/
Aluminum Foam Circulator for
at 58° K
Soon after the dewar was attached to the optical Multilayered
Ground-supplied
Cold Helium Gas
The unanticipated impact of repeated cooling cycles was not recognized because there was no
prototype testing. During optics installation, an “alarmingly small clearance” was reported,
but neither the designers nor the first investigation team conducted an interference analysis.
Lessons Learned:
• Perform in-depth modeling and thermal cycling tests on cryogenic systems, which are
delicate equipment involving complex physics and material behavior.
• Provide adequate tolerances for thermal expansion mismatch (using flexible links, for
example).
• Be extra vigilant when stretching the state-of-the-art.
Lesson 52
Space Systems Engineering Lessons Learned
Lesson 53
Test Hardware and Software Together
The Problem:
A satellite lost power shortly after launch.
Lesson 53
Space Systems Engineering Lessons Learned
Lesson 54
Design and Handle Cryogenic Equipment with Great Care
The Problem:
Absolute Pressure
A cryogenic dewar containing liquid helium exploded on ground. Relief Valve
Lesson 54
Space Systems Engineering Lessons Learned
Lesson 55
Do Not Dismiss Test Anomalies as Random Events—Find Out Why (I)
The Problem:
Two commercial satellites failed to deploy during the same Space Shuttle mission.
The Cause: C/C = Carbon/Carbon Volatiles from
C/P = Carbon/Phenolic
b er Insulator Pyrolysis
Gap Area is Shaded Rub
Both satellites suffered identical mishaps—the Ti
Ti
C/P
carbon/carbon nozzles on their kick motors came Ti
C/P Insu
lato r
C/P C/P ne
off a few seconds into firing. C/C
E xit
Co Bondline
C/C H eat
Three other nozzles failed in a similar manner
during qualification tests. Unfortunately, these
failures were attributed to deficiencies in materi- Exit Cone Collapse
als and workmanship. The flight incident
investigation report also blamed the two failures
on undetected flaws in the material used to fabri- The independent investigation prompted
cate the exit cones. The fundamental problem NASA to conduct its own instrumented
was not diagnosed. firing, which proved the buckling scenario.
Prior to firing, the cone curled toward the
Because the motors were slated for government left. It became vertical (Photograph A) and
applications, Congress asked for an independent started to curl toward the right (Photograph
investigation. Finally, the root cause was dis- B). The cone failed shortly afterwards.
covered: charring of the unvented carbon/
phenolic insulator created gaseous pressure
within the exit cone. Since permeabilities inside
the insulating materials are highly variable, the
gas sometimes became trapped, forcing the exit (a)
cone to buckle. The problem could have been
avoided simply by placing vent grooves in the
bondlines.
(b)
Lessons Learned:
• Exhaustively search for the root cause of failures.
• Conduct fully instrumented tests.
• Provide sufficient thermal and structural margins to allow for material, manufacturing,
and processing fluctuations.
Lesson 55
Space Systems Engineering Lessons Learned
Lesson 56
Do Not Dismiss Test Anomalies as Random Events—Find Out Why (II)
The Problem:
A solar array drive failed soon after deployment.
28V Bus h
Drive Motor Redundant
The Cause: Controller
DC/DC Primary
The problem occurred because of a seemingly Solar Array Boom
Converter
Lesson 56
Space Systems Engineering Lessons Learned
Lesson 57
Protect Propulsion System from Contamination
The Problem:
A launch was delayed for many months. Fill/
Drain
The Cause: Thruster
Pressure
Following a guidance system malfunction, Gauge
the satellite had to be removed from the Hydrazine
Tank
launch vehicle. Off-loading of the toxic
propellant caused a problem: the legacy
satellite had no gravity drains, and the
thruster valve was not robust. Neither the Fuel System (Simplified)
original valve vendor nor the system The higher location of the fill/drain port in
manufacturer was still in business, nor the legacy propulsion system prevents
gravity draining, and the single seat valve
could the build paper be located to help is prone to leak. Dual seat valves (right),
find a good solution. typically used in new designs, would have
prevented air ingression unless both valves
The decision was made to pump out most leaked.
of the fuel, fix the guidance unit, and re-
stack the satellite. Unfortunately, before A Similar Incident
refueling could start, a valve failed. Carbon
An ICBM, refurbished to launch satellites,
dioxide in the air leaked in and reacted suffered a performance degradation re-
with hydrazine, forming corrosive carbazic cently after its turbine seal leaked,
acid and fouling the line. The entire allowing ammonia in the exhaust gas to
propulsion system had to be replaced. react with the lubricant, plugging the filter
and blocking lubricant circulation.
Lessons Learned: The problem, chemically alike the thruster
• Consider retrofitting legacy hardware contamination, was addressed in the
follow-on generation of the rockets, but the
with proven design upgrades. Antici- original units were not retrofitted.
pate out-of-sequence operations, such
as rework, during hardware design.
• Design propulsion systems to
accommodate ground handling by in-
cluding features such as low point
drains to facilitate fuel removal.
• Archive manufacturing documents.
Lesson 57
Space Systems Engineering Lessons Learned
Lesson 58
Guard Against Sneak Paths Through Ground Test Equipment
The Problem:
The primary side of an instrument failed shortly after launch.
The Cause:
Defective Crimps, Soldering,
The instrument had parallel redundant power or Socket/P in Connection
pins, but the power plug on the bus had only Cable
Instrument
single pins for source and return. The flight Current
Source
Flight
Hardware
cable had to be spliced so redundant conduc-
Power Control &
tors could be crimped into the same socket. Distribution Unit
The circuit opened because of broken solder + -
Status Indicators External Power Supply/
joints at the current supply board, loose con- Battery Backup
Should Be Added
tacts, or defective crimps.
A subtle test issue hid this single point failure.
The instrument needed a long time to stabi- Test Setup (Simplified)
lize, and was therefore kept on during ground
testing by an external power supply with
battery backup. On the test stand, the instru-
ment operated normally, despite the faulty
cable, by drawing power from the external Similar Examples
power supply.
A flight box was not grounded by mistake. The
The flaw would likely have been caught if the problem was missed because the test equipment
test equipment provided metering to show the li d di
unit was unexpectedly drawing power from it.
Lessons Learned:
• Independently confirm hardware performance for functions temporarily provided by test
equipment.
• Use a breakout box to check harness connector paths, and directions and magnitudes of
currents flows.
For more technical information, call Peter Carian at (310) 336-8215.
For comments on the Aerospace Lessons Learned Program, including background specifics,
call Paul Cheng at (310) 336-8222.
Lesson 58
Space Systems Engineering Lessons Learned
Lesson 59
Lesson from Challenger: Understand Your Data!
The Problem:
Vital O-ring data was ignored before the Shuttle lifted off on a freezing morning.
MOTOR O-Ring Tested on
The Cause: horizontal
platforms
During a pre-launch telecon, 34 engineers DM-4 47
in Utah
DM-2 52
debated for hours over whether to delay QM-3 48
the launch, out of the concern that cold QM-4 51
The only 2
launches (of
weather might compromise the seals. SRM-14 53 24) shown
SRM-22 75
Citing O-ring anomalies at both 75 deg F SRM-25 29
Forecasted
and 53 deg F launches, some engineers 27
temperature
for the
argued against launch. But because Challenger
damage occurred both hot and cold,
managers perceived no temperature
A table of temperature data presented during
effect. The launch went forward. pre-launch telecon included irrelevant in-
The Post-Challenger Investigation Com- formation but only selective flight
mission found that in presenting the flight experience. The audience was misled.
history, the engineers omitted data from 12
flights in which the O-rings remained in- Failures
y
tact, mistakenly thinking that successful 8
Only Only Data Points Pre sented
4
about risk. y
If presenters had plotted data from all 0
25 35 45 55 65 75 85
flights, nobody would have missed the 12
y
effect of temperature on the O-rings! All Data
8 Trend
more
Lessons Learned: obvious
4 Forecasted yy y y
• Consider all relevant information. Temperature
01/27/86 y
• Develop a coherent explanation of en- 0
25 35 45 55 65
yyyyyyy75yyyyyyyy85
gineering data to help audience Joint temperature (°F) at Launch
analyze risks.
O-ring Damage History
• Display data cogently (see Visual Ex-
planations by E. Tufte, for example). Anomalies rarely occurred in warm days,
but routinely took place during launches
For more technical information, call Jon below 65°F.
Binkley at (310) 336-7787.
For comments on the Aerospace Lessons Learned Program, including background specifics,
call Paul Cheng at (310) 336-8222.
Lesson 59
Space Systems Engineering Lessons Learned
Lesson 60
Tests Are for Verification, Not for Discovery
The Problem:
A satellite started to tumble shortly after deployment.
The Cause:
S
N
N
meaning of the Earth’s magnetic poles and set the
flight software incorrectly. The error went un-
noticed because the coil test had no expected
The Earth as a Magnet
polarity values—the configuration was determined
based on the measured responses. Opposite magnetic poles attract.
The north pole of magnet needles
After separating from the launcher, the satellite be- points to the Earth’s magnetic
gan to wobble. Fortunately, the lead G&C engineer South Pole, also called the
was prepared. Having heard many horror stories geomagnetic North Pole!
about torque rod phase mistakes, he had spent the
previous day making contingency plans. Within half
an hour, he reversed the controller gain, stabilizing
the satellite.
Lessons Learned:
• Expected test results should be established in advance of the test. Deviation from expected
results should raise a flag, and be thoroughly investigated before making any changes.
• Rigorously manage software development, especially on requirements, interfaces, and
configuration control.
• Plan for contingencies, using a top-down fault tree (ask “what happens if the satellite
failed to de-spin?” for example).
• Double-check torquer signs (Lesson 53).
Lesson 60
Space Systems Engineering Lessons Learned
Lesson 61
Do Not Assume a Situation Is Acceptable Simply Because Nothing Is Said About It in
Documents
The Problem:
A separation failure sent a launch vehicle tumbling out of control.
The Cause:
Following stage-1 separation, a small interstage Interstage
nd Sta
ge
Ring 2
ring surrounding the stage-2 nozzle also had to be
jettisoned. Equipped with three guide tracks, this 1
st Sta
ge
engineer that the foam felt too tight. Seeing no in- Nozzle
Lessons Learned:
• Double-check designs against possible misinstallation.
• Make sure field-assembled hardware can be inspected.
For more technical information, call Andy Shearon at (310) 336-1762 or Brian Gore at (310)
336-7253.
For comments on the Aerospace Lessons Learned Program, including background specifics,
call Paul Cheng at (310) 336-8222.
Lesson 61
Space Systems Engineering Lessons Learned
Lesson 62
Test as You Fly
The Problem:
To Vacuum
A battery exploded on orbit.
Bubbles
Electrolyte
The Cause: Reservoir
In orbit, leakage triggered a violent short. The plastic case ignited, and the battery blew up.
“Ultimately, this anomaly occurred because of a programmatic philosophy to minimize cost,”
said the failure report. “All failure scenarios could have been ruled out if enough testing had
been done.”
Lessons Learned:
• Analyze prior incidents of equipment malfunction.
• Review all aspects of battery application—do not regard batteries as simple plug-and-play
items.
For more technical information, call Doug Chism at (310) 336- 6375.
For comments on the Aerospace Lessons Learned Program, including background specifics,
call Paul Cheng at (310) 336-8222.
Lesson 62
Space Systems Engineering Lessons Learned
Lesson 63
Verify Field Installations of All Single-Point-Failure Items
The Problem:
A suborbital launch failed because the second stage would not start.
The Cause: Ground
Support
Equipment
After the first-stage burn, two bolt cutters were Igniter
Primary
fired, successfully jettisoning the spent stage. Ordnance
Battery/Circuitry
However, neither the second-stage motor nor its Thermal
Redundant
thermal battery ignited upon command. Ordnance Battery
Battery/Circuitry
The igniter and the thermal battery shared an Ground Bolt Cutter
Bolt
Cutter √ ter
though both sides of the connection were male, Bolt
Cutter √ tter
their shell types and pin configuration allowed Launch Configuration
an unintentional fit. Should-Be Actual
The error was not caught because, unlike most √: Deployed X: Did Not Deploy
Air Force programs, an end-to-end test with a
load to verify circuit performance was not per-
formed, nor was a quality assurance checklist
used.
Lessons Learned:
• Simplify interfaces, commands, and procedures in prelaunch operations lest the hectic
pace cause errors.
• Verify final assembly operations, particularly on single-point-failure risks. Pay particular
attention to possible connector mismating.
• Do not allow primary and redundant sides of critical circuits to join in a single-point-
failure area.
For more technical information, call Bruce Wendler at (310) 336-5475.
For comments on the Aerospace Lessons Learned Program, including background specifics,
call Paul Cheng at (310) 336-8222.
Lesson 63
Space Systems Engineering Lessons Learned
Lesson 64
Review Out-Of-Flow Processes to Ensure No Steps Are Bypassed
The Problem:
The temperature of an antenna dropped below expectation in certain conditions.
The Cause:
A legacy antenna had a radiator that was
oversized for this mission. Thermal de-
signers specified that the excess area should
be covered with multi-layer insulation
(MLI).
A veteran engineer, conducting a
walkaround prior to the system-level Antenna Without MLI MLI Installed
thermal vacuum test, discovered that the
MLI was missing. The blanket was installed.
A Similar Incident
After the test was completed, the temporary
MLI was removed in preparation for instal- A satellite used active louvers to control
the baseplate temperature of an instrument.
lation of the flight MLI. Unfortunately, the
final integration order still neglected to in- The system, including the louvers,
underwent thermal vacuum testing, after
clude the MLI reinstallation instruction. which the louvers were removed. They
Meanwhile, the old hand retired. His were temporarily reinstalled, without being
replacement did not spot the missing MLI, connected, for fit check.
and the antenna was flown without the blan- The louvers were left in place, without
ket. anyone realizing that the connector
remained unattached. Pre-shipment checks
Lessons Learned: did not verify the mate status because the
connector was not accessible.
• Make sure corrections in engineering Running too hot in space, the instrument
drawings or work instructions are back suffered significant degradation.
annotated in all applicable drawings and
shop orders (including subsequent builds
and units that have been distributed).
• Conduct final walkthroughs in the presence of the most experienced personnel.
• Keep good records of all “non-flight” installations.
Lesson 64
Space Systems Engineering Lessons Learned
Lesson 65
Perform Thorough Post-Flight Analysis
The Problem:
A Similar Incident
A launch vehicle lost control.
Misleading instructions on drawings led
The Cause: assemblers to wrap thermal tapes too close to a
separation connector (Lesson 4). The stage
The investigation board traced the mishap to a jammed (see diagram below), stranding the
solenoid valve in the thrust vector actuators. satellite.
Apparently, microscopic metal shavings, Eleven previous flights were subsequently
created during the assembly and adjustment reviewed; all showed the same hang-up. Seven,
and dispersed during ascent, jammed the spool in fact, were saved only because the floating
connectors were jolted apart when they hit the
shut for eight seconds—time enough to ruin the
allowable stops. The mission right before the
mission. failure had the narrowest escape.
In a previous launch, this valve stuck open. In The warning signs were not pursued.
another, it seized up twice, once open, once
closed. Minor anomalies occurred two other
times, but all previous flights succeeded.
Separation
Since a valve that is stuck open is manageable, Stage 2
Failure
these earlier troubles were disregarded. But a
sticky valve can as easily fail closed as open.
The blockage proved lethal.
“It is recommended that procedures for dealing
with flight and ground test anomalies be re-
viewed. This recommendation is necessarily Stage 1
Lesson 65
Space Systems Engineering Lessons Learned
Lesson 66
Thoroughly Analyze All Environmental Load Paths and Develop a Detailed System
Dynamic Model
The Problem:
A solar array broke on orbit.
The Cause:
Four solar array paddles were attached to the space- Attachment
Solar Beam Release
craft with aluminum brackets. Three brackets were Paddle Mechanism
stiffened with gussets, but interference from Space-
surrounding components prevented a gusset from be- Compliant craft Flexible
Base Hinge
ing added to the fourth bracket.
Shaker Table
During vibration testing, the flexible hinge channeled
most of the force into the release mechanism at the
other end of the paddle, damaging a latching clevis. Test Configuration (Side View)
The problem would have been recognized had the
paddle been instrumented or the component inspected
after test. Unfortunately, the program did not ade- Restraint Magnetometer
Cable
quately analyze dynamic loads during environmental
testing and launch.
Loads during upper-stage burn exceeded nominal, and Damaged
Hinge
the clevis and bracket came loose. The paddle was left
dangling by its cabling. The attitude-controlling mag- Flexible
Harness
netometer malfunctioned, whereupon the satellite
turned away from the Sun, draining the battery. As-Deployed
The satellite was rescued later (Lesson 67).
Lessons Learned:
• Provide extra margins to accommodate excessive launch shocks that occasionally occur,
especially with new launch vehicles (Lesson 11).
• Independently review dynamic loads analysis prior to test.
• Adequately instrument the unit, subsystem, and vehicle during environment tests.
• Check all data and inspect critical parts for damage after tests.
For more technical information, call Julia White at (310) 416-7229.
For comments on the Aerospace Lessons Learned Program, including background specifics,
call Paul Cheng at (310) 336-8222.
Lesson 66
Space Systems Engineering Lessons Learned
Lesson 67
Provide Design Flexibility to Enable Emergency Recovery
The Event:
Despite a damaged solar array (Lesson 66), a satellite was recovered.
The Cause:
Dangling
When one of the solar paddles came loose, Paddle
the magnetometer attached to it was dis-
abled. Lacking autonomous attitude control, Launch
Vehicle
the satellite turned away from the Sun, and
the battery drained. Ground controllers Paddle Detached
could not contact the satellite.
Fortunately, a video from the launcher Ground
Station
Ground
Station
Dynamic Model
Kalman Filter
Earthshine could partially replenish the As-designed Control Loop Imp lemented in Operation
battery!
Pointing Information Recovery
All non-emergency functions were com-
manded off to allow the batteries to fully Accurate attitude knowledge, especially
during orbit night when most of the ob-
charge. With its torquers manually con- servations were made, posed the next
trolled from the ground, the satellite was challenge—the satellite no longer rotated as
reoriented toward the Sun and spun up a rigid body; even the spin axis orientation
nominally. Full operation started three was uncertain.
months after launch. The program created a non-linear rigid-
body model. Using Sun sensor and horizon
Lesson Learned: crossing indicator data as input, an algo-
rithm incorporating Kalman filters
• Provide as much telemetry as possible calculated the satellite attitude to 0.25º
on launch vehicles, especially on sepa- accuracy, even during most of the orbit
ration events. Without knowing how the nights when direct sensor readings were
satellite malfunctioned, controllers unavailable. Most mission requirements
were met.
would likely have given up before the
downlink was received!
Lesson 67
Space Systems Engineering Lessons Learned
Lesson 68
Insist On End-to-End Ownership to Verify Interfaces
The Problem:
An uncontrolled explosion during the release of a satellite damaged the Space Shuttle.
The mistake was not caught despite hundreds of hours of reviews and tests because the sepa-
rate drawings were never put together into a single, end-to-end, schematic. “Even after the
occurrence of the separation system anomaly, detecting the design error through drawing
reviews was difficult,” reported the investigation panel.
Investigators also found that the documentation describing the mechanical and electrical sub-
system interfaces was inadequate. Labeling of the components was “incomplete and
confusing.” Verification tests were flawed—designed to ascertain that the separator was built
to the (flawed) design, instead of demonstrating the intended function. Discrepancies raised
during the critical design review were not properly resolved.
Lessons Learned:
• Develop end-to-end diagrams for electrical and mechanical interfaces, including software
driven interfaces.
• Clearly label each connector to avoid mismating.
Lesson 68
Space Systems Engineering Lessons Learned
Lesson 69
Protect Solid Rocket Grain Structure from Destabilizing Gas Flow
The Problem:
A prototype solid rocket motor exploded during prequalification firing.
The Cause:
Igniter Case and
Mixing of combustion gas streams created a Joint
• Conduct adequate subscale testing. Chamfering of the forward grain face (c)
eliminated the chokepoint.
• Study post-test and post-flight anomaly
reports from similar programs.
Lesson 69
Space Systems Engineering Lessons Learned
Lesson 70
Late Modifications Require Careful Revalidation
The Problem:
A jammed tether prevented a satellite from being deployed from the Shuttle.
The Cause:
Post-flight inspection found that a bolt pro-
truded into the path of a traveling ball nut. Tether
Lesson 70
Space Systems Engineering Lessons Learned
Lesson 71
Make Sure Ground Support Equipment Cannot Damage Flight Hardware
The Problem:
An oxygen tank on Apollo 13 blew up.
Supply
Line
The Cause:
Fill
Tube
A month before launch, the spaceship was stacked on Thermostat
Switches
Lesson 71
Space Systems Engineering Lessons Learned
Lesson 72
Prevent Failures in Support Equipment from Propagating into Flight Boxes
The Problem:
A transmitter was damaged during test. Should Add Isolation
Resistors Here
Sequentially
The Cause: Scanning
Monitor
The test set incorporated 15 separate power
5 V Flight Hardware Power Supply 1
supplies with various voltages. To auto-
10 V Flight Hardware
matically record data from each test point, Power Supply 2
a computer addressed the power supplies 28 V Flight Hardware Power Supply 3.. Test
Se t
via a bank of relays. The commercial test 31 V Flight Hardware Power Supply 15
unit did not isolate each monitor point. Reed Relays
Lesson 72
Space Systems Engineering Lessons Learned
Lesson 73
Trace All Software Changes Back to System Requirements and Specifications—Do Not
Simply Modify the Code
UV Computation SW
The Problem: Tracking Difficulty
New Thruster + Unit Mix-up
ers to unload the reaction wheels. Ground con- Spacecraft Spacecraft Trajectory Navigation
Telemetry Model Estimation Failure
trollers planned the burns with a thruster model,
reused from a successful mission. Complex Failure Causes
A thruster change made it necessary to update this model, which specified thruster input in
Newton-sec. The thruster vendor—the same for both missions—used lb-force-sec. In the
original model, engineers correctly added the 4.45 conversion factor to the vendor’s equation.
Overlooking the interface specification and seeing no warning in the code comments, the
follow-on team simply made a substitution.
Labeled as non-mission critical, the ground software—without the conversion factor—was
not rigorously reviewed; the “truth” table, computed manually for acceptance testing,
contained the same mistake. Interface with the navigation function was informally tested only
to ensure that it could move across servers.
Only one, occasionally two, engineers navigated the spacecraft. Two months before orbit in-
sertion, radar returns projected a path too close to Mars. Unfortunately, as the probe neared
Mars, poor observation geometry from Earth reduced tracking precision. The flight team, con-
fident with their navigation ability, decided against raising the orbit.
Not until aerobraking, after Martian gravity had captured the probe, was it possible to calcu-
late the spacecraft’s true position. Only then did the controllers realize the probe was 100
kilometers off course!
The successful reflight listed both English and metric units on all interface control documents,
adopted a more robust navigation method, and used six full-time navigators.
Lessons Learned:
• Any software that commands a satellite is mission critical, even though it may not be
embedded in the flight vehicle.
• Validate changes in mission-critical software with more vigor than the original develop-
ment (Lesson 25, 29, 47). Rigorous formal testing is essential.
• Always specify the units in requirements and Interface specifications.
• Generate expected results used in verification tests independently, in accordance with
system requirements.
For more technical information, call Suellen Eslinger at (310) 336-2906.
For comments on the Aerospace Lessons Learned Program, including background specifics,
call Paul Cheng at (310) 336-8222.
Lesson 73
Space Systems Engineering Lessons Learned
Lesson 74
Understand Why Warning Lights Come On Before Disabling Them
Lesson 74
Space Systems Engineering Lessons Learned
Lesson 75
Protect High-Voltage Equipment from Contamination
The Problem:
A satellite was lost when the tether deploying it was severed by arcing
.
Polyester Core/Overwrap
The Cause:
Fluorocarbon
Inspection of the recovered tether fragment re- Insulation
vealed contamination, pinholes, and other de-
fects. Debris was also found on the deployment
mechanism.
Kevlar (Providing Strength)
Apparently, the underlayers of tether experi-
enced severe compression loads while wound Copper Conductor
Lesson 75
Space Systems Engineering Lessons Learned
Lesson 76
Make Sure Someone Takes Responsibility for Each Interface
The Problem:
A space probe was damaged on the launch pad.
The Cause:
The Importance of Stating TBDs
The probe, developed by one agency (A), re-
Agency B’s cooling plan stated that the
quired another agency (B) to provide launch- equipment would be set to “agency A
pad cooling. value” or “desired” flow rate.” The two
partners reviewed the plan step by step,
Neither agency bothered to assign interface never realizing that this number had not
responsibilities. The requirements were not been agreed upon.
spelled out; the design and operational proce-
Stating “set to TBD ± TBD units
dures were not placed under configuration (agency A value to be supplied)” would
control. Communications faltered. have raised a flag and avoided the
misunderstanding.
Agency A faxed agency B a gas-flow value,
which it intended as the not-to-exceed limit.
The nominal value was buried in a thick re-
view package.
Seeing only the faxed number, agency B made certain it could be met by making several pro-
cedural changes, such as narrowing the cooling duct, without considering the effect of too
much air. On the pad, excessive air flow tore a hole in the probe’s insulation.
The investigation board found that in five years the two organizations missed catching the
problem 26 times. “The actions taken were logical, based on the knowledge available to the
people taking action. The incident was entirely due to inadequate or imprecise information
exchange,” said the board.
Lessons Learned:
• Check ground operation procedures and support equipment to avoid damage to flight
hardware.
• Ensure interfaces between two organizations are worked out in detail, agreed to by both
sides, and documented.
• Bound each requirement within a range.
Lesson 76
Space Systems Engineering Lessons Learned
Lesson 77
Make Sure Sequential Safety Devices Operate Independently
The Problem:
A science mission ended during the first orbit.
Sate llite
28V Regu-
Power lator Po wer-On 5V Relay Arming
F Driver Relay
The Cause: Res et Reset
P Power
Clock
On-Board ARM Osc illator G Sw itch Switch
The aperture cover’s design called for its pyro cir- Computer A Driver
FIRE
cuits “safed” prior to being sequentially “armed” Pyro Electronics Pyro
and “fired.”
Timing Issue in the Safety Mechanism
A design feature in the controller chip invalidated After the bus power is switched to the
all the programming circuits for a few milli- pyro box via a relay, the controller (a
seconds upon powering up. All outputs, including field programmable gate array, FPGA)
should be safed and initialized at the
“ARM” and “FIRE”, were momentarily asserted.
direction of an oscillator clock.
The cover blew open prematurely; the cryogen
escaped. It took 30 milliseconds for the local
voltage to rise and another 25 milli-
The chip would manifest this start-up problem seconds for the safing clock to start,
only after having been turned off for several but only 15 milliseconds for the
transient to occur.
hours. Although power cycled many times during
component testing, it was never unpowered long
enough to reveal the problem.
The use of a slow, non-flight-like, power supply during unit testing masked the spurious out-
put: during the transient period there was not enough voltage to close the arming relays. Later,
anomalies repeatedly occurred during system testing. Unfortunately, because the pyro simu-
lator was very sensitive, a load delay was fitted to the test equipment to filter out spurious
triggers, unintentionally preventing the actual start-up glitch from being recorded. The warn-
ing signs were ignored.
At launch, the chip had been powered down for weeks. Not only did it go awry but, because
power to the pyro box was applied via a fast relay, sufficient voltage had also built up to com-
plete the arming circuit. The FIRE switch, commanded by the same controller and therefore
not truly independent, set off as well, ending the mission.
This controller chip had caused troubles before, prompting NASA to issue an application
note. However, the contractor and the field engineer from the vendor did not know about it.
“[We need] an information hotline, set up on an industry-wide lessons learned web page,”
suggested the engineers later.
Lesson Learned:
• Beware that many programmable devices do not follow their truth tables at power-on—
see http://www.klabs.org/ for more information.
For more technical information, call Peter Carian at (310) 336-8215.
For comments on the Aerospace Lessons Learned Program, including background specifics,
call Paul Cheng at (310) 336-8222.
Lesson 77
Space Systems Engineering Lessons Learned
Lesson 78
Thermal Blankets And Tie-down Cables Can Jam Mechanisms
The Problem:
An antenna reflector on a communication satellite could not deploy.
The Cause:
The reflector was tied down to the bus deck
Reflector
during launch with four cables. When the Velcro attachment,
reinforced with
cables were pyrotechnically cut, the two Kapton tape
hinged reflector booms failed to deploy.
Tie-down mechanism,
Later, ground testing showed that the pocket- Blanket with cable and internal
shaped thermal blankets covering the tie- cover springs (one of four)
down mechanisms expanded during the
ascent, fouling the wrap cable. The spring-
Cable cutter
loaded hinges did not have enough force to
overcome this interference. Antenna Reflector (Simplified)
Lessons Learned:
• Anticipate the errant movement and expansion of flexible materials, such as wires and
blankets.
• Allow thermal blankets to vent whenever possible.
• Avoid protrusions or sharp edges that can snag soft items.
• Indicate the presence of soft goods on top-level assembly drawings to draw attention to
the risks of interference and obstruction problems.
For more technical information, call Robert Postma at (310) 336-7228.
For comments on the Aerospace Lessons Learned Program, including background specifics,
call Paul Cheng at (310) 336-8222.
Lesson 78
Space Systems Engineering Lessons Learned
Lesson 79
Make Sure Software and Hardware Engineers Communicate with Each Other
The Problem:
A “Deadly Embrace” by the Watchdog
An experimental spacecraft lost its computers. The computer uses an independently
clocked watchdog function (Lesson
The Cause: 36) to enable switching to the re-
dundant CPU if the primary side
The satellite, hitchhiking on the qualification malfunctions (for example, due to
flight of a launch vehicle, was designed and radiation damage).
built in one year. The final software mistakenly set the
watchdog counter to 0.1-s, but it took
The bus software was checked out against the the hardware about a third of a second
engineering model without incident, but was to boot. The CPU could not finish
not tested against the payloads until the space- booting before being reset, and was
craft was already loaded onto the host vehicle. stuck in an endless loop.
It was then discovered that a payload per-
formed very sluggishly.
Three launch-support engineers worked 14 hours a day for a week to adjust the bus memory-
management functions. They created several software patches, one of them contained a wrong
boot-up parameter. The mistake was not caught because the software developer did not con-
sult with the processor engineers, nor verify the changes in the engineering model.
The software was loaded into the primary processor, which right away halted. Assuming a
faulty primary memory was the cause, and again not enlisting the CPU expert’s help, the en-
gineers loaded the same code in the backup computer. It froze, too.
The computer could be physically reset. But by this time it would take several days to remove
other experiments to reach the frozen computer, possibly delaying the flight. The host mission
refused, and the hitchhiking project could only watch the launch, knowing its computers had
already died.
The project manager traced the failure to poor communication between the software and
hardware personnel, because the software team worked in isolation.
Lessons Learned:
• Make sure no single parameter error or single spacecraft malfunction can cause endless
cycling (for example, by enabling the watchdog function to switch to a recovery mode
after a few “try agains”).
• Double-check last-minute code changes (Lesson 43).
• Problems in embedded systems are not always due to random hardware defects. Pause and
think before inflicting the same software flaw on the redundant side (Lesson 18).
For more technical information, call Lan Nguyen, at (310) 336-2146.
For comments on the Aerospace Lessons Learned Program, including background specifics,
call Paul Cheng at (310) 336-8222.
Lesson 79
Space Systems Engineering Lessons Learned
Lesson 80
Check, Double-check, and Triple-check Torquer Phases
The Problem:
A magnetic torquer sign error was caught just one day before launch.
The Cause:
The attitude control engineer who calculated Two Other Mistakes on This Mission
the fields induced by the applied current made
1. The calculated moments of inertia,
an error in an equation, which reversed the which should have been referenced
predicted torques. against the center of gravity, were in-
stead referenced against the origin
The engineer left the project, and his suc- point on the drawing. The mistake was
cessor, misunderstanding the vendor’s caught by an independent analysis
drawing notes, installed all three coils upside (Lesson 2).
down. The second error, which could have 2. The star tracker misbehaved on-orbit
been easily discovered with a compass, was because the vendor altered its coordi-
masked by the faulty truth table. nate convention but the change notice
was not heeded.
Fortunately, the prime contractor’s president
had concerns with a delay in generating solar
power (Lesson 53). As a result, the attitude
control components relating to sun acquisition
were thoroughly scrutinized.
To alleviate prelaunch work load, the customer paid to bring back the original attitude control
engineer. Rechecking his own calculations, he spotted the sign error one day before launch.
Lessons Learned:
• Don’t overlook simple tests that can discover problems early.
• Whenever possible, conduct independent analyses.
• Document attitude control coordinate frames early in development to avoid mistakes.
For more technical information, call David Voelkel at (505) 846-8380 or Geoffrey Smit at
(310) 336-1602.
For comments on the Aerospace Lessons Learned Program, including background specifics,
call Paul Cheng at (310) 336-8222.
Lesson 80
Space Systems Engineering Lessons Learned
Lesson 81
Designate A Responsible Engineer for Complex Equipment
The Problem:
A satellite lost part of its primary structure one minute after liftoff.
d
oi
or
te
ld e
The 1200-pound shroud was supposed to fit
ie om
Sh icr
M
tightly to the satellite body during ascent and
then extend five inches after reaching orbit.
The contractor delegated the development of
this complex hardware to its structures
department without putting a project engineer
in charge.
Coordination suffered. Not having been told
that the shield must fit tightly during launch,
the structural and manufacturing engineers Tiedowns for
Solar Array
made it light but fragile. Without looking at Broke Apart
Lesson 81
Space Systems Engineering Lessons Learned
Lesson 82
Understand Transient Behavior of Analog Circuits
The Problem:
A pyro device failed to fire on orbit.
The Cause: Spec: 3.5 A
1A
The incident stumped engineers because pyro units
rarely malfunction, and two certification units fired
successfully.
0 amp
An outside expert pointed out that when current
passed through the bridgewire, ohmic heating raised 0.01 s
its resistance. Because the firing circuit was designed Current vs. Time (Malfunctioning Unit)
as a constant voltage output, current and power
dropped off (P = V2/R) just enough to thwart ignition.
Most pyro unit outputs are current-limited with series resistors, or energy-limited with ca-
pacitor discharges. Few engineers realize that the bridgewire resistance can change within the
hundredths of a second it takes to heat the bridgewire enough to ignite the charges. In fact, the
initiator specification only stipulated the firing current, not how long the pulse should hold.
The designers, who did not know how pyro circuits typically work, used a constant, low-
voltage approach that turned out to be vulnerable.
A lack of fidelity in design verification hid this mistake. During simulation tests, a resistor was
used to emulate the initiator, and the current was steady because the resistance did not change.
A fast-blow fuse, which more accurately simulates the load, would have revealed the resis-
tance change.
The design was certified based on only two live firings, during which no current trace was
recorded. In retrospect, the successes were purely a matter of luck—there was just enough
current margin for success 60 percent of the time. If more units had been fired, or if instru-
mentation had been used, the inadequacy would have been found.
Lessons Learned:
• Check time-dependent circuit behavior, and bound transients in specifications.
• Do not qualify a design solely because a unit worked. Measure circuit parameters and ver-
ify that positive margins exist.
• Analyze instrumentation data, which can provide more engineering information such as
postfire conduction (which may drain flight battery).
• Understand how circuits are typically designed and tested before inventing novel
approaches.
• Qualify pyro devices by conducting lot acceptance testing.
• Review the Pyroinitiator User's Guide published by NASA (JSC-28596A).
For more technical information, call Ron Williamson at (310) 336-2149.
For comments on the Aerospace Lessons Learned Program, including background specifics,
call Paul Cheng at (310) 336-8222.
Lesson 82
Space Systems Engineering Lessons Learned
Lesson 83
Put Critical Analyses Under Configuration Control
allowed the air into the frigid engine, where it ΔP across check valve (right scale)
1
Lesson 83
Space Systems Engineering Lessons Learned
Lesson 84
Check Start-up Circuit Behavior, Particularly at Low Temperatures
The Problem:
The primary side of an onboard computer would not turn on.
The Cause:
The computer received analog housekeeping
inputs via numerous multiplexers inside the Load total at -5°C
Lessons Learned:
• Use fault-tolerance circuits to protect upstream assets, not load units. Better yet, use dual-
level current limiters to protect load units during ground tests. But for flight, protect only
the source circuits.
• Redesign fault-tolerance circuits when the load units have been substantially altered.
For more technical information, call Peter Carian at (310) 336-8215.
For comments on the Aerospace Lessons Learned Program, including background specifics,
call Paul Cheng at (310) 336-8222.
Lesson 84
Space Systems Engineering Lessons Learned
Lesson 85
Systems and Software Engineering Should Actively Coordinate
The Problem:
A satellite could not be deployed.
Forward BW = Bridge wire
Payload SFC = Squib firing circuit
The Cause: I/F = Interface connection
BW BW
The payload separation system was designed to SFC SFC
accommodate two satellites, but only one satel-
lite flew on this mission. Aft Payload
BW BW
The mission specification had the separation
SFC SFC Mission
commands sent to the “forward” position. An unique
I/F I/F
engineer redlined the commands to “aft” to Generic
core
simplify wiring. Unfortunately, this change was
not incorporated in the final mission specifi- Payload Payload
cation. Software
BW BW BW BW
commanded Hard wired
Not realizing that the informal redline had SFC SFC SFC SFC
fallen through the cracks, the hardware group I/F I/F I/F I/F
designed an incompatible harness. The draw-
ings were released as a new baseline, making it
difficult to detect crucial changes. Several Separation Configuration
systems engineering departments could have
(Top) For two payloads
checked the compatibility of the final design to (Bottom) For the failed mission
overall requirements, but none did—the key
mission specification was developed by soft-
ware engineers and was not placed under
systems engineering’s jurisdiction.
The mistake was not discovered on the ground because the generic systems test activated both
positions, allowing the miswired ordnance verification unit to appear working.
Lessons Learned:
• Test the specific configuration that will be flown (Lesson 3).
• Conduct tests and reviews to validate that the requirements are met, rather than that the
drawings are correctly implemented.
• Actively involve systems engineers in software development activities, and formally con-
trol all system (including software) interfaces.
Lesson 85
Space Systems Engineering Lessons Learned
Lesson 86
Hand-Over Logic Tree Must Be Unambiguous
The Problem:
A suborbital launch was inadvertently terminated less than a minute after liftoff.
Lesson 86
Space Systems Engineering Lessons Learned
Lesson 87
Avoid Repeating Other People’s Mistakes
The Problem:
A launcher’s maiden flight failed.
(a) As designed
Actuator
The Cause: and
guidance
cable
flow into the aft area, the flame damaged an the flame
For comments on the Aerospace Lessons Learned Program, including background specifics,
call Paul Cheng at (310) 336-8222.
Lesson 87
Space Systems Engineering Lessons Learned
Lesson 88
Verify Each Operation Step
The Problem:
A piece of flight hardware was damaged during its integration to the launch vehicle.
The Cause:
A Similar Incident
During launch vehicle erection, the Stage III,
spin table, and the satellite were contained in a As a thunderstorm approached a launch pad,
workers draped a rain shield over a satellite
canister and bolted to the Stage II. After the being processed in the White Room.
guidance systems were connected, a technician
had to remove the bolts before the canister The shield consisted of overlapping strips of
waterproof cloth, secured with adhesive
could be lifted. tapes. The installation instructions stated,
To indicate that he was to start unbolting, the “ensure both top and bottom sides of seam
technician put both thumbs up and shouted are taped.” Nonetheless, the lower side was
neglected, nor was there a verification.
“ready.” The crane operator heard “Randy,” his
name, and mistakenly interpreted the gesture as Rainwater poured through the building’s
a command to hoist. The shackled stack was leaks. The weak rain shield collapsed,
drenching the satellite. Launch had to be
raised up; the spin table suffered structural delayed for years.
damage.
The error took place because:
1. Not realizing the lift operation could be hazardous, the foreman allowed an uncertified
technician to direct the crane. A properly trained rigger would have avoided making an
ambiguous “thumb-up” sign.
2. The operating procedure did not require anyone to verify that the bolts had indeed been
removed. The crane driver should have been taught to ask for the restraining pin, for
example, first.
3. The procedure did not specify communication protocol.
Lessons Learned:
• Implement a discrete verification step for each critical task.
• Require positive confirmation before hazardous commands can be acted upon.
• Do not deviate from written procedures.
• Handle space hardware carefully.
For more technical information, call Norman Lagerquist at (310) 336-2362.
For comments on the Aerospace Lessons Learned Program, including background specifics,
call Paul Cheng at (310) 336-8222.
Lesson 88
Space Systems Engineering Lessons Learned
Lesson 89
Prevent Hardware Fratricide
Helper spring
Fairing
diagram Circumferentia l
The Problem: thruster spring
Lesson 89
Space Systems Engineering Lessons Learned
Lesson 90
Account for All Loose Materials
The Problem:
Other “Foreign Object Damage” Incidents
A large engine partially melted during a test
firing. • Debris contamination spoiled five foreign
launches between 1990 and 1999, includ-
The Cause: ing several caused by rags clogging
propulsion lines.
Investigators found that a large piece of sealing
tape, routinely used during engine assembly, • Debris such as paper clips left in RF cavi-
blocked the fuel injector and caused the tur- ties repeatedly caused test failures on a
bopump to overheat. satellite program. The contractor finally
developed an electromagnetic probe to
The investigation board reprimanded the manu- sweep all cavities before they were sealed.
facturer for not having a disciplined process to
handle, or account for, loose materials. The • A jet engine contractor suffered several
failures caused by bolts or tools being left
processing paperwork was not traceable, making inside test units. The management subse-
it difficult to know what work was done on quently required an inspector to go inside
which part. the inlet to check for debris using a flash-
light.
In this case, the build log supposedly docu-
mented tape removal and independent Right after the new procedure was imple-
verification. The Investigation Board discovered, mented, the engine blew up. The flashlight
was left behind. (From “Augustine’s
however, that tape reportedly taken out was Laws.”)
repeatedly found during postfire inspection or
engine rebuild.
Lessons Learned:
• Make sure loose, nonserialized materials (such as wipe cloth) used during assembly are
carefully accounted for.
• Correct the root cause of in-process anomalies (Lesson 32).
• Keep accurate records of all “nonflight” installations.
• Take photos frequently during assembly.
• Design hardware to minimize areas that cannot be easily inspected, and avoid the use of
potential contaminants whenever possible.
• Keep hardware closed when access is not needed.
• Review out-of-flow processes to ensure no steps are bypassed (Lesson 64).
For more technical information, call Dana Speece at (310) 336-5021 or Gary Shultz at (310)
336-2342.
For comments on the Aerospace Lessons Learned Program, including background specifics,
call Paul Cheng at (310) 336-8222.
Lesson 90
Space Systems Engineering Lessons Learned
Lesson 91
Ensure Critical Systems Are Tolerant of Transient Power Loss
The Problem:
A first-stage engine shut down soon after liftoff.
A Lesson Not Learned
The Cause: After this incident, the contractor redesigned
the 30-year old control electronics to provide
Immediately before the mishap, the bus current redundant power and guidance. A sister launch
spiked twice. Evidently, a power cable had a vehicle program, however, did not make a
breach in its insulation layer, and momentarily similar change.
grounded. The engine relay box lost power, and Years later, the second program suffered a
numerous relays controlling the propulsion failure. Apparently, a defective power cable
valves dropped out, disabling the engine. shorted intermittently, causing the guidance
computer to reset and the inertial measurement
By design, the relays lock on their own contacts
unit to lose reference.
during flight, which depends on a continuous
supply of electricity to retain their running con- The launcher had miles of wires—forty-four
repairs had been made on this particular vehicle
figuration. If the power is lost, even for an alone. In retrospect, it was clearly impossible to
instant, the relays unlatch with no means to re- inspect out every wiring defect, and the
cover. decision not to provide redundant power proved
The vulnerability to a transient short had been costly.
recognized by the contractor for years. Unfortu-
nately, even though many design improvements
Cabling defects
Florida Today
Lessons Learned:
• Ensure the onboard computer retains “most recent state” information so that if a glitch
causes the loss of “present state” data, the vehicle can revert to a survivable configuration.
• Anticipate wiring problems, and provide redundant power sources to critical systems, in-
cluding lock-in power circuits to prevent hardware reset.
• Recognize the need to address weaknesses in nonpropulsive systems.
For more technical information, call Peter Carian at (310) 336-8215.
For comments on the Aerospace Lessons Learned Program, including background specifics,
call Paul Cheng at (310) 336-8222.
Lesson 91
Space Systems Engineering Lessons Learned
Lesson 92
Rigorously Determine the Root Causes of Test Failures
The Problem:
The primary laser in an instrument failed after a month in space. Laser diode
(mounted with
Array tin/lead solder)
endcap
The Cause:
Heat Gold wire bonds
The laser pump consisted of several diodes sink
Lasers have not flown in space often. The de- Laser Array Stacks (Simplified)
sign of this laser was derived from a previous
program and was procured commercially. In Au/In Reaction in Terrestrial Applications
retrospect, the vendor’s internal processes and Original gold wire
Au/In Intermetallics
controls were not up to par for space applica-
8 Years 14 Years 25 Years
tions. The new design was more vulnerable Lawerence Livermore Lab
because current density in the contaminated
bondwires increased by 40 percent, intensi-
fying thermal loads in the wires. Several years
of launch delay made the degradation worse. Remaining gold wire
During qualification, the bondwires broke several times. The vendor replaced the defective
components and asserted that the failures would not recur. A laboratory analysis, which would
have discovered the root problem, was requested but not carried out.
Lessons Learned:
• New technologies require rigorous qualification, analysis of design changes, and a thor-
ough understanding of failure modes.
• Audit a vendor’s manufacturing process, conduct destructive physical analysis of sample
parts, and ascertain the root causes of all anomalies.
• Review the materials and processes for each new application drawing.
• Guard against known materials incompatibilities (gold/tin intermetallics can embrittle
solder joints, for example).
For more technical information, call Renny Fields at (310) 336-6973.
For comments on the Aerospace Lessons Learned Program, including background specifics,
call Paul Cheng at (310) 336-8222.
Lesson 92
Space Systems Engineering Lessons Learned
Lesson 93
Always Ascertain the Direction of Current Flow
The Problem:
Contact with a satellite was lost soon after launch. -
I
-
Solar Positive polarity ground
Array often us ed
The Cause: +
+ in foreign design
The satellite consisted of a domestic instru-
ment module and a foreign service module.
A design mistake in the foreign unit caused I
+ +
the solar panels to be connected backwards.
Solar Negative polarity ground
The domestic instrument supplier, in charge Array
-
commonly used
- in US design
of system integration, checked the interface
between the solar panels and battery, but
only verified the magnitude of current, not Polarity Confusion
its direction—engineers might have became
confused as to how the current should flow A Similar Incident (from “Augustine’s Laws”)
because the foreign unit grounded positively
but the American unit grounded negatively. A preflight check found two hardware modules
wired in the opposite polarity. Both subcontractors
Once in orbit, the battery drained, ruining reversed their cables. The launch failed.
the mission.
"It's always the simple stuff that kills you," lamented the lead engineer.
Lessons Learned:
• Make sure that engineers understand how the system or component should function during
test.
• Thoroughly verify interfaces of subcontracted items, particularly when the suppliers use
different engineering conventions.
• Use an engineering model to verify interfaces early.
Lesson 93
Space Systems Engineering Lessons Learned
Lesson 94
Provide Debug Features in Flight Software to Assist Anomaly Resolution
The Problem:
An interplanetary probe lost some scientific data due to occasional system resets.
The Cause: J LOW* (Other instruments)
J MEDIUM (Bus tasks) J HIGH (Data management)
Driven by demanding mission requirements, the
designers used a commercial, realtime, multiple- 1553 Databus
Shared
J LOW (from one memory
tasking operating system. instrument) Processor Watchdog timer
An esoteric “priority inversion” problem took
place during science operation and caused some The Priority Inversion Problem
data loss. This glitch was not caught on the
Because the bus and instruments share the processor,
ground because the Earth-pointing antenna per- job allocation is vital. The highest priority is given to
formed better than expected, allowing more data management, followed by bus tasks and by sci-
frequent downlinks than originally planned. ence activities. If data management tasks cannot
complete within the watchdog’s 125 millisecond
Fortunately, debugging tools, written during cycle, an anomaly is assumed and the computer is
code development, were embedded in the reset.
software. With extensive support of the vendor, Data from the bus and payloads flow through a 1553
the project was able to reproduce the problem in data bus, but one instrument is processed directly.
the laboratory and identify the cause. A quick fix That sensor shares a software function with the trans-
allowed the mission to successfully conclude. action manager—not a prudent design but normally
not a problem. Access to this resource was controlled
Lessons Learned: with a key. If a data manager job (JHIGH) starts late in
the cycle, it may find a job from this instrument (JLOW)
• Ensure that commercial software, especially still in process. If JHIGH also requires the shared soft-
the operating system, allows access to inter- ware function, it must pause for the key.
nal information and is compatible with When a communication job (JMEDIUM) initiates during
development debug tools. the short interval, however, it preempts JLOW, prevent-
• Test for off-nominal conditions, both ing the key’s release. The system watchdog timer
starts the next cycle, finds JHIGH unfinished, and resets
“better” and “worse” than expected (for the system.
example, at higher throughput rate), to see if
Turning on “priority inheritance” options for that par-
the system misbehaves. ticular thread (giving high priority to JLOW in light of
• Leave debug capabilities embedded in the jobs blocked by it) solves this problem. This option is
not normally used as default due to performance
operational system.
concerns.
• Shared functions must be thoroughly tested,
especially for timing.
For more technical information, call Suellen Eslinger at (310) 336-2906.
For comments on the Aerospace Lessons Learned Program, including background specifics,
call Paul Cheng at (310) 336-8222.
Lesson 94
Space Systems Engineering Lessons Learned
Lesson 95
Ensure Heritage Designs Can Operate in the New Application Environment
The Problem:
An interplanetary probe mysteriously failed.
AIAA-2001-3630
The Cause:
The incident occurred when the vehicle,
having completed a year-long flight, pres-
surized its propulsion system in preparation Restrictor exposed to oxidizer vapor
(a) After 30 days (b) After one year showing
for an orbit-insertion burn. The propulsion extensive corrosion
V Helium
Extensive testing could not reproduce the 1
F R X
Lessons Learned:
• Avoid relying on short-term tests (days to months) to confirm long-term reliability.
• Audit vendor material lists to ensure completeness.
• Account for vapor diffusion in propulsion subsystem design.
For more technical information, call Mark Mueller at (310) 336-5081.
For comments on the Aerospace Lessons Learned Program, including background specifics,
call Paul Cheng at (310) 336-8222.
Lesson 95
Space Systems Engineering Lessons Learned
Lesson 96
Tests Must Independently Verify Development Results
The Problem:
A space telescope was out of focus. Null corrector
Point
light source Top view
The Cause:
Upper mirr or
Anti- Chipped
The telescope’s primary mirror was Reflective coating
Inter fer ometer
polished with the aid of a “null corrector.” Cap coating
a
Lights that are shone on a perfect mirror,
when reflected through the corrector, Rod Light fr om
should form straight interference patterns. Inter fer ometer
Lower mirr or
The corrector was set up with a positioning Cap Side view
Lens
rod capped on one end. A light beam
passed through a small aperture in the cap Rod
b
to focus on the rod’s tip, and a lens was Telescope Mirror As Intended Actual
placed at the other end of the rod.
Unfortunately, a speck of antireflective
coating chipped off the rod’s cap, and the Mirror Manufacturing Process (Simplified)
focusing beam was aimed at the cap in- Missing coating (view a above) near the cap aperture
stead. The lens was misplaced; the mirror caused the operator to aim the light at the cap instead of
was misshapen. at the rod (view b above).
Because the contractor used the corrector
not only as a manufacturing tool but also Operators Failing to Call Attention to the Problem
as the sole referee standard, it could not The misfocusing prevented the metering rod from
detect the mistake. In fact, each of two reaching the lens, but the technicians simply extended
pieces of auxiliary optics suggested gross the rod by inserting a few washers.
errors. However, confident that the new-
“That in itself should have alerted people…because
technology corrector was better, the engi- clearly there should not be a need for any unexpected
neers ignored the red flags. washers to be added,” said the investigation board.
Lessons Learned:
• Use simple tools to crosscheck elaborate tests.
• Scrutinize test equipment, analysis, or algorithms reused from design or manufacturing for
possible single-point failure.
For more technical information, call Julie White at (310) 416-7229.
For comments on the Aerospace Lessons Learned Program, including background specifics,
call Paul Cheng at (310) 336-8222.
Lesson 96
Space Systems Engineering Lessons Learned
Lesson 97
Control Hardware and Software Configurations Before, During, and After Tests
The Problem:
A satellite pointed toward the Sun with the
wrong axis.
The Cause:
As the satellite exited eclipse for the first time, it
should have pointed a vector 35 degrees off the
z-axis toward the Sun. Instead, it wobbled, while
pointing the x-axis to the Sun. Fortunately, one
of the solar wings was illuminated, giving the Satellite
Early in integration (a) Final (b)
engineers time to recover.
The next day, an examination of a photo taken at
the launch site revealed that two Sun sensors
were mounted ninety degrees off. A software
change quickly fixed the problem.
The Sun sensors were mounted on the main
access panel in the intended direction during
verification testing, before the panel was
attached to the spacecraft. When the panel was
being installed, however, the mechanical engi- Closed-up view - notice wiring direction
neers found that the sensor cables were too short (a) (b)
to mount the sensors “as hung.” Seeing no con-
trol document on the sensor configuration, they Sun Sensor Misorientation
turned the sensors sideways, without informing
the guidance and control (G&C) engineers of the
change.
Lessons Learned:
• Always ascertain G&C actuator phasing (Lessons 53, 60, 80).
• Ensure domain engineers own all aspects of their subsystems.
• Conduct end-to-end testing in the flight configuration.
• Take plenty of photographs during assembly.
• Document G&C subsystem-level alignment. See Guideline GD-ED-2211 from NASA
Technical Memorandum 4322A, for example.
For more technical information, call Geoffrey Smit at (310) 336-1602.
For comments on the Aerospace Lessons Learned Program, including background specifics,
call Paul Cheng at (310) 336-8222.
Lesson 97
Space Systems Engineering Lessons Learned
Lesson 98
Guard Against Post-Firing Conduction of Pyro Initiators
The Problem:
The redundant memory board on a spacecraft filed.
NASA
The Cause:
During an orbit insertion maneuver, the
satellite fired several explosive bolts to Post-firing Conductive Mechanism
jettison a solid rocket.
Arming relays
As designed, the pins
The burning pyro propellant formed a con- are insulated from the
case. Firing current
ductive plasma, shorting to the chassis- + returns via isola ted
path.
grounded case. A voltage surge rippled Power
supply
-
Pyro
Capacitative
through the input protection diode in the Single point
Firing relay Memory
circuits
coupling
Chassis
the primary memory had latched, the mis-
Pyro shorted ( ),
sion could have failed. bypassing norma l return
path. Ringing and
volta ge spike occur in
Lessons Learned: adjacent circuits due to
ground coupling.
• Protect firing circuits against sneak Power
supply
currents and line-to-ground shorts. Memory
circuits
Components such as step motors and
pyro circuits that experience sudden Electromagnetic
coupling
current changes should be isolated
from all other current-carrying circuits Simplified Bus Grounding Architecture
including electrical power, electrical
control, RF transmission lines, and
monitoring circuitry. For additional in-
Other Post-Fire Conduction Conditions
formation, see Electromagnetic Inter-
ference Analysis of Circuit Transients, Post-fire plasma shorts can drain batteries. See
NASA Preferred Reliability Practice Journal of Spacecraft and Rockets, 36, 586-590
(1999).
No. PD-AP-1308, for example.
Drive elements can be disabled by residual
• Check circuit designs against Elec- current, and should be inspected after ground
troexplosive Subsystem Safety live tests. In one case, an inspection found a
Requirements and Test Methods for damaged fusing resistor, which would have
prevented in-flight firing.
Space Systems (MIL-STD-1576),
Between 3% and 5% of firings result in.
NASA Standard Initiator User's Guide conduction.
(JSC-28596A), and Electrical
Grounding Architecture for Unmanned
Spacecraft (NASA-HDBK-4001).
For more technical information, call Ron Williamson at (310) 336-2149.
For comments on the Aerospace Lessons Learned Program, including background specifics,
call Paul Cheng at (310) 336-8222.
Lesson 98
Space Systems Engineering Lessons Learned
Lesson 99
Have the Model’s Originator Check the Analysis
The Problem:
A spacecraft broke up after firing its embedded solid rocket motor.
Satellite
The Cause: body
The accident investigation board subsequently found that the spacecraft would suffer massive
heating from the motor exhaust plume and disintegrate. The motor vendor estimated that
heating would be almost two orders of magnitude higher than expected by the contractor.
Why was the design, qualified by similarity, so far off?
It turned out that the motor in the previous mission was actually more extended
(dheritage mission, actual = 11.03 inches). The distance shown in the conference paper was an error!
The author knew about the mistake but unfortunately did not know the contractor relied on his
publication instead of the model, which did not include this erroneous diagram.
Lessons Learned:
• Double check all analysis models, assumptions, methods, and predictions.
• Develop a rigorous process for using experience as a basis for accepting further designs
and equipment.
• Have the original analyst review final product (Lesson 26).
• Make sure key subcontractors accept how their product is being used.
Lesson 99
Space Systems Engineering Lessons Learned
Lesson 100
Make Sure Safety Mechanisms Are Truly Independent
The Problem:
Spurious
A satellite suffered a near-catastrophic short. signal
Plunger
Power controller
FPGA
Wax
The Cause: Primary heater
+28V
Following launch, the spacecraft turned on a
set of wax heaters for three minutes to acti- Secondary heater
vate the release actuators on the solar arrays. Wax actuator
(4 Sets)
Later, a design error in a field-programmable
gate array (FPGA) inside the power controller Actuator Diagram (Simplified)
caused the primary heaters to be reactivated.
After ten minutes, the overheating primary
elements shorted to the secondary elements,
and subsequently to the bus structure. The Ensure Independent Safety Mechanisms
short circuits drew hundreds of watts, at a The ARM and FIRE relays in Diagram (a)
current level several times the power board’s below can prematurely close on one FPGA
design limit. error Separate drivers (b) should be used
Enable
High level
commands
Fortunately, the heater traces burned open,
Arm Backup
saving the power distribution unit from perma- FPGA firing
nent damage. Otherwise, the mission would Fire circuits
have ended. Pyro
initiator (a)
Lessons Learned:
• Ensure safing mechanisms will prevent Arm
FPGA 1
one design error from causing a cascade Fire
FPGA 2 (b)
of irreversible failures (Lesson 77). In this
case, one error could have activated all the
heaters, and the solar arrays might have
been deployed prematurely.
• Check for failure mechanisms during extended operation even if that is not the intended
application. If prolonged operation leads to catastrophic failure, provide circuit interrupts,
time-out protection, or a graceful degradation mechanism (Lesson 19, 71).
• Review special design requirements for FPGAs (Lesson 77).
Lesson 100
Space Systems Engineering Lessons Learned
Nominal
Lesson 101 Safe Mode
Face-on
Provide Robust Design for Safe Modes Illumination
Y X (intermediate)
The Problem:
A satellite ceased operation four days after launch.
Z: Principal
Axis of Inertia
The Cause:
NADIR
Intermediate
down. The satellite entered a flat spin, turning its solar arrays •Major axis spin is stable
edge-on towards the Sun, and power was lost. By the time the •Minor axis spin is stable
ground crew returned, the battery was depleted. but may be destabilized by
Major energy dissipation
Working in a contractor’s branch office, the attitude-control •Intermediate axis spin is
engineers reused a design without realizing the previous unstable
mission had a more stable configuration in safe mode and
without performing a peer review—a similar satellite being Major-Axis Rule
developed at the prime’s main campus in fact used multiple
gyros in similar safe mode, in anticipation of instability.
Controllers opted to leave the satellite unattended without realizing that even though stability
in the eventual mission orbit (523 km) had been demonstrated, steadiness in the lower parking
orbit, where atmospheric drag is more severe, had not been validated. Later, simulations
confirmed that attitude control would be lost in a few hours.
Lessons Learned:
• Continuously staff the ground station during spacecraft initialization.
• Analyze the effect of anomalies in all operating modes.
• Incorporate mass property and thruster imbalances in attitude stability simulation, and
avoid thruster-only control modes.
Lesson 101
Space Systems Engineering Lessons Learned
Lesson 102
Establish Configuration Control for Ground Support Equipment
The Problem:
A spacecraft sustained heavy damage in the factory.
The Cause:
The contractor had built dozens of these
satellites before and was integrating two Cart Satellite 2
more.
Satellite 1 was mounted, via an adapter
plate, on a turnover cart and tested hori-
Booster Adapter Plate
zontally. As the satellite was Turnover Cart Interface Plate
Lessons Learned:
• Maintain enough discipline to ensure space equipment is handled carefully, and avoid
allowing familiarity to breed contempt.
• Implement an unambiguous verification step for each critical task (Lesson 32).
For more technical information, call Pat Mak at (310) 336-3529 or Anthony Salvaggio at
(310) 336-3198.
For comments on the Aerospace Lessons Learned Program, including background specifics,
call Paul Cheng at (310) 336-8222.
Lesson 102
Space Systems Engineering Lessons Learned
Lesson 103
Ensure Adequate User Understanding of Complex Instruments
The Problem:
Results from months of electromagnetic interference (EMI) testing had to be discarded.
The Cause:
The contractor acquired a new test set, Factory Set for Commercial Applications
factory set to scan in 120 kilohertz (kHz) A B
wide windows, stepping up in 40 kHz B’ A”
Window size is A-B
Step is B-B’ (or A-A”)
increments. Military test specifications,
6950 kHz 7000 kHz 7050 kHz
however, required 1 kHz windows and 0.5
kHz steps. When reprogrammed according Tailored Mil-Std-462 Method 1 kHz
to the specifications, the test equipment Window
Lesson 103
Space Systems Engineering Lessons Learned
Lesson 104
Ensure Ground Commands Do Not Jeopardize Spacecraft
Lessons Learned:
• Conditional commands (execution of an instruction contingent upon another) must first
verify the completion of the preceding command.
• If multiple commands can cause a mechanical or electrical conflict, code in a prevention
block (i.e., an exclusive OR).
• Make sure flight computers are restarted in a known mode with only appropriate
commands in the queue—always clear pending commands first.
• Double-check if a fuse should be installed, and carefully analyze fault scenarios to size
fuses. For guidelines, see NASA TM-02179, Selection of Wires and Circuit Protective
Devices, and NASA-HDBK-4001, Electrical Grounding Architecture for Unmanned
Spacecraft.
For more technical information on commanding, call Joseph Anselmi at (310) 336-7326; for
information on fuse selection, call Tom Hecht at (310) 336-1505.
For comments on the Aerospace Lessons Learned Program, including background specifics,
call Paul Cheng at (310) 336-8222.
Lesson 104
Space Systems Engineering Lessons Learned
Lesson 105
Establish Roles and Responsibilities Ahead of Critical Operations
The Problem:
An air launch went forward despite an abort.
The Cause:
Two agencies, a Launch Vehicle team
Satelli te d ish
Lesson 105
Space Systems Engineering Lessons Learned
Lesson 106
Do Not Dismiss Test Anomalies as Random Events—Find Out Why (III)
The Problem:
A vibration isolation equipment jammed in orbit.
The Cause:
The isolator used two dampers to stabilize an Bushing
antenna mounted on the end of a 60-meter Plastic seal
Lesson 106
Space Systems Engineering Lessons Learned
Lesson 107
Difficult Troubleshooting Calls for System-wide Rethinking
Lesson 107
Space Systems Engineering Lessons Learned
Lesson 108
Create Open Liens on Parts Procured Ahead of Qualification
The Problem:
An extremely time-critical launch was
aborted.
The Cause:
The rocket was laterally supported by three
retractable arms. Each arm had a crush block
to absorb the shock when the arm was with-
drawn. Crush Block
The preliminary design called for 5.5-inch
blocks, 18 of which were delivered to the
Mounting
launch facilities. The subcontractor attached Plate
a temporary part number on the blocks and ¼”
Lessons Learned:
• Never accept deliveries of “flight-like” hardware without creating a link to the inventory
system to control its use.
• Develop 100% effective reachback mechanisms for Failure Review Board (FRB) and
Material Review Board (MRB) actions.
For more technical information, call Gary Shultz at (310) 336-2342.
For comments on the Aerospace Lessons Learned Program, including background specifics,
call Paul Cheng at (310) 336-8222.
Lesson 108
Space Systems Engineering Lessons Learned
Lesson 109
Ensure Safe Handling of Dangerous Equipment (I)
The Problem:
Test Button
<<1 w
A rocket accidentally took off, killing a range + -
<<
technician.
3
Battery
Pack Micro-ampere 1Ω Bridgewire
only Fires at 1 w
The Cause: <<
Pinch
The rocket was mounted in a test stand when Resistor As-intended
operators checked the continuity of the motor
1.3 V
igniter with a hand-held meter. Battery
3
battery pack delivered a trickle current—
~1.3 ampere
~1.3
enough to verify the connections but not fire
the pyros. Unfortunately, the battery fit well in + -
<<
the meter without the holder, and there was no
1.3 V
warning on the holder. Battery As Incident Occurred
When word of this accident spread, spokes- Battery Redundant A Safer Design
persons for two other facilities stepped forward resistor
Lesson 109
Space Systems Engineering Lessons Learned
Lesson 110
Ensure Safe Handling of Dangerous Equipment (II)
The Problem:
Spring Clip (b)
A rocket being prepared for launch Ground
(Retains Rocket)
conductive lanyard.
Safety
n
Unfortunately, someone replaced the Tape Spring-loaded
J-Hook p
lanyard with a electrical wire, cre- over
Igniter
ating a sneak path. During handling, Pin
o
brief electrical contact to the squib (a)
was made, and a loose safe-to-
ground path failed to shunt the (d)
Lesson 110
Space Systems Engineering Lessons Learned
Lesson 111
Ensure Safe Handling of Dangerous Equipment (III)
The Problem:
An ejected pyro part narrowly missed test en-
gineers. Test Equipment Table
Impact Shield
30 ft
The Cause:
Several refurbished pin pullers from earlier Satellite
pyroshock testing were reused to verify de- Under Test
Lesson 111
Space Systems Engineering Lessons Learned
Lesson 112
Understand the Subtle Behavior and Limitations of Commercial Off-The-Shelf (COTS)
Software
The Cause:
Unallocated Memory (Exhausted)
The Rover collected a large amount of sci- Allocated Buffer
ence data for occasional transmission to an Data
Code
overhead relay spacecraft. A DOS-type
utility indexed these files in the random-
access memory (RAM), using more Too Many Files Exhausting Memory
memory than expected. The file management utility used memory
based on the number of files in each sub-
The amount of data acquired by the Rover directory, including both deleted files and
burgeoned, but housekeeping telemetry did metafiles on science data. The software engi-
not report the memory status. Soon, there neers failed to account for this overhead and
were too many files for the RAM to han- underestimated memory usage.
dle, and the computer turned itself off. Although a design rule prohibited flight
Because the file system reload required as software from reaching into the free RAM once
much memory as that cached before the initialization completed, the file-management
shutdown, the computer could not reboot. utility was allowed to violate this rule in part
because the memory usage was mistakenly
The start-shutdown cycle repeated over thought of as small.
sixty times until the batteries ran down. Controllers attempted to upload a utility to
The system then entered a “crippled mode” remove unnecessary folders (merely deleting
wherein the computer rebooted without files would not suffice). Unfortunately, the link
loading the file indexes. With the RAM failed and the anomaly occurred before the
resend could complete. Several other esoteric
freed up, controllers cleaned up the files problems subsequently combined to impede
and restarted the computer normally. autonomous recovery.
Lessons Learned:
• Provide a way out of endless reboot cycles, and avoid start sequences that require poten-
tially unavailable resources (Lesson 79).
• Track computer resource usage, just as other vehicle consumables.
• Ensure the computer can degrade gracefully, instead of freezing up catastrophically.
For more technical information, call Joe Anselmi at (310) 336-7326.
For comments on the Aerospace Lessons Learned Program, including background specifics,
call Paul Cheng at (310) 336-8222.
Lesson 112
Space Systems Engineering Lessons Learned
Lesson 113
Analysis Must Make Room for Source Data Uncertainties
Lesson 113
Space Systems Engineering Lessons Learned
Lesson 114
Identify All Exposed Circuits to Preclude Inadvertent Shorts
Lesson 115
Space Systems Engineering Lessons Learned
Lesson 116
Thoroughly Analyze and Test Deployables (II)
The Problem:
Lesson 116
Space Systems Engineering Lessons Learned
Lesson 117
Carefully Plan for Controlled Mission Shutdown
The Problem:
A satellite lost attitude control for two days.
• Conduct a detailed failure mode and A spacecraft lasted many years beyond its initial short
effect analysis for every fuse and mission because its battery performed surprisingly well.
relay. The satellite’s beacon, which could not be turned off,
caused persistent radio-frequency interference.
• Make sure that a secondary payload
cannot possibly endanger the
primary mission. Consider a host-
controlled “kill switch.”
For more technical information, call Ron Williamson at (310) 336-2149.
For comments on the Aerospace Lessons Learned Program, including background specifics,
call Paul Cheng at (310) 336-8222.
Lesson 117
Space Systems Engineering Lessons Learned
Lesson 118
Ensure Subcontracted Tasks are Properly Verified
The Problem:
A satellite launch was delayed by 18 months.
later on, a loose rivnut was found. All Heat Pipe Flange Thermal
Lessons Learned:
• Make sure the intent of critical design requirements are followed on subcontracted tasks,
especially when specialized “tribal knowledge” is involved.
• Conduct sufficient engineering reviews on repairs and reworks.
For more technical information, call Harry Yerondopoulos at (310) 336-3375.
For comments on the Aerospace Lessons Learned Program, including background specifics,
call Paul Cheng at (310) 336-8222.
Lesson 118
Space Systems Engineering Lessons Learned
Lesson 119
Ensure Safe Handling of Dangerous Equipment (IV)
Relay Command
The Problem: Panel
Lesson 119
Space Systems Engineering Lessons Learned
Lesson 120
Do Not Dismiss Test Anomalies As Random Events—Find Out Why (IV)
The Problem:
A payload could not separate from the launch vehicle.
Laser Firing Unit
The Cause: FIRE
1.4 A Firing Current Combined
Command IN
Firing Output
The laser-initiated squib circuit consisted of Power Unit
Laser
a test current and a separately generated BIT IN
BIT/ Matrix Diodes
Fiber
Optics
BIT OUT
FIRE current. Apparently, the FIRE con- Decode
0.4 A Test
troller chip broke a pin because foam BIT
Current
Initiators
protecting the circuit board from vibration Receiver
had been removed “to improve produci- Simplified Laser Firing Unit Schematics
bility.”
The test circuit generated a faint laser glow to verify
Similar chip failures had occurred three command logic. But the main firing unit itself was not
times during tests, yet the contractor did not tested for fear of setting off the squibs.
recognize that, without the foam, the board Besides building in redundancy, the designers should
would excessively deflect and cause harm. have incorporated a switch to shunt the fire current
Unfortunately, the pin damage was not into the “BIT check” loop, so the entire circuit could
be verified.
caught before launch because the FIRE
current was not checked during or after
ground vibration.
FIRE Controller Chip Vibration Response
Lessons Learned:
G2/Hz (Log)
• Avoid single string designs on critical Previous Design: Board New Design: Foam
functions. Stiffened with Foam Removed
Lesson 120
Space Systems Engineering Lessons Learned
Lesson 121
Prevent Nonmagnetic Materials from Becoming Magnetized
The Problem:
A satellite fired its attitude-control thrusters too often, depleting its fuel.
The Cause:
The failure was traced to the propellant tank,
Momentum (Nms)
which developed a dipole moment that 1.0
Field (nanotesla)
100
torqued the satellite to align with the Earth’s 90
magnetic axis, overtaxing the thrusters. 0.8
80
Momentum
The tank was made of annealed stainless
steel, normally thought as nonmagnetic. 1 3 5 7 9 11 13 15 17 19
Days Under Monitoring
Apparently, the sheet metal became mag-
netized either while being worked into the
hemispheric shape, or when exposed to an
Pinpointing the Failure Mechanism
external magnetic field—perhaps by coming
into contact with a magnetized equipment. The anomaly was at first blamed on leaks
in the propulsion line. Later on, the
The tank was supposed to be made of tita- anomalous torque was found to correlate
nium, but a switch to stainless steel had to with variations in the Earth’s magnetic
be made due to schedule deadlines. Unfor- field. Moreover, the roll error shot up
tunately, the possibility of magnetization did during magnetic storms. The fuel tank’s
not occur to anybody, otherwise a simple engineering model was subsequently tested
degaussing would have averted the failure. and found to have a large dipole moment.
Lesson 121
Space Systems Engineering Lessons Learned
Lesson 122
Minimize Residual Magnetism on Delicate Mechanisms
Rotor
Stator
Electromagnetic Coil
All prototypes appeared to work well. How-
ever, because the flight unit was meticulously Frictional Force FF
assembled to ensure the tightest fit, it Magnetic Force FM
developed much more residual magnetism. Gap (L) FM ∝ 1/L2
But nobody knew—the test script cycled the
relays before power-off, unintentionally cre-
ating a reverse current spike that degaussed Mechanism of Shutter Operation
the solenoid.
During final test, which did not spuriously A Similar Incident
degauss, the shutter jammed. Unfortunately, it A solenoid actuator failed because its friction
was too late for repair, and operation suffered. plate, made of Precipitation Hardened (PH)
stainless steel, became magnetized. PH steel is
Lessons Learned: prone to magnetic hysteresis, which varied
from unit to unit more than expected.
• Conduct tolerance analysis and specify
ranges on units sensitive to subtle dimen-
sion changes (caused by, for example,
thermal mismatch, creep, and manufac-
turing variations). Avoid unforgiving de-
signs.
• Use positive means (such as non-magnetic plating) to control gap widths in magnetic circuits.
• Make sure test programs do not mask equipment flaws.
For more technical information, call John Bohner at (310) 336-1772.
For comments on the Aerospace Lessons Learned Program, including background specifics, call
Paul Cheng at (310) 336-8222.
Lesson 122
Space Systems Engineering Lessons Learned
Lesson 123
Double-Check Start-Up Behavior of Digital Electronics
For comments on the Aerospace Lessons Learned Program, including background specifics, call
Paul Cheng at (310) 336-8222.
Lesson 123
Space Systems Engineering Lessons Learned
Lesson 124
Double-Check Start-Up Behavior of Digital Electronics (II)
The Problem:
A payload computer occasionally locked up during ground testing.
The Cause:
Extensive troubleshooting traced
the seemingly random halts to
Computer boots Occasional Exception not
an improbable event in the com- with improper code parity error handled well
puter. The flaw was not found
earlier because lower-level tests Application runs with CPU status register
did not cycle the hardware on interrupt enabled K0 corrupted
and off frequently enough or Application needs to Another interrupt
adequately exercise the test write CPU status to bit K0 happens to be pending
scripts.
CPU illegally attempts
Computer halts
Early on, the computer vendor to read from EEPROM
had notified the prime that the
operating system’s boot se- Infrequent Deterministic
quence should be updated to
avoid having a timing glitch Simplified Failure Scheme
cause parity errors in a control- Notice three infrequent events have to converge for the computer to halt.
ler’s registers. The prime, which
extensively tailored the
algorithms in order to integrate
the computer to the payload, did
not fully understand the problem
and applied its own fix, which
turned out to be inadequate.
Lessons Learned:
• Develop thorough test cases to verify system performance.
• Include sufficient on-off cycling and long run times during computer testing to expose
subtle hardware/software interaction quirks.
• Heed vendor alerts.
For more technical information, call Lee Mendoza at (310) 336-5547.
For comments on the Aerospace Lessons Learned Program, including background specifics,
call Paul Cheng at (310) 336-8222.
Lesson 124
INTERNAL DISTRIBUTION LIST
REPORT TITLE
FINAL APPROVER DRAW LINE(S) ACROSS UNFILLED SPACE AND INITIAL TO PRECLUDE ADDITIONS 1
IF LIST COMPRISES TWO OR MORE SHEETS, COMPLETE ABOVE BLOCK ON LAST SHEET ONLY
SHEET 1 OF 6
AEROSPACE FORM 2394 REV 3-85
INTERNAL DISTRIBUTION LIST
REPORT TITLE
FINAL APPROVER DRAW LINE(S) ACROSS UNFILLED SPACE AND INITIAL TO PRECLUDE ADDITIONS 2
IF LIST COMPRISES TWO OR MORE SHEETS, COMPLETE ABOVE BLOCK ON LAST SHEET ONLY
SHEET 2 OF 6
AEROSPACE FORM 2394 REV 3-85
INTERNAL DISTRIBUTION LIST
REPORT TITLE
FINAL APPROVER DRAW LINE(S) ACROSS UNFILLED SPACE AND INITIAL TO PRECLUDE ADDITIONS 3
IF LIST COMPRISES TWO OR MORE SHEETS, COMPLETE ABOVE BLOCK ON LAST SHEET ONLY
SHEET 3 OF 6
AEROSPACE FORM 2394 REV 3-85
INTERNAL DISTRIBUTION LIST
REPORT TITLE
FINAL APPROVER DRAW LINE(S) ACROSS UNFILLED SPACE AND INITIAL TO PRECLUDE ADDITIONS 4
IF LIST COMPRISES TWO OR MORE SHEETS, COMPLETE ABOVE BLOCK ON LAST SHEET ONLY
SHEET 4 OF 6
AEROSPACE FORM 2394 REV 3-85
INTERNAL DISTRIBUTION LIST
REPORT TITLE
FINAL APPROVER DRAW LINE(S) ACROSS UNFILLED SPACE AND INITIAL TO PRECLUDE ADDITIONS 5
IF LIST COMPRISES TWO OR MORE SHEETS, COMPLETE ABOVE BLOCK ON LAST SHEET ONLY
SHEET 5 OF 6
AEROSPACE FORM 2394 REV 3-85
INTERNAL DISTRIBUTION LIST
REPORT TITLE
FINAL APPROVER DRAW LINE(S) ACROSS UNFILLED SPACE AND INITIAL TO PRECLUDE ADDITIONS 6
IF LIST COMPRISES TWO OR MORE SHEETS, COMPLETE ABOVE BLOCK ON LAST SHEET ONLY
SHEET 6 OF 6
AEROSPACE FORM 2394 REV 3-85
EXTERNAL DISTRIBUTION LIST
REPORT TITLE
FINAL APPROVER DRAW LINE(S) ACROSS UNFILLED SPACE AND INITIAL TO PRECLUDE ADDITIONS
DISTRIBUTION LIMITATIONS MARKED ON THE COVER/TITLE PAGE ARE AUTHORIZED BY SIGNATURE BELOW
IF LIST COMPRISES TWO OR MORE SHEETS, COMPLETE ABOVE BLOCK ON LAST SHEET ONLY
SHEET 1 OF 1
AEROSPACE FORM 2380 REV 11-85