Design For Reliability by Adesh
Design For Reliability by Adesh
ADESH KUMAR M.TECH-1ST YEAR (MACHINE DESIGN) JAMIA MILLIA ISLAMIA NEW DELHI
Chapter Objectives
Introduce the need for design for reliability List the main causes of reliability failures How do failures relate to their mechanisms Describe each failure Propose design guidelines against the failure
What is Reliability?
Reliability is:
The ability of an item to perform its required
function under defined customer operating conditions for a stated period of time. The probability that no (system) failure will occur in a given time interval In research, the term reliability means "repeatability" or "consistency". A measure is considered reliable if it would give us the same result over and over again
across all functions Engineering Research manufacturing Testing Packaging field service
What is Probability?
Probability is: A measure that describes the chance or
likelihood that an event will occur. The probability that event (A) occurs is represented by a number between 0 (zero) and 1. When P(A) = 0, the event cannot occur. When P(A) = 1, the event is certain to occur. When P(A) = 0.5, the event is as likely to occur as it is not.
term success of a product. Too high reliability will cause the product to be too expensive Too low reliability will cause warranty and repair costs to be high and therefore market share will be lost.
Cost-Reliability Functions
Noise Factors are sources of disturbing influences that can disrupt the ideal function, causing error states which lead to quality problems.
Reliability Terms
Mean Time To Failure (MTTF) for non-repairable
systems Mean Time Between Failures for repairable systems (MTBF) Reliability Probability (survival) R(t) Failure Probability (cumulative density function ) F(t)=1-R(t) Failure Probability Density f(t) Failure Rate (hazard rate) (t)
Reliability Function
Probability density function of failures
f(t) = le-lt for t > 0 Probability of failure from (0 to T) F(t) = 1 e-lT Reliability function R(T) = 1 F(T) = e-lT
Series Systems
RS = R1 R2 ... Rn
14
Serial reliability
Series systems are also referred to as
weakest link or chain systems. System failure is caused by the failure of any one component. Therefore, for a series system, the reliability of the system is the product of the individual component reliabilities More components = less reliability
s e r ia l r e lia b ility
i 1
xi
Parallel Systems
1
2
Parallel reliability
oParallel systems are also referred to as redundant. oThe system fails only if all of the components fail. oTherefore, for a parallel system, the system probability of failure is the product of the individual component probabilities.
n
i 1
(1 x i )
Series-Parallel Systems
C
RA
A
RB
B
RC C
RD
D
RC
RC = 1 (1-RC)(1-RC)
A Simple Example
A system has 4000 components with a
10-4 failures/hour
MTBF = 1 / (8 * 10-4 ) = 1250 hours
18 ADESH
An Example
A first generation computer contains 10000 components each
MTBF = t / (1 R(t)) = t / (1 0.99) t = MTBF * 0.01 = 0.01 / av Where av is the average failure rate N = No. of components = 10000 = failure rate of a component = 0.5% / (1000 hours) = 0.005/1000 = 5 * 10-6 per
hour
Therefore, av = N = 10000 * 5 * 10-6 = 5 * 10-2
per hour
19
ADESH
Causes of Failure
Misuse Failures attributable to the application of stresses beyond the stated capabilities of the item. Inherent Weakness Failures attributable to weakness inherent in the item itself when subjected to stresses within the stated capabilities of the item.
inadequate design, poor manufacturing, and inappropriate usage. these can be catastrophic to human life.
Overstress Mechanisms These occur due to insufficient
safety factor in design, higher than expected random loads, human errors, misapplication.
Wearout Mechanisms Occur late in life and then increase
with age.This happens on corrosion, material fatigue, poor maintenance, creep , degradation in strength.
% Failure - % of failures in a total population MTTF (Mean Time To Failure) - the average time of operation to first failure.
MTBF (Mean Time Between Failure) - the average time between product failures.
Repairs Per Thousand (R/1000) Bq Life Life at which q% of the population will fail
a population of products using a graphical representation called the bathtub curve. The bathtub curve consists of three periods: an infant mortality period with a decreasing failure rate followed by a normal life period (also known as "useful life") with a low, relatively constant failure rate and concluding with a wear-out period that exhibits an increasing failure rate.
Reliability
90 80 70
Prob of dying in the next year (deaths/ 1000)
60 50 40 30 20 10 0 0 2 5 12 16
Age From the Statistical Bulletin 79, no 1, Jan-Mar 1998
19
30
50
70
86
27
planning for design and manufacturing. Why? To determine: useful life of product what accelerated life testing to be used Reliability must be as close to perfect as possible for the products useful life. You MUST know where your product's major points of failure are!
more reliable which can be used as a selling feature by the marketing department. Also, this adds to the company reputation and can be used for comparisons with competition.
Stress Analysis
It establishes the presence of a safety margin
thus enhancing system life. Stress analysis provides input data for reliability prediction. It is based on customer requirements.
existing product can be found by studying field failure data. For a new product however, or if significant changes are made to the design, it may be required to estimate or calculate MTBF before any field data is available.
qualitative technique for understanding the behaviour of components in an engineered systems The objective is to determine the influence of component failure on other components, and on the system as a whole FMEA can also be used as a stand-alone procedure for relative ranking of failure modes that screens them according to risk.
ADESH
how it can fail. Determine the Effect of each failure mode, and the severity on system function. Determine the likelihood of occurrence and detecting the failure. Calculate the Risk Priority Number (RPN = Severity X Occurrence X Detection). Consider corrective actions (may reduce severity of occurrence, or increase probably of detection). Start with the higher RPN values (most severe problems) and work down. Recalculate RPN after the corrective actions have been determined, the aim is to minimize RPN.
series and parallel connections of subsystems Reliability block diagrams (RBD) represent a system using interconnected blocks arranged in combinations of series and/or parallel configurations They can be used to analyze the reliability of a system quantitatively Reliability block diagrams can consider active and stand-by states to get estimates of reliability, and availability (or unavailability) of the system Reliability block diagrams may be difficult to construct for very complex systems
ADES H
concepts and calculations for systematically comparing redundancy and reliability factors as they apply to network storage configurations. We will determine a reliability figure on three very basic architectures. The starting point of our study is the network storage requirements.
server. Later, this storage will be accessible to other servers. The server is already in place, and has been designed to sustain single component hardware failures (with dual host bus adapters (HBAs), for example). Data on this storage must be mirrored, and the storage access must also stand up to hardware failures. The cost of the storage system must be reasonable, while still providing good performance.
Architecture 1
Architecture 1 provides the
basic storage necessities we are looking for with the following advantages and disadvantages: Advantages: Storage is accessible if one of the links is down. Storage A is mirrored onto B. Other servers can be connected to the concentrator to access the storage. Disadvantages: If the concentrator fails, we have no more access to the storage. This concentrator is a single
Architecture 2
Architecture 2 has been
improved to take into account the previous SPOF. A concentrator has been added. Advantages: If any links or components go down, storage is still accessible (resilient to hardware failures). Data is mirrored (Disk A <-> Disk B). Other servers can be connected to both concentrators to access
Architecture 3
The main difference is that
Disk A and Disk B have only one data path. Disk A is still mirrored to Disk B, as required. This architecture has all the advantages of the previous architectures with the following differences: Disk A can only be accessed through Link C, and Disk B only through Link D. There is no data multi pathing software layer, which results in easier administration and easier troubleshooting.
Determining Reliability
Using the reliability formulas , we can determine
which architecture has the highest reliability value. For the purpose of this article , we will use sample MTBF values (as obtained by the manufacturer) and AFR*(Annual Failure Rate) values shown in the table below:
*(The AFR for each component was calculated using the MTBF where (8760/MTBF) = AFR). The example MTBF values were taken from real network storage component statistics. However, such values vary greatly, and these numbers are given here purely for illustration.
Determining Reliability
Component HBA 1 HBA 2 LINK A LINK B Concentrator 1 Concentrator 2 LINK C LINK D Disk A Disk B AFR Variable
H H L L C C L L D D
Determining Reliability
Having the rate of failure of each individual
component, we can obtain the system's annual failure rate AFR and consequently the system reliability (R) and system MTBF values. The AFR values of redundant components are multiplied to the power equal to the number of redundant components. The AFR values of non-redundant components are multiplied by the number of those components in series.
Calculation
In case of Architecture 1, concentrator(C) is the
only non-redundant component. AFR1 = (H+L)2 + C + L2 + D2 AFR1 = (0.011+0.022) 2 + 0.0151 + (0.022)2 + (0.0088)2 = 0.0167 R1 = 1 - AFR1 = 1 0.0167 = 0.9833, or 98.33% MTBF1= 8760/AFR1 = 8760/0.0167 = 524,551 hours.
Calculation
The architecture 2 has a different configuration
with no non-redundant components. AFR2 = (H+L+C+L) 2 + D2 AFR2 = (0.011+0.022+0.0151+0.022) 2 + (0.0088)2 = 0.0005 R2 = 1 AFR2 = 1 0.0005 = 0.995, or 99.50% MTBF2= 8760/AFR2 = 8760/0.0005 = 1,752,000 hours.
Calculation
Architecture 3 has yet another configuration and
has no non-redundant components. AFR3 = (H+L+C+L+D) 2 AFR3 = (0.011+0.022+0.0151+0.022+0.0088) 2 = 0.0062 R3 = 1 AFR3 = 1 0.0062 = 0.9938, or 99.38% MTBF3= 8760/AFR3 = 8760/0.0062 = 1,412,903 hours.
Conclusion
When the calculations are complete, we compare the
data: Architecture 1 = 98.33%, or a System's MTBF = 524,551 hours Architecture 2 = 99.50%, or a System's MTBF = 1,752,000 hours Architecture 3 = 99.38%, or a System's MTBF = 1,412,903 hours The MTBF figures are the most revealing, and indicate that architecture 2 is statistically the most reliable of all.
Intermittent operation
Roughness Excessive effort requirements
Important to understand the failure (why, where, how long, application, etc.) Two methods for design against failure: By reducing the stress that cause the failure. By increasing the strength of the component. Either one can be achieved by: Selecting materials Changing the package geometry Changing the dimensions Protection
1. 2.
Fatigue Failure?
Fatigue is the most common mechanism of failure
mechanism that occurs rapidly with little or no warning when the induced stress in the component exceeds the fraction strength of the material.
Occurs in brittle materials (ceramics, glasses
and silicon).
Applied stress and work could break the
atomic bonds.
conditions that would produce the least stress in brittle materials should be created.
The brittle material should be polished to
Thermally-activated process: the rate of deformation for a given stress level increases significantly with temperature. Deformation depends on
1. 2. 3.
The applied load. The duration through which the load is applied Elevated temperature
Creep can occur at any stress level. Creep is most important at elevated
temperatures.
deformation.
Creep is a time controlled phenomenon.
accumulation of plastic strain due to cyclic loading will eventually lead to cracking of the component and make it unusable.
below the yield strength of the materials used. If possible, use materials that have high yield strength.
Design and control the local plastic deformation at
electrochemical reaction between a material, usually a metal, and its environment that produces a deterioration of the material and its properties.
corrode faster.
Use hermetic packages to prevent moisture
absorption.
Ensure there are no trapped moisture or
joining process generates intermetallic layers which are byproducts of the joining process.
Redesign
Improved fabrication Verification of redesign
References
Mechanical reliability and design by A.D.S Carter Introduction to reliability in design by Charles O.
Smith.
http://www.reliabilityanalysislab.com/ReliabilityServic
es.asp
http://pms401.pd9.ford.com:8080/arr/concept.htm