Reliability and Its Application in TQM
Reliability and Its Application in TQM
Submitted By:
Mitanshu Garg (040418) B.Tech IIIrd Year Production & Industrial Engineering Indian Institute of Technology, Roorkee
Abstract
Reliability has to do with the quality of measurement. In its everyday sense, reliability is the "consistency" or "repeatability" of your measures. The inspectors view of reliability is that the product is assessed against a specification or set of attributes and having passed is delivered to the customer. However, this approach provides no measure of quality over a period of time. Therefore, the need for a time based concept of quality is developed. The inspectors concept is not time-dependent. The product either passes a given test or it fails. On the other hand, reliability is usually concerned with failures in the time domain. This distinction marks the difference between traditional quality control, and the approach to reliability. Total Quality Management (TQM) is a management strategy aimed at embedding awareness of quality in all organizational processes. Total Quality provides an umbrella under which everyone in the organization can strive and create customer satisfaction. TQ is a people focused management system that aims at continual increase in customer satisfaction at continually lower real costs.
TABLE OF CONTENTS
Abstract.............................................................................................................................. 2 What is reliability?............................................................................................................ 4 Reliability & Validity........................................................................................................ 4 Quality vs. Reliability ....................................................................................................... 5 Failure ................................................................................................................................ 6 Degrees of Failure........................................................................................................... 6 Types of Failures - Causes.............................................................................................. 6 The Bathtub Curve.......................................................................................................... 6 Reliability Calculations .................................................................................................... 7 Why do Reliability Calculation?..................................................................................... 8 Stress Analysis ................................................................................................................ 8 Reliability Predictions (MTBF) ...................................................................................... 8
Prediction Approaches ........................................................................................................... 8
Failure mode and effects analysis (FMEA) .................................................................... 9 Failure Modes, Effects and Criticality Analysis (FMECA).......................................... 10 When to use Reliability Calculations?.......................................................................... 10 CASE STUDY: Network Storage Evaluations Using Reliability Calculations......... 11 Network Storage Requirements .................................................................................... 11 Architecture 1................................................................................................................ 11 Architecture 2................................................................................................................ 12 Architecture 3................................................................................................................ 12 Determining Reliability ................................................................................................ 13 Conclusion .................................................................................................................... 14 Barriers to implementing Reliability ............................................................................ 14 References........................................................................................................................ 15
What is reliability?
We hear the term used a lot in research contexts, but what does it really mean? If you think about how we use the word "reliable" in everyday language, you might get a hint. For instance, we often speak about a machine as reliable: "I have a reliable car." Or, news people talk about a "usually reliable source". In both cases, the word reliable usually means "dependable" or "trustworthy." In research, the term "reliable" also means dependable in a general sense, but that's not a precise enough definition. The inspectors view of reliability is that the product is assessed against a specification or set of attributes and having passed is delivered to the customer. However, this approach provides no measure of quality over a period of time. Therefore, the need for a time based concept of quality is developed. The inspectors concept is not time-dependent. The product either passes a given test or it fails. On the other hand, reliability is usually concerned with failures in the time domain. This distinction marks the difference between traditional quality control, and the approach to reliability. Reliability has to do with the quality of measurement. In its everyday sense, reliability is the "consistency" or "repeatability" of your measures. In research, the term reliability means "repeatability" or "consistency". A measure is considered reliable if it would give us the same result over and over again (assuming that what we are measuring isn't changing!). Reliability is defined as the probability of the product to perform as expected for a certain period of time, under the given operating conditions, and at a given set of product performance characteristics. Reliability of a product describes the trouble-free time period before a product fails.
The figure above shows four possible situations. In the first one, you are hitting the target consistently, but you are missing the center of the target. That is, you are consistently and systematically measuring the wrong value for all respondents. This measure is reliable, but no valid (that is, it's consistent but wrong). The second shows hits that are randomly spread across the target. You seldom hit the center of the target but, on average, you are getting the right answer for the group (but not very well for individuals). In this case, you get a valid group estimate, but you are inconsistent. Here, you can clearly see that reliability is directly related to the variability of your measure. The third scenario shows a case where your hits are spread across the target and you are consistently missing the center. Your measure in this case is neither reliable nor valid. Finally, we see the "Robin Hood" scenario -- you consistently hit the center of the target. Your measure is both reliable and valid. When we look at reliability and validity in this way, we see that, rather than being distinct, they actually form a continuum.
Failure
A failure is anytime when the product does not function to its specification causing dissatisfaction even losses to the user. A commercial failure is a product that does not reach expectations of success, failing to come even close. A major flop goes one step further and is recognized for its complete lack of success.
Degrees of Failure
Partial Failures Complete Failures
The bathtub curve, displayed in Figure, does not depict the failure rate of a single item, but describes the relative failure rate of an entire population of products over time. Some individual units will fail relatively early (infant mortality failures), others (we hope most) will last until wear-out, and some will fail during the relatively long period typically called normal life. Failures during infant mortality are highly undesirable and are always caused by defects and blunders: material defects, design blunders, errors in assembly, etc. Normal life failures are normally considered to be random cases of "stress exceeding strength." However, as we'll see, many failures often considered normal life failures are actually infant mortality failures. Wear-out is a fact of life due to fatigue or depletion of materials (such as lubrication depletion in bearings). A product's useful life is limited by its shortest-lived component. A product manufacturer must assure that all specified materials are adequate to function through the intended product life. Note that the bathtub curve is typically used as a visual model to illustrate the three key periods of product failure and not calibrated to depict a graph of the expected behavior for a particular product family. It is rare to have enough short-term and long-term failure information to actually model a population of products with a calibrated bathtub curve.
Reliability Calculations
Reliability Calculations are methodology for analyzing the expected or actual reliability of a product, process or service, and identifying actions to reduce failures or mitigate their effect. A reliability calculation is simply the analysis of parts and components in an effort to predict and calculate the rate at which an item will fail. A reliability calculation is one of the most common forms of reliability analyses for calculating failure rate and (Mean Time between Failures (MTBF). Reliability calculations are commonly used in the development of products and systems to compare alternative design approaches and to assess progress toward reliability design goals. They're often criticized as not being accurate forecasts of field reliability performance because they don't usually account for all the factors that cause field failures. Nevertheless, predictions are a valuable form of 7 Reliability and Its Application in TQM
analysis that also provides insight into safety, maintenance and warranty costs and other product considerations. There exist 4 major types of reliability calculations: Stress Analysis Reliability Predictions (MTBF) FMEA (Failure Mode and Effects Analysis) OR FMECA (Failure Mode and Effects and Criticality Analysis)
Stress Analysis
It establishes the presence of a safety margin thus enhancing system life. Stress analysis provides input data for reliability prediction. It is based on customer requirements.
Prediction Approaches
MIL-HDBK-217 (Military Handbook) has been the mainstay of reliability predictions for about 40 years but it has not been updated since 1995, and there are no plans by the US military to update it in the future. Even though MILHDBK-217 is becoming more obsolete every day, it remains the most widely used technique for electronics. The handbook includes a series of empirical failure rate models developed using historical piece part failure data for a wide array of component types. All models predict reliability in terms of failures per million operating hours and assume an exponential distribution (constant failure rate), which allows the addition of failure rates to determine higher assembly reliability. Typical factors used in determining a part's failure rate include a 8 Reliability and Its Application in TQM
temperature factor (T), power factor (P), power stress factor (S), quality factor (Q) and environmental factor (E) in addition to the base failure rate (b). For example, the model for a resistor is as follows:
Resistor = b T P S Q E
Bellcore (Telcorida) Bellcore's approach is very similar to that of MIL-HDBK-217 but it's based primarily on telecommunications data and covers five separate use environments. The approach also assumes an exponential failure distribution and calculates reliability in terms of failures per billion part operating hours, or FITs. The steady-state failure rate (SSi) depends on the basic part steady-state failure rate (Gi) and the quality (Qi), electrical stress (Si) and temperature factors (Ti) as follows:
SSi = Gi Qi Si Ti
The IEEE Gold Book provides data concerning equipment reliability used in industrial and commercial power distribution systems. Reliability data for different types of equipment are provided along with other aspects of reliability analysis for power distribution systems, such as basic concepts of reliability analysis, probability methods, fundamentals of power system reliability evaluation, economic evaluation of reliability, and cost of power outage data. RDF 2000 is the new version of the CNET UTEC80810 reliability prediction standard that covers most of the same components as MIL-HDBK-217. The models take into account power on/off cycling as well as temperature cycling and are very complex with predictions for integrated circuits requiring information on equipment outside ambient and print circuit ambient temperatures, type of technology, number of transistors, year of manufacture, junction temperature, working time ratio, storage time ratio, thermal expansion characteristics, number of thermal cycles, thermal amplitude of variation, application of the device, as well as per transistor, technology related and package related base failure rates.
Calculate the Risk Priority Number (RPN = Severity X Occurrence X Detection). Consider corrective actions (may reduce severity of occurrence, or increase probably of detection). Start with the higher RPN values (most severe problems) and work down. Recalculate RPN after the corrective actions have been determined, the aim is to minimize RPN.
Architecture 1
Architecture 1 provides the basic storage necessities we are looking for with the following advantages and disadvantages:
Advantages: Storage is accessible if one of the links is down. Storage A is mirrored onto B. Other servers can be connected to the concentrator to access the storage.
Disadvantages: If the concentrator fails, we have no more access to the storage. This concentrator is a single point of failure (SPOF).
Architecture 2
Architecture 2 has been improved to take into account the previous SPOF. A concentrator has been added, and now the storage configuration is redundant and the requirements are satisfied with the following advantages:
If any links or components go down, storage is still accessible (resilient to hardware failures). Data is mirrored (Disk A <-> Disk B). Other servers can be connected to both concentrators to access the storage space.
Architecture 3
Architecture 3 seems very close to architecture 2. The main difference resides in the fact that Disk A and Disk B have only one data path. Disk A is still mirrored to Disk B, as required. This architecture has all the advantages of the previous architectures with the following differences: Disk A can only be accessed through Link C, and Disk B only through Link D. There is no data multipathing software layer, which results in easier administration and easier troubleshooting. In some sense it seems we are loosing a level of redundancy in architecture 3. To appreciate the differences between architecture 2 and 3, we will use block diagram analysis to determine and compare their reliability values.
Determining Reliability
Using the reliability formulas discussed earlier, we can determine which architecture has the highest reliability value. For the purpose of this article, we will use sample MTBF values (as obtained by the manufacturer) and AFR values shown in the table below: Component HBA 1 HBA 2 Link A Link B Concentrator 1 Concentrator 2 Link C Link D Disk A Disk B
*
AFR Variable H H L L C C L L D D
The AFR for each component was calculated using the MTBF where (8760/MTBF) = AFR. The example MTBF values were taken from real network storage component statistics. However, such values vary greatly, and these numbers are given here purely for illustration.
Having the rate of failure of each individual component, we can obtain the system's annual failure rate AFR and consequently the system reliability (R) and system MTBF values. The AFR
values of redundant components are multiplied to the power equal to the number of redundant components. The AFR values of non-redundant components are multiplied by the number of those components in series. In case of Architecture 1, concentrator(C) is the only non-redundant component. AFR1 = (H+L) 2 + C + L2 + D2 AFR1 = (0.011+0.022) 2 + 0.0151 + 0.0222 + 0.00882 = 0.0167 R1 = 1 - AFR1 = 1 0.0167 = 0.9833, or 98.33% MTBF1= 8760/AFR1 = 8760/0.0167 = 524,551 hours. The architecture 2 has a different configuration with no non-redundant components. AFR2 = (H+L+C+L) 2 + D2 AFR2 = (0.011+0.022+0.0151+0.022) 2 + 0.00882 = 0.0005 13 Reliability and Its Application in TQM
R2 = 1 AFR2 = 1 0.0005 = 0.995, or 99.50% MTBF2= 8760/AFR2 = 8760/0.0005 = 1,752,000 hours. Architecture 3 has yet another configuration and has no non-redundant components. AFR3 = (H+L+C+L+D) 2 AFR3 = (0.011+0.022+0.0151+0.022+0.0088) 2 = 0.0062 R3 = 1 AFR3 = 1 0.0062 = 0.9938, or 99.38% MTBF3= 8760/AFR3 = 8760/0.0062 = 1,412,903 hours.
Conclusion
When the calculations are complete, we compare the data: Architecture 1 = 98.33%, or a System's MTBF = 524,551 hours Architecture 2 = 99.50%, or a System's MTBF = 1,752,000 hours Architecture 3 = 99.38%, or a System's MTBF = 1,412,903 hours The MTBF figures are the most revealing, and indicate that architecture 2 is statistically the most reliable of all.
References
Following online resources were consulted for preparing this report: http://www.socialresearchmethods.net/kb/reliable.php http://www.socialresearchmethods.net/kb/reliablt.php http://en.wikipedia.org/wiki http://www.socialresearchmethods.net/kb/reltypes.php http://www.socialresearchmethods.net/kb/relandval.php http://www.tutorialsweb.com/reliability/reliability1,2,3.htm http://www.consult-li.com/Presentations/RelCalcs.ppt http://www.phptr.com/articles/article.asp?p=28689&seqNum=2&rl=1 http://www.reliabilityworld.com
Submitted By:
Mitanshu Garg (040418) B.Tech IIIrd Year Production & Industrial Engineering Indian Institute of Technology, Roorkee