Guy, C.G. “Computer Reliability”
The Electrical Engineering Handbook
Ed. Richard C. Dorf
Boca Raton: CRC Press LLC, 2000
98
Computer Reliability
98.1 Introduction
98.2 Definitions of Failure, Fault, and Error
98.3 Failure Rate and Reliability
98.4 Relationship Between Reliability and Failure Rate
98.5 Mean Time to Failure
98.6 Mean Time to Repair
98.7 Mean Time Between Failures
98.8 Availability
98.9 Calculation of Computer System Reliability
98.10Markov Modeling
98.11Software Reliability
98.12Reliability Calculations for Real Systems
98.1 Introduction
This chapter outlines the knowledge needed to estimate the reliability of any electronic system or subsystem
within a computer. The word estimate was used in the first sentence to emphasize that the following calculations,
even if carried out perfectly correctly, can provide no guarantee that a particular example of a piece of electronic
equipment will work for any length of time. However, they can provide a reasonable guide to the probability
that something will function as expected over a given time period. The first step in estimating the reliability
of a computer system is to determine the likelihood of failure of each of the individual components, such as
resistors, capacitors, integrated circuits, and connectors, that make up the system. This information can then
be used in a full system analysis.
98.2 Definitions of Failure, Fault, and Error
A failure occurs when a system or component does not perform as expected. Examples of failures at the
component level could be a base-emitter short in a transistor somewhere within a large integrated circuit or a
solder joint going open circuit because of vibrations. If a component experiences a failure, it may cause a fault,
leading to an error, which may lead to a system failure.
A fault may be either the outward manifestation of a component failure or a design fault. Component failure
may be caused by internal physical phenomena or by external environmental effects such as electromagnetic
fields or power supply variations. Design faults may be divided into two classes. The first class of design fault
is caused by using components outside their rated specification. It should be possible to eliminate this class of
faults by careful design checking. The second class, which is characteristic of large digital circuits such as those
found in computer systems, is caused by the designer not taking into account every logical condition that could
occur during system operation. All computer systems have a software component as an integral part of their
operation, and software is especially prone to this kind of design fault.
Chris G. Guy
University of Reading
? 2000 by CRC Press LLC
A fault may be permanent or transitory. Examples of permanent faults are short or open circuits within a
component caused by physical failures. Transitory faults can be subdivided further into two classes. The first,
usually called transient faults, are caused by such things as alpha-particle radiation or power supply variations.
Large random access memory circuits are particularly prone to this kind of fault. By definition, a transient fault
is not caused by physical damage to the hardware. The second class is usually called intermittent faults. These
faults are temporary but reoccur in an unpredictable manner. They are caused by loose physical connections
between components or by components used at the limits of their specification. Intermittent faults often become
permanent faults after a period of time. A fault may be active or inactive. For example, if a fault causes the
output of a digital component to be stuck at logic 1, and the desired output is logic 1, then this would be
classed as an inactive fault. Once the desired output becomes logic 0, then the fault becomes active.
The consequence for the system operation of a fault is an error. As the error may be caused by a permanent
or by a transitory fault, it may be classed as a hard error or a soft error. An error in an individual subsystem
may be due to a fault in that subsystem or to the propagation of an error from another part of the overall system.
The terms fault and error are sometimes interchanged. The term failure is often used to mean anything
covered by these definitions. The definitions given here are those in most common usage.
Physical faults within a component can be characterized by their external electrical effects. These effects are
commonly classified into fault models. The intention of any fault model is to take into account every possible
failure mechanism, so that the effects on the system can be worked out. The manifestation of faults in a system
can be classified according to the likely effects, producing an error model. The purpose of error models is to
try to establish what kinds of corrective action need be taken in order to effect repairs.
98.3 Failure Rate and Reliability
An individual component may fail after a random time, so it is impossible to predict any pattern of failure
from one example. It is possible, however, to estimate the rate at which members of a group of identical
components will fail. This rate can be determined by experimental means using accelerated life tests. In a
normal operating environment, the time for a statistically significant number of failures to have occurred in a
group of modern digital components could be tens or even hundreds of years. Consequently, the manufacturers
must make the environment for the tests extremely unfavorable in order to produce failures in a few hours or
days and then extrapolate back to produce the likely number of failures in a normal environment. The failure
rate is then defined as the number of failures per unit time, in a given environment, compared with the number
of surviving components. It is usually expressed as a number of failures per million hours.
If f(t) is the number of components that have failed up to time t, and s(t) is the number of components that
have survived, then z(t), the failure rate or hazard rate, is defined as
(98.1)
Most electronic components will exhibit a variation of failure rate with time. Many studies have shown that
this variation can often be approximated to the pattern shown in Fig. 98.1. For obvious reasons this is known
as a bathtub curve. The first phase, where the failure rate starts high but is decreasing with time, is where the
components are suffering infant mortality; in other words, those that had manufacturing defects are failing.
This is often called the burn-in phase. The second part, where the failure rate is roughly constant, is the useful
life period of operation for the component. The final part, where the failure rate is increasing with time, is
where the components are starting to wear out.
Using the same nomenclature as before, if:
(98.2)
zt
st
dft
dt
()
()
()
()
=×
1
st ft N() ()+=
? 2000 by CRC Press LLC
i.e., N is the total number of components in the test, then the reliability r(t) is defined as
(98.3)
or in words, and using the definition from the IEEE Standard Dictionary of Electrical and Electronic Terms,
reliability is the probability that a device will function without failure over a specified time period or amount
of usage, under stated conditions.
98.4 Relationship Between Reliability and Failure Rate
Using Eqs. (98.1), (98.2), and (98.3) then
(98.4)
l is commonly used as the symbol for the failure rate z(t) in the period where it is a constant, i.e., the useful
life of the component. Consequently, we may write Eq. (98.4) as
(98.5)
Rewriting, integrating, and using the limits of integration as r(t) = 1 at t=0 and r(t) = 0 at t = ¥ gives the result:
(98.6)
This result is true only for the period of operation where the failure rate is a constant. For most common
components, real failure rates can be obtained from such handbooks as the American military MIL-HDBK-
217E, as explained in Section 98.12.
It must also be borne in mind that the calculated reliability is a probability function based on lifetime tests.
There can be no guarantee that any batch of components will exhibit the same failure rate and hence reliability
as those predicted because of variations in manufacturing conditions. Even if the components were made at
FIGURE 98.1 Variation of failure rate with time.
rt
st
N
()
()
=
zt
N
st
dr t
dt
()
()
()
()
=- ×
l=- ×
1
rt
dr t
dt()
()
()
rt e
t
()=
-l
? 2000 by CRC Press LLC
the same factory as those tested, the process used might have been slightly different and the equipment will be
older. Quality assurance standards are imposed on companies to try to guarantee that they meet minimum
manufacturing standards, but some cases in the United States have shown that even the largest plants can fall
short of these standards.
98.5 Mean Time to Failure
A figure that is commonly quoted because it gives a readier feel for the system performance is the mean time
to failure or MTTF. This is defined as
(98.7)
Hence, for the period where the failure rate is constant:
(98.8)
98.6 Mean Time to Repair
For many computer systems it is possible to define a mean time to repair (MTTR). This will be a function of
a number of things, including the time taken to detect the failure, the time taken to isolate and replace the
faulty component, and the time taken to verify that the system is operating correctly again. While the MTTF
is a function of the system design and the operating environment, the MTTR is often a function of unpredictable
human factors and, hence, is difficult to quantify. Figures used for MTTR for a given system in a fixed situation
could be predictions based on the experience of the reliability engineers or could be simply the maximum
response time given in the maintenance contract for a computer. In either case, MTTR predictions may be
subject to some fluctuations. To take an extreme example, if the service engineer has a flat tire while on the
way to effect the repair, then the repair time may be many times the predicted MTTR. For some systems no
MTTR can be predicted, as they are in situations that make repair impossible or uneconomic. Computers in
satellites are a good example. In these cases and all others where no errors in the output can be allowed, fault
tolerant approaches must be used in order to extend the MTTF beyond the desired system operational lifetime.
98.7 Mean Time Between Failures
For systems where repair is possible, a figure for the expected time between failures can be defined as
MTBF = MTTF + MTTR (98.9)
The definitions given for MTTF and MTBF are the most commonly accepted ones. In some texts, MTBF is
wrongly used as mean time before failure, confusing it with MTTF. In many real systems, MTTF is very much
greater than MTTR, so the values of MTTF and MTBF will be almost identical, in any case.
98.8 Availability
Availability is defined as the probability that the system will be functioning at a given time during its normal
working period.
MTTF= rtdt()
0
¥
ò
MTTF=
1
l
? 2000 by CRC Press LLC
(98.10)
This can also be written as
(98.11)
Some systems are designed for extremely high availability. For example, the computers used by AT&T to control
its telephone exchanges are designed for an availability of 0.9999999, which corresponds to an unplanned
downtime of 2 min in 40 years. In order to achieve this level of availability, fault tolerant techniques have to
be used from the design stage, accompanied by a high level of monitoring and maintenance.
98.9 Calculation of Computer System Reliability
For systems that have not been designed to be fault tolerant it is common to assume that the failure of any
component implies the failure of the system. Thus, the system failure rate can be determined by the so-called
parts count method. If the system contains m types of component, each with a failure rate l
m
, then the system
failure rate l
s
can be defined as
(98.12)
where N
m
is the number of each type of component.
The system reliability will be
(98.13)
If the system design is such that the failure of an individual component does not necessarily cause system
failure, then the calculations of MTTF and r
s
(t) become more complicated.
Consider two situations where a computer system is made up of several subsystems. These may be individual
components or groups of components, e.g., circuit boards. The first is where failure of an individual subsystem
implies system failure. This is known as the series model and is shown in Fig. 98.2. This is the same case as
considered previously, and the parts count method, Eqs. (98.12) and (98.13), can be used. The second case is
where failure of an individual subsystem does not imply system failure. This is shown in Fig. 98.3. Only the
failure of every subsystem means that the system has failed, and the system reliability can be evaluated by the
following method. If r(t) is the reliability (or probability of not failing) of each subsystem, then q(t) = 1 – r(t)
is the probability of an individual subsystem failing. Hence, the probability of them all failing is
q
s
(t) = [1 – r(t)]
n
(98.14)
for n subsystems.
FIGURE 98.2 Series model.
Av=
total working time
total time
Av=
MTTF
MTTF+MTTR
ll
s
=×
?
N
m
m
m
1
rt Nr
m
m
ms
()=×
?
1
? 2000 by CRC Press LLC
Hence the system reliability will be:
r
s
(t) = 1 – [1 – r(t)]
n
(98.15)
In practice, systems will be made up of differing combinations of parallel and series networks; the simplest
examples are shown in Figs. 98.4 and 98.5.
FIGURE 98.3 Parallel model.
FIGURE 98.4 Parallel series model.
FIGURE 98.5Series-parallel model.
? 2000 by CRC Press LLC
Parallel-Series System
Assuming that the reliability of each subsystem is identical, then the overall reliability can be calculated thus.
The reliability of one unit is r; hence the reliability of the series path is r
n
. The probability of failure of each
path is then q = 1 – r
n
. Hence, the probability of failure of all m paths is (1 - r
n
)
m
, and the reliability of the
complete system is
r
ps
= 1 – (1 – r
n
)
m
(98.16)
Series-Parallel System
Making similar assumptions, and using a similar method, the reliability can be written as
r
sp
= [1 – (1 – r)
n
]
m
(98.17)
It is straightforward to extend these results to systems with subsystems having different reliabilities and in different
combinations. It can be seen that these simple models could be used as the basis for a fault tolerant system, i.e.,
one that is able to carry on performing its designated function even while some of its parts have failed.
Practical Systems Using Parallel Sub-Systems
A computer system that uses parallel sub-systems to improve reliability must incorporate some kind of
arbitrator to determine which output to use at any given time. A common method of arbitration involves
adding a voter to a system with N parallel modules, where N is an odd number. For example, if N = 3, a single
incorrect output can be masked by the two correct outputs outvoting it. Hence, the system output will be
correct, even though an error has occurred in one of the sub-systems. This system would be known as Triple-
Modular-Redundant (TMR) (Fig. 98.6).
The reliability of a TMR system is the probability that any two out of the three units will be working. This
can be expressed as
where r
n
(n = 1, 2, 3) is the reliability of each subsystem. If r
1
= r
2
= r
3
= r this reduces to
FIGURE 98.6Triple-modular-redundant system.
rrr r rrrr rr
tmr
=+-
()
+-
()
+-
()
123 12 3 1 23 123
1 1
rrr
tmr
=-32
23
? 2000 by CRC Press LLC
The reliability of the voter must be included when calculating the overall reliability of such a system. As the
voter appears in every path from input to output, it can be included as a series element in a series-parallel
model. This leads to
(98.18)
where r
v
is the reliability of the voter.
More information on methods of using redundancy to improve system reliability can be found in Chapter 93.
98.10 Markov Modeling
Another approach to determining the probability of system failure is to use a Markov model of the system,
rather than the combinatorial methods outlined previously. Markov models involve the defining of system states
and state transitions. The mathematics of Markov modeling are well beyond the scope of this brief introduction,
but most engineering mathematics textbooks will cover the technique.
To model the reliability of any system it is necessary to define the various fault-free and faulty states that
could exist. For example, a system consisting of two identical units (A and B), either of which has to work for
the system to work, would have four possible states. They would be (1) A and B working; (2) A working, B
failed; (3) B working, A failed; and (4) A and B failed. The system designer must assign to each state a series
of probabilities that determine whether it will remain in the same state or change to another after a given time
period. This is usually shown in a state diagram, as in Fig. 98.7. This model does not allow for the possibility
of repair, but this could easily be added.
98.11 Software Reliability
One of the major components in any computer system is its software. Although software is unlikely to wear
out in a physical sense, it is still impossible to prove that anything other than the simplest of programs is totally
free from bugs. Hence, any piece of software will follow the first and second parts of the normal bathtub curve
(Fig. 98.1). The burn-in phase for hardware corresponds to the early release of a complex program, where bugs
are commonly found and have to be fixed. The useful life phase for hardware corresponds to the time when
the software can be described as stable, even though bugs may still be found. In this phase, where the failure
rate can be characterized as constant (even if it is very low), the hardware performance criteria, such as MTTF
and MTTR can be estimated. They must be included in any estimation of the overall availability for the computer
system as a whole. Just as with hardware, techniques using redundancy can be used to improve the availability
through fault tolerance.
FIGURE 98.7State diagram for two-unit parallel system.
rrrr
tmr v
=-
[]
32
23
? 2000 by CRC Press LLC
98.12 Reliability Calculations for Real Systems
The most common source of basic reliability data for electronic components and circuits is the military
handbook Reliability Prediction of Electronic Equipment, published by the U.S. Department of Defense. It has
the designation MIL-HDBK-217E in its most recent version. This handbook provides both the basic reliability
data and the formulae to modify those data for the application of interest. For example, the formula for
predicting the failure rate, l
p
, of a bipolar or MOS microprocessor is given as
l
p
= p
Q
(C
1
p
T
p
V
+ C
2
p
E
)p
L
failures per 10
6
hours
where p
Q
is the part quality factor, with several categories, ranging from a full mil-spec part to a commercial
part; p
T
is the temperature acceleration factor, related to both the technology in use and the actual operating
temperature; p
V
is the voltage stress derating factor, which is higher for devices operating at higher voltages;
p
E
is the application environment factor (the handbook gives figures for many categories of environment,
ranging from laboratory conditions up to the conditions found in the nose cone of a missile in flight); p
L
is
the device learning factor, related to how mature the technology is and how long the production of the part
has been going on; C
1
is the circuit complexity factor, dependent on the number of transistors on the chip; and
C
2
is the package complexity, related to the number of pins and the type of package.
The following figures are given for a 16-bit microprocessor, operating on the ground in a laboratory
environment, with a junction temperature of 51°C. The device is assumed to be packaged in a plastic, 64-pin
dual in-line package and to have been manufactured using the same technology for several years:
p
Q
= 20 p
T
= 0.89 p
V
= 1 p
E
= 0.38
p
L
= 1 C
1
= 0.06 C
2
= 0.033
Hence, the failure rate l
p
for this device, operating in the specified environment, is estimated to be 1.32
failures per 10
6
hours. To calculate the predicted failure rate for a system based around this microprocessor
would involve similar calculations for all the parts, including the passive components, the PCB, and connectors,
and multiplying all the resultant failure rates together. The resulting figure could then be inverted to give a
predicted MTTF. This kind of calculation is repetitive, tedious, and therefore prone to errors, so many companies
now provide software to perform the calculations. The official Department of Defense program for automating
the calculation of reliability figures is called ORACLE. It is regularly updated to include all the changes since
MIL-HDBK-217E was released. Versions for VAX/VMS and the IBM PC are available from the Rome Air Defense
Center, RBET, Griffiss Air Force Base, NY 13441-5700. Other software to perform the same function is advertised
in the publications listed under Further Information.
Defining Terms
Availability: This figure gives a prediction for the proportion of time that a given part or system will be in
full working order. It can be calculated from
Failure rate: The failure rate, l, is the (predicted or measured) number of failures per unit time for a specified
part or system operating in a given environment. It is usually assumed to be constant during the working
life of a component or system.
Mean time to failure: This figure is used to give an expected working lifetime for a given part, in a given
environment. It is defined by the equation
Av =
MTTF
MTTF + MTTR
MTTF = rt dt()
0
¥
ò
? 2000 by CRC Press LLC
If the failure rate l is constant, then
Mean time to repair: The MTTR figure gives a prediction for the amount of time taken to repair a given part
or system.
Reliability: Reliability r(t) is the probability that a component or system will function without failure over a
specified time period, under stated conditions.
Related Topics
92.2 Local Area Networks ? 110.1 Introduction ? 110.4 Mean Time to Failure (MTTF) ? 110.14 Markov
Models ? 110.22 Reliability and Economics
References
B. W. Johnson, Design and Analysis of Fault Tolerant Digital Systems, Reading, Mass.: Addison-Wesley, 1989.
V. P. Nelson and B. D. Carroll, Tutorial: Fault Tolerant Computing, Washington, D.C.: IEEE Computer Society
Press, 1987.
D. K. Pradhan, Fault Tolerant Computing, Theory and Techniques, vols. I and II, Englewood Cliffs, N. J.: Prentice-
Hall, 1986.
Further Information
The quarterly magazine IEEE Transactions on Reliability contains much of the latest research on reliability
estimation techniques.
The monthly magazine Microelectronics and Reliability covers the field of reliability estimation and also
includes papers on actual measured reliabilities.
Sometimes, manufacturers make available measured failure rates for their devices.
MTTF =
1
l
? 2000 by CRC Press LLC