Ramakumar, R. “Reliability Engineering”
The Electrical Engineering Handbook
Ed. Richard C. Dorf
Boca Raton: CRC Press LLC, 2000
110
Reliability Engineering
1
110.1 Introduction
110.2 Catastrophic Failure Models
110.3 The Bathtub Curve
110.4 Mean Time To Failure (MTTF)
110.5 Average Failure Rate
110.6 A Posteriori Failure Probability
110.7 Units for Failure Rates
110.8 Application of the Binomial Distribution
110.9 Application of the Poisson Distribution
110.10 The Exponential Distribution
110.11 The Weibull Distribution
110.12 Combinatorial Aspects
110.13 Modeling Maintenance
110.14 Markov Models
110.15 Binary Model for a Repairable Component
110.16 Two Dissimilar Repairable Components
110.17 Two Identical Repairable Components
110.18 Frequency and Duration Techniques
110.19 Applications of Markov Process
110.20 Some Useful Approximations
110.21 Application Aspects
110.22 Reliability and Economics
110.1 Introduction
Reliability engineering is a vast field and it has grown significantly during the past five decades (since World
War II). The two major approaches to reliability assessment and prediction are (1) traditional methods based
on probabilistic assessment of field data and (2) methods based on the analysis of failure mechanisms and
physics of failure. The latter is more accurate, but is difficult and time consuming to implement. The first one,
in spite of its many flaws, continues to be in use. Some of the many areas encompassing reliability engineering
are reliability allocation and optimization, reliability growth and modeling, reliability testing including accel-
erated testing, data analysis and graphical techniques, quality control and acceptance sampling, maintenance
engineering, repairable system modeling and analysis, software reliability, system safety analysis, Bayesian
analysis, reliability management, simulation and Monte Carlo techniques, Failure Modes, Effects and Criticality
Analysis (FMECA), and economic aspects of reliability, to mention a few.
Application of reliability techniques is gaining importance in all branches of engineering because of its
effectiveness in the detection, prevention, and correction of failures in the design, manufacturing, and operational
1
Some of the material in this chapter was previously published by CRC Press in The Engineering Handbook, R. C. Dorf,
Ed., 1996.
R. Ramakumar
Oklahoma State University
? 2000 by CRC Press LLC
? 2000 by CRC Press LLC
I
NFORMATION
M
ANAGEMENT
S
YSTEM
FOR
M
ANUFACTURING
E
FFICIENCY
t current schedules, each of NASA’s four Space Shuttle Orbiters must fly two or three times a
year. Preparing an orbiter for its next mission is an incredibly complex process and much of the
work is accomplished in the Orbiter Processing Facility (OPF) at Kennedy Space Center.
The average “flow” — the complete cycle of refurbishing an orbiter — requires the integration of
approximately 10,000 work events, takes 65 days, and some 40,000 technician labor hours. Under the
best conditions, scheduling each of the 10,000 work events in a single flow would be a task of monumental
proportions. But the job is further complicated by the fact that only half the work is standard and
predictable; the other half is composed of problem-generated tasks and jobs specific to the next mission,
which creates a highly dynamic processing environment and requires frequent rescheduling.
For all the difficulties, Kennedy Space Center and its prime contractor for shuttle processing —
Lockheed Space Operations Company (LSOC) — are doing an outstanding job of managing OPF oper-
ations with the help of a number of processing innovations in recent years. One of the most important
is the Ground Processing Scheduling System, or GPSS. The GPSS is a software system for enhancing
efficiency by providing an automated scheduling tool that predicts conflicts between scheduled tasks,
helps human schedulers resolve those conflicts, and searches for near-optimal schedules.
GPSS is a cooperative development of Ames Research Center, Kennedy Space Center, LSOC, and a
related company, Lockheed Missiles and Space Company. It originated at Ames, where a group of
computer scientists conducted basic research on the use of artificial intelligence techniques to automate
the scheduling process. A product of the work was a software system for complex, multifaceted operations
known as the Gerry scheduling engine.
Kennedy Space Center brought Ames and Lockheed together and the group formed an inter-center/NASA
contractor partnership to transfer the technology of the Gerry scheduling engine to the Space Shuttle
program. The transfer was successfully accomplished and GPSS has become the accepted general purpose
scheduling tool for OPF operations. (Courtesy of National Aeronautics and Space Administration.)
Kennedy Space Center technicians are preparing a Space Shuttle Orbiter for its next mission, an intricate task that
requires scheduling 10,000 separate events over 65 days. A NASA-developed computer program automated this
extremely complex scheduling job. (Photo courtesy of National Aeronautics and Space Administration.)
A
phases of products and systems. Increasing emphasis being placed on quality of components and systems,
coupled with pressures to minimize cost and increase value, further emphasize the need to study, understand,
quantify, and predict reliability and arrive at innovative designs and operational and maintenance procedures.
From the electrical engineering point of view, two (among several) areas that have received significant
attention are electronic equipment (including computer hardware) and electric power systems. Other major
areas include communication systems and software engineering. As the complexity of electronic equipment
grew during and after World War II and as the consequences of failures in the field became more and more
apparent, the U.S. military became seriously involved, promoted the formation of groups, and became instru-
mental in the development of the earliest handbooks and specifications. The great northeast blackout in the
U.S. in November 1965 triggered the serious application of reliability concepts in the power systems area.
The objectives of this chapter are to introduce the reader to the fundamentals and applications of classical
reliability concepts and bring out the important benefits of reliability considerations. Brief summaries of
application aspects of reliability for electronic systems and power systems are also included.
110.2 Catastrophic Failure Models
Catastrophic failure refers to the case in which repair of the component is either not possible or available or
of no value to the successful completion of the mission originally planned. Modeling such failures is typically
based on life test results. We can consider the “lifetime” or “time to failure” T as a continuous random variable.
Then,
(110.1)
where R(t) is the reliability function. Obviously, as t ? ¥, R(t) ? 0 since the probability of failure increases
with time of operation. Moreover,
(110.2)
where Q(t) is the unreliability function. From the definition of the distribution function of a continuous random
variable, it is clear that Q(t) is indeed the distribution function for T. Therefore, the failure density function
f(t) can be obtained as
(110.3)
The hazard rate function l(t) is defined as
(110.4)
It can be shown that
(110.5)
The four functions, f(t), Q(t), R(t), and l(t) constitute the set of functions used in basic reliability analysis.
The relationships between these functions are given in Table 110.1.
PPTtRtsurvival up to time t
( )
=>
( )
o
()
PPTtQtfailure at t
( )
=£
( )
o
()
ft
d
dt
Qt
()
=
()
lt
t
t
tt t
()
o
[]?
+
D
D
D
0
1
lim (, ),
given survivaluptot
probability of failure in
lt
ft
Rt
()
=
()
()
? 2000 by CRC Press LLC
110.3 The Bathtub Curve
Of the four functions discussed, the hazard rate function l(t) displays the different stages during the lifetime
of a component most clearly. In fact, typical l(t) plots have the general shape of a bathtub as shown in Fig. 110.1.
The first region corresponds to wearin (infant mortality) or early failures during debugging. The hazard rate
goes down as debugging continues. The second region corresponds to an essentially constant and low failure
rate and failures can be considered to be nearly random. This is the useful lifetime of the component. The third
region corresponds to wearout or fatigue phase with a sharply increasing hazard rate.
“Burn-in” refers to the practice of subjecting components to an initial operating period of t
1
(see Fig. 110.1)
before delivering them to the customer. This eliminates all the initial failures from occurring after delivery to
customers requiring high-reliability components. Moreover, it is prudent to replace a component as it
approaches the wearout region, i.e., after an operating period of (t
2
-t
1
). Electronic components tend to have a
long useful life (constant hazard) period. Wearout region tends to dominate in the case of mechanical components.
TABLE 110.1 Relationships Between Different Reliability Functions
f (t) l(t) Q(t) R(t)
f (t) = f (t)
l(t)
Q(t)1 – R(t)
1 – Q(t) R(t)
FIGURE 110.1 Bathtub-shaped hazard function
llxx()exp ()td
t
-
é
?
ê
ù
?
ú
ò
0
d
dt
Qt() -
d
dt
Rt()
l
xx
()
()
()
t
ft
fd
t
=
-
ò
1
0
1
1-Qt
d
dt
Qt
()
(()) -
d
dt
Rt[ln()]
Qt fd
t
() ()=
ò
xx
0
1
0
--
é
?
ê
ù
?
ú
ò
exp ()lxxd
t
Rt fd
t
() ()=-
ò
1
0
xx
exp ()-
é
?
ê
ù
?
ú
ò
lxxd
t
0
l (t)
t
1
t
2
t
I II III
? 2000 by CRC Press LLC
110.4 Mean Time To Failure (MTTF)
The mean or expected value of the continuous random variable “time-to-failure” is the MTTF. This is a very
useful parameter and is often enough to assess the suitability of components. It can be obtained using either
the failure density function f(t) or the reliability function R(t) as follows:
(110.6)
In the case of repairable components, the repair time can also be considered as a continuous random variable
with an expected value of MTTR. The mean time between failures, MTBF, is the sum of MTTF and MTTR.
Since for well-designed components MTTR<<MTTF, MTBF and MTTF are often used interchangably.
110.5 Average Failure Rate
The average failure rate over the time interval 0 to T is defined as
(110.7)
110.6A Posteriori Failure Probability
When components are subjected to a burn-in (or wearin) period of duration T, and if the component survives
during (0, T), the probability of failure during (T, T+t) is called the a posteriori failure probability Q
c
(t). It can
be found using
(110.8)
The probability of survival during (T, T+t) is
(110.9)
110.7 Units for Failure Rates
Several units are used to express failure rates. In addition to l(t) which is usually in number per hour, %/K is
used to denote failure rate in percent per thousand hours and PPM/K is used to express failure rate in parts
per million per thousand hours. The last unit is also known as FIT for “fails in time”. The relationships between
these units are given in Table 110.2.
MTTF tftdt Rtdt=
() ()
¥¥
òò
00
or
AFR T AFRT
RT
T
0,
ln
( )
o
()
=-
()
Qt
fd
fd
c
T
Tt
T
()
=
()
()
+
¥
ò
ò
xx
xx
RtT Qt
fd
fd
RT t
RT
d
c
Tt
T
T
Tt
( )
=-
()
=
()
()
=
+
( )
()
=-
()
é
?
ê
ù
?
ú
+
¥
¥
+
ò
ò
ò
1
xx
xx
lxxexp
? 2000 by CRC Press LLC
110.8 Application of the Binomial Distribution
In an experiment consisting of n identical independent trials, with each trial resulting in success or failure with
probabilities of p and q, the probability P
r
of r successes and (n-r) failures is
(110.10)
If X denotes the number of successes in n trials, then it is a discrete random variable with a mean value of (np)
and a variance of (npq).
In a system consisting of a collection of n identical components with a probability p that a component is
defective, the probability of finding r defects out of n is given by the P
r
in Eq. (110.10). If p is the probability
of success of one component and if at least r of them must be good for system success, then the system reliability
(probability of system success) is given by
(110.11)
For systems with redundancy, r < n.
110.9 Application of the Poisson Distribution
For events that occur “in-time” at an average rate of l occurrences per unit of time, the probability P
x
(t) of
exactly x occurences during the time interval (0 , t) is given by
(110.12)
The number of occurrences X in (0, t) is a discrete random variable with a mean value m of (lt) and a standard
deviation s of . By setting X = 0 in Eq. (110.12), we obtain the probability of no occurrence in (0,t) as e
–lt
.
If the event is failure, then no occurrence means success and e
–lt
is the probability of success or system reliability.
This is the well-known and often-used exponential distribution, also known as the constant-hazard model.
110.10 The Exponential Distribution
A constant hazard rate (constant l) corresponding to the useful lifetime of components leads to the single-
parameter exponential distribution. The functions of interest associated with a constant l are:
(110.13)
TABLE 110.2Relationships Between Different Failure Rate Units
l(#/hr) %K PPM/K (FIT)
l = l 10
–5
(%/K) 10
–9
(PPM/K)
%/K = 10
5
l %/K 10
–4
(PPM/K)
PPM/K (FIT) = 10
9
l 10
4
(%/K) PPM/K
PCpp
rnr
r
nr
=-
( )
-
1
RCpp
nk
k
nk
kr
n
=-
( )
-
=
?
1
Pt
te
x
x
x
t
()
=
()
-
l
l
!
lt
ft e t
t
()
=>
-
l
l
,0
? 2000 by CRC Press LLC
(110.14)
(110.15)
The a posteriori failure probability Q
c
(t) is independent of the prior operating time T, indicating that the
component does not degrade no matter how long it operates. Obviously, such a scenario is valid only during
the useful lifetime (horizontal portion of the bathtub curve) of the component.
The mean and standard deviation of the random variable “lifetime” are
(110.16)
110.11 The Weibull Distribution
The Weibull distribution has two parameters, a scale parameter a and a shape parameter b. By adjusting these
two parameters, a wide range of experimental data can be modeled in system reliability studies.
The associated functions are
(110.17)
(110.18)
(110.19)
With b = 1, the Weibull distribution reduces to the constant hazard model with l = (1/a). With b = 2, the
Weibull distribution reduces to the Rayleigh distribution.
The associated MTTF is
(110.20)
where G denotes the gamma function.
110.12 Combinatorial Aspects
Analysis of complex systems is facilitated by decomposition into functional entities consisting of subsystems
or units and by the application of combinatorial considerations and network modeling techniques.
A series or chain structure consisting of n units is shown in Fig. 110.2. From the reliability point of view,
the system will succeed only if all the units succeed. The units may or may not be physically in series. If R
i
is
the probability of success of the ith unit, then the series system reliability R
s
is given as
(110.21)
Rt e
t
()
=
-l
Qt Qt e
c
t
()
=
()
=-
-
1
l
m
l
s
l
o= =MTTF
11
and
l
b
a
ab
b
b
t
t
t
()
=>>3
-1
000;,,
ft
tt
()
=-
?
è
?
?
?
÷
é
?
ê
ê
ù
?
ú
ú
-
b
a a
b
b
b
1
exp
Rt
t
()
=-
?
è
?
?
?
÷
é
?
ê
ê
ù
?
ú
ú
exp
a
b
MTTF== +
?
è
?
?
?
÷
ma
b
G1
1
RR
si
i
n
=
=
?
1
? 2000 by CRC Press LLC
if the units do not interact with each other. If they do, then the conditional probabilities must be carefully
evaluated.
If each of the units has a constant hazard, then
(110.22)
where l
i
is the constant failure rate for the ith unit or component. This enables us to replace the n components
in series by an equivalent component with a constant hazard l
s
where
(110.23)
If the components are identical, then l
s
= nl and the MTTF for the equivalent component is (1/n) of the MTTF
of one component.
A parallel structure consisting of n units is shown in Fig. 110.3. From the reliability point of view, the system
will succeed if any one of the n units succeeds. Once again, the units may or may not be physically or topologically
in parallel. If Q
i
is the probability of failure of the ith unit, then the parallel system reliability R
p
is given as
(110.24)
if the units do not interact with each other (meaning independent).
FIGURE 110.2Series or chain structure.
FIGURE 110.3Parallel structure.
Cause EffectUnit 1 Unit 2 Unit n
Unit 1
Cause Effect
Unit 2
Unit n
Rt t
si
i
n
()
=-
( )
=
?
exp l
1
ll
si
i
n
=
=
?
1
RQ
pi
i
n
=-
=
?
1
1
? 2000 by CRC Press LLC
If each of the units has a constant hazard, then
(110.25)
and we do not have the luxury of being able to replace the parallel system by an equivalent component with a
constant hazard. The parallel system does not exhibit constant-hazard behavior even though each of the units
has constant-hazard.
The MTTF of the parallel system can be obtained by using Eq. (110.25) in (110.6). The results for the case
of components with identical hazards l are: (1.5/l), (1.833/l), and (2.083/l) for n = 2, 3, and 4 respectively.
The largest gain in MTTF is obtained by going from one component to two components in parallel. It is
uncommon to have more than two or three components in a truly parallel configuration because of the cost
involved. For two non-identical components in parallel with hazard rates of l
1
and l
2
, the MTTF is given as
(110.26)
An r-out-of-n structure, also known as a partially redundant system, can be evaluated using Eq. (110.11).
If all the components are identical, independent, and have a constant hazard of l, then the system reliability
can be expressed as
(110.27)
For r = 1, the structure becomes a parallel system and for r = n, it becomes a series system.
Series-parallel systems are evaluated by repeated application of the expressions derived for series and parallel
configurations by employing the well-known network reduction techniques.
Several general techniques are available for evaluating the reliability of complex structures that do not come
under purely series or parallel or series parallel. They range from inspection to cutset and tieset methods and
connection matrix techniques that are amenable for computer programming.
110.13 Modeling Maintenance
Maintenance of a component could be a scheduled (or preventive) one or a forced (corrective) one. The latter
follows in-service failures and can be handled using Markov models discussed later. Scheduled maintenance is
conducted at fixed intervals of time, irrespective of the system continuing to operate satisfactorily.
Scheduled maintenance, under ideal conditions, takes very little time (compared to the time between main-
tenances) and the component is restored to an “as new” condition. Even if the component is not repairable,
scheduled maintenance postpones failure and prolongs the life of the component. Scheduled maintenance
makes sense only for those components with increasing hazard rates. Most mechanical systems come under
this category. It can be shown that the density function f
T
*(t) with scheduled maintenance included can be
expressed as
(110.28)
Rt t
pi
i
n
( )
=- - -
( )
[]
=
?
11
1
exp l
MTTF =+-
+
11 1
1212
ll ll
Rt C e e
nk
kt t
nk
kr
n
( )
=-
( )
--
-
=
?
ll
1
ft ft kT RT
TM
k
M
k
*
( )
=-
( ) ( )
=
¥
?
1
0
? 2000 by CRC Press LLC
(110.29)
R(t) = component reliability function
T
M
= time between maintenances, constant
and f
T
(t) = original failure density function
In Eq. (110.28), k = 0 is used only between t = 0 and t = T
M
; k = 1 is used only between t = T
M
and t = 2T
M
and so on.
A typical f
T
*(t)is shown in Fig. 110.4. The time scale is divided into equal intervals of T
M
each. The function
in each segment is a scaled-down version of the one in the previous segment, the scaling factor being equal to
R(T
M
). Irrespective of the nature of the original failure density function, scheduled maintenance gives it an
exponential tendency. This is another justification for the widespread use of exponential distribution in system
reliability evaluations.
110.14 Markov Models
Of the different Markov models available, the discrete-state continuous-time Markov process has found many
applications in system reliability evaluation, including the modeling of repairable systems. The model consists
of a set of discrete states, called the state space, in which the system can reside and a set of transition rates
between appropriate states. Using these, a set of first order differential equations are derived in the standard
vector-matrix form for the time-dependent probabilities of the various states. Solution of these equations
incorporating proper initial conditions gives the probabilities of the system residing in different states as
functions of time. Several useful results can be gleaned from these functions.
110.15 Binary Model for a Repairable Component
The binary model for a repairable component assumes that the component can exist in one of two states—the
UP state or the DOWN state. The transition rates between these two states, S
0
and S
1
, are assumed to be constant
FIGURE 110.4Density function with ideal scheduled maintenance incorporated.
T
m
?
T
(t)
2T
m
3T
m
4T
m
5T
m
*
?
1
(t) = ?
T
(t)
?
1
(t - 2T
M
) R
2
(T
M
)
?
1
(t - 3T
M
) R
3
(T
M
)
?
1
(t - 4T
M
) R
4
(T
M
)
?
1
(t - T
M
) R(T
M
)
t
where
for
otherwise
ft
ft tT
TM
1
0
0
()
=
()
<£
ì
í
?
? 2000 by CRC Press LLC
and equal to l and m. These transition rates are the constant failure and repair rates implied in the modeling
process and their reciprocals are the MTTF and MTTR, respectively. Figure 110.5 illustrates the binary model.
The associated Markov differential equations are
(110.30)
with the initial conditions
(110.31)
The coefficient matrix of Markov differential equations, namely
is obtained by transposing the matrix of rates of departures
and replacing the diagonal entries by the negative of the sum of all the other entries in their respective columns.
Solution of (110.30) with initial conditions as given by (110.31) yields:
(110.32)
(110.33)
The limiting, or steady-state, probabilities are found by letting t ? ¥. They are also known as limiting
availability A and limiting unavailability U and they are
(110.34)
The time-dependent A(t) and U(t) are simply P
0
(t), and P
1
(t) respectively.
Referring back to Eq. (110.14) for a constant hazard component and comparing it with Eq. (110.32) which
incorporates repair, the difference between R(t) and A(t) becomes obvious. Availability A(t) is the probability
FIGURE 110.5State space diagram for a single reparable component.
S
0
Component
Up
S
1
Component
Down
l
¢
()
¢
()
é
?
ê
ê
ù
?
ú
ú
=
-m
-m
é
?
ê
ù
?
ú
()
()
é
?
ê
ê
ù
?
ú
ú
Pt
Pt
Pt
Pt
0
1
0
1
l
l
P
P
0
1
0
0
1
0
()
()
é
?
ê
ê
ù
?
ú
ú
=
é
?
ê
ù
?
ú
-m
-m
é
?
ê
ù
?
ú
l
l
0
0
l
m
é
?
ê
ù
?
ú
Pt e
t
0
()
=
+
+
+
-+()m
lm
l
lm
lm
Pt e
t
1
1
()
=
+
-
[]
-+()l
lm
lm
PAPU
01
o
m
+
o=
+
o
lm
l
lm
and
? 2000 by CRC Press LLC
that the component is up at time t and reliability R(t) is the probability that the system has continuously
operated from 0 to t. Thus, R(t) is much more stringent than A(t). While both R(0) and A(0) are unity, R(t)
drops off rapidly as compared to A(t) as time progresses. With a small value of MTTR (or large value of m), it
is possible to realize a very high availability for a repairable component.
110.16 Two Dissimilar Repairable Components
Irrespective of whether the two components are in series or in parallel, the state space consists of four possible
states. They are: S
1
(1 up, 2 up), S
2
(1 down, 2 up), S
3
(1 up, 2 down)
,
and S
4
(1 down, 2 down). The actual
system configuration will determine which of these four states corresponds to system success and failure. The
associated state-space diagram is shown in Fig. 110.6. Analysis of this system results in the following steady-
state probabilities:
(110.35)
(110.36)
For components in series, A = P
1
and U = (P
2
+ P
3
+ P
4
) and the two components can be replaced by an
equivalent component with a failure rate of l
s
= (l
1
+ l
2
) and a mean repair duration of r
s
where
(110.37)
Extending this to n components in series, the equivalent system will have
(110.38)
(110.39)
FIGURE 110.6State space diagram for two dissimilar reparable components.
S
1
UU
S
2
DU
S
3
UD
S
4
DD
m
1
m
2
l
2
m
2
l
2
l
1
m
1
l
1
PPPP
1
12
2
12
3
21
4
12
====
mm lm lm ll
;
;
;
Denom Denom Denom Denom
where Denom o+
( )
+
( )
lmlm
1122
r
rr
s
s
@
+ll
l
11 22
ll
l
l
si
i
n
s
s
i
i
n
i
rr=@
==
??
11
1
and
and system unavailability = Ur r
sss i
i
n
@=
=
?
ll
1
? 2000 by CRC Press LLC
For components in parallel, A = (P
1
+ P
2
+ P
3
) and U = P
4
and the two components can be replaced by an
equivalent component with
(110.40)
(110.41)
Extension to more than two components in parallel follows similar lines. For three components in parallel,
(110.42)
110.17 Two Identical Repairable Components
In this case, only three states are needed to complete the state space. They are: S
1
:Both UP; S
2
: One UP and
One DOWN; and S
3
:Both DOWN. The corresponding state space diagram is shown in Fig. 110.7. Analysis of
this system results in the following steady-state probabilities:
(110.43)
110.18 Frequency and Duration Techniques
The expected residence time in a state is the mean value of the passage time from the state in question to any
other state. Cycle time is the time required to complete an “in” and “not-in” cycle for that state. Frequency of
occurrence (or encounter) for a state is the reciprocal of its cycle time. It can be shown that the frequency of
occurrence of a state is equal to the steady-state probability of being in that state times the total rate of departure
from it. Also, the expected value of the residence time is equal to the reciprocal of the total rate of departure
from that state.
Under steady-state conditions, the expected frequency of entering a state must be equal to the expected
frequency of leaving that state (this assumes that the system is “ergodic”, which will not be elaborated for lack
FIGURE 110.7State space diagram for two identical reparable components.
S
2
One Up,
One Down
S
1
Both Up
S
3
Both Down
2m
m
l
2l
lllll mmm
pp
rr@
( )
+
( )
=+
121 212 1 2
and
and system unavailability==
( )
U
ppp
lm1
mmmm lll
pp
Urrr=++
( )
=
123 123123
and
PP P
1
2
2
2
3
2
2
=
+
?
è
?
?
?
÷
=
+
?
è
?
?
?
÷
=
+
?
è
?
?
?
÷
m
lm
l
m
m
lm
l
lm
;;
? 2000 by CRC Press LLC
of space). Using this principle, frequency balance equations can be easily written (one for each state) and solved
in conjunction with the fact that the sum of the steady-state proabilities of all the states must be equal to unity
to obtain the steady state probabilities. This procedure is much simpler than solving the Markov differential
equations and letting t ? ¥.
110.19 Applications of Markov Process
Once the different states are identified and a state-space diagram is developed, Markov analysis can proceed
systematically (probably with the help of a computer in the case of large systems) to yield a wealth of results
useful in system reliability evaluation. Inclusion of installation time after repair, maintenance, spare, standby
systems, and limitations imposed by restricted repair facilities are some of the many problems that can be
studied.
110.20 Some Useful Approximations
1. For an r-out-of-n structure with failure and repair rates of l and m for each, the equivalent MTTR and
MTTF can be approximated as
(110.44)
(110.45)
2. Influence of weather must be considered for components operating in an outdoor environment. If l and l¢
are the normal weather and stormy weather failure rates, l¢ will be much greater than l and the average failure
rate l
f
can be approximated as
(110.46)
where N and S are the expected durations of normal and stormy weather.
3. For well-designed high-reliability components, the failure rate l will be very small and lt << 1. Then, for
a single component,
(110.47)
and for n dissimilar components in series,
(110.48)
For the case of n identical components in parallel,
(110.49)
MTTR
MTTR of one component
nr
eq
=
-+1
MTTF
MTTF
of one
component
MTTF
MTTR
nrr
n
eq
nr
=
?
è
?
?
?
?
?
÷
÷
÷
?
è
?
?
?
÷
-
( )
-
( )
ì
í
?
?
?
ü
y
?
t
?
-
!!
!
1
lll
f
N
NS
S
NS
@
+
?
è
?
?
?
÷
+
+
?
è
?
?
?
÷
¢
Rt t Qt t
( )
@-
( )
@1 ll and
Rt t Qt t
i
i
n
i
i
n
( )
@-
( )
@
==
??
1
11
ll and
Rt t Qt t
n
( )
@-
( ) ( )
@
( )
1 ll
n
and
? 2000 by CRC Press LLC
For the case of an r-out-of-n configuration,
(110.50)
The approximations detailed in (3) are called rare-event approximations.
110.21 Application Aspects
Electronic systems utilize large numbers of similar components over which the designer has very little control.
Quality control methods can be used in the procurement and manufacturing phases. However, the circuit
designer has no control over the design reliability of the devices except in cases such as custom-designed
integrated circuits. In addition, electronic components cannot be inspected easily because of encapsulation.
Although gross defects can be easily detected by suitable testing processes, defects that are not immediately
effective (for example, weak mechanical bond of a lead-in conductor, material flaws in semiconductors, defective
sealing, etc.) primarily contribute to unreliability. Temperature and voltage are the predominant failure-accel-
erating stresses for the majority of electronic components. As weaker components fail and are replaced by better
ones, the percentage of defects in a population is reduced, resulting in a decreasing hazard rate. Wearout is
rarely of significance in the failure of electronic components and systems. The designer should be careful to
ensure that the loads (voltage, current, temperature) are within rated values and strive for a design that
minimizes hot spots and temperature rises. Parameter drifts and accidental short circuits at connections can
also lead to system failures. The circuit designer can follow a few basic rules to significantly improve electronic
system reliability: reduce the number of adjustable components; avoid selection of components on the basis of
parameter values obtained by testing; assemble components such that adjustments are easily accessible; and
partition circuits into subassemblies for easy testing and diagnosis of problems.
Power systems are expected to provide all customers a reliable supply of electric power upon which much
of modern life depends. Power systems are also very large, consisting of scores of large generators, hundreds
of miles of high-voltage transmission lines, thousands of miles of distribution lines, along with the necessary
transformers, switchgear, and substations interconnecting them. Reliability at the customer level can be
improved by additional investment; the challenge is to balance reliability and the associated investment cost
against the cost of energy charged to customers. This should be done in the presence of a number of random
inputs and events: generator outages, line outages (which are highly weather dependent), random component
outages, and uncertainties in the load demand (which is also weather dependent). Probabilistic techniques to
evaluate power system reliability have been used effectively to resolve this problem satisfactorily. The system is
divided into a number of subsystems and each one is analyzed separately. Then, composite system reliability
evaluation techniques are employed to combine the results and arrive at a number of quantifiable reliability
indices as inputs to managerial decisions. The major subsystems involved are generation, transmission, distri-
bution, substations, and protection systems. Care should be taken to ensure that reliabilities of different parts
of the system conform to each other and that no part of the system is unusually strong or weak. Obviously,
different levels of reliability will be required for different parts of the system depending on the impacts of
failures at different points on the interconnected power system.
110.22 Reliability and Economics
Reliability and economics are very closely related. Issues such as the level of reliability required, the amount of
additional expenditures justified, where to invest the additional resources to maximize reliability, how to achieve
a certain level of overall reliability at minimum cost, and how to assess the cost of failures and monetary
equivalent of non-monetary items are all quite complex and not purely technical. However, once managerial
decisions are made and reliability goals are set, certain well-proven techniques such as incorporating redun-
dancy, improving maintenance procedures, selecting better quality components, etc. can be employed by the
designer to achieve the goals.
Qt C t
n nr
nr
( )
@
( )
-+()
-+
1
1
l
? 2000 by CRC Press LLC
Defining Terms
Availability: The availability A(t) is the probability that a system is performing its required function success-
fully at time t. The steady-state availability A is the fraction of time that an item, system, or component
is able to perfom its specified or required function.
Bathtub curve: For most physical components and living entities, the plot of failure (or hazard) rate vs. time
has the shape of the longitudinal cross-section of a bathtub and hence its name.
Hazard rate function: The plot of instantaneous failure rate vs. time is called the hazard function. It cleary
and distinctly exhibits the different life cycles of the component.
MTTF: The mean time to failure is the mean or expected value of “time to failure”.
Parallel structure: Also known as a completely redundant system, it describes a system that can succeed when
at least one of two or more components succeeds.
Redundancy: Refers to the existence of more than one means, identical or otherwise, for accomplishing a
task or mission.
Reliability: The reliability R(t) of an item or system is the probability that it has performed successfully over
the time interval from 0 to t. In the case of non-repairable systems, R(t) = A(t). With repair, R(t) £ A(t).
Series structure: Also known as a chain structure or non-redundant system, it describes a system whose
success depends on the success of all of its components.
Related Topics
23.2 Testing ? 98.5 Mean Time to Failure ? 98.10 Markov Modeling ? 98.12 Reliability Calculations for Real
Time Systems
References
R. Billinton and R.N. Allan, Reliability Evaluation of Engineering Systems: Concepts and Techniques, 2nd ed.,
New York: Plenum, 1992.
E.E. Lewis, Introduction to Reliability Engineering, New York: John Wiley & Sons, 1987.
M.G. Pecht and F.R. Nash, “Predicting the reliability of electronic equipment”, Proc. IEEE, 82(7), 992–1004,
1994.
R. Ramakumar, Engineering Reliability: Fundamentals and Applications, Englewood Cliffs, N.J.: Prentice–Hall,
1993.
M.L. Shooman, Probabilistic Reliability: An Engineering Approach, 2nd ed., Malabar, Fla.: R.E. Krieger Pub-
lishing Company, 1990.
For Further Information
R. Billinton and R.N. Allan, Reliability Evaluation of Power Systems, London, England: Pitman Advanced
Publishing Program, 1984.
A.E. Green and A.J. Bourne, Reliability Technology, New York: Wiley-Interscience, 1972.
E.J. Henley and H. Kumamoto, Probabilistic Risk Assessment—Reliability Engineering, Design, and Analysis, New
York: IEEE Press, 1991.
IEEE Transactions on Reliability, New York: Institute of Electrical and Electronics Engineers.
P.D.T. O’Connor, Practical Reliability Engineering, 3rd ed. New York: John Wiley & Sons, 1985.
Proceedings: Annual Reliability and Maintainability Symposium, New York: Institute of Electrical and Electronics
Engineers.
D.P. Siewiorek, and R.S. Swarz, The Theory and Practice of Reliabile System Design, Bedford, Mass.: Digital Press,
1982.
K.S. Trivedi, Probability and Statistics with Reliability, Queuing, and Computer Science Applications, Englewood
Cliffs, N.J.: Prentice-Hall, 1982.
A. Villemeur, Reliability, Availability, Maintainability and Safety Assessment, Volumes I and II, New York: John
Wiley & Sons, 1992.
? 2000 by CRC Press LLC