Feng, T. “Parallel Processors”
The Electrical Engineering Handbook
Ed. Richard C. Dorf
Boca Raton: CRC Press LLC, 2000
95
Parallel Processors
95.1Introduction
95.2Classifications
95.3Types of Parallel Processors
Ensemble Processors?Array Processors?Associative Processor
95.4System Utilization
95.1 Introduction
A computer usually consists of four major components: the arithmetic-logic unit (ALU), the main memory
unit (MU), the input/output unit (I/O), and the control unit (CU). Such a computer is known as a uniprocessor
since the processing is achieved by operating on one word or word pair at a time. In order to increase the
computer performance, we may improve the device technology to reduce the switching (gate delay) time.
Indeed, for the past half century we have seen switching speeds improve from 200 to 300 ms for relays to
present-day subnanosecond very large scale integration (VLSI) circuits. As the switching speeds of computer
devices approach a limit, however, any further significant improvement in performance is more likely to be in
increasing the number of words or word pairs that can be processed simultaneously. For example, we may use
one ALU to compute N sets of additions N times in a uniprocessor, or we may design a computer system with
N ALUs to add all N sets once. Conceptually, such a computer system may still consist of the four major
components mentioned previously except that there are N ALUs. An organization with multiple ALUs under
the control of a single CU is called a parallel processor. To make a parallel processor more efficient and cost-
effective, a fifth major component, called interconnection networks, is usually required to facilitate the inter-
processor and processor-memory communications. In addition, each ALU requires not only its own registers
but also network interfaces; the expanded ALU is then called a processing element (PE). Figure 95.1 shows a
block diagram of a parallel processor.
95.2 Classifications
Flynn has classified computer systems according to the multiplicity of instruction and data streams, where
computers are partitioned into four groups [Flynn, 1966]:
1.Single instruction stream, single data stream (SISD): The conventional, word-sequential architecture
including pipelined computers (usually with parallel ALU)
2.Single instruction stream, multiple data stream (SIMD): The multiple ALU-type architectures (e.g.,
parallel/array processor). The ALU may be either bit-serial or bit-parallel.
3.Multiple instruction stream, single data stream (MISD): Not as practical as the other classes.
4.Multiple instruction stream, multiple data stream (MIMD): The multiprocessor system.
As a general rule, one could conclude that SISD and SIMD machines are single CU systems, whereas MIMD
machines are multiple CU systems. Flynn’s classification does not address the interactions among the processing
Tse-yun Feng
The Pennsylvania State University
? 2000 by CRC Press LLC
modules and the methods by which processing modules in concurrent system are controlled. As a result, one
can classify both uniprocessors and pipelined computers as SISD machines, because both instructions and data
are provided sequentially.
We may also classify computer systems according to the number of bits or bit pairs a computer executes at
any instant [Feng, 1972]. For example, a computer may perform operations on one bit or bit pair at a time
through the use of a simple serial ALU. For an M-bit word or operand, the operation repeats M times (Point A
in Fig. 95.2). To speed up the processing, a parallel ALU is usually used so that all bits of a word can be operated
on simultaneously. This is how a conventional word-sequential computer executes on its operands (Point B in
Fig. 95.2). In a parallel processor, it may execute either (a) all the ith bits of N operands or operand pairs (i.e.,
bit slice or bis) or (b) all N M-bit operands or operand pairs simultaneously (Points C and D in Fig. 95.2,
respectively). Figure 95.2 also shows some of the systems in this classification. It is seen from this classification
that the performance of a computer is proportional to the total number of bits or bit pairs it can execute
simultaneously.
Feng’s classification [Hwang and Briggs, 1984] was originally intended for parallel processors, and as a result,
the number of CUs in a computer system was not specified. H?ndler extended Feng’s classification by adding
a third dimension, namely, the number of CUs. Pipelined systems are also included in this classification through
additional parameters [H?ndler, 1977].
95.3 Types of Parallel Processors
Ensemble Processors
An ensemble system is an extension of the conventional uniprocessor systems. It is a collection of N PEs (a PE
here consists of an ALU, a set of local registers, and limited local control capability) and N MUs, under the
control of a single CU. Thus, the organization of an ensemble processor is similar to that shown in Fig. 95.1
except that there are no direct interprocessor and processor-memory communications, i.e., no interconnection
networks. When the need for communication arises, it is done through the CU. This slows down the system
for applications requiring extensive interprocessor and processor-memory communications. For example, the
sum of two matrices A and B can be executed in one step, if R
2
PEs are available in an ensemble processor,
where R is the rank of the matrices. On the other hand, the product of the same two matrices requires extensive
data alignment between the elements of A and B. As a result, it is ineffective for performing matrix multipli-
cations with an ensemble processor. Therefore, while the ensemble processors are capable of executing up to
N identical jobs simultaneously, they have very limited applications. Parallel element processing ensemble
(PEPE) [Evensen and Troy, 1973] is an example of such parallel processors.
FIGURE 95.1 A basic parallel processor organization.
? 2000 by CRC Press LLC
Array Processors
Because of the need for interprocessor and processor-memory communication for most applications, a parallel
processor usually has one or more circuits (known as interconnection networks) to support various applications
for efficient processing. In general, an array processor may consist of N identical PEs under the control of a
single CU and a number of MUs. Within each PE there are circuits for network interface as well as its own
local memories. The PEs and MUs communicate with each other through an interconnection network. A typical
array processor organization is shown in Fig. 95.3. Depending on the design, each PE may perform serial-by-
bit (as in MPP) or parallel-by-bit (as in ILLIAC IV) operations.
As can be seen from Fig. 95.3, the interconnection networks play a very important role in parallel processors.
The network usually provides a uniform interconnection among PEs on one hand and PEs and MUs on the
other. Different array processor organizations might use different interconnection networks. In general, the
interconnection networks can be classified into two categories: static and dynamic, as shown in Fig. 95.4.
ILLIAC IV [Barnes et al., 1968] and MPP [Batcher, 1979] are examples of parallel processors using static
interconnections, while STARAN [Batcher, 1973] and BSP [Kuck and Stokes, 1982] are examples using dynamic
interconnections.
The CU usually has its own high-speed registers, local memory, and arithmetic unit. Thus, in many cases,
it is a conventional computer and the instructions are stored in a main memory, together with data. However,
in some machines such as ILLIAC IV, programs are distributed among the local memories of the PEs. Hence,
the instructions are fetched from the processors’ local memories into an instruction buffer in the CU. Each
FIGURE 95.2 Feng’s classification.
? 2000 by CRC Press LLC
instruction is either a local type instruction, where it is executed entirely within the CU, or it is a parallel
instruction and is executed in the processing array. The primary function of the CU is to examine each
instruction as it is to be executed and to determine where the execution should take place.
Associative Processor
Associative memories, also known as content-addressable memories, retrieve information on the basis of data
content rather than addresses. An associative memory performs comparison (i.e., exclusive-OR or equivalence)
operations at its bit level. The results of the comparison on a group of bits in a word for all words in the memory
are transmitted to a register called a response register or flag. In addition, there are circuits such as multiple
match resolver, enable/disable register, a number of temporary registers, as well as appropriate logic gates for
resolving multiple responses and information retrieval. For associative processors, arithmetic capabilities are
added to this unit. The unit can be viewed as consisting of a number of bit-serial PEs. Furthermore, the bit-
level logic is moved out of the memory so that the memory part of the processor consists of a number of
random-access memories called word modules. A typical associative processor is shown in Fig. 95.5. STARAN
and MPP (Fig. 95.2) are representative of this bit-serial, word-parallel SIMD organization. In Fig. 95.5 the
common register is where the common operand is stored and the mask register defines the bit positions requiring
operation. The enable/disable register provides local control of individual PEs. Because of its simplicity in design
the per-PE cost of an associative processor is much lower, but the bit-serial operations slow down the system
drastically. To compensate for this, these systems are useful only for applications requiring a large number of PEs.
95.4 System Utilization
As discussed previously, for any computer there is a maximum number of bits or bit pairs that can be processed
concurrently, whether it is under single-instruction or multiple-instruction control [Feng, 1972, 1973]. This
maximum degree of concurrency, or maximum concurrency (C
m
), is an indication of the computer-processing
capability. The actual utilization of this capability is indicated by the average concurrency defined to be
FIGURE 95.3 An array processor organization. I/O, input/output devices; LM, local memory; PE, processing element; SM,
shared memory.
? 2000 by CRC Press LLC
where c
i
is the concurrency at Dt
i
. If Dt
i
is set to one time unit, then the average concurrency over a period of
T time units is
The average hardware utilization is then
FIGURE 95.4 Some examples of static (a) and dynamic (b) interconnection networks.
C
ct
t
a
ii
i
=
SD
SD
C
c
T
a
i
i
T
=
?
=1
? 2000 by CRC Press LLC
where s
i
is the hardware utilization at time i. Whereas C
m
is determined by the hardware design, C
a
or m is
highly dependent on the software and applications. A general-purpose computer should achieve a high m for
as many applications as possible, whereas a special-purpose computer would yield a high m for at least the
intended applications. In either case, maximizing the value of m for a computer design is important. This
equation can also be used to evaluate the relative effectiveness of machine designs.
For a parallel processor, the degree of concurrency is called the degree of parallelism. A similar discussion
can be used to define the average hardware utilization of a parallel processor. The maximum parallelism is then
P
m
, and the average parallelism is
or
for T time units. The average hardware utilization becomes
FIGURE 95.5 An associative processor organization.
m= =
?
=
=
=
?
C
C
c
TC T
a
m
i
i
T
m
i
i
T
1
1
1
s
P
pt
t
a
ii
i
=
SD
SD
P
p
T
a
i
i
T
=
?
=1
? 2000 by CRC Press LLC
where r
i
is the hardware utilization for parallel processors at time i. With appropriate instrumentation, the
average hardware utilization of a system can be determined.
In practice, however, it is not always true that every bit or bit pair that is being processed would be productive.
Some of the bits produce only repetitious (superfluous) or even meaningless results. This happens more often
and more severely in a parallel processor than in a word-sequential processor. Consider, for example, performing
a maximum search operation in a mesh-connected parallel processor (such as ILLIAC IV). For N operands, it
takes (N/2)log
2
N comparisons (N/2 comparisons for each of log
2
N iterations) instead of the usual N – 1
comparisons in word-sequential machines. Thus, in effect there are
comparisons that are nonproductive. If we let
^
P
a
be the effective parallelism over a period of T time units and
^
u,
^
p
i
, and
^
r
i
be the corresponding effective values, the effective hardware utilization is then
A successful parallel processor design should yield a high
^
u, as well as the required throughput for, at least,
the intended applications. This not only involves a proper hardware and software design but also the develop-
ment of efficient parallel algorithms for these applications.
Suppose T
u
is the execution time of an application program using a conventional word-sequential machine,
and T
c
is the execution time of the same program using a concurrent system; the speed-up ratio is then defined as
Naturally, for a specific parallel organization, the speed-up ratio determines how well an application program
can utilize the hardware resources. Supporting software has a direct effect on the speed-up ratio.
Defining Terms
Array processor: A parallel processor consisting of a number of processing elements, memory modules, and
input/output devices as well as interconnection networks under a single control unit.
Associative processor: A parallel processor consisting of a number of processing elements, memory modules,
and input/output devices under a single control unit. The capability of the processing elements is usually
limited to the bit-serial operations.
Ensemble processor: A parallel processor consisting of a number of processing elements, memory modules,
and input/output devices under a single control unit. It has no interconnection network to provide
interprocessor or processor-memory communications.
Interconnection network: A network of interconnections providing interprocessor and processor-memory
communications. It may be static or dynamic, distributed, or centralized.
ur==
?
=
=
=
?
P
P
p
TP T
a
m
i
i
T
m
i
i
T
1
1
1
N
NN
N
N
2
1
2
21
22
log ( ) (log )
?
è
?
?
?
÷
--= -+
?
?
?
?
ur==
?
=
=
=
?
P
P
p
TP T
a
m
i
i
T
m
i
i
T
1
1
1
S
T
T
u
c
=
? 2000 by CRC Press LLC
Parallel processor: A computing system consisting of a number of processors, memory modules, input/output
devices, and other components under the control of a single control unit. It is known to be a single-
instruction-stream, multiple-data-stream (SIMD) machine.
Processing element: A basic processor consisting of an arithmetic-logic unit, a number of registers, network
interfaces, and some local control facilities.
Related Topics
18.1 Special Architectures ? 96.5 Parallel Processing
References
G.H. Barnes, R.M. Brown, M. Kato, D.J. Kuck, D.L. Slotnick, and R.A. Stokes, “The ILLIAC IV computer,” IEEE
Trans. Comput., vol. C-7, pp. 746–757, 1968.
K.E. Batcher, “STARAN/RADCAP hardware architecture,” Proc. Sagamore Computer Conf. on Parallel Processing,
pp. 147–152, 1973.
K.E. Batcher, “MPP—A massively parallel processor,” Proc. Int. Conf. on Parallel Processing, p. 249, 1979.
A.J. Evensen and J.L. Troy, “Introduction to the architecture of a 288-element PEPE,” Proc. Sagamore Computer
Conference, pp. 162–169, 1973.
T. Feng, “An overview of parallel processing systems,” 1972 WESCON Tech. Papers, Session 1 — “Parallel
Processing Systems,” pp. 1–2, 1972.
T. Feng, Parallel Processing Characteristics & Implementation of Data Manipulating Functions, Technical Report
RADC-TR-73-189, July 1973.
M.J. Flynn, “Very high speed computing systems,” Proc. IEEE, vol. 54(12), pp. 1901–1909, 1966.
W. H?ndler, “The impact of classification schemes on computer architecture,” Proc. Int. Conf. on Parallel
Processing, pp. 7–15, 1977.
K. Hwang and F.A. Briggs, Computer Architecture and Parallel Processing, New York: McGraw-Hill, 1984.
D. J. Kuck and R.A. Stokes, “The Borroughs Scientific Processor (BSP),” IEEE Trans. Comput., vol. C-31(5), pp.
363–376, 1982.
Further Information
Proceedings of International Conference on Parallel Processing: An annual conference held since 1972. Recent
proceedings published by CRCPress.
IEEE Transactions on Parallel and Distributed Systems: Started in 1990 as a quarterly, now a monthly, published
by the IEEE Computer Society.
Journal of Parallel and Distributed Computing: A monthly published by Academic Press.
Computer Architecture and Parallel Processing: A book by K. Hwang and F. A. Briggs published by McGraw-Hill.
? 2000 by CRC Press LLC