Feng, T. “Parallel Processors” The Electrical Engineering Handbook Ed. Richard C. Dorf Boca Raton: CRC Press LLC, 2000 95 Parallel Processors 95.1Introduction 95.2Classifications 95.3Types of Parallel Processors Ensemble Processors?Array Processors?Associative Processor 95.4System Utilization 95.1 Introduction A computer usually consists of four major components: the arithmetic-logic unit (ALU), the main memory unit (MU), the input/output unit (I/O), and the control unit (CU). Such a computer is known as a uniprocessor since the processing is achieved by operating on one word or word pair at a time. In order to increase the computer performance, we may improve the device technology to reduce the switching (gate delay) time. Indeed, for the past half century we have seen switching speeds improve from 200 to 300 ms for relays to present-day subnanosecond very large scale integration (VLSI) circuits. As the switching speeds of computer devices approach a limit, however, any further significant improvement in performance is more likely to be in increasing the number of words or word pairs that can be processed simultaneously. For example, we may use one ALU to compute N sets of additions N times in a uniprocessor, or we may design a computer system with N ALUs to add all N sets once. Conceptually, such a computer system may still consist of the four major components mentioned previously except that there are N ALUs. An organization with multiple ALUs under the control of a single CU is called a parallel processor. To make a parallel processor more efficient and cost- effective, a fifth major component, called interconnection networks, is usually required to facilitate the inter- processor and processor-memory communications. In addition, each ALU requires not only its own registers but also network interfaces; the expanded ALU is then called a processing element (PE). Figure 95.1 shows a block diagram of a parallel processor. 95.2 Classifications Flynn has classified computer systems according to the multiplicity of instruction and data streams, where computers are partitioned into four groups [Flynn, 1966]: 1.Single instruction stream, single data stream (SISD): The conventional, word-sequential architecture including pipelined computers (usually with parallel ALU) 2.Single instruction stream, multiple data stream (SIMD): The multiple ALU-type architectures (e.g., parallel/array processor). The ALU may be either bit-serial or bit-parallel. 3.Multiple instruction stream, single data stream (MISD): Not as practical as the other classes. 4.Multiple instruction stream, multiple data stream (MIMD): The multiprocessor system. As a general rule, one could conclude that SISD and SIMD machines are single CU systems, whereas MIMD machines are multiple CU systems. Flynn’s classification does not address the interactions among the processing Tse-yun Feng The Pennsylvania State University ? 2000 by CRC Press LLC modules and the methods by which processing modules in concurrent system are controlled. As a result, one can classify both uniprocessors and pipelined computers as SISD machines, because both instructions and data are provided sequentially. We may also classify computer systems according to the number of bits or bit pairs a computer executes at any instant [Feng, 1972]. For example, a computer may perform operations on one bit or bit pair at a time through the use of a simple serial ALU. For an M-bit word or operand, the operation repeats M times (Point A in Fig. 95.2). To speed up the processing, a parallel ALU is usually used so that all bits of a word can be operated on simultaneously. This is how a conventional word-sequential computer executes on its operands (Point B in Fig. 95.2). In a parallel processor, it may execute either (a) all the ith bits of N operands or operand pairs (i.e., bit slice or bis) or (b) all N M-bit operands or operand pairs simultaneously (Points C and D in Fig. 95.2, respectively). Figure 95.2 also shows some of the systems in this classification. It is seen from this classification that the performance of a computer is proportional to the total number of bits or bit pairs it can execute simultaneously. Feng’s classification [Hwang and Briggs, 1984] was originally intended for parallel processors, and as a result, the number of CUs in a computer system was not specified. H?ndler extended Feng’s classification by adding a third dimension, namely, the number of CUs. Pipelined systems are also included in this classification through additional parameters [H?ndler, 1977]. 95.3 Types of Parallel Processors Ensemble Processors An ensemble system is an extension of the conventional uniprocessor systems. It is a collection of N PEs (a PE here consists of an ALU, a set of local registers, and limited local control capability) and N MUs, under the control of a single CU. Thus, the organization of an ensemble processor is similar to that shown in Fig. 95.1 except that there are no direct interprocessor and processor-memory communications, i.e., no interconnection networks. When the need for communication arises, it is done through the CU. This slows down the system for applications requiring extensive interprocessor and processor-memory communications. For example, the sum of two matrices A and B can be executed in one step, if R 2 PEs are available in an ensemble processor, where R is the rank of the matrices. On the other hand, the product of the same two matrices requires extensive data alignment between the elements of A and B. As a result, it is ineffective for performing matrix multipli- cations with an ensemble processor. Therefore, while the ensemble processors are capable of executing up to N identical jobs simultaneously, they have very limited applications. Parallel element processing ensemble (PEPE) [Evensen and Troy, 1973] is an example of such parallel processors. FIGURE 95.1 A basic parallel processor organization. ? 2000 by CRC Press LLC Array Processors Because of the need for interprocessor and processor-memory communication for most applications, a parallel processor usually has one or more circuits (known as interconnection networks) to support various applications for efficient processing. In general, an array processor may consist of N identical PEs under the control of a single CU and a number of MUs. Within each PE there are circuits for network interface as well as its own local memories. The PEs and MUs communicate with each other through an interconnection network. A typical array processor organization is shown in Fig. 95.3. Depending on the design, each PE may perform serial-by- bit (as in MPP) or parallel-by-bit (as in ILLIAC IV) operations. As can be seen from Fig. 95.3, the interconnection networks play a very important role in parallel processors. The network usually provides a uniform interconnection among PEs on one hand and PEs and MUs on the other. Different array processor organizations might use different interconnection networks. In general, the interconnection networks can be classified into two categories: static and dynamic, as shown in Fig. 95.4. ILLIAC IV [Barnes et al., 1968] and MPP [Batcher, 1979] are examples of parallel processors using static interconnections, while STARAN [Batcher, 1973] and BSP [Kuck and Stokes, 1982] are examples using dynamic interconnections. The CU usually has its own high-speed registers, local memory, and arithmetic unit. Thus, in many cases, it is a conventional computer and the instructions are stored in a main memory, together with data. However, in some machines such as ILLIAC IV, programs are distributed among the local memories of the PEs. Hence, the instructions are fetched from the processors’ local memories into an instruction buffer in the CU. Each FIGURE 95.2 Feng’s classification. ? 2000 by CRC Press LLC instruction is either a local type instruction, where it is executed entirely within the CU, or it is a parallel instruction and is executed in the processing array. The primary function of the CU is to examine each instruction as it is to be executed and to determine where the execution should take place. Associative Processor Associative memories, also known as content-addressable memories, retrieve information on the basis of data content rather than addresses. An associative memory performs comparison (i.e., exclusive-OR or equivalence) operations at its bit level. The results of the comparison on a group of bits in a word for all words in the memory are transmitted to a register called a response register or flag. In addition, there are circuits such as multiple match resolver, enable/disable register, a number of temporary registers, as well as appropriate logic gates for resolving multiple responses and information retrieval. For associative processors, arithmetic capabilities are added to this unit. The unit can be viewed as consisting of a number of bit-serial PEs. Furthermore, the bit- level logic is moved out of the memory so that the memory part of the processor consists of a number of random-access memories called word modules. A typical associative processor is shown in Fig. 95.5. STARAN and MPP (Fig. 95.2) are representative of this bit-serial, word-parallel SIMD organization. In Fig. 95.5 the common register is where the common operand is stored and the mask register defines the bit positions requiring operation. The enable/disable register provides local control of individual PEs. Because of its simplicity in design the per-PE cost of an associative processor is much lower, but the bit-serial operations slow down the system drastically. To compensate for this, these systems are useful only for applications requiring a large number of PEs. 95.4 System Utilization As discussed previously, for any computer there is a maximum number of bits or bit pairs that can be processed concurrently, whether it is under single-instruction or multiple-instruction control [Feng, 1972, 1973]. This maximum degree of concurrency, or maximum concurrency (C m ), is an indication of the computer-processing capability. The actual utilization of this capability is indicated by the average concurrency defined to be FIGURE 95.3 An array processor organization. I/O, input/output devices; LM, local memory; PE, processing element; SM, shared memory. ? 2000 by CRC Press LLC where c i is the concurrency at Dt i . If Dt i is set to one time unit, then the average concurrency over a period of T time units is The average hardware utilization is then FIGURE 95.4 Some examples of static (a) and dynamic (b) interconnection networks. C ct t a ii i = SD SD C c T a i i T = ? =1 ? 2000 by CRC Press LLC where s i is the hardware utilization at time i. Whereas C m is determined by the hardware design, C a or m is highly dependent on the software and applications. A general-purpose computer should achieve a high m for as many applications as possible, whereas a special-purpose computer would yield a high m for at least the intended applications. In either case, maximizing the value of m for a computer design is important. This equation can also be used to evaluate the relative effectiveness of machine designs. For a parallel processor, the degree of concurrency is called the degree of parallelism. A similar discussion can be used to define the average hardware utilization of a parallel processor. The maximum parallelism is then P m , and the average parallelism is or for T time units. The average hardware utilization becomes FIGURE 95.5 An associative processor organization. m= = ? = = = ? C C c TC T a m i i T m i i T 1 1 1 s P pt t a ii i = SD SD P p T a i i T = ? =1 ? 2000 by CRC Press LLC where r i is the hardware utilization for parallel processors at time i. With appropriate instrumentation, the average hardware utilization of a system can be determined. In practice, however, it is not always true that every bit or bit pair that is being processed would be productive. Some of the bits produce only repetitious (superfluous) or even meaningless results. This happens more often and more severely in a parallel processor than in a word-sequential processor. Consider, for example, performing a maximum search operation in a mesh-connected parallel processor (such as ILLIAC IV). For N operands, it takes (N/2)log 2 N comparisons (N/2 comparisons for each of log 2 N iterations) instead of the usual N – 1 comparisons in word-sequential machines. Thus, in effect there are comparisons that are nonproductive. If we let ^ P a be the effective parallelism over a period of T time units and ^ u, ^ p i , and ^ r i be the corresponding effective values, the effective hardware utilization is then A successful parallel processor design should yield a high ^ u, as well as the required throughput for, at least, the intended applications. This not only involves a proper hardware and software design but also the develop- ment of efficient parallel algorithms for these applications. Suppose T u is the execution time of an application program using a conventional word-sequential machine, and T c is the execution time of the same program using a concurrent system; the speed-up ratio is then defined as Naturally, for a specific parallel organization, the speed-up ratio determines how well an application program can utilize the hardware resources. Supporting software has a direct effect on the speed-up ratio. Defining Terms Array processor: A parallel processor consisting of a number of processing elements, memory modules, and input/output devices as well as interconnection networks under a single control unit. Associative processor: A parallel processor consisting of a number of processing elements, memory modules, and input/output devices under a single control unit. The capability of the processing elements is usually limited to the bit-serial operations. Ensemble processor: A parallel processor consisting of a number of processing elements, memory modules, and input/output devices under a single control unit. It has no interconnection network to provide interprocessor or processor-memory communications. Interconnection network: A network of interconnections providing interprocessor and processor-memory communications. It may be static or dynamic, distributed, or centralized. ur== ? = = = ? P P p TP T a m i i T m i i T 1 1 1 N NN N N 2 1 2 21 22 log ( ) (log ) ? è ? ? ? ÷ --= -+ ? ? ? ? ur== ? = = = ? P P p TP T a m i i T m i i T 1 1 1 S T T u c = ? 2000 by CRC Press LLC Parallel processor: A computing system consisting of a number of processors, memory modules, input/output devices, and other components under the control of a single control unit. It is known to be a single- instruction-stream, multiple-data-stream (SIMD) machine. Processing element: A basic processor consisting of an arithmetic-logic unit, a number of registers, network interfaces, and some local control facilities. Related Topics 18.1 Special Architectures ? 96.5 Parallel Processing References G.H. Barnes, R.M. Brown, M. Kato, D.J. Kuck, D.L. Slotnick, and R.A. Stokes, “The ILLIAC IV computer,” IEEE Trans. Comput., vol. C-7, pp. 746–757, 1968. K.E. Batcher, “STARAN/RADCAP hardware architecture,” Proc. Sagamore Computer Conf. on Parallel Processing, pp. 147–152, 1973. K.E. Batcher, “MPP—A massively parallel processor,” Proc. Int. Conf. on Parallel Processing, p. 249, 1979. A.J. Evensen and J.L. Troy, “Introduction to the architecture of a 288-element PEPE,” Proc. Sagamore Computer Conference, pp. 162–169, 1973. T. Feng, “An overview of parallel processing systems,” 1972 WESCON Tech. Papers, Session 1 — “Parallel Processing Systems,” pp. 1–2, 1972. T. Feng, Parallel Processing Characteristics & Implementation of Data Manipulating Functions, Technical Report RADC-TR-73-189, July 1973. M.J. Flynn, “Very high speed computing systems,” Proc. IEEE, vol. 54(12), pp. 1901–1909, 1966. W. H?ndler, “The impact of classification schemes on computer architecture,” Proc. Int. Conf. on Parallel Processing, pp. 7–15, 1977. K. Hwang and F.A. Briggs, Computer Architecture and Parallel Processing, New York: McGraw-Hill, 1984. D. J. Kuck and R.A. Stokes, “The Borroughs Scientific Processor (BSP),” IEEE Trans. Comput., vol. C-31(5), pp. 363–376, 1982. Further Information Proceedings of International Conference on Parallel Processing: An annual conference held since 1972. Recent proceedings published by CRCPress. IEEE Transactions on Parallel and Distributed Systems: Started in 1990 as a quarterly, now a monthly, published by the IEEE Computer Society. Journal of Parallel and Distributed Computing: A monthly published by Academic Press. Computer Architecture and Parallel Processing: A book by K. Hwang and F. A. Briggs published by McGraw-Hill. ? 2000 by CRC Press LLC