Tinder, R.F., Oklobdzija, V.G., Hamacher, V.C., Vranesic, Z.G., Zaky, S.G., Raymond, J. “Organization” The Electrical Engineering Handbook Ed. Richard C. Dorf Boca Raton: CRC Press LLC, 2000 86 Organization 86.1Number Systems Positional and Polynomial Representations?Unsigned Binary Number System?Unsigned Binary-Coded Decimal, Hexadecimal, and Octal Systems?Conversion between Number Systems?Signed Binary Numbers?Floating-Point Number Systems 86.2Computer Arithmetic Number Representation?Algorithms for Basic Arithmetic Operations?Implementatino of Addition?Implementation of the Multiplication Algorithm?Floating-Point Representation 86.3Architecture Functional Units?Basic Operational Concepts? Performance? Multiprocessors 86.4Microprogramming Levels of Programming?Microinstruction Structure? Microprogram Development?High-Level Languages for Microprogramming?Emulation?Applications of Microprogramming 86.1 Number Systems Richard F. Tinder Number systems provide the basis for conveying and quantifying information. Weather data, stocks, pagination of books, weights and measures—these are just a few examples of the use of numbers that affect our daily lives. For this purpose we find the decimal (or arabic) number system to be reliable and easy to use. This system evolved presumably because early humans were equipped with a crude type of calculator, their ten fingers. A number system that is appropriate for humans, however, may be intractable for use by a machine such as a computer. Likewise, a number system appropriate for a machine may not be suitable for human use. Before concentrating on those number systems that are useful in computers, it will be helpful to review the characteristics that are desirable in any number system. There are four important characteristics in all: ?Distinguishability of symbols ?Arithmetic operations capability ?Error control capability ?Tractability and speed To one degree or another the decimal system of numbers satisfies these characteristics for hard-copy transfer of information between humans. Roman numerals and binary are examples of number systems that do not satisfy all four characteristics for human use. On the other hand, the binary number system is preferable for use in digital computers. The reason is simply put: current digital electronic machines recognize only two identifiable states physically represented by a high voltage level and a low voltage level. These two physical states are logically interpreted as the binary symbols 1 and 0. Richard F. Tinder Washington State University Vojin G. Oklobdzija University of California, Davis V. Carl Hamacher Queen's University, Canada Zvonko G. Vranesic University of Toronto Safwat G. Zaky University of Toronto Jacques Raymond University of Ottawa ? 2000 by CRC Press LLC A fifth desirable characteristic of a number system to be used in a computer should be that it have a minimum number of easily identifiable states. The binary number system satisfies this condition. However, the digital computer must still interface with humankind. This is done by converting the binary data to a decimal and character-based form that can be readily understood by humans. A minimum number of identifiable characters (say 1 and 0, or true and false) is not practical or desirable for direct human use. If this is difficult to understand, imagine trying to complete a tax form in binary or in any number system other than decimal. On the other hand, use of a computer for this purpose would not only be practical but, in many cases, highly desirable. Positional and Polynomial Representations The positional form of a number is a set of side-by-side (juxtaposed) digits given generally in fixed-point form as MSD Radix Point LSD N r = (a n–1 … a 3 a 2 a 1 a 0 . a –1 a –2 a –3 … a –m ) r (86.1) Integer Fraction where the radix (or base) r is the total number of digits in the number system and a is a digit in the set defined for radix r. Here, the radix point separates n integer digits on the left from m fraction digits on the right. Notice that a n–1 is the most significant (highest-order) digit, called MSD, and that a –m is the least significant (lowest- order) digit, denoted by LSD. The value of the number in Eq. (86.1) is given in polynomial form by (86.2) where a i is the digit in the ith position with a weight r i . Application of Eqs. (86.1) and (86.2) follows directly. For the decimal system r = 10, indicating that there are 10 distinguishable characters recognized as decimal numerals 0, 1, 2, …,r – 1(= 9). Examples of the positional and polynomial representations for the decimal system are N 10 = (d 3 d 2 d 1 d 0 .d –1 d –2 d –3 ) 10 = 3017.528 and where d i is the decimal digit in the ith position. Exclusive of possible leading and trailing zeros, the MSD and LSD for this number are 3 and 8, respectively. This number could have been written in a form such as N 10 = 03017.52800 without altering its value but implying greater accuracy of the fraction portion. Nar ar ar ar ar ar ar ar ri i im n n n m m r = = ++++ ++++ ? è ? ? ? ? ÷ ÷ =- - - - - - - - - - ? 1 1 1 2 2 1 1 0 0 1 1 2 2 L L Nd i i i n 10 3 1 3210123 10 3 10 0 10 1 10 7 10 5 10 2 10 8 10 3000 10 7 05 002 0008 = =′+′+′+′+′ +′ +′ =+++++ =- - --- ? ... ? 2000 by CRC Press LLC Unsigned Binary Number System Applying Eqs. (86.1) and (86.2) to the binary system requires that r = 2, indicating that there are two distin- guishable characters, typically 0 and (r – 1) = 1, that are used. In positional representation these characters (numbers) are called binary digits or bits. Examples of the positional and polynomial notations for a binary number are N 2 = (b n–1 … b 3 b 2 b 1 b 0 . b –1 b –2 b –3 … b –m ) 2 = 101101.101 2 MSB LSB and where b i is the bit in the ith position. Thus, the bit positions are weighted …, 16, 8, 4, 2, 1, ?, ?, ?, … for any number consisting of integer and fraction portions. Binary numbers so represented are sometimes referred to as natural binary. In positional representation the bits on the extreme left and extreme right are called the MSB (most significant bit) and LSB (least significant bit), respectively. Notice that by obtaining the value of a binary number a conversion from binary to decimal has been performed. The subject of radix (base) conversion will be dealt with more extensively later. For reference purposes Table 86.1 provides the binary-to-decimal conversion for two-, three-, four-, five-, and six-bit binary. The six-bit binary column is only halfway completed for brevity. TABLE 86.1 Binary-to-Decimal Conversion Two-Bit Decimal Three-Bit Decimal Four-Bit Decimal Five-Bit Decimal Six-Bit Decimal Binary Value Binary Value Binary Value Binary Value Binary Value 00 0 000 0 0000 0 10000 16 100000 32 01 1 001 1 0001 1 10001 17 100001 33 10 2 010 2 0010 2 10010 18 100010 34 11 3 011 3 0011 3 10011 19 100011 35 100 4 0100 4 10100 20 100100 36 101 5 0101 5 10101 21 100101 37 110 6 0110 6 10110 22 100110 38 111 7 0111 7 10111 23 100111 39 1000 8 11000 24 101000 40 1001 9 11001 25 101001 41 1010 10 11010 26 101010 42 1011 11 11011 27 101011 43 1100 12 11100 28 101100 44 1101 13 11101 29 101101 45 1110 14 11110 30 101110 46 1111 15 11111 31 101111 47 .. .. Nb i i im n = =′+′+′+′+′+′+′+′+′ =+++++ = =- - --- ? 2 120212120212120212 32841050125 45625 1 543 101 3 10 .. . ? 2000 by CRC Press LLC In the natural binary system the number of bits in a unit of data is commonly assigned a name. Examples are: ?4-data-bit unit: nibble (or half-byte) ?8-data-bit unit: byte ?16-data-bit unit: two bytes (or half-word) ?32-data-bit unit: word (or four bytes) ?64-data-bit unit: double-word etc. The word size for a computer is determined by the number of bits that can be manipulated and stored in registers. The foregoing list of names would be applicable to a 32-bit computer. Unsigned Binary-Coded Decimal, Hexadecimal, and Octal Systems While the binary system of numbers is most appropriate for use in computers, it has several disadvantages when used by humans who have become accustomed to the decimal system. For example, binary machine code is long, difficult to assimilate, and tedious to convert to decimal. However there exist simpler ways to represent binary numbers for conversion to decimal representation. Three examples, commonly used, are natural binary- coded decimal (NBCD), binary-coded hexadecimal (BCH), and binary-coded octal (BCO). These number systems are useful in applications where a digital device, such as a computer, must interface with humans. The NBCD code representation is also useful in carrying out computer arithmetic. The NBCD Representation The BCD system as used here is actually an 8, 4, 2, 1 weighted code called natural BCD or NBCD. This system uses patterns of four bits to represent each decimal position of a number and is one of several such weighted BCD code systems. The NBCD code is converted to its decimal equivalent by polynomials of the form N 10 = b 3 ′ 2 3 + b 2 ′ 2 2 + b 1 ′ 2 1 + b 0 ′ 2 0 = b 3 ′ 8 + b 2 ′ 4 + b 1 ′ 2 + b 0 ′ 1 for any b 3 b 2 b 1 b 0 code integer. Thus, decimal 6 is represented as (0′8) + (1′4) + (1′2) + (0′1), or 0110 in NBCD code. Like natural binary, NBCD code is also called “natural” because its bit positional weights are derived from integer powers of 2 n . Table 86.2 shows the NBCD bit patterns for decimal integers 0 through 9. The NBCD code is currently the most widely used of the BCD codes. There are many excellent sources of information on BCD codes. One, in particular, provides a fairly extensive coverage of both weighted and unweighted BCD codes [Tinder, 1991]. TABLE 86.2NBCD Bit Patterns and Decimal Equivalent NBCD NBCD Bit Pattern Decimal Bit Pattern Decimal 0000 0 1000 8 0001 1 1001 9 0010 2 1010 NA 0011 3 1011 NA 0100 4 1100 NA 0101 5 1101 NA 0110 6 1110 NA 0111 7 1111 NA NA = not allowed. ? 2000 by CRC Press LLC Decimal numbers greater than 9 or less than 1 can be represented by the NBCD code if each digit is given in that code and if the results are combined. For example, the number 63.98 is represented by (or converted to) NBCD code as 63.98 63.98 10 = 01100011.10011000) NBCD = 1100011.10011 NBCD Here, the code weights are 80, 40, 20, 10; 8, 4, 2, 1; 0.8, 0.4, 0.2, 0.1; and 0.08, 0.04, 0.02, 0.01 for the tens, units, tenths, and hundredths digits, respectively, representing four decades. Conversion between binary and NBCD requires conversion to decimal as an intermediate step. For example, to convert from NBCD to binary requires that groups of four bits be selected in both directions from the radix point to form the decimal number. If necessary, zeros are added to the leftmost or rightmost ends to complete the groups of four bits as in the above example. Negative NBCD numbers can be represented either in sign-magnitude notation or 1’s or 2’s complement notation as discussed later. Another BCD code that is used for number representation and manipulation is called excess 3 BCD (or XS3 NBCD, or simply XS3). XS3 is an example of a biased-weighted code (a bias of 3). This code is formed by adding 0011 2 (= 3 10 ) to the NBCD bit patterns in Table 86.2. Thus, to convert XS3 to NBCD code, 0011 must be subtracted from XS3 code. In four-bit quantities the XS3 code has the useful feature that when adding two numbers together in XS3 notation a carry will result and yield the correct value any time a carry results in decimal (i.e., when 9 is exceeded). This feature is not shared by either natural binary or NBCD addition. The Hexadecimal and Octal Systems The hexadecimal number system requires that r = 16 in Eqs. (86.1) and (86.2), indicating that there are 16 distinguishable characters in the system. By convention, the permissible hexadecimal digits are 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, and F for decimals 0 through 15, respectively. Examples of the positional and polynomial representations for a hexadecimal number are N 16 = (h n–1 …h 3 h 2 h 1 h 0 . h –1 h –2 h –3 …h –m ) 16 = (AF3.C8) 16 with a decimal value of Here, it is seen that a hexadecimal number has been converted to decimal by using Eq. (86.2). The octal number system requires that r = 8 in Eqs. (86.1) and (86.2), indicating that there are eight distinguishable characters in this system. The permissible octal digits are 0, 1, 2, 3, 4, 5, 6, and 7, as one might expect. Examples of the application of Eqs. (86.1) and (86.2) are N 8 = (o n–1 …o 3 o 2 o 1 o 0 . o –1 o –2 o –3 …o –m ) 8 = 501.74 8 with a decimal value of Nh i i im n = =′+′+′+′+′ = =- - -- ? 16 10 16 15 16 3 16 12 16 8 16 280378125 1 210 2 10 . ? 2000 by CRC Press LLC When the hexadecimal and octal number systems are used to represent bit patterns in binary, they are called binary-coded hexadecimal (BCH) and binary-coded octal (BCO), respectively. These two number systems are examples of binary-derived radices. Table 86.3 lists several selected examples showing the relationships between BCH, BCO, binary, and decimal. What emerges on close inspection of Table 86.3 is that each hexadecimal digit corresponds to four binary digits and that each octal digit corresponds to three binary digits. The following example illustrates the relationships between these number systems: 5BF.D8 10110111111.11011 2 = 010110111111.11011000 = 5BF.D8 16 2677.66 = 010110111111.110110 = 2677.66 8 = 1471.84375 10 To separate the binary digits into groups of four (for BCH) or groups of three (for BCO), counting must begin from the radix point and continue outward in both directions. Then, where needed, zeros are added to the leading and trailing ends of the binary representation to complete the MSDs and LSDs for the BCH and BCO forms. Conversion between Number Systems It is not the intent of this section to cover all methods for radix (base) conversion. Rather, the plan is to provide general approaches, separately applicable to the integer and fraction portions, followed by specific examples. Conversion of Integers Since the polynomial form of Eq. (86.2) is a geometrical progression, the integer portion can be represented in nested radix form. In source radix s, the nested representation is TABLE 86.3The BCH and BCO Number Systems Binary BCH BCO Decimal Binary BCH BCO Decimal 0000 0 0 0 1010 A 12 10 0001 1 1 1 1011 B 13 11 0010 2 2 2 1100 C 14 12 0011 3 3 3 1101 D 15 13 0100 4 4 4 1110 E 16 14 0101 5 5 5 1111 F 17 15 0110 6 6 6 10000 10 20 16 0111 7 7 7 11011 1B 33 27 1000 8 10 8 110001 31 61 49 1001 9 11 9 1001110 4E 116 78 No i i im n = =′+′+′+′+′ = =- - -- ? 8 5808187848 3219375 1 21012 10 . ? 2000 by CRC Press LLC (86.3) for digits a i having integer values from 0 to s – 1. The nested radix form not only suggests a conversion process but also forms the basis for computerized conversion. Consider that the number in Eq. (86.3) is to be represented in nested radix r form (86.4) where, in general, m 1 n. Then, if N s is divided by r, the results are of the form (86.5) where Q is the integer quotient rearranged as Q 0 = b 1 + r(b 2 + … + b m–1 )))) r and R is the remainder R 0 = b 0 . A second division by r yields Q 0 /r = Q 1 + R 1 /r, where Q 1 is arranged as Q 1 = b 2 + r(b 3 + … + b m–1 ))) r and R 1 = b 1 . Thus, by repeated division of the integer result Q i by r, the remainders yield (b 0 , b 1 , b 2 , …, b m–1 ) r in that order. The conversion method just described, called the radix divide method, can be used to convert between any two integers of different radices. However, the requirement is that the arithmetic required by N s /r must be carried out in source radix, s. Except for source radices 10 and 2, this poses a severe problem for humans. Table 86.4 provides the recommended procedures for integer conversion. The radix divide method is suitable for computer conversion providing, of course, that the computer is programmed to carry out the arithmetic in different radices. TABLE 86.4Summary of Recommended Methods for Integer Conversion by Noncomputer Means Integer Conversion Conversion Method N 10 ? N r Radix division by radix r using Eq. (86.5) N s ? N 10 Eq. (86.2) or Eq. (86.3) N s ) s110 —> N r ) r110 N s ? N 10 by Eq. (86.2) or (86.3) N 10 ? N r radix division by r using Eq. (86.5) Special Cases for Binary Forms N 2 ? N 10 Positional weighting N 2 ? N BCH Partition N 2 into groups of four bits starting from radix point, then apply Table 86.3 N 2 ? N BCO Partition N 2 into groups of three bits starting from radix point, then apply Table 86.3 N BCH ? N 2 Reverse of N 2 ? N BCH N BCO ? N 2 Reverse of N 2 ? N BCO N BCH ? N BCO N BCH ? N 2 ? N BCO N BCO ? N BCH N BCO ? N 2 ? N BCH N NBCD ? N XS3 Add 0011 2 (= 3 10 ) to N NBCD N XS3 ? N NBCD Subtract 0011 2 (= 3 10 ) from N NBCD Nasas asas asasa a asas s s s n n n n n i i i n =++++ ( ) =++++ =+ ? è ? ? ? ÷ - - - - - - = - ? 1 1 2 2 1 1 0 0 012 1 0 1 1 1 L L( ( ))))) N b rb rb b brbr rm i i i m r =++++ =+ ? è ? ? ? ÷ - - = - ? 012 1 0 1 1 1 ( ( )))))L N r Q R r s =+ ? 2000 by CRC Press LLC The integer conversion methods of Table 86.4 can be illustrated by the following simple examples: Example 1. 139 10 ? N 2 N/r Q R 139/2 = 69 1 69/2 = 34 1 34/2 = 17 0 17/2 = 8 1 8/2 = 4 0 4/2 = 2 0 2/2 = 1 0 1/2 = 0 1 139 10 = 10001011 2 Example 2. 10001011 2 ? N 10 . By positional weights, N 10 = 128 + 8 + 2 + 1 = 139 10 Example 3. 139 10 ? N 8 N/rQR 139/8 = 17 3 17/8 = 2 1 2/8 = 0 2 139 10 = 213 8 Example 4. 10001011 2 ? N BCO 213 010 001 011 = 213 BCO Example 5. 213 BCO ? N BCH 213 8 B 213 BCO = 010 001 011 = 10001011 2 = 1000 1011 = 8B 16 Example 6. 213 8 ? N 5 213 8 = 2 ′ 8 2 + 1 ′ 8 1 + 3 ′ 8 0 = 139 10 N/r Q R 139/5 = 27 4 27/5 = 5 2 5/5 = 1 0 1/5 = 0 1 213 8 = 1024 5 Check: 1 ′ 5 3 + 2 ′ 5 1 + 4 ′ 5 0 = 125 + 10 + 4 = 139 10 Conversion of Fractions By extracting the fraction portion from Eq. (86.2) one can write (86.6) in radix s. This is called the nested inverse radix form that provides the basis for computerized conversion. .( ) ( ( ))))) () Nasas as sa sa a sa as sm m s ms i i s i m =+++ =+++ =+ - - - - - - - - - -- - -- -+ = ? 1 1 2 2 1 1 1 2 1 1 1 2 L L ? 2000 by CRC Press LLC If the fraction in Eq. (86.6) is represented in nested inverse radix r form, then (86.7) for any fraction represented in radix r. Now, if N s is multiplied by r, the result is of the form .N s ′ r = I + F (86.8) where I is the product integer, I 1 = b –1 , and F 0 is the product fraction arranged as F 1 = r –1 (b –2 + r –1 (b –3 + … + b –p )))) r . By repeated multiplication by r of the remaining fractions F i , the resulting integers yield (b –1 , b –2 , b –3 , … b –m ) r , in that order. The conversion just described is called the radix multiply method and is perfectly general for converting between fractions of different radices. However, as in the case of integer conversion, the requirement is that the arithmetic required by .N s ′ r must be carried out in source radix, s. For noncomputer use by humans, this procedure is usually limited to fraction conversions N 10 ? N r , where the source radix is 10 (decimal). The recommended methods for converting between fractions of different radices are given in Table 86.5. The radix multiply method is well suited to computer use. For any integer of radix s, there exists an exact representation in radix r. This is not the case for a fraction whose conversion is a geometrical progression that never converges. Terminating a fraction conversion at n digits (to the right of the radix point) results in an error or uncertainty. In decimal, this error is given by where the quantity in brackets approaches the value of a –n + 1. Therefore, terminating a fraction conversion at n digits from the radix point results in an error with bounds TABLE 86.5Summary of Recommended Methods for Fraction Conversion by Noncomputer Means Fraction Conversion Conversion Method .N 10 ? .N r Radix multiplication by using Eq. (86.8) .N s ? .N 10 Equation (86.2) or Eq. (86.6) .N r ) s110 ? .N r ) r110 N s ? N s10 by Eq. (86.2) or Eq. (86.6) N 10 ? N r radix multiply by Eq. (86.8) Special Cases for Binary Forms .N 2 ? .N BCH Partition .N 2 into groups of four bits from radix point, then apply Table 86.3 .N 2 ? .N BCO Partition .N 2 into groups of three bits from radix point, then apply Table 86.3 .N BCH ? .N 2 Reverse of .N 2 ? .N BCH .N BCO ? .N 2 Reverse of .N 2 ? .N BCO .N BCH ? .N BCO .N BCH ? .N 2 ? .N BCO .N BCO ? .N BCH .N BCO ? .N 2 ? .N BCH . ( ( ))))) () Nrbrb b rb br r p r i i r i p =+++ =+ -- -- - - -- -+ = ? 11 12 1 1 1 2 L e 10 1 1 2 2 1 =+ + + =+ é ? ê ê ù ? ú ú - - -+() -+() -+() -+() - --+() -+() = ¥ ? ar a r a r ra ar n n n n n n n nni ni i r L ? 2000 by CRC Press LLC 0 < e 10 £ r –n (a –n + 1) (86.9) in decimal. Equation (86.9) is useful in deciding when to terminate a fraction conversion. Often, it is desirable to terminate a fraction conversion at (n + 1) digits and then round off to n from the radix point. A suitable method for rounding to n digits in radix r is: Perform the fraction conversion to (n + 1) digits from the radix point, then drop the (n + 1) digit if a –(n+1) < r/2, or add r –(n–1) to the result if a –(n+1) 3 r/2. After rounding off to n digits, the maximum error becomes the difference between the rounded result and the smallest value possible. By using Eq. (86.9), this difference is e max = r –n (a –n + 1) – r –n (a –n + a –(n+1) /r) = r –n (1 – a –(n+1) /r) Then, by rounding to n digits, there results an error with bounds 0 < e 10 £ r –n (1 – a –(n+1) /r) (86.10) in decimal. If a –(n+1) < r/2 and the (n + 1) digit is dropped, the maximum error is r –n . Note that for N s ? N 10 ? N r type conversions, the bounds of errors aggregate. The following examples illustrate the fraction conversion methods of Table 86.5. Example 7. 0.654 10 ? N 2 rounded to eight bits .N s ′ rFI 0.654 ′ 2 0.308 1 0.308 ′ 2 0.616 0 0.616 ′ 2 0.232 1 0.232 ′ 2 0.464 0 0.464 ′ 2 0.928 0 0.928 ′ 2 0.856 1 0.654 10 = 0.10100111 2 0.856 ′ 2 0.712 1 0.712 ′ 2 0.424 1 0.424 ′ 2 0.848 0 e max = 2 –8 Example 8. 0.654 10 ? N 8 terminated at four digits .N s ′ rFI 0.654 ′ 8 0.232 5 0.232 ′ 8 0.856 1 0.654 10 = 5166 8 0.856 ′ 8 0.848 6 with error bounds 0.848 ′ 8 0.784 6 0 < e 10 £ 7 ′ 8 –4 = 1.71 ′ 10 –3 Example 9. 0.5166 8 ? N 2 rounded to eight bits and let 0.5166 8 ? N 10 be rounded to four decimal places. 0.5166 8 = 5 ′ 8 –1 + 1 ′ 8 –2 + 6 ′ 8 –3 + 6 ′ 8 –4 = 0.625000 + 0.015625 + 0.011718 + 0.001465 = 0.6538 rounded to four decimal places; e 10 £ 10 –4 .N s ′ rFI 0.6538 ′ 2 0.3076 1 0.3076 ′ 2 0.6152 0 0.6152 ′ 2 0.2304 1 0.2304 ′ 2 0.4608 0 0.4608 ′ 2 0.9216 0 ? 2000 by CRC Press LLC 0.9216 ′ 2 0.8432 1 0.8432 ′ 2 0.6864 1 0.5166 8 = 0.10100111 2 (compare with Example 7) 0.6864 ′ 2 0.3728 1 0.3728 ′ 2 0.7457 0 e 10 £ 10 –4 + 2 –8 = 0.0040 Example 10.0.10100111 2 ? N BCH A· 7 0.10100111 2 = 0.1010 0111 = 0.A7 BCH Signed Binary Numbers To this point only unsigned numbers (assumed to be positive) have been considered. However, both positive and negative numbers must be used in computers. Several schemes have been devised for dealing with negative numbers in computers, but only four are commonly used: ?Signed-magnitude representation ?Radix complement representation ?Diminished radix complement representation ?Excess (offset) code representation Of these, the radix 2 complement representation, called 2’s complement, is the most widely used system in computers. Signed-Magnitude Representation A signed-magnitude number consists of a magnitude together with a symbol indicating its sign (positive or negative). Such a number lies in the decimal range of –(r n–1 – 1) through +(r n–1 – 1) for n integer digits in radix r. A fraction portion, if present, would consist of m digits to the right of the radix point. The most common examples of signed-magnitude numbers are those in the decimal and binary systems. The sign symbols for decimal (+ or –) are well known. In binary it is established practice to use 0 = plus and 1 = minus for the sign symbols and to place one of them in the MSB position for each number. Examples in eight-bit binary are Magnitude +45.5 10 = 0 101101.1 2 +0 10 = 00000000 2 Sign bit Magnitude –123 10 = 1 1111011 2 –0 10 = 10000000 2 Sign bit Although the sign-magnitude system is used in computers, it has two drawbacks. There is no unique zero, as indicated by the examples, and addition and subtraction calculations require time-consuming decisions regarding operation and sign as, for example, (–7) minus (–4). Even so, the sign-magnitude representation is commonly used in floating-point number systems. Radix Complement Representation The radix complement of an n-digit number N r is obtained by subtracting it from r n , that is r n – N r . The operation r n – N r is equivalent to complementing the number and adding 1 to the LSD. Thus, the radix complement is N r + 1 LSD where N r = r n – 1 – N r is the complement of a number in radix r. Therefore, one may write ? 2000 by CRC Press LLC Radix complement of N r = r n – N r = N r + 1 (86.11) The complements N r for digits in three commonly used number systems are given in Table 86.6. Notice that the complement of a binary number is formed simply by replacing the 1’s with 0’s and 0’s with 1’s as required by 2 n – 1 – N 2 . With reference to Table 86.6 and Eq. (86.11), the following examples of radix complement representation are offered. Example 11.The 10’s complement of 47.83 is N 10 + 1 LSD = 52.17 Example 12.The 2’s complement of 0101101.101 is N 2 + 1 LSB = 1010010.011 Example 13.The 16’s complement of A3D is N 16 + 1 LSD = 5C2 + 1 = 5C3 The decimal value of Eq. (86.11) can be found from the polynomial expression (86.12) for any n-digit number of radix r. In Eqs. (86.11) and (86.12) the MSD is taken to be the position of the sign symbol. 2’s Complement Representation.The radix complement for binary is the 2’s complement representation. In 2’s complement the MSB is the sign bit, 1 indicating a negative number or 0 if positive. The decimal range of representation for n-integer bits in 2’s complement is from –(2 n–1 ) through +(2 n–1 ). From Eq. (86.11), the 2’s complement is formed by N 2 ) 2’s compl. = 2 n – N 2 = N 2 + 1 (86.13) A few examples in eight-bit binary are shown in Table 86.7. Notice that application of Eq. (86.13) changes the sign of the decimal value of a binary number (+ to –, and vice versa) and that only one zero representation exists. Application of Eq. (86.12) gives the decimal value of any 2’s complement number, including those containing a radix point. For example, the pattern N 2’s compl. = 11010010.011 has a decimal value N 2’s compl. ) 10 = –1 ′ 2 7 + 1 ′ 2 6 + 1 ′ 2 4 + 1 ′ 2 1 + 1 ′ 2 –2 + 1 ′ 2 –3 = –128 + 64 + 16 + 2 + 0.25 + 0.125 = –45.625 10 The same result could have easily been obtained by first applying Eq. (86.13) to N 2’s compl. followed by the use of positional weighting to obtain the decimal value. Thus, Narar n n i i im n radix compl. )() 10 1 1 2 =- + - - =- - ? ? 2000 by CRC Press LLC N 2’s compl. = 00101101.101 = 32 + 8 + 5 + 0.5 + 0.125 = 45.625 10 which is known to be a negative number, –45.625 10 . Negative NBCD numbers can be represented in 2’s complement. The foregoing discussion on 2’s complement applies to NBCD with consideration of how NBCD is formed from binary. As an example, –59.24 10 is represented by 0101 1001.0010 0100) NBCD = 10100110.11011100) 2’s compl. NBCD In a similar fashion, negative NBCD numbers can also be represented in 1’s complement following the procedure given in the next paragraph. Sign-magnitude representation of a negative NBCD number simply requires the addition of a sign bit to the NBCD magnitude. Diminished Radix Complement Representation The diminished radix complement of a number is obtained by N r ) dim. rad. compl. = r n – N r – 1 (86.14) = N r Thus, the complement of a number is its diminished radix complement. It also follows that the radix comple- ment of a number is the diminished radix complement with 1 added to the LSD as in Eq. (86.13). The range of representable numbers is –(r n–1 – 1) through +(r n–1 – 1) for radix r. In the binary and decimal number systems, the diminished radix complement representations are the 1’s complement and 9’s complement, respectively. Examples of 1’s complement are shown in Table 86.7 for comparison with those of 2’s complement. Notice that in 1’s complement there are two representations for zero, one for +0 and the other for –0. This fact limits the usefulness of the 1’s complement representation for computer arithmetic. TABLE 86.6 Complements for Three Commonly Used Number Systems TABLE 86.7 Examples of Eight-Bit 2’s and 1’s Complement Representations (MSB = Sign Bit) Complement (–N r ) Decimal 2’s 1’s Digit Binary Decimal Hexadecimal Value Complement Complement 0 1 9 F –128 10000000 1 0 8 E –127 10000001 10000000 2 7 D –31 11100001 11100000 3 6 C –16 11110000 11101111 4 5 B –15 11110001 11110000 5 4 A –3 11111101 11111100 6 3 9 –0 00000000 11111111 7 2 8 +0 00000000 00000000 8 1 7 +3 00000011 00000011 9 0 6 +15 00001111 00001111 A 5 +16 00010000 00010000 B 4 +31 00011111 00011111 C 3 +127 01111111 01111111 D 2 +128 E1 F0 ? 2000 by CRC Press LLC Excess (Offset) Representations Other systems for representing negative numbers use excess or offset codes. Here, a bias B is added to the true value N r of the number to produce an excess number N xs given by N xs = N r + B (86.15) When B = r n–1 exceeds the usable bounds of negative numbers, N xs remains positive. Perhaps the most common use of the excess representation is in floating-point number systems—the subject of the next section. Two examples are given below in eight-bit excess 128 code. Example 14. –43 10 11010101 N 2’s compl. +128 10 10000000 B 85 10 01010101 N xs = –43 10 in excess 128 code Example 15. 27 10 00011011 N 2’s compl. +128 10 10000000 B 155 10 10011011 N xs = 27 10 in excess 128 code The representable decimal range for an excess 2 n–1 number system is –2 n–1 through +(2 n–1 – 1) for an n-bit binary number. However, if N 2 + B > 2 n–1 – 1, overflow occurs and 2 n–1 must be subtracted from (N 2 + B) to give the correct result in excess 2 n–1 code. Floating-Point Number Systems In fixed-point representation [Eq. (86.1)], the radix point is assumed to lie immediately to the right of the integer field and at the left end of the fraction field. The fixed-point system is the most commonly used system for representing bounded orders of magnitude. For example, with 32 bits a binary number could represent decimal numbers with upper and lower bounds of the order of ±10 10 and ±10 –10 . However, for greatly expanded bounds of representation, as in scientific notation, the floating-point representation is needed. A floating-point number (FPN) in radix r has the general form FPN) r = F ′ r E (86.16) where F is the fraction (or mantissa) and E is the exponent. Only fraction digits are used for the mantissa! Take, for example, Planck’s constant h = 6.625 ′ 10 –34 J·s. This number can be represented many different ways in floating point notation: Planck’s constant h = 0.625 2 10 –33 = 0.0625 2 10 –32 = 0.00625 2 10 –31 All three adhere to the form of Eq. (86.16) and are, therefore, legitimate floating-point numbers in radix 10. Thus, as the radix point floats to the left, the exponent is scaled accordingly. The first form for h is said to be normalized because the MSD of F is nonzero, a means of standardizing the radix point position. Notice that the sign for F is positive while that for E is negative. In computers the FPN is represented in binary where the normalized representation requires that the MSB for F always be 1. Thus, the range in F in decimal is ? 2000 by CRC Press LLC 0.5 £ F < 1 Also, the mantissa F is represented in sign-magnitude from. The normalized format for a 32-bit floating-point number in binary, which agrees with the IEEE standard [IEEE, 1985], is shown in Fig. 86.1. Here, the sign bit (1 if negative or 0 if positive) is placed at bit position 0 to indicate the sign of the fraction. Notice that the radix point is assumed to lie between bit positions 8 and 9 to separate the E bit-field from the F bit-field. Before two FPNs can be added or subtracted in a computer, the E fields must be compared and equalized and the F fields adjusted. The decision-making process can be simplified if all exponents are converted to positive numbers by using the excess representation given by Eq. (86.15). For a q-digit number in radix r, the exponent in Eq. (86.16) becomes E xs = E r + r q–1 (86.17) where E is the actual exponent augmented by a bias of B = r q–1 . The range in the actual exponent E r is usually taken to be –(r q–1 – 1) £ E r £ +(r q–1 – 1) In the binary system, required for computer calculations, Eq. (86.17) becomes E xs = E 2 + 2 q–1 (86.18) with a range in actual exponent of –(2 q–1 –1) £ E 2 £ +(2 q–1 – 1). In 32-bit normalized floating-point form, the exponent is stored in excess 128 code, while the number is stored in sign-magnitude form. There still remains the question of how the number 0 is to be represented. If the F field is zero, then the exponent can be anything and the number will be zero. However, in computers the normalized FPN 2 limits F to 0.5 £ F < 1 since the MSB for F is always 1. The solution to this problem is to assume that the number is zero if the exponent bits are all zero regardless of the value of the mantissa. This leads, however, to a discontinuity in normalized FPN 2 representation at the low end. The IEEE standard for normalized FPN 2 representation attempts to remove the problem just described. The IEEE system stores the exponent in excess 2 q–1 – 1 code and limits the decimal range of the actual exponent to –(2 q–1 – 2) £ E 2 £ +(2 q–1 – 1) For 32-bit FPN representation, the exponent is stored in excess 127 code as indicated in Fig. 86.1. Thus, the allowable range of representable exponents is from –126 10 = 00000001 2 through +127 10 = 11111110 2 FIGURE 86.1IEEE standard bit format for normalized floating-point representation. ? 2000 by CRC Press LLC This system reserves the use of all 0’s or all 1’s in the exponent for special conditions [IEEE, 1985; Pollard, 1990]. So that the F field magnitude can diminish linearly to zero when E = –126, the MSB = 1 for F is not specifically represented in the IEEE system but is implied. The following example attempts to illustrate the somewhat confusing aspects of the IEEE normalized representation: The number 101101.11001 2 is to be represented in IEEE normalized FPN 2 notation. 101101.11001 2 = .10110111001 ′ 2 6 Sign bit = 0 (positive) E xs = 6 + 127 = 133 10 = 10000101 2 F = 0110111001…00 (the MSB = 1 is not shown) Therefore, the IEEE normalized FPN is FPN 2 = 0 10000101 0110111001…0 Still other forms of FPNs are in use. In addition to the IEEE system, there are the IBM, Cray, and DEC systems of representation, each with its own single- and double-precision forms. Defining Terms Binary: Representation of quantities in base 2. Complement: Opposite form of a number system. Floating point: Similar to “scientific notation” except used to represent binary operations in a computer. Hexadecimal: Base 16 number system. Mantissa: Fraction portion of a floating-point number. Octal: Base 8 number system. Radix: Base to which numbers are represented. Related Topic 86.2 Computer Arithmetic References H.L. Garner, “Number systems and arithmetic,” in Advances in Computers, vol. 6, F.L. Alt et al., Eds., New York: Academic, 1965, pp. 131–194. IEEE, IEEE Standard for Binary Floating-Point Arithmetic, ANSI/IEEE Std. 754–1985. D.E. Knuth, The Art of Computer Programming: Seminumerical Algorithms, vol. 2, Reading, Mass: Addison- Wesley, 1969. C. Tung, “Arithmetic,” in Computer Science, A.F. Cardenas et al., Eds., New York: Wiley-Interscience, 1972, chap. 3. Further Information K. Hwang, Computer Arithmetic, New York: Wiley, 1978. L.H. Pollard, Computer Design and Architecture, Englewood Cliffs, N.J.: Prentice-Hall, 1990. R.F. Tinder, Digital Engineering Design: A Modern Approach, Englewood Cliffs, N.J.: Prentice-Hall, 1991. ? 2000 by CRC Press LLC 86.2 Computer Arithmetic Vojin G. Oklobdzija As the ability to perform computation increased from the early days of computers up to the present, so has the knowledge of how to utilize the hardware and software to perform computation. Digital computer arithmetic emerged from that period in two ways: as an aspect of logic design and as a development of efficient algorithms to utilize the available hardware. Given that numbers in a digital computer are represented as a string of zeros and ones and that hardware can perform only a relatively simple and primitive set of Boolean operations, all the arithmetic operations performed are based on a hierarchy of operations that are built upon the very simple ones. What distinguishes computer arithmetic is its intrinsic relation to technology and the ways things are designed and implemented in a digital computer. This comes from the fact that the value of a particular way to compute, or a particular algorithm, is directly evaluated from the actual speed with which this computation is performed. Therefore, there is a very direct and strong relationship between the technology in which digital logic is implemented to compute and the way the computation is structured. This relationship is one of the guiding principles in the development of computer arithmetic. The subject of computer arithmetic can be, for simpler treatment, divided into number representation; basic, arithmetic operations (such as addition, multiplication, and division), and evaluation of functions. Number Representation The only way to represent information in a digital computer is via a string of bits, i.e., zeros and ones. The number of bits being used depends on the length of the computer word, which is a quantity of bits on which hardware is capable of operating (sometimes also a quantity that is brought to the CPU from memory in a single access). The first question is what relationship to use in establishing correspondence between those bits and a number. Second, we need to make sure that certain properties that exist in the corresponding number representation system are satisfied and that they directly correspond to the operations being performed in hardware over the taken string of bits. This relationship is defined by the rule that associates one numerical value designated as X (in the text we will use capital X for the numerical value) with the corresponding bit string designated as x. x = {x n–1 , x n–2 ,…, x 0 } where x i ? 0, 1 In this case the associated word (the string of bits) is n bits long. When for every value X there exists one and only one corresponding bit string x, we define the number system as nonredundant. If however, we could have more than one bit string x that represents the same value X, the number system is redundant. Most commonly we are using numbers represented in a weighted number system, where a numerical value is associated with the bit string x according to the equation where w 0 = 1 and w i = (w i – 1)(r i – 1) xxw ii i n = = - ? 0 1 ? 2000 by CRC Press LLC The value r i is an integer designated as the radix, and in a nonredundant number system it is an integer equal to the number of allowed values for x i . In general x i could consist of more than one bit. The numerical value associated with x is designated as the explicit value of x e . In conventional number systems the radix r i is the same positive integer for all the digit positions x i and with the canonical set of digit values ?i = {0, 1, 2, 3,…, r i – 1} for 0 £ i £ n – 1 An example of a weighted number system with a mixed radix would be the representation of time in weeks, days, hours, minutes, and seconds with a range for representing 100 weeks: r = 10, 10, 7, 24, 60, 60 In digital computers the radices encountered are 2, 4, 10, and 16, with 2 being the most commonly used one. The digit set x i can be redundant and nonredundant. If the number of different values x i can assume is n x £ r, then we have a nonredundant digit set. Otherwise, if n x > r, we have a redundant digit set. Use of the redundant digit set has its advantages in efficient implementation of algorithms (multiplication and division in particular). Other number representations of interest are nonweighted number systems, where the relative position of the digit does not affect the weight so that the appropriate interchange of any two digits will not change the value x. The best example of such a number system is the residue number system (RNS). We also define explicit value x e and implicit value X i of a number represented by a bit string x. The implicit value is the only value of interest to the user, while the explicit value provides the most direct interpretation of the bit string x. Mapping of the explicit value to the implicit value is obtained by an arithmetic function that defines the number representation used. It is a task of the arithmetic designer to devise algorithms that result in the correct implicit value of the result for the operations on the operand digits representing the explicit values. In other words, the arithmetic algorithm needs to satisfy the closure property. The relationship between the implicit value and the explicit value is best illustrated by Table 86.8. Representation of Signed Integers The two most common representations of signed integers are sign and magnitude (SM) representation and true and complement (TC) representation. While SM representation might be easier to understand and convert to and from, it has its own problems. Therefore, we will find TC representation to be more commonly used. Sign and Magnitude Representation (SM).In SM representation signed integer X i is represented by sign bit x s and magnitude x m (x s , x m ). Usually 0 represents the positive sign (+) and 1 represents the negative sign (–). The magnitude of the number x m can be represented in any way chosen for the representation of positive integers. A disadvantage of SM representation is that two representations of zero exist, positive and negative zero: x s = 0, x m = 0 and x s = 1, x m = 0. True and Complement Representation (TC). In TC representation there is no separate bit used to represent the sign. Mapping between the explicit and implicit value is defined as TABLE 86.8The Relationship between the Implicit Value and the Explicit Value Implied Attributes: Radix Point, Expression for Implicit Value X i Numerical Implicit Value X i Negative Number Representation, Others as a Function of Explicit Value x e (in Decimal) Integer magnitude X i = x e 27 Integer, two’s complement X i = –2 5 + x e –5 Integer, one’s complement X i = –(2 5 – 1) + x e –4 Fraction, magnitude X i = –2 –5 x e 27/32 Fraction, two’s complement X i = –2 –4 (2 –5 + x e ) –5/16 Fraction, one’s complement X i = –2 –4 (2 –5 + 1 + x e ) –4/16 Source: A. Avizienis, “Digital computer arithmetic: A unified algorithmic specification,” in Symp. Computers and Automata, Polytechnic Institute of Brooklyn, April 13–15, 1971. ? 2000 by CRC Press LLC The illustration of TC mapping is given in Table 86.9. In this representation positive integers are represented in the true form, while negative integers are represented in the complement form. With respect to how the complementation constant C is chosen, we can further distinguish two representa- tions within the TC system. If the complementation constant is chosen to be equal to the range of possible values taken by x e , C = r n in a conventional number system where 0 £ x e £ r n – 1, then we have defined the range complement (RC) system. If, on the other hand, the complementation constant is chosen to be C = r n – 1, we have defined the diminished radix complement (DRC) (also known as the digit complement [DC]) number system. Representations of the RC and DRC number representation systems are shown in Table 86.10. As can be seen from Table 86.10, the RC system provides for one unique representation of zero because the complementation constant C = r n falls outside the range. There are two representations of zero in the DRC system, x e = 0 and r n – 1. The RC representation is not symmetrical, and it is not a closed system under the change of sign operation. The range for RC is [–1?2r n , 1?2r n – 1]. The DC is symmetrical and has the range of [–(1?2r n – 1), 1?2r n – 1]. For the radix r = 2, RC and DRC number representations are commonly known as two’s complement and one’s complement number representation systems, respectively. Those two representations are illustrated by an example in Table 86.11 for the range of values –(4 £ X i £ 3). Algorithms for Basic Arithmetic Operations The algorithms for the arithmetic operation are dependent on the number representation system used. There- fore, their implementation should be examined for each number representation system separately, given that the complexity of the algorithm, as well as its hardware implementation, is dependent on it. Addition and Subtraction in Sign and Magnitude System In the SM number system addition/subtraction is performed on pairs (u s , u m ) and (w s , w m ) resulting in a sum (s s , s m ), where u s and w s are sign bits and u m and w m are magnitudes. The algorithm is relatively complex because it requires comparisons of the signs and magnitudes as well. Extending the addition algorithm in order to perform subtraction is relatively easy because it only involves change of the sign of the operand being subtracted. Therefore, we will consider only the addition algorithm. The algorithm can be described as if u s = w s (signs are equal) then s s = u s and s m = u m + w m (operation includes checking for the overflow) X xxC xC xC i ee = < > ì í ? / –/ 2 2 TABLE 86.10Mapping of the Explicit Value x e into RC and DRC Number Representations x e X i (RC) X i (DRC) 000 111 222 MMM 1?2r n – 1 1?2r n – 1 1?2r n – 1 1?2r n –1?2r n –(1?2r n – 1) MMM r n – 2 –2 –1 r n – 1 –1 0 TABLE 86.9True and Complement Mapping x e X i 00 11 22 MM C/2 –1 C/2 –1 C/2 +1 –(C/2 +1) MM C –2 –2 C –1 –1 C0 ? 2000 by CRC Press LLC if u s 1 w s then if u m > w m : s m = u m – w m s s = u s else: s m = w m – u m s s = w s Addition and Subtraction in True and Complement System Addition in the TC system is relatively simple. It is sufficient to perform modulo addition of the explicit values; therefore, s e = (u e + w e ) mod C Proof will be omitted. In the RC number system this is equivalent to passing the operands through an adder and discarding the carry-out of the most significant position of the adder which is equivalent to performing the modulo addition (given that C = r n ). In the DRC number system the complementation constant is C = r n – 1. Modulo addition in this case is performed by subtracting r n and adding 1. It turns out that this operation can be performed by simply passing the operands through an adder and feeding the carry-out from the most significant digit position into the carry-in at the least significant digit position. This is also called addition with end-around carry. Subtracting two numbers is performed by simply changing the sign of the operand to be subtracted preceding the addition operation. Change of Sign Operation. The change of sign operation involves the following operation: W i = –Z i w e = (–z e ) = (–z e ) mod C = C – Z i mod C = C – z e which means that the change of sign operation consists of subtracting the operand z e from the complementation constant C. In the DRC system complementation is performed by simply complementing each digit of the operand Z i with respect to r – 1. In the case of r = 2, this results in a simple inversion of bits. In the RC system the complementation is performed by complementing each digit of the operand Z i with respect to r – 1 and adding 1 to the resulting z e . Implementation of Addition Carry Look-Ahead Adder (CLA) The first significant speed improvement in the implementation of a parallel adder was a carry-look-ahead adder (CLA) developed by Weinberger and Smith [1958] in 1958. The CLA is one of the fastest schemes used for the addition of two numbers even today given that the delay incurred to add two numbers is logarithmically TABLE 86.11 Two’s Complement and One’s Complement Representation Two’s Complement, C = 8 One’s Complement, C = 7 X i x e X i (2’s complement) x e X i (1’s complement) 3 3 011 3 011 2 2 010 2 010 1 1 001 1 001 0 0 000 0 000 –0 0 000 7 111 –1 7 111 6 110 –2 6 110 5 101 –3 5 101 4 100 –4 4 100 3 — ? 2000 by CRC Press LLC dependent on the size of the operands (delay = log[N]). The concept of CLA is illustrated in Fig. 86.2a and b. For each bit position of the adder, a pair of signals (p i , g i ) is generated in parallel. It is possible to generate local carries using (p i , g i ) as seen in the equations. Those signals are designated as: p i = carry-propagate and g i = carry-generate because they take part in propagation and generation of carry signal C i-1 . However, each bit position requires an incoming carry signal C i-1 in order to generate the outgoing carry C i . This makes the addition slow because the carry signal has to ripple from stage to stage as shown in Fig. 86.2a. The adder can be divided into the groups and the carry-generate and carry-propagate signals can be calculated for the entire group (G,P). This will take an additional time equivalent to and AND-OR delay of the logic. However, now we can calculate each group’s carry signals in an additional AND-OR delay. for the generation of the carry signal from the adder only the incoming carry signal into the group is now required. Therefore, the rippling of the carry is limited only to the groups. In the next step we may calculate generate and propagate signals for the group of groups (G*, P*) and continue in that fashion until we have only one group left generating the C out signal from the adder. This process will terminate in log number of steps given that we are generating a tree structure for generation of carries. The computation of carries within the groups is done individually as illustrated in Fig. 86.2a and this process requires only the incoming carry into the group. The logarithmic dependence on the delay (delay = log[N]) is only valid under the assumption that the gate delay is constant without depending on the fan-out and fan-in of the gate. In practice this is not true and even when the bipolar technology (which does not exhibit strong dependence on the fan-out) is used to implement CLA structure, the further expansion of the carry-block is not possible given the practical limitations on the fan-in of the gate. In CMOS technology this situation is much different given that CMOS gate has strong dependency not only on fan-in but on fan-out as well. This limitation takes away much of the advantages gained by using the CLA scheme. However, by clever optimization of the critical path and appropriate use of dynamic logic the CLA scheme can still be advantageous, especially for the adders of a larger size. FIGURE 86.2Carry Look-Ahead adder structure. (a) Generation of carry generate and propagate signals and (b) generation of group signals G, P and intermediate carries. ? 2000 by CRC Press LLC Conditional-Sum Addition Another one of the fast schemes for addition of two numbers that predates CLA is conditional-sum addition (CSA) proposed by Sklansky [1960] in 1960. The essence of the CSA scheme is the realization that we can add two numbers without waiting for the carry signal to be available. Simply, the number were added in two instances: one assuming C in = 0 and the other assuming C in = 1. The results: Sum 0 , Sum 1 and Carry 0 , Carry 1 are presented at the input of a multiplexer. The final values are being selected at the time C in arrives at the “select” input of a multiplexer. As in CLA the input bits are divided into groups which are added “conditionally”. It is apparent that starting from the least significant bit (LSB) position the hardware complexity starts to grow rapidly. Therefore, in practice, the full-blown implementation of the CSA is not often seen. However, the idea of adding the most significant bit (MSB) portion conditionally and selecting the results once the carry-in signal is computed in the LSB portion is attractive. Such a scheme (which is a subset of CSA) is known as “carry-select adder”. A 26-b carry-select adder consisting of two 13-bit portions is shown in Fig. 86.3. Multiplication Algorithm The multiplication operation is performed in a variety of forms in hardware and software. In the beginning of computer development any complex operation was usually programmed in software or coded in the microcode of the machine. Some limited hardware assistance was provided. Today it is more likely to find full hardware implementation of the multiplication for reasons of speed and reduced cost of hardware. However, in all of them, multiplication shares the basic algorithm with some adaptations and modifications to the particular implementation and number system used. For simplicity we will describe a basic multiplication algorithm that operates on positive n-bit-long integers X and Y resulting in the product P, which is 2n bits long: This expression indicates that the multiplication process is performed by summing n terms of a partial product: X ′ y i r i . This product indicates that the ith term is obtained by a simple arithmetic left shift of X for the i positions and multiplication by the single digit y i . For the binary radix r = 2, y i is 0 or 1 and multiplication by the digit y i is very simple to perform. The addition of n terms can be performed at once, by passing the partial products through a network of adders (which is the case of full hardware multiplier), or sequentially, by passing the partial product through an adder n times. The algorithm to perform multiplication of X and Y can be described as [Ercegovac, 1985] p (0) = 0 p j + 1 = 1/r(p j + r n Xy j ) for j = 0,…, n – 1 It can be easily proved that this recurrence results in p (n) = XY. FIGURE 86.326-bit carry-select adder. PXYX yr Xyr i i i n i i i n ==′ =′ = - = - ?? 0 1 0 1 ? 2000 by CRC Press LLC Various modifications of the multiplication algorithm exist; one of the most famous is the modified Booth recoding algorithm described by Booth in 1951. This algorithm allows for the reduction of the number of partial products, thus speeding up the multiplication process. Generally speaking, the Booth algorithm is a case of using the redundant number system with the radix higher than 2. Implementation of the Multiplication Algorithm Speed of multiply operation is of utmost importance in digital signal processors (DSP) today as well as in the general purpose processors. Therefore, research in building a fast parallel multiplier has been going on since such a paper was published by Wallace [1964] in 1964. In his historic paper, Wallace introduced a way of summing the partial product bits in parallel using a tree of carry-save adders which became generally known as the “Wallace Tree” (Fig. 86.4). A suggestion for speed improvement of such a process of adding partial product bits in parallel followed in the paper published by Dadda [1965]. In his 1965 paper, Dadda introduced a notion of a counter structure that will take a number of bits p in the same bit position (of the same “weight”) and output a number q that represents the count of ones in the input. Dadda has introduced a number of ways to compress the partial product bits using such a counter, which later became known as Dadda’s counter. The quest for making the parallel multiplier even faster continued for almost 30 years. The search for producing a fastest “counter” did not result in a general structure that yielded a faster partial product summation than that which used a full-adder (FA) cell, or 3:2 counter. Therefore, the use of the Wallace Tree was almost prevalent in the implementation of the parallel multipliers. In 1981, Weinberger disclosed a structure he called “4-2 carry-save module”. This structure contained a combination of FA cells in an intricate interconnection structure that was yielding faster partial product compression than the use of 3:2 counters. FIGURE 86.4Wallace Tree. ? 2000 by CRC Press LLC The 4:2 compressor (Fig. 86.5) actually compresses five partial product bits into three; however, it is connected in such a way that four of the inputs are coming from the same bit position of the weight j while one bit is fed from the neighboring position j-1 (known as carry-in). The output of such a 4-2 module consists of one bit in the position j and two bits in the position j+1. This structure does not represent a counter (though it became erroneously known as a “4-2 counter”) but a “compressor” which would compress four partial product bits into two (while using one bit laterally connected between adjacent 4-2 compressors). The efficiency of such a structure is higher (it reduces the number of partial product bits by one half). The speed of such a 4-2 compressor has been determined by the speed of 3 XOR gates in series (in the redesigned version of 4-2 compressor) making such a scheme more efficient that the one using 3:2 counters in a regular Wallace Tree. The other equally important feature of the use of a 4-2 compressor is that the interconnections between such cells follow more regular patterns than in the case of the Wallace Tree. Booth Encoding Booth’s algorithm [Booth, 1951] is widely used in the implementations of hardware or software multipliers because its application makes it possible to reduce the number of partial products. It can be used for both sign- magnitude numbers as well as 2’s complement numbers with no need for a correction term or a correction step. A modification of the Booth algorithm was proposed by MacSorley [1961] in which a triplet of bits is scanned instead of two bits. This technique has the advantage of reducing the number of partial prod- ucts by one half regardless of the inputs. This is summarized in Table 86.12. The recoding is performed within two steps: encoding and selec- tion. The purpose of the encoding is to scan the triplet of bits of the multiplier and define the operation to be performed on the multi- plicand, as shown in Table 86.8. This method is actually an applica- tion of a sign-digit representation in radix 4. The Booth-MacSorley algorithm, usually called the Modified Booth algorithm or simply the Booth algorithm, can be generalized to any radix. Booth recoding necessitates the internal use of 2’s complement representation in order to efficiently perform subtraction of the partial products as well as additions. However, floating point standard specifies sign mag- nitude representation which is followed by most of the non-standard floating point numbers in use today. The advantage of Booth recoding is that it generates only a half of the partial products as compared to the multiplier implementation which does not use Booth recoding. However, the benefit achieved comes at the expense of FIGURE 86.54:2 compressor. TABLE 86.12Modified Booth Recoding x i+2 x i+1 x i Add to Partial Product 000 +0Y 001 +1Y 010 +1Y 011 +2Y 100 –2Y 101 –1Y 110 –1Y 111 –0Y ? 2000 by CRC Press LLC increased hardware complexity. Indeed, this implementation requires hardware for the encoding and for the selection of the partial products (0, ±Y, ±2Y). An optimized encoding is shown in Fig. 86.6. Division Algorithm Division is a more complex process to implement because, unlike multiplication, it involves guessing the digits of the quotient. Here, we will consider an algorithm for division of two positive integers designated as dividend Y and divisor X resulting in a quotient Q and an integer remainder Z according to the relation given by Y = XQ + Z In this case the dividend contains 2n integers and the divisor has n digits in order to produce a quotient with n digits. The algorithm for division is given with the following recurrence relationship [Ercegovac, 1985]: z (0) = Y z (j+1) = rz (j) – Xr n Q n–1–j for j = 0,…, n – 1 this recurrence relation yields z (n) = r n (Y – XQ) Y = XQ + z (n) r –n which defines the division process with remainder Z = z (n) r –n . The selection of the quotient digit is done by satisfying that 0 £ Z < X at each step in the division process. This selection is a crucial part of the algorithm, and the best known are restoring and nonrestoring division algorithms. In the former the value of the tentative partial remainder z (j) is restored after the wrong guess is made of the quotient digit q j . In the latter this correction is not done in a separate step, but rather in the step following. The best-known division algorithm is the so-called SRT algorithm independently developed by Sweeney, Robertson, and Tocher. Algorithms for a higher radix were further developed by Robertson and his students, most notably Ercegovac. Defining Terms 4:2 Compressor:A structure used in the partial product reduction tree of a parallel multiplier for achieving faster and more efficient reduction of the partial product bits. Algorithm:Decomposition of the computation into subcomputations with an associated precedence relation that determines the order in which these subcomputations are performed [Ercegovac, 1985]. FIGURE 86.6Booth encoder. ? 2000 by CRC Press LLC Booth-MacSorley algorithm: An algorithm used for recoding of the multiplier such that the number of partial products is roughly reduced by a factor of two. It is a special case of the application of the redundant number system to represent the multiplier. Carry look-ahead adder: An implementation technique of addition that accelerates the propagation of the carry signal, thus increasing the speed of addition operation. Dadda’s counter: A generalized structure used to produce a number (count) representing the number of bits that are “one”. It is used for efficient reduction of the partial products in a parallel multiplier. Explicit value x e : A value associated with the bit string according to the rule defined by the number repre- sentation system being used. Implicit value X i : The value obtained by applying the arithmetic function defined for the interpretation of the explicit value x e . Nonredundant number system: The system where for each bit string there is one and only one corresponding numerical value x e . Number representation system: A defined rule that associates one numerical value x e with every valid bit string x. Redundant number system: The system in which the numerical value x e could be represented by more than one bit string. SRT algorithm: An algorithm for division of binary numbers which uses redundant number representation. Wallace tree: A technique for summing the partial product bits of a parallel multiplier in a carry-save fashion using full-adder cells. Related Topic 86.1 Number Systems References A. Avizienis, “Digital computer arithmetic: A unified algorithmic specification,” in Symposium on Computers and Automata, Polytechnic Institute of Brooklyn, April 13–15, 1971. A. D. Booth, “A signed binary multiplication technique”, Quarterly J. Mechan. Appl. Math., vol. IV, 1951. L. Dadda, “Some schemes for parallel multipliers,” Alta Frequenza, 34, 349–356, 1965. M. Ercegovac, Digital Systems and Hardware/Firmware Algorithms, New York: Wiley, 1985, chap. 12. O. L. MacSorley, “High speed arithmetic in binary computers”, Proc. IRE, 49(1), 1961. V. G. Oklobdzija and E. R. Barnes, “Some optimal schemes for ALU implementation in VLSI technology”, Proceedings of 7th Symposium on Computer Arithmetic, Urbana, Ill.: University of Illinois, June 4–6, 1985. Sklanski, “Conditional-sum addition logic”, IRE Trans. Electron. Computers, EC-9, 226–231, 1960. C. S. Wallace, “A suggestion for a fast multiplier,” IEEE Trans. Electron. Computers, EC-13, 14–17, 1964. S. Waser and M. Flynn, Introduction to Arithmetic for Digital Systems Designers, New York: Holt, 1982. Weinberger and J. L. Smith, “A logic for high-speed addition”, National Bureau of Standards, Circulation 591, p. 3–12, 1958. Further Information For more information about specific arithmetic algorithms and their implementation see K. Hwang, Computer Arithmetic: Principles, Architecture and Design, New York: Wiley, 1979 and also E. Swartzlander, Computer Arithmetic, vols. I and II, Los Alamitos, Calif.: IEEE Computer Society Press, 1980. Publications in IEEE Transactions on Electronic Computers and Proceedings of the Computer Arithmetic Symposia by various authors are very good sources for detailed information on a particular algorithm or implementation. ? 2000 by CRC Press LLC 86.3 Architecture 1 V. Carl Hamacher, Zvonko G. Vranesic, and Safwat G. Zaky Computer architecture can be defined here to mean the functional operation of the individual hardware units in a computer system and the flow of information and control among them. This is a somewhat more general definition than is sometimes used. For example, some articles and books refer to instruction set architecture or the system bus architecture. The main functional units of a single-processor system, a basic way to interconnect them, and features that are used to increase the speed with which the computer executes programs will be described. Following this, a brief introduction to systems that have more than one processor will be provided. Functional Units A digital computer, or simply a computer, accepts digitized input information, processes it according to a list of internally stored machine instructions, and produces the resultant output information. The list of instructions is called a program, and internal storage is called computer memory. A computer has five functionally independent main parts: input, memory, arithmetic and logic, output, and control. The input unit accepts digitally encoded information from human operators, through electromechan- ical devices such as a keyboard, or from other computers over digital communication lines. The information received is usually stored in the memory and then operated on by the arithmetic and logic unit circuitry under the control of a program. Results are sent back to the outside world through the output unit. All these actions are coordinated by the control unit. The arithmetic and logic unit, in conjunction with the main control unit, are referred to as the processor. Input and output equipment is usually combined under the term input-output unit (I/O unit). This is reasonable because some standard equipment provides both input and output functions. The simplest example of this is the video terminal consisting of a keyboard for input and a cathode-ray tube for output. The control circuits of the computer recognize two distinct devices, even though the human operator may associate them as being part of the same physical unit. The memory unit stores programs and data. There are two main classes of memory devices called primary and secondary memory. Primary storage, or main memory, is an electronic storage device, constructed from integrated circuits that consist of millions of semiconductor storage cells, each capable of storing one bit of information. These cells are accessed in groups of fixed size called words. The main memory is organized so that the contents of one word can be stored or retrieved in one basic operation called a memory cycle. To provide direct access to any word in the main memory in a short and fixed amount of time, a distinct address number is associated with each word location. A given word is accessed by specifying its address and issuing a control command that starts the storage or retrieval process. The number of bits in a word is called the word length of the computer. Word lengths vary from 16 to 64 bits. Small machines such as personal computers or workstations, may have only a few million words in the main memory, while larger machines have tens of millions of words. The time required to access a word for reading or writing is less than 100 ns. Although primary memory is essential, it tends to be expensive and volatile. Thus cheaper, more permanent, magnetic media secondary storage is used for files of information that contain programs and data. A wide selection of suitable devices is available, including magnetic disks, drums, diskettes, and tapes. Execution of most operations within a computer takes place in the arithmetic and logic unit (ALU) of a processor. Consider a typical example. Suppose that two numbers located in the main memory are to be added, and the sum is to be stored back into the memory. Using a few instructions, each consisting of a few basic steps, determined by the control unit, the operands are first fetched from the memory into the processor. They are then added in the ALU, and the result is stored back in memory. Processors contain a number of high-speed storage elements called registers, which are used for temporary storage of operands. Each register contains one 1 Adapted from V.C. Hamacher, Z.G. Vranesic, and S.G. Zaky, Computer Organization, 4th ed., New York: McGraw-Hill, 1996. With permission. ? 2000 by CRC Press LLC THE FIRST DIGITAL COMPUTERS f all the new technologies to emerge from World War II, none was to have such profound and pervasive impacts as the digital computer. As early as the 1830s, the Englishman Charles Babbage conceived of an “Analytical Engine” that would perform mathematical operations using punched cards, hundreds of gears, and steam power. Babbage’s idea was beyond the capabilities of 19th- century technology, but his vision represented a goal that many were to pursue in the next century and a half. In the mid-1920s, MIT electrical engineer Vannevar Bush devised the “product integraph”, a semi- automatic machine for solving problems in determining the characteristics of complex electrical systems. This was followed a few years later by the “differential analyzer”, the first general equation-solving machine. These machines were mechanical, analog devices, but at the same time that they were being built and copied, the principles of electrical, digital machines were being laid out. In 1937, Claude Shannon published in the Transactions of the AIEE the circuit principles for an “electric adder to the base of two”, and George Stibitz of Bell Labs built such an adding device on his own kitchen table. In that same year, Howard Aiken, then a student at Harvard, proposed a gigantic calculating machine that could be used for everything from vacuum tube design to problems in relativistic physics. O ? 2000 by CRC Press LLC word of data and its access time is about 10 times faster than main memory access time. Large-scale micro- electronic fabrication techniques allow whole processors to be implemented on a single semiconductor chip containing a few million transistors. Basic Operational Concepts To perform a given computational task, an appropriate program consisting of a set of machine instructions is stored in the main memory. Individual instructions are brought from the memory into the processor for execution. Data used as operands are also stored in the memory. A typical instruction may be MOVE MEMLOC, Ri This instruction loads a copy of the operand at memory location MEMLOC into the processor register Ri. The instruction requires a few basic steps to be performed. First, the instruction must be transferred from the memory into the processor, where it is decoded. Then the operand at location MEMLOC must be fetched into the processor. Finally, the operand is placed into register R1. After operands are loaded into the processor registers in this way, instructions such as ADD Ri, Rj, Rk can be used to add the contents of registers Ri and Rj, and then place the result into register Rk. Instruction set design has been intensively studied in recent years to determine the effectiveness of the various alternatives. See Patterson and Hennessey [1994] for a thorough discussion. With support from Thomas J. Watson, president of IBM, Aiken was able to build his machine, the “Automatic Sequence Controlled Calculator”, or “Mark I”. When it was finished in 1944, the Mark I was quickly pressed into war service, calculating ballistics problems for the Navy. In 1943, the government contracted with John W. Mauchly and J. Presper Eckert of the University of Pennsylvania to build the “Electronic Numerical Integrator and Computer”—the first true electronic digital computer. When the ENIAC was finally dedicated in February 1946, it was both a marvel and a monster—weighing 30 tons, consuming 150 kW of power, and using 18,000 vacuum tubes. With all of this, it could perform 5,000 additions or 400 multiplications per second, which was about one thousand The ENIAC, pictured above, was the first true electronic digital computer. Early pro- ? 2000 by CRC Press LLC The connection between the main memory and the processor that allows for the transfer of instructions and operands is called the bus, as shown in Fig. 86.7. A bus consists of a set of address, data, and control lines. The bus also allows program and data files to be transferred from their long-term location on magnetic disk storage to the main memory. Long distance digital communication with other computers is also enabled by transfers over the bus to the Communication Line Interface, as shown in the figure. The bus interconnects a number of devices, but only two devices (a sender and a receiver) can use it at any one time. Therefore, some control circuitry is needed to manage the orderly use of the bus when a number of devices wish to use it. Normal execution of programs may sometimes be preempted if some I/O device requires urgent control action or servicing. For example, a monitoring device in a computer-controlled industrial process may detect a dangerous condition that requires the execution of a special service program dedicated to the device. To cause this service program to be executed, the device sends an interrupt signal to the processor. The processor temporarily suspends the program that is being executed and executes the special interrupt service routine. After providing the required service, the processor switches back to the interrupted program. To appreciate the complexity of the computer system software programs needed to control such switching from one program task to another and to manage the general movement of programs and data between primary and secondary storage, consult Tanenbaum [1990]. The need often arises during program loading and execution to transfer blocks of data between the main memory and a disk or other secondary storage I/O devices. Special control circuits are provided to manage these transfers without detailed control actions from the main processor. Such transfers are referred to as direct memory access (DMA). Assuming that accesses to the main memory from both I/O devices (such as disks) and grammers set up problems by plugging in cables and setting switches. ENIAC could perform calculations about one thousand times faster than any other machine of its day. (Photo courtesy of the IEEE Center for the History of Electrical Engineering.) times faster than any other machine of the day. The ENIAC showed the immense possibilities of digital electronic computers. These possibilities occupied engineers and mathematicians for the coming decades. For electrical engineers, the computer represented a challenge and responsibility for the most powerful new machine of the twentieth century. (Courtesy of the IEEE Center for the History of Electrical Engineering.) the main processor can be appropriately interwoven over the bus, I/O-memory transfers and computation in the main processor can proceed in parallel, and performance of the overall system is improved. Performance A major performance measure for computer systems is the time, T, that it takes to execute a complete program for some task. Suppose N machine instructions need to be executed to perform the task. A program is typically written in some high-level language, translated by a compiler program into machine language, and stored on a disk. An operating system software routine then loads the machine language program into the main memory, ready for execution. Assume that, on average, each machine language instruction requires S basic steps for its execution. If basic steps are executed at the rate of R steps per second, then the time to execute the program is T = (N ′ S)/R The main goal in computer architecture is to develop features that minimize T. We will now give an outline of main memory and processor design features that help to achieve this goal. The first concept is that of a memory hierarchy. We have already noted that access to operands in processor registers is significantly faster than access to the main memory. Suppose that when instructions and data are first loaded into the processor, they are stored in a small, fast, cache memory on the processor chip itself. If instructions and data in the cache are accessed repeatedly within a short period of time, as happens often with program loops, then program execution will be speeded up. The cache can only hold small parts of the executing program. When the cache is full, its contents are replaced by new instructions and data as they are fetched from the main memory. A variety of cache replacement algorithms are in use. The objective of these algorithms is to maximize the probability that the instructions and data needed for program execution are found in the cache. This probability is known as the cache hit ratio. A higher hit ratio means that a larger percentage of the instructions and data are being found in the cache, and do not require access to the slower main memory. This leads to a reduction in the memory access basic step time components of S, and hence to a smaller value of T. The basic idea of a cache can be applied at different points in a computer system, resulting in a hierarchy of storage units. A typical memory hierarchy is shown in Fig. 86.8. Some systems have two levels of cache to take the best advantage of size/speed/cost tradeoffs. The main memory is usually not large enough to contain all of the programs and their data. Therefore, the highest level in the memory hierarchy is usually magnetic disk storage. As the figure indicates, it has the largest capacity, but the slowest access time. Segments of a program, often called pages, are transferred from the disk to the main memory for execution. As other pages are needed, they may replace the pages already in the main memory if the main memory is full. The orderly, FIGURE 86.7Interconnection of major components in a computer system. ? 2000 by CRC Press LLC automatic movement of large program and data segments between the main memory and the disk, as programs execute, is managed by a combination of operating system software and control hardware. This is referred to as memory management. We have implicitly assumed that instructions are executed one after another. Most modern processors are designed to allow the execution of successive instructions to overlap, using a technique known as pipelining. In the example in Fig. 86.9, each instruction is broken down into 4 basic steps—fetch, decode, operate, and write—and a separate hardware unit is provided to perform each of these steps. As a result, the execution of FIGURE 86.8Memory hierarchy. FIGURE 86.9Pipelining of instruction execution. ? 2000 by CRC Press LLC successive instructions can be overlapped as shown, resulting in an instruction completion rate of one per basic time step. If the execution overlap pattern shown in the figure can be maintained for long periods of time, the effective value of S tends toward 1. When the execution of some instruction I depends on the results of a previous instruction, J, which is not yet completed, instruction I must be delayed. The pipeline is said to be stalled, waiting for the execution of instruction J to be completed. While it is not possible to eliminate such situations altogether, it is important to minimize the probability of their occurrence. This is a key consideration in the design of the instruction set of modern processors and the design of the compilers that translate high-level language programs into machine language. Now, imagine that multiple functional units are provided in the processor so that more than one instruction can be in the operate stage. This parallel execution capability, when added to pipelining of the individual instructions, means that execution rates of more than one instruction completion per basic step time can be achieved. This mode of enhanced processor performance is called superscalar processing. The rate, R, of performing basic steps in the processor is usually referred to as the processor clock rate; and it is of the order of 100 to 200 million steps per second in current high-performance VLSI processors. This rate is determined by the technology used in fabricating the processors, and is strongly related to the size or area occupied by individual transistors. This size feature, which is currently in the submicron range, has been steadily decreasing as fabrication techniques improve, allowing increases in R to be achieved. Multiprocessors Physical limits on electronic speeds prevent single processors from being speeded up indefinitely. A major design trend has seen the development of systems that consist of a large number of processors. Such multipro- cessors can be used to speed up the execution of large programs by executing subtasks in parallel. The main difficulty in achieving this type of speedup is in being able to decompose a given task into its parallel subtasks and assign these subtasks to the individual processors in such a way that communication among the subtasks can be done efficiently. Fig. 86.10 shows a block diagram of a multiprocessor system, with the interconnection network needed for data sharing among the processors Pi. Parallel paths are needed in this network in order for parallel activity to proceed in the processors as they access the global memory space represented by the multiple memory units Mi. Defining Terms Arithmetic and logic unit:The logic gates and register storage elements used to perform the basic operations of addition, subtraction, multiplication, and division of numeric operands, and the comparison, shifting, and alignment operations on more general forms of numeric and nonnumeric data. Bus:The collection of data, address, and control lines that enables exchange of information, usually in word- size quantities, among the various computer system units. In practice, a large number of units can be connected to a single bus. These units contend in an orderly way for the use of the bus for individual transfers. Cache memory:A high-speed memory for temporary storage of copies of the sections of program and data from the main memory that are currently active during program execution. FIGURE 86.10A multiprocessor system. ? 2000 by CRC Press LLC Computer architecture: The functional operation of the individual hardware units in a computer system and the flow of information and control among them. Control unit: The circuits required for sequencing the basic steps needed to execute machine instructions. Input-output unit (I/O): The equipment and controls necessary for a computer to interact with a human operator, to access mass storage devices such as disks and tapes, or to communicate with other computer systems over communication networks. Memory hierarchy: The collection of cache, primary, and secondary memory units that comprise the total storage capability in the computer system. Memory management: The combination of operating system software and hardware controls that is needed to access and move program and data segments up and down the memory hierarchy during program execution. Memory unit: The unit responsible for storing programs and data. There are two main types of units: primary memory, consisting of millions of bit storage cells fabricated from electronic semiconductor integrated circuits, used to hold programs and data during program execution; and secondary memory, based on magnetic disks, diskettes, and tapes, used to store permanent copies of programs and data. Multiprocessor: a computer system comprising multiple processors and main memory unit modules, con- nected by a network that allows parallel activity to proceed efficiently among these units in executing program tasks that have been sectioned into subtasks and assigned to the processors. Pipelining: The overlapped execution of the multiple steps of successive instructions of a machine language program, leading to a higher rate of instruction completion than can be attained by executing instructions strictly one after another. Processor: The arithmetic and logic unit combined with the control unit needed to sequence the execution of instructions. Some cache memory is also included in the processor. Superscalar processing: The ability to execute instructions at a completion rate that is faster than the normal pipelined rate, by providing multiple functional units in the pipeline to allow a small number of instruc- tions to proceed through the pipeline process in parallel. Related Topics 86.2 Computer Arithmetic ? 86.4 Microprogramming References V.C. Hamacher, Z.G. Vranesic, and S.G. Zaky, Computer Organization, 4th ed., New York: McGraw-Hill, 1996. D. A. Patterson and J. L. Hennessey, Computer Organization and Design—The Hardware/Software Interface, San Mateo, Calif.: Morgan Kaufman, 1994. A.S. Tanenbaum, Structured Computer Organization, 3rd ed., Englewood Cliffs, N.J.: Prentice-Hall, 1990. Further Information The IEEE magazines Computer, Micro, and Software all have interesting articles on subjects related to computer architecture, including software aspects. Also, articles on computer architecture occasionally appear in Scientific American. 86.4 Microprogramming Jacques Raymond Since the 1950s when Wilkes et. al. [1958] defined the term and the concept, microprogramming has been used as a clean and systematic way to define the instruction set of a computer. It has also been used to define a virtual architecture out of a real hardwired one. ? 2000 by CRC Press LLC Levels of Programming In Fig. 86.12, we see that a computer application is usually realized by programming a given algorithm in a high-level language. A system offering a high-level language capability is implemented at the system level via a compiler. The operating system is (usually) implemented in a lower-level language. The machine instruction set can be hardwired (in a hardware implementation) or implemented via microprogramming (Fig. 86.11). Therefore, microprogramming is simply an extra level in the general structure. Since it is used to define the machine instruction set, it can be considered at the hardware level. Since this definition is done via a program at a low level, but still eventually modifiable, it can also be considered to be at the software level. For these reasons, the term firmware has been coined to name sets of microprograms. In short, microinstructions that specify hardware functions (microoperations such as Open a path, Select operation) are used to form a more complex instruction (Convert to binary, Add decimal). The machine instruction set is defined via a set of microprogram routines and a microprogrammed instruction decoder. In a microprogrammed machine, the hardware is designed in terms of its capabilities (ALU, data paths, I/O, processing units) with little concern for how these capabilities will have to be accessed by the programmers. The microoperations are decoded from microinstructions. The way programmers view the machine is defined at the microprogramming level. This approach offers some differences over the hardwired approach. The advantages are that it is more systematic in implementation, modifiable, economical on most designs, and easier to debug. The disadvantages are that it is uneconomical on simple machines, slower, and needs support software. Like all programs, microprograms reside in memory. The term “control memory” is commonly used for microprograms. Microinstruction Structure On a given hardware, many processing functions are available. In general a subset O of these functions can be performed in parallel, for example, carrying on an addition between two registers while copying a register on an I/O bus. These functions are called microcommands. Horizontal Microinstructions Each of the fields f of a microinstruction specifies a microcommand. If the format of the microinstruction is such that all possible microcommands can be specified, the instruction is called horizontal. Most of the time, it is wasteful in memory as, in a microprogram, not every possible microcommand is specified in each microinstruction. However, it permits the microprogrammer to fully take advantage of all possible parallelisms and to build faster machines. For example, the horizontal specification of an ALU operation, FIGURE 86.11Levels of programming in a computer system. ALUOperation SourcePathA SourcePathB ResultPath CvtDecimal ? 2000 by CRC Press LLC specifies both operands, which register will contain the result, whether or not the result is to be converted to decimal, and the operation to be performed. If this instruction is, for example, part of a microprogram defining a 32-bit addition instruction, assuming a 16-bit path, it is wasteful to specify twice the source and result operands. In some cases, it is possible to design a microinstruction that specifies more microcommands than can be executed in parallel. In that case, the execution of the microcommand is carried out in more than one clock cycle. For this reason they are called polyphase microinstructions (as opposed to monophase). Vertical Microinstructions At the other extreme, if the microinstruction allows only the specification of a single microcommand at a time, the instruction is then called vertical. In that case, only the necessary commands for a particular program are specified, resulting in smaller control memory requirements. However, it is not possible to take advantage of possible parallelism offered by the hardware, since only one microcommand is executed at a time. For example, the vertical specification of an ALU operation is as follows: Diagonal Microinstructions Most cases fit in between these two extremes (see Fig. 86.13). Some parallelism is possible; however, micro- commands pertaining to a given processing unit are regrouped. This results in shorter microprograms than in the vertical case and may still allow some optimization. For example, a diagonal specification of an ALU operation is as follows: Optimization Time and space optimization studies can be performed before designing the microinstruction format. The reader is referred to Das et al. [1973] and Agerwala [1976] for details and more references. FIGURE 86.12A view of computer system levels. SourceA Reg# 1st Operand SourceB Reg# 2nd Operand Result Reg# Result ALU Op Operation SelectSources RegA# RegB# Select Operands SelectResult Reg# Dec/Bin Result place and format Select ALU Operation Perform the operation ? 2000 by CRC Press LLC Microprogram Development Microassemblers The first level of specification of microinstructions is, just like its counterpart at the machine level, the assembler. Although the process and philosophy is exactly the same, it is traditionally called a microassembler. A microas- sembler is a software program (it is not relevant to know in which language it is written) whose function is to translate a source program into the binary code equivalent. Obviously, to write a source program, a language has to be designed. At assembly level, languages are usually very close to the hardware structure, and the objects defined are microregisters, gate level controls, and paths. Operations are the microoperations (sometimes slightly more sophisticated with a microassembler with macrofacilities). This level provides an easily readable microprogram and does much to help avoid syntax errors. In binary, only the programmer can catch a faulty 1 or 0; the microassembler can catch syntax errors or some faulty register specifications. No microassembler exists that can catch all logic errors in the implementation of a given instruction algorithm. It is still very easy to make mistakes. It should be noted that this level is a good compromise between convenience and cost. The following is a typical example of a microprogram in the microassembler (it implements a 16-bit add on an 8-bit path and ALU): CLC Clear Carry Lod A Get first part of first operand Add B Add to first part of second operand Sto C Give low byte of final result Lod a Get second part of first operand Adc b Add to second part of second operand and to carry bit Sto c Give high byte of final result JCS Error Jump to error routine if Result >65536 Jmp FetchNext FIGURE 86.13 Microinstruction fields versus microcommands. ? 2000 by CRC Press LLC High-Level Languages for Microprogramming Many higher-level languages have been designed and implemented; see a discussion of some design philosophies in Malik and Lewis [1978]. In principle, a high-level language program is oriented toward the application it supports and is farther away from the hardware-detailed implementation of the machine it runs on. The appli- cations supported are mostly other machine definitions (emulators) and implementations of some algorithms. The objects defined and manipulated by high-level languages for microprogramming are therefore the virtual components of virtual machines. They are usually much the same as their real counterparts: registers, paths, ALUs, microoperations, etc. Furthermore, writing a microprogram is usually defining a machine architecture. It involves a lot of intricate and complicated details, but the algorithms implemented are mostly quite simple. The advantages offered by high-level languages to write better algorithms without getting lost in implementation details are therefore not exploited. Firmware Implementation Microprogramming usually requires the regular phases of design, coding, test, conformance acceptance, doc- umentation, etc. It differs from other programming activities when it comes to the deliverable. The usual final product takes the form of hardware, namely, a control memory, PROM, ROM, or other media, containing the bit patterns equivalent to the firmware. These implementation steps require special hardware and software support. They include a linker, loader, PAL programmer, or ROM burner; a test program in a control memory test bench is also necessary. Supporting Software It is advisable to test the microprogram before its actual hard implantation, if the implantation process is irreversible or too costly to repeat. Software simulators have been implemented to allow thorough testing of the newly developed microprogram. Needless to say these tools are very specialized to a given environment and therefore costly to develop, as their development cost cannot be distributed over many applications. Emulation Concept In a microprogrammed environment, a computer architecture is softly (or firmly) defined by the microprogram designed to implement its operations, its data paths, and its machine-level instructions. It is easy to see that if one changes the microprogram for another one, then a new computer is defined. In this environment, the desired operation is simulated by the execution of the firmware, instead of being the result of action on real hardwired components. Since the word simulation was already in use for simulation of some system by software, the word emulation was chosen to mean simulation of an instruction set by firmware. Of course, “simulation” by hardware is not a simulation but the real thing. The general structure of an emulator consists of the following pseudocode algorithm: BEGIN Initialize Machine Components Repeat Fetch Instruction Emulate Operation of the current instruction Process interrupts Update instruction counter Until MachineIsOff Perform shutdown procedure END Many variations exist, in particular to process interrupts within the emulation of a lengthy operation or to optimize throughput, but the general principle and structure are fairly constant. ? 2000 by CRC Press LLC Emulation of CPU Operation One of the advantages of microprogramming is that the designer can implement his or her dream instructions simply by emulating its operation. We have seen already the code for a typical 16-bit adder, but it is not difficult to code a parity code generator, a cyclic redundancy check calculator, or an instruction that returns the eigenvalues of an n ′ n matrix. This part is straight programming. One consideration is to make sure that the machine is still listening to the outside world (interruptions) or actively monitoring it (I/O flags) in order not to loose asynchronous data while looking for a particular pattern in a 1 megabyte string. Another consideration is to optimize memory usage by combining common processes for different operations. For example, emulating a 32-bit add instruction and emulating a 16-bit add instruction have common parts. This is, however, a programming concern not specific to emulation. I/O System and Interrupts Programming support for I/Os and interrupts is more complicated than for straight machine instructions. This is due to the considerable speed differences between I/O devices and a CPU, the need for synchronization, the need for not losing any external event, and the concerns for optimizing processing time. Microprogramming offers considerable design flexibility, as these problems are more easily handled by programming than with hardware components. Applications of Microprogramming The main application of microprogramming is the emulation of a virtual machine architecture on a different host hardware machine. It is, however, easy to see that the concept of emulation can be broadened to other functions than the traditional hardware operation. It is mainly a matter of point of view. Emulation and simulation are essentially the same process but viewed from different levels. Realizing a 64-bit addition and implementing a communication controller are qualitatively the same type of task. Once this is considered, there are theoretically no limits to the uses of microprogramming. From the programmer’s point of view, programming is the activity of producing, in some language, an implementation of some algorithm. If the language is at the very lowest level, as is the case with micropro- gramming, and at the same time the algorithm is filled with intricate data structures and complex decisions, the task might be enormous, but nothing says it cannot be done (except, maybe, experience). With this perspective of the field, we now look at some existing applications of microprogramming. Operating System Support One of the first applications, besides emulation, was to support some operating system functions. Since microprograms are closer to the hardware and programming directly in microcode removes the overhead of decoding machine-level instructions, it was thought that directly coding operating system (OS) functions would improve their performance. Success was achieved in some areas, such as virtual memory. In general, people write most OS functions in assembly language, probably because the cost is not offset by the benefits, especially with rapidly changing OS versions. The problems raised by the human side of programming have changed the question “Should it be in microcode or in assembler?” to the question “Should it be in assembler or in C?” This parallels the CISC/RISC debate. High-Level Languages Support Early research was done also in the area of support for high-level languages. Support can be in the form of microprogrammed implementations of some language primitive (for example, the trigonometric functions) or support for the definition and processing of data structures (for example, trees and lists primitives). Many interesting research projects have led to esoteric laboratory machines. More common examples include the translate instructions, string searches and compares, or indexing multidimensional arrays. Paging, Virtual Memory An early and typical application of microprogramming is the implementation of the paging algorithm for a virtual memory system. It is a typical application since it is a low-level function that must be time optimized and is highly hardware dependent. Furthermore, the various maintenance functions which are required by the ? 2000 by CRC Press LLC paging algorithms and the disk I/Os can be done during the idle time of the processing of other functions or during part of that processing in order to avoid I/O delays. Diagnostics Diagnostic functions have also been an early application of microprogramming. A firmware implementation is ideally suited to test the various components of a computer system, since the gates, paths, and units can be exercised in an isolated manner, therefore allowing one to precisely pinpoint the trouble area. Controllers Real-time controllers benefit from a microprogrammed implementation, due to the speed gained by program- ming only the required functions, therefore avoiding the overhead of general-purpose instructions. Since the microprogrammer can better make use of the available parallelism in the machine, long processes can still support the asynchronous arrival of data by incorporating the interrupt polling at intervals in these processes. High-Level Machines Machines that directly implement the constructs of high-level languages can be easily implemented via micro- programming. For example, Prolog machines and Lisp machines have been tried. It is also possible to conceive an application directly microcoded. Although this could provide a high performance hardware, human errors and software engineering practice seem to make such a machine more of a curiosity than a maintainable system. Defining Terms Control memory: A memory containing a set of microinstructions (a microprogram) that defines the instruc- tion set and operations of a CPU. Emulator: The firmware that simulates a given machine architecture. Firmware: Meant as an intermediate between software, which can be modified very easily, and hardware, which is practically unchangeable (once built); the word firmware was coined to represent the micropro- gram in control memory, i.e., the modifiable representation of the CPU instruction set. High-level language for microprogramming: A high-level language more or less oriented toward the descrip- tion of a machine. Emulators can more easily be written in a high-level language; the source code is compiled into the microinstructions for actual implementation. Horizontal microinstruction: Theoretically, a completely horizontal microinstruction is made up of all the possible microcommands available in a given CPU. In practice, some encoding is provided to reduce the length of the instruction. Microcommand: A small bit field indicating if a gate is open or closed, if a function is enabled or not, if a control path is active or not, etc. A microcommand is therefore the specification of some action within the control structure of a CPU. Microinstruction: The set of microcommands to be executed or not, enabled or not. Each field of a micro- instruction is a microcommand. The instruction specifies the new state of the CPU. Vertical microinstruction: A completely vertical microinstruction would contain one field and therefore would specify one microcommand. An Op code is used to specify which microcommand is specified. In practice, microinstructions that typically contain three or four fields are called vertical. Related Topic 86.3 Architecture References T. Agerwala, “Microprogram optimization: a survey,” IEEE Trans. Comput., vol. C25, no. 10, pp. 862–873, 1976. J.D. Bagley, “Microprogrammable virtual machines,” Computer, pp. 38–42, 1976. D.K. Banerji and J. Raymond, Elements of Microprogramming, Englewood Cliffs, N.J.: Prentice-Hall, 1982. G.F. Casaglia, “Nanoprogramming vs. microprogramming,” Computer, pp. 54–58, 1976. ? 2000 by CRC Press LLC S.R. Das, D. K. Banerji, and A. Chattopadhyay, “On control memory minimization in microprogrammed digital computers,” IEEE Trans. Comput., vol. C22, no. 9, pp. 845–848, 1973. L.H. Jones, “An annotated bibliography on microprogramming,” SIGMICRO Newsletter, vol. 6, no. 2, pp. 8–31, 1975. L.H. Jones, “Instruction sequencing in microprogrammed computers,” AFIPS Conf. Proc., vol. 44, pp. 91–98, 1975. K. Malik and T.J. Lewis, “Design objectives for high level microprogramming languages,” in Proceedings of the 11th Annual Microprogramming Workshop, Englewood Cliffs, N.J.: Prentice-Hall, 1978, pp. 154–160. J. Raymond and D.K. Banerji, “Using a microprocessor in an intelligent graphics terminal,” Computer, pp. 18–25, 1976. M.V. Wilkes, W. Renwick, and D.J. Wheeler, “The design of the control unit of an electronic digital computer,” Proc. IEE, pp. 121–128, 1958. Further Information J. Carter, Microprocessor Architecture and Microprogramming, a State Machine Approach, Englewood Cliffs, NJ: Prentice-Hall, 1996. S. Habib, Ed., Microprogramming and Firmware Engineering Methods, New York: Van Nostrand Reinhold, 1988. H. Okuno, N. Osato, and I. Takeuchi, “Firmware approach to fast lisp interpreter. Twentieth annual workshop on microprogramming”, (MICRO-20), ACM, 1987. A. J. Van der Hoeven, P. Van Prooijen, E. F. Deprettere, and P. M. Dewilde, “A hardware design system based on object-oriented principles”, IEEE, pp. 459–463, 1993. ? 2000 by CRC Press LLC