电子工程师手册（英文版一）：ch018

分类：电子与通信格式：pdf 日期：2007年02月02日

Parhi, K.K., Chassaing, R., Bitler, B. “VLSI for Signal Processing” The Electrical Engineering Handbook Ed. Richard C. Dorf Boca Raton: CRC Press LLC, 2000 18 VLSI for Signal Processing 18.1 Special Architectures Pipelining?Parallel Processing?Retiming?Unfolding? Folding Transformation?Look-Ahead Technique?Associativity Transformation?Distributivity?Arithmetic Processor Architectures?Computer-Aided Design?Future VLSI DSP Systems 18.2 Signal Processing Chips and Applications DSP Processors?Fixed-Point TMS320C25-Based Development System?Implementation of a Finite Impulse Response Filter with the TMS320C25?Floating-Point TMS320C30-Based Development System?EVM Tools?Implementation of a Finite Impulse Response Filter with the TMS320C30?FIR and IIR Implementation Using C and Assembly Code?Real-Time Applications? Conclusions and Future Directions 18.1 Special Architectures Keshab K. Parhi Digital signal processing (DSP) is used in numerous applications. These applications include telephony, mobile radio, satellite communications, speech processing, video and image processing, biomedical applications, radar, and sonar. Real-time implementations of DSP systems require design of hardware that can match the application sample rate to the hardware processing rate (which is related to the clock rate and the implementation style). Thus, real-time does not always mean high speed. Real-time architectures are capable of processing samples as they are received from the signal source, as opposed to storing them in buffers for later processing as done in batch processing. Furthermore, real-time architectures operate on an infinite time series (since the number of the samples of the signal source is so large that it can be considered infinite). While speech and sonar applications require lower sample rates, radar and video image processing applications require much higher sample rates. The sample rate information alone cannot be used to choose the architecture. The algorithm complexity is also an important consideration. For example, a very complex and computationally intensive algorithm for a low- sample-rate application and a computationally simple algorithm for a high-sample-rate application may require similar hardware speed and complexity. These ranges of algorithms and applications motivate us to study a wide variety of architecture styles. Using very large scale integration (VLSI) technology, DSP algorithms can be prototyped in many ways. These options include (1) single or multiprocessor programmable digital signal processors, (2) the use of core programmable digital signal processor with customized interface logic, (3) semicustom gate-array implemen- tations, and (4) full-custom dedicated hardware implementation. The DSP algorithms are implemented in the programmable processors by translating the algorithm to the processor assembly code. This can require an extensive amount of time. On the other hand, high-level compilers for DSP can be used to generate the assembly code. Although this is currently feasible, the code generated by the compiler is not as efficient as hand-optimized code. Design of DSP compilers for generation of efficient code is still an active research topic. In the case of Keshab K. Parhi University of Minnesota Rulph Chassaing Roger Williams University Bill Bitler InfiMed ? 2000 by CRC Press LLC dedicated designs, the challenge lies in a thorough understanding of the DSP algorithms and theory of archi- tectures. For example, just minimizing the number of multipliers in an algorithm may not lead to a better dedicated design. The area saved by the number of multipliers may be offset by the increase in control, routing, and placement costs. Off-the-shelf programmable digital signal processors can lead to faster prototyping. These prototyped systems can prove very effective in fast simulation of computation-intensive algorithms (such as those encountered in speech recognition, video compression, and seismic signal processing) or in benchmarking and standardization. After standards are determined, it is more useful to implement the algorithms using dedicated circuits. Design of dedicated circuits is not a simple task. Dedicated circuits provide limited or no programming flexibility. They require less silicon area and consume less power. However, the low production volume, high design cost, and long turnaround time are some of the difficulties associated with the design of dedicated systems. Another difficulty is the availability of appropriate computer-aided design (CAD) tools for DSP systems. As time progresses, however, the architectural design techniques will be better understood and can be incor- porated into CAD tools, thus making the design of dedicated circuits easier. Hierarchical CAD tools can integrate the design at various levels in an automatic and efficient manner. Implementation of standards for signal and image processing using dedicated circuits will lead to higher volume production. As time progresses, dedicated designs will be more acceptable to customers of DSP. Successful design of dedicated circuits requires careful algorithm and architecture considerations. For exam- ple, for a filtering application, different equivalent realizations may possess different levels of concurrency. Thus, some of these realizations may be suitable for a particular application while other realizations may not be able to meet the sample rate requirements of the application. The lower-level architecture may be implemented in a word-serial or word-parallel manner. The arithmetic functional units may be implemented in bit-serial or digit-serial or bit-parallel manner. The synthesized architecture may be implemented with a dedicated data path or shared data path. The architecture may be systolic or nonsystolic. Algorithm transformations play an important role in the design of dedicated architectures [Parhi, 1989]. This is because the transformed algorithms can be made to operate with better performance (where the performance may be measured in terms of speed, area, or power). Examples of these transformations include pipelining, parallel processing, retiming, unfolding, folding, look-ahead, associativity, and distributivity. These transformations and other architectural concepts are described in detail in subsequent sections. Pipelining Pipelining can increase the amount of concurrency (or the number of activities performed simultaneously) in an algorithm. Pipelining is accomplished by placing latches at appropriate intermediate points in a data flow graph that describes the algorithm. Each latch also refers to a storage unit or buffer or register. The latches can be placed at feed-forward cutsets of the data flow graph. In synchronous hardware implementations, pipelining can increase the clock rate of the system (and therefore the sample rate). The drawbacks associated with pipelining are the increase in system latency and the increase in the number of registers. To illustrate the speed increase using pipelining, consider the second-order three-tap finite impulse response (FIR) filter shown in Fig. 18.1(a). The signal x(n) in this system can be sampled at a rate limited by the throughput of one multiplication and two additions. For simplicity, if we assume the multiplication time to be two times the addition time (T add ), the effective sample or clock rate of this system is 1/4T add . By placing latches as shown in Fig. 18.1(b) at the cutset shown in the dashed line, the sample rate can be improved to the rate of one multiplication or two additions. While pipelining can be easily applied to all algorithms with no feedback loops by the appropriate placement of latches, it cannot easily be applied to algorithms with feedback loops. This is because the cutsets in feedback algorithms contain feed-forward and feedback data flow and cannot be con- sidered as feed-forward cutsets. Pipelining can also be used to improve the performance in software programmable multiprocessor systems. Most software programmable DSP processors are programmed using assembly code. The assembly code is generated by high-level compilers that perform scheduling. Schedulers typically use the acyclic precedence graph to construct schedules. The removal of all edges in the signal (or data) flow graph containing delay ? 2000 by CRC Press LLC elements converts the signal flow graph to an acyclic precedence graph. By placing latches to pipeline a data flow graph, we can alter the acyclic precedence graph. In particular, the critical path of the acyclic precedence graph can be reduced. The new precedence graph can be used to construct schedules with lower iteration periods (although this may often require an increase in the number of processors). Pipelining of algorithms can increase the sample rate of the system. Sometimes, for a constant sample rate, pipelining can also reduce the power consumed by the system. This is because the data paths in the pipelined system can be charged or discharged with lower supply voltage. Since the capacitance remains almost constant, the power can be reduced. Achieving low power can be important in many battery-powered applications [Chandrakasan et al., 1992]. Parallel Processing Parallel processing is related to pipelining but requires replication of hardware units. Pipelining exploits concurrency by breaking a large task into multiple smaller tasks and by separating these smaller tasks by storage units. On the other hand, parallelism exploits concurrency by performing multiple larger tasks simultaneously in separate hardware units. To illustrate the speed increase due to parallelism, consider the parallel implementation of the second-order three-tap FIR filter of Fig. 18.1(a) shown in Fig. 18.2. In the architecture of Fig. 18.2, two input samples are processed and two output samples are generated in each clock cycle period of four addition times. Because each clock cycle processes two samples, however, the effective sample rate is 1/2T add which is the same as that of Fig. 18.1(b). The parallel architecture leads to the speed increase with significant hardware overhead. The entire data flow graph needs to be replicated with an increase in the amount of parallelism. Thus, it is more desirable to use pipelining as opposed to parallelism. However, parallelism may be useful if pipelining alone cannot meet the speed demand of the application or if the technology constraints (such as limitations on the clock rate by the I/O technology) limit the use of pipelining. In obvious ways, pipelining and parallelism can be combined also. Parallelism, like pipelining, can also lead to power reduction but with significant overhead in hardware requirements. Achieving pipelining and parallelism can be difficult for systems with feedback loops. Concurrency may be created in these systems by using the look-ahead transformation. FIGURE 18.1 (a) A three-tap second-order nonrecursive digital filter; (b) the equivalent pipelined digital filter obtained by placing storage units at the intersection of the signal wires and the feed-forward cutset. If the multiplication and addition operations require 2 and 1 unit of time, respectively, then the maximum achievable sampling rates for the original and the pipelined architectures are 1/4 and 1/2 units, respectively. ? 2000 by CRC Press LLC Retiming Retiming is similar to pipelining but yet different in some ways [Leiserson et al., 1983]. Retiming is the process of moving the delays around in the data flow graph. Removal of one delay from all input edges of a node and insertion of one delay to each outgoing edge of the same node is the simplest example of retiming. Unlike pipelining, retiming does not increase the latency of the system. However, retiming alters the number of delay elements in the system. Retiming can reduce the critical path of the data flow graph. As a result, it can lead to clock period reduction in hardware implementations or critical path of the acyclic precedence graph or the iteration period in programmable software system implementations. The single host formulation of the retiming transformation preserves the latency of the algorithm. The retiming formulation with no constraints on latency (i.e., with separate input and output hosts) can also achieve pipelining with no retiming or pipelining with retiming. Pipelining with retiming is the most desirable transfor- mation in DSP architecture design. Pipelining with retiming can be interpreted to be identical to retiming of the original algorithm with a large number of delays at the input edges. Thus, we can increase the system latency arbitrarily and remove the appropriate number of delays from the inputs after the transformation. The retiming formulation assigns retiming variables r(.) to each node in the data flow graph. If i(U ? V ) is the number of delays associated with the edge U ?V in the original data flow graph and r( V ) and r(U ), respectively, represent the retiming variable value of the nodes V and U, then the number of delays associated with the edge U ?V in the retimed data flow graph is given by i r (U ? V ) = i ( U ? V ) + r ( V ) – r ( U ) For the data flow graph to be realizable, i r (U ? V ) 3 0 must be satisfied. The retiming transformation formulates the problem by calculating path lengths and by imposing constraints on certain path lengths. These constraints are solved as a shortest-path problem. To illustrate the usefulness of retiming, consider the data flow graph of a two-stage pipelined lattice digital filter graph shown in Fig. 18.3(a) and its equivalent pipelined-retimed data flow graph shown in Fig. 18.3(b). If the multiply time is two units and the add time is one unit, the architecture in Fig. 18.3(a) can be clocked with period 10 units whereas the architecture in Fig. 18.3(b) can be clocked with period 2 units. FIGURE 18.2 Twofold parallel realization of the three-tap filter of Fig. 18.1(a). ? 2000 by CRC Press LLC Unfolding The unfolding transformation is similar to loop unrolling. In J-unfolding, each node is replaced by J nodes and each edge is replaced by J edges. The J-unfolded data flow graph executes J iterations of the original algorithm [Parhi, 1991]. The unfolding transformation can unravel the hidden concurrency in a data flow program. The achievable iteration period for a J-unfolded data flow graph is 1/J times the critical path length of the unfolded data flow graph. By exploiting interiteration concurrency, unfolding can lead to a lower iteration period in the context of a software programmable multiprocessor implementation. The unfolding transformation can also be applied in the context of hardware design. If we apply an unfolding transformation on a (word-serial) nonrecursive algorithm, the resulting data flow graph represents a word- parallel (or simply parallel) algorithm that processes multiple samples or words in parallel every clock cycle. If we apply 2-unfolding to the 3-tap FIR filter in Fig. 18.1(a), we can obtain the data flow graph of Fig. 18.2. FIGURE 18.3 (a) A two-stage pipelinable time-invariant lattice digital filter. If multiplication and addition operations require 2 and 1 time units, respectively, then this data flow graph can achieve a sampling period of 10 time units (which corresponds to the critical path M 1 ? A 2 ? M 2 ? A 1 ? M 3 ? A 3 ? A 4 ). (b) The pipelined/retimed lattice digital filter can achieve a sampling period of 2 time units. ? 2000 by CRC Press LLC Because the unfolding algorithm is based on graph theoretic approach, it can also be applied at the bit level. Thus, unfolding of a bit-serial data flow program by a factor of J leads to a digit-serial program with digit size J. The digit size represents the number of bits processed per clock cycle. The digit-serial architecture is clocked at the same rate as the bit-serial (assuming that the clock rate is limited by the communication I/O bound much before reaching the computation bound of the bit-serial program). Because the digit-serial program processes J bits per clock cycle the effective bit rate of the digit-serial architecture is J times higher. A simple example of this unfolding is illustrated in Fig. 18.4, where the bit-serial adder in Fig. 18.4(a) is unfolded by a factor of 2 to obtain the digit-serial adder in Fig. 18.4(b) for digit size 2 for a word length of 4. In obvious ways, the unfolding transformation can be applied to both word level and bit level simultaneously to generate word-parallel digit-serial architectures. Such architectures process multiple words per clock cycle and process a digit of each word (not the entire word). Folding Transformation The folding transformation is the reverse of the unfolding transformation. While the unfolding transformation is simpler, the folding transformation is more difficult [Parhi et al., 1992]. The folding transformation can be applied to fold a bit-parallel architecture to a digit-serial or bit-serial one or to fold a digit-serial architecture to a bit-serial one. It can also be applied to fold an algorithm data flow graph to a hardware data flow for a specified folding set. The folding set indicates the processor in which and the time partition at which a task is executed. A specified folding set may be infeasible, and this needs to be detected first. The folding transformation performs a preprocessing step to detect feasibility and in the feasible case transforms the algorithm data flow graph to an equivalent pipelined/retimed data flow graph that can be folded. For the special case of regular data flow graphs and for linear space–time mappings, the folding tranformation reduces to systolic array design. In the folded architecture, each edge in the algorithm data flow graph is mapped to a communicating edge in the hardware architecture data flow graph. Consider an edge U ?V in the algorithm data flow graph with associated number of delays i(U ? V). Let the tasks U and V be mapped to the hardware units H U and H V , respectively. Assume that N time partitions are available, i.e., the iteration period is N. A modulo operation determines the time partition. For example, the time unit 18 for N = 4 corresponds to time partition 18 modulo FIGURE 18.4 (a) A least-significant-bit first bit-serial adder for word length of 4; (b) a digit-serial adder with digit size 2 obtained by two-unfolding of the bit-serial adder. The bit position 0 stands for least significant bit. ? 2000 by CRC Press LLC 4 or 2. Let the tasks U and V be executed in time partitions u and v, i.e., the lth iterations of tasks U and V are executed in time units Nl + u and Nl + v, respectively. The i(U ? V ) delays in the edge U ? V implies that the result of the lth iteration of U is used for the (l + i)th iteration of V. The (l + i)th iteration of V is executed in time unit N(l + i) + v. Thus the number of storage units needed in the folded edge corresponding to the edge U ? V is D F (U ?V) = N(l + i) + v – Nl – u – P u = Ni + v – u – P u where P u is the level of pipelining of the hardware operator H U . The D F (U ? V) delays should be connected to the edge between H U and H V , and this signal should be switched to the input of H V at time partition v. If the D F (U ? V)’s as calculated here were always nonnegative for all edges U ? V, then the problem would be solved. However, some D F ()’s would be negative. The algorithm data flow graph needs to be pipelined and retimed such that all the D F ()’s are nonnegative. This can be formulated by simple inequalities using the retiming variables. The retiming formulation can be solved as a path problem, and the retiming variables can be determined if a solution exists. The algorithm data flow graph can be retimed for folding and the calculation of the D F ()’s can be repeated. The folded hardware architecture data flow graph can now be completed. The folding technique is illustrated in Fig. 18.5. The algorithm data flow graph of a two-stage pipelined lattice recursive digital filter of Fig. 18.3(a) is folded for the folding set shown in Fig. 18.5. Fig. 18.5(a) shows the pipelined/retimed data flow graph (preprocessed for folding) and Fig. 18.5(b) shows the hardware architecture data flow graph obtained after folding. As indicated before, a special case of folding can address systolic array design for regular data flow graphs and for linear mappings. The systolic architectures make use of extensive pipelining and local communication and operate in a synchronous manner [Kung, 1988]. The systolic processors can also be made to operate in an asynchronous manner, and such systems are often referred to as wavefront processors. Systolic architectures have been designed for a variety of applications including convolution, matrix solvers, matrix decomposition, and filtering. Look-Ahead Technique The look-ahead technique is a very powerful technique for pipelining of recursive signal processing algorithms [Parhi and Messerschmitt, 1989]. This technique can transform a sequential recursive algorithm to an equivalent concurrent one, which can then be realized using pipelining or parallel processing or both. This technique has been successfully applied to pipeline many signal processing algorithms, including recursive digital filters (in direct form and lattice form), adaptive lattice digital filters, two-dimensional recursive digital filters, Viterbi decoders, Huffman decoders, and finite state machines. This research demonstrated that the recursive signal processing algorithms can be operated at high speed. This is an important result since modern signal processing applications in radar and image processing and particularly in high-definition and super-high-definition tele- vision video signal processing require very high throughput. Traditional algorithms and topologies cannot be used for such high-speed applications because of the inherent speed bound of the algorithm created by the feedback loops. The look-ahead transformation creates additional concurrency in the signal processing algo- rithms and the speed bound of the transformed algorithms is increased substantially. The look-ahead trans- formation is not free from its drawbacks. It is accompanied by an increase in the hardware overhead. This difficulty has encouraged us to develop inherently pipelinable topologies for recursive signal processing algo- rithms. Fortunately, this is possible to achieve in adaptive digital filters using relaxations on the look-ahead or by the use of relaxed look-ahead [Shanbhag and Parhi, 1992]. To begin, consider a time-invariant one-pole recursive digital filter transfer function Hz Xz Uz az () () () == - - 1 1 1 ? 2000 by CRC Press LLC described by the difference equation x(n) = ax(n – 1) + u(n) and shown in Fig. 18.6(a). The maximum achievable speed in this system is limited by the operating speed of one multiply–add operation. To increase the speed of this system by a factor of 2, we can express x(n) in terms of x(n – 2) by substitution of one recursion within the other: x(n) = a[ax(n – 2) + u(n – 1)] + u(n) = a 2 x(n – 2) + au(n – 1) + u(n) FIGURE 18.5(a) A pipelined/retimed data flow graph obtained from Fig. 18.3(a) by preprocessing for folding; (b) the folded hardware architecture data flow graph. In our folding notation, the tasks are ordered within a set and the ordering represents the time partition in which the task is executed. For example, SA 1 = (A 2 , A 1 ) implies that A 2 and A 1 are, respectively, executed in even and odd time partitions in the same processor. The notation F represents a null operation. ? 2000 by CRC Press LLC The transfer function of the emulated second-order system is given by and is obtained by using a pole-zero cancellation at –a. In the modified system, x(n) is computed using x(n – 2) as opposed to x(n – 1); thus we look ahead. The modified system has two delays in the multiply–add feedback loop, and these two delays can be distributed to pipeline the multiply–add operation by two stages. Of course, the additional multiply–add operation that represents one zero would also need to be pipelined by two stages to keep up with the sample rate of the system. To increase the speed by four times, we can rewrite the transfer function as: This system is shown in Fig. 18.6(b). Arbitrary speed increase is possible. However, for power-of-two speed increase the hardware complexity grows logarithmically with speed-up factor. The same technique can be applied to any higher-order system. For example, a second-order recursive filter with transfer function can be modified to for a twofold increase in speed. In this example, the output y(n) is computed using y(n – 2) and y(n – 4); thus, it is referred to as scattered look-ahead. FIGURE 18.6(a) A first-order recursive digital filter; (b) a four-stage pipelinable equivalent filter obtained by look-ahead computation. Hz az az ()= + - - - 1 1 1 22 Hz az az az () ()() () = ++ - -- - 11 1 122 44 Hz rzrz () cos = -+ -- 1 12 122 q Hz rzrz rzrz () cos cos = ++ -+ -- -- 12 12 2 122 224 q q ? 2000 by CRC Press LLC While look-ahead can transform any recursive digital filter transfer function to pipelined form, it leads to a hardware overhead proportional to N log 2 M, where N is the filter order and M is the speed-up factor. Instead of starting with a sequential digital filter transfer function obtained by traditional design approaches and transforming it for pipelining, it is more desirable to use a constrained filter design program that can satisfy the filter spectrum and the pipelining constraint. The pipelining constraint is satisfied by expressing the denominator of the transfer function in scattered look-ahead form. Such filter design programs have now been developed in both time domain and frequency domain. The advantage of the constrained filter design approach is that we can obtain pipelined digital filters with marginal or zero hardware overhead compared with sequential digital filters. The pipelined transfer functions can also be mapped to pipelined lattice digital filters. The reader might note that the data flow graph of Fig. 18.3(a) was obtained by this approach. The look-ahead pipelining can also be applied for the design of transversal and adaptive lattice digital filters. Although look-ahead transformation can be used to modify the adaptive filter recursions to create concurrency, this requires large hardware overhead. The adaptive filters are based on weight update operations, and the weights are adapted based on the current error. Finally, the error becomes close to zero and the filter coefficients have been adapted. Thus, making relaxations on the error can reduce the hardware overhead substantially without degradation of the convergence behavior of the adaptive filter. Three types of relaxations of look-ahead are possible. These are referred to as sum relaxation, product relaxation, and delay relaxation. To illustrate these three relaxations, consider the weight update recursion w(n + 1) = a(n)w(n) + f(n) where the term a(n) is typically 1 for transversal least mean square (LMS) adaptive filters and of the form (1 – e(n)) for lattice LMS adaptive digital filters, and f(n) = me(n)u(n) where m is a constant, e(n) is the error, and u(n) is the input. The use of look-ahead transforms the above recursion to In sum relaxation, we only retain the single term dependent on the current input for the last term of the look- ahead recursion. The relaxed recursion after sum relaxation is given by In lattice digital filters, the coefficient a(n) is close to 1 for all n, since it can be expressed as (1 – e(n)) and e(n) is close to zero for all n and is positive. The product relaxation on the above equation leads to w(n + M) = (1 – Me(n + M – 1)) w(n) + f(n + M – 1) wnM anMi wn i M ()( ) ()+= +-- = - ? 1 0 1 ++- +-- +-- é ? ê ê ù ? ú ú +- +- é ? ê ê ê ê ê ê ê ê ù ? ú ú ú ú ú ú ú ú = - = ?? 11 1 1 1 2 0 2 0 1 anM anMi anMi fnM fnM fn i M i ()().. () () . . . () wnM anMi wnfnM i M ()( ) ()( )+= +-- ++- = - ? 11 0 1 ? 2000 by CRC Press LLC The delay relaxation assumes the signal to be slowly varying or to be constant over D samples and replaces the look-ahead by w(n + M) = (1 – Me(n + M – 1)) w(n) + f(n + M – D – 1) These three types of relaxations make it possible to implement pipelined transversal and lattice adaptive digital filters with marginal increase in hardware overhead. Relaxations on the weight update operations change the convergence behavior of the adaptive filter, and we are forced to examine carefully the convergence behavior of the relaxed look-ahead adaptive digital filters. It has been shown that the relaxed look-ahead adaptive digital filters do not suffer from degradation in adaptation behavior. Futhermore, when coding, the use of pipelined adaptive filters could lead to a dramatic increase in pixel rate with no degradation in signal-to-noise ratio of the coded image and no increase in hardware overhead [Shanbhag and Parhi, 1992]. The concurrency created by look-ahead and relaxed look-ahead transformations can also be exploited in the form of parallel processing. Furthermore, for a constant speed, concurrent architectures (especially the pipelined architectures) can also lead to low power consumption. Associativity Transformation The addition operations in many signal processing algorithms can be interchanged since the add operations satisfy associativity. Thus, it is possible to move the add operations outside the critical loops to increase the maximum achievable speed of the system. As an example of the associative transformation, consider the realization of a second-order recursion x(n) = 5/8x(n – 1) – 3/4x(n – 2) + u(n). Two possible realizations are shown in Fig. 18.7(a). The realization on the left contains one multiplication and two add operations in the critical inner loop, whereas the realization on the right contains one multiplication and one add operation in the critical inner loop. The realization on the left can be transformed to the realization on the right using the associativity transformation. Figure 18.7(b) shows a bit-serial implementation of this second-order recursion for the realization on the right for a word length of 8. This bit-serial system can be operated in a functionally correct manner for any word length greater than or equal to 5 since the inner loop computation latency is 5 cycles. On the other hand, if associativity were not exploited, then the minimum realizable word length would be 6. Thus, associativity can improve the achievable speed of the system. Distributivity Another local transformation that is often useful is distributivity. In this transformation, a computation (A 2 B) + (A 2 C) may be reorganized as A 2 (B + C). Thus, the number of hardware units can be reduced from two multipliers and one adder to one multiplier and one adder. Arithmetic Processor Architectures In addition to algorithms and architecture designs, it is also important to address implementation styles and arithmetic processor architectures. Most DSP systems use fixed-point hardware arithmetic operators. While many number system representa- tions are possible, the two’s complement number system is the most popular number system. The other number systems include the residue number system, the redundant or signed-digit number system, and the logarithmic number system. The residue and logarithmic number systems are rarely used or are used in very special cases such as nonrecursive digital filters. Shifting or scaling and division are difficult in the residue number system. Difficulty with addition and the overhead associated with logarithm and antilogarithm converters reduce the attractiveness of the logarithm number system. The use of the redundant number system leads to carry-free operation but is accompanied by the overhead associated with redundant-to-two’s complement conversion. Another approach often used is distributed arithmetic. This approach has recently been used in a few video transformation chips. ? 2000 by CRC Press LLC The simplest arithmetic operation is addition. Multiplication can be realized as a series of add-shift opera- tions, and division and square-root can be realized as a series of controlled add–subtract operations. The conventional two’s complement adder involves carry ripple operation. This limits the throughput of the adder operation. In DSP, however, the combined multiply–add operation is most common. Carry–save operations have been used to realize pipelined multiply-adders using fewer pipelining latches. In conventional pipelined two’s complement multiplier, the multiplication time is approximately two times the bit-level addition time. Recently, a technique has been proposed to reduce the multiplication time from 2W bit-level binary adder times to 1.25W bit-level binary adder times where W is the word length. This technique is based on the use of hybrid number system representation, where one input operand is in two’s complement number representation and the other in redundant number representation [Srinivas and Parhi, 1992]. Using an efficient sign-select redundant-to-two’s complement conversion technique, this multiplier can be made to operate faster and, in the pipelined mode, would require fewer pipelining latches and less silicon area. FIGURE 18.7 (a) Two associative realizations of a second-order recursion; (b) an efficient bit-serial realization of the recursion for a word length of 8. ? 2000 by CRC Press LLC Computer-Aided Design With progress in the theory of architectures, the computer-aided design (CAD) systems for DSP application also become more powerful. In early 1980, the first silicon compiler system for signal processing was developed at the University of Edinburgh and was referred to as the FIRST design system. This system only addressed the computer-aided design of bit-serial signal processing systems. Since then more powerful systems have been developed. The Cathedral I system from Katholieke Universiteit Leuven and the BSSC (bit-serial silicon com- piler) from GE Research Center in Schenectady, New York, also addressed synthesis of bit-serial circuits. The Cathedral system has now gone through many revisions, and the new versions can systhesize parallel multi- processor data paths and can perform more powerful scheduling and allocation. The Lager design tool at the University of California at Berkeley was developed to synthesize the DSP algorithms using parametrizable macro building blocks (such as ALU, RAM, ROM). This system has also gone through many revisions. The Hyper system also developed at the University of California at Berkeley and the MARS design system developed at the University of Minnesota at Minneapolis perform higher level transformations and perform scheduling and allocation. These CAD tools are crucial to rapid prototyping of high-performance DSP integrated circuits. Future VLSI DSP Systems Future VLSI systems will make use of a combination of many types of architectures such as dedicated and programmable. These systems can be designed successfully with proper understanding of the algorithms, applications, theory of architectures, and with the use of advanced CAD systems. Defining Terms Bit serial: Processing of one bit per clock cycle. If word length is W, then one sample or word is processed in W clock cycles. In contrast, all W bits of a word are processed in the same clock cycle in a bit-parallel system. Digit serial: Processing of more than one but not all bits in one clock cycle. If the digit size is W 1 and the word length is W, then the word is processed in W/W 1 clock cycles. If W 1 = 1, then the system is referred to as a bit-serial and if W 1 = W, then the system is referred to as a bit-parallel system. In general, the digit size W 1 need not be a divisor of the word length W, since the least and most significant bits of consecutive words can be overlapped and processed in the same clock cycle. Folding: The technique of mapping many tasks to a single processor. Look-ahead: The technique of computing a state x (n) usng previous state x (n – M ) without requiring the intermediate states x (n – 1) through x (n – M + 1). This is referred to as a M-step look-ahead. In the case of higher-order computations, there are two forms of look-ahead: clustered look-ahead and scattered look-ahead. In clustered look-ahead, x (n) is computed using the clustered states x (n – M – N + 1) through x (n – M ) for an Nth order computation. In scattered look-ahead, x (n) is computed using the scattered states x (n – iM ) where i varies from 1 to N. Parallel processing: Processing of multiple tasks independently by different processors. This also increases the throughput. Pipelining: A technique to increase throughput. A long task is divided into components, and each component is distributed to one processor. A new task can begin even though the former tasks have not been completed. In the pipelined operation, different components of different tasks are executed at the same time by different processors. Pipelining leads to an increase in the system latency, i.e., the time elapsed between the starting of a task and the completion of the task. Retiming: The technique of moving the delays around the system. Retiming does not alter the latency of the system. Systolic: Flow of data in a rhythmic fashion from a memory through many processors, returning to the memory just as blood flows Unfolding: The technique of transforming a program that describes one iteration of an algorithm to another equivalent program that describes multiple iterations of the same algorithm. Word parallel: Processing of multiple words in the same clock cycle. ? 2000 by CRC Press LLC Related Topic 95.1 Introduction References A.P. Chandrakasan, S. Sheng, and R.W. Brodersen, “Low-power CMOS digital design,” IEEE J. Solid State Circuits, vol. 27(4), pp. 473–484, April 1992. S.Y. Kung, VLSI Array Processors, Englewood Cliffs, N.J.: Prentice-Hall, 1988. E.A. Lee and D.G. Messerschmitt, “Pipeline interleaved programmable DSP’s,” IEEE Trans. Acoustics, Speech, Signal Processing, vol. 35(9), pp. 1320–1345, September 1987. C.E. Leiserson, F. Rose, and J. Saxe, “Optimizing synchronous circuitry by retiming,” Proc. 3rd Caltech Conf. VLSI, Pasadena, Calif., pp. 87–116, March 1983. K.K. Parhi, “Algorithm transformation techniques for concurrent processors,” Proc. IEEE, vol. 77(12), pp. 1879–1895, December 1989. K.K. Parhi, “Systematic approach for design of digit-serial processing architectures,” IEEE Trans. Circuits Systems, vol. 38(4), pp. 358–375, April 1991. K.K. Parhi and D.G. Messerschmitt, “Pipeline interleaving and parallelism in recursive digital filters,” IEEE Trans. Acoustics, Speech, Signal Processing, vol. 37(7), pp. 1099–1135, July 1989. K.K. Parhi, C.Y. Wang, and A.P. Brown, “Synthesis of control circuits in folded pipelined DSP architectures,” IEEE J. Solid State Circuits, vol. 27(1), pp. 29–43, January 1992. N.R. Shanbhag, and K.K. Parhi, “A pipelined adaptive lattice filter architecture,” Proc. 1992 IEEE Int. Symp. Circuits and Systems, San Diego, May 1992. H.R. Srinivas and K.K. Parhi, “High-speed VLSI arithmetic processor architectures using hybrid number representation,” J. VLSI Signal Processing, vol. 4(2/3), pp. 177–198, 1992. Further Information A detailed video tutorial on “Implementation and Synthesis of VLSI Signal Processing Systems” presented by K.K. Parhi and J.M. Rabaey in March 1992 can be purchased from the customer service department of IEEE, 445 Hoes Lane, P.O. Box 1331, Piscataway, NJ 08855-1331. Special architectures for video communications can be found in the book VLSI Implementations for Image Communications, published as the fourth volume of the series Advances in Image Communications (edited by Peter Pirsch) by the Elsevier Science Publishing Co. in 1993. The informative article “Research on VLSI for Digital Video Systems in Japan,” published by K.K. Parhi in the fourth volume of the 1991 Office of Naval Research Asian Office Scientific Information Bulletin (pages 93–98), provides examples of video codec designs using special architectures. For video programmable digital signal processor approaches, see I. Tamitani, H. Harasaki, and T. Nishitani, “A Real-Time HDTV Signal Processor: HD-VSP,” published in IEEE Transactions on Circuits and Systems for Video Technology, March 1991, and T. Fujii, T. Sawabe, N. Ohta, and S. Ono, “Implementation of Super High-Definition Image Processing on HiPIPE,” published in 1991 IEEE International Symposium on Circuits and Systems, held in June 1991 in Singapore (pages 348–351). The IEEE Design and Test of Computers published three special issues related to computer-aided design of special architectures; these issues were published in October 1990 (addressing high-level synthesis), December 1990 (addressing silicon compilations), and June 1991 (addressing rapid prototyping). Descriptions of various CAD systems can be found in the following references. The description of the FIRST system can be found in the article “A Silicon Compiler for VLSI Signal Processing,” by P. Denyer et al. in the Proceedings of the ESSCIRC conference held in Brussels in September 1982 (pages 215–218). The Cathedral system has been described in R. Jain et al., “Custom Design of a VLSI PCM-FDM Transmultiplexor from System Specifications to Circuit Layout Using a Computer Aided Design System,” published in IEEE Journal of Solid State Circuits in February 1986 (pages 73–85). The Lager system has been described in “An Integrated Automatic Layout Generation System for DSP Circuits,” by J. Rabaey, S. Pope, and R. Brodersen, published in the July 1985 issue of the IEEE Transactions on Computer Aided Design (pages 285–296). The description of the MARS Design System can be found in C.-Y. Wang and K.K. Parhi, “High-Level DSP Synthesis Using MARS System,” ? 2000 by CRC Press LLC published in Proceedings of the 1992 IEEE International Symposium on Circuits and Systems in San Diego, May 1992. A tutorial article on high-level synthesis can be found in “The High-Level Synthesis of Digital Systems,” by M.C. McFarland, A. Parker, and R. Composano, published in the February 1990 issue of the Proceedings of the IEEE (pages 310–318). Articles on pipelined multipliers can be found in T.G. Noll et al., “A Pipelined 330 MHZ Multiplier,” IEEE Journal of Solid State Circuits, June 1986 (pages 411–416) and in M. Hatamian and G. Cash, “A 70-MHz 8-Bit ′ 8-Bit-Parallel Pipelined Multiplier in 2.5 mm CMOS,” IEEE Journal of Solid State Circuits, 1986. Technical articles on special architectures and chips for signal and image processing appear at different places, including proceedings of conferences such as IEEE Workshop on VLSI Signal Processing, IEEE International Conference on Acoustics, Speech, and Signal Processing, IEEE International Symposium on Circuits and Systems, IEEE International Solid State Circuits Conference, IEEE Customs Integrated Circuits Conference, IEEE International Conference on Computer Design, ACM/IEEE Design Automation Conference, ACM/IEEE International Conference on Computer Aided Design, International Conference on Application Specific Array Processors, and journals such as IEEE Transactions on Signal Processing, IEEE Transactions on Image Processing, IEEE Transactions on Circuits and Systems: Part II: Analog and Digital Signal Processing, IEEE Transactions on Computers, IEEE Journal of Solid State Circuits, IEEE Signal Processing Magazine, IEEE Design and Test Magazine, and Journal of VLSI Signal Processing. 18.2 Signal Processing Chips and Applications Rulph Chassaing and Bill Bitler Recent advances in very large scale integration (VLSI) have contributed to the current digital signal processors. These processors are just special-purpose fast microprocessors characterized by architectures and instructions suitable for real-time digital signal processing (DSP) applications. The commercial DSP processor, a little more than a decade old, has emerged because of the ever-increasing number of signal processing applications. DSP processors are now being utilized in a number of applications from communications and controls to speech and image processing. They have found their way into talking toys and music synthesizers. A number of texts [such as Chassaing, 1992] and articles [such as Ahmed and Kline, 1991] have been written, discussing the applications that use DSP processors and the recent advances in DSP systems. DSP Processors Digital signal processors are currently available from a number of companies, including Texas Instruments, Inc. (Texas), Motorola, Inc. (Arizona), Analog Devices, Inc. (Massachusetts), AT&T (New Jersey), and NEC (California). These processors are categorized as either fixed-point or floating-point processors. Several com- panies are now supporting both types of processors. Special-purpose digital signal processors, designed for a specific signal processing application such as for fast Fourier transform (FFT), have also emerged. Currently available digital signal processors range from simple, low cost processing units through high performance units such as Texas Instruments’ (TI) TMS320C40 (Chassaing and Martin, 1995) and TMS320C80, and Analog Devices 1 ADSP-21060 SHARC (Chassaing and Ayers, 1996). One of the first-generation digital signal processors is the (N-MOS technology) TMS32010, introduced by Texas Instruments in 1982. This first-generation fixed-point processor is based on the Harvard architecture, with a fast on-chip hardware multiplier/accumulator, and with data and instructions in separate memory spaces, allowing for concurrent accesses. This type of pipelining feature enables the processor to execute one instruction while fetching at the same time the next instruction. Other features include 144 (16-bit) words of on-chip data RAM and a 16-bit by 16-bit multiply operation in one instruction cycle time of 200 ns. Since many instructions can be executed in one single cycle, the TMS32010 is capable of executing 5 million instructions per second (MIPS). Major drawbacks of this first-generation processor are its limited on-chip memory size and much slower execution time for accessing external memory. Improved versions of this first-generation processor are now available in C-MOS technology, with a faster instruction cycle time of 160 ns. ? 2000 by CRC Press LLC The second-generation fixed-point processor TMS32020, introduced in 1985 by TI, was quickly followed by an improved C-MOS version TMS320C25 [Chassaing and Horning, 1990] in 1986. Features of the TMS320C25 include 544 (16-bit) words of on-chip data RAM, separate program and data memory spaces (each 64 K words), and an instruction cycle time of 100 ns, enabling the TMS320C25 to execute 10 MIPS. A faster version, TI’s fixed-point TMS320C50 processor, is available with an instruction cycle time of 35 ns. The third-generation TMS320C30 (by TI) supports fixed- as well as floating-point operations [Chassaing, 1992]. Features of this processor include 32-bit by 32-bit floating-point multiply operations in one instruction cycle time of 60 ns. Since a number of instructions, such as load and store, multiply and add, can be performed in parallel (in one cycle time), the TMS320C30 can execute a pair of parallel instructions in 30 ns, allowing for 33.3 MIPS. The Harvard-based architecture of the fixed-point processors was abandoned for one allowing four levels of pipelining with three subsequent instructions being consequently fetched, decoded, and read while the current instruction is being executed. The TMS320C30 has 2 K words of on-chip memory and a total of 16 million words of addressable memory spaces for program, data, and input/output. Specialized instructions are available to make common DSP algorithms such as filtering and spectral analysis execute fast and efficiently. The architecture of the TMS320C30 was designed to take advantage of higher-level languages such as C and ADA. The TMS320C31 and TMS320C32, recent members of the third-generation floating-point processors, are available with a 40 ns instruction cycle time. DSP starter kits (DSK) are inexpensive development systems available from TI and based on both the fixed- point TMS320C50 and the floating-point TMS320C31 processors. We will discuss both the fixed-point TMS320C25 and the floating-point TMS320C30 digital signal processors, including the development tools available for each of these processors and DSP applications. Fixed-Point TMS320C25-Based Development System TMS320C25-based development systems are now available from a number of companies such as Hyperception Inc., Texas, and Atlanta Signal Processors, Inc., Georgia. The Software Development System (SWDS), available from TI includes a board containing the TMS320C25, which plugs into a slot on an IBM compatible PC. Within the SWDS environment, a program can be developed, assembled, and run. Debugging aids supported by the SWDS include single-stepping, setting of breakpoints, and display/modification of registers. A typical workstation consists of: 1.An IBM compatible PC. Commercially available DSP packages (such as from Hyperception or Atlanta Signal Processors) include a number of utilities and filter design techniques. 2.The SWDS package, which includes an assembler, a linker, a debug monitor, and a C compiler. 3.Input/output alternatives such as TI’s analog interface board (AIB) or analog interface chip (AIC). The AIB includes a 12-bit analog-to-digital converter (ADC) and a 12-bit digital-to-analog converter (DAC). A maximum sampling rate of 40 kHz can be obtained. With (input) antialiasing and (output) reconstruction filters mounted on a header on the AIB, different input/output (I/O) filter bandwidths can be achieved. Instructions such as IN and OUT can be used for input/output accesses. The AIC, which provides an inexpensive I/O alternative, includes 14-bit ADC and DAC, antialiasing/reconstruction filters, all on a single C-MOS chip. Two inputs and one output are available on the AIC. (A TMS320C25/AIC interface diagram and communication routines can be found in Chassaing and Horning, 1990.) The TLC32047 AIC is a recent member of the TLC32040 family of voiceband analog interface circuits, with a maximum sampling rate of 25 kHz. Implementation of a Finite Impulse Response Filter with the TMS320C25 The convolution equation (18.1)yn hkxnk hxn hxn hN xn N hN xn N k N () ()( ) ()() ()( )... ( )( ( )) ()(() =- =+-++--- +--- = - ? 0 1 11 2 2 1 ? 2000 by CRC Press LLC represents a finite impulse response (FIR) filter with length N. The memory organization for the coefficients h(k) and the input samples x(n – k) is shown in Table 18.1. The coefficients are placed within a specified internal program memory space and the input samples within a specified data memory space. The program counter (PC) initially points at the memory location that contains the last coefficient h(N – 1), for example at memory address FF00h (in hex). One of the (8) auxiliary registers points at the memory address of the last or least recent input sample. The most recent sample is represented by x(n). The following program segment implements (18.1): LARP AR1 RPTK N-1 MACD FF00h,*- APAC The first instruction selects auxiliary register AR1, which will be used for indirect addressing. The second instruction RPTK causes the subsequent MACD instruction to execute N times (repeated N – 1 times). The MACD instruction has the following functions: 1.Multiplies the coefficient value h(N – 1) by the input sample value x(n – (N – 1)). 2.Accumulates any previous product stored in a special register (TR). 3.Copies the data memory sample value into the location of the next-higher memory. This “data move” is to model the input sample delays associated with the next unit of time n + 1. The last instruction APAC accumulates the last multiply operation h(0)x(n). At time n + 1, the convolution Eq. (18.1) becomes (18.2) The previous program segment can be placed within a loop, with the PC and the auxiliary register AR1 reinitialized (see the memory organization of the samples x(k) associated with time n + 1 in Table 18.1). Note that the last multiply operation is h(0)x(.), where x(.) represents the newest sample. This process can be continuously repeated for time n + 2, n + 3, and so on. The characteristics of a frequency selective FIR filter are specified by a set of coefficients that can be readily obtained using commercially available filter design packages. These coefficients can be placed within a generic FIR program. Within 5–10 minutes, an FIR filter can be implemented in real time. This includes finding the coefficients; assembling, linking and downloading the FIR program into the SWDS; and observing the desired frequency response displayed on a spectrum analyzer. A different FIR filter can be quickly obtained since the only necessary change in the generic program is to substitute a new set of coefficients. The approach for modeling the sample delays involves moving the data. A different scheme is used with the floating-point TMS320C30 processor with a circular mode of addressing. TABLE 18.1TMS320C25 Memory Organization for Convolution Input Samples Coefficients Time n Time n + 1 Time n + 2 PC ? h(N – 1) x(n) x(n+1) x(n+2) h(N – 2) x(n – 1) x(n) x(n+1) . . . . . . . . . h(2) x(n – (N – 3)) x(n – (N – 4)) x(n – (N – 5)) h(1) x(n – (N – 2)) x(n – (N – 3)) x(n – (N – 4)) h(0) AR1 ? x(n – (N – 1)) x(n – (N – 2)) x(n – (N – 3)) yn hxn hxn hN xn N hN xn N ( ) ()( ) ()()... ()(() ()(() += ++ + +---+--- 1011 2312 ? 2000 by CRC Press LLC Floating-Point TMS320C30-Based Development System TMS320C30-based DSP development systems are also currently available from a number of companies. The following are available from Texas Instruments: 1.An evaluation module (EVM). The EVM is a powerful, yet relatively inexpensive 8-bit card that plugs into a slot on an IBM AT compatible. It includes the third-generation TMS320C30, 16 K of user RAM, and an AIC for I/O. A serial port connector available on the EVM can be used to interface the TMS320C30 to other input/output devices (the TMS320C30 has two serial ports). An additional AIC can be interfaced to the TMS320C30 through this serial port connector. A very powerful, yet inexpensive, analog evaluation fixture, available from Burr-Brown (Arizona), can also be readily interfaced to the serial port on the EVM. This complete two-channel analog evaluation fixture includes an 18-bit DSP102 ADC, an 18-bit DSP202 DAC, antialiasing and reconstruction filters. The ADC has a maximum sampling rate of 200 kHz. 2.An XDS1000 emulator—powerful but quite expensive. A module can be readily built as a target system to interface to the XDS1000 [Chassaing, 1992]. This module contains the TMS320C30, 16 K of static RAM. Two connectors are included on this module, for interfacing to either an AIC module or to a second-generation analog interface board (AIB). The AIC was discussed in conjunction with the TMS320C25. The AIB includes Burr-Brown’s 16-bit ADC and DAC with a maximum sampling rate of 58 kHz. An AIC is also included on this newer AIB version. EVM Tools The EVM package includes an assembler, a linker, a simulator, a C compiler, and a C source debugger. The second-generation TMS320C25 fixed-point processor is supported by C with some degrees of success. The architecture and instruction set of the third-generation TMS320C30 processor facilitate the development of high-level language compilers. An optimizer option is available with the C compiler for the TMS320C30. A C- code program can be readily compiled, assembled, linked, and downloaded into either a simulator or the EVM for real-time processing. A run-time support library of C functions, included with the EVM package, can be used during linking. During simulation, the input data can be retrieved from a file and the output data written into a file. Input and output port addresses can be appropriately specified. Within a real-time processing environment with the EVM, the C source debugger can be used. One can single-step through a C-code program while observing the equivalent step(s) through the assembly code. Both the C code and the corresponding assembly code can be viewed through the EVM windows. One can also monitor at the same time the contents of registers, memory locations, and so on. Implementation of a Finite Impulse Response Filter with the TMS320C30 Consider again the convolution equation, Eq. (18.1), which represents an FIR filter. Table 18.2 shows the TMS320C30 memory organization used for the coefficients and the input samples. Initially, all the input samples can be set to zero. The newest sample x(n), at time n, can be retrieved from an ADC using the following instructions: TABLE 18.2TMS320C30 Memory Organization for Convolution Coefficients Time n Time n + 1 Time n + 2 AR0 ? h(N – 1) AR1 ? x(n – (N – 1)) x(n + 1) x(n + 1) h(N – 2) x(n – (N – 2)) AR1 ? x(n – (N – 2)) x(n + 2) h(N – 3) x(n – (N – 3)) x(n – (N – 3)) AR1 ? x(n – (N – 3)) . . . . . . . . h(1) x(n – 1) x(n – 1) x(n – 1) h(0) x(n) x(n) x(n) ? 2000 by CRC Press LLC FLOAT *AR3,R3 STF R3, *AR1++% These two instructions cause an input value x(n), retrieved from an input port address specified by auxiliary register AR3, to be loaded into a register R3 (one of eight 40-bit-wide extended precision registers), then stored in a memory location pointed by AR1 (AR1 would be first initialized to point at the “bottom” or higher-memory address of the table for the input samples). AR1 is then postincremented in a circular fashion, designated with the modulo operator %, to point at the oldest sample x(n – (N – 1)), as shown in Table 18.2. The size of the circular buffer must first be specified. The following program segment implements (18.1): RPTS LENGTH-1 MPYF *AR0++%,*AR1++%,R0 | ADDF R0,R2,R2 ADDF R0,R2 The repeat “single” instruction RPTS causes the next (multiply) floating-point instruction MPYF to be executed LENGTH times (repeated LENGTH-1), where LENGTH is the length of the FIR filter. Furthermore, since the first ADDF addition instruction is in parallel (designated by ||) with the MPYF instruction, it is also executed LENGTH times. From Table 18.2, AR0, one of the eight available auxiliary registers, initially points at the memory address (a table address) which contains the coefficient h(N – 1), and a second auxiliary register AR1 now points to the address of the oldest input sample x(n – (N – 1)). The second indirect addressing mode instruction multiplies the content in memory (address pointed by AR0) h(N – 1) by the content in memory (address pointed by AR1) x(n – N – 1)), with the result stored in R0. Concurrently (in parallel), the content of R0 is added to the content of R2, with the result stored in R2. Initially R0 and R2 are set to zero; hence, the resulting value in R2 is not the product of the first multiply operation. After the first multiply operation, both AR0 and AR1 are incremented, and h(N – 2) is multiplied by x(n – (N – 2)). Concurrently, the result of the first multiply operation (stored in R0) is accumulated into R2. The second addition instruction, executed only once, accumulates the last product h(0)x(n) (similar to the APAC instruction associated with the fixed-point TMS320C25). The overall result yields an output value y(n) at time n. After the last multiply operation, both AR0 and AR1 are postincremented to point at the “top” or lower-memory address of each circular buffer. The process can then be repeated for time n + 1 in order to obtain a second output value y(n + 1). Note that the newest sample x(n + 1) would be retrieved from an ADC using the FLOAT and STF instructions, then placed at the top memory location of the buffer (table) containing the samples, overwriting the initial value x(n – (N – 1)). AR1 is then incremented to point at the address containing x(n – (N – 2)), and the previous four instructions can be repeated. The last multiply operation involves h(0) and x(.), where x(.) is the newest sample x(n + 1), at time n + 1. The foregoing procedure would be repeated to produce an output y(n + 2), y(n + 3), and so on. Each output value would be converted to a fixed-point equivalent value before being sent to a DAC. The frequency response of an FIR filter with 41 coefficients and a center frequency of 2.5 kHz, obtained from a signal analyzer, is displayed in Fig. 18.8. FIR and IIR Implementation Using C and Assembly Code A real-time implementation of a 45-coefficient bandpass FIR filter and a sixth-order IIR filter with 345 samples, using C code and TMS320C30 code, is discussed in Chassaing and Bitler [1991]. Tables 18.3 and 18.4 show a comparison of execution times of those two filters. The C language FIR filter, implemented without the modulo operator %, and compiled with a C compiler V4.1, executed two times slower 1 than an equivalent assembly language filter (which has a similar execution time as one implemented with a filter routine in assembly, called by a C program). The C language IIR filter ran 1.3 times slower than the corresponding assembly language IIR filter. These slower execution times may be acceptable for many applications. Where execution speed is crucial, 1 1.5 times slower using a newer C compiler V4.4. ? 2000 by CRC Press LLC a time-critical function may be written in assembly and called from a C program. In applications where speed is not absolutely crucial, C provides a better environment because of its portability and maintainability. Real-Time Applications A number of applications are discussed in Chassaing and Horning (1990) using TMS320C25 code and in Chassaing (1992) using TMS320C30 and C code. These applications include multirate and adaptive filtering, modulation techniques, and graphic and parametric equalizers. Two applications are briefly discussed here: a ten-band multirate filter and a video line rate analysis. 1.The functional block diagram of the multirate filter is shown in Fig. 18.9. The multirate design provides a significant reduction in processing time and data storage, compared to an equivalent single-rate design. With multirate filtering, we can use a decimation operation in order to obtain a sample rate reduction or an interpolation operation (as shown in Fig. 18.9) in order to obtain a sample rate increase [Crochiere and Rabiner, 1983]. A pseudorandom noise generator implemented in software provides the input noise to the ten octave band filters. Each octave band filter consists of three 1/3-octave filters (each with 41 coefficients), which can be individually controlled. A controlled noise source can be obtained with this design. Since each 1/3-octave band filter can be turned on or off, the noise spectrum can be shaped accordingly. The interpolation filter is a low-pass FIR filter with a 2:1 data-rate increase, yielding two sample outputs for each input sample. The sample rate of the highest octave-band filter is set at 32,768 samples per second, with each successively lower band processing at half the rate of the next-higher band. The multirate filter (a nine-band version) was implemented with the TMS320C25 [Chassaing et al., 1990]. Figure 18.10 shows the three 1/3-octave band filters of band 10 implemented with the EVM FIGURE 18.8Frequency response of 41-coefficient FIR filter. TABLE 18.3Execution Time and Program Size of FIR Filter FIR Execution Time Size (45 samples) (msec) (words) C with modulo 4.16 122 C without modulo 0.338 116 C-called assembly 0.1666 74 Assembly 0.1652 27 TABLE 18.4Execution Time and Program Size of 6th-Order IIR Filter IIR Execution Time Size (345 samples) (msec) (words) C 1.575 109 Assembly 1.18 29 ? 2000 by CRC Press LLC in conjunction with the two-channel analog fixture (made by Burr-Brown). The center frequency of the middle 1/3-octave band 10 filter is at approximately 8 kHz since the coefficients were designed for a center frequency of 1/4 the sampling rate (the middle 1/3-octave band 9 filter would be centered at 4 kHz, the middle 1/3-octave band 8 filter at 2 kHz, and so on). Note that the center frequency of the middle 1/3-octave band 1 filter would be at 2 Hz if the highest sampling rate is set at 4 kHz. Observe from Fig. 18.10 that the crossover frequencies occur at the 3-dB points. Since the main processing time of the multirate filter (implemented in assembly code) was measured to be 8.8 ms, the maximum sampling rate was limited to 58 ksps. 2.A video line rate analysis implemented entirely in C code is discussed in Chassaing and Bitler [1992]. A module was built to sample a video line of information. This module included a 9.8-MHz clock, a high sampling rate 8-bit ADC and appropriate support circuitry (comparator, FIFO buffer, etc.). Inter- active features allowed for the selection of one (out of 256) horizontal lines of information and the execution of algorithms for digital filtering, averaging, and edge enhancement, with the resulting effects displayed on the PC screen. Figure 18.11 shows the display of a horizontal line (line #125) of information FIGURE 18.9Multirate filter functional block diagram. FIGURE 18.10Frequency responses of the 1/3-octave band ten filters. ? 2000 by CRC Press LLC obtained from a test chart with a charge coupled device (CCD) camera. The function key F3 selects the 1-MHz low-pass filter resulting in the display shown in Fig. 18.12. The 3-MHz filter (with F4) would pass more of the higher-frequency components of the signal but with less noise reduction. F5 implements the noise averaging algorithm. The effect of the edge enhancement algorithm (with F7) is displayed in Fig. 18.13. Conclusions and Future Directions DSP processors have been used extensively in a number of applications, even in non-DSP applications such as graphics. The fourth-generation floating-point TMS320C40, code compatible with the TMS320C30, features an instruction cycle time of 40 ns and six serial ports. The fifth-generation fixed-point TMS320C50, code compatible with the first two generations of fixed-point processors, features an instruction cycle time of 35 ns and 10 K words (16-bit) of on-chip data and program memory. Currently, both the fixed-point and floating- point processors are being supported by TI. FIGURE 18.11Display of a horizontal line of video signal. FIGURE 18.12Video line signal with 1-MHz filtering. ? 2000 by CRC Press LLC Defining Terms C compiler: Program that translates C code into assembly code. Digital signal processor: Special-purpose microprocessor with an architecture suitable for fast execution of signal processing algorithms. Fixed-point processor: A processor capable of operating on scaled integer and fractional data values. Floating-point processor: Processor capable of operating on integers as well as on fractional data values without scaling. On-chip memory: Internal memory available on the digital signal processor. Pipelining feature: Feature that permits parallel operations of fetching, decoding, reading, and executing. Special-purpose digital signal processor: Digital signal processor with special feature for handling a specific signal processing application, such as FFT. Related Topics 14.3 Design and Implementation of Digital Filters ? 79.1 IC Logic Family Operation and Characteristics References H. M. Ahmed and R. B. Kline, “Recent advances in DSP systems,” IEEE Communications Magazine, 1991. R. Chassaing, Digital Signal Processing with C and the TMS320C30, New York: Wiley, 1992. R. Chassaing and R. Ayers, “Digital signal processing with the SHARC,” in Proceedings of the 1996 ASEE Annual Conference, 1996. R. Chassaing and B. Bitler, “Real-time digital filters in C,” in Proceedings of the 1991 ASEE Annual Conference, 1991. R. Chassaing and B. Bitler, “A video line rate analysis using the TMS320C30 floating-point digital signal processor,” in Proceedings of the 1992 ASEE Annual Conference, 1992. R. Chassaing and D. W. Horning, Digital Signal Processing with the TMS320C25, New York: Wiley, 1990. R. Chassaing and P. Martin, “Parallel processing with the TMS320C40,” in Proceedings of the 1995 ASEE Annual Conference, 1995. R. Chassaing, W.A. Peterson, and D. W. Horning, “A TMS320C25-based multirate filter,” IEEE Micro, 1990. FIGURE 18.13 Video line signal with edge enhancement. ? 2000 by CRC Press LLC R.E. Crochiere and L.R. Rabiner, Multirate Digital Signal Processing, Englewood Cliffs, N.J.: Prentice-Hall, 1983. K. S. Lin (ed.), Digital Signal Processing Applications with the TMS320 Family. Theory, Algorithms, and Imple- mentations, vol. 1, Texas Instruments Inc., Texas, 1989. A. V. Oppenheim and R. Schafer, Discrete-Time Signal Processing, Englewood Cliffs, N.J.: Prentice-Hall, 1989. P. Papamichalis (ed.), Digital Signal Processing Applications with the TMS320 Family. Theory, Algorithms, and Implementations, vol. 3, Texas Instruments, Inc., Texas, 1990. Further Information Rulph Chassaing teaches hands-on workshops on digital signal processing using C and the TMS320C30, offered at Roger Williams University in Bristol, RI, 02809. He offered a one-week workshop in August 1996, supported by the National Science Foundation (NSF). He will offer two workshops in August 1997, supported by NSF, using the TMS320C30 and the TMS320C31. Workshops on the TMS320 family of digital signal processors are offered by Texas Instruments, Inc. at various locations. A tutorial “Digital Signal Processing Comes of Age” can be found in the IEEE Spectrum, May 1996. ? 1999 by CRC Press LLC

课件简介

课件名称：	电子工程师手册（英文版一）
课件分类：	电子与通信
课件类型：	参考资料
文件大小：	11.68MB
下载次数：	0
评论次数：	0
用户评分：	0

显示更多>>

用户列表

更多用户>>

关于我们|帮助中心|意见反馈|联系我们