Parhi, K.K., Chassaing, R., Bitler, B. “VLSI for Signal Processing”
The Electrical Engineering Handbook
Ed. Richard C. Dorf
Boca Raton: CRC Press LLC, 2000
18
VLSI for Signal Processing
18.1 Special Architectures
Pipelining?Parallel Processing?Retiming?Unfolding?
Folding Transformation?Look-Ahead Technique?Associativity
Transformation?Distributivity?Arithmetic Processor
Architectures?Computer-Aided Design?Future VLSI DSP Systems
18.2 Signal Processing Chips and Applications
DSP Processors?Fixed-Point TMS320C25-Based Development
System?Implementation of a Finite Impulse Response Filter with
the TMS320C25?Floating-Point TMS320C30-Based Development
System?EVM Tools?Implementation of a Finite Impulse
Response Filter with the TMS320C30?FIR and IIR Implementation
Using C and Assembly Code?Real-Time Applications?
Conclusions and Future Directions
18.1 Special Architectures
Keshab K. Parhi
Digital signal processing (DSP) is used in numerous applications. These applications include telephony, mobile
radio, satellite communications, speech processing, video and image processing, biomedical applications, radar,
and sonar. Real-time implementations of DSP systems require design of hardware that can match the application
sample rate to the hardware processing rate (which is related to the clock rate and the implementation style).
Thus, real-time does not always mean high speed. Real-time architectures are capable of processing samples as
they are received from the signal source, as opposed to storing them in buffers for later processing as done in
batch processing. Furthermore, real-time architectures operate on an infinite time series (since the number of
the samples of the signal source is so large that it can be considered infinite). While speech and sonar applications
require lower sample rates, radar and video image processing applications require much higher sample rates.
The sample rate information alone cannot be used to choose the architecture. The algorithm complexity is also
an important consideration. For example, a very complex and computationally intensive algorithm for a low-
sample-rate application and a computationally simple algorithm for a high-sample-rate application may require
similar hardware speed and complexity. These ranges of algorithms and applications motivate us to study a
wide variety of architecture styles.
Using very large scale integration (VLSI) technology, DSP algorithms can be prototyped in many ways. These
options include (1) single or multiprocessor programmable digital signal processors, (2) the use of core
programmable digital signal processor with customized interface logic, (3) semicustom gate-array implemen-
tations, and (4) full-custom dedicated hardware implementation. The DSP algorithms are implemented in the
programmable processors by translating the algorithm to the processor assembly code. This can require an
extensive amount of time. On the other hand, high-level compilers for DSP can be used to generate the assembly
code. Although this is currently feasible, the code generated by the compiler is not as efficient as hand-optimized
code. Design of DSP compilers for generation of efficient code is still an active research topic. In the case of
Keshab K. Parhi
University of Minnesota
Rulph Chassaing
Roger Williams University
Bill Bitler
InfiMed
? 2000 by CRC Press LLC
dedicated designs, the challenge lies in a thorough understanding of the DSP algorithms and theory of archi-
tectures. For example, just minimizing the number of multipliers in an algorithm may not lead to a better
dedicated design. The area saved by the number of multipliers may be offset by the increase in control, routing,
and placement costs.
Off-the-shelf programmable digital signal processors can lead to faster prototyping. These prototyped systems
can prove very effective in fast simulation of computation-intensive algorithms (such as those encountered in
speech recognition, video compression, and seismic signal processing) or in benchmarking and standardization.
After standards are determined, it is more useful to implement the algorithms using dedicated circuits.
Design of dedicated circuits is not a simple task. Dedicated circuits provide limited or no programming
flexibility. They require less silicon area and consume less power. However, the low production volume, high
design cost, and long turnaround time are some of the difficulties associated with the design of dedicated
systems. Another difficulty is the availability of appropriate computer-aided design (CAD) tools for DSP systems.
As time progresses, however, the architectural design techniques will be better understood and can be incor-
porated into CAD tools, thus making the design of dedicated circuits easier. Hierarchical CAD tools can integrate
the design at various levels in an automatic and efficient manner. Implementation of standards for signal and
image processing using dedicated circuits will lead to higher volume production. As time progresses, dedicated
designs will be more acceptable to customers of DSP.
Successful design of dedicated circuits requires careful algorithm and architecture considerations. For exam-
ple, for a filtering application, different equivalent realizations may possess different levels of concurrency. Thus,
some of these realizations may be suitable for a particular application while other realizations may not be able
to meet the sample rate requirements of the application. The lower-level architecture may be implemented in
a word-serial or word-parallel manner. The arithmetic functional units may be implemented in bit-serial or
digit-serial or bit-parallel manner. The synthesized architecture may be implemented with a dedicated data
path or shared data path. The architecture may be systolic or nonsystolic.
Algorithm transformations play an important role in the design of dedicated architectures [Parhi, 1989].
This is because the transformed algorithms can be made to operate with better performance (where the
performance may be measured in terms of speed, area, or power). Examples of these transformations include
pipelining, parallel processing, retiming, unfolding, folding, look-ahead, associativity, and distributivity. These
transformations and other architectural concepts are described in detail in subsequent sections.
Pipelining
Pipelining can increase the amount of concurrency (or the number of activities performed simultaneously) in
an algorithm. Pipelining is accomplished by placing latches at appropriate intermediate points in a data flow
graph that describes the algorithm. Each latch also refers to a storage unit or buffer or register. The latches can
be placed at feed-forward cutsets of the data flow graph. In synchronous hardware implementations, pipelining
can increase the clock rate of the system (and therefore the sample rate). The drawbacks associated with
pipelining are the increase in system latency and the increase in the number of registers. To illustrate the speed
increase using pipelining, consider the second-order three-tap finite impulse response (FIR) filter shown in
Fig. 18.1(a). The signal x(n) in this system can be sampled at a rate limited by the throughput of one
multiplication and two additions. For simplicity, if we assume the multiplication time to be two times the
addition time (T
add
), the effective sample or clock rate of this system is 1/4T
add
. By placing latches as shown in
Fig. 18.1(b) at the cutset shown in the dashed line, the sample rate can be improved to the rate of one
multiplication or two additions. While pipelining can be easily applied to all algorithms with no feedback loops
by the appropriate placement of latches, it cannot easily be applied to algorithms with feedback loops. This is
because the cutsets in feedback algorithms contain feed-forward and feedback data flow and cannot be con-
sidered as feed-forward cutsets.
Pipelining can also be used to improve the performance in software programmable multiprocessor systems.
Most software programmable DSP processors are programmed using assembly code. The assembly code is
generated by high-level compilers that perform scheduling. Schedulers typically use the acyclic precedence
graph to construct schedules. The removal of all edges in the signal (or data) flow graph containing delay
? 2000 by CRC Press LLC
elements converts the signal flow graph to an acyclic precedence graph. By placing latches to pipeline a data
flow graph, we can alter the acyclic precedence graph. In particular, the critical path of the acyclic precedence
graph can be reduced. The new precedence graph can be used to construct schedules with lower iteration
periods (although this may often require an increase in the number of processors).
Pipelining of algorithms can increase the sample rate of the system. Sometimes, for a constant sample rate,
pipelining can also reduce the power consumed by the system. This is because the data paths in the pipelined
system can be charged or discharged with lower supply voltage. Since the capacitance remains almost constant,
the power can be reduced. Achieving low power can be important in many battery-powered applications
[Chandrakasan et al., 1992].
Parallel Processing
Parallel processing is related to pipelining but requires replication of hardware units. Pipelining exploits
concurrency by breaking a large task into multiple smaller tasks and by separating these smaller tasks by storage
units. On the other hand, parallelism exploits concurrency by performing multiple larger tasks simultaneously
in separate hardware units.
To illustrate the speed increase due to parallelism, consider the parallel implementation of the second-order
three-tap FIR filter of Fig. 18.1(a) shown in Fig. 18.2. In the architecture of Fig. 18.2, two input samples are
processed and two output samples are generated in each clock cycle period of four addition times. Because
each clock cycle processes two samples, however, the effective sample rate is 1/2T
add
which is the same as that
of Fig. 18.1(b). The parallel architecture leads to the speed increase with significant hardware overhead. The
entire data flow graph needs to be replicated with an increase in the amount of parallelism. Thus, it is more
desirable to use pipelining as opposed to parallelism. However, parallelism may be useful if pipelining alone
cannot meet the speed demand of the application or if the technology constraints (such as limitations on the
clock rate by the I/O technology) limit the use of pipelining. In obvious ways, pipelining and parallelism can
be combined also. Parallelism, like pipelining, can also lead to power reduction but with significant overhead
in hardware requirements. Achieving pipelining and parallelism can be difficult for systems with feedback loops.
Concurrency may be created in these systems by using the look-ahead transformation.
FIGURE 18.1 (a) A three-tap second-order nonrecursive digital filter; (b) the equivalent pipelined digital filter obtained
by placing storage units at the intersection of the signal wires and the feed-forward cutset. If the multiplication and addition
operations require 2 and 1 unit of time, respectively, then the maximum achievable sampling rates for the original and the
pipelined architectures are 1/4 and 1/2 units, respectively.
? 2000 by CRC Press LLC
Retiming
Retiming is similar to pipelining but yet different in some ways [Leiserson et al., 1983]. Retiming is the process
of moving the delays around in the data flow graph. Removal of one delay from all input edges of a node and
insertion of one delay to each outgoing edge of the same node is the simplest example of retiming. Unlike
pipelining, retiming does not increase the latency of the system. However, retiming alters the number of delay
elements in the system. Retiming can reduce the critical path of the data flow graph. As a result, it can lead to
clock period reduction in hardware implementations or critical path of the acyclic precedence graph or the
iteration period in programmable software system implementations.
The single host formulation of the retiming transformation preserves the latency of the algorithm. The
retiming formulation with no constraints on latency (i.e., with separate input and output hosts) can also achieve
pipelining with no retiming or pipelining with retiming. Pipelining with retiming is the most desirable transfor-
mation in DSP architecture design. Pipelining with retiming can be interpreted to be identical to retiming of
the original algorithm with a large number of delays at the input edges. Thus, we can increase the system latency
arbitrarily and remove the appropriate number of delays from the inputs after the transformation.
The retiming formulation assigns retiming variables r(.) to each node in the data flow graph. If i(U ? V )
is the number of delays associated with the edge U ?V in the original data flow graph and r( V ) and r(U ),
respectively, represent the retiming variable value of the nodes V and U, then the number of delays associated
with the edge U ?V in the retimed data flow graph is given by
i
r
(U ? V ) = i ( U ? V ) + r ( V ) – r ( U )
For the data flow graph to be realizable, i
r
(U ? V ) 3 0 must be satisfied. The retiming transformation formulates
the problem by calculating path lengths and by imposing constraints on certain path lengths. These constraints
are solved as a shortest-path problem.
To illustrate the usefulness of retiming, consider the data flow graph of a two-stage pipelined lattice digital
filter graph shown in Fig. 18.3(a) and its equivalent pipelined-retimed data flow graph shown in Fig. 18.3(b).
If the multiply time is two units and the add time is one unit, the architecture in Fig. 18.3(a) can be clocked
with period 10 units whereas the architecture in Fig. 18.3(b) can be clocked with period 2 units.
FIGURE 18.2 Twofold parallel realization of the three-tap filter of Fig. 18.1(a).
? 2000 by CRC Press LLC
Unfolding
The unfolding transformation is similar to loop unrolling. In J-unfolding, each node is replaced by J nodes
and each edge is replaced by J edges. The J-unfolded data flow graph executes J iterations of the original
algorithm [Parhi, 1991].
The unfolding transformation can unravel the hidden concurrency in a data flow program. The achievable
iteration period for a J-unfolded data flow graph is 1/J times the critical path length of the unfolded data flow
graph. By exploiting interiteration concurrency, unfolding can lead to a lower iteration period in the context
of a software programmable multiprocessor implementation.
The unfolding transformation can also be applied in the context of hardware design. If we apply an unfolding
transformation on a (word-serial) nonrecursive algorithm, the resulting data flow graph represents a word-
parallel (or simply parallel) algorithm that processes multiple samples or words in parallel every clock cycle.
If we apply 2-unfolding to the 3-tap FIR filter in Fig. 18.1(a), we can obtain the data flow graph of Fig. 18.2.
FIGURE 18.3 (a) A two-stage pipelinable time-invariant lattice digital filter. If multiplication and addition operations
require 2 and 1 time units, respectively, then this data flow graph can achieve a sampling period of 10 time units (which
corresponds to the critical path M
1
? A
2
? M
2
? A
1
? M
3
? A
3
? A
4
). (b) The pipelined/retimed lattice digital filter
can achieve a sampling period of 2 time units.
? 2000 by CRC Press LLC
Because the unfolding algorithm is based on graph theoretic approach, it can also be applied at the bit level.
Thus, unfolding of a bit-serial data flow program by a factor of J leads to a digit-serial program with digit
size J. The digit size represents the number of bits processed per clock cycle. The digit-serial architecture is
clocked at the same rate as the bit-serial (assuming that the clock rate is limited by the communication I/O
bound much before reaching the computation bound of the bit-serial program). Because the digit-serial
program processes J bits per clock cycle the effective bit rate of the digit-serial architecture is J times higher. A
simple example of this unfolding is illustrated in Fig. 18.4, where the bit-serial adder in Fig. 18.4(a) is unfolded
by a factor of 2 to obtain the digit-serial adder in Fig. 18.4(b) for digit size 2 for a word length of 4. In obvious
ways, the unfolding transformation can be applied to both word level and bit level simultaneously to generate
word-parallel digit-serial architectures. Such architectures process multiple words per clock cycle and process
a digit of each word (not the entire word).
Folding Transformation
The folding transformation is the reverse of the unfolding transformation. While the unfolding transformation
is simpler, the folding transformation is more difficult [Parhi et al., 1992].
The folding transformation can be applied to fold a bit-parallel architecture to a digit-serial or bit-serial one
or to fold a digit-serial architecture to a bit-serial one. It can also be applied to fold an algorithm data flow
graph to a hardware data flow for a specified folding set. The folding set indicates the processor in which and
the time partition at which a task is executed. A specified folding set may be infeasible, and this needs to be
detected first. The folding transformation performs a preprocessing step to detect feasibility and in the feasible
case transforms the algorithm data flow graph to an equivalent pipelined/retimed data flow graph that can be
folded. For the special case of regular data flow graphs and for linear space–time mappings, the folding
tranformation reduces to systolic array design.
In the folded architecture, each edge in the algorithm data flow graph is mapped to a communicating edge
in the hardware architecture data flow graph. Consider an edge U ?V in the algorithm data flow graph with
associated number of delays i(U ? V). Let the tasks U and V be mapped to the hardware units H
U
and H
V
,
respectively. Assume that N time partitions are available, i.e., the iteration period is N. A modulo operation
determines the time partition. For example, the time unit 18 for N = 4 corresponds to time partition 18 modulo
FIGURE 18.4 (a) A least-significant-bit first bit-serial adder for word length of 4; (b) a digit-serial adder with digit size 2
obtained by two-unfolding of the bit-serial adder. The bit position 0 stands for least significant bit.
? 2000 by CRC Press LLC
4 or 2. Let the tasks U and V be executed in time partitions u and v, i.e., the lth iterations of tasks U and V
are executed in time units Nl + u and Nl + v, respectively. The i(U ? V ) delays in the edge U ? V implies
that the result of the lth iteration of U is used for the (l + i)th iteration of V. The (l + i)th iteration of V is
executed in time unit N(l + i) + v. Thus the number of storage units needed in the folded edge corresponding
to the edge U ? V is
D
F
(U ?V) = N(l + i) + v – Nl – u – P
u
= Ni + v – u – P
u
where P
u
is the level of pipelining of the hardware operator H
U
. The D
F
(U ? V) delays should be connected
to the edge between H
U
and H
V
, and this signal should be switched to the input of H
V
at time partition v. If
the D
F
(U ? V)’s as calculated here were always nonnegative for all edges U ? V, then the problem would be
solved. However, some D
F
()’s would be negative. The algorithm data flow graph needs to be pipelined and
retimed such that all the D
F
()’s are nonnegative. This can be formulated by simple inequalities using the retiming
variables. The retiming formulation can be solved as a path problem, and the retiming variables can be
determined if a solution exists. The algorithm data flow graph can be retimed for folding and the calculation
of the D
F
()’s can be repeated. The folded hardware architecture data flow graph can now be completed. The
folding technique is illustrated in Fig. 18.5. The algorithm data flow graph of a two-stage pipelined lattice
recursive digital filter of Fig. 18.3(a) is folded for the folding set shown in Fig. 18.5. Fig. 18.5(a) shows the
pipelined/retimed data flow graph (preprocessed for folding) and Fig. 18.5(b) shows the hardware architecture
data flow graph obtained after folding.
As indicated before, a special case of folding can address systolic array design for regular data flow graphs
and for linear mappings. The systolic architectures make use of extensive pipelining and local communication
and operate in a synchronous manner [Kung, 1988]. The systolic processors can also be made to operate in an
asynchronous manner, and such systems are often referred to as wavefront processors. Systolic architectures
have been designed for a variety of applications including convolution, matrix solvers, matrix decomposition,
and filtering.
Look-Ahead Technique
The look-ahead technique is a very powerful technique for pipelining of recursive signal processing algorithms
[Parhi and Messerschmitt, 1989]. This technique can transform a sequential recursive algorithm to an equivalent
concurrent one, which can then be realized using pipelining or parallel processing or both. This technique has
been successfully applied to pipeline many signal processing algorithms, including recursive digital filters (in
direct form and lattice form), adaptive lattice digital filters, two-dimensional recursive digital filters, Viterbi
decoders, Huffman decoders, and finite state machines. This research demonstrated that the recursive signal
processing algorithms can be operated at high speed. This is an important result since modern signal processing
applications in radar and image processing and particularly in high-definition and super-high-definition tele-
vision video signal processing require very high throughput. Traditional algorithms and topologies cannot be
used for such high-speed applications because of the inherent speed bound of the algorithm created by the
feedback loops. The look-ahead transformation creates additional concurrency in the signal processing algo-
rithms and the speed bound of the transformed algorithms is increased substantially. The look-ahead trans-
formation is not free from its drawbacks. It is accompanied by an increase in the hardware overhead. This
difficulty has encouraged us to develop inherently pipelinable topologies for recursive signal processing algo-
rithms. Fortunately, this is possible to achieve in adaptive digital filters using relaxations on the look-ahead or
by the use of relaxed look-ahead [Shanbhag and Parhi, 1992].
To begin, consider a time-invariant one-pole recursive digital filter transfer function
Hz
Xz
Uz
az
()
()
()
==
-
-
1
1
1
? 2000 by CRC Press LLC
described by the difference equation
x(n) = ax(n – 1) + u(n)
and shown in Fig. 18.6(a). The maximum achievable speed in this system is limited by the operating speed of
one multiply–add operation. To increase the speed of this system by a factor of 2, we can express x(n) in terms
of x(n – 2) by substitution of one recursion within the other:
x(n) = a[ax(n – 2) + u(n – 1)] + u(n) = a
2
x(n – 2) + au(n – 1) + u(n)
FIGURE 18.5(a) A pipelined/retimed data flow graph obtained from Fig. 18.3(a) by preprocessing for folding; (b) the
folded hardware architecture data flow graph. In our folding notation, the tasks are ordered within a set and the ordering
represents the time partition in which the task is executed. For example, SA
1
= (A
2
, A
1
) implies that A
2
and A
1
are, respectively,
executed in even and odd time partitions in the same processor. The notation F represents a null operation.
? 2000 by CRC Press LLC
The transfer function of the emulated second-order system is given by
and is obtained by using a pole-zero cancellation at –a. In the modified system, x(n) is computed using x(n – 2)
as opposed to x(n – 1); thus we look ahead. The modified system has two delays in the multiply–add feedback
loop, and these two delays can be distributed to pipeline the multiply–add operation by two stages. Of course,
the additional multiply–add operation that represents one zero would also need to be pipelined by two stages
to keep up with the sample rate of the system. To increase the speed by four times, we can rewrite the transfer
function as:
This system is shown in Fig. 18.6(b). Arbitrary speed increase is possible. However, for power-of-two speed
increase the hardware complexity grows logarithmically with speed-up factor. The same technique can be
applied to any higher-order system. For example, a second-order recursive filter with transfer function
can be modified to
for a twofold increase in speed. In this example, the output y(n) is computed using y(n – 2) and y(n – 4); thus,
it is referred to as scattered look-ahead.
FIGURE 18.6(a) A first-order recursive digital filter; (b) a four-stage pipelinable equivalent filter obtained by look-ahead
computation.
Hz
az
az
()=
+
-
-
-
1
1
1
22
Hz
az az
az
()
()()
()
=
++
-
--
-
11
1
122
44
Hz
rzrz
()
cos
=
-+
--
1
12
122
q
Hz
rzrz
rzrz
()
cos
cos
=
++
-+
--
--
12
12 2
122
224
q
q
? 2000 by CRC Press LLC
While look-ahead can transform any recursive digital filter transfer function to pipelined form, it leads to a
hardware overhead proportional to N log
2
M, where N is the filter order and M is the speed-up factor. Instead
of starting with a sequential digital filter transfer function obtained by traditional design approaches and
transforming it for pipelining, it is more desirable to use a constrained filter design program that can satisfy
the filter spectrum and the pipelining constraint. The pipelining constraint is satisfied by expressing the
denominator of the transfer function in scattered look-ahead form. Such filter design programs have now been
developed in both time domain and frequency domain. The advantage of the constrained filter design approach
is that we can obtain pipelined digital filters with marginal or zero hardware overhead compared with sequential
digital filters. The pipelined transfer functions can also be mapped to pipelined lattice digital filters. The reader
might note that the data flow graph of Fig. 18.3(a) was obtained by this approach.
The look-ahead pipelining can also be applied for the design of transversal and adaptive lattice digital filters.
Although look-ahead transformation can be used to modify the adaptive filter recursions to create concurrency,
this requires large hardware overhead. The adaptive filters are based on weight update operations, and the
weights are adapted based on the current error. Finally, the error becomes close to zero and the filter coefficients
have been adapted. Thus, making relaxations on the error can reduce the hardware overhead substantially
without degradation of the convergence behavior of the adaptive filter. Three types of relaxations of look-ahead
are possible. These are referred to as sum relaxation, product relaxation, and delay relaxation. To illustrate these
three relaxations, consider the weight update recursion
w(n + 1) = a(n)w(n) + f(n)
where the term a(n) is typically 1 for transversal least mean square (LMS) adaptive filters and of the form (1
– e(n)) for lattice LMS adaptive digital filters, and f(n) = me(n)u(n) where m is a constant, e(n) is the error,
and u(n) is the input. The use of look-ahead transforms the above recursion to
In sum relaxation, we only retain the single term dependent on the current input for the last term of the look-
ahead recursion. The relaxed recursion after sum relaxation is given by
In lattice digital filters, the coefficient a(n) is close to 1 for all n, since it can be expressed as (1 – e(n)) and e(n)
is close to zero for all n and is positive. The product relaxation on the above equation leads to
w(n + M) = (1 – Me(n + M – 1)) w(n) + f(n + M – 1)
wnM anMi wn
i
M
()( ) ()+= +--
=
-
?
1
0
1
++- +-- +--
é
?
ê
ê
ù
?
ú
ú
+-
+-
é
?
ê
ê
ê
ê
ê
ê
ê
ê
ù
?
ú
ú
ú
ú
ú
ú
ú
ú
=
-
=
??
11 1 1
1
2
0
2
0
1
anM anMi anMi
fnM
fnM
fn
i
M
i
()().. ()
()
.
.
.
()
wnM anMi wnfnM
i
M
()( ) ()( )+= +-- ++-
=
-
?
11
0
1
? 2000 by CRC Press LLC
The delay relaxation assumes the signal to be slowly varying or to be constant over D samples and replaces the
look-ahead by
w(n + M) = (1 – Me(n + M – 1)) w(n) + f(n + M – D – 1)
These three types of relaxations make it possible to implement pipelined transversal and lattice adaptive digital
filters with marginal increase in hardware overhead. Relaxations on the weight update operations change the
convergence behavior of the adaptive filter, and we are forced to examine carefully the convergence behavior
of the relaxed look-ahead adaptive digital filters. It has been shown that the relaxed look-ahead adaptive digital
filters do not suffer from degradation in adaptation behavior. Futhermore, when coding, the use of pipelined
adaptive filters could lead to a dramatic increase in pixel rate with no degradation in signal-to-noise ratio of
the coded image and no increase in hardware overhead [Shanbhag and Parhi, 1992].
The concurrency created by look-ahead and relaxed look-ahead transformations can also be exploited in the
form of parallel processing. Furthermore, for a constant speed, concurrent architectures (especially the pipelined
architectures) can also lead to low power consumption.
Associativity Transformation
The addition operations in many signal processing algorithms can be interchanged since the add operations
satisfy associativity. Thus, it is possible to move the add operations outside the critical loops to increase the
maximum achievable speed of the system. As an example of the associative transformation, consider the
realization of a second-order recursion x(n) = 5/8x(n – 1) – 3/4x(n – 2) + u(n). Two possible realizations are
shown in Fig. 18.7(a). The realization on the left contains one multiplication and two add operations in the
critical inner loop, whereas the realization on the right contains one multiplication and one add operation in
the critical inner loop. The realization on the left can be transformed to the realization on the right using the
associativity transformation. Figure 18.7(b) shows a bit-serial implementation of this second-order recursion
for the realization on the right for a word length of 8. This bit-serial system can be operated in a functionally
correct manner for any word length greater than or equal to 5 since the inner loop computation latency is 5
cycles. On the other hand, if associativity were not exploited, then the minimum realizable word length would
be 6. Thus, associativity can improve the achievable speed of the system.
Distributivity
Another local transformation that is often useful is distributivity. In this transformation, a computation (A 2
B) + (A 2 C) may be reorganized as A 2 (B + C). Thus, the number of hardware units can be reduced from
two multipliers and one adder to one multiplier and one adder.
Arithmetic Processor Architectures
In addition to algorithms and architecture designs, it is also important to address implementation styles and
arithmetic processor architectures.
Most DSP systems use fixed-point hardware arithmetic operators. While many number system representa-
tions are possible, the two’s complement number system is the most popular number system. The other number
systems include the residue number system, the redundant or signed-digit number system, and the logarithmic
number system. The residue and logarithmic number systems are rarely used or are used in very special cases
such as nonrecursive digital filters. Shifting or scaling and division are difficult in the residue number system.
Difficulty with addition and the overhead associated with logarithm and antilogarithm converters reduce the
attractiveness of the logarithm number system. The use of the redundant number system leads to carry-free
operation but is accompanied by the overhead associated with redundant-to-two’s complement conversion.
Another approach often used is distributed arithmetic. This approach has recently been used in a few video
transformation chips.
? 2000 by CRC Press LLC
The simplest arithmetic operation is addition. Multiplication can be realized as a series of add-shift opera-
tions, and division and square-root can be realized as a series of controlled add–subtract operations. The
conventional two’s complement adder involves carry ripple operation. This limits the throughput of the adder
operation. In DSP, however, the combined multiply–add operation is most common. Carry–save operations
have been used to realize pipelined multiply-adders using fewer pipelining latches. In conventional pipelined
two’s complement multiplier, the multiplication time is approximately two times the bit-level addition time.
Recently, a technique has been proposed to reduce the multiplication time from 2W bit-level binary adder
times to 1.25W bit-level binary adder times where W is the word length. This technique is based on the use of
hybrid number system representation, where one input operand is in two’s complement number representation
and the other in redundant number representation [Srinivas and Parhi, 1992]. Using an efficient sign-select
redundant-to-two’s complement conversion technique, this multiplier can be made to operate faster and, in
the pipelined mode, would require fewer pipelining latches and less silicon area.
FIGURE 18.7 (a) Two associative realizations of a second-order recursion; (b) an efficient bit-serial realization of the
recursion for a word length of 8.
? 2000 by CRC Press LLC
Computer-Aided Design
With progress in the theory of architectures, the computer-aided design (CAD) systems for DSP application
also become more powerful. In early 1980, the first silicon compiler system for signal processing was developed
at the University of Edinburgh and was referred to as the FIRST design system. This system only addressed the
computer-aided design of bit-serial signal processing systems. Since then more powerful systems have been
developed. The Cathedral I system from Katholieke Universiteit Leuven and the BSSC (bit-serial silicon com-
piler) from GE Research Center in Schenectady, New York, also addressed synthesis of bit-serial circuits. The
Cathedral system has now gone through many revisions, and the new versions can systhesize parallel multi-
processor data paths and can perform more powerful scheduling and allocation. The Lager design tool at the
University of California at Berkeley was developed to synthesize the DSP algorithms using parametrizable macro
building blocks (such as ALU, RAM, ROM). This system has also gone through many revisions. The Hyper
system also developed at the University of California at Berkeley and the MARS design system developed at
the University of Minnesota at Minneapolis perform higher level transformations and perform scheduling and
allocation. These CAD tools are crucial to rapid prototyping of high-performance DSP integrated circuits.
Future VLSI DSP Systems
Future VLSI systems will make use of a combination of many types of architectures such as dedicated and
programmable. These systems can be designed successfully with proper understanding of the algorithms,
applications, theory of architectures, and with the use of advanced CAD systems.
Defining Terms
Bit serial: Processing of one bit per clock cycle. If word length is W, then one sample or word is processed
in W clock cycles. In contrast, all W bits of a word are processed in the same clock cycle in a bit-parallel
system.
Digit serial: Processing of more than one but not all bits in one clock cycle. If the digit size is W
1
and the
word length is W, then the word is processed in W/W
1
clock cycles. If W
1
= 1, then the system is referred
to as a bit-serial and if W
1
= W, then the system is referred to as a bit-parallel system. In general, the
digit size W
1
need not be a divisor of the word length W, since the least and most significant bits of
consecutive words can be overlapped and processed in the same clock cycle.
Folding: The technique of mapping many tasks to a single processor.
Look-ahead: The technique of computing a state x (n) usng previous state x (n – M ) without requiring the
intermediate states x (n – 1) through x (n – M + 1). This is referred to as a M-step look-ahead. In the
case of higher-order computations, there are two forms of look-ahead: clustered look-ahead and scattered
look-ahead. In clustered look-ahead, x (n) is computed using the clustered states x (n – M – N + 1)
through x (n – M ) for an Nth order computation. In scattered look-ahead, x (n) is computed using the
scattered states x (n – iM ) where i varies from 1 to N.
Parallel processing: Processing of multiple tasks independently by different processors. This also increases
the throughput.
Pipelining: A technique to increase throughput. A long task is divided into components, and each component
is distributed to one processor. A new task can begin even though the former tasks have not been
completed. In the pipelined operation, different components of different tasks are executed at the same
time by different processors. Pipelining leads to an increase in the system latency, i.e., the time elapsed
between the starting of a task and the completion of the task.
Retiming: The technique of moving the delays around the system. Retiming does not alter the latency of the
system.
Systolic: Flow of data in a rhythmic fashion from a memory through many processors, returning to the
memory just as blood flows
Unfolding: The technique of transforming a program that describes one iteration of an algorithm to another
equivalent program that describes multiple iterations of the same algorithm.
Word parallel: Processing of multiple words in the same clock cycle.
? 2000 by CRC Press LLC
Related Topic
95.1 Introduction
References
A.P. Chandrakasan, S. Sheng, and R.W. Brodersen, “Low-power CMOS digital design,” IEEE J. Solid State
Circuits, vol. 27(4), pp. 473–484, April 1992.
S.Y. Kung, VLSI Array Processors, Englewood Cliffs, N.J.: Prentice-Hall, 1988.
E.A. Lee and D.G. Messerschmitt, “Pipeline interleaved programmable DSP’s,” IEEE Trans. Acoustics, Speech,
Signal Processing, vol. 35(9), pp. 1320–1345, September 1987.
C.E. Leiserson, F. Rose, and J. Saxe, “Optimizing synchronous circuitry by retiming,” Proc. 3rd Caltech Conf.
VLSI, Pasadena, Calif., pp. 87–116, March 1983.
K.K. Parhi, “Algorithm transformation techniques for concurrent processors,” Proc. IEEE, vol. 77(12), pp.
1879–1895, December 1989.
K.K. Parhi, “Systematic approach for design of digit-serial processing architectures,” IEEE Trans. Circuits Systems,
vol. 38(4), pp. 358–375, April 1991.
K.K. Parhi and D.G. Messerschmitt, “Pipeline interleaving and parallelism in recursive digital filters,” IEEE
Trans. Acoustics, Speech, Signal Processing, vol. 37(7), pp. 1099–1135, July 1989.
K.K. Parhi, C.Y. Wang, and A.P. Brown, “Synthesis of control circuits in folded pipelined DSP architectures,”
IEEE J. Solid State Circuits, vol. 27(1), pp. 29–43, January 1992.
N.R. Shanbhag, and K.K. Parhi, “A pipelined adaptive lattice filter architecture,” Proc. 1992 IEEE Int. Symp.
Circuits and Systems, San Diego, May 1992.
H.R. Srinivas and K.K. Parhi, “High-speed VLSI arithmetic processor architectures using hybrid number
representation,” J. VLSI Signal Processing, vol. 4(2/3), pp. 177–198, 1992.
Further Information
A detailed video tutorial on “Implementation and Synthesis of VLSI Signal Processing Systems” presented by
K.K. Parhi and J.M. Rabaey in March 1992 can be purchased from the customer service department of IEEE,
445 Hoes Lane, P.O. Box 1331, Piscataway, NJ 08855-1331.
Special architectures for video communications can be found in the book VLSI Implementations for Image
Communications, published as the fourth volume of the series Advances in Image Communications (edited by
Peter Pirsch) by the Elsevier Science Publishing Co. in 1993. The informative article “Research on VLSI for
Digital Video Systems in Japan,” published by K.K. Parhi in the fourth volume of the 1991 Office of Naval
Research Asian Office Scientific Information Bulletin (pages 93–98), provides examples of video codec designs
using special architectures. For video programmable digital signal processor approaches, see I. Tamitani, H.
Harasaki, and T. Nishitani, “A Real-Time HDTV Signal Processor: HD-VSP,” published in IEEE Transactions
on Circuits and Systems for Video Technology, March 1991, and T. Fujii, T. Sawabe, N. Ohta, and S. Ono,
“Implementation of Super High-Definition Image Processing on HiPIPE,” published in 1991 IEEE International
Symposium on Circuits and Systems, held in June 1991 in Singapore (pages 348–351).
The IEEE Design and Test of Computers published three special issues related to computer-aided design of
special architectures; these issues were published in October 1990 (addressing high-level synthesis), December
1990 (addressing silicon compilations), and June 1991 (addressing rapid prototyping).
Descriptions of various CAD systems can be found in the following references. The description of the FIRST
system can be found in the article “A Silicon Compiler for VLSI Signal Processing,” by P. Denyer et al. in the
Proceedings of the ESSCIRC conference held in Brussels in September 1982 (pages 215–218). The Cathedral
system has been described in R. Jain et al., “Custom Design of a VLSI PCM-FDM Transmultiplexor from System
Specifications to Circuit Layout Using a Computer Aided Design System,” published in IEEE Journal of Solid
State Circuits in February 1986 (pages 73–85). The Lager system has been described in “An Integrated Automatic
Layout Generation System for DSP Circuits,” by J. Rabaey, S. Pope, and R. Brodersen, published in the July
1985 issue of the IEEE Transactions on Computer Aided Design (pages 285–296). The description of the MARS
Design System can be found in C.-Y. Wang and K.K. Parhi, “High-Level DSP Synthesis Using MARS System,”
? 2000 by CRC Press LLC
published in Proceedings of the 1992 IEEE International Symposium on Circuits and Systems in San Diego, May
1992. A tutorial article on high-level synthesis can be found in “The High-Level Synthesis of Digital Systems,”
by M.C. McFarland, A. Parker, and R. Composano, published in the February 1990 issue of the Proceedings of
the IEEE (pages 310–318).
Articles on pipelined multipliers can be found in T.G. Noll et al., “A Pipelined 330 MHZ Multiplier,” IEEE
Journal of Solid State Circuits, June 1986 (pages 411–416) and in M. Hatamian and G. Cash, “A 70-MHz 8-Bit
′ 8-Bit-Parallel Pipelined Multiplier in 2.5 mm CMOS,” IEEE Journal of Solid State Circuits, 1986.
Technical articles on special architectures and chips for signal and image processing appear at different places,
including proceedings of conferences such as IEEE Workshop on VLSI Signal Processing, IEEE International
Conference on Acoustics, Speech, and Signal Processing, IEEE International Symposium on Circuits and
Systems, IEEE International Solid State Circuits Conference, IEEE Customs Integrated Circuits Conference,
IEEE International Conference on Computer Design, ACM/IEEE Design Automation Conference, ACM/IEEE
International Conference on Computer Aided Design, International Conference on Application Specific Array
Processors, and journals such as IEEE Transactions on Signal Processing, IEEE Transactions on Image Processing,
IEEE Transactions on Circuits and Systems: Part II: Analog and Digital Signal Processing, IEEE Transactions on
Computers, IEEE Journal of Solid State Circuits, IEEE Signal Processing Magazine, IEEE Design and Test Magazine,
and Journal of VLSI Signal Processing.
18.2 Signal Processing Chips and Applications
Rulph Chassaing and Bill Bitler
Recent advances in very large scale integration (VLSI) have contributed to the current digital signal processors.
These processors are just special-purpose fast microprocessors characterized by architectures and instructions
suitable for real-time digital signal processing (DSP) applications. The commercial DSP processor, a little more
than a decade old, has emerged because of the ever-increasing number of signal processing applications. DSP
processors are now being utilized in a number of applications from communications and controls to speech
and image processing. They have found their way into talking toys and music synthesizers. A number of texts
[such as Chassaing, 1992] and articles [such as Ahmed and Kline, 1991] have been written, discussing the
applications that use DSP processors and the recent advances in DSP systems.
DSP Processors
Digital signal processors are currently available from a number of companies, including Texas Instruments,
Inc. (Texas), Motorola, Inc. (Arizona), Analog Devices, Inc. (Massachusetts), AT&T (New Jersey), and NEC
(California). These processors are categorized as either fixed-point or floating-point processors. Several com-
panies are now supporting both types of processors. Special-purpose digital signal processors, designed for a
specific signal processing application such as for fast Fourier transform (FFT), have also emerged. Currently
available digital signal processors range from simple, low cost processing units through high performance units
such as Texas Instruments’ (TI) TMS320C40 (Chassaing and Martin, 1995) and TMS320C80, and Analog
Devices
1
ADSP-21060 SHARC (Chassaing and Ayers, 1996).
One of the first-generation digital signal processors is the (N-MOS technology) TMS32010, introduced by
Texas Instruments in 1982. This first-generation fixed-point processor is based on the Harvard architecture,
with a fast on-chip hardware multiplier/accumulator, and with data and instructions in separate memory spaces,
allowing for concurrent accesses. This type of pipelining feature enables the processor to execute one instruction
while fetching at the same time the next instruction. Other features include 144 (16-bit) words of on-chip data
RAM and a 16-bit by 16-bit multiply operation in one instruction cycle time of 200 ns. Since many instructions
can be executed in one single cycle, the TMS32010 is capable of executing 5 million instructions per second
(MIPS). Major drawbacks of this first-generation processor are its limited on-chip memory size and much
slower execution time for accessing external memory. Improved versions of this first-generation processor are
now available in C-MOS technology, with a faster instruction cycle time of 160 ns.
? 2000 by CRC Press LLC
The second-generation fixed-point processor TMS32020, introduced in 1985 by TI, was quickly followed by
an improved C-MOS version TMS320C25 [Chassaing and Horning, 1990] in 1986. Features of the TMS320C25
include 544 (16-bit) words of on-chip data RAM, separate program and data memory spaces (each 64 K words),
and an instruction cycle time of 100 ns, enabling the TMS320C25 to execute 10 MIPS. A faster version, TI’s
fixed-point TMS320C50 processor, is available with an instruction cycle time of 35 ns.
The third-generation TMS320C30 (by TI) supports fixed- as well as floating-point operations [Chassaing,
1992]. Features of this processor include 32-bit by 32-bit floating-point multiply operations in one instruction
cycle time of 60 ns. Since a number of instructions, such as load and store, multiply and add, can be performed
in parallel (in one cycle time), the TMS320C30 can execute a pair of parallel instructions in 30 ns, allowing
for 33.3 MIPS. The Harvard-based architecture of the fixed-point processors was abandoned for one allowing
four levels of pipelining with three subsequent instructions being consequently fetched, decoded, and read
while the current instruction is being executed. The TMS320C30 has 2 K words of on-chip memory and a total
of 16 million words of addressable memory spaces for program, data, and input/output. Specialized instructions
are available to make common DSP algorithms such as filtering and spectral analysis execute fast and efficiently.
The architecture of the TMS320C30 was designed to take advantage of higher-level languages such as C and
ADA. The TMS320C31 and TMS320C32, recent members of the third-generation floating-point processors,
are available with a 40 ns instruction cycle time.
DSP starter kits (DSK) are inexpensive development systems available from TI and based on both the fixed-
point TMS320C50 and the floating-point TMS320C31 processors. We will discuss both the fixed-point
TMS320C25 and the floating-point TMS320C30 digital signal processors, including the development tools
available for each of these processors and DSP applications.
Fixed-Point TMS320C25-Based Development System
TMS320C25-based development systems are now available from a number of companies such as Hyperception
Inc., Texas, and Atlanta Signal Processors, Inc., Georgia. The Software Development System (SWDS), available
from TI includes a board containing the TMS320C25, which plugs into a slot on an IBM compatible PC. Within
the SWDS environment, a program can be developed, assembled, and run. Debugging aids supported by the
SWDS include single-stepping, setting of breakpoints, and display/modification of registers.
A typical workstation consists of:
1.An IBM compatible PC. Commercially available DSP packages (such as from Hyperception or Atlanta
Signal Processors) include a number of utilities and filter design techniques.
2.The SWDS package, which includes an assembler, a linker, a debug monitor, and a C compiler.
3.Input/output alternatives such as TI’s analog interface board (AIB) or analog interface chip (AIC).
The AIB includes a 12-bit analog-to-digital converter (ADC) and a 12-bit digital-to-analog converter (DAC).
A maximum sampling rate of 40 kHz can be obtained. With (input) antialiasing and (output) reconstruction
filters mounted on a header on the AIB, different input/output (I/O) filter bandwidths can be achieved.
Instructions such as IN and OUT can be used for input/output accesses. The AIC, which provides an inexpensive
I/O alternative, includes 14-bit ADC and DAC, antialiasing/reconstruction filters, all on a single C-MOS chip.
Two inputs and one output are available on the AIC. (A TMS320C25/AIC interface diagram and communication
routines can be found in Chassaing and Horning, 1990.) The TLC32047 AIC is a recent member of the TLC32040
family of voiceband analog interface circuits, with a maximum sampling rate of 25 kHz.
Implementation of a Finite Impulse Response Filter with the TMS320C25
The convolution equation
(18.1)yn hkxnk
hxn hxn hN xn N
hN xn N
k
N
() ()( )
()() ()( )... ( )( ( ))
()(()
=-
=+-++---
+---
=
-
?
0
1
11 2 2
1
? 2000 by CRC Press LLC
represents a finite impulse response (FIR) filter with length N. The memory organization for the coefficients
h(k) and the input samples x(n – k) is shown in Table 18.1. The coefficients are placed within a specified internal
program memory space and the input samples within a specified data memory space. The program counter
(PC) initially points at the memory location that contains the last coefficient h(N – 1), for example at memory
address FF00h (in hex). One of the (8) auxiliary registers points at the memory address of the last or least
recent input sample. The most recent sample is represented by x(n). The following program segment implements
(18.1):
LARP AR1
RPTK N-1
MACD FF00h,*-
APAC
The first instruction selects auxiliary register AR1, which will be used for indirect addressing. The second
instruction RPTK causes the subsequent MACD instruction to execute N times (repeated N – 1 times). The
MACD instruction has the following functions:
1.Multiplies the coefficient value h(N – 1) by the input sample value x(n – (N – 1)).
2.Accumulates any previous product stored in a special register (TR).
3.Copies the data memory sample value into the location of the next-higher memory. This “data move”
is to model the input sample delays associated with the next unit of time n + 1.
The last instruction APAC accumulates the last multiply operation h(0)x(n).
At time n + 1, the convolution Eq. (18.1) becomes
(18.2)
The previous program segment can be placed within a loop, with the PC and the auxiliary register AR1
reinitialized (see the memory organization of the samples x(k) associated with time n + 1 in Table 18.1). Note
that the last multiply operation is h(0)x(.), where x(.) represents the newest sample. This process can be
continuously repeated for time n + 2, n + 3, and so on.
The characteristics of a frequency selective FIR filter are specified by a set of coefficients that can be readily
obtained using commercially available filter design packages. These coefficients can be placed within a generic
FIR program. Within 5–10 minutes, an FIR filter can be implemented in real time. This includes finding the
coefficients; assembling, linking and downloading the FIR program into the SWDS; and observing the desired
frequency response displayed on a spectrum analyzer. A different FIR filter can be quickly obtained since the
only necessary change in the generic program is to substitute a new set of coefficients.
The approach for modeling the sample delays involves moving the data. A different scheme is used with the
floating-point TMS320C30 processor with a circular mode of addressing.
TABLE 18.1TMS320C25 Memory Organization for Convolution
Input Samples
Coefficients Time n Time n + 1 Time n + 2
PC ? h(N – 1) x(n) x(n+1) x(n+2)
h(N – 2) x(n – 1) x(n) x(n+1)
. . . .
.
. . . .
h(2) x(n – (N – 3)) x(n – (N – 4)) x(n – (N – 5))
h(1) x(n – (N – 2)) x(n – (N – 3)) x(n – (N – 4))
h(0) AR1 ? x(n – (N – 1)) x(n – (N – 2)) x(n – (N – 3))
yn hxn hxn
hN xn N hN xn N
( ) ()( ) ()()...
()(() ()(()
+= ++ +
+---+---
1011
2312
? 2000 by CRC Press LLC
Floating-Point TMS320C30-Based Development System
TMS320C30-based DSP development systems are also currently available from a number of companies. The
following are available from Texas Instruments:
1.An evaluation module (EVM). The EVM is a powerful, yet relatively inexpensive 8-bit card that plugs
into a slot on an IBM AT compatible. It includes the third-generation TMS320C30, 16 K of user RAM,
and an AIC for I/O. A serial port connector available on the EVM can be used to interface the TMS320C30
to other input/output devices (the TMS320C30 has two serial ports). An additional AIC can be interfaced
to the TMS320C30 through this serial port connector. A very powerful, yet inexpensive, analog evaluation
fixture, available from Burr-Brown (Arizona), can also be readily interfaced to the serial port on the
EVM. This complete two-channel analog evaluation fixture includes an 18-bit DSP102 ADC, an 18-bit
DSP202 DAC, antialiasing and reconstruction filters. The ADC has a maximum sampling rate of 200 kHz.
2.An XDS1000 emulator—powerful but quite expensive. A module can be readily built as a target system
to interface to the XDS1000 [Chassaing, 1992]. This module contains the TMS320C30, 16 K of static
RAM. Two connectors are included on this module, for interfacing to either an AIC module or to a
second-generation analog interface board (AIB). The AIC was discussed in conjunction with the
TMS320C25. The AIB includes Burr-Brown’s 16-bit ADC and DAC with a maximum sampling rate of
58 kHz. An AIC is also included on this newer AIB version.
EVM Tools
The EVM package includes an assembler, a linker, a simulator, a C compiler, and a C source debugger. The
second-generation TMS320C25 fixed-point processor is supported by C with some degrees of success. The
architecture and instruction set of the third-generation TMS320C30 processor facilitate the development of
high-level language compilers. An optimizer option is available with the C compiler for the TMS320C30. A C-
code program can be readily compiled, assembled, linked, and downloaded into either a simulator or the EVM
for real-time processing. A run-time support library of C functions, included with the EVM package, can be
used during linking. During simulation, the input data can be retrieved from a file and the output data written
into a file. Input and output port addresses can be appropriately specified. Within a real-time processing
environment with the EVM, the C source debugger can be used. One can single-step through a C-code program
while observing the equivalent step(s) through the assembly code. Both the C code and the corresponding
assembly code can be viewed through the EVM windows. One can also monitor at the same time the contents
of registers, memory locations, and so on.
Implementation of a Finite Impulse Response Filter with the TMS320C30
Consider again the convolution equation, Eq. (18.1), which represents an FIR filter. Table 18.2 shows the
TMS320C30 memory organization used for the coefficients and the input samples. Initially, all the input samples
can be set to zero. The newest sample x(n), at time n, can be retrieved from an ADC using the following
instructions:
TABLE 18.2TMS320C30 Memory Organization for Convolution
Coefficients Time n Time n + 1 Time n + 2
AR0 ? h(N – 1) AR1 ? x(n – (N – 1)) x(n + 1) x(n + 1)
h(N – 2) x(n – (N – 2)) AR1 ? x(n – (N – 2)) x(n + 2)
h(N – 3) x(n – (N – 3)) x(n – (N – 3)) AR1 ? x(n – (N – 3))
. . . .
. . . .
h(1) x(n – 1) x(n – 1) x(n – 1)
h(0) x(n) x(n) x(n)
? 2000 by CRC Press LLC
FLOAT *AR3,R3
STF R3, *AR1++%
These two instructions cause an input value x(n), retrieved from an input port address specified by auxiliary
register AR3, to be loaded into a register R3 (one of eight 40-bit-wide extended precision registers), then stored
in a memory location pointed by AR1 (AR1 would be first initialized to point at the “bottom” or higher-memory
address of the table for the input samples). AR1 is then postincremented in a circular fashion, designated with
the modulo operator %, to point at the oldest sample x(n – (N – 1)), as shown in Table 18.2. The size of the
circular buffer must first be specified. The following program segment implements (18.1):
RPTS LENGTH-1
MPYF *AR0++%,*AR1++%,R0
| ADDF R0,R2,R2
ADDF R0,R2
The repeat “single” instruction RPTS causes the next (multiply) floating-point instruction MPYF to be
executed LENGTH times (repeated LENGTH-1), where LENGTH is the length of the FIR filter. Furthermore,
since the first ADDF addition instruction is in parallel (designated by ||) with the MPYF instruction, it is also
executed LENGTH times. From Table 18.2, AR0, one of the eight available auxiliary registers, initially points
at the memory address (a table address) which contains the coefficient h(N – 1), and a second auxiliary register
AR1 now points to the address of the oldest input sample x(n – (N – 1)). The second indirect addressing mode
instruction multiplies the content in memory (address pointed by AR0) h(N – 1) by the content in memory
(address pointed by AR1) x(n – N – 1)), with the result stored in R0. Concurrently (in parallel), the content
of R0 is added to the content of R2, with the result stored in R2. Initially R0 and R2 are set to zero; hence, the
resulting value in R2 is not the product of the first multiply operation. After the first multiply operation, both
AR0 and AR1 are incremented, and h(N – 2) is multiplied by x(n – (N – 2)). Concurrently, the result of the
first multiply operation (stored in R0) is accumulated into R2. The second addition instruction, executed only
once, accumulates the last product h(0)x(n) (similar to the APAC instruction associated with the fixed-point
TMS320C25). The overall result yields an output value y(n) at time n. After the last multiply operation, both
AR0 and AR1 are postincremented to point at the “top” or lower-memory address of each circular buffer. The
process can then be repeated for time n + 1 in order to obtain a second output value y(n + 1). Note that the
newest sample x(n + 1) would be retrieved from an ADC using the FLOAT and STF instructions, then placed
at the top memory location of the buffer (table) containing the samples, overwriting the initial value x(n – (N
– 1)). AR1 is then incremented to point at the address containing x(n – (N – 2)), and the previous four
instructions can be repeated. The last multiply operation involves h(0) and x(.), where x(.) is the newest sample
x(n + 1), at time n + 1. The foregoing procedure would be repeated to produce an output y(n + 2), y(n + 3),
and so on. Each output value would be converted to a fixed-point equivalent value before being sent to a DAC.
The frequency response of an FIR filter with 41 coefficients and a center frequency of 2.5 kHz, obtained from
a signal analyzer, is displayed in Fig. 18.8.
FIR and IIR Implementation Using C and Assembly Code
A real-time implementation of a 45-coefficient bandpass FIR filter and a sixth-order IIR filter with 345 samples,
using C code and TMS320C30 code, is discussed in Chassaing and Bitler [1991]. Tables 18.3 and 18.4 show a
comparison of execution times of those two filters. The C language FIR filter, implemented without the modulo
operator %, and compiled with a C compiler V4.1, executed two times slower
1
than an equivalent assembly
language filter (which has a similar execution time as one implemented with a filter routine in assembly, called
by a C program). The C language IIR filter ran 1.3 times slower than the corresponding assembly language IIR
filter. These slower execution times may be acceptable for many applications. Where execution speed is crucial,
1
1.5 times slower using a newer C compiler V4.4.
? 2000 by CRC Press LLC
a time-critical function may be written in assembly and called from a C program. In applications where speed
is not absolutely crucial, C provides a better environment because of its portability and maintainability.
Real-Time Applications
A number of applications are discussed in Chassaing and Horning (1990) using TMS320C25 code and in
Chassaing (1992) using TMS320C30 and C code. These applications include multirate and adaptive filtering,
modulation techniques, and graphic and parametric equalizers. Two applications are briefly discussed here: a
ten-band multirate filter and a video line rate analysis.
1.The functional block diagram of the multirate filter is shown in Fig. 18.9. The multirate design provides
a significant reduction in processing time and data storage, compared to an equivalent single-rate design.
With multirate filtering, we can use a decimation operation in order to obtain a sample rate reduction
or an interpolation operation (as shown in Fig. 18.9) in order to obtain a sample rate increase [Crochiere
and Rabiner, 1983]. A pseudorandom noise generator implemented in software provides the input noise
to the ten octave band filters. Each octave band filter consists of three 1/3-octave filters (each with 41
coefficients), which can be individually controlled. A controlled noise source can be obtained with this
design. Since each 1/3-octave band filter can be turned on or off, the noise spectrum can be shaped
accordingly. The interpolation filter is a low-pass FIR filter with a 2:1 data-rate increase, yielding two
sample outputs for each input sample. The sample rate of the highest octave-band filter is set at 32,768
samples per second, with each successively lower band processing at half the rate of the next-higher
band. The multirate filter (a nine-band version) was implemented with the TMS320C25 [Chassaing
et al., 1990]. Figure 18.10 shows the three 1/3-octave band filters of band 10 implemented with the EVM
FIGURE 18.8Frequency response of 41-coefficient FIR filter.
TABLE 18.3Execution Time and Program Size
of FIR Filter
FIR Execution Time Size
(45 samples) (msec) (words)
C with modulo 4.16 122
C without modulo 0.338 116
C-called assembly 0.1666 74
Assembly 0.1652 27
TABLE 18.4Execution Time and Program Size
of 6th-Order IIR Filter
IIR Execution Time Size
(345 samples) (msec) (words)
C 1.575 109
Assembly 1.18 29
? 2000 by CRC Press LLC
in conjunction with the two-channel analog fixture (made by Burr-Brown). The center frequency of the
middle 1/3-octave band 10 filter is at approximately 8 kHz since the coefficients were designed for a
center frequency of 1/4 the sampling rate (the middle 1/3-octave band 9 filter would be centered at 4
kHz, the middle 1/3-octave band 8 filter at 2 kHz, and so on). Note that the center frequency of the
middle 1/3-octave band 1 filter would be at 2 Hz if the highest sampling rate is set at 4 kHz. Observe
from Fig. 18.10 that the crossover frequencies occur at the 3-dB points. Since the main processing time
of the multirate filter (implemented in assembly code) was measured to be 8.8 ms, the maximum
sampling rate was limited to 58 ksps.
2.A video line rate analysis implemented entirely in C code is discussed in Chassaing and Bitler [1992].
A module was built to sample a video line of information. This module included a 9.8-MHz clock, a
high sampling rate 8-bit ADC and appropriate support circuitry (comparator, FIFO buffer, etc.). Inter-
active features allowed for the selection of one (out of 256) horizontal lines of information and the
execution of algorithms for digital filtering, averaging, and edge enhancement, with the resulting effects
displayed on the PC screen. Figure 18.11 shows the display of a horizontal line (line #125) of information
FIGURE 18.9Multirate filter functional block diagram.
FIGURE 18.10Frequency responses of the 1/3-octave band ten filters.
? 2000 by CRC Press LLC
obtained from a test chart with a charge coupled device (CCD) camera. The function key F3 selects the
1-MHz low-pass filter resulting in the display shown in Fig. 18.12. The 3-MHz filter (with F4) would
pass more of the higher-frequency components of the signal but with less noise reduction. F5 implements
the noise averaging algorithm. The effect of the edge enhancement algorithm (with F7) is displayed in
Fig. 18.13.
Conclusions and Future Directions
DSP processors have been used extensively in a number of applications, even in non-DSP applications such as
graphics. The fourth-generation floating-point TMS320C40, code compatible with the TMS320C30, features
an instruction cycle time of 40 ns and six serial ports. The fifth-generation fixed-point TMS320C50, code
compatible with the first two generations of fixed-point processors, features an instruction cycle time of 35 ns
and 10 K words (16-bit) of on-chip data and program memory. Currently, both the fixed-point and floating-
point processors are being supported by TI.
FIGURE 18.11Display of a horizontal line of video signal.
FIGURE 18.12Video line signal with 1-MHz filtering.
? 2000 by CRC Press LLC
Defining Terms
C compiler: Program that translates C code into assembly code.
Digital signal processor: Special-purpose microprocessor with an architecture suitable for fast execution of
signal processing algorithms.
Fixed-point processor: A processor capable of operating on scaled integer and fractional data values.
Floating-point processor: Processor capable of operating on integers as well as on fractional data values
without scaling.
On-chip memory: Internal memory available on the digital signal processor.
Pipelining feature: Feature that permits parallel operations of fetching, decoding, reading, and executing.
Special-purpose digital signal processor: Digital signal processor with special feature for handling a specific
signal processing application, such as FFT.
Related Topics
14.3 Design and Implementation of Digital Filters ? 79.1 IC Logic Family Operation and Characteristics
References
H. M. Ahmed and R. B. Kline, “Recent advances in DSP systems,” IEEE Communications Magazine, 1991.
R. Chassaing, Digital Signal Processing with C and the TMS320C30, New York: Wiley, 1992.
R. Chassaing and R. Ayers, “Digital signal processing with the SHARC,” in Proceedings of the 1996 ASEE Annual
Conference, 1996.
R. Chassaing and B. Bitler, “Real-time digital filters in C,” in Proceedings of the 1991 ASEE Annual Conference,
1991.
R. Chassaing and B. Bitler, “A video line rate analysis using the TMS320C30 floating-point digital signal
processor,” in Proceedings of the 1992 ASEE Annual Conference, 1992.
R. Chassaing and D. W. Horning, Digital Signal Processing with the TMS320C25, New York: Wiley, 1990.
R. Chassaing and P. Martin, “Parallel processing with the TMS320C40,” in Proceedings of the 1995 ASEE Annual
Conference, 1995.
R. Chassaing, W.A. Peterson, and D. W. Horning, “A TMS320C25-based multirate filter,” IEEE Micro, 1990.
FIGURE 18.13 Video line signal with edge enhancement.
? 2000 by CRC Press LLC
R.E. Crochiere and L.R. Rabiner, Multirate Digital Signal Processing, Englewood Cliffs, N.J.: Prentice-Hall, 1983.
K. S. Lin (ed.), Digital Signal Processing Applications with the TMS320 Family. Theory, Algorithms, and Imple-
mentations, vol. 1, Texas Instruments Inc., Texas, 1989.
A. V. Oppenheim and R. Schafer, Discrete-Time Signal Processing, Englewood Cliffs, N.J.: Prentice-Hall, 1989.
P. Papamichalis (ed.), Digital Signal Processing Applications with the TMS320 Family. Theory, Algorithms, and
Implementations, vol. 3, Texas Instruments, Inc., Texas, 1990.
Further Information
Rulph Chassaing teaches hands-on workshops on digital signal processing using C and the TMS320C30,
offered at Roger Williams University in Bristol, RI, 02809. He offered a one-week workshop in August 1996,
supported by the National Science Foundation (NSF). He will offer two workshops in August 1997, supported
by NSF, using the TMS320C30 and the TMS320C31. Workshops on the TMS320 family of digital signal
processors are offered by Texas Instruments, Inc. at various locations.
A tutorial “Digital Signal Processing Comes of Age” can be found in the IEEE Spectrum, May 1996.
? 1999 by CRC Press LLC