Chapter 9
Analysis of Finite
Wordlength Effects
Introduction
Ideally,the system parameters along with the
signal variables have infinite precision taking
any value between -? and?
In practice,they can take only discrete values
within a specified range since the registers of
the digital machine where they are stored are
of finite length
The discretization process results in nonlinear
difference equations characterizing the
discrete-time systems
Introduction
These nonlinear equations,in principle,
are almost impossible to analyze and
deal with exactly
However,if the quantization amounts
are small compared to the values of
signal variables and filter parameters,a
simpler approximate theory based on a
statistical model can be applied
Introduction
Using the statistical model,it is possible to
derive the effects of discretization and develop
results that can be verified experimentally
Sources of errors -
(1) Filter coefficient quantization
(2) A/D conversion
(3) Quantization of arithmetic operations
(4) Limit cycles
Introduction
Consider the first-order IIR digital filter
y[n]=?y[n-1]+x[n]
where y[n] is the output signal and x[n]
is the input signal
^?
When implemented on a digital machine,
the filter coefficient? can assume only
certain discrete values approximating
the original design value?
Introduction
The desired transfer function is
z zzzH 11
1)(
z
zzH )(
^
^
which may be much different from the desired
transfer function H(z)
The actual transfer function implemented is
Introduction
Thus,the actual frequency response may
be quite different from the desired
frequency response
Coefficient quantization problem is
similar to the sensitivity problem
encountered in analog filter
implementation
Introduction
A/D Conversion Error - generated by
the filter input quantization process
If the input sequence x[n] has been
obtained by sampling an analog signal
xa(t),then the actual input to the digital
filter is
][][][ nenxnx^
where e[n] is the A/D conversion error
Introduction
Arithmetic Quantization Error - For the
first-order digital filter,the desired
output of the multiplier is
]1[][ nynv?
][][][]1[][ nenvnenynv^
where e?[n] is the product roundoff error
Due to product quantization,the actual output
of the multiplier of the implemented filter is
Introduction
Limit Cycles - The nonlinearity of
the arithmetic quantization process
may manifest in the form of
oscillations at the filter output,
usually in the absence of input or,
sometimes,in the presence of
constant input signals or sinusoidal
input signals
§ 9.1 Quantization Process and Error
Two basic types of binary representations of
data,(1) Fixed-point,and (2) Floating-point
formats
Various problems can arise in the digital
implementation of the arithmetic operations
involving the binary data
Caused by the finite wordlength limitations of
the registers storing the data and the results
of arithmetic operations
§ 9.1 Quantization Process and Error
For example in fixed-point arithmetic,
product of two b-bit numbers is 2b bits long,
which has to be quantized to b bits to fit the
prescribed wordlength of the registers
In fixed-point arithmetic,addition operation
can result in a sum exceeding the register
wordlength,causing an overflow
In floating-point arithmetic,there is no
overflow,but results of both addition and
multiplication may have to be quantized
§ 9.1 Quantization Process and Error
In both fixed-point and floating-point formats,
a negative number can be represented in one
of three different forms
Analysis of various quantization effects on the
performance of a digital filter depends on
(1) Data format (fixed-point or floating-point),
(2) Type of representation of negative
numbers,
(3) Type of quantization,and
(4) Digital filter structure implementing the
transfer function
§ 9.1 Quantization Process and Error
Since the number of all possible
combinations of the type of arithmetic,
type of quantization method,and digital
filter structure is very large,
quantization effects in some selected
practical cases are discussed
Analysis presented can be extended
easily to other cases
§ 9.1 Quantization Process and Error
In DSP applications,it is a common practice
to represent the data either as a fixed-point
fraction or as a floating-point binary number
with the mantissa as a binary fraction
Assume the available wordlength is (b+1) bits
with the most significant bit (MSB)
representing the sign
Consider the data to be a (b+1)-bit fixed-point
fraction
§ 9.1 Quantization Process and Error
Representation of a general (b+1)-bit
fixed-point fraction is shown below
12
22
b2
s 1?a 2?a ba
Smallest positive number that can be
represented in this format will have a least
significant bit (LSB) of 1 with remaining
bits being all 0’s
§ 9.1 Quantization Process and Error
Decimal equivalent of smallest positive
number is?=2-b
Numbers represented with (b+1) bits are thus
quantized in steps of 2-b,called quantization
step
An original data x represented as a (b+1)-bit
fraction is converted into a (b+1)-bit fraction
Q(x) either by truncation or rounding
§ 9.1 Quantization Process and Error
The quantization process for truncation
or rounding can be modeled as shown
below
x Q(x)Q
§ 9.1 Quantization Process and Error
Since representation of a positive binary
fraction is the same independent of
format being used to represent the
negative binary fraction,effect of
quantization of a positive fraction
remains unchanged
The effect of quantization on negative
fractions is different for the three
different representations
§ 9.1.1 Quantization of Fixed-
Point Numbers
Truncation of a (b+1)-bit fixed-point
number to (b+1) bits is achieved by
simply discarding the least significant
bits as shown below
s 1?a 2?a ba? To be discarded
s 1?a 2?a ba?


12
22
b2
b2
§ 9.1.1 Quantization of Fixed-
Point Numbers
Range of truncation error?t=Q(x)-x
(assuming b >> b):
Positive number and two’s complement
negative number
-t
Sign-magnitude negative number and
ones’-complement negative number
0t
§ 9.1.1 Quantization of Fixed-
Point Numbers
Range of rounding error?r=Q(x)-x
(assuming b >> b)
For all positive and negative
numbers
-?/2t/2
§ 9.1.2 Quantization of Floating -
Point Numbers
In floating-point format a decimal
number x is represented as x=2E?M
where E is the exponent and M is the
mantissa
Mantissa M is a binary fraction
restricted to lie in the range
1/2?M?1
Exponent E is either a positive or a
negative binary number
§ 9.1.2 Quantization of Floating -
Point Numbers
The quantization of a floating-point
number is carried out only on the
mantissa
Range of relative error:
= [Q(x)-x]/x = [Q(M)-M]/M
Two’s complement truncation
-2t? 0,x>0
0t?2?,x<0
§ 9.1.3 Analysis of Coefficient
Quantization Effects
)(? zH? The transfer function of the digital
filter implemented with quantized
coefficients is different from the desired
transfer function H(z)
Main effect of coefficient quantization is
to move the poles and zeros to different
locations from the original desired
locations
§ 9.1.3 Analysis of Coefficient
Quantization Effects
The actual frequency response is
thus different from the desired frequency
response H(ej?)
)(jeH
In some cases,the poles may move
outside the unit circle causing the
implemented digital filter to become
unstable even though the original transfer
function H(z) is stable
§ 9.1.3 Analysis of Coefficient
Quantization Effects
Effect of coefficient quantization
can be easily carried out using
MATLAB
To this end,the M-files a2dT (for
truncation) and a2dR (for rounding)
can be used
§ 9.1.4 Coefficient Quantization
Effects On a Direct Form IIR Filter
Gain responses of a 5-th order elliptic
lowpass filter with unquantized and
quantized coefficients
Fullband Gain Response
original - solid line,quantized - dashed line
Passband Details
original - solid line,quantized - dashed line
§ 9.1.4 Coefficient Quantization
Effects On a Direct Form IIR Filter
Pole and zero locations of the filter with
quantized coefficients (denoted by,x”
and,o”) and those of the filter with
unquantized coefficients (denoted by,+”
and,*”)
§ 9.1.4 Coefficient Quantization
Effects On a Direct Form FIR Filter
Gain responses of a 39-th order
equiripple lowpass FIR filter with
unquantized and quantized coefficients
original - solid line,quantized - dashed line
§ 9.2 A/D Conversion Noise Analysis
A/D converters used for digital
processing of analog signals in general
employ two’s-complement fixed-point
representation to represent the digital
equivalent of the input analog signal
For the processing of bipolar analog
signals,the A/D converter generates a
bipolar output represented as a fixed-
point signed binary fraction
§ 9.2 A/D Conversion Noise Analysis
The digital sample generated by the A/D
converter is the binary representation of
the quantized version of that produced
by an ideal sampler with infinite
precision
If the output word is of length (b+1) bits
including the sign bit,the total number
of discrete levels available for the
representation of the digital equivalent
is 2b+1
§ 9.2 A/D Conversion Noise Analysis
The dynamic range of the output
register depends on the binary number
representation selected for the A/D
converter
The model of a practical A/D conversion
system is as shown below
QuantizerIdealsampler Coder][nx )(nTxa?)(txa ][?nx ])[( nxQ?Quantizer ][? nxeq
§ 9.2 A/D Conversion Noise Analysis
The quantization process employed
by the quantizer can be either
rounding or truncation
Assuming rounding is used,the
input-output characteristic of a 3-
bit A/D converter with the output in
two’s-complement form is as shown
next
§ 9.2 A/D Conversion Noise Analysis
Input-output characteristic
§ 9.2 A/D Conversion Noise Analysis
In general,for a (b+1)-bit bipolar A/D
converter employing two’s-complement
representation,the full-scale range is
given by
2
1
2
1 )12()()12( bab nTx
][][?][])[(][ nxnxnxnxne Q
][?])[( nxnx?Q
Denote the difference between the
quantized value and the
input sample x[n] as the quantization error:
§ 9.2 A/D Conversion Noise Analysis
It follows from the input-output characteristic
of the 3-bit bipolar A/D converter given
earlier that e[n] is in the range
-?/2 < e[n]/2
assuming that a sample exactly halfway
between two levels is rounded up to the
nearest higher level and assuming that the
analog input is within the A/D converter full-
scale range
§ 9.2 A/D Conversion Noise Analysis
When the input analog sample is outside the
full-scale range of the A/D converter,the
magnitude of error e[n] increases linearly with
an increase in the magnitude of the input
In such a situation,the error e[n] is called the
saturation noise or the overload noise as the
A/D converter output is,clipped” to the
maximum value 1-2-b if the analog input is
positive or to the minimum value –1 if the
analog input is negative
§ 9.2 A/D Conversion Noise Analysis
A clipping of the A/D converter output
causes signal distortion with highly
undesirable effects and must be avoided
by scaling down the input analog signal
xa(nT) to ensure that it remains within
the A/D converter full-scale range
We therefore assume that input analog
samples are within the A/D converter
full-scale range and thus,there is no
saturation error
§ 9.2 A/D Conversion Noise Analysis
Now,the input-output characteristic of an A/D
converter is nonlinear,and the analog input
signal is not known a priori in most cases
It is thus reasonable to assume for analysis
purposes that the error e[n] is a random signal
with a statistical model as shown below
+][nx ][?nx
][ne
§ 9.2 A/D Conversion Noise Analysis
For simplified analysis,the following
assumptions are made:
(1) The error sequence {e[n]} is a sample
sequence of a wide-sense stationary (WSS)
white noise process,with each sample e[n]
being uniformly distributed over the range of
the quantization error
(2) The error sequence is uncorrelated with its
corresponding input sequence {x[n]}
(3) The input sequence is a sample sequence of
a stationary random process
§ 9.2 A/D Conversion Noise Analysis
These assumptions hold in most
practical situations for input signals
whose samples are large and change in
amplitude very rapidly in time relative
to the quantization step in a somewhat
random fashion
These assumptions have also been
verified experimentally and by computer
simulations
§ 9.2 A/D Conversion Noise Analysis
Mean and variance of the error sample e[n]:
Rounding -
02 )2/()2/(em

1212
)2/()2/(2 22
e
22
0em
1212
)0(2 22e
Two’s-complement truncation -
§ 9.3 Signal-to-Quantization Noise Ratio
The effect of the additive quantization noise
e[n] on the input signal x[n] is given by the
signal-to-quantization noise ratio given by
dBS N R
e
x
DA


2
2
10/ lo g10?
where?x2 is the input signal variance
representing the signal power and?e2 is the noise
variance representing the quantization noise
power
§ 9.3 Signal-to-Quantization Noise Ratio
This expression can be used to determine the
minimum wordlength of an A/D converter
needed to meet a specified SNRA/D
Note,SNRA/D increases by 6 dB for each bit
added to the wordlength
22
2
)(2
4810/ l o g10
FSb
x
RDAS N R
dBb
x
FSR
l o g2081.1602.6
Therefore
§ 9.3 Signal-to-Quantization Noise Ratio
For a given wordlength,the actual SNR
depends on?x,the rms value of the
input signal amplitude and the full-scale
range RFS of the A/D converter
Example - Determine the SNR in the
digital equivalent of an analog sample
x[n] with a zero-mean Gaussian
distribution using a (b+1)-bit A/D
converter having RFS =K?x
§ 9.3 Signal-to-Quantization Noise Ratio
Here

x
FSRDA bS NR
10/ l o g2081.1602.6
)(lo g2081.1602.6 10 Kb
05.8901.7797.6493.5289.408
56.9151.7947.6743.5539.436
08.9504.8399.7095.5891.464
15131197

K
K
K
bbbbb? Computed values of the SNR for various values of K are as given below:
§ 9.4 Limit Cycles in IIR Digital Filters
So far we have treated the analysis of finite
wordlength effects using a linear model of the
system
A practical digital filter is a nonlinear system
caused by the quantization of the arithmetic
operations
Such nonlinearities may cause an IIR filter,
which is stable under infinite precision,to
exhibit an unstable behavior under finite
precision arithmetic for specific input signals
§ 9.4 Limit Cycles in IIR Digital Filters
This type of instability usually
results in an oscillatory periodic
output called a limit cycle
The system remains in this
condition until an input of
sufficiently large amplitude is
applied to move the system into a
more conventional operation
§ 9.4 Limit Cycles in IIR Digital Filters
Limit cycles occur in IIR filters due to
the presence of feedback
Such oscillations are absent in FIR
filters as they do not have any feedback
path
There are two types of limit cycles
(1) Granular limit cycle is usually of low
amplitude
(2) Overflow limit cycle has large
amplitudes
§ 9.4 Limit Cycles in IIR Digital Filters
Two types of granular limit cycles have
been observed in IIR digital filters,
(1) Inaccessible limit cycle - can appear
only if the initial conditions of the digital
filter at the time of starting pertain to
that limit cycle
(2) Accessible limit cycle - can appear by
starting the digital filter with initial
conditions not pertaining to the limit
cycle
§ 9.4 Limit Cycles in IIR Digital Filters
Assume the quantization operation to be
rounding and the filter to be implemented
with a signed 6-bit fractional arithmetic
The nonlinear difference equation
characterizing the filter is given by
+
Q 1?z?
x[n] ][?ny
][]1[?][? nxnyny Q
Consider the first-
order IIR filter as
shown right
§ 9.4 Limit Cycles in IIR Digital Filters
The limit cycle generated has a period of 1
0]1[y? For x[n] = 0.04?[n],,and? = 0.6,the
output of the filter is as shown below
0 5 10 15 200
0.01
0.02
0.03
0.04
Am
pli
tud
e
Time index n
= 0.6
§ 9.4 Limit Cycles in IIR Digital Filters
The limit cycle generated has a period of 2
0]1[y? For x[n] = 0.04?[n],,and the output
of the filter is as shown below
6.0
0 5 10 15 20-0.04
-0.02
0
0.02
0.04
Am
pli
tud
e
Time index n
= -0.6
§ 9.4 Limit Cycles in IIR Digital Filters
Limit-cycle-like oscillations can also result
from overflow in digital filters implemented
with finite precision arithmetic
The amplitude of the overflow oscillations can
cover the whole dynamic range of the register
experiencing the overflow
Overflow limit cycles are thus much more
serious in nature than the granular limit
cycles
§ 9.4 Limit Cycles in IIR Digital Filters
Consider the causal all-pole second-
order IIR digital filter shown below
+ Q
1?z
x[n]
1?z
1
2
][? ny
Assume implementation using sign-magnitude
4-bit arithmetic with a rounding of the sum of
products by a single quantizer
§ 9.4 Limit Cycles in IIR Digital Filters
Consider x[n] = 0 for n?0
6 2 5.0]1[y? Let?1=-0.875,?2=0.875,
1 2 5.0]2[yand
0 10 20 30 40-0.5
0
0.5
Time index n
Am
pli
tud
e
1 = -0.875?2 = 0.875
§ 9.4 Limit Cycles in IIR Digital Filters
The second-order direct form IIR
structure with multiplier coefficients?1
and?2 remains stable if |?1|<1 and
|?1|<1 +?2
However,the structure can still get into
a zero-input overflow oscillation mode
for a large range of values of the filter
constants satisfying the stability
constraint when implemented using
two’s-complement arithmetic with
rounding
§ 9.4 Limit Cycles in IIR Digital Filters
It has been shown that overflow limit cycles
under zero-input cannot occur if the filter
coefficients lie in the shaded region inside the
stability triangle shown below
Homework
Read textbook from p.583 to 611,from
p.639 to 650
Problems:
9.8,9.13,9.14,9.15,
M9.3,M9.4,M9.5