Delp, E.J. Allebach, J., Bouman, C.A., Rajala, S.A., Bose, N.K., Sibul, L.H., Wolf, W., Zhang, Y-Q. “Multidimensional Signal Processing” The Electrical Engineering Handbook Ed. Richard C. Dorf Boca Raton: CRC Press LLC, 2000 17 Multidimensional Signal Processing 17.1 Digital Image Processing Image Capture ? Point Operations ? Image Enhancement ? Digital Image Compression ? Reconstruction ? Edge Detection ? Analysis and Computer Vision 17.2 Video Signal Processing Sampling ? Quantization ? Vector Quantization ? Video Compression ? Information-Preserving Coders ? Predictive Coding ? Motion-Compensated Predictive Coding ? Transform Coding ? Subband Coding ? HDTV ? Motion Estimation Techniques ? Token Matching Methods ? Image Quality and Visual Perception ? Visual Perception 17.3 Sensor Array Processing Spatial Arrays, Beamformers, and FIR Filters ? Discrete Arrays for Beamforming ? Discrete Arrays and Polynomials ? Velocity Filtering 17.4 Video Processing Architectures Computational Techniques ? Heterogeneous Multiprocessors ? Video Signal Processors ? Instruction Set Extensions 17.5 MPEG-4 Based Multimedia Information System MPEG-4 Multimedia System 17.1 Digital Image Processing Edward J. Delp, Jan Allebach, and Charles A. Bouman What is a digital image? What is digital image processing? Why does the use of computers to process pictures seem to be everywhere? The space program, robots, and even people with personal computers are using digital image processing techniques. In this section we shall describe what a digital image is, how one obtains digital images, what the problems with digital images are (they are not trouble-free), and finally how these images are used by computers. A discussion of processing the images is presented later in the section. At the end of this section is a bibliography of selected references on digital image processing. The use of computers to process pictures is about 30 years old. While some work was done more than 50 years ago, the year 1960 is usually the accepted date when serious work was started in such areas as optical character recognition, image coding, and the space program. NASA’s Ranger moon mission was one of the first programs to return digital images from space. The Jet Propulsion Laboratory (JPL) established one of the early general- purpose image processing facilities using second-generation computer technology. The early attempts at digital image processing were hampered because of the relatively slow computers used, i.e., the IBM 7094, the fact that computer time itself was expensive, and that image digitizers had to be built by the research centers. It was not until the late 1960s that image processing hardware was generally available (although expensive). Today it is possible to put together a small laboratory system for less than $60,000; a system based on a popular home computer can be assembled for about $5,000. As the cost of computer hardware Edward J. Delp Purdue University Jan Allebach Purdue University Charles A. Bouman Purdue University Sarah A. Rajala North Carolina State University N. K. Bose Pennsylvania State University L. H. Sibul Pennsylvania State University Wayne Wolf Princeton University Ya-Qin Zhang Microsoft Research, China ? 2000 by CRC Press LLC decreases, more uses of digital image processing will appear in all facets of life. Some people have predicted that by the turn of the century at least 50% of the images we handle in our private and professional lives will have been processed on a computer. Image Capture A digital image is nothing more than a matrix of numbers. The question is how does this matrix represent a real image that one sees on a computer screen? Like all imaging processes, whether they are analog or digital, one first starts with a sensor (or transducer) that converts the original imaging energy into an electrical signal. These sensors, for instance, could be the photomultiplier tubes used in an x-ray system that converts the x-ray energy into a known electrical voltage. The transducer system used in ultrasound imaging is an example where sound pressure is converted to electrical energy; a simple TV camera is perhaps the most ubiquitous example. An important fact to note is that the process of conversion from one energy form to an electrical signal is not necessarily a linear process. In other words, a proportional charge in the input energy to the sensor will not always cause the same proportional charge in the output electrical signal. In many cases calibration data are obtained in the laboratory so that the relationship between the input energy and output electrical signal is known. These data are necessary because some transducer performance characteristics change with age and other usage factors. The sensor is not the only thing needed to form an image in an imaging system. The sensor must have some spatial extent before an image is formed. By spatial extent we mean that the sensor must not be a simple point source examining only one location of energy output. To explain this further, let us examine two types of imaging sensors used in imaging: a CCD video camera and the ultrasound transducer used in many medical imaging applications. The CCD camera consists of an array of light sensors known as charge-coupled devices. The image is formed by examining the output of each sensor in a preset order for a finite time. The electronics of the system then forms an electrical signal which produces an image that is shown on a cathode-ray tube (CRT) display. The image is formed because there is an array of sensors, each one examining only one spatial location of the region to be sensed. The process of sampling the output of the sensor array in a particular order is known as scanning. Scanning is the typical method used to convert a two-dimensional energy signal or image to a one-dimensional electrical signal that can be handled by the computer. (An image can be thought of as an energy field with spatial extent.) Another form of scanning is used in ultrasonic imaging. In this application there is only one sensor instead of an array of sensors. The ultrasound transducer is moved or steered (either mechanically or electrically) to various spatial locations on the patient’s chest or stomach. As the sensor is moved to each location, the output electrical signal of the sensor is sampled and the electronics of the system then form a television-like signal which is displayed. Nearly all the transducers used in imaging form an image by either using an array of sensors or a single sensor that is moved to each spatial location. One immediately observes that both of the approaches discussed above are equivalent in that the energy is sensed at various spatial locations of the object to be imaged. This energy is then converted to an electrical signal by the transducer. The image formation processes just described are classical analog image formation, with the distance between the sensor locations limiting the spatial resolution in the system. In the array sensors, resolution is determined by how close the sensors are located in the array. In the single-sensor approach, the spatial resolution is limited by how far the sensor is moved. In an actual system spatial resolution is also determined by the performance characteristics of the sensor. Here we are assuming for our purposes perfect sensors. In digital image formation one is concerned about two processes: spatial sampling and quantization. Sam- pling is quite similar to scanning in analog image formation. The second process is known as quantization or analog-to-digital conversion, whereby at each spatial location a number is assigned to the amount of energy the transducer observes at that location. This number is usually proportional to the electrical signal at the output of the transducer. The overall process of sampling and quantization is known as digitization. Sometimes the digitization process is just referred to as analog-to-digital conversion, or A/D conversion; however, the reader should remember that digitization also includes spatial sampling. The digital image formulation process is summarized in Fig. 17.1. The spatial sampling process can be considered as overlaying a grid on the object, with the sensor examining the energy output from each grid box ? 2000 by CRC Press LLC and converting it to an electrical signal. The quantization process then assigns a number to the electrical signal; the result, which is a matrix of numbers, is the digital representation of the image. Each spatial location in the image (or grid) to which a number is assigned is known as a picture element or pixel (or pel). The size of the sampling grid is usually given by the number of pixels on each side of the grid, e.g., 256 H11003 256, 512 H11003 512, 488 H11003 380. The quantization process is necessary because all information to be processed using computers must be represented by numbers. The quantization process can be thought of as one where the input energy to the transducer is represented by a finite number of energy values. If the energy at a particular pixel location does not take on one of the finite energy values, it is assigned to the closest value. For instance, suppose that we assume a priori that only energy values of 10, 20, 50, and 110 will be represented (the units are of no concern in this example). Suppose at one pixel an energy of 23.5 was observed by the transducer. The A/D converter would then assign this pixel the energy value of 20 (the closest one). Notice that the quantization process makes mistakes; this error in assignment is known as quantization error or quantization noise. In our example, each pixel is represented by one of four possible values. For ease of representation of the data, it would be simpler to assign to each pixel the index value 0, 1, 2, 3, instead of 10, 20, 50, 110. In fact, this is typically done by the quantization process. One needs a simple table to know that a pixel assigned the value 2 corresponds to an energy of 50. Also, the number of possible energy levels is typically some integer power of two to also aid in representation. This power is known as the number of bits needed to represent the energy of each pixel. In our example each pixel is represented by two bits. One question that immediately arises is how accurate the digital representation of the image is when one compares the digital image with a corresponding analog image. It should first be pointed out that after the digital image is obtained one requires special hardware to convert the matrix of pixels back to an image that can be viewed on a CRT display. The process of converting the digital image back to an image that can be viewed is known as digital-to-analog conversion, or D/A conversion. FIGURE 17.1 Digital image formation: sampling and quantization. ? 2000 by CRC Press LLC The quality of representation of the image is determined by how close spatially the pixels are located and how many levels or numbers are used in the quantization, i.e., how coarse or fine is the quantization. The sampling accuracy is usually measured in how many pixels there are in a given area and is cited in pixels/unit length, i.e., pixels/cm. This is known as the spatial sampling rate. One would desire to use the lowest rate possible to minimize the number of pixels needed to represent the object. If the sampling rate is too low, then obviously some details of the object to be imaged will not be represented very well. In fact, there is a mathematical theorem which determines the lowest sampling rate possible to preserve details in the object. This rate is known as the Nyquist sampling rate (named after the late Bell Laboratories engineer Harry Nyquist). The theorem states that the sampling rate must be twice the highest possible detail one expects to image in the object. If the object has details closer than, say 1 mm, one must take at least 2 pixels/mm. (The Nyquist theorem actually says more than this, but a discussion of the entire theorem is beyond the scope of this section.) If we sample at a lower rate than the theoretical lowest limit, the resulting digital representation of the object will be distorted. This type of distortion or sampling error is known as aliasing errors. Aliasing errors usually manifest themselves in the image as moiré patterns (Fig. 17.2). The important point to remember is that there is a lower limit to the spatial sampling rate such that object detail can be maintained. The sampling rate can also be stated as the total number of pixels needed to represent the digital image, i.e., the matrix size (or grid size). One often sees these sampling rates cited as 256 3 256, 512 3 512, and so on. If the same object is imaged with a large matrix size, the sampling rate has obviously increased. Typically, images are sampled on 256 3 256, 512 3 512, or 1024 3 1024 grids, depending on the application and type of modality. One immediately observes an important issue in digital representation of images: that of the large number of pixels needed to represent the image. A 256 3 256 image has 65,536 pixels and a 512 3 512 image has 262,144 pixels! We shall return to this point later when we discuss processing or storage of these images. The quality of the representation of the digital image is also determined by the number of levels or shades of gray that are used in the quantization. If one has more levels, then fewer mistakes will be made in assigning values at the output of the transducer. Figure 17.3 demonstrates how the number of gray levels affects the digital representation of an artery. When a small number of levels are used, the quantization is coarse and the quantization error is large. The quantization error usually manifests itself in the digital image by the appearance FIGURE 17.2 This image shows the effects of aliasing due to sampling the image at too low a rate. The image should be straight lines converging at a point. Because of undersampling, it appears as if there are patterns in the lines at various angles. These are known as moiré patterns. ? 2000 by CRC Press LLC of false contouring in the picture. One usually needs at least 6 bits or 64 gray levels to represent an image adequately. Higher-quality imaging systems use 8 bits (256 levels) or even as many as 10 bits (1024 levels) per pixel. In most applications, the human observer cannot distinguish quantization error when there are more than 256 levels. (Many times the number of gray levels is cited in bytes. One byte is 8 bits, i.e., high-quality monochrome digital imaging systems use one byte per pixel.) One of the problems briefly mentioned previously is the large number of pixels needed to represent an image, which translates into a large amount of digital data needed for the representation. A 512 3 512 image with 8 bits/pixel (1 byte/pixel) of gray level representation requires 2,097,152 bits of computer data to describe it. A typical computer file that contains 1000 words usually requires only about 56,000 bits to describe it. The 512 3 512 image is 37 times larger! (A picture is truly worth more than 1000 words.) This data requirement is one of the major problems with digital imaging, given that the storage of digital images in a computer file system is expensive. Perhaps another example will demonstrate this problem. Many computers and word processing systems have the capability of transmitting information over telephone lines to other systems at data rates of 2400 bits per second. At this speed it would require nearly 15 minutes to transmit a 512 3 512 image! Moving objects are imaged digitally by taking digital snapshots of them, i.e., digital video. True digital imaging would acquire about 30 images/s to capture all the important motion in a scene. At 30 images/s, with each image sampled at 512 3 512 and with 8 bits/pixel, the system must handle 62,914,560 bits/s. Only very expensive acquisition systems are capable of handling these large data rates. The greatest advantage of digital images is that they can be processed on a computer. Any type of operation that one can do on a computer can be done to a digital image. Recall that a digital image is just a (huge) matrix of numbers. Digital image processing is the process of using a computer to extract useful information from this matrix. Processing that cannot be done optically or with analog systems (such as early video systems) can be easily done on computers. The disadvantage is that a large amount of data needs to be processed and on some small computer systems this can take a long time (hours). We shall examine image processing in more detail in the next subsection and discuss some of the computer hardware issues in a later chapter. FIGURE 17.3 This image demonstrates the effects of quantization error. The upper left image is a coronary artery image with 8 bits (256 levels or shades of gray) per pixel. The upper right image has 4 bits/pixel (16 levels). The lower left image has 3 bits/pixel (8 levels). The lower right image has 2 bits/pixel (4 levels). Note the false contouring in the images as the number of possible levels in the pixel representation is reduced. This false contouring is the quantization error, and as the number of levels increases the quantization error decreases because fewer mistakes are being made in the representation. ? 2000 by CRC Press LLC Point Operations Perhaps the simplest image processing operation is that of modifying the values of individual pixels in an image. These operations are commonly known as point operations. A point operation might be used to highlight certain regions in an image. Suppose one wished to know where all the pixels in a certain gray level region were spatially located in the image. One would modify all those pixel values to 0 (black) or 255 (white) such that the observer could see where they were located. Another example of a point operation is contrast enhancement or contrast stretching. The pixel values in a particular image may occupy only a small region of gray level distribution. For instance, the pixels in an image may only take on values between 0 and 63, when they could nominally take on values between 0 and 255. This is sometimes caused by the way the image was digitized and/or by the type of transducer used. When this image is examined on a CRT display the contrast looks washed out. A simple point operation that multiplies each pixel value in the image by four will increase the apparent contrast in the image; the new image now has gray values between 0 and 252. This operation is shown in Fig. 17.4. Possibly the most widely used point operation in medical imaging is pseudo-coloring. In this point operation all the pixels in the image with a particular gray value are assigned a color. Various schemes have been proposed for appropriate pseudo-color tables that assign the gray values to colors. It should be mentioned that point operations are often cascaded, i.e., an image undergoes contrast enhancement and then pseudo-coloring. The operations described above can be thought of as operations (or algorithms) that modify the range of the gray levels of the pixels. An important feature that describes a great deal about an image is the histogram of the pixel values. A histogram is a table that lists how many pixels in an image take on a particular gray value. These data are often plotted as a function of the gray value. Point operations are also known as histogram modification or histogram stretching. The contrast enhancement operation shown in Fig. 17.4 modifies the histogram of the resultant image by stretching the gray values from a range of 0–63 to a range of 0–252. Some point operations are such that the resulting histogram of the processed image has a particular shape. A popular form of histogram modification is known as histogram equalization, whereby the pixels are modified such that the histogram of the processed image is almost flat, i.e., all the pixel values occur equally. It is impossible to list all possible types of point operations; however, the important thing to remember is that these operations process one pixel at a time by modifying the pixel based only on its gray level value and not where it is distributed spatially (i.e., location in the pixel matrix). These operations are performed to enhance the image, make it easier to see certain structures or regions in the image, or to force a particular shape to the histogram of the image. They are also used as initial operations in a more complicated image processing algorithm. FIGURE 17.4 Contrast stretching. The image on the right has gray values between 0 and 63, causing the contrast to look washed out. The image on the right has been contrast enhanced by multiplying the gray levels by four. ? 2000 by CRC Press LLC Image Enhancement Image enhancement is the use of image processing algorithms to remove certain types of distortion in an image. The image is enhanced by removing noise, making the edge structures in the image stand out, or any other operation that makes the image look better. 1 Point operations discussed above are generally considered to be enhancement operations. Enhancement also includes operations that use groups of pixels and the spatial location of the pixels in the image. The most widely used algorithms for enhancement are based on pixel functions that are known as window operations. A window operation performed on an image is nothing more than the process of examining the pixels in a certain region of the image, called the window region, and computing some type of mathematical function derived from the pixels in the window. In most cases the windows are square or rectangle, although other shapes have been used. After the operation is performed, the result of the computation is placed in the center pixel of the window where a 3 3 3 pixel window has been extracted from the image. The values of the pixels in the window, labeled a 1 , a 2 , . . ., a 9 , are used to compute a new pixel value which replaces the value of a 5 , and the window is moved to a new center location until all the pixels in the original image have been processed. As an example of a window operation, suppose we computed the average value of the pixels in the window. This operation is known as smoothing and will tend to reduce noise in the image, but unfortunately it will also tend to blur edge structures in the image. Another window operation often used is the computation of a linear weighted sum of the pixel values. Let a9 5 be the new pixel value that will replace a 5 in the original image. We then form (17.1) where the a i ’s are any real numbers. For the simple smoothing operation described above we set a i = 1/9 for all i. By changing the values of the a i weights, one can perform different types of enhancement operations to an image. Any window operation that can be described by Eq. 17.1 is known as a linear window operation or convolution operator. If some of the a i coefficients take on negative values, one can enhance the appearance of edge structures in the image. It is possible to compute a nonlinear function of the pixels in the window. One of the more powerful nonlinear window operations is that of median filtering. In this operation all the pixels in the window are listed in descending magnitude and the middle, or median, pixel is obtained. The median pixel then is used to replace a 5 . The median filter is used to remove noise from an image and at the same time preserve the edge structure in the image. More recently there has been a great deal of interest in morphological operators. These are also nonlinear window operations that can be used to extract or enhance shape information in an image. In the preceding discussion, all of the window operations were described on 3 3 3 windows. The current research in window operations is directed at using large window sizes, i.e., 9 3 9, 13 3 13, or 21 3 21. The philosophy in this work is that small window sizes only use local information and what one really needs to use is information that is more global in nature. Digital Image Compression Image compression refers to the task of reducing the amount of data required to store or transmit a digital image. As discussed earlier, in its natural form, a digital image comprises an array of numbers. Each such 1 Image enhancement is often confused with image restoration. Image enhancement is the ad hoc application of various processing algorithms to enhance the appearance of the image. Image restoration is the application of algorithms that use knowledge of the degradation process to enhance or restore the image, i.e., deconvolution algorithms used to remove the effect of the aperture point spread function in blurred images. A discussion of image restoration is beyond the scope of this section. ¢= = ? aa ii i 5 1 9 a ? 2000 by CRC Press LLC number is the sampled value of the image at a pixel (picture element) location. These numbers are represented with finite precision using a fixed number of bits. Until recently, the dominant image size was 512 3 512 pixels with 8 bits or 1 byte per pixel. The total storage size for such an image is 512 2 ? 0.25 3 10 6 bytes or 0.25 Mbytes. When digital image processing first emerged in the 1960s, this was considered to be a formidable amount of data, and so interest in developing ways to reduce this storage requirement arose immediately. Since that time, image compression has continued to be an active area of research. The recent emergence of standards for image coding algorithms and the commercial availability of very large scale integration (VLSI) chips that implement image coding algorithms is indicative of the present maturity of the field, although research activity continues apace. With declining memory costs and increasing transmission bandwidths, 0.25 Mbytes is no longer considered to be the large amount of data that it once was. This might suggest that the need for image compression is not as great as previously. Unfortunately (or fortunately, depending on one’s point of view), this is not the case because our appetite for image data has also grown enormously over the years. The old 512 3 512 pixels 3 1 byte per pixel “standard’’ was a consequence of the spatial and gray scale resolution of sensors and displays that were commonly available until recently. At this time, displays with more than 10 3 3 10 3 pixels and 24 bits/pixel to allow full color representation (8 bits each for red, green, and blue) are becoming commonplace. Thus, our 0.25-Mbyte standard image size has grown to 3 Mbytes. This is just the tip of the iceberg, however. For example, in desktop printing applications, a 4-color (cyan, magenta, yellow, and black) image of an 8.5 3 11 in. 2 page sampled at 600 dots per in. requires 134 Mbytes. In remote sensing applications, a typical hyperspectral image contains terrain irradiance measurements in each of 200 10-nm-wide spectral bands at 25-m intervals on the ground. Each measurement is recorded with 12-bit precision. Such data are acquired from aircraft or satellite and are used in agriculture, forestry, and other fields concerned with management of natural resources. Storage of these data from just a 10 3 10 km 2 area requires 4800 Mbytes. Figure 17.5 shows the essential components of an image compression system. At the system input, the image is encoded into its compressed form by the image coder. The compressed image may then be subjected to further digital processing, such as error control coding, encryption, or multiplexing with other data sources, before being used to modulate the analog signal that is actually transmitted through the channel or stored in a storage medium. At the system output, the image is processed step by step to undo each of the operations that was performed on it at the system input. At the final step, the image is decoded into its original uncom- pressed form by the image decoder. Because of the role of the image encoder and decoder in an image compression system, image coding is often used as a synonym for image compression. If the reconstructed image is identical to the original image, the compression is said to be lossless. Otherwise, it is lossy. Image compression algorithms depend for their success on two separate factors: redundancy and irrelevancy. Redundancy refers to the fact that each pixel in an image does not take on all possible values with equal probability, and the value that it does take on is not independent of that of the other pixels in the image. If this were not true, the image would appear as a white noise pattern such as that seen when a television receiver is tuned to an unused channel. From an information-theoretic point of view, such an image contains the FIGURE 17.5Overview of an image compression system. ? 2000 by CRC Press LLC maximum amount of information. From the point of view of a human or machine interpreter, however, it contains no information at all. Irrelevancy refers to the fact that not all the information in the image is required for its intended application. First, under typical viewing conditions, it is possible to remove some of the information in an image without producing a change that is perceptible to a human observer. This is because of the limited ability of the human viewer to detect small changes in luminance over a large area or larger changes in luminance over a very small area, especially in the presence of detail that may mask these changes. Second, even though some degradation in image quality may be observed as a result of image compression, the degradation may not be objectionable for a particular application, such as teleconferencing. Third, the degradation introduced by image compression may not interfere with the ability of a human or machine to extract the information from the image that is important for a particular application. Lossless compression algorithms can only exploit redundancy, whereas lossy methods may exploit both redundancy and irrelevancy. A myriad of approaches have been proposed for image compression. To bring some semblance of order to the field, it is helpful to identify those key elements that provide a reasonably accurate description of most encoding algorithms. These are shown in Fig. 17.6. The first step is feature extraction. Here the image is partitioned into N 3 N blocks of pixels. Within each block, a feature vector is computed which is used to represent all the pixels within that block. If the feature vector provides a complete description of the block, i.e., the block of pixel values can be determined exactly from the feature vector, then the feature is suitable for use in a lossless compression algorithm. Otherwise, the algorithm will be lossy. For the simplest feature vector, we let the block size N = 1 and take the pixel values to be the features. Another important example for N = 1 is to let the feature be the error in the prediction of the pixel value based on the values of neighboring pixels which have already been encoded and, hence, whose values would be known as the decoder. This feature forms the basis for predictive encoding, of which differential pulse-code modulation (DPCM) is a special case. For larger size blocks, the most important example is to compute a two-dimensional (2-D) Fourier-like transform of the block of pixels and to use the N 2 transform coefficients as the feature vector. The widely used Joint Photographic Experts Group (JPEG) standard image coder is based on the discrete cosine transform (DCT) with a block size of N = 8. In all of the foregoing examples, the block of pixel values can be reconstructed exactly from the feature vector. In the last example, the inverse DCT is used. Hence, all these features may form the basis for a lossless compression algorithm. A feature vector that does not provide a complete description of the pixel block is a vector consisting of the mean and variance of the pixels within the block and an N 3 N binary mask indicating whether or not each pixel exceeds the mean. From this vector, we can only reconstruct an approximation to the original pixel block which has the same mean and variance as the original. This feature is the basis for the lossy block truncation coding algorithm. Ideally, the feature vector should be chosen to provide as nonredundant as possible a representation of the image and to separate those aspects of the image that are relevant to the viewer from those that are irrelevant. The second step in image encoding is vector quantization. This is essentially a clustering step in which we partition the feature space into cells, each of which will be represented by a single prototype feature vector. Since all feature vectors belonging to a given cell are mapped to the same prototype, the quantization process is irreversible and, hence, cannot be used as part of a lossless compression algorithm. Figure 17.7 shows an example for a two-dimensional feature space. Each dot corresponds to one feature vector from the image. The X’s signify the prototypes used to represent all the feature vectors contained within its quantization cell, the boundary of which is indicated by the dashed lines. Despite the simplicity with which vector quantization may be described, the implementation of a vector quantizer is a computationally complex task unless some structure is imposed on it. The clustering is based on minimizing the distortion between the original and quantized feature vectors, averaged over the entire image. The distortion measure can be chosen to account for the relative sensitivity of the human viewer to different kinds of degradation. In one dimension, the vector quantizer reduces to the Lloyd-Max scalar quantizer. FIGURE 17.6Key elements of an image encoder. ? 2000 by CRC Press LLC The final step in image encoding is entropy coding. Here we convert the stream of prototype feature vectors to a binary stream of 0’s and 1’s. Ideally, we would like to perform this conversion in a manner that yields the minimum average number of binary digits per prototype feature vector. In 1948, Claude Shannon proved that it is possible to code a discrete memoryless source using on the average as few binary digits per source symbol as the source entropy defined as Here p n denotes the probability or relative frequency of occurrence of the nth symbol in the source alphabet, and log 2 (x) = ln(x)/ln(2) is the base 2 logarithm of x. The units of H are bits/source symbol. The proof of Shannon’s theorem is based on grouping the source symbols into large blocks and assigning binary code words of varying length to each block of source symbols. More probable blocks of source symbols are assigned shorter code words, whereas less probable blocks are assigned longer code words. As the block length approaches infinity, the bit rate tends to H. Huffman determined the optimum variable-length coding scheme for a discrete memoryless source using blocks of any finite length. Table 17.1 provides an example illustrating the concept of source coding. The source alphabet contains eight symbols with the probabilities indicated. For convenience, these symbols have been labeled in order of decreasing probability. In the context of image encoding, the source alphabet would simply consist of the prototype feature vectors generated by the vector quantizer. The entropy of this source is 2.31 bits/source symbol. If we were to use a fixed-length code for this source, we would need to use three binary digits for each source symbol as shown in Table 17.1. On the other hand, the code words for the Huffman code contain from 1 to 4 code letters (binary digits). In this case, the average code word length isl ˉ = 2.31 binary digits. Here l n is the number of code letters in the code word for the source symbol a n . This is the average number of binary digits per source symbol that would be needed to encode the source, and it is equal to the entropy. Thus, for this particular source, the Huffman code achieves the lower bound. It can be shown that in general the rate for the Huffman code will always be within 1 binary digit of the source entropy. By grouping source symbols into blocks of length L and assigning code words to each block, this maximum FIGURE 17.7Vector quantization of a 2-D feature space. Hpp nn n =- ? log 2 lpl nn n = ? ? 2000 by CRC Press LLC distance can be decreased to 1/L binary digits. Note the subtle distinction here between bits, which are units of information, a property of the source alone, and binary digits, which are units of code word length, and hence only a property of the code used to represent the source. Also note that the Huffman code satisfies the prefix condition, i.e., no code word is the prefix of another longer code word. This means that a stream of 0’s and 1’s may be uniquely decoded into the corresponding sequence of source symbols without the need for markers to delineate boundaries between code words. The Huffman code is determined from the binary tree shown in Fig. 17.8. This tree is constructed recursively by combining the two least probable symbols in the alphabet into one composite symbol whose probability of occurrence is the sum of the probabilities of the two symbols that it represents. The code words for these two symbols are the same as that of the composite symbol with a 0 or a 1 appended at the end to distinguish between them. This procedure is repeated until the reduced alphabet contains only a single code word. Then the code word for a particular source symbol is determined by traversing the tree from its root to the leaf node for that source symbol. Reconstruction The objective of image reconstruction is to compute an unknown image from many complex measurements of the image. Usually, each measurement depends on many pixels in the image which may be spatially distant from one another. TABLE 17.1A Discrete Source with an Eight-Symbol Alphabet and Two Schemes for Encoding It Source Symbol Probability p n Fixed-Length Code Huffman Code a 1 1/2 000 0 a 2 1/8 001 100 a 3 1/8 010 101 a 4 1/16 011 1100 a 5 1/16 100 1101 a 6 1/16 101 1110 a 7 1/32 110 11110 a 8 1/32 111 11111 H = 2.31 l ˉ = 3 binary l ˉ = 2.31 binary bits/source digits/source digits/source symbol symbol symbol FIGURE 17.8Binary tree used to generate the Huffman code for the source shown in Table 17.1. ? 2000 by CRC Press LLC A typical reconstruction problem is tomography, in which each measurement is obtained by integrating the pixel values along a ray through the image. Figure 17.9 illustrates the measurement of these ray integrals in the projection process. For each angle u a set of ray integrals is computed by varying the position t at which the ray passes through the image. The points along a ray are given by all the solutions (x,y) to the equation t = x cos u + y sin u We may therefore compute the ideal projection integrals by the following expression known as the Radon transform (17.2) where d(t – x cos u –y sin u) is an impulse function that is nonzero along the projection ray. In practice, these projection integrals may be measured using a variety of physical techniques. In transmission tomography, l T photons are emitted into an object under test. A detector then counts the number of photons, l(u,t), which pass through the object without being absorbed. Collimators are used to ensure the detected energy passes straight through the object along the desired path. Since the attenuation of energy as it passes through the object is exponentially related to the integral of the object’s density, the projection integral may be computed from the formula In emission tomography, one wishes to measure the rate of photon emission at each pixel. In this case, various methods may be used to collect and count all the photons emitted along a ray passing through the object. Once the projections p(u,t) have been measured, the objective is to compute the unknown cross section f(x,y). The image and projections may be related by first computing the Fourier transform of the 2-D image and the 1-D projection for each angle FIGURE 17.9Projection data for angle u, resulting in the one-dimensional function p(u,t). p t fxy t x y dxdy(,) (,)( cos sin)qdqq=-- -¥ ¥ -¥ ¥ òò pt t T (,) log (,) q lq l =- ? è ? ? ? ÷ F fxye dxdy xy jxy xy (,) (,) () ww ww = + -¥ ¥ -¥ ¥ òò ? 2000 by CRC Press LLC These two transforms are then related by the Fourier slice theorem. F(w cos u, w sin u) = P (u,w) In words, P(u,w) corresponds to the value of the 2-D Fourier transform F(w x , w y ) along a 1-D line at an angle of u passing through the origin. The Fourier slice theorem may be used to develop two methods for inverting the Radon transform and thereby computing the image f. The first method, known as filtered back projection, computes the inverse Fourier transform in polar coordinates using the transformed projection data. Notice that the *w* term accounts for the integration in polar coordinates. A second inversion method results from performing all computations in the space domain rather than first transforming to the frequency domain w. This can be done by expressing the inner integral of filtered back projection as a convolution in the space domain. Here h(t) is the inverse Fourier transform of *w*. This results in the inversion formula known as convolution back projection In practice, h must be a low-pass approximation to the true inverse Fourier transform of *w*. This is necessary to suppress noise in the projection data. In practice, the choice of h is the most important element in the design of the reconstruction algorithm. Edge Detection The ability to find gray level edge structures in images is an important image processing operation. We shall define an edge to be regions in the image where there is a large change in gray level over a relatively small spatial region. The process of finding edge locations in digital images is known as edge detection. Most edge detection operators, also known as edge operators, use a window operator to first enhance the edges in the image, followed by thresholding the enhanced image. There has been a great deal of research performed in the area of edge detection. Some of the research issues include robust threshold selection, window size selection, noise response, edge linking, and the detection of edges in moving objects. While it is beyond the scope of this section to discuss these issues in detail, it is obvious that such things as threshold selection will greatly affect the performance of the edge detection algorithm. If the threshold is set too high, then many edge points will be missed; if set too low, then many “false’’ edge points will be obtained because of the inherent noise in the image. The investigation of the “optimal’’ choice of the threshold is an important research area. Selection of the particular window operation to enhance the edges of an image, as an initial step in edge detection, has recently been based on using models of the performance of the human visual system in detecting edges. Pptedt jt (,) (,)qw q w = -¥ ¥ ò fxy P e dd jx y (,) (,) (cos sin) = -¥ ¥ + òò 1 2 0 p qww wq p wqq ** 1 2p qww w q w Pedpthstd js (,) (,)( )** -¥ ¥ -¥ ¥ òò =- fxy p thx y tdtd(,) (,)(cos sin )=+- -¥ ¥ òò qqqq p 0 ? 2000 by CRC Press LLC Analysis and Computer Vision The process of extracting useful measurements from an image or sequence of images is known as image analysis or computer vision. Before analysis can be performed one must first determine pertinent features or attributes of the object in the scene and extract information about these features. The selection of which features in the image to measure must be chosen a priori, based on empirical results. Most features used consist of shape properties, shape change properties, shading, texture, motion, depth, and color. After the features are extracted, one must then use the feature measurements to determine scene characteristics such as object identification. In the past, simple pattern recognition algorithms, i.e., nearest-neighbor classification, have been used to compare the feature measurements of an image to a set of feature measurements that correspond to a known object. A decision is then made as to whether or not the features of the image match those of the known type. Recently, there has been work in the application of artificial intelligence techniques to image analysis. These approaches are very much different from classical statistical pattern recognition in that the feature measurements are used in a different manner as part of a larger system that attempts to model the scene and then determine what is in it based on the model. Defining Terms Digital image: An array of numbers representing the spatial distribution of energy in a scene which is obtained by a process of sampling and quantization. Edge: A localized region of rapid change in gray level in the image. Entropy: A measure of the minimum amount of information required on the average to store or transmit each quantized feature vector. Image compression or coding: The process of reducing the number of binary digits or bits required to represent the image. Image enhancement: An image processing operation that is intended to improve the visual quality of the image or to emphasize certain features. Image feature: An attribute of a block of image pixels. Image reconstruction: The process of obtaining an image from nonimage data that characterizes that image. Lossless vs. lossy compression: If the reconstructed or decoded image is identical to the original, the com- pression scheme is lossless. Otherwise, it is lossy. Pixel: A single sample or picture element in the digital image which is located at specific spatial coordinates. Point operation: An image processing operation in which individual pixels are mapped to new values irre- spective of the values of any neighboring pixels. Projection: A set of parallel line integrals across the image oriented at a particular angle. Quantization: The process of converting from a continuous-amplitude image to an image that takes on only a finite number of different amplitude values. Sampling: The process of converting from a continuous-parameter image to a discrete-parameter image by discretizing the spatial coordinate. Tomography: The process of reconstructing an image from projection data. Vector quantization: The process of replacing an exact vector of features by a prototype vector that is used to represent all feature vectors contained within a cluster. Window operation: An image processing operation in which the new value assigned to a given pixel depends on all the pixels within a window centered at that pixel location. Related Topics 15.1 Coding, Transmission, and Storage ? 73.6 Data Compression References H. C. Andrews and B.R. Hunt, Digital Image Restoration, Englewood Cliffs, N.J.: Prentice-Hall, 1977. D. H. Ballard and C. M. Brown, Computer Vision, Englewood Cliffs, N.J.: Prentice-Hall, 1982. ? 2000 by CRC Press LLC H. Barrow and J. Tenenbaum, “Computational vision,’’ Proc. IEEE, vol. 69, pp. 572–595, May 1981. A. Gersho and R. M. Gray, Vector Quantization and Signal Compression, Norwell, Mass.: Kluwer Academic Publishers, 1991. R. C. Gonzalez and P. Wintz, Digital Image Processing, Reading, Mass.: Addison-Wesley, 1991. G.T. Herman, Image Reconstruction from Projections, New York: Springer-Verlag, 1979. T. S. Huang, Image Sequence Analysis, New York: Springer-Verlag, 1981. A. K. Jain, Fundamentals of Digital Image Processing, Englewood Cliffs, N.J.: Prentice-Hall, 1989. A. Kak and M. Slaney, Principles of Computerized Tomographic Imaging, New York: IEEE Press, 1988. A. Macovski, Medical Imaging Systems, Englewood Cliffs, N.J.: Prentice-Hall, 1983. M. D. McFarlane, “Digital pictures fifty years ago,’’ Proc. IEEE, pp. 768–770, July 1972. W. K. Pratt, Digital Image Processing, New York: Wiley, 1991. A. Rosenfeld and A. Kak, Digital Picture Processing, vols. 1 and 2, San Diego: Academic Press, 1982. J. Serra, Image Analysis and Mathematical Morphology, vols. 1 and 2, San Diego: Academic Press, 1982 and 1988. Further Information A number of textbooks are available that cover the broad area of image processing and several that focus on more specialized topics within this field. The texts by Gonzalez and Wintz [1991], Jain [1989], Pratt [1991], and Rosenfeld and Kak (Vol. 1) [1982] are quite broad in their scope. Gonzalez and Wintz’s treatment is written at a somewhat lower level than that of the other texts. For a more detailed treatment of computed tomography and other medical imaging modalities, the reader may consult the texts by Herman [1979], Macovski [1983], and Kak and Slaney [1988]. To explore the field of computer vision, the reader is advised to consult the text by Ballard and Brown [1982]. Current research and applications of image processing are reported in a number of journals. Of particular note are the IEEE Transactions on Image Processing; the IEEE Transactions on Pattern Analysis and Machine Intelligence; the IEEE Transactions on Geoscience and Remote Sensing; the IEEE Transactions on Medical Imaging; the Journal of the Optical Society of America, A: Optical Engineering; the Journal of Electronic Imaging; and Computer Vision, Graphics, and Image Processing. 17.2 Video Signal Processing Sarah A. Rajala Video signal processing is the area of specialization concerned with the processing of time sequences of image data, i.e., video. Because of the significant advances in computing power and increases in available transmission bandwidth, there has been a proliferation of potential applications in the area of video signal processing. Applications such as high-definition television, digital video, multimedia, video phone, interactive video, medical imaging, and information processing are the driving forces in the field today. As diverse as the applications may seem, it is possible to specify a set of fundamental principles and methods that can be used to develop the applications. Considerable understanding of a video signal processing system can be gained by representing the system with the block diagram given in Fig. 17.10. Light from a real-world scene is captured by a scanning system and causes an image frame f (x,y,t 0 ) to be formed on a focal plane. A video signal is a sequence of image frames that are created when a scanning system captures a new image frame at periodic intervals in time. In general, each frame of the video sequence is a function of two spatial variables x and y and one temporal variable t. An integral part of the scanning system is the process of converting the original analog signal into an appropriate digital representation. The conversion process includes the operations of sampling and quantization. Sampling FIGURE 17.10Video signal processing system block diagram. ? 2000 by CRC Press LLC is the process of converting a continuous-time/space signal into a discrete-time/space signal. Quantization is the process of converting a continuous-valued signal into a discrete-valued signal. Once the video signal has been sampled and quantized, it can be processed digitally. Processing can be performed on special-purpose hardware or general-purpose computers. The type of processing performed depends on the particular application. For example, if the objective is to generate high-definition television, the processing would typically include compression and motion estimation. In fact, in most of the applications listed above these are the fundamental operations. Compression is the process of compactly representing the information contained in an image or video signal. Motion estimation is the process of estimating the dis- placement of the moving objects in a video sequence. The displacement information can then be used to interpolate missing frame data or to improve the performance of compression algorithms. After the processing is complete, a video signal is ready for transmission over some channel or storage on some medium. If the signal is transmitted, the type of channel will vary depending on the application. For example, today analog television signals are transmitted one of three ways: via satellite, terrestrially, or by cable. All three channels have limited transmission bandwidths and can adversely affect the signals because of the imperfect frequency responses of the channels. Alternatively, with a digital channel, the primary limitation will be the bandwidth. The final stage of the block diagram shown in Fig. 17.10 is the display. Of critical importance at this stage is the human observer. Understanding how humans respond to visual stimuli, i.e., the psychophysics of vision, will not only allow for better evaluation of the processed video signals but will also permit the design of better systems. Sampling If a continuous-time video signal satisfies certain conditions, it can be exactly represented by and be recon- structed from its sample values. The conditions which must be satisfied are specified in the sampling theorem. The sampling theorem can be stated as follows: Sampling Theorem: Let f(x,y,t) be a bandlimited signal with F(w x ,w y ,w t ) = 0 for *w x * > w xM , *w y * > w yM , and *w t * > w tM . Then f(x,y,t) is uniquely determined by its samples f(jX S ,kY S ,lT S ) = f(j,k,l), where j,k,l = 0, ±1, ±2,... if w sx > 2w xM , w sy > 2w yM , and w st > 2w tM and w sx = 2p/X S , w sy = 2p/Y S , and w st = 2p/T S X S is the sampling period along the x direction, w x = 2p/X S is the spatial sampling frequency along the x direction, Y S is the sampling period along the y direction, w y = 2p/Y S is the spatial sampling frequency along the y direction, T S is the sampling period along the temporal direction, and w t = 2p/T S is the temporal sampling frequency. Given these samples, f(x,y,t) can be reconstructed by generating a periodic impulse train in which suc- cessive impulses have amplitudes that are successive sample values. This impulse train is then processed through an ideal low-pass filter with appropriate gain and cut-off frequencies. The resulting output signal will be exactly equal to f(x,y,t). (Source: Oppenheim et al., 1983, p. 519.) If the sampling theorem is not satisfied, aliasing will occur. Aliasing occurs when the signal is undersampled and therefore no longer recoverable by low-pass filtering. Figure 17.11(a) shows the frequency spectrum of a sampled bandlimited signal with no aliasing. Figure 17.11(b) shows the frequency response of the same signal with aliasing. The aliasing occurs at the points where there is overlap in the diamond-shaped regions. For video signals aliasing in the temporal direction will give rise to flicker on the display. For television systems, the standard temporal sampling rate is 30 frames per second in the United States and Japan and 25 frames per second in Europe. However, these rates would be insufficient without the use of interlacing. If the sampling rate (spatial and/or temporal) of a system is fixed, a standard approach for minimizing the effects of aliasing for signals that do not satisfy the sampling theorem is to use a presampling filter. Presampling ? 2000 by CRC Press LLC filters are low-pass filters whose cut-off frequencies are chosen to be less than w M , w M , w M . Although the signal will still not be able to be reconstructed exactly, the degradations are less annoying. Another problem in a real system is the need for an ideal low-pass filter to reconstruct an analog signal. An ideal filter is not physically realizable, so in practice an approximation must be made. Several very simple filter structures are common in video systems: sample and hold, bilinear, and raised cosine. Quantization Quantization is the process of converting the continuous-valued amplitude of the video signal into a discrete- valued representation, i.e., a finite set of numbers. The output of the quantizer is characterized by quantities that are limited to a finite number of values. The process is a many-to-one mapping, and thus there is a loss of information. The quantized signal can be modeled as f q (j,k,l) = f(j,k,l) – e(j,k,l) where f q (j,k,l) is the quantized video signal and e(j,k,l) is the quantization noise. If too few bits per sample are used, the quantization noise will produce visible false contours in the image data. The quantizer is a mapping operation which generally takes the form of a staircase function (see Fig. 17.12). A rule for quantization can be defined as follows: Let {d k , k = 1, 2,..., N + 1} be the set of decision levels with d 1 the minimum amplitude value and d N the maximum amplitude value of f(j,k,l). If f(j,k,l) is contained in the interval (d k , d k+1 ), then it is mapped to the kth reconstruction level r. Methods for designing quantizers can be broken into two categories: uniform and nonuniform. The input-output function for a typical uniform quantizer is shown in Fig. 17.12. The mean square value of the quantizing noise can be easily calculated if it is assumed that the amplitude probability distribution is constant within each quantization step. The quantization step size for a uniform quantizer is and all errors between q/2 and –q/2 are equally likely. The mean square quantization error is given by: FIGURE 17.11 (a) Frequency spectrum of a sampled signal with no aliasing. (b) Frequency spectrum of a sampled signal with aliasing. q dd N N = - +11 ejkl f q df q q q 2 22 2 2 12 (,,) / / == - ò ? 2000 by CRC Press LLC If one takes into account the exact amplitude probability distribution, an optimal quantizer can be designed. Here the objective is to choose a set of decision levels and reconstruction levels that will yield the minimum quantization error. If f has a probability density function p f (f), the mean square quantization error is where N is the number of quantization levels. To minimize, the mean square quantization error is differentiated with respect to d i and r i . This results in the Max quantizer: and Thus, the quantization levels need to be midway between the reconstruction levels, and the reconstruction levels are at the centroid of that portion of p f (f) between d i and f d+1 . Unfortunately these requirements do not lead to an easy solution. Max used an iterative numerical technique to obtain solutions for various quantization levels assuming a zero-mean Gaussian input signal. These results and the quantization levels for other standard amplitude distributions can be found in Jain [1989]. A more common and less computationally intense approach to nonuniform quantization is to use a com- pandor (compressor–expander). The input signal is passed through a nonlinear compressor before being quantized uniformly. The output of the quantizer must then be expanded to the original dynamic range (see Fig. 17.13). The compression and expansion functions can be determined so that the compandor approximates a Max quantizer. FIGURE 17.12Characteristics of a uniform quantizer. ejkl frpfdf if di di i N 22 1 1 (,,) ( )()=- + = ò ? d rr i ii = + -1 2 r fpfdf pfdf i f di di f di di = + + ò ò () () 1 1 ? 2000 by CRC Press LLC Vector Quantization Quantization does not have to be done on a single pixel at a time. In fact, better results can be achieved if the video data are quantized on a vector (block) basis. In vector quantization, the image data are first processed into a set of vectors. A code book (set of code words or templates) that best matches the data to be quantized is then generated. Each input vector is then quantized to the closest code word. Compression is achieved by transmitting only the indices for the code words. At the receiver, the images are reconstructed using a table look-up procedure. Two areas of ongoing research are finding better methods for designing the code books and developing better search and update techniques for matching the input vectors to the code words. Video Compression Digital representations of video signals typically require a very large number of bits. If the video signal is to be transmitted and/or stored, compression is often required. Applications include conventional and high-definition television, video phone, video conferencing, multi-media, remote-sensed imaging, and magnetic resonance imaging. The objective of compression (source encoding) is to find a representation that maximizes picture quality while minimizing the data per picture element (pixel). A wealth of compression algorithms have been developed during the past 30 years for both image and video compression. However, the ultimate choice of an appropriate algorithm is application dependent. The following summary will provide some guidance in that selection process. Compression algorithms can be divided into two major categories: information-preserving, or lossless, and lossy techniques. Information-preserving techniques introduce no errors in the encoding/decoding process; thus, the original signal can be reconstructed exactly. Unfortunately, the achievable compression rate, i.e., the reduction in bit rate, is quite small, typically on the order of 3:1. On the other hand, lossy techniques introduce errors in the coding/decoding process; thus, the received signal cannot be reconstructed exactly. The advantage of the lossy techniques is the ability to achieve much higher compression ratios. The limiting factor on the compression ratio is the required quality of the video signal in a specific application. One approach to compression is to reduce the spatial and/or temporal sampling rate and the number of quantization levels. Unfortunately, if the sampling is too low and the quantization too coarse, aliasing, con- touring, and flickering will occur. These distortions are often much greater than the distortions introduced by more sophisticated techniques at the same compression rate. Compression systems can generally be modeled by the block diagram shown in Fig. 17.14. The first stage of the compression system is the mapper. This is an operation in which the input pixels are mapped into a representation that can be more effectively encoded. This stage is generally reversible. The second stage is the quantizer and performs the same type of operation as described earlier. This stage is not reversible. The final stage attempts to remove any remaining statistical redundancy. This stage is reversible and is typically achieved with one of the information-preserving coders. Information-Preserving Coders The data rate required for an original digital video signal may not represent its average information rate. If the original signal is represented by M possible independent symbols with probabilities p i , i = 0, 1,..., M – 1, then the information rate as given by the first-order entropy of the signal H is FIGURE 17.13Nonuniform quantization using a compandor. Hpp ii i M =- = - ? log 2 1 1 bits per sample ? 2000 by CRC Press LLC According to Shannon’s coding theorem [see Jain, 1989], it is possible to perform lossless coding of a source with entropy H bits per symbol using H + e bits per symbol. e is a small positive quantity. The maximum obtainable compression rate C is then given by: Huffman Coding One of the most efficient information-preserving (entropy) coding methods is Huffman coding. Construction of a Huffman code involves arranging the symbol probabilities in decreasing order and considering them as leaf nodes of a tree. The tree is constructed by merging the two nodes with the smallest probability to form a new node. The probability of the new node is the sum of the two merged nodes. This process is continued until only two nodes remain. At this point, 1 and 0 are arbitrarily assigned to the two remaining nodes. The process now moves down the tree, decomposing probabilities and assigning 1’s and 0’s to each new pair. The process continues until all symbols have been assigned a code word (string of 1’s and 0’s). An example is given in Fig. 17.15. Many other types of information-preserving compression schemes exist (see, for example, Gonza- lez and Wintz [1987]), including arithmetic coding, Lempel-Ziv algorithm, shift coding, and run-length coding. Predictive Coding Traditionally one of the most popular methods for reducing the bit rate has been predictive coding. In this class, differential pulse-code modulation (DPCM) has been used extensively. A block diagram for a basic DPCM system is shown in Fig. 17.16. In such a system the difference between the current pixel and a predicted version of that pixel gets quantized, coded, and transmitted to the receiver. This difference is referred to as the prediction error and is given by e i = f i – f ^ i The prediction is based on previously transmitted and decoded spatial and/or temporal information and can be linear or nonlinear, fixed or adaptive. The difference signal e i is then passed through a quantizer. The signal FIGURE 17.14Three-stage model of an encoder. FIGURE 17.15An example of constructing a Huffman code. C= average bit rate of the original data average bit rate of the encoded data ? 2000 by CRC Press LLC at the output of the quantizer is the quantized prediction error e iq , which is entropy encoded transmission. The first step at the receiver is to decode the quantized prediction error. After decoding, d iq is added to the predicted value of the current pixelf ^ i to yield the reconstructed pixel value. Note that as long as a quantizer is included in the system, the output signal will not exactly equal the input signal. The predictors can include pixels from the present frame as well as those from previous frames (see Fig. 17.17). If the motion and the spatial detail are not too high, frame (or field) prediction works well. If the motion is high and/or the spatial detail is high, intrafield prediction generally works better. A primary reason is that there is less correlation between frames and fields when the motion is high. For more information on predictive coding, see Musmann et al. [1985] or Jain [1989]. Motion-Compensated Predictive Coding Significant improvements in image quality, at a fixed compression rate, can be obtained when adaptive predic- tion algorithms take into account the frame-to-frame displacement of moving objects in the sequence. Alter- natively, one could increase the compression rate for a fixed level of image quality. The amount of increase in performance will depend on one’s ability to estimate the motion in the scene. Techniques for estimating the motion are described in a later subsection. Motion-compensated prediction algorithms can be divided into two categories. One category estimates the motion on a block-by-block basis and the other estimates the motion one pixel at a time. For the block-based methods an estimate of the displacement is obtained for each block in the image. The block matching is achieved by finding the maximum correlation between a block in the current frame and a somewhat larger search area in the previous frame. A number of researchers have proposed ways to reduce the computational complexity, FIGURE 17.16Block diagram of a basic DPCM system. FIGURE 17.17Transform coding system. ? 2000 by CRC Press LLC including using a simple matching criterion and using logarithmic searches for finding the peak value of the correlation. The second category obtains a displacement estimate at each pixel in a frame. These techniques are referred to as pel recursive methods. They tend to provide more accurate estimates of the displacement but at the expense of higher complexity. Both categories of techniques have been applied to video data; however, block matching is used more often in real systems. The primary reason is that more efficient implementations have been feasible. It should be noted, however, that every pixel in a block will be assigned the same displacement estimate. Thus, the larger the block size the greater the potential for errors in the displacement estimate for a given pixel. More details can be found in Musmann et al. [1985]. Transform Coding In transform coding, the video signal f(x,y,t) is subjected to an invertible transform, then quantized and encoded (see Fig. 17.17). The purpose of the transformation is to convert statistically dependent picture elements into a set of statistically independent coefficients. In practice, one of the separable fast transforms in the class of unitary transforms is used, e.g., cosine, Fourier, or Hadamard. In general, the transform coding algorithms can be implemented in 2-D or 3-D. However, because of the real-time constraints of many video signal processing applications, it is typically more efficient to combine a 2-D transform with a predictive algorithm in the temporal direction, e.g., motion compensation. For 2-D transform coding the image data are first subdivided into blocks. Typical block sizes are 8 3 8 or 16 3 16. The transform independently maps each image block into a block of transform coef- ficients; thus, the processing of each block can be done in parallel. At this stage the data have been mapped into a new representation, but no compression has occurred. In fact, with the Fourier transform there is an expansion in the amount of data. This occurs because the trans- form generates coefficients that are complex-valued. To achieve com- pression the transform coefficients must be quantized and then coded to remove any remaining redundancy. Two important issues in transform coding are the choice of trans- formation and the allocation of bits in the quantizer. The most com- monly used transform is the discrete cosine transform (DCT). In fact, many of the proposed image and video standards utilize the DCT. The reasons for choosing a DCT include: its performance is superior to the other fast transforms and is very close to the optimal Karhunen- Loeve transform, it produces real-valued transform coefficients, and it has good symmetry properties, thus reducing the blocking artifacts inherent in block-based algorithms. One way to reduce these artifacts is by using a transform whose basis functions are even, i.e., the DCT, and another is to use overlapping blocks. For bit allocation, one can determine the variance of the transform coefficients and then assign the bits so the distortion is minimized. An example of a typical bit allocation map is shown in Fig. 17.18. Subband Coding Recently, subband coding has proved to be an effective technique for image compression. Here, the original video signal is filtered into a set of bandpass signals (subbands), each sampled at successively lower rates. This process is known as the subband analysis stage. Each of the bandpass images is then quantized and encoded for transmission/storage. At the receiver, the signals must be decoded and then an image reconstructed from the subbands. The process at the receiver is referred to as the subband synthesis stage. A one-level subband FIGURE 17.18 A typical bit allocation for 16 3 16 block coding of an image using the DCT. 0000000000000000 0000000000000000 1000000000000000 1111000000000000 1111110000000000 1111111100000000 1111111110000000 2111111111000000 2222111111100000 2222221111100000 3322222111110000 3333222111110000 5433322211111000 6543322211111000 7654332211111000 8765334441111100 ? 2000 by CRC Press LLC analysis results in 4 subbands and a 2-level analysis in 16 equal subbands or 7 unequal subbands. A block diagram for a separable two-dimensional subband analysis system is shown in Fig. 17.19. HDTV High-definition television (HDTV) has received much attention in the past few years. With the recent push for all digital implementations of HDTV, the need for video signal processing techniques has become more obvious. In order for the digital HDTV signal to fit in the transmission bandwidth, there is a need for a compression ratio of approximately 10:1, with little or no degradation introduced. The goal of HDTV is to produce high- quality video signals by enhancing the detail, improving the aspect ratio and the viewing distance. The detail is enhanced by increasing the video bandwidth. The proposed aspect ratio of 16/9 will allow for a wide-screen format which is more consistent with the formats used in the motion-picture industry. The eye’s ability to resolve fine detail is limited. To achieve full resolution of the detail, the HDTV image should be viewed at a distance of approximately three times the picture height. To accommodate typical home viewing environments, larger displays are needed. Motion Estimation Techniques Frame-to-frame changes in luminance are generated when objects move in video sequences. The luminance changes can be used to estimate the displacement of the moving objects if an appropriate model of the motion is specified. A variety of motion models have been developed for dynamic scene analysis in machine vision and for video communications applications. In fact, motion estimates were first used as a control mechanism for the efficient coding of a sequence of images in an effort to reduce the temporal redundancy. Motion estimation algorithms can be classified in two broad categories: gradient or differential-based methods and token matching or correspondence methods. The gradient methods can be further divided into pel recursive, block matching, and optical flow methods. Pel Recursive Methods Netravali and Robbins [1979] developed the first pel recursive method for television signal compression. The algorithm begins with an initial estimate of the displacement, then iterates recursively to update the estimate. The iterations can be performed at a single pixel or at successive pixels along a scan line. The true displacement D at each pixel is estimated by whereD ^ i is the displacement estimate at the ith iteration and U i is the update term. U i is an estimate of D – D ^ i–1 .They then used the displaced frame difference (DFD): FIGURE 17.19A two-dimensional subband analysis system for generating four equal subbands. ?? DDU iii =+ -1 ? 2000 by CRC Press LLC to obtain a relationship for the update term U i . In the previous equation, T S is the temporal sample spacing. If the displacement estimate is updated from sample to sample using a steepest-descent algorithm to minimize the weighted sum of the squared displaced frame differences over a neighborhood, then D ^ i becomes where W j 3 0 and A graphical representation of pel recursive motion estimation is shown in Fig. 17.20. A variety of methods to calculate the update term have been reported. The advantage of one method over another is mainly in the improvement in compression. It should be noted that pel recursive algorithms assume that the displacement to be esti- mated is small. If the displacement is large, the estimates will be poor. Noise can also affect the accuracy of the estimate. Block Matching Block matching methods estimate the displacement within an M 3 N block in an image frame. The estimate is determined by finding the best match between the M 3 N block in a frame at time t and its best match from frame at t – T S . An underlying assumption in the block matching techniques is that each pixel within a block has the same displacement. A general block matching algorithm is given as follows: 1.Segment the image frame at time t into a fixed number of blocks of size M 3 N. 2.Specify the size of the search area in the frame at time t – 1. This depends on the maximum expected displacement. If D max is the maximum displacement in either the horizontal or vertical direction, then the size of the search area, SA, is SA = (M + 2D max ) 3 (N + 2D max ) Figure 17.21 illustrates the search area in the frame at time t – 1 for an M 3 N block at time t. 3.Using an appropriately defined matching criterion, e.g., mean-squared error or sum of absolute differ- ence, find the best match for the M 3 N block. 4.Proceed to the next block in frame t and repeat step 3 until displacement estimates have been determined for all blocks in the image. Optical Flow Methods The optical flow is defined as the apparent motion of the brightness patterns from one frame to the next. The optical flow is an estimate of the velocity field and hence requires two equations to solve for it. Typically a DFDxy Ixyt Ix tT ii S (,, ? )(,,)( ? ,)DD -- =--- 11 ?? ? [(, ? )]DD D ii i jkj i j WDFDx=-? é ? ê ê ù ? ú ú - - - ? 112 2 e D W j j = ? 1 FIGURE 17.20A graphical illustration of pel recursive motion estimation. The distance between the x and o pixels in the frame at t – 1 isD ^ i–1 . ? 2000 by CRC Press LLC constraint is imposed on the motion model to provide the necessary equations. Optical flow can give useful information about the spatial arrangement of the objects in a scene, as well as the rate of change of those objects. Horn [1986] also defines a motion field, which is a two-dimensional velocity field resulting from the projection of the three-dimensional velocity field of an object in the scene onto the image plane. The motion field and the optical flow are not the same. In general, the optical flow has been found difficult to compute because of the algorithm sensitivity to noise. Also, the estimates may not be accurate at scene discontinuities. However, because of its importance in assigning a velocity vector at each pixel, there continues to be research in the field. The optical flow equation is based on the assumption that the brightness of a pixel at location (x,y) is constant over time; thus, where dx/dt and dy/dt are the components of the optical flow. Several different constraints have been used with the optical flow equation to solve for dx/dt and dy/dt. A common constraint to impose is that the velocity field is smooth. Token Matching Methods Token matching methods are often referred to as discrete methods since the goal is to estimate the motion only at distinct image features (tokens). The result is a sparse velocity field. The algorithms attempt to match the set of discrete features in the frame at time t – 1 with a set that best resembles them in the frame at time t. Most of the algorithms in this group assume that the estimation will be achieved in a two-step process. In the first step, the features are identified. The features could be points, corners, centers of mass, lines, or edges. This step typically requires segmentation and/or feature extraction. The second step determines the various velocity parameters. The velocity parameters include a translation component, a rotation component, and the rotation axis. The token matching algorithms fail if there are no distinct features to use. All of the methods described in this subsection assume that the intensity at a given pixel location is reasonably constant over time. In addition, the gradient methods assume that the size of the displacements is small. Block matching algorithms have been used extensively in real systems, because the computational complexity is not too great. The one disadvantage is that there is only one displacement estimate per block. To date, optical flow algorithms have found limited use because of their sensitivity to noise. Token matching methods work well for applications in which the features are well defined and easily extracted. They are probably not suitable for most video communications applications. FIGURE 17.21 An illustration of block matching. I dx dt I dy dt I xyt ++=0 ? 2000 by CRC Press LLC Image Quality and Visual Perception An important factor in designing video signal processing algorithms is that the final receiver of the video information is typically a human observer. This has an impact on how the quality of the final signal is assessed and how the processing should be performed. If our objective is video transmission over a limited bandwidth channel, we do not want to waste unnecessary bits on information that cannot be seen by the human observer. In addition, it is undesirable to introduce artifacts that are particularly annoying to the human viewer. Unfor- tunately, there are no perfect quantitative measures of visual perception. The human visual system is quite complicated. In spite of the advances that have been made, no complete model of human perception exists. Therefore, we often have to rely on subjective testing to evaluate picture quality. Although no comprehensive model of human vision exists, certain functions can be characterized and then used in designing improved solutions. For more information, see Netravali and Haskell [1988]. Subjective Quality Ratings There are two primary categories of subjective testing: category-judgment (rating-scale) methods and comparison methods. Category-judgment methods ask the subjects to view a sequence of pictures and assign each picture (video sequence) to one of several categories. Categories may be based on overall quality or on visibility of impairment (see Table 17.2). Comparison methods require the subjects to compare a distorted test picture with a reference picture. Distortion is added to the test picture until both pictures appear of the same quality to the subject. Viewing conditions can have a great impact on the results of such tests. Care must be taken in the experimental design to avoid biases in the results. Visual Perception In this subsection, a review of the major aspects of human psychophysics that have an impact in video signal processing is given. The phenomena of interest include light adaptation, visual thresholding and contrast sensitivity, masking, and temporal phenomena. Light Adaptation The human visual system (HVS) has two major classes of photoreceptors, the rods and the cones. Because these two types of receptors adapt to light differently, two different adaptation time constants exist. Furthermore, these receptors respond at different rates going from dark to light than from light to dark. It should also be noted that although the HVS has an ability to adapt to an enormous range of light intensity levels, on the order of 10 10 in millilamberts, it does so adaptively. The simultaneous range is on the order of 10 3 . Visual Thresholding and Contrast Sensitivity Determining how sensitive an observer is to small changes in luminance is important in the design of video systems. One’s sensitivity will determine how visible noise will be and how accurately the luminance must be represented. The contrast sensitivity is determined by measuring the just-noticeable difference (JND) as a TABLE 17.2Quality and Impairment Ratings 5Excellent 5Imperceptible 3Much better 4Good 4Perceptible but not annoying 2Better 3Fair 3Slightly annoying 1Slightly better 2Poor 2Annoying 0Same 1Bad 1Very annoying –1Slightly worse –2Worse –3Much worse ? 2000 by CRC Press LLC function of the brightness. The JND is the amount of additional brightness needed to distinguish a patch from the background. It is a visibility threshold. What is significant is that the JND is dependent on the background and surrounding luminances, the size of the background and surrounding areas, and the size of the patch, with the primary dependence being on the luminance of the background. Masking The response to visual stimuli is greatly affected by what other visual stimuli are in the immediate neighborhood (spatially and temporally). An example is the reduced sensitivity of the HVS to noise in areas of high spatial activity. Another example is the masking of details in a new scene by what was present in the previous scene. In both cases, the masking phenomenon can be used to improve the quality of image compression systems. Temporal Effects One relevant temporal phenomenon is the flicker fusion frequency. This is a temporal threshold which deter- mines the point at which the HVS fuses the motion in a sequence of frames. Unfortunately this frequency varies as a function of the average luminance. The HVS is more sensitive to flicker at high luminances than at low luminances. The spatial-temporal frequency response of the HVS is important in determining the sensitivity to small-amplitude stimuli. In both the temporal and spatial directions, the HVS responds as a bandpass filter (see Fig. 17.22). Also significant is the fact that the spatial and temporal properties are not independent of one another, especially at low frequencies. For more details on image quality and visual perception see Schreiber [1991] and Netravali and Haskell [1988]. Defining Terms Aliasing:Distortion introduced in a digital signal when it is undersampled. Compression: Process of compactly representing the information contained in a signal. Motion estimation: Process of estimating the displacement of moving objects in a scene. Quantization: Process of converting a continuous-valued signal into a discrete-valued signal. Sampling:Process of converting a continuous-time/space signal into a discrete-time/space signal. Scanning system:System used to capture a new image at periodic intervals in time and to convert the image into a digital representation. FIGURE 17.22A perspective view of the spatio-temporal threshold surface. ? 2000 by CRC Press LLC Related Topics 8.5 Sampled Data?15.1 Coding, Transmission, and Storage References R. C. Gonzalez and P. Wintz, Digital Image Processing, Reading, Mass.: Addison-Wesley, 1987. R. A. Haddad and T. W. Parsons, Digital Signal Processing: Theory, Applications, and Hardware, New York: Computer Science Press, 1991. B. P. Horn, Robot Vision, Cambridge, Mass.: The MIT Press, 1986. A. K. Jain, Fundamentals of Digital Image Processing, Englewood Cliffs, N.J.: Prentice-Hall, 1989. N. Jayant, “Signal compression: Technology targets and research directions,” IEEE Journal on Selected Areas in Communications, vol. 10, no. 5, pp. 796–818, 1992. H. G. Musmann, P. Pirsch, and H.-J. Grallert, “Advances in picture coding,” Proc. IEEE, vol. 73, no. 4, pp. 523–548, 1985. A. N. Netravali and B. G. Haskell, Digital Pictures: Representation and Compression, New York: Plenum Press, 1988. A. N. Netravali and J. D. Robbins, “Motion-compensated television coding: Part I,” Bell Syst. Tech. J., vol. 58, no. 3, pp. 631–670, 1979. A. V. Oppenheim, A. S. Willsky, and I. T. Young, Signals and Systems, Englewood Cliffs, N.J.: Prentice-Hall, 1983. W. F. Schreiber, Fundamentals of Electronic Imaging Systems, Berlin: Springer-Verlag, 1991. Further Information Other recommended sources of information include IEEE Transactions on Circuits and Systems for Video Technology, IEEE Transactions on Image Processing, and the Proceedings of the IEEE, April 1985, vol. 73, and Multidimensional Systems and Signal Processing Journal, 1992, vol. 3. 17.3 Sensor Array Processing N. K. Bose and L. H. Sibul Multidimensional signal processing tools apply to aperture and sensor array processing. Planar sensor arrays can be considered to be sampled apertures. Three-dimensional or volumetric arrays can be viewed as multidi- mensional spatial filters. Therefore, the topics of sensor array processing, aperture processing, and multidimen- sional signal processing can be studied under a unified format. The basic function of the receiving array is transduction of propagating waves in the medium into electrical signals. Propagating waves are fundamental in radar, communication, optics, sonar, and geophysics. In electromagnetic applications, basic transducers are antennas and arrays of antennas. A large body of literature that exists on antennas and antenna arrays can be exploited in the areas of aperture and sensor array processing. Much of the antenna literature deals with transmitting antennas and their radiation patterns. Because of the reciprocity of transmitting and receiving transducers, key results that have been developed for transmitters can be used for analysis of receiver aperture and/or array processing. Transmitting transducers radiate energy in desired directions, whereas receiving aper- tures/arrays act as spatial filters that emphasize signals from a desired look direction while discriminating against interferences from other directions. The spatial filter wavenumber response is called the receiver beam pattern. Transmitting apertures are characterized by their radiation patterns. Conventional beamforming deals with the design of fixed beam patterns for given specifications. Optimum beamforming is the design of beam patterns to meet a specified optimization criterion. It can be compared to optimum filtering, detection, and estimation. Adaptive beamformers sense their operating environment (for example, noise covariance matrix) and adjust beamformer parameters so that their performance is optimized [Monzingo and Miller, 1980]. Adaptive beamformers can be compared with adaptive filters. ? 2000 by CRC Press LLC Multidimensional signal processing techniques have found wide application in seismology—where a group of identical seismometers, called seismic arrays, are used for event location, studies of the earth’s sedimentation structure, and separation of coherent signals from noise, which sometimes may also propagate coherently across the array but with different horizontal velocities—by employing velocity filtering [Claerbout, 1976]. Velocity filtering is performed by multidimensional filters and allows also for the enhancement of signals which may occupy the same wavenumber range as noise or undesired signals do. In a broader context, beamforming can be used to separate signals received by sensor arrays based on frequency, wavenumber, and velocity (speed as well as direction) of propagation. Both the transfer and unit impulse-response functions of a velocity filter are two-dimensional functions in the case of one-dimensional arrays. The transfer function involves frequency and wavenumber (due to spatial sampling by equally spaced sensors) as independent variables, whereas the unit impulse response depends upon time and location within the array. Two-dimensional filtering is not limited to velocity filtering by means of seismic array. Two-dimensional spatial filters are frequently used, for example, in the interpretation of gravity and magnetic maps to differentiate between regional and local features. Input data for these filters may be observations in the survey of an area conducted over a planar grid over the earth’s surface. Two-dimensional wavenumber digital filtering principles are useful for this purpose. Velocity filtering by means of two-dimensional arrays may be accomplished by properly shaping a three-dimensional response function H(k 1 ,k 2 ,w). Velocity filtering by three-dimensional arrays may be accomplished through a four- dimensional function H(k 1 ,k 2 ,k 3 ,w) as explained in the following subsection. Spatial Arrays, Beamformers, and FIR Filters A propagating plane wave, s(x,t), is, in general, a function of the three-dimensional space variables and the time variable (x 1 ,x 2 ,x 3 ) D= x and the time variable t. The 4-D Fourier transform of the stationary signal s(x,t) is (17.3) which is referred to as the wavenumber–frequency spectrum of s(x,t), and (k 1 ,k 2 ,k 3 ) D k denotes the wavenum- ber variables in radians per unit distance and w is the frequency variable in radians per second. If c denotes the velocity of propagation of the plane wave, the following constraint must be satisfied If the 4-D Fourier transform of the unit impulse response h(x,t) of a 4-D linear shift-invariant (LSI) filter is denoted by H(k,w), then the response y(x,t) of the filter to s(x,t) is the 4-D linear convolution of h(x,t) and s(x,t), which is, uniquely, characterized by its 4-D Fourier transform Y(k,w) = H(k,w)S(k,w) (17.4) The inverse 4-D Fourier transform, which forms a 4-D Fourier transform pair with Eq. (17.3), is (17.5) It is noted that S(k,w) in Eq. (17.3) is product separable, i.e., expressible in the form S(k,w) = S 1 (k 1 )S 2 (k 2 )S 3 (k 3 )S 4 (w) (17.6) S s te dxdxdxdt jt kx ii i (,) (,) () kxw w = -- -¥ ¥ -¥ ¥ -¥ ¥ -¥ ¥ = ? òòòò 1 3 123 kkk c 1 2 2 2 3 2 2 2 ++= w s t S e dkdkdkd jt kx ii i (,) () (,) () xk= - -¥ ¥ -¥ ¥ -¥ ¥ -¥ ¥ = ? òòòò 1 2 4 1231 3 p ww w ? 2000 by CRC Press LLC where each function on the right-hand side is a univariate function of the respective independent variable, if and only if s(x,t) in Eq. (17.3) is also product separable. In beamforming, S i (k i ) in Eq. (17.6) would be the far- field beam pattern of a linear array along the x i -axis. For example, the normalized beam pattern of a uniformly weighted (shaded) linear array of length L is where l = (2p/k) is the wavelength of the propagating plane wave and u is the angle of arrival at array site as shown in Fig. 17.23. Note that u is explicitly admitted as a variable in S(k,u) to allow for the possibility that for a fixed wavenumber, the beam pattern could be plotted as a function of the angle of arrival. In that case, when u is zero, the wave impinges the array broadside and the normalized beam pattern evaluates to unity. The counterpart, in aperture and sensor array processing, of the use of window functions in spectral analysis for reduction of sidelobes is the use of aperture shading. In aperture shading, one simply multiplies a uniformly weighted aperture by the shading function. The resulting beam pattern is, then, simply the convolution of the beam pattern of the uniformly shaded volumetric array and the beam pattern of the shading function. Fourier transform relationship between the stationary signal s(x,t) and the wavenumber frequency spectrum S(k,w) allows one to exploit high-resolution spectral analysis techniques for the high-resolution estimation of the direction of arrival [Pillai, 1989]. The superscript *, t, and H denote, respectively, complex conjugate, transpose, and conjugate transpose. Discrete Arrays for Beamforming An array of sensors could be distributed at distinct points in space in various ways. Line arrays, planar arrays, and volumetric arrays could be either uniformly spaced or nonuniformly spaced, including the possibility of placing sensors randomly according to some probability distribution function. Uniform spacing along each coordinate axis permits one to exploit the well-developed multidimensional signal processing techniques con- cerned with filter design, DFT computation via FFT, and high-resolution spectral analysis of sampled signals [Dudgeon, 1977]. Nonuniform spacing sometimes might be useful for reducing the number of sensors, which otherwise might be constrained to satisfy a maximum spacing between uniformly placed sensors to avoid grating lobes due to aliasing, as explained later. A discrete array, uniformly spaced, is convenient for the synthesis of a digital filter or beamformer by the performing of digital signal processing operations (namely delay, sum, and multiplication or weighting) on the signal received by a collection of sensors distributed in space. The sequence of the nature of operations dictates the types of beamformer. Common beamforming systems are of FIGURE 17.23Uniformly weighted linear array. Sk kL kL (,) sin sin sin q q q = ? è ? ? ? ÷ ? è ? ? ? ÷ 2 2 ? 2000 by CRC Press LLC the straight summation, delay-and-sum, and weighted delay-and-sum types. The geometrical distribution of sensors and the weights w i associated with each sensor are crucial factors in the shaping of the filter character- istics. In the case of a linear array of N equispaced sensors, which are spaced D units apart, starting at the origin x 1 = 0, the function (17.8) becomes the array pattern, which may be viewed as the frequency response function for a finite impulse response (FIR) filter, characterized by the unit impulse response sequence {w n }. In the case when w n = 1, Eq. (17.8) simplifies to If the N sensors are symmetrically placed on both sides of the origin, including one at the origin, and the sensor weights are w n = 1, then the linear array pattern becomes For planar arrays, direct generalizations of the preceding linear array results can be obtained. To wit, if the sensors with unity weights are located at coordinates (kD, lD), where k = 0, ±1, ±2, . . ., ±[(N–1)/2], and l = 0, ±1, ±2, . . ., ±[(M–1)/2], for odd integer values of N and M, then the array pattern function becomes (17.10) Routine generalizations to 3-D spatial arrays are also possible. The array pattern functions for other geometrical distributions may also be routinely generated. For example, if unit weight sensors are located at the six vertices and the center of a regular hexagon, each of whose sides is D units long, then the array pattern function can be shown to be Wk N we n jknD n N () 1 0 1 1 1 = - = - ? Wk N kND kD j NkD () sin sin exp () 1 1 1 1 1 2 2 1 2 = ? è ? ? ? ÷ ? è ? ? ? ÷ - - ì í ? ? ? ü y ? t ? Wk N kND kD () sin sin 1 1 1 1 2 2 = ? è ? ? ? ÷ ? è ? ? ? ÷ Wkk NM jkkD klD NM kND kD kMD kD lk M M N N (, ) exp{( } sin sin sin sin 12 1 2 1 1 2 2 1 1 2 2 2 2 1 2 1 2 1 2 1 2 =-+ = ? è ? ? ? ÷ ? è ? ? ? ÷ ? è ? ? ? ÷ ? è ? ? ? ÷ =- () ? è ? ? =- () ? è ? ? - - - - ?? ? 2000 by CRC Press LLC (17.11) The array pattern function reveals how selective a particular beamforming system is. In the case of a typical array function shown in Eq. (17.9), the beamwidth, which is the width of the main lobe of the array pattern, is inversely proportional to the array aperture. Because of the periodicity of the array pattern function, the main lobe is repeated at intervals of 2p/D. These repetitive lobes are called grating lobes, whose existence may be interpreted in terms of spatial frequency aliasing resulting from a sampling interval D due to the N receiving sensors located at discrete points in space. If the spacing D between sensors satisfies (17.12) where l is the smallest wavelength component in the signal received by the array of sensors, then the grating lobes have no effect on the received signal. A plane wave of unit amplitude which is incident upon the array at bearing angle u degrees, as shown in Fig. 17.23, produces outputs at the sensors given by the vector s(u)D s u = [exp( j0) exp(jk 1 D sin u) . . . exp(jk 1 (N – 1)D sin u)] t (17.13) where k 1 = 2p/l is the wavenumber. In array processing, the array output y u may be viewed as the inner product of an array weight vector w and the steering vector s u . Thus, the beamformer response along a direction characterized by the angle u is, treating w as complex, (17.14) The beamforming system is said to be robust if it performs satisfactorily despite certain perturbations [Ahmed and Evans, 1982]. It is possible for each component s ku of s u to belong to an interval [s ku – f k u , s ku + f k u ], and a robust beamformer will require the existence of at least one weight vector w which will guarantee the output y u to belong to an output envelope for each s u in the input envelope. The robust beamforming problem can be translated into an optimization problem, which may be tackled by minimizing the value of the array output power P(u) = w H (u)Rw(u) (17.15) when the response to a unit amplitude plane wave incident at the steering direction u is constrained to be unity, i.e., w H (u)s(u) = 1, and R is the additive noise-corrupted signal autocorrelation matrix. The solution is called the minimum variance beamformer and is given by (17.16) and the corresponding power output is (17.17) The minimum variance power as a function of u can be used as a form of the data-adaptive estimate of the directional power spectrum. However, in this mode of solution, the coefficient vector is unconstrained except Wk k kD kD k D ( , ) cos cos cos 12 1 12 1 7 12 4 2 3 2 =+ + é ? ê ê ù ? ú ú D £ l 2 y w jk kD k k N qq == = - ? w( s), * exp( sin ) 1 0 1 w s ss MV () () () () q q qq = - - R R H 1 1 P R H MV () () () q qq = - 1 1 ss ? 2000 by CRC Press LLC at the steering direction. Consequently, a signal tends to be regarded as an unwanted interference and is, therefore, suppressed in the beamformed output unless it is almost exactly aligned with the steering direction. Therefore, it is desirable to broaden the signal acceptance angle while at the same time preserving the optimum beamformer’s ability to reject noise and interference outside this region of angles. One way of achieving this is by the application of the principle of superdirectivity. Discrete Arrays and Polynomials It is common practice to relate discrete arrays to polynomials for array synthesis purposes [Steinberg, 1976]. For volumetric equispaced arrays (it is only necessary that the spacing be uniform along each coordinate axis so that the spatial sampling periods D i and D j along, respectively, the ith and jth coordinate axes could be different for i 1 j), the weight associated with sensors located at coordinate (i 1 D 1 , i 2 D 2 , i 3 D 3 ) is denoted by w(i 1 , i 2 , i 3 ). The function in the complex variables (z 1 , z 2 , and z 3 ) that is associated with the sequence {w(i 1 ,i 2 ,i 3 )} is the generating function for the sequence and is denoted by (17.18) In the electrical engineering and geophysics literature, the generating function W(z 1 ,z 2 ,z 3 ) is sometimes called the z-transform of the sequence {w(i 1 , i 2 , i 3 )}. When there are a finite number of sensors, a realistic assumption for any physical discrete array, W(z 1 ,z 2 ,z 3 ) becomes a trivariate polynomial. In the special case when w(i 1 , i 2 , i 3 ) is product separable, the polynomial W(z 1 ,z 2 ,z 3 ) is also product separable. Particularly, this separability property holds when the shading is uniform, i.e., w(i 1 , i 2 , i 3 ) = 1. When the support of the uniform shading function is defined by i 1 = 0,1, . . ., N 1 – 1, i 2 = 0,1, . . ., N 2 – 1, and i 3 = 0,1, . . ., N 3 – 1, the associated polynomial becomes (17.19) In this case, all results developed for the synthesis of linear arrays become directly applicable to the synthesis of volumetric arrays. For a linear uniform discrete array composed of N sensors with intersensor spacing D 1 starting at the origin and receiving a signal at a known fixed wavenumber k 1 at a receiving angle u, the far-field beam pattern may be associated with a polynomial S N–1 r=0 z r 1 , by setting z 1 = e jk1D1sinq . This polynomial has all its zeros on the unit circle in the z 1 -plane. If the array just considered is not uniform but has a weighting factor w r , for r = 0,1, . . ., N 1 – 1, the space factor, may again be associated with a polynomial S N 1 –1 r=0 w r z r 1 . By the pattern multiplication theorem, it is possible to get the polynomial associated with the total beam pattern of an array with weighted sensors by multiplying the polynomials associated with the array element pattern and the polynomial associated with the space factor Q(u). The array factor *Q(u)* 2 may also be associated with the polynomial spectral factor Wz z z wi i i z z z i iii ii (, , ) (,,) 1 2 3 123 123 1 321 23 = ??? Wz z z z z z z z iii i N i N i N i N ii i (, , ) 1 2 3 123 0 1 0 1 0 1 1 3 123 3 3 2 2 1 1 1 1 == - - = - = - = - = ??? ? Sk S e jk rD r N (,) () sin 1 0 1 11 qq q D = = - ? Qwe r jk D r r N () sin q q D 11 1 0 1 = - ? ? 2000 by CRC Press LLC (17.20) where the weighting (shading) factor is allowed to be complex. Uniformly distributed apertures and uniformly spaced volumetric arrays which admit product separable sensor weightings can be treated by using the well- developed theory of linear discrete arrays and their associated polynomial. When the product separability property does not hold, scopes exist for applying results from multidimensional systems theory [Bose, 1982] concerning multivariate polynomials to the synthesis problem of volumetric arrays. Velocity Filtering Combination of individual sensor outputs in a more sophisticated way than the delay-and-sum technique leads to the design of multichannel velocity filters for linear and planar as well as spatial arrays. Consider, first, a linear (1-D) array of sensors, which will be used to implement velocity discrimination. The pass and rejection zones are defined by straight lines in the (k 1 ,w)-plane, where is the wavenumber, w the angular frequency in radians/second, V the apparent velocity on the earth’s surface along the array line, v the velocity of wave propagation, and u the horizontal arrival direction. The transfer function of a “pie-slice” or “fan” velocity filter [Bose, 1985] rejects totally wavenumbers outside the range –*w*/V £ k 1 £ *w*/V and passes completely wavenumbers defined within that range. Thus, the transfer function defines a high- pass filter which passes signals with apparent velocities of magnitude greater than V at a fixed frequency w. If the equispaced sensors are D units apart, the spatial sampling results in a periodic wavenumber response with period k 1 = 1/(2D). Therefore, for a specified apparent velocity V, the resolvable wavenumber and frequency bands are, respectively, –1/(2D) £ k 1 £ 1/(2D) and –V/(2D) £ w £ V/(2D) where w/(2D) represents the folding frequency in radians/second. Linear arrays are subject to the limitation that the source is required to be located on the extended line of sensors so that plane wavefronts approaching the array site at a particular velocity excite the individual sensors, assumed equispaced, at arrival times which are also equispaced. In seismology, the equispaced interval between successive sensor arrival times is called a move-out or step-out and equals (D sin u)/v = D/V. However, when the sensor-to-source azimuth varies, two or more independent signal move-outs may be present. Planar (2-D) arrays are then required to discriminate between velocities as well as azimuth. Spatial (3-D) arrays provide additional scope to the enhancement of discriminating capabilities when sensor/source locations are arbitrary. In such cases, an array origin is chosen and the mth sensor location is denoted by a vector (x 1m x 2m x 3m ) t and the frequency wavenumber response of an array of sensors is given by where H m (w) denotes the frequency response of a filter associated with the mth recording device (sensor). The sum of all N filters provides flat frequency response so that waveforms arriving from the estimated directions of arrival at estimated velocities are passed undistorted and other waveforms are suppressed. In the planar Qwzwz r r r N r r r N () *( *)q 2 1 0 1 1 0 1 11 ? = - = - ?? k Vv 1 == ww q( sin )/ Hk V k V (, ) , , w ww 1 1 1 0 = -££ ì í ? ? ? ü y ? t ? ** ** otherwise Hkkk N Hjkx miim im N (,,,) ()expwwp 12 3 1 3 1 1 2=- é ? ê ê ù ? ú ú == ?? ? 2000 by CRC Press LLC specialization, the 2-D array of sensors leads to the theory of 3-D filtering involving a transfer function in the frequency wavenumber variables f, k 1 , and k 2 . The basic design equations for the optimum, in the least-mean- square error sense, frequency wavenumber filters have been developed [Burg, 1964]. This procedure of Burg can be routinely generalized to the 4-D filtering problem mentioned above. Acknowledgment N.K. Bose and L.H. Sibul acknowledge the support provided by the Office of Naval Research under, respectively, Contract N00014-92-J-1755 and the Fundamental Research Initiatives Program. Defining Terms Array pattern: Fourier transform of the receiver weighting function taking into account the positions of the receivers. Beamformers: Systems commonly used for detecting and isolating signals that are propagating in a particular direction. Grating lobes: Repeated main lobes in the array pattern interpretable in terms of spatial frequency aliasing. Velocity filtering: Means for discriminating signals from noise or other undesired signals because of their different apparent velocities. Wavenumber: 2p (spatial frequency in cycles per unit distance). Related Topic 14.3 Design and Implementation of Digital Filters References K.M. Ahmed and R.J. Evans, “Robust signal and array processing,” IEE Proceedings, F: Communications, Radar, and Signal Processing, vol. 129, no. 4, pp. 297–302, 1982. N.K. Bose, Applied Multidimensional Systems Theory, New York: Van Nostrand Reinhold, 1982. N.K. Bose, Digital Filters, New York: Elsevier Science North-Holland, 1985. Reprint ed., Malabar, Fla.: Krieger Publishing, 1993. J.P. Burg, “Three-dimensional filtering with an array of seismometers,” Geophysics, vol. 23, no. 5, pp. 693–713, 1964. J.F. Claerbout, Fundamentals of Geophysical Data Processing, New York: McGraw-Hill, 1976. D.E. Dudgeon, “Fundamentals of digital array processing,” Proc. IEEE, vol. 65, pp. 898–904, 1977. R.A. Monzingo and T.W. Miller, Introduction to Adaptive Arrays, New York: Wiley, 1980. S.M. Pillai, Array Signal Processing, New York: Springer-Verlag, 1989. B.D. Steinberg, Principles of Aperture and Array System Design, New York: Wiley, 1976. Further Information Adaptive Signal Processing, edited by Leon H. Sibul, includes papers on adaptive arrays, adaptive algorithms and their properties, as well as other applications of adaptive signal processing techniques (IEEE Press, New York, 1987). Adaptive Antennas:Concepts and Applications, by R. T. Compton, Jr., emphasizes adaptive antennas for electromagnetic wave propagation applications (Prentice-Hall, Englewood-Cliffs, N.J., 1988). Array Signal Processing: Concepts and Techniques, by D. H. Johnson and D. E. Dudgeon, incorporates results from discrete-time signal processing into array processing applications such as signal detection, estimation of direction of propagation, and frequency content of signals (Prentice-Hall, Englewood Cliffs, N.J., 1993). Neural Network Fundamentals with Graphs, Algorithms, and Applications, by N. K. Bose and P. Liang, contains the latest information on adaptive-structure networks, growth algorithms, and adaptive techniques for learning and capability for generalization (McGraw-Hill, New York, N.Y., 1996). ? 2000 by CRC Press LLC 17.4 Video Processing Architectures Wayne Wolf Video processing has become a major application of computing: personal computers display multimedia data, digital television provides more channels, etc. The characteristics of video algorithms are very different from traditional applications of computers; these demands require new architectures. Two fundamental characteristics of video processing make it challenging and different than applications like database processing. First, the video processor must handle streaming data that arrives constantly. Traditional applications assume that data has a known, fixed location. In video processing, not only are new input samples always arriving, but our time reference in the stream is constantly changing. At one time instant, we may consider a sample x t , but at the next sampling interval that sample becomes x t–1 . The need to sweep through the data stream puts additional demands on the memory system. Since streaming data must be processed in realtime. If the deadline for completing an output is missed, the results will be visible on the screen. When designing realtime systems, it is not sufficient to look at aggregate throughput because data can become backed up for a period and still meet some long-term timing requirements. Processing must complete every realtime result by the appointed deadline. Architectures must provide underlying support for predictable computation times. The challenges of processing streaming data in realtime are made greater by the fact that video processing algorithms are becoming very complex. Video compression algorithms make use of several different techniques and complex search algorithms to maximize their ability to compress the video data; video display systems provide much more sophisticated controls to the user; content analysis systems make use of multiple complex algorithms working together; mixed computer graphics-video systems combine geometric algorithms with traditional video algorithms. Expect video processing algorithms to become more complex in the future. This complexity puts greater demands on the realtime nature of the video architecture: more complex algorithms generally have less predictable execution times. The architecture should be designed so that algorithms can take advantage of idle hardware caused by early completions of functions, rather than letting hardware sit idle while it waits for other operations to complete. Luckily, VLSI technology is also advancing rapidly and allows us to build ever more sophisticated video processing architectures. The state of video processing architectures will continue to advance as VLSI allows us to integrate more transistors on a chip; in particular, the ability to integrate a significant amount of memory along with multiple processing elements will provide great strides in video processing performance over the next several years. However, the basic techniques for video processing used today [Pir98] will continue to be the basis for video architectures in the long run. This chapter section first reviews two basic techniques for performing video operations: single instruction multiple data (SIMD) and vectorization; and then looks at the three major styles of video architectures: heterogeneous multiprocessors, video signal processors, and microprocessor instruction set extensions. Computational Techniques Many of the fundamental operations in video processing are filters that can be described as linear equations; for example, There are two techniques for implementing such equations: single-instruction multiple data (SIMD) processing and vector processing. The two are similar in underlying hardware structure; the most important differences come in how they relate to the overall computer architecture of which they are a part. SIMD The term SIMD comes from Flynn’s classification of computer architectures, based on the number of data elements they processed simultaneously and the number of instructions used to control the operations on those cx ii in1≤≤ ∑ ? 2000 by CRC Press LLC data. In a SIMD machine, a single instruction is used to control the operation performed on many data elements. Thus, the same operation is performed simultaneously on all that data. Figure 17.24 shows a SIMD structure: several function units, each with its own register file, has an ALU for performing operations on data; the controller sends identical signals to all function units so that the same operation is performed on all function units at the same; there is also a network that allows processing elements to pass data among themselves. Consider how to use a SIMD machine to perform the filtering operation given at the beginning of this section. The multiplications are all independent, so we can perform N multiplications in parallel on the N processing elements. We need to perform N – 1 additions on the multiplication results; by properly arranging the computation in a tree, many of those operations can be performed in parallel as well. We will need to use the data transfer network in two ways: to transfer x values between processing elements for the data streaming time shift; and to transfer the partial addition results in the addition tree. SIMD architectures can of course be used to implement multidimensional functions as well. For example, two-dimensional correlation is used in video compression, image recognition, etc., and can easily be mapped onto a SIMD machine. SIMD architectures provide a high degree of parallelism at high speeds. Instruction distribution and decoding is not a bottleneck. Furthermore, each processing element has its own data registers and the communication network between the processing elements can be designed to be fast. However, not all algorithms can be efficiently mapped onto SIMD architectures. Global computation is difficult in SIMD machines. Operations that cause global changes to the machine state also create problems. Vectorization Vector instructions were originally invented for supercomputers to improve the performance of scientific calculations. Although video operations are generally done in fixed-point rather than floating-point arithmetic, vector instructions are well-suited to the many video operations that can be expressed in linear algebra. Vectorization was used in many early video processors. More recently, SIMD has become more popular, but with vectorization becoming more popular in general-purpose microprocessors, there may be a resurgence of vector units for multimedia computation. A vector is a data structure supported by hardware. The vector is stored in memory as a set of memory locations; special vector registers are also provided to hold the vectors for arithmetic operations. Our filter example could be implemented as a single vector instruction (after loading the vector registers with the c and x vectors): a vector multiply-accumulate instruction, similar to scalar multiply-accumulate instructions in DSPs, could multiply the x i ’s by the c i ’s and accumulate the result. The motivation for supporting vector instructions is pipelining the arithmetic operations. If an arithmetic operation takes several clock cycles, pipelining allows high throughput at a high clock rate at the cost of latency. As shown in Fig. 17.25 vectors are well-suited to pipelined execution because all the operations in the vector are known to be independent in advance. Vector units allow linear algebra to be performed at very high speeds with high hardware utilization. Furthermore, because they have a long history in scientific computing, compiling high-level languages into vector instructions is well understood. However, the latencies of integer arithmetic operations for video oper- ations is smaller than that for the floating-point operations typically used in scientific vector processors. FIGURE 17.24 A SIMD architecture. ? 2000 by CRC Press LLC Heterogeneous Multiprocessors The earliest style of video processor is the heterogeneous multiprocessor. These machines cannot execute arbitrary programs — they are restricted to a single algorithm or variations on that algorithm. The microar- chitecture of the machine is tuned to the target application. In the early days of digital video, special-purpose heterogeneous multiprocessors were the only way to implement VLSI video processing because chips were not large enough to support the hardware required for instruction-set processors. Today, heterogeneous multipro- cessors are used to implement low-cost video systems, since by specializing the hardware for a particular application, less hardware is generally required, resulting in smaller, less-expensive chips. A simple heterogeneous architecture is shown in Fig. 17.26. This machine implements a sum-of-absolute- differences correlation in two dimensions for block motion estimation. The architecture of this machine is derived from the data flow of the computation, where for each offset (r, s), the sum-of-absolute differences between a n × n macroblock and a T × T reference area can be computed: The machine executes one column of the computation per clock cycle: n absolute differences are formed and then passed onto a summation unit. This machine is not a SIMD architecture because it does not execute instructions — it is designed to perform one algorithm. Heterogeneous architectures can also be used for more complex algorithms. Figure 17.27 shows a sketch for a possible architecture for MPEG-style video compression [MPE]. The unit has separate blocks for the major operations: block motion estimation, discrete cosine transform (DCT) calculation, and channel coding; it also has a processor used for overall control. FIGURE 17.25 Pipelining to support vector operations. FIGURE 17.26 A heterogeneous multiprocessor. |,– , |Mi j Ri r j s inin ( ) ++ ( ) ≤≤≤≤ ∑∑ 11 ? 2000 by CRC Press LLC Heterogeneous architectures are designed by careful examination of the algorithm to be implemented. The most time-critical functions must be identified early. Those operations are typically implemented as special- purpose function units. For example, the block motion estimation engine of Fig. 17.26 can be used as a special- purpose function unit in a more complex application like an MPEG video compressor. Communication links must be provided between the function units to provide adequate bandwidth for the data transfers. In structured communication architectures, data transfers can be organized around buses or more general communication networks like crossbars. Heterogeneous communication systems make specialized connections as required by the algorithm. Many modern heterogeneous video processors use as much structured communication as possible but add specialized communication links as required to meet performance requirements. Many modern heterogeneous processors are at least somewhat programmable. Basic architectures may use registers to control certain parameters of the algorithm. More complex algorithms may use general-purpose microprocessors as elements of the architecture. Small microcontrollers are frequently used for system interfacing, such as talking to a keyboard or other controlling device. Larger microprocessors can be used to run algorithms that do not benefit from special-purpose function units. Heterogeneous multiprocessors will continue to dominate high-volume, low-cost markets for video and multimedia functions. When an application is well-defined, it is often possible to design a special-purpose architecture that performs only that operation but is significantly cheaper than a system built from a program- mable processor. Furthermore, heterogeneous multiprocessors may require significantly less power than pro- grammable solutions and, therefore, an increasing number of battery-operated multimedia devices. However, heterogeneous multiprocessors are not well-suited to other application areas. If the algorithm is not well-defined, if the system must be able to execute a variety of algorithms, or if the size of the market will not support the cost of designing an application-specific solution, heterogeneous multiprocessors are not appropriate. Video Signal Processors The term digital signal processor (DSP) is generally reserved for microprocessors optimized for signal processing algorithms and which run at audio rates. A video signal processor (VSP) is a DSP that is capable of running at video rates. Using separate names for audio and video rate processors is reasonable because VSPs provide much greater parallelism and significantly different microarchitectures. Many early video processors were vector machines because vector units provide high throughput with relatively small amounts of hardware. Today, most VSPs today make use of very-long instruction word (VLIW) processor technology, as shown in Fig. 17.28. The architecture has several function units connected to a single register file. The operations on all the function units are controlled by the instruction decoder based on the current instruction. A VLIW machine differs from a SIMD machine in two important ways. First, the VLIW machine connects all function units to the same register file, while the SIMD machine uses separate registers for the function units. The common register file gives the VLIW machine much more flexibility; for example, a data value can be used on one function unit on one cycle and on another function unit on the next cycle without having to copy the value. Second, the function units in the VLIW machine need not perform the same operation. The instruction is divided into fields, one for each unit. Under control of its instruction field, each instruction unit can request data from the register file and perform operations as required. FIGURE 17.27 A heterogeneous architecture for MPEG-style compression. ? 2000 by CRC Press LLC Although having a common register file is very flexible, there are physical limitations on the number of function units that can be connected to a single register file. A single addition requires three ports to the register file: one to read each operand and a third to write the result back to the register file. Register files are built from static random access memory (SRAMs) and slow down as the number of read/write ports grows. As a result, VLIW machines are typically built in clusters as shown in Fig. 17.29. Each cluster has its own register file and function units, with three or four function units per cluster typical in today’s technology. A separate interconnection network allows data transfers between the clusters. When data held in one register file is needed in a different cluster, an instruction must be executed to transfer the data over the interconnection network to the other register file. The major difference between VLIW architectures and the superscalar architectures found in modern micro- processors is that VLIW machines have statically scheduled operations. A superscalar machine has hardware that examines the instruction stream to determine what operations can be performed in parallel; for example, when two independent operations appear in consecutive instructions, those instructions can be executed in parallel. A VLIW machine relies on a compiler to identify parallelism in the program and to pack those operations into instruction words. This requires sophisticated compilers that can extract parallelism and effectively make use of it when generating instructions. Video is especially well-suited to VLIW because video programs have a great deal of parallelism that is relatively easy to identify and take advantage of in a VLIW machine. VLIW has potential performance advantages because its control unit is relatively simple. Because the work of finding parallelism is performed by the compiler, a VLIW machine does not require the sophisticated execution unit of a superscalar processor. This allows a VLIW video processor to run at high clock rates. However, it does rely on the compiler’s ability to find enough parallelism to keep the function units busy. Furthermore, complex algorithms may have some sections that are not highly parallel and therefore will not be sped up by the VLIW mechanism. If one is not careful, these sequential sections of code can come to limit the overall performance of the application. Practical video signal processors are not pure VLIW machines, however. In general, they are in fact hybrid machines that use VLIW processing for some operations and heterogeneous multiprocessing techniques for others. This is necessary to meet the high performance demand on video processing; certain critical operations can be sped up with special-purpose function units, leaving the VLIW processor to perform the rest. An example FIGURE 17.28 A simple VLIW machine. FIGURE 17.29 A clustered VLIW machine. ? 2000 by CRC Press LLC of this technique is the Trimedia TM-1 processor [Rat96] shown in Fig. 17.30. This machine has a VLIW processor. It also has several function units for specialized video operations, principal among these being a variable-length decoding for channel coding, an image coprocessor. The TM-1 also supports multiple DMA channels to speed up data transfers as well as timers to support realtime operation. VLIW VSPs represent one end of the programmable video processor spectrum. These machines are designed from the ground up to execute video algorithms. The VLIW architecture is very well-suited to video applications due to the embarrassing levels of parallelism available in video programs. Special-purpose function units can be used to speed up certain key operations. However, VLIW VSPs may not be as well-suited to executing code that is more typically found on workstation microprocessors, such as error checking, bit-level operations, etc. As a result, VLIW VSPs can be used in conjunction with standard microprocessors to implement a complex video application, with the VSP performing traditional parallel video sections of the code and the micropro- cessor performing the less regular computations. Instruction Set Extensions Both heterogeneous multiprocessors and VSPs are specialized architectures for video. However, there are many applications in which it is desirable to execute video programs directly on a workstation or PC: programs that are closely tied to the operating system, mixed video/graphics applications, etc. Traditional microprocessors are fast but are not especially well-utilized by video programs. For these applications, microprocessor instruction set extensions have been developed to allow video algorithms to be executed more efficiently on traditional microprocessors. The basic principle of instruction set extensions is subword parallelism [Lee95], as illustrated in Fig. 17.31. This technique takes advantage of the fact that modern microprocessors support native 32- or 64-bit operations while most video algorithms require much smaller data accuracy, such as 16 bits or even 8 bits. One can divide the microprocessor data path, on which the instructions are executed, into subwords. This is a relatively simple operation, mainly entailing adding a small amount of logic to cut the ALU’s carry chain at the appropriate points when subword operations are performed. When a 64-bit data path is divided for use by 16-bit subwords, the machine can support four simultaneous subword operations. Subword parallelism is often referred to as SIMD because a single microprocessor instruction causes the same operation to be performed on all the subwords in parallel. However, there is no separate SIMD instruction unit — all the work is done by adding a small amount of hardware to the microprocessor data path. Subword parallelism is powerful because it has a very small cost in the microprocessor (both in terms of chip area and performance) and because it provides substantial speedups on parallel code. A typical instruction set extension will of course support logical and arithmetic operations on subwords. They may support saturation arithmetic as well as two’s-complement arithmetic. Saturation arithmetic gener- ates the maximum value on overflow, more closely approximating physical devices. They may also support FIGURE 17.30 The Trimedia TM-1 video signal processor. ? 2000 by CRC Press LLC permutation operations so that the order of subwords in a word can be shuffled. Loads and stores are performed on words — not subwords. ISA extensions have been defined for the major microprocessor architectures. The MAX extension for the HP PA-RISC architecture [Lee96] was the first ISA extension and introduced the notion of subword parallelism. The VIS (Visual Instruction Set) extension [Tre96] has been added to the Sun SPARC architecture. The Intel x86 architecture has been extended with the MMX instructions [Pel96]. The MMX extension is based on the well-known Intel architecture. It supports operations on 8-bit bytes, 16-bit words, 32-bit doublewords, and 64-bit quadwords. All these data types are packed into 64-bit words. All MMX operations are performed in the floating-point registers; this means that the floating-point registers must be saved at the beginning of MMX code and restored at its end. (Although floating-point operations access these registers as a stack, MMX instructions can arbitrarily address the registers.) MMX supports addition, subtraction, comparison, multiplication, shifts, and logical operations. Arithmetic can optionally be performed in saturation mode. There are also instructions for packing and unpacking subwords into words. Some con- version operations are provided so that intermediate calculations can be performed at higher precisions and then converted to a smaller format. The Sun VIS extension also uses the floating-point registers. The MAX-2 extension is the latest extension to the HP architecture. It uses integer registers rather than floating-point registers. It does not directly implement multiplication, but instead provides a shift-and-add operation for software-driven multiplication. MAX-2 also supports a permutation operation to allow subwords to be rearranged in a word. The ability to mix multimedia instructions with other instructions on a standard microprocessor is the clear advantage of instruction set extensions. These extensions are very well-suited to the implementation of complex algorithms because the microprocessor can efficiently execute the nonlinear algebra operations as well as the highly parallel video operations. Furthermore, instruction set extensions take advantage of the huge resources available to microprocessor manufacturers to build high-performance chips. The main disadvantages of instruction set extensions are related to the tight coupling of the video and non- video instructions. First, the memory system is not changed to fit the characteristics of the video application. FIGURE 17.31 Implementing subword parallelism on a microprocessor. ? 2000 by CRC Press LLC The streaming data typical of video is not very well-suited to the caches used in microprocessors. Caches rely on temporal and spatial locality; they assume that a variable is used many times after its initial use. In fact, streaming data will be used a certain number of times and then be discarded, to be replaced with a new datum. Second, the available parallelism is limited by the width of the data path. A 64-bit data path can exhibit at most four-way parallelism when subdivided into 16-bit subwords. Other architectures can be more easily extended for greater parallelism when technology and cost permit. Summary There is no one best way to design a video processing architecture. The structure of the architecture depends on the intended application environment, algorithms to be run, performance requirements, cost constraints, and other factors. Computer architects have developed a range of techniques that span a wide range of this design space: heterogeneous multiprocessors handle low-cost applications effectively; VLIW video signal pro- cessors provide specialized video processing; instruction set extensions to microprocessors enhance video performance on traditional microprocessors. As VLSI technology improves further, these techniques will be extended to create machines that hold significant amounts of video memory on-chip with the processing elements that operate on the video data. Defining Terms ALU: Arithmetic/logic unit. MPEG: A set of standards for video compression. Processing element: A computational unit in a parallel architecture. SIMD (single-instruction multiple data): An architecture in which a single instruction controls the operation of many separate processing elements. Heterogeneous multiprocessors: An architecture in which several dissimilar processing units are connected together to perform a particular computation. Vector processor: A machine that operates on vector and matrix quantities in a pipelined fashion. VLIW (very-long instruction word): An architecture in which several ALUs are connected to a common register file, under the control of an instruction word that allows the ALU operations to be determined separately. References [Lee95] R. B. Lee, Accelerating multimedia with enhanced microprocessor, IEEE Micro, April 1995, pp. 22–32. [Lee96] R. B. Lee, Subword parallelism with MAX-2, IEEE Micro, August 1996, pp. 51–59. [MPE] MPEG Web site, http://www.mpeg.org. [Pel96] A. Peleg and U. Weiser, MMX technology extension to the Intel architecture, IEEE Micro, August 1996, pp. 42–50. [Pir98] P. Pirsch and J.-J. Stolberg, VLSI implementations of image and video multimedia processing systems, IEEE Transactions on Circuits and Systems for Video Technology, 8(7), November 1998, pp. 878–891. [Rat96] S. Rathnam and G. Slavenburg, An architectural overview of the programmable media processor, TM-1, in Proc. Compcon, IEEE Computer Society Press, 1996, pp. 319–326. [Tre96] M. Tremblay, J. M. O’Connor, Ventatesh Narayanan, and Liang He, VIS speeds new media processing, IEEE Micro, August 1996, pp. 10–20. Further Reading Two journals, IEEE Transactions on Circuits and Systems for Video Technology and IEEE Micro — provide up- to-date information on developments in video processing. A number of conferences cover this area, including the International Solid State Circuits Conference (ISCCC) and the Silicon Signal Processing (SiSP) Workshop. ? 2000 by CRC Press LLC 17.5 MPEG-4 Based Multimedia Information System Ya-Qin Zhang 1 Recent creation and finalization of the Motion-Picture Expert Group (MPEG-4) international standard has provided a common platform and unified framework for multimedia information representation. In addition to providing highly efficient compression of both natural and synthetic audio-visual (AV) contents such as video, audio, sound, texture maps, graphics, still images, MIDI, and animated structure, MPEG-4 enables greater capabilities for manipulating AV contents in the compressed domain with object-based representation. MPEG-4 is a natural migration of the technological convergence of several fields: digital television, computer graphics, interactive multimedia, and Internet. This tutorial chapter briefly discusses some example features and applications enabled by the MPEG-4 standard. During the last decade, a spectrum of standards in digital video and multimedia has emerged for different applications. These standards include: the ISO JPEG for still images [1]; ITU-T H.261 for video conferencing from 64 kilobits per second (kbps) to 2 megabits per second (Mbps) [2]; ITU-T H.263 for PSTN-based video telephony [3]; ISO MPEG-1 for CD-ROM and storage at VHS quality [4]; the ISO MPEG-2 standard for digital television [5]; and the recently completed ISO/MPEG-4 international standard for multimedia representation and integration [6]. Two new ISO standards are under development to address the next-generation still image coding (JPEG-2000) and content-based multimedia information description (MPEG-7). Several special issues of IEEE journals have been devoted to summarizing recent advances in digital image, video compression, and advanced television in terms of standards, algorithms, implementations, and applications [7–11]. The successful convergence and implementation of MPEG-1 and MPEG-2 have become a catalyst for propelling the new digital consumer markets such as Video CD, Digital TV, DVD, and DBS. While the MPEG-1 and MPEG-2 standards were primarily targeted at providing high compression efficiency for storage and transmission of pixel-based video and audio, MPEG-4 envisions to support a wide variety of multimedia applications and new functionalities of object-based audio-visual (AV) contents. The recent completion of MPEG-4 Version 1 is expected to provide a stimulus to the emerging multimedia applications in wireless networks, Internet, and content creation. The MPEG-4 effort was originally conceived in late 1992 to address very low bit rate (VLBR) video appli- cations at below 64 kbps such as PSTN-based videophone, video e-mail, security applications, and video over cellular networks. The main motivations for focusing MPEG-4 at VLBR applications were: ? Applications such as PSTN videophone and remote monitoring were important, but not adequately addressed by established or emerging standards. In fact, new products were introduced to the market with proprietary schemes. The need for a standard at rates below 64 kbps was imminent. ? Research activities had intensified in VLBR video coding, some of which have gone beyond the boundary of the traditional statistical-based and pixel-oriented methodology. It was felt that a new breakthrough in video compression was possible within a five-year time window. This “quantum leap” would likely make compressed-video quality at below 64 kbps, adequate for many applications such as videophone. Based on the above assumptions, a workplan was generated to have the MPEG-4 Committee Draft (CD) completed in 1997 to provide a generic audio visual coding standard at very low bit rates. Several MPEG-4 seminars were held in parallel with the WG11 meetings, many workshops and special sessions have been organized, and several special issues have been devoted to such topics. However, as of July 1994 in the Norway WG11 meeting, there was still no clear evidence that a “quantum leap” in compression technology was going to happen within the MPEG-4 timeframe. On the other hand, ITU-T has embarked on an effort to define the H.263 standard for videophone applications in PSTN and mobile networks. The need for defining a pure compression standard at very low bit rates was, therefore, not entirely justified. 1 The author was the director of Multimedia Technology Laboratory at Sarnoff Corporation in Princeton, New Jersey when this work was performed. ? 2000 by CRC Press LLC In light of the situation, a change of direction was called to refocus on new or improved functionalities and applications that are not addressed by existing and emerging standards. Examples include object-oriented features for content-based multimedia database, error-robust communications in wireless networks, hybrid nature and synthetic image authoring and rendering. With the technological convergence of digital video, computer graphics, and Internet, MPEG-4 aims at providing an audiovisual coding standard allowing for interactivity, high compression, and/or universal accessibility, with a high degree of flexibility and extensibility. In particular, MPEG-4 intends to establish a flexible content-based audio-visual environment that can be customized for specific applications and that can be adapted in the future to take advantage of new technological advances. It is foreseen that this environment will be capable of addressing new application areas ranging from conventional storage and transmission of audio and video to truly interactive AV services requiring content- based AV database access, e.g., video games or AV content creation. Efficient coding, manipulation, and delivery of AV information over Internet will be key features of the standard. MPEG-4 Multimedia System Figure 17.32 shows an architectural overview of MPEG-4. The standard defines a set of syntax to represent individual audiovisual objects, with both natural and synthetic contents. These objects are first encoded inde- pendently into their own elementary streams. Scene description information is provided separately, defining the location of these objects in space and time that are composed into the final scene presented to the user. This representation includes support for user interaction and manipulation. The scene description uses a tree- based structure, following the Virtual Reality Modeling Language (VRML) design. Moving far beyond the capabilities of VRML, MPEG-4 scene descriptions can be dynamically constructed and updated, enabling much higher levels of interactivity. Object descriptors are used to associate scene description components that relate digital video to the actual elementary streams that contain the corresponding coded data. As shown in Fig. 17.32, these components are encoded separately and transmitted to the receiver. The receiving terminal then has the responsibility of composing the individual objects for presentation and for managing user interaction. Following are eight MPEG-4 functionalities, defined and clustered into three classes: ? Content-based interactivity: Content-based manipulation and bit stream editing; content-based multi- media data access tools; hybrid natural and synthetic data coding; improved temporal access. ? Compression: Improved coding efficiency; coding of multiple concurrent data streams. ? Universal access: Robustness in error-prone environments; content-based scalability. FIGURE 17.32 MPEG-4 Overview. Audio-visual objects, natural audio, as well as synthetic media are independently coded and then combined according to scene description information (courtesy of the ISO/MPEG-4 committee). ? 2000 by CRC Press LLC Some of the applications enabled by these functionalities include: ? Video streaming over Internet. ? Multimedia authoring and presentations. ? View of the contents of video data in different resolutions, speeds, angles, and quality levels. ? Storage and retrieval of multimedia database in mobile links with high error rates and low channel capacity (e.g., Personal Digital Assistant). ? Multipoint teleconference with selective transmission, decoding, and display of “interesting” parties. ? Interactive home shopping with customers’ selection from a video catalogue. ? Stereo-vision and multiview of video contents, e.g., sports. ? “Virtual” conference and classroom. ? Video email, agents, and answering machines. Object-based Authoring Tool Example Figure 17.33 shows an example of an object-based authoring tool for MPEG-4 AV contents, recently developed by the Multimedia Technology Laboratory at Sarnoff Corporation in Princeton, New Jersey. This tool has the following features: ? Compression/decompression of different visual objects into MPEG-4-compliant bitstreams. ? Drag-and-drop of video objects into a window while resizing the objects or adapting them to different frame rates, speeds, transparencies, and layers. FIGURE 17.33 An example of a multimedia authoring system using MPEG-4 tools and functionalities (courtesy of Sarnoff Corporation). ? 2000 by CRC Press LLC ? Substitution of different backgrounds. ? Mixing natural image and video objects with computer-generated, synthetic texture and animated objects. ? Creating metadata information for each visual objects. This set of authoring tools can be used for interactive Web design, digital studio, and multimedia presentation. It empowers users to compose and interact with digital video on a higher semantic level. References 1. JPEG Still Image Coding Standard, ISO/IEC 10918–1, 1990. 2. Video Code for Audiovisual Services at 64 to1920 kbps, CCITT Recommendation H.261, 1990. 3. Recommendation H.263P Video Coding for Narrow Telecommunication Channels at below 64 kbps, ITU-T/SG15/LBC, May 1995. 4. Coding of Moving Pictures and Associated Audio for Digital Storage Media at up to about 1.5 Mbps, ISO/IEC 11172, 1992. 5. Generic Coding of Moving Pictures and Associated Audio, ISO/IEC 13818, 1994. 6. MPEG-4 Draft International Standard, ISO/IEC JTC1/SC29/WG11, October, 1998. 7. Y.-Q. Zhang, W. Li, and M. Liou, eds., Advances in Digital Image and Video Compression, Special Issue, Proceedings of IEEE, Feb. 1995. 8. M. Kunt, ed. Digital Television, Special Issue, Proceedings of IEEE, July 1995. 9. Y.-Q. Zhang, F. Pereria, T. Sikora, and C. Reader, eds., MPEG-4, Special Issue, IEEE Transactions on Circuits and Systems for Video Technology, Feb. 1997. 10. T. Chen, R. Liu, and A. Tekalp, eds., Multimedia Signal Processing, Special Issue on Proceedings of IEEE, May 1998. 11. M.T. Sun, K. Ngan, T. Sikora, and S. Panchnatham, eds., Representation and Coding of Images and Video, IEEE Transactions on Circuits and Systems for Video Technology, November 1998. 12. MPEG-4 Requirements Ad-Hoc Group, MPEG-4 Requirements, ISO/IEC JTC1/SC29/WG11/MPEG-4, Maceio, Nov. 1996. ? 2000 by CRC Press LLC