Principe, J.C. “Artificial Neural Networks” The Electrical Engineering Handbook Ed. Richard C. Dorf Boca Raton: CRC Press LLC, 2000 20 Artificial Neural Networks 20.1Definitions and Scope Introduction?Definitions and Style of Computation?ANN Types and Applications 20.2Multilayer Perceptrons Function of Each PE?How to Train MLPs?Applying Back- Propagation in Practice? A Posteriori Probabilities 20.3Radial Basis Function Networks 20.4Time Lagged Networks Memory Structures?Training-Focused TLN Architectures 20.5Hebbian Learning and Principal Component Analysis Networks Hebbian Learning?Principal Component Analysis?Associative Memories 20.6Competitive Learning and Kohonen Networks 20.1 Definitions and Scope Introduction Artificial neural networks (ANN) are among the newest signal-processing technologies in the engineer’s toolbox. The field is highly interdisciplinary, but our approach will restrict the view to the engineering perspective. In engineering, neural networks serve two important functions: as pattern classifiers and as nonlinear adaptive filters. We will provide a brief overview of the theory, learning rules, and applications of the most important neural network models. Definitions and Style of Computation An ANN is an adaptive, most often nonlinear system that learns to perform a function (an input/output map) from data. Adaptive means that the system parameters are changed during operation, normally called the training phase. After the training phase the ANN parameters are fixed and the system is deployed to solve the problem at hand (the testing phase). The ANN is built with a systematic step-by-step procedure to optimize a performance criterion or to follow some implicit internal constraint, which is commonly referred to as the learning rule. The input/output training data are fundamental in neural network technology, because they convey the necessary information to “discover” the optimal operating point. The nonlinear nature of the neural network processing elements (PEs) provides the system with lots of flexibility to achieve practically any desired input/output map, i.e., some ANNs are universal mappers. There is a style in neural computation that is worth describing (Fig. 20.1). An input is presented to the network and a corresponding desired or target response set at the output (when this is the case the training is called supervised). An error is composed from the difference between the desired response and the system Jose C. Principe University of Florida ? 2000 by CRC Press LLC A A P F P output. This error information is fed back to the system and adjusts the system parameters in a systematic fashion (the learning rule). The process is repeated until the performance is acceptable. It is clear from this description that the performance hinges heavily on the data. If one does not have data that cover a significant portion of the operating conditions or if they are noisy, then neural network technology is probably not the right solution. On the other hand, if there is plenty of data and the problem is poorly understood to derive an approximate model, then neural network technology is a good choice. This operating procedure should be contrasted with the traditional engineering design, made of exhaustive subsystem specifications and intercommunication protocols. In ANNs, the designer chooses the network topol- ogy, the performance function, the learning rule, and the criterion to stop the training phase, but the system automatically adjusts the parameters. So, it is difficult to bring a priori information into the design, and when the system does not work properly it is also hard to incrementally refine the solution. But ANN-based solutions are extremely efficient in terms of development time and resources, and in many difficult problems ANNs provide performance that is difficult to match with other technologies. Denker 10 years ago said that “ANNs are the second best way to implement a solution” motivated by the simplicity of their design and because of their universality, only shadowed by the traditional design obtained by studying the physics of the problem. At present, ANNs are emerging as the technology of choice for many applications, such as pattern recognition, prediction, system identification, and control. ANN Types and Applications It is always risky to establish a taxonomy of a technology, but our motivation is one of providing a quick overview of the application areas and the most popular topologies and learning paradigms. FIGURE 20.1 The style of neural computation. Supervised Unsupervised pplication Topology Learning Learning ssociation Hopfield [Zurada, 1992; Haykin, 1994] — Hebbian [Zurada, 1992; Haykin, 1994; Kung, 1993] Multilayer perceptron [Zurada, 1992; Haykin, 1994; Bishop, 1995] Back-propagation [Zurada, 1992; Haykin, 1994; Bishop, 1995] — Linear associative mem. [Zurada, 1992; Haykin, 1994] — Hebbian attern recognition Multilayer perceptron [Zurada, 1992; Haykin, 1994; Bishop, 1995] Back-propagation — Radial basis functions [Zurada, 1992; Bishop, 1995] Least mean square k-means [Bishop, 1995] eature extraction Competitive [Zurada, 1992; Haykin, 1994] — Competitive Kohonen [Zurada, 1992; Haykin, 1994] — Kohonen Multilayer perceptron [Kung, 1993] Back-propagation — Principal comp. anal. [Zurada, 1992; Kung, 1993] — Oja’s [Zurada, 1992; Kung, 1993] rediction, system ID Time-lagged networks [Zurada, 1992; Kung, 1993; de Vries and Principe, 1992] Back-propagation through time [Zurada, 1992] — Fully recurrent nets [Zurada, 1992] ? 2000 by CRC Press LLC It is clear that multilayer perceptrons (MLPs), the back-propagation algorithm and its extensions — time-lagged networks (TLN) and back-propagation through time (BPTT), respectively — hold a prominent position in ANN technology. It is therefore only natural to spend most of our overview presenting the theory and tools of back-propagation learning. It is also important to notice that Hebbian learning (and its extension, the Oja rule) is also a very useful (and biologically plausible) learning mechanism. It is an unsupervised learning method since there is no need to specify the desired or target response to the ANN. 20.2 Multilayer Perceptrons Multilayer perceptrons are a layered arrangement of nonlinear PEs as shown in Fig. 20.2. The layer that receives the input is called the input layer, and the layer that produces the output is the output layer. The layers that do not have direct access to the external world are called hidden layers. A layered network with just the input and output layers is called the perceptron. Each connection between PEs is weighted by a scalar, w i , called a weight, which is adapted during learning. The PEs in the MLP are composed of an adder followed by a smooth saturating nonlinearity of the sigmoid type (Fig. 20.3). The most common saturating nonlinearities are the logistic function and the hyperbolic tangent. The threshold is used in other nets. The importance of the MLP is that it is a universal mapper (implements arbitrary input/output maps) when the topology has at least two hidden layers and sufficient number of PEs [Haykin, 1994]. Even MLPs with a single hidden layer are able to approximate continuous input/output maps. This means that rarely we will need to choose topologies with more than two hidden layers. But these are existence proofs, so the issue that we must solve as engineers is to choose how many layers and how many PEs in each layer are required to produce good results. Many problems in engineering can be thought of in terms of a transformation of an input space, containing the input, to an output space where the desired response exists. For instance, dividing data into classes can be thought of as transforming the input into 0 and 1 responses that will code the classes [Bishop, 1995]. Likewise, identification of an unknown system can also be framed as a mapping (function approximation) from the input to the system output [Kung, 1993]. The MLP is highly recommended for these applications. FIGURE 20.2 MLP with one hidden layer (d-k-m). FIGURE 20.3 A PE and the most common nonlinearities. ? 2000 by CRC Press LLC Function of Each PE Let us study briefly the function of a single PE with two inputs [Zurada, 1992]. If the nonlinearity is the threshold nonlinearity we can immediately see that the output is simply 1 and –1. The surface that divides these subspaces is called a separation surface, and in this case it is a line of equation (20.1) i.e., the PE weights and the bias control the orientation and position of the separation line, respectively (Fig. 20.4). In many dimensions the separation surface becomes an hyperplane of dimension one less than the dimensionality of the input space. So, each PE creates a dichotomy in the input space. For smooth nonlinearities the separation surface is not crisp; it becomes fuzzy but the same principles apply. In this case, the size of the weights controls the width of the fuzzy boundary (larger weights shrink the fuzzy boundary). The perceptron input/output map is built from a juxtaposition of linear separation surfaces, so the perceptron gives zero classification error only for linearly separable classes (i.e., classes that can be exactly classified by hyperplanes). When one adds one layer to the perceptron creating a one hidden layer MLP, the type of separation surfaces changes drastically. It can be shown that this learning machine is able to create “bumps” in the input space, i.e., an area of high response surrounded by low responses [Zurada, 1992]. The function of each PE is always the same, no matter if the PE is part of a perceptron or an MLP. However, notice that the output layer in the MLP works with the result of hidden layer activations, creating an embedding of functions and producing more complex separation surfaces. The one-hidden-layer MLP is able to produce nonlinear separation surfaces. If one adds an extra layer (i.e., two hidden layers), the learning machine now can combine at will bumps, which can be interpreted as a universal mapper, since there is evidence that any function can be approximated by localized bumps. One important aspect to remember is that changing a single weight in the MLP can drastically change the location of the separation surfaces; i.e., the MLP achieves the input/output map through the interplay of all its weights. How to Train MLPs One fundamental issue is how to adapt the weights w i of the MLP to achieve a given input/output map. The core ideas have been around for many years in optimization, and they are extensions of well-known engineering principles, such as the least mean square (LMS) algorithm of adaptive filtering [Haykin, 1994]. Let us review the theory here. Assume that we have a linear PE (f(net) = net) and that one wants to adapt the weights as to minimize the square difference between the desired signal and the PE response (Fig. 20.5). This problem has an analytical solution known as the least squares [Haykin, 1994]. The optimal weights are obtained as the product of the inverse of the input autocorrelation function (R –1 ) and the cross-correlation vector (P) between the input and the desired response. The analytical solution is equivalent to a search for the minimum of the quadratic performance surface J(w i ) using gradient descent, where the weights at each iteration k are adjusted by FIGURE 20.4 A two-input PE and its separation surface. yww wx wx b 12 11 22 0, ( ) =++= ? 2000 by CRC Press LLC (20.2) where h is a small constant called the step size, and ? J (k) is the gradient of the performance surface at iteration k. Bernard Widrow in the late 1960s proposed a very efficient estimate to compute the gradient at each iteration (20.3) which when substituted into Eq. (20.2) produces the so-called LMS algorithm. He showed that the LMS converged to the analytic solution provided the step size h is small enough. Since it is a steepest descent procedure, the largest step size is limited by the inverse of the largest eigenvalue of the input autocorrelation matrix. The larger the step size (below this limit), the faster is the convergence, but the final values will “rattle” around the optimal value in a basin that has a radius proportional to the step size. Hence, there is a fundamental trade-off between speed of convergence and accuracy in the final weight values. One great appeal of the LMS algorithm is that it is very efficient (just one multiplication per weight) and requires only local quantities to be computed. The LMS algorithm can be framed as a computation of partial derivatives of the cost with respect to the unknowns, i.e., the weight values. In fact, with the chainrule one writes (20.4) we obtain the LMS algorithm for the linear PE. What happens if the PE is nonlinear? If the nonlinearity is differentiable (smooth), we still can apply the same method, because of the chain rule, which prescribes that (Fig. 20.6) FIGURE 20.5 Computing analytically optimal weights for the linear PE. FIGURE 20.6 How to extend LMS to nonlinear PEs with the chain rule. wk wk Jk J J w iiii i + ( ) = () -? () ?=1 h ? ? ? () = () ()( ) =- () () Jk w Jk w kkxk i ii i ? ? ? ? ee~ 1 2 2 ? ? ? ? ? ? ? ? ? ? e J w J y y wy dy w wx x ii i ii i ==?- ( ) ? è ? ? ? ( ) =- 2 ? 2000 by CRC Press LLC (20.5) where f ¢(net) is the derivative of the nonlinearity computed at the operating point. Equation (20.5) is known as the delta rule, and it will train the perceptron [Haykin, 1994]. Note that throughout the derivation we skipped the pattern index p for simplicity, but this rule is applied for each input pattern. However, the delta rule cannot train MLPs since it requires the knowledge of the error signal at each PE. The principle of the ordered derivatives can be extended to multilayer networks, provided we organize the computations in flows of activation and error propagation. The principle is very easy to understand, but a little complex to formulate in equation form [Haykin, 1994]. Suppose that we want to adapt the weights connected to a hidden layer PE, the ith PE (Fig. 20.7). One can decompose the computation of the partial derivative of the cost with respect to the weight w ij as (20.6) i.e., the partial derivative with respect to the weight is the product of the partial derivative with respect to the PE state — part 1 in Eq. (20.6) — times the partial derivative of the local activation to the weights — part 2 in Eq. (20.6). This last quantity is exactly the same as for the nonlinear PE (f ¢(net i )x j ), so the big issue is the computation of . For an output PE, becomes the injected error e in Eq. (20.4). For the hidden ith PE is evaluated by summing all the errors that reach the PE from the top layer through the topology when the injected errors e k are clamped at the top layer, or in an equation (20.7) Substituting back in Eq. (20.6) we finally get (20.8) FIGURE 20.7 How to adapt the weights connected to ith PE. ? ? ? ? ? ? ? ? e J w J y y w netdyfxfx ii ii ==- ( ) ¢ ( ) =-¢ ( ) net net net ? ? ? ? ? ? ? ? J w J y y w ij i i iij i = net net 12 ? ? J y ? ? J y ? ? J y ? ? ? ? ? ? ? ? e J y J y y y fw ik k k k i kkkki k = ? è ? ? ? ÷ = ¢ ( ) ?? net net net ? ? e J w xf f w ij ji kkki k =- ¢ ( ) ¢ ( ) ? è ? ? ? ÷ ? net net 12 ? 2000 by CRC Press LLC This equation embodies the back-propagation training algorithm [Haykin, 1994; Bishop, 1995]. It can be rewritten as the product of a local activation (part 1) and a local error (part 2), exactly as the LMS and the delta rules. But now the local error is a composition of errors that flow through the topology, which becomes equivalent to the existence of a desired response at the PE. There is an intrinsic flow in the implementation of the back-propagation algorithm: first, inputs are applied to the net and activations computed everywhere to yield the output activation. Second, the external errors are computed by subtracting the net output from the desired response. Third, these external errors are utilized in Eq. (20.8) to compute the local errors for the layer immediately preceding the output layer, and the computations chained up to the input layer. Once all the local errors are available, Eq. (20.2) can be used to update every weight. These three steps are then repeated for other training patterns until the error is acceptable. Step three is equivalent to injecting the external errors in the dual topology and back-propagating them up to the input layer [Haykin, 1994]. The dual topology is obtained from the original one by reversing data flow and substituting summing junctions by splitting nodes and vice versa. The error at each PE of the dual topology is then multiplied by the activation of the original network to compute the weight updates. So, effectively the dual topology is being used to compute the local errors which makes the procedure highly efficient. This is the reason back-propagation trains a network of N weights with a number of multiplications proportional to N, (O(N)), instead of (O(N 2 )) for previous methods of computing partial derivatives known in control theory. Using the dual topology to implement back-propagation is the best and most general method to program the algorithm in a digital computer. Applying Back-Propagation in Practice Now that we know an algorithm to train MLPs, let us see what are the practical issues to apply it. We will address the following aspects: size of training set vs. weights, search procedures, how to stop training, and how to set the topology for maximum generalization. Size of Training Set The size of the training set is very important for good performance. Remember that the ANN gets its information from the training set. If the training data do not cover the full range of operating conditions, the system may perform badly when deployed. Under no circumstances should the training set be less than the number of weights in the ANN. A good size of the training data is ten times the number of weights in the network, with the lower limit being set around three times the number of weights (these values should be taken as an indication, subject to experimentation for each case) [Haykin, 1994]. Search Procedures Searching along the direction of the gradient is fine if the performance surface is quadratic. However, in ANNs rarely is this the case, because of the use of nonlinear PEs and topologies with several layers. So, gradient descent can be caught in local minima, which makes the search very slow in regions of small curvature. One efficient way to speed up the search in regions of small curvature and, at the same time, to stabilize it in narrow valleys is to include a momentum term in the weight adaptation (20.9) The value of momentum a should be set experimentally between 0.5 and 0.9. There are many more modifi- cations to the conventional gradient search, such as adaptive step sizes, annealed noise, conjugate gradients, and second-order methods (using information contained in the Hessian matrix), but the simplicity and power of momentum learning is hard to beat [Haykin, 1994; Bishop, 1995]. How to Stop Training The stop criterion is a fundamental aspect of training. The simple ideas of capping the number of iterations or of letting the system train until a predetermined error value are not recommended. The reason is that we want the ANN to perform well in the test set data; i.e., we would like the system to perform well in data it wn wn nxn wn wn ij ij j ij ij + ( ) = ( ) + ( ) ( ) + ( ) -- ( )( ) 11hd a ? 2000 by CRC Press LLC never saw before (good generalization) [Bishop, 1995]. The error in the training set tends to decrease with iteration when the ANN has enough degrees of freedom to represent the input/output map. However, the system may be remembering the training patterns (overfitting) instead of finding the underlying mapping rule. This is called overtraining. To avoid overtraining the performance in a validation set, i.e., a set of input data that the system never saw before, must be checked regularly during training (i.e., once every 50 passes over the training set). The training should be stopped when the performance in the validation set starts to increase, despite the fact that the performance in the training set continues to decrease. This method is called cross validation. The validation set should be 10% of the training set, and distinct from it. Size of the Topology The size of the topology should also be carefully selected. If the number of layers or the size of each layer is too small, the network does not have enough degrees of freedom to classify the data or to approximate the function, and the performance suffers. On the other hand, if the size of the network is too large, performance may also suffer. This is the phenomenon of overfitting that we mentioned above. But one alternative way to control it is to reduce the size of the network. There are basically two procedures to set the size of the network: either one starts small and adds new PEs or one starts with a large network and prunes PEs [Haykin, 1994]. One quick way to prune the network is to impose a penalty term in the performance function — a regularizing term — such as limiting the slope of the input/output map [Bishop, 1995]. A regularization term that can be implemented locally is (20.10) where l is the weight decay parameter and d the local error. Weight decay tends to drive unimportant weights to zero. A Posteriori Probabilities We will finish the discussion of the MLP by noting that this topology when trained with the mean square error is able to estimate directly at its outputs a posteriori probabilities, i.e., the probability that a given input pattern belongs to a given class [Bishop, 1995]. This property is very useful because the MLP outputs can be interpreted as probabilities and operated as numbers. In order to guarantee this property, one has to make sure that each class is attributed to one output PE, that the topology is sufficiently large to represent the mapping, that the training has converged to the absolute minimum, and that the outputs are normalized between 0 and 1. The first requirements are met by good design, while the last can be easily enforced if the softmax activation is used as the output PE [Bishop, 1995], (20.11) 20.3 Radial Basis Function Networks The radial basis function (RBF) network constitutes another way of implementing arbitrary input/output mappings. The most significant difference between the MLP and RBF lies in the PE nonlinearity. While the PE in the MLP responds to the full input space, the PE in the RBF is local, normally a Gaussian kernel in the input space. Hence, it only responds to inputs that are close to its center; i.e., it has basically a local response. wn wn wn nx n ij ij ij ij + ( ) = ( ) - + ( )( ) ? è ? ? ? ? ÷ ÷ + ( ) ( ) 11 1 2 l hd y j j = ( ) ( ) ? exp exp net net ? 2000 by CRC Press LLC The RBF network is also a layered net with the hidden layer built from Gaussian kernels and a linear (or nonlinear) output layer (Fig. 20.8). Training of the RBF network is done normally in two stages [Haykin, 1994]: first, the centers x i are adaptively placed in the input space using competitive learning or k means clustering [Bishop, 1995], which are unsupervised procedures. Competitive learning is explained later in the chapter. The variances of each Gaussian are chosen as a percentage (30 to 50%) to the distance to the nearest center. The goal is to cover adequately the input data distribution. Once the RBF is located, the second layer weights w i are trained using the LMS procedure. RBF networks are easy to work with, they train very fast, and they have shown good properties both for function approximation as classification. The problem is that they require lots of Gaussian kernels in high- dimensional spaces. 20.4 Time-Lagged Networks The MLP is the most common neural network topology, but it can only handle instantaneous information, since the system has no memory and it is feedforward. In engineering, the processing of signals that exist in time requires systems with memory, i.e., linear filters. Another alternative to implement memory is to use feedback, which gives rise to recurrent networks. Fully recurrent networks are difficult to train and to stabilize, so it is preferable to develop topologies based on MLPs but where explicit subsystems to store the past information are included. These subsystems are called short-term memory structures [de Vries and Principe, 1992]. The combination of an MLP with short-term memory structures is called a time-lagged network (TLN). The memory structures can be eventually recurrent, but the feedback is local, so stability is still easy to guarantee. Here, we will cover just one TLN topology, called focused, where the memory is at the input layer. The most general TLN have memory added anywhere in the network, but they require other more-involved training strategies (BPTT [Haykin, 1994]). The interested reader is referred to de Vries and Principe [1992] for further details. The function of a short-term memory in the focused TLN is to represent the past of the input signal, while the nonlinear PEs provide the mapping as in the MLP (Fig. 20.9). Memory Structures The simplest memory structure is built from a tap delay line (Fig. 20.10). The memory by delays is a single- input, multiple-output system that has no free parameters except its size K. The tap delay memory is the memory utilized in the time-delay neural network (TDNN) which has been utilized successfully in speech recognition and system identification [Kung, 1993]. A different mechanism for linear memory is the feedback (Fig. 20.11). Feedback allows the system to remem- ber past events because of the exponential decay of the response. This memory has limited resolution because of the low pass required for long memories. But notice that unlike the memory by delay, memory by feedback provides the learning system with a free parameter m that controls the length of the memory. Memory by feedback has been used in Elman and Jordan networks [Haykin, 1994]. FIGURE 20.8 Radial Basis Function (RBF) network. ? 2000 by CRC Press LLC It is possible to combine the advantages of memory by feedback with the ones of the memory by delays in linear systems called dispersive delay lines. The most studied of these memories is a cascade of low-pass functions called the gamma memory [de Vries and Principe, 1992]. The gamma memory has a free parameter m that controls and decouples memory depth from resolution of the memory. Memory depth D is defined as the first moment of the impulse response from the input to the last tap K, while memory resolution R is the number of taps per unit time. For the gamma memory D = K/m, and R = m; i.e., changing m modifies the memory depth and resolution inversely. This recursive parameter m can be adapted with the output MSE as the other network parameters; i.e., the ANN is able to choose the best memory depth to minimize the output error, which is unlike the tap delay memory. Training-Focused TLN Architectures The appeal of the focused architecture is that the MLP weights can be still adapted with back-propagation. However, the input/output mapping produced by these networks is static. The input memory layer is bringing in past input information to establish the value of the mapping. As we know in engineering, the size of the memory is fundamental to identify, for instance, an unknown plant or to perform prediction with a small error. But note now that with the focused TLN the models for system identification become nonlinear (i.e., nonlinear moving average — NMA). When the tap delay implements the short-term memory, straight back-propagation can be utilized since the only adaptive parameters are the MLP weights. When the gamma memory is utilized (or the context PE), the recursive parameter is adapted in a total adaptive framework (or the parameter is preset by some external consideration). The equations to adapt the context PE and the gamma memory are shown in Figs. 20.11 and 20.12, respectively. For the context PE d(n) refers to the total error that is back-propagated from the MLP and that reaches the dual context PE. FIGURE 20.9 A focused TLN. FIGURE 20.10 Tap delay line memory. FIGURE 20.11 Memory by feedback (context PE). ? 2000 by CRC Press LLC 20.5 Hebbian Learning and Principal Component Analysis Networks Hebbian Learning Hebbian learning is an unsupervised learning rule that captures similarity between an input and an output through correlation. To adapt a weight w i using Hebbian learning we adjust the weights according to Dw i = hx i y or in an equation [Haykin, 1994] (20.12) where h is the step size, x i is the ith input and y is the PE output. The output of the single PE is an inner product between the input and the weight vector (formula in Fig. 20.13). It measures the similarity between the two vectors — i.e., if the input is close to the weight vector the output y is large; otherwise it is small. The weights are computed by an outer product of the input X and output Y, i.e., W = XY T , where T means transpose. The problem of Hebbian learning is that it is unstable; i.e., the weights will keep on growing with the number of iterations [Haykin, 1994]. Oja proposed to stabilize the Hebbian rule by normalizing the new weight by its size, which gives the rule [Haykin, 1994]: (20.13) The weights now converge to finite values. They still define in the input space the direction where the data cluster has its largest projection, which corresponds to the eigenvector with the largest eigenvalue of the input correlation matrix [Kung, 1993]. The output of the PE provides the largest eigenvalue of the input correlation matrix. FIGURE 20.12 Gamma memory (dispersive delay line). FIGURE 20.13 Hebbian PE. wn wn xnyn iii + ( ) = () + ()() 1 h wn wn ynxn ynwn ii ii + ( ) = () + () () - () () [] 1 h ? 2000 by CRC Press LLC Principal Component Analysis Principal component analysis (PCA) is a well-known technique in signal processing that is used to project a signal into a signal-specific basis. The importance of PCA analysis is that it provides the best linear projection to a subspace in terms of preserving the signal energy [Haykin, 1994]. Normally, PCA is computed analytically through a singular value decomposition. PCA networks offer an alternative to this computation by providing an iterative implementation that may be preferred for real-time operation in embedded systems. The PCA network is a one-layer network with linear-processing elements (Fig. 20.14). One can extend Oja’s rule for many-output PEs (less or equal to the number of input PEs), according to the formula shown in Fig. 20.14 which is called the Sanger’s rule [Haykin, 1994]. The weight matrix rows (that contain the weights connected to the output PEs in descending order) are the eigenvectors of the input correlation matrix. If we set the number of output PEs equal to M < D, we will be projecting the input data onto the M largest principal components. Their outputs will be proportional to the M largest eigenvalues. Note that we are performing an eigendecomposition through an iterative procedure. Associative Memories Hebbian learning is also the rule to create associative memories [Zurada, 1992]. The most-utilized associative memory implements heteroassociation, where the system is able to associate an input X to a designated output Y which can be of a different dimension (Fig. 20.15). So, in heteroassociation the signal Y works as the desired response. We can train such a memory using Hebbian learning or LMS, but the LMS provides a more efficient encoding of information. Associative memories differ from conventional computer memories in several respects. First, they are content addressable, and the information is distributed throughout the network, so they are robust to noise in the input. With nonlinear PEs or recurrent connections (as in the famous Hopfield network) [Haykin, 1994] they display the important property of pattern completion; i.e., when the input is distorted or only partially available, the recall can still be perfect. FIGURE 20.14 PCA network. FIGURE 20.15 Associative memory (heteroassociation). ? 2000 by CRC Press LLC A special case of associative memories is called the autoassociator (Fig. 20.16), where the training output of size D is equal to the input signal (also a size D) [Kung, 1993]. Note that the hidden layer has fewer PEs (M ! D) than the input (bottleneck layer). W 1 = W 2 T is enforced. The function of this network is one of encoding or data reduction. The training of this network (W 2 matrix) is done with LMS. It can be shown that this network also implements PCA with M components, even when the hidden layer is built from nonlinear PEs. 20.6 Competitive Learning and Kohonen Networks Competition is a very efficient way to divide the computing resources of a network. Instead of having each output PE more or less sensitive to the full input space, as in the associative memories, in a competitive network each PE specializes into a piece of the input space and represents it [Haykin, 1994]. Competitive networks are linear, single-layer nets (Fig. 20.17). Their functionality is directly related to the competitive learning rule, which belongs to the unsupervised category. First, only the PE that has the largest output gets its weights updated. The weights of the winning PE are updated according to the formula in Fig. 20.17 in such a way that they approach the present input. The step size exactly controls how much is this adjustment (see Fig. 20.17). Notice that there is an intrinsic nonlinearity in the learning rule: only the PE that has the largest output (the winner) has its weights updated. All the other weights remain unchanged. This is the mechanism that allows the competitive net PEs to specialize. Competitive networks are used for clustering; i.e., an M output PE net will seek M clusters in the input space. The weights of each PE will correspond to the centers of mass of one of the M clusters of input samples. When a given pattern is shown to the trained net, only one of the outputs will be active and can be used to label the sample as belonging to one of the clusters. No more information about the input data is preserved. Competitive learning is one of the fundamental components of the Kohonen self-organizing feature map (SOFM) network, which is also a single-layer network with linear PEs [Haykin, 1994]. Kohonen learning creates annealed competition in the output space, by adapting not only the winner PE weights but also their spatial FIGURE 20.16 Autoassociator. FIGURE 20.17 Competitive neural network. ? 2000 by CRC Press LLC neighbors using a Gaussian neighborhood function L. The output PEs are arranged in linear or two-dimensional neighborhoods (Fig. 20.18) Kohonen SOFM networks produce a mapping between the continuous input space to the discrete output space preserving topological properties of the input space (i.e., local neighbors in the input space are mapped to neighbors in the output space). During training, both the spatial neighborhoods and the learning constant are decreased slowly by starting with a large neighborhood s 0 , and decreasing it (N 0 controls the scheduling). The initial step size h 0 also needs to be scheduled (by K). The Kohonen SOFM network is useful to project the input to a subspace as an alternative to PCA networks. The topological properties of the output space provide more information about the input than straight clustering. References C. M. Bishop, Neural Networks for Pattern Recognition, New York: Oxford University Press, 1995. de Vries and J. C. Principe, “The gamma model — a new neural model for temporal processing,” Neural Networks, Vol. 5, pp. 565–576, 1992. S. Haykin, Neural Networks: A Comprehensive Foundation, New York: Macmillan, 1994. S. Y. Kung, Digital Neural Networks, Englewood Cliffs, N.J.: Prentice-Hall, 1993. J. M. Zurada, Artificial Neural Systems, West Publishing, 1992. Further Information The literature in this field is voluminous. We decided to limit the references to text books for an engineering audience, with different levels of sophistication. Zurada is the most accessible text, Haykin the most compre- hensive. Kung provides interesting applications of both PCA networks and nonlinear signal processing and system identification. Bishop concentrates on the design of pattern classifiers. Interested readers are directed to the following journals for more information: IEEE Transactions on Signal Processing, IEEE Tranactions on Neural Networks, Neural Networks, Neural Computation, and Proceedings of the Neural Information Processing System Conference (NIPS). FIGURE 20.18 Kohonen SOFM. ? 2000 by CRC Press LLC