Download Information Theory and Machine Learning

Information Theory and Machine Learning David Kaye April 25, 2008 Abstract This project will start by looking at classical information theory, covering many results from Claude Shannon’s 1948 paper A Mathematical Theory of Communication before moving on to machine learning. This will be covered in the context of neural networks, as they are an influential field of study with many modern applications. Finally it will look at quantum information theory, comparing and contrasting it with the classical theory before moving on to introduce the reader to quantum neural networks and some of their applications. Contents 1 Introduction 3 I 5 Classical Information Theory 2 Introduction to Information Theory 2.1 Binary Symmetric Channel . . . . . 2.2 Linear Codes . . . . . . . . . . . . . 2.3 Error Correcting Codes . . . . . . . 2.3.1 Repetition Codes . . . . . . . 2.3.2 Block Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 6 7 8 8 8 3 Probability, Information and Entropy 3.1 Ensembles . . . . . . . . . . . . . . . . 3.2 Probability Book-Keeping . . . . . . . 3.3 Information and Entropy . . . . . . . 3.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 11 12 12 14 4 Information Coding 4.1 Introduction . . . . . . . . . . . . . . 4.2 Source Coding . . . . . . . . . . . . 4.3 Symbol Codes . . . . . . . . . . . . . 4.4 Further Entropy and Information . . 4.5 The Noisy Channel Coding Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 15 15 17 19 20 II . . . . . Machine Learning and Neural Networks 22 5 Introduction to Neural Networks 23 5.1 Neurons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 5.2 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 6 Learning Types 6.1 Supervised Learning . . 6.2 Reinforcement Learning 6.3 Unsupervised Learning . 6.4 Goals for Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 28 29 30 30 7 Learning Algorithms 7.1 Error Correction Learning 7.2 Hebbian Learning . . . . . 7.3 Competitive Learning . . 7.4 Self Organising Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Information Theory Applied to Neural Networks 8.1 Capacity of a Single Neuron . . . . . . . . . . . . . 8.2 Network Architecture for Associative Memory . . . 8.3 Memories . . . . . . . . . . . . . . . . . . . . . . . 8.4 Unlearning . . . . . . . . . . . . . . . . . . . . . . 8.5 Capacity of a Hopfield Network . . . . . . . . . . . III . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 32 34 36 37 . . . . . 38 39 40 41 41 42 Quantum Information Theory and the Cutting Edge 9 Quantum Mechanics 9.1 Motivation . . . . . . . . 9.2 Qubits . . . . . . . . . . . 9.3 The EPR Paradox . . . . 9.4 Entanglement . . . . . . . 9.5 The Bell States . . . . . . 9.6 The No Cloning Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Quantum Entropy and Information 10.1 Von Neumann Entropy . . . . . . . 10.2 Quantum Distance Measures . . . 10.3 Quantum Error Correction . . . . 10.4 Quantum Teleportation . . . . . . 10.5 Dense Coding . . . . . . . . . . . . 10.6 Quantum Data Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 44 44 45 46 46 46 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 48 48 49 51 51 53 11 Quantum Neural Networks 11.1 Motivation . . . . . . . . . . . . . . 11.2 Architecture . . . . . . . . . . . . . . 11.3 Training Quantum Neural Networks 11.4 Quantum Associative Memory . . . 11.5 Implementation . . . . . . . . . . . . 11.6 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 54 55 55 55 56 57 12 Conclusion 58 A Notation 60 Bibliography 62 2 Chapter 1 Introduction When we wish to communicate with somebody we generally need to do so over an imperfect channel, be it a crackling telephone line, a noisy room or even just an unreliable internet connection. The channel will usually add noise to whatever we are saying/sending and so we need to protect against this. One way of doing this is to add redundancy into the message, allowing the recipient to check the message against the redundancy to check for, and hopefully correct, any errors that have occurred. Unfortunately this decreases the rate at which we can communicate: if we are capable of sending 10 megabytes per second, but the tenth is purely for correcting errors (and therefore contains no new information), then our effective rate of communication is just 9 megabytes per second. To overcome this difficulty we can try to compress the data we wish to send. If we can find a way to convey 10 megabytes worth of information using only 9 megabytes then we can use the tenth for error correction. This would mean that for the same rate of communication, we can now correct some errors that will undoubtedly occur whilst the message is in transit. It was Claude Shannon who pioneered the field of information theory, the focus of which is to provide solutions to the problem of communicating over different types of channels. It is this that we will focus on in part 1. Error Correcting codes work by having a list of permitted words and attempting to correct any deviations from these allowed forms. We can therefore view error correction as a form of pattern recognition, a huge field in its own right. Popular tools in this area are so-called ‘neural networks’, these are computational models inspired by the neurons in our own brains. As one might expect given our natural aptitude at recognising blurred pictures of people, trees and fruit, these networks are inherently suited to pattern recognition and classification (not to mention various other tasks), Part 2 will look at neural networks, covering their similarities and differences from the biological neural networks found within our crania. It will focus on the ways in which they can store patterns and other data. We will then proceed to examine the limits of these neural networks: how much information can they store? How reliable are they? And what happens if we try to store too much information in them? Next we shall take a brief look at the history of physics in order to prepare ourselves for some results to follow. In the early twentieth century it was discovered that the laws of classical physics (for example, those of Sir Isaac Newton) did not provide a complete description of the world we inhabit. Experiments had shown that under certain conditions particles such as electrons display properties that were unquestionably wave-like, and photons (hitherto considered purely as waves) were shown to have distinctly particle-like properties. The theory surrounding these observations was dubbed quantum mechanics. These findings were not merely physical curiosities, they turned out to have wide reaching 3 implications for almost all areas of science. Information theory, for example, is turned on its head when we incorporate the laws of quantum mechanics. Under the quantum paradigm, many of the activities we take for granted (like copying unknown data as and when we please) can no longer be done: quantum information simply does not behave in the same way as classical information. It is the behaviour and control of this new type information that we shall spend the majority of part 3 discussing. For the remainder we shall delve into an exciting new field: that of quantum neural networks. It is a highly speculative area, so much so that research is still underway to ascertain what they are capable of and how they differ in form and function to the (classically formulated) neural networks covered in part 2. 4 Part I Classical Information Theory 5 Chapter 2 Introduction to Information Theory 2.1 Binary Symmetric Channel Suppose Alice wishes to send a message (an email perhaps) to Bob using her computer, this message will consist of a string of digits x (consisting of elements which can be either 0 or 1). Note that any string of binary digits of length N may be considered as an element of an n dimensional vector space (defined over the field of positive integers modulo 2). This vector will need to cross some kind of channel, like a phone line, over which there will be some noise. As a result of the noise the received vector y may differ from x. We shall make the following assumptions about the channel: 1. The channel only transmits 1s and 0s, this is called a binary channel. 2. The probability of yi differing from xi is equal to f, i.e P (yi = 0|xi = 1) = f (2.1a) P (yi = 1|xi = 0) = f. (2.1b) Therefore the probability of yi being the same as xi is equal to 1 − f . This is called a symmetric channel. Figure 2.1: The binary symmetric channel. Computers store and manipulate all data in binary form. The storage medium may also be viewed as a channel, however instead of transporting the data from one spatial location to another it transports them to a different temporal location. In an ideal world our storage media would be error free and we could utilise their entire capacity for storing data. Unfortunately 6 our media are not perfect and as such some of their capacity must be used for bookkeeping data in order to detect and correct errors. Given a hard disk with a bit error probability of f there are two ways to increase its reliability. First there are physical methods. These aim to directly decrease f by making the hard disk from better quality components, making it airtight and cooling it to reduce thermal noise. All of these methods, whilst effective, are expensive and increase the financial cost of the channel. The second method is the ‘system method’, this involves adjusting the system itself in order to make it error tolerant. System methods involve encoding the source message s to add redundancy, it is this encoded message (denoted t) that is transmitted over the binary symmetric channel. The channel adds noise to t, meaning that in general the received message r differs from t. r is then passed through a decoder, which uses the redundancy to extract (or attempt to extract) the transmitted message t and the noise that was added to it. The redundancy is then stripped and (if the error correction has been successful) the source message s is recovered. This method has the advantage of provided error resistance and little or no additional financial cost, however there is an increased cost in terms of computational resources. There is a particular class of codes which we shall be looking at due to their very useful properties, these are called linear codes for reasons that shall become apparent. 2.2 Linear Codes A code C over a field F (meaning that elements of C are made constructed from elements of F ) is linear if: 1. for any two codewords u, v ∈ C we have u + v ∈ C 2. for all codewords u, and elements a ∈ F , we have a.u also a codeword (note that F = {0, 1} for the binary symmetric channel). From this definition it follows that the null vector (a string of all 0’s) is a member of all linear codes. A linear code may also be viewed as a group under addition, a feature that has led to some authors calling them group codes. Given a vector x, we define its weight w(x) to be the number of non-zero elements of x. For example if x = (01100101), then w(x) = 4. We will only be considering binary codes so it is possible to define the distance between codewords x and y as d(x, y) = w(x − y). The minimum distance is the smallest value of d(x, y) for all codewords x and y. This brings us to the first advantage of linear codes: the smallest distance between any two codewords is equal to the smallest value of w(x) for all non zero codewords in C. This means that if our code has M codewords, we need only make M − 1 comparisons1 , unlike a nonlinear code where we would be required to make M C2 = 12 M (M − 1) comparisons2 . The second advantage is that linear codes are easy to specify due to their group-like nature. Whereas with a nonlinear code we may need to list every single codeword. For a linear code we simply need to list a set of ‘basis’ codewords from which the other codewords may be generated, since we know that the sum of any two codewords is also a codeword. The final advantage with using linear codes is that they are very easy to encode and decode - these operations amount to matrix algebra. If we have a k × n matrix G whose rows form a basis of an (n, k) code3 , then G is called a generator matrix of that code. By using elementary row operations: 1 comparing each codeword to the null vector comparing every possible pair of codewords 3 one that encodes strings of length k into codewords of length n 2 7 1. interchange two rows 2. multiply a row by a scalar 3. add one row to another 4. interchange two columns 5. multiply a column by a scalar, it is possible to convert a k × n generator matrix G into what is known as standard form: G = [Ik |M ], where Ik is the k × k identity matrix and M is a k × (n − k) matrix. When G is in standard form all codewords of C may be formed by multiplying the k dimensional vector x that we seek to encode by the transpose of G, like so: y = Gt x. 2.3 2.3.1 Error Correcting Codes Repetition Codes Repetition codes (denoted Rn ) are simple: just repeat n (odd) times each bit that you wish to send. For example, to send 0101100 using the repetition code R3 we would go through the folowing procedure: s t n r d s̄ 0 000 001 001 000 0 1 111 000 111 111 1 0 000 010 010 000 0 1 111 001 110 111 1 1 111 000 111 111 1 0 000 000 000 000 0 0 000 100 100 000 0 source message s transmitted message (t = s + redundancy) noise added to transmission n received message (t + n ) decoded message d message (d − redundancy ) Table 2.1: Encoding and decoding using the R3 repetition code The probability of a decoding error is dominated by the probability of two bit errors occurring in a single triplet, which scales with f 2 . An error would also if we had three errors in a triple, but this probability scales with f 3 , and is often negligible in comparison to the probability of two errors occurring. Repetition codes certainly give us a rapidly decreasing error rate (think of R5 with an error rate of 0.05) but they also significantly restrict our rate of communication. Using a repetition 1 as we are sending m bits over the channel for code Rm , our communication rate drops to m every bit of information in our source message. In our perfect fantasy world we would like to combine a low error probability with a high transmission rate. 2.3.2 Block Codes Block codes take a sequence of source bits of length k and convert it to a sequence of bits of length n (n > k) to add redundancy. In a linear block code the extra N − K bits are called ‘parity check’ bits. For the (7, 4) Hamming code n = |s| = 7 and k = |s| = 4. To create one of the Hamming codewords place it in a tri-circle diagram. Set ti = si for i = 1 . . . 4, then set t5 . . . t7 so that the parity within each circle is even. The parity of s1 s2 s3 = 0 if s1 + s2 + s3 is even and, the parity is 1 if s1 + s2 + s3 is odd. Given that the Hamming Code is linear, it can be written compactly in terms of matrices (meaning that all codewords may be written in the following form: t = Gt s). 8 s 0000 0001 0010 0011 0100 0101 0110 0111 t 0000000 0001011 0010111 0011100 0100110 0101101 0110001 0111010 s 1000 1001 1010 1011 1100 1101 1110 1111 t 1000101 1001110 1010010 1011001 1100011 1101000 1110100 1111111 Table 2.2: The sixteen source words of the Hamming Code Figure 2.2: Tri-circle diagram.  1  0 G=  0 0 0 1 0 0 0 0 1 0 0 0 0 1 1 1 1 0 0 1 1 1  1 0   1  1 The generator matrix of the Hamming Code. Note that Gt = IP4 . Now that we have a swish new encoder we need a correspondingly swish decoder. Whilst decoding it is important to remember that any of the transmitted bits may be flipped - even our parity bits. We will make one further assumption here, namely that we have no idea what codewords will be sent (i.e. as far as we know they are all equally likely). For a given received vector r the optimal decoder selects the codeword t that differs from r in the fewest places. There is more than one method of doing this. One very simple method would be to compare r to each of the codewords t one by one, counting the number of places where ri 6= ti and selecting the codeword which minimised these discrepancies after the 16 comparisons. For a small code such as the (7, 4) Hamming code this inefficiency is not too troublesome, however if we generalise the Hamming code to an (n, k) code then we need to perform n comparisons per codeword received. As n increases this becomes devastatingly inefficient and as a result this method is rarely used. The pattern of parity violations is called the syndrome, if we have parity violation in circles two and three then the syndrome, denoted z, is (0, 1, 1). It follows from this definition that for all codewords t we have z = (0, 0, 0). The syndrome is calculated by using the following 9 formula: Syndrome = (calculated parity from r1−4 ) + (received parity from r5−7 ) [modulo 2]. Once we have found the circles with odd parity we must search for the smallest number of flipped bits that will produce this parity violation. Given z = (0, 1, 1) we need to flip a bit that lies only in circles two and three. Luckily a unique bit exists for each possible syndrome and so we can build up a map between the syndrome and the flipped bit. z perpetrator 000 none 001 r7 010 r6 011 r4 100 r5 101 r1 110 r2 111 r3 Table 2.3: The eight possible syndromes for the Hamming code. The above syndromes could all be caused more complex bit-flip patterns, for example: upon receiving the string 0110101 [syndrome 100] we can see that the error could lie either with r5 or with s3 and s4 together. However a larger number of flipped bits is necessarily less likely so we choose to flip r5 . Using this method of decoding we can see that if one bit is flipped the error is detected and corrected, however if two bits are flipped then the true error is not identified and thus the ‘correction’ applied to r leads to 3 bit errors. If r3 and r4 had in fact been the culprits then our decoding would have given us a string s̄ = 0110101. We have our matrix Gt = IP4 , so now we define a matrix H = [P |I3 ] to compute the syndrome. This is a linear operation and is performed by multiplying r by H on the right hand side like so: z = Hr. All of the received codewords can (by definition) be written in the following form: r = Gt s + n, meaning that the syndrome is equal to HGt s + Hn. Note that HGt = 0, so our syndrome is calculated from Hn. In essence, the problem we are facing with syndrome decoding is: given Hn, find the most probable vector n that gives this particular value of z. Any decoder which solves this problem is known as a maximum likelihood decoder for reasons which are hopefully clear. The probability of block error (denoted pB ) is the probability that the decoded message and the source message are not in fact the same (P (s̄ 6= s)). The probability of bit error is the average P probability that a particular decoded bit doesn’t match it’s corresponding source bit, pb = k1 ki=1 P (s¯i 6= si ). For the Hamming code pB = the probability of more than one bit being flipped = f 2 + f 3 + . . . + f 7 . This probability scales with O(f 2 ), exactly the same as our R3 code, however the Hamming code has a much higher rate of transmission - a rate of 47 (four source bits for every seven transmitted bits). A significant contrast to the R3 code with its measly 31 transmission rate. In his ground breaking 1948 paper (entitled A Mathematical Theory of Communication) Claude Shannon proved that it is possible to have an arbitrarily small probability of bit error combined with a non-zero rate of transmission. This is the Noisy Channel Coding Theorem and will be discussed in chapter 4. Before we are able to truly appreciate this result, we must must define and explain some new concepts. 10 Chapter 3 Probability, Information and Entropy 3.1 Ensembles We start by defining an ensemble X as a triplet (x, Ax , Px ) where x is a random variable, Ax is the set of values that x may take, and Px is the corresponding probability that x will take each value of Ax . To illustrate this we will look at a real ensemble consisting of the letters of the english alphabet1 and their various probabilities when a character is drawn at random from a block of standard english text. The set of values of Ax and Px are shown below: Ax Px Ax Px Ax Px a 0.0575 j 0.0006 s 0.0567 b 0.0128 k 0.0084 t 0.0706 c 0.0263 l 0.0335 u 0.0334 d 0.0285 m 0.0235 v 0.0069 e 0.0913 n 0.0596 w 0.0119 f 0.0173 o 0.0689 x 0.0073 g 0.0133 p 0.0192 y 0.0164 h 0.0313 q 0.0008 z 0.0007 i 0.0599 r 0.0508 space 0.1928 Table 3.1: Probability of each letter of the alphabet in standard english text Tables such as this are easily available on the internet and are widely used to defeat substitution ciphers2 such as rot13. In this cipher each letter is assigned a number (a = 1, b = 2 etc...) then each number n is mapped to n + 13 modulo 26, giving rot13(a) = n, rot13(p) = d and so on. This obviously a very simple substitution cipher the use of which has been condemned to the history books along with other substitution ciphers due to the negligible protection they give when faced with modern computers. This is not to say that they are useless, nor that they ever were, simply that they do not offer any meaningful protection when an adversary has even modest computational resources. A joint ensemble XY is a pair of ensembles where the outcome is an ordered pair (x, y), where x can be any member of Ax = (a1 , a2 , . . .) with probabilities Px = (px1 , px2 , . . .), and y can be any member of By with probabilities Py , each defined similarly. There is no requirement that x and y be independent. For example, given the binary values of the lowercase letters in A.S.C.I.I.3 we can define x to be the first four digits and y to be the last four digits of the code, as displayed in Table 3.2. 1 from now on we will always assume that we are dealing with the english alphabet A cipher is a code whose purpose is to hide/obfuscate the source message 3 American Standard Code for Information Interchange 2 11 letter a b c d e f g h i j k l m n binary value 01100001 01100010 01100011 01100100 01100101 01100110 01100111 01101000 01101001 01101010 01101011 01101100 01101101 01101110 x 0110 0110 0110 0110 0110 0110 0110 0110 0110 0110 0110 0110 0110 0110 y 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 letter o p q r s t u v w x y z space binary value 01101111 01110000 01110001 01110010 01110011 01110100 01110101 01110110 01110111 01111000 01111001 01111010 00100000 x 0110 0111 0111 0111 0111 0111 0111 0111 0111 0111 0111 0111 0010 y 1111 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 0000 Table 3.2: Binary expansion of the lower case A.S.C.I.I. letters As Table 3.1 tells us, given the first four digits (x1 , x2 , x3 , x4 ) of the code some strings (y1 , y2 , y3 , y4 ) (i.e. letters) will be more likely to occur than others. Regrettably we must now review some simple rules of probability in order to fully prepare ourselves for one of Shannon’s discoveries. 3.2 Probability Book-Keeping Given a random variable X taking its outcome x from Ax and a random variable Y taking outcomes y from By , denote the probability that x takes the value ai by P (x = ai ) ≡ P (ai ) ≡ P P(x). Given a subset K ⊆ Ax the probability that x takes a value in K is given by P (x ∈ K) = is called the marginal probability of x = ai and is defined by k∈K P (x = k). The probability P summation over y: P (ai ) = y∈By P (ai , y). The joint probability of (ai , bj ) is the probability of x = ai and y = bj occuring together. It P (a ,b ) is used to define the conditional probability P (ai |bj ) = P (bi j )j , but beware: if P (bj ) = 0 then this is undefined (as one would expect). We must also not forget the following rules: Product (Chain) Rule: P (x, y|F ) = P (x|y, F ) × P (y|F ) X X Sum rule: P (x|F ) = P (x, y|F ) = P (x|y, F ) × P (y|F ) y (3.2) y Independence: x and y are independent if and only if P (x, y) = P (x) × P (y). 3.3 (3.1) (3.3) Information and Entropy The Shannon information content of an outcome x = ai is defined4 to be: h(ai ) = log 4 1 P (ai ) Unless otherwise stated, all logarithms will be to base 2 12 (3.4) and is measured in bits (though this is unrelated to binary digits). The entropy of an ensemble X is the average Shannon information content of its outcome, and is given by: H(X) = X x∈Ax P (x) × log 1 , P (x) (3.5) [note that P (x) = 0 ⇒ 0 × log 10 ≡ 0, just as we define it for the limit]. The entropy of X is a measure of the uncertainty in the value that X will take. Since 0 ≤ P (x) ≤ 1 for all values 1 1 of x (by definition of probability), P (x) ≥ 1, meaning that log( P (x) ) ≥ 0, giving us our result that H(X) ≥ 0 with equality if and only if pi = 1 for one of the i’s. It should be noted that on occasion the entropy is written as H(p) where p is a vector consisting of the probabilities of the outcomes xi , so p = (px1 , px2 , . . . , pxn ). The joint entropy of X and Y is defined by the following equation: X 1 . (3.6) H(X, Y ) = P (x, y) × log P (x, y) x,y∈Ax ,Ay As one might expect the entropy of two independant random variables is additive, so H(X, Y ) = H(X)+H(Y ) if X and Y are independant. This describes the situation P (x, y) = P (x)×P (Y ). Taking the logarithm of |Ax | gives us an upper bound for the entropy, so H(X) ≤ log |Ax |. The entropy is maximised (we have equality) if Px is uniform, i.e. p1 = p2 = . . . = pn = |Ax |−1 . The redundancy of an ensemble measures the difference between H(X) and its maximum possible value, log |Ax |: so H(X) redundancy = 1 − . (3.7) log |Ax | All of the preceeding results have referred to discrete random variables where |Ax | is finite. However, the concepts involved generalise to continuous random variables (use the probability density) and infinite sets (where we must be aware that H(X) may tend to infinity). The relative entropy between two probability distributions P (X) and Q(X) (both defined over Ax ) is X P (x) DKL (P ||Q) = P (x) × log . (3.8) Q(x) x It is important to note that DKL (P ||Q) is not the same as DKL (Q||P ). The relative entropy is sometimes known as the Kullback-Leibler divergence. The relative entropy satisfies Gibb’s inequality, which states that DKL (P ||Q) ≥ 0. At this point in the proceedings it is advisable to make a special mention of the binary entropy function H2 (f ). This function describes the entropy of a random variable that may take the value 0 or 1, and does so with probabilities 1 − f and f respectively. It takes its maximum at f = 0.5 in agreement with out earlier result. Written explicitly, the binary entropy function is: 1 1 H2 (f ) = f × log + (1 − f ) × log . (3.9) f 1−f The binary entropy function is useful because it allows us to understand one of Shannon’s theorems, namely the Noisy Channel Coding Theorem. This states that for a binary symmetric channel with probability of bit-flip f , the maximum rate of information transfer is given by subtracting H2 (f ) from 1, or symbolically: 1 1 + (1 − f ) log . (3.10) C(f ) = 1 − H2 (f ) = 1 − f log f 1−f 13 3.4 Examples We have looked at some relatively abstract features of proabilities and ensembles, so now is the fime to illustrate their meaning by looking some numerical examples. Taking our ensemble as the alphabet, we can see that the information content of the letter v being selected is given by h(v) = log 1 1 = log = 7.1792 P (v) 0.0069 so we get approximately 7.2 bits of information when the letter v the outcome. Let us compare this to another letter: e, the information content of e occuring is: h(e) = log 1 1 = log = 3.4532 P (e) 0.0913 so e is worth only about 3.5 bits of information, demonstrating that less probable outcomes convey more information than more probable ones. Denoting our alphabet ensemble by Ψ we can calculate its entropy in the usual way: H(Ψ) = X µ∈Ψ P (µ) × log 1 . P (µ) Substituting in our values from table 3.1, we can easily calculate that that the average Shannon Information Content (entropy) of Ψ to be 4.11. In the next chapter we shall meet another of Shannon’s groundbreaking theorems, the Source Coding Theorem, which defines limits on our data compression algorithms. 14 Chapter 4 Information Coding 4.1 Introduction In this chapter we will exploit the results we have obtained in order to reproduce some of Shannon’s key fndings. 4.2 Source Coding Source coding is essentially data compression, it is also known as noiseless channel coding. The raw bit content of an ensemble X is denoted H0 (X) and takes the value log |Ax |. The raw bit content of X provides a lower bound for the number of yes/no questions that must be asked in order to uniquely identify an outcome. The raw bit content of a joint ensemble XY is simply H0 (XY ) = H0 (X) + H0 (Y ). Lossy Compression Lossy compression methods, such as those used by the ubiquitous MP3 (MPEG1 Layer 3) and the increasingly popular Ogg Vorbis codecs actually throw away information in order to achieve better compression rates. MP3 and Ogg Vorbis are most commonly used to compress audio tracks from compact discs (typically around 45 megabytes) to a managable size (generally around 5 megabytes). JPEG compression behaves in a similar way, acting on our pictures to prevent them taking up valuable space on camera memory cards and computer hard disks. Because lossy compression algorithms throw away data there is a chance that when we compress two different files we will end up with two identical files, meaning that we cannot uniquely identify the source. This is called a failed encoding and its probability of occurence is denoted δ. Our goal when using a lossy algorithm is to achieve a satisfactory trade off between the probability of an encoding failure and the level of compression. If we risk a higher probability of failure then we will undoubtedly achieve better compression, but it is down to us to decide what value of δ is acceptable for each situtation. We can implement a lossy compression algorithm to compress text by simply leaving out uncommon letters of the alphabet, this decreases the size of our alphabet and so our encoder simply deletes any letters it that it is not expecting. If we remove the three letters from our alphabet say a, z and q then we will achieve a compressed file size calculated as follows: 1 − P (a) − P (z) − P (q) = 1 − 0.0575 − 0.0007 − 0.0008 = 0.941. 1 Motion Picture Experts Group 15 1 1 N HÆ (X N=10 N=210 N=410 N=610 N=810 N=1010 N ) 0.8 0.6 0.4 0.2 0 0 0.2 0.4 0.6 0.8 1 Æ Figure 4.1: Hδ (X n ) for various values of n[13]. So a text file compressed using this method will be about 94% of the original file size. This gives us a probability of failure P (a) + P (q) + P (z) = 0.059, since any files differing only in the number and permutations of a’s, z’s and q’s will be indistinguishable after compression. It should be noted that this is not a terribly useful compression algorithm, but it serves to illustrate our concepts. If we are looking to create an algorithm that will give us a particular value of δ then what we are really trying to do is create the smallest subset Sδ of our alphabet S such that the probability of a particular chosen letter not lying in Sδ is less than or equal to δ. In our previous example, we had Sδ as the alphabet excluding a, z and q. This algorithm did meet our requirement that P (x ∈ / Sδ ) ≤ 0.06 however it is not optimal as we could have taken several letters in place of a. An easy method of removing the maximum number of letters is to rearrage them in order of decreasing probability and remove letters, starting with the least likely, until the probability of failure is as close as possible to (but not greater than) δ. For a particular value of δ we define the essential bit content Hδ (X) of our reduced ensemble to be log |Sδ |. As with the raw bit content, the essential bit content of X is additive. So given n independant, identically distributed random variables (X,X,X, . . . ,X) (which we shall denote by X n ) then H(X n ) = n × H(X). This result applies to Hδ (X) as well, but as n approaches infinity, Hδ (X n ) becomes less and less dependant on δ. For any value of delta, it is approximately equal to n times the entropy of X, i.e. Hδ (X n ) ≈ n × H(X). This neatly brings us to the Source Coding Theorem, one of Shannon’s discoveries, which tells us about limits on our data compression. The Source Coding Theorem: X is an ensemble with entropy H(X) = H. Given some positive ǫ and some δ between 0 and 1, there exists some non-zero n0 ∈ N such that for any n > n0 ; 1 Hδ (X n ) − H < ǫ. (4.1) n Or, to put it plainly, we can compress X into more than nH(X) bits with negligible risk of failure as N → ∞. On the other hand, if we try to compress it into less than nH(X) bits, it is almost certain that our encoding will fail. 16 Lossless Compression With a lossless compression algorithm, all input files are mapped to different output files. Consequently while some files are decreased in size, others are necessarily increased in size. One example of a common lossless algorithm is the Free Lossless Audio Codec (FLAC). Lossless compression frequently makes use of symbol codes, which we shall now discuss. 4.3 Symbol Codes These codes are lossless, they preserve the information content of the input precisely. As a result of this, the pigeonhole principle (”You can’t put N pigeons into N − K boxes unless at least one box has more than one pigeon in it.”) dictates that in order to avoid a box containing more than one pigeon (a failed encoding) they must sometimes produce encoding strings which are longer than the original input. Thus our goal when using symbol codes is to assign shorter encodings to more probable input strings, thus increasing the probability that our code will compress the input. Recalling our ensemble X = (x, Ax , Px ) from the previous chapter, we shall now define AN x to be the set of ordered N-tuples drawing elements from Ax . We shall further define A+ x to be the set of all strings of finite length constructed by using elements of Ax . Given these two definitions, we are now able to define a symbol code. A symbol code C for an ensemble X is + a map C : Ax → {0, 1}+ . An extended symbol code is defined similarly, C + : A+ x → {0, 1} , and is made by concatenating the corresponding codewords, e.g. C(ai aj ak ) = c(ai )c(aj )c(ak ). Where c(ai ) denotes the codeword corresponding to an element ai in Ax , the length of this codeword is written as l(ai ) or sometimes just li . In order for a symbol code to be useful, it must have the following properties: • any encoded string must have a unique decoding, • it must be easy to decode • the code must achieve the greatest possible amount of compression. In order to be uniquely decodable, no distinct strings may have the same encoding. So for all distinct x and y in A+ x we require that c(x) 6= c(y). A symbol code is easy to decode if it is possible to identify the end of a codeword as soon as it arrives. This amounts to requiring that no codeword be a prefix of another, e.g. 101 is a prefix of 10101 so a symbol code which included both of these as codewords would not be easy to decode. If this condition is met then our symbol code is called a prefix code (sometimes known as instantaneous or self-punctuating codes). These codes may be drawn as binary trees with the end of each branch representing a codeword. If there are no unused branches then the code is complete. In the figures below the shaded strings are codewords. The expected length L(C, X) of a symbol code C for X is L(C, X) = X P (x)l(x) = |Ax | X pi l i (4.2) i=1 x∈Ax and is bounded below by H(X) if C is uniquely decodable. The expected length of a uniquely decodable symbol code is minimised (and equal to H(X)) if and only if the lengths of the codewords li are equal to their Shannon information content, i.e. li = log p1i . If our code consists solely of codewords with length l, then we have 2l different codewords. We may view this as each codeword having a ‘cost’ of 2−l out of a budget of 1. We may spend 17 Figure 4.2: A prefix code (complete). Figure 4.3: An incomplete, non-prefix code. 00 001 0000 0001 0010 0011 0 010 01 011 0100 0101 0110 0111 100 10 101 1 110 1000 1001 1010 1011 1100 1101 11 111 The total symbol code budget 000 1110 1111 Figure 4.4: Symbol Code Budgets[13]. this budget on codewords in various different ways, for example if l = 2 then we might not want C = {00, 01, 10, 11} so we might replace 00 and 01 with the string 0, which will have a cost of 2 × 2−l = 2−l+1 . If we go over our ‘budget’ of 1 then we lose unique decodability. To see this in a trivial case we may take C = {0, 1, 10} , giving a total amount spent on codewords of 2−1 + 2−1 + 2−2 = 1.25 and as we can see: the string 10 may be decoded as c(x2 )c(x1 ) or c(x3 ). This is known as Kraft’s Inequality, and is formally stated as: X if 2−li ≤ 1 then the symbol code is uniquely decodable. (4.3) i If we have equality then the code is complete. We are now in a position to stated the Source Coding Theorem for symbol codes. It is an existence theorem which tells us that for an ensemble X there exists a prefix code C such that H(X) ≤ L(C, X) ≤ H(X) + 1. (4.4) Whether one is able to find such a code is somewhat more problematic, and the theorem can serve both as a source of hope and taunting despair for those who try. 18 Figure 4.5: Relationship between the mutual information and the various entropies of X and Y. 4.4 Further Entropy and Information A discrete memoryless channel is one that takes input from a discrete alphabet X and gives an output from another discrete alphabet Y . The probability of a particular output y being produced by an input x is given by p(y|x), these probabilities are defined (but not necessarily non-zero) for all x ∈ X and y ∈ Y . The channel is memoryless if this probability distribution is independent of all previous input/output pairs, in other words, using the channel does not change its properties. given a particular value of y, say y = yi , the conditional entropy of an ensemble X is the entropy of the probability distribution P (x|yi ). It is defined in a similar way to the entropy of X: X 1 H(X|yi ) = P (x|y = yi ) × log . (4.5) P (x|yi ) x∈X The conditional entropy of the two ensembles X and Y is the conditional entropy of X when averaged over all the values of y: H(X|Y ) = X y∈Y P (y) × H(X|y) = X x,y∈X,Y P (x, y) × log 1 . P (x|y) (4.6) It is the average uncertainty about x after we have learned y We define the chain rule for entropy as follows: H(X, Y ) = H(X) + H(Y |X) = H(Y ) + H(X|Y ). (4.7) Verbosely, the entropy of two ensembles X and Y is equal to the entropy of Y given X, added to the entropy of X. It is certainly hard to see how it could be any other way. The mutual information between X and Y is defined by the following: mutual information = I(X : Y ) = H(X) − H(Y |X). (4.8) The mutual information of two ensembles is symmetric and has a minimum value of zero, which corresponds to the case where nothing about X may be inferred from a knowledge of Y . The mutual information is therefore the reduction in our uncertainty about X as a result of learning Y. Our noisy channel consists of an input alphabet X, an output alphabet Y and a set of probabilities P (x, y) = P (y|x) × P (x). It can therefore be thought of as a joint ensemble XY . 19 I (X ; Y ) 0.4 0.3 0.2 0.1 0 0 0.25 p1 0.5 0.75 1 Figure 4.6: Mutual information between X and Y over a binary symmetric channel with f = 0.15[13]. Defining it so has the major advantage that we can now apply Bayes’ theorem to the problem. Bayes’ theorem simply tells us the following: P (x|y) = P (y|x)P (x) P (y|x)P (x) . =P P (y) z P (y|z)P (z) (4.9) This is simple if we are sending information bitwise across the binary symmetric channel, but it also applies if X is an ensemble of codewords. If we are dealing with a binary symmetric channel with a probability of bit error f = 0.1 and X = {1001, 0110, 0000}. With each codeword being equally likely - we have no prior information about what codewords will be sent. Y is the set of all 4 digit binary strings. Then, if we receive y = 0100 we can use Bayes’ theorem to work out the most likely source word. The information conveyed by a noisy channel is the mutual information between the input and output alphabets so we would like to find a way to maximise this. The capacity of a channel is defined to be the maximum value that the I(X : Y ) can take over all probability distributions. By symmetry we can see that this corresponds to P (x = 0) = P (x = 1) = 12 , since there is no preference for 0’s or 1’s in our channel. Having illustrated all of these ideas we can now move on to one of Shannon’s main achievements: a description of how efficiently we may communicate over noisy channels 4.5 The Noisy Channel Coding Theorem The noisy channel coding theorem is in three parts which we shall observe in turn. The first part states (quoting from Mackay[13]): For every discrete memoryless channel, the channel capacity C = max I(X : Y ) PX 20 has the following property. For any ǫ > 0 and R < C, for large enough n, there exists a code of length n and rate ≥ R and a decoding algorithm, such that the probability of block error is < ǫ. The maxPX term means all probability distributions over X. Ultimately the theorem says that as long as we try to communicate at a rate smaller than the channel capacity then there will be a code of block length n (for some n, possibly large) that will allow us to communicate with an arbitrarily small probability of error. If we are willing to accept a particular probability of error (that is not arbitrarily small) then it turns out that we are able to communicate at a rate higher than the channel capacity. The process by which we do this bears a large resemblance to the lossy data compression algorithm we looked at in the previous chapter. In order to achieve the higher rate using an (n, k) code, Arthur takes his source message and splits it up into blocks of length n. He then passes these blocks through the decoder in order to obtain corresponding blocks of length k, which he then sends over the noisy channel. Upon reception of the length k blocks Belinda passes them through the encoder in order to retrieve length n blocks that, she hopes, are the same as the original length n blocks that Arthur wanted to send. By using this method, if a probability of error equal to f is deemed acceptable, then communication is possible up to a rate R(f ), given by: R(f ) = C . 1 − H2 (f ) (4.10) Where C is the channel capacity and H2 (f ) is the binary entropy function evaluated at f . The process of selecting a source word, encoding it, its corruption by noise and subsequent decoding define a chain of probabilities known as a Markov chain. Our source word s is encoded to x, which is corrupted to become y, which is subsequently decoded to ŝ, the chain of probabilities is: P (s, x, y, ŝ) = P (s)P (x|s) × P (y|x) × P (ŝ|y) (4.11) The data processing inequality, which states that processing data necessarily discards information, applies to this chain, telling us that I(s : s̄) ≤ I(x : y). The definition of channel capacity tells us that I(x : y) ≤ nC, and therefore I(s : s̄) ≤ nC. If the system achieves a rate R with a bit error probability f , then I(s : s̄) is ≥ Rn(1 − H2 (f )). However, since I(s : s̄) > nC is not achievable, neither is R > 1−HC2 (f ) . The maximum rate at which one can reliably communicate over a noisy channel with a probability of bit error pb = p is known as the Shannon limit. Possibly the most remarkable consequence of this is that it tells us we can select an arbitrarily small probability of bit error, and yet still have a non-zero rate of communication! Example If we have a channel with probability of bit error p = 0.05 then our channel capacity will be given by: 1 1 C(0.05) = 1 − H2 (0.05) = 1 − 0.05 × log + 0.95 × log . 0.05 0.95 = 1 − [0.2161 + 0.0703] = 1 − 0.2864 = 0.7136. So an probability of bit error equal to 0.05 leads to a channel capaticy of about 0.71 (for each bit of information sent, 0.71 will be received). 21 Part II Machine Learning and Neural Networks 22 Chapter 5 Introduction to Neural Networks 5.1 Neurons Before we look into neural networks we must first explain what we mean by the word ‘neuron’. It should be noted that we will not be looking at biological neurons, but artificial ‘idealised’ neurons. We need to simplify because whilst brains have several hundred species of neurons[3], which would make the analysis and derivations truly horrendous. These will obviously draw some inspiration from the biological neurons, but are simplified so as to bring to the fore the features most important to us. An artificial neuron (simply called a neuron from here onwards) consists of on or more inputs (also known as synapses, a term brought over from the biologists), labelled xi and one output, y. Each input is assigned a ‘synaptic strength/weight’ denoted wi , this indicates the level of influence that an input xi has on the overall activation. This synaptic weight can be positive or negative depending on whether P we want xi to increase or inhibit the activation of the neuron. The neuron now computes the i wi xi to find its activation, a. This activation, a, is then used as the argument to a function f , somewhat unimaginatively called the activation function, which determines the output, y. A neuron with three inputs is shown in figure (5.1). Two possible activation functions are shown in figures (5.1) and (5.1). In an extreme example, consider a neuron with eleven binary inputs, ten ordinary inputs with synaptic weight wi = 1, and the eleventh with synaptic weight w11 = −20. The neuron could be considered a vote counter for ten people - if the weighted inputs total more than 5 the neuron fires and the motion passes. Each voter may either use their input neuron to vote ‘yay’ (xi = 1) or ‘nay’ (xi = 0). Alternatively, a naysayer may use their vote to veto the entire decision (setting x11 = 1). By forcing the activation to be reduced by twenty, so that no matter how many people say ‘yay’, the decision cannot go ahead. 5.2 Neural Networks Neural networks are, unsuprisingly, a system of interconnected neurons. At their most fundamental level our brains are nothing more than vast, immensely complex neural networks [refer to Brunak and Lautrup for different species of neurons in the brain]. They provide us with an interesting counterpart to traditional computational methods, as the following paragraphs will explain. Computer memory is address based memory - to retrieve a memory you must be able to recall its address, if you cannot recall the address you are unable to retrieve the memory1 . It 1 we shall not convern ourselves will the storage or recall of the address itself! 23 Figure 5.1: A neuron with three inputs. 1.0 tanh(a) K 3 K 2 0.5 K 0 1 1 2 3 a K 0.5 K 1.0 Figure 5.2: Hyperbolic tangent activation function. 1.0 f(a) K 3 K 2 K 1 0.5 0 1 a Figure 5.3: Piecewise activation function 24 2 3 is also not associative, for example if you retrieve an image of your wife’s face, you will not be able to recall her name unless you know the address of where her name is stored. As a result, it is best not to rely on a computer to remember your wife’s name. As many people who have had a system failure will attest, computer memory is not robust/fault tolerant. This means that a small fault with the R.A.M. in your computer can (and usually does) lead to catastrophic results. Finally, memories are not distributed around the entire computer but stored entirely within the R.A.M. chips. Whilst this makes upgrading much less hassle it also means that when retrieving data from R.A.M. the majority of the computer is sitting idle as it waits for the data: only the C.P.U. and a few circuits are actually doing anything during this process. Biological (neural network) memory on the other hand, is content addressable - seeing your wife’s face will (barring any psychological difficulties) bring her name to mind. It is also robust, surviving our best attempts to stop it working. This point is illustrated beautifully by the following quote, take from Mackay[13]: Our brains are noisy lumps of meat that are in a continual state of change, with cells being damaged by natural processes, alcohol and boxing. Unlike computer memory, it is distributed - all the processing and storage is distributed throughout the entire network. It is impossible to isolate a one area where information is stored and one where it is processed since all neurons take part in these tasks. In a network as complex as our brain it is possible to observe some specialisation in certain regions, but within a region each neuron is used to store parts of multiple memories. Given their fundamentally different nature to standard computers, neural networks are ‘programmed’ differently. We refer to this programming as ‘training’ the network, and there are many different ways in which it can be done. When training a network there are three things that we must specify: 1. Architecture: this simply describes the network and its topology. It should include things like the number of neurons, the number of inputs they have, their initial synaptic weights and other fundamental features. This can often be achieved with a ‘simple’ diagram. 2. Activity Rule: this describes how the activities of the neurons change in response to each other. This is typically just a list of the activation sum and activation function for each neuron. 3. Learning Rule: this describes how the weights in the network change with time, it acts over a longer time period than the activity rule. We will be looking at variations in the learning rule rather than architecture or activity rule. Our architecture will consist of several fully connected layers of neurons (every neuron in layer n is connected to every neuron in layer n + 1). Such networks are called Hopfield networks afte John Hopfield, an example is shown in figure (8.1). There is no requirement for a neural network to be like this, we can equally well have a network in which each neuron can affect its own activation. These are called feedback networks and an example is shown in figure (5.5). Unless otherwise stated we will treat the activity rule as a abstract function so as to preserve the generality of the results we derive. We will be looking at the various different ways of training neural networks along with their respective advantages and shortcomings. We will then look at the different tasks to which neural networks can be put and how the concepts of information theory will help us. We will conclude 25 Figure 5.4: A simple feedforward network. the section by studying at neural network memory, its various properties and how it can be effectively utilised. Throughout this investigation we will restrict ourselves to neural networks operating in discrete time. Whilst many of the results we will obtain also apply to continuous time (spiking) neural networks their derivation is more complex and no more informative for being so. 26 Figure 5.5: A simple feedback network. 27 Chapter 6 Learning Types 6.1 Supervised Learning If we subject our unsuspecting network to supervised learning, it means that we have an external teacher who has some knowledge that he wishes to pass on to the network. We assume that he has been diligent enough to prepare a set of examples which accurately characterise his knowledge, we further assume that the neural network has no ‘knowledge’ other than that which the teacher will force onto it. The examples consist of sets of input data with corresponding desired responses/outputs from the network. We use the following algorithm to teach n examples to the network: 1. set i = 1 2. subject the network to the input of example i 3. compare the output of the network y, to the desired response, d to create an error vector e. 4. use the error vector to adjust the parameters of the network so that the network’s response is closer to the desired response. 5. if i = n:STOP, else go to 6 6. set i = i + 1 7. go to step 2. The algorithm is repeated until the network is deemed to have learned as much as possible from the teacher. When this occurs we remove the teacher and let the network operate freely. As we shall see in the following chapter, this is an example of an error correcting feedback loop. There are two ways to perform supervised learning, online and offline. In online learning the supervision takes place within the network itself. The learning takes places in real time and once the training examples have been worked through the network operates dynamically, continuing to learn from all input vectors submitted to it. In offline learning the supervision is carried out at some kind of remote facility. If the network is a software program then the supervision could be a separate program on the same computer, or it could be a program running on a separate computer. Once the training is complete the supervision facility is disconnected and the network parameters are fixed. The network runs statically after its training. 28 The most frequently used method of updating the synaptic weights is the backpropagation algorithm. this algorithm has two distinct phases of computation: forward, where the activations and outputs are calculated from the input to the output, and backward, where the synaptic weight adjustments are calculated from the output neurons to the inputs. To calculate the change in one neuron’s synaptic weights one must analyse every neuron that can connect to it, which can lead to scaling issues. 6.2 Reinforcement Learning One of the main problems with supervised learning is that without the teacher present the network cannot learn new ways of interpreting the data in the example set. One possible way of overcoming this problem is to use reinforcement learning. This is when a network learns an input/output map by trial and error in an online manner. The trial and error process attempts to maximise what is known as the reinforcement index (a scalar measure of how well the network is performing)uu. There are two subtypes of reinforcement learning: associative and nonassociative. In nonassociative reinforcement learning reinforcement is the only stimulus that the network receives. The network simply repeats the action which gives it the greatest reinforcement. In associative reinforcement learning the network is provided with more stimuli that just reinforcement. The network is required to learn a map between the various stimuli and their corresponding optimal action. Formally, we declare the following: • The network is interacting with an environment in discrete time. • The environment has a finite set of states that it can be in, denoted X. • The environment’s state at time n is given by x(n) (where x(n) ∈ X). The initial state is x(0). • The network has a finite number of actions to choose from, the set of which is denoted A. This set may depend on x(n), so the network’s choice of actions may be restricted. • At time n the network receives x(n) as its input and performs action a(n). • The action taken affects the environment and moves it from state x(n) to state y. The new state is determined entirely by a(n) and x(n) - it is independent of all previous actions and states. • Pxy (a) denotes the probability that the system will be moved into state y by the action a. • At time n + 1 the system receives a reinforcement that has been determined from x(n) and a(n). The so-called evaluation function provides a natural measure of the networks performance, it is defined as follows: # "∞ X k (6.1) γ r(k + 1) | x(0) = x0 . J(x) = E k=0 The summation term is called the cumulative discounted reinforcement. γ is the discount rate parameter, it lies in the range 0 ≤ γ < 1 and adjusts the extent to which reinforcement signals from longer ago are discounted. If γ = 0 then only immediate reinforcement is taken into accont, 29 as as γ → 1 the network takes longer term consequences into account. If we had γ = 1 then infinite reinforcement would be possible, hence we restrict γ to be strictly less than one. The expectation of the cumulative discounted reinforcement is taken with respect to the method the network is using to select actions. The basic aim behind reinforcement learning is to learn J(x) so that the cumulative discounted reinforcement may be predicted in the future. There is another way to implement reinforcement learning, it is called the Adaptic Heuristic Critic method. Essentially it uses an external ‘critic’ to refine the reinforcement signal into a higher quality ‘heuristic’ reinforcement signal. In supervised learning the performance of the network was judged by measuring its responses to a set of examples and using a known error function. By contrast, with reinforcement learning the performance of the network is judged on the basis of any measure that we choose to provide it with. In supervised learning the teacher is able to immediately tell the network what to do in order to improve its performance. This allows us to use any and all knowledge that we currently have in order to guide the network to an optimal response. With reinforcement learning the system acts by trial and error - it must try various actions before it can judge which is optimal. The trial and error nature of the learning, coupled with a delayed reward means that the operation is slowed down. We should not discount reinforcement learning on this basis, since it is still a very important tool, especially when combined with supervised learning (this applies doubly when the neural networks are brains and we are thinking about how humans learn). With these differences noted, it is time to move on to the third and final type of learning: unsupervised learning. 6.3 Unsupervised Learning As one might expect, unsupervised learning takes place without a teacher of any kind. This is most often used when creating an associative memory: the inputs are repeatedly presented to the network, whose free parameters gradually become tuned to the input data. One of the major advantages of unsupervised learning is that the network develops the ability to form its own internal representations of the data, rather than having these imposed on it (as happens in supervised learning). This enables it to form new classes as and when it needs to. Another advantage becomes apparent when we consider how changes to the weights of the system are calculated. Backpropagation is a good algorithm, but it doesn’t scale very well. If we have l layers, and the average number of inputs to a neuron in each layer is min then changing a synaptic weight in the first layer affects (min )l other neurons. So the time taken to train the network grows exponentially with the number of layers due to the increasing number of effects that must be calculated. 6.4 Goals for Learning When training a neural network we usually want it to perform one of the following tasks. We can put them to other uses, but these are the main ones. Prediction: This is a ‘temporal signal processing problem’: given a set of past examples of the behaviour of a system, we require the network to predict its next action/state. This is similar to error correction as we can use the next state of the system to train the network to (hopefully) better predict the system in future. One popular idea is trying to predict the stock market. 30 Approximation: If we have a function y = f (x) (f unknown) and a set of paired values (xi , xi ) (where yi = f (xi )) then given a large enough example set we may use the neural network to predict values of yi not given by our examples. This is obviously a candidate for supervised learning as we need to teach the network our initial data. When we are training a network for this approximation we must be careful not to over-train the network. Doing so would lead to a ‘join the dots’ approximation (overfitting to the data) rather than an accurate representation of the function. Auto-association: We want the neural network to store a set of patterns, we achieve this by repeatedly presenting them to the network. The network is then presented with either a partial or a noisy pattern and is required to recall/retrieve the corresponding pattern. Autoassociative memories are perfect candidates for unsupervised learning. Pattern Classification: This takes place with supervised learning, we want the network to sort its input into a finite number of classes. The network is presented with a large series of examples and taught what classes they each belong to (e.g. person, rock or fruit). The network is then presented with an previously unseen input and is required to classify it correctly. Control: Neural networks are also very good at maintaining a controlled state in a system or process due to their ability to learn via several different methods. This should not be terribly surprising given that our brains are, at their most fundamental level, large neural networks that control our actions and learn from a multitude of different stimuli. Now that we have discussed the various types of learning, their advantages, pitfalls and what goals they are particularly well matched to, it is time to delve deeper and learn about the details of several different algorithms by which neural networks may learn. 31 Chapter 7 Learning Algorithms 7.1 Error Correction Learning Error correction learning is the main form that supervised learning takes. The teacher wishes to train the network to respond correctly to a number of inputs and does so by trying to minimise the difference between the network’s response and the desired response. The most common way to do this is by using the backpropagation algorithm, which we shall explain here. The algorithm is used to alter the weights of the network in a systematic manner. We build on the definition of the error vector (e = y − d) by defining the total ‘error energy’: M 1X 2 E(n) = ej (n) 2 (7.1) j=1 (where the m is the number of output neurons) which gives us a measure of how well the network is performing for a particular training example. The average squared error energy is given by Eav N 1 X E(n), = N (7.2) n=1 and gives us a measure of the performance of the network which takes into account all of the training examples. In order to simplify the notation we shall assume that N = 1 and so E = Eav . In training the network, our goal is to minimise E, which is a function of all the parameters of the network (the synaptic weights and biases). The activation ak and output yk of neuron k are given by: ak = m X wki yi , (7.3) i=0 and yk = fk (ak ). (7.4) ∂E ∂wki . The change to weight wki is denoted ∆wki and is proportional to As a result, those weights which affect E the most will be adjusted more, whilst those which have very little effect will be adjusted by a correspondingly small amount. We will now derive the formula for the backpropagation algorithm. By the chain rule, we can write the following: ∂E ∂ek ∂yk ∂ak ∂E = . ∂wki ∂ek ∂yk ∂ak ∂wki 32 (7.5) Differentiating equation (7.1) gives us ∂E = ek . ∂ek From our equation for the error we simply get (7.6) ∂ek = 1. ∂yk (7.7) If we partially differentiate yk with respect to ak then we see that: ∂yk dfk = . ∂ak dak (7.8) Finally, we differentiate our expression for the activation of neuron k, ak to find: ∂ak = yi . ∂wki (7.9) Substituting equations (7.6), (7.7), (7.8) and (7.9) into our (7.5) we get the following: ∂E dfk = −ek yi . ∂wki dak (7.10) By using (7.6), (7.7) and (7.8) we may define δk = ∂E = ek f ′ (ak ), ∂ak (7.11) then we can use the following expression for ∆wki : ∆wki = −η ∂E = ηδk yi . ∂wki (7.12) Where η is the learning rate parameter, which must be chosen with care. If it is too small then the rate of convergence will be so slow as to be useless, but if it is too high then we might get no convergence at all. If neuron k is an output neuron then all is well, but if it is hidden (i.e. not an output neuron) then we do not have ek , so we redefine δk to δk = − ∂E ∂E ∂yk = ∂yk ∂yk ∂ak (7.13) ∂E ∂yk . by using the chain rule. As a result we now need to find From (7.1) it follows that X ∂ej X ∂ej ∂aj ∂E = ej = ej . ∂yk ∂yk ∂aj ∂yk j (7.14) j P ∂e Note that ej = dj − fj (aj ), giving us ∂ajj = −fj′ . Also, aj = k wjk yk giving us: This implies that X X ∂E =− ej fj′ wjk = − δj wjk , ∂yk j ∂aj ∂yk = wjk . (7.15) j which we substitute back into equation (7.13) to find: X δk = fk′ δj wjk . j 33 (7.16) Figure 7.1: The initial setup of our neural network. The δj in this formula is found from equation (7.11) for the hidden neurons immediately preceding the output layer, and from (7.13) for every layer thereafter. The error is a function of the synaptic weights, this allows us to define a multi-dimensional error performance surface on which we hope to find a minimum value. If the network has units units with linear activation functions then the surface is quadratic and has a unique minima which the algorithm above will alwas allow us to reach. If we have nonlinear units, then the surface will have global minima but will also have local ones which the algorithm might get ‘trapped’ in. 7.2 Hebbian Learning Perhaps the best way to describe Hebbian learning is to begin with a quotation from Hebb himself (taken from Simon Haykin[9]): When an axon1 of cell A is near enough to excite a cell B and repeatedly or persistently takes part in firing it, some growth process or metabolic changes take place in one or both cells such that A’s efficiency as one of the cells firing B, is increased. This definition has since been expanded to say the following: 1. if two neurons on either side of a synapse fire together then the strength of that synapse will be increased. 2. if on the other hand the two neurons fire asynchronously, then the strength of the synapse will be decreased. Any synapse which obeys the above rule is called a Hebbian synapse. When we are using a Hebbian learning technique the synaptic modifications can be one of several different types. 1 output 34 They can be Hebbian, in which case the synaptic strength increases when the firing of the neurons on either side of it are positively correlated. They can be anti-Hebbian, where the synaptic strength increases when the firing is negatively correlated and the case where the synaptic strength decreases when the correlation of firing patterns is positive. Alternatively the modifications may be non-Hebbian, where the modification of synaptic strengths does not depend on the correlation of the two neuron’s firing. One example of this would be a slow decay in synaptic strength. Hebb’s postulate may be expressed as ∆wki = F (yk , xi ), the change in weight wki is a function of the presynaptic and postsynaptic activities xi and yk . We are using xi instead of yi so that the adjustment to wki is normalised by the current synaptic weight - if wki is small then i and k firing together should only adjust the weight by a small amount. We will look at the special case where F (yk , xi ) = ηyk xi . This simple function does have one disadvantage though - the synaptic weight cannot decrease, as a result of this repeated correlated firing of neurons k and i will saturate the synaptic weight. Therefore we add a forgetting factor to slowly decrease the weight over time. This gives us a new formula: ∆wki = ηyk xi − αyk wki , (7.17) alternatively: η . (7.18) α The new synaptic weight of neuron k is given by taking the sum of the old weight and the Hebbian adjustment, like so: ∆wki = αyk [cxi − wki ] where c= new synaptic weight of neuron k = wk + ∆wk . (7.19) Example Given our starting neural network from figure (7.1), we wish to calculate the weight changes using a Hebbian learning rule with a learning rate parameter η = 0.2 and a forgetting factor α = 0.01. We will be using the same activation function as in the case of supervised learning, the only difference is how we will calculate the weight changes. We will only compute one set of weight changes as the same principle and method apply to all layers of neurons and there is little benefit to repeating the calculations. We can see here that the weight changes for neuron 4 are as follows: ∆w41 = ηy4 x1 − αy4 w41 = (0.2 × 0.99 × [0.1 × 0.54]) − (0.01 × 0.99 × 0.1) = 0.010 ∆w42 = (0.2 × 0.99 × [0.16 × 0.89]) − (0.01 × 0.99 × 0.16) = 0.027 ∆w43 = (0.2 × 0.99 × [0.2 × 0.46]) − (0.01 × 0.99 × 0.2) = 0.016. Working through the same process for neuron 5 gives us the following weight change, expressed as a vector:   0.008 ∆w5 =  0.033  . 0.023 35 So the new synaptic weight vectors for neurons    0.10 w4 =  0.16  +  0.20    0.09 w5 =  0.21  +  0.30 7.3 4 and 5 are:   0.010 0.027  =  0.016   0.008 0.033  =  0.023  0.11 0.187  0.216  0.098 0.243  0.323 Competitive Learning In competitive learning we have only one output neuron firing at any one time, all of the output neurons are competing for this ‘privilege’. The firing neuron is called the winner takes all neuron. We might also wish to have 3 output groups and have one neuron fire per group, but to keep things slightly simpler we shall assume that we have only one group and therefore only want one neuron firing at any one time. Neural networks of this type are very good for classifying input as we can simply allocate each output neuron to a different input class and we will get an unambiguous decision from the network about which class the input belongs to. We must however take care not to simply assume that the network knows the answer - it can only make useful contributions if it’s training is good. Even then we must exercise caution against blind faith in the network’s response[7]. There are three main requirements for a competitive learning rule, 1. we must have identical neurons (same connections and activation functions) but with randomly distributed synaptic weights. 2. The strength of each neuron must be finite and uniform across all neurons (i.e. each neuron must have the same amount of synaptic weight to distribute among its inputs). 3. There must be some mechanism in place to allow competition between the output neurons or output groups (to ensure that only one neuron fires). We need the first so that the network will begin its training by responding differently to different inputs. If it reacted identically to all input patterns then we would be unable to train it. The second is needed so that no single neuron dominates the competition and wins the competition for all input patterns. The third condition is to ensure that only one neuron fires per input. After all, it is not very useful if we ask the network whether and input x belongs to A, B or C (all mutually exclusive) and the network responds with A and B. We can P implement requirement 2 by requiring that the weights of each neuron all add up to 1, so j wkj = 1 for all k. We can implement condition 3 by having intra-neuronal synapses with negative weights. Thus one neuron fireing will inhibit all the others. The winner-takes-all neuron must have the highest activation of all the output neurons. When it fires its output is set equal to 1 and all the other output neurons are set to 0. If neuron k wins, then all of its synaptic weights are decreased slightly and the removed weight is redistributed among the active synapses. Note that if a neuron has no active inputs then it will not learn. The weight changes are calculated as follows: η(xj − wkj ) if neuron k wins ∆wkj = (7.20) 0 otherwise 36 As we can see from this equation, the weight vector of neuron k (wk ) will converge to the input pattern x (assuming we have an appropriate learning rate parameter η). Once wk is deemed sufficiently close to x the network is considered to be trained. 7.4 Self Organising Maps Self organising maps were by Teuvo Kohonen and hence are sometimes called Kohonen maps. They take an input pattern (of arbitrary dimension) and convert it to a discrete one or two dimensional map of interconnected neurons2 . The weight vector of neuron k is denoted wk = (wk1 , wk2 , . . . , wkm ), where there are m inputs. The input signal is denoted x. When we are using a self organising map the input pattern usually has one or two large components, with the rest being small. All neurons are exposed to every input component so as to allow the map to mature effectively. There are three distinct phases to the algorithm: competition, cooperation and synaptic adjustment. In the competitive phase the neurons all compute their output, the one with the highest output is deemed the winner. If all neurons possess the same activation function then this process amounts to finding the best match of x with the neuron’s weight vectors, i.e. maximising wk .x across all k (this is equivalent to minimising i(x) = ||x − wk ||). Once the winning neuron has been found, the algorithm enters the competitive phase. This is where lateral connections between the neurons come into play - the firing of the winning neuron gives some excitation to nearby neurons. We will assume that neuron k is the winner. Those neurons near to the winner will get more excitation from it firing than those further away. We define dik to be the distance between neurons k and i (dkk = 0). We now define hki , which describes the excitation received by neuron i from neuron k as a function of the distance between them. We have two requirements of hki : we require it to be unimodal in dik , taking its maximum value at dkk . We further require that h tends to 0 as the distance tends to infinity. With a one dimensional output, we can easily define dik as being ||i − k||. It is relatively simple to define a higher dimensional distance, we simply take dik = ||ri −rk ||, measuring r from an arbitrary point. In order to help the network converge, we would like h to decreases with time, so that successive examples/presentations will affect less neurons and allow the network to become finely tuned to the inputs. Finally, the network enters the synaptic adaptation phase where we use a Hebbian learning rule with a forgetting factor g(yk ). For example, ∆wk = ηyk x − g(yk )wk . If we set g(yk ) = yk and yk = hki (x) then we get a weight adjustment of ∆wk = ηhki (x)(x − wk ). (7.21) In order to encourage convergence of the network, we would like η to decrease with time, to allow the network to become finely tuned to the inputs. There are two distinct phases of training when we are using a self organising map. In the first phase we allow hkj to include all the neurons and decrease slowly, η should also decrease gradually in this phase. This is the ‘rough’ training of the network. The second phase is where we seek convergence/fine tuning, this is where we hold η constant and h should include only those neurons closest to the winner. The second phase generally lasts much longer than the first. There are many other types of learning algorithm, unfortunately we do not have the space to cover them here as each one could quite easily be the subject of an entire book. We will however study memory learning in more depth as this will become useful to us in part 3. 2 3D outputs are possible but less common 37 Chapter 8 Information Theory Applied to Neural Networks The key concept when applying Shannon’s information theory to neural networks is mutual information. Depending on the scenario we may wish to adjust the mutual information between synaptic weights or output neurons. We will often describe ideas by images, but the concepts generalise to any data set. To give a few examples, in all cases xa are input vectors and ya are the output vectors. 1. Acquiring information for the first time (e.g. cramming before an exam). The notes consist of the vector x which are fed into the neural network, producing a memory y. We need to maximise the information conveyed to y (what we rememeber) from x (the notes). 2. Associative Memory (given part of an image, recall the rest of it). Non overlapping parts of the same image are given by xa , xb and xc , which together produce outputs ya , yb and yc . Now our goal is to maximise the mutual information of all the output vectors. 3. Consistent memory (given two independent images, keep them isolated). Two independent images xa and xb produce outputs ya and yb respectively. Our aim here is to minimise the mutual information of ya and yb , since due to the independence of xa and xb we cannot infer anything about yb from ya or vice versa. Maximising mutual information between x and y (I(x : y)) is the fundamental goal of signal processing. This goal can be summarised as the Maximum Mutual Information (Infomax) Principle, due to Linkser: The transformation of a random vector x observed in the input layer of a neural system to a random vector y produced in the output layer of the system should be so chosen that the activities of the neurons in the output layer jointly maximise information about activities in the input layer. The objective function to be maximised is the mutual information I(y : x) between the vectors x and y. The mutual information therefore gives neural networks an analogue of the channel capacity we met in our brief excursion into information theory, which defines the maximum reliable rate of information transmittion through a channel. We will look at autoassociative memory in order to illustrate these ideas. 38 8.1 Capacity of a Single Neuron If we use our neuron to transfer information from one person to another then it is taking on the role of a channel. We will look at the case of a single neuron and then generalise to Hopfield networks. We will work with binary inputs, analysis may also be done on continuous (and stepped) inputs, but the work is considerably more complex. We will take our neuron to have k inputs and assume that we have n (k-dimensional) input vectors to classify. These vectors are said to be in a general layout if any subset of k or less vectors are linearly independent and no k + 1 of them lie on a k − 1 dimensional plane. If we consider the case k = 2 then this latter requirement is simply that no three points lie in a straight line. In order to simplify the analysis we will also assume that our neuron has a binary threshold function, defined below: 0 for a ≤ 0 f (a) = (8.1) 1 for a > 0 Let us assume that our neuron has 6 synapses, so its inputs are points in a 6 dimensional binary space. We wish to classify some data using the neuron. This data consists of a set of points {x(i)}5i=1 in the 6D binary space and a set of corresponding target values {ti }5i=1 . If we use x(3) as the input to the neuron then we want its output to be t3 . The receiver is only allowed to test the neuron at the 5 points x(1) through x(5) and must use the weights of the network to reconstruct the original target values. The sender must therefore choose an appropriate learning algorithm to encode this information into the synaptic weights. There are a 25 distinct binary labellings of 5 points, if these points are in a general layout then the neuron is not necessarily able to store them all. The number of labellings that the neuron can produce for n k-dimensional points in a general position is denoted T (n, k) and satisfies the recurrence relation T (n, k) = T (n − 1, k) + T (n − 1, k − 1). Using this, it is possible to show that T (n, k) is given by the following formula:   n  2 for k ≥ n  Pk−1 n − 1 T (n, k) = for k < n   2 i=0 i (8.2) (8.3) If the neuron is able to produce all 25 labellings then it can reliably classify all of the information. If it is only able to produce two different labellings for the entire set then it can store one bit of information. The number of different labellings the neuron can produce is a natural measure of its capacity. In order to move from labellings to capacity, we use the following equation: capacity = log[T (n, k)]. (8.4) We can check if our neuron is likely to be able to store all the information by checking the probability that it will be able to store all n bits, this is calculate by taking the ratio of the achievable labellings with the number of theoretical labellings, like so: probability of successful storage = T (n, k) 2n (8.5) If we plot a graph of of this probability against n and k then we see an interesting result, shown below: 39 N=K 1 1 0.75 0.75 N=2K 0.5 0.5 0.25 70 0 60 50 40 30 20 10 0.25 0 10 20 30 N (a) 40 50 60 70 K N=K (b) 50 100 N 150 K=N/2 10 20 30K 40 50 60 70 2400 2300 1 log T(N,K) log 2^N 2200 0.75 2100 0.5 2000 0.25 1900 N=K () 0 0 0.5 N=2K 1 1.5 N/K (d) 2 2.5 1800 1800 1900 2000 2100 2200 2300 2400 3 Figure 8.1: Probability of successfully labelling all points[13], from various different perspectives. (c) and (d) are drawn at k = 1000 When n < k we can see that the probability of successfully labelling all of the inputs is 1, so we are guaranteed success. If we look at the region k < n < 2k then there is a very high probability of successfully labelling all the inputs. So provided we are willing to accept a small probability of error, the capacity of a single neuron is slightly less than two bits per input. Our single neuron is a trivial example of a feedforward ‘network’. Feedforward networks can be used to create interpolative memory, meaning that if the inputs differ slightly from the memory then the output is also allowed to differ slightly. For example, if we replaced our lone neuron’s activation function by, for example: f (a) = tanh(3a) (8.6) then its output would be continuous. Suppose we had classified the pattern (1, 1, 1) as type ‘1’ using our neuron, with all weights being equal and taking the value 1/3. If we then submit x = (1, 1, 0) as our input, then the neuron’s activation will be 2/3 instead of 3/3 as it was for the stored memory. However the neuron will still fire with an output of 2 f (a) = tanh 3 × = tanh(2) = 0.964. 3 Note that whilst this is not the same as the response to (1, 1, 1) (0.995), it is still close. A rather more interesting use of neural networks is to create so-called accretive memory, where the output is required to be exactly the same as one of the stored memories - no deviations due to noise are allowed. For this task feedback networks are most appropriate. The use of networks is also advantageous as is allows us to store vectors, not just classify them. 8.2 Network Architecture for Associative Memory When we supply a network with a noisy/incomplete vector x̄(i) we must allow it some time to recover the clean version x(i). As a result we will feed the network the noisy vector and wait until it’s output no longer changes with time, at which point we will take it’s output to be the recovered memory. If x̄(i) causes the network’s output to converge to x(i) then the memory has been successfully recalled. Given that we will be using a feedback network we must specify how our neurons will calculate their activations and update their state. They may either all do this simultaneously 40 (synchronous updating) or they may do so one at a time according to some sequence that may be predetermined or random (asynchronous updating). When using asynchronous updates the neurons update their activations and states at a mean rate (per unit of time). We will look at asynchronous updates in a binary Hopfield feedback network. It can be shown that if we use asynchronous updates then the network will always converge to a stable state and the memories of the neural network correspond to these stable states. This result is quite general and applies to a Hopfield network with a continuous activation function, e.g. f (a) = tanh(βa). In order to guarantee convergence with synchronous updates we must use continuous time. 8.3 Memories The nature of memory recall (wanting certain sets of neurons to fire simultaneously) makes it an ideal candidate for Hebbian learning. After the training process is complete some memories will be more accessible than others, that is to say that they are more easily recalled. For example we might be able to recall x(1) if it is corrupted by ≤ 3 flipped bits whilst only being able to recall x(2) if it is corrupted by ≤ 2 flipped bits. The accessibility of a memory x(i) is defined to be the fraction of initial network states that converge to it. A memory that can tolerate a large number of flipped bits is said to form a large attractor, similarly for a small attractor. The terms ‘large’ and ‘small’ here are relative to the number of bit-flips that other memories can sustain. The encoding of memories is not without its problems, there are a number of ways in which it can fail. • Individual memories may get corrupted - the stable state of the network might be slightly displaced from the memory we wish to store. • Memories might not form attractors, or they might form such small attractors as to be effectively useless. • Spurious memories can be formed. • Memories might interfere with each other, resulting in a stable state that is an amalgamation of two memories. It is important to note that the spurious memories are not formed by transitive associations like (A + B) and (B + C) leading to (A + C), but illogical, meaningless connections caused by correlations in the structure of memories. This means that the spurious memories are naturally formed when we create a memory - there is no process which can prevent them. These memories interfere with the running of our network and so we would like to find a way to limit their effect. 8.4 Unlearning Luckily Hopfield et. al. came up with a method of moderating the effects of these spurious memories[11]. After the network has been trained we choose a random state as our input and allow the network to converge on a stable state. We then adjust the synaptic weights in order to decrease the accessibility of this stable state by a very small amount. We then lather, rinse and repeat. They called this process ‘unlearning’ as it is nothing more than forgetting each memory by a small amount. After its initial training the memories encoded in the network (including the spurious ones) can have widely varying accessibilities. The unlearning process causes the accessibility of each of 41 the stored memories to converge and the total accessibility of the spurious memories to decrease. One must exercise caution when implementing the unlearning algorithm since if it runs for too long then it ends up erasing everything, including the memories we want it to store. A convincing case has been made for R.E.M.1 sleep and dreams being nothing more than our brains undergoing a process of unlearning in order to decrease the accessibility of these spurious state. In doing so it would be keeping our memories consistent and useful, which is nice. 8.5 Capacity of a Hopfield Network When we are looking at the storage capacity of a Hopfield network the important quantities are the number of neurons and the number of patterns to be stored. We shall denote these numbers by n and p respectively. For a fixed number of neurons there is an upper limit to the number of patterns that can be stored, as one would expect. What is less intuitive however, is that if we try to store too many patterns in the network (i.e. the ratio np is too large) then the network undergoes a sharp transition into complete amnesia, where none of the memories correspond to stable states. Statistical analysis has found that this transition occurs at p = 0.138n (8.7) It is interesting to note that there is no threshold below which the stable states correspond exactly to the patterns we wish to store. Just below the amnesia threshold there is a correlation of 98.4% between the memories and stable states, above this threshold there is absolutely no correlation. Just below the amnesia threshold the probability of error in each memory is 0.016 per bit, and as np decreases, so does this probability. There are n neurons in the network, each of which has n − 1 inputs, which gives us a total of 21 n(n − 1) synaptic weights. The mutual information between the input pattern and the stable state is 1 − H2 (0.016) = 0.96. The number of patterns we are storing is 0.138n, and each one consists of n binary digits, so we are trying to store 0.138n2 bits in our network. We must then multiply this by the mutual information to find the amount of information actually stored, which turns out to be 0.122n2 bits. If we assume that n is large (as it must be if we wish to store a useful amount on information) then the difference between n and n−1 is negligible and we may consider the number of synaptic weights to be 21 n2 . This means that our network is storing 0.138n2 Ibits = 0.24 bits per synapse, 1 2 2 n weights which is significantly less than our binary neuron classifier. 1 Rapid Eye Movement 42 (8.8) Part III Quantum Information Theory and the Cutting Edge 43 Chapter 9 Quantum Mechanics 9.1 Motivation The laws of physics were once all classical - governed by Sir Isaac Newton’s laws. These laws were believed to be all pervasive, governing all motion from the stars in the night sky to the atoms from which we are formed. The planets behaved as large rocky billiard balls, orbitted my smaller rocky billiard balls that we called ‘moons’. Atoms were smaller balls, bouncing off each other throughout the universe. Unfortunately there is a flaw in this simple, intuitive and intellectually satisfying world view: it is wrong1 . The planets warp space and time by their very presence and atoms do not bounce off each other but happily coexist not only as particles, but also as waves. The theory describing atoms and waves at their most fundamental level is called ‘quantum mechanics’. Just as the laws of physics were required to change or adapt to the new quantum regime, the laws of information are required to do the same. The reason for this is simple: all information is physical, and as such is subject to our newly revised physical laws. In order to adapt, we must now revise some of our fundamental concepts in order to take into account the fact that we are now dealing with information in a quantum framework (very small) rather than a classical one (the kind we deal with every day). In order to build up this quantum theory of information however, we must first review some of the fundamental concepts of quantum mechanics. 9.2 Qubits In the classical framework a bit could take one of two values, 0 or 1. We have a similar concept in quantum information theory, namely the qubit, which has states |0i and |1i. The notation |.i is called a ‘ket’, and may have its inner product taken by multiplying on the left with a ‘bra’ (h.|). Given a state |ai (‘ket a’) we define ha| (‘bra a’) by ha| = (|ai)† where † denotes taking the Hermitian conjugate (conjugate transpose). Unlike a classical bit a qubit may also be in a linear superposition of these two states, meaning that it is in a state |ψi, where |ψi = α|0i + β|1i (9.1) (where |α|2 + |β|2 = 1). The qubit is then said to be in a coherent state, if it interacts with its environment in any way (if we try to measure it for instance) then the qubit will decohere and fall into one of the states |0i or |1i. Interestingly enough, the process by which it does so is not random, it will decohere to |0i with probability |α|2 and |1i with probability |β|2 . 1 It should be noted that the Newtonian view of the universe works very accurately between these two scales, and as such should not be disregarded. 44 Each qubit there lies in a 2 dimensional vector space over the field of complex numbers, such a vector space is called a Hilbert Space after David Hilbert. A system is said to have n qubits if and only if it has 2n mutually orthogonal states, this corresponds to lying in an n-dimensional Hilbert space. We shall denote these states by |xi with x representing a string of n binary digits, for example: |01110010i is one of the states of an eight qubit system. Two states are orthogonal if their inner product, denoted ha|bi, is zero. For our purposes we will reduce inner products of coherent states to inner products of |0i and |1i states, using the fact that h0|0i = h1|i = 1 and h0|1i = h1|0i = 0. 9.3 The EPR Paradox The EPR paradox was a thought experiment proposed by Einstein, Podolsky and Rosen in an attempt to demonstrate that quantum mechanics provided an incomplete description of nature. It seems to show that the measurement of particles can be used in order to violate the fundamental principles of relativity. If we have two particles in the state |φi = √12 (|00i + |11i) and give one to Alice and one to Bob, who can be arbitrarily far apart. If Alice measures her particle and discovers that it is in state |0i then the combined state must be |00i, so if Bob measures his particle he will find it to be in state |0i. This decoherence occurs instantaneously in spite of any distance between Alice and Bob. Could this decoherence be used to allow Alice and Bob to comminicate fast than light? No. Though there is a coupling between the particles, called ‘entanglement’, which we shall look at shortly. Einstein, Podolksi and Rosen decided that there must be some kind of internal state of the particles that cause them to be in state |0i or state |1i before the seperation occured. This state is not accessible to us until we perform a measurement however, so we may only speak probabilities of the outcomes. Theories of this kind are known as ‘local hidden variable theories’. Local hidden variable theories do have the attractive property of simplicity, but they cannot explain the results of measurements made in a different basis. John Bell proved that any local hidden variable theory satisfies a particular inequality (Bell’s inequality, unsurprisingly), but in experiments it was shown that this inequality is consistenly violated. So no local hidden variable theory can explain quantum mechanics. An alternative explanation is that the measurement of Alice’s particle does affect Bob’s. This is rather problematic for causality however, as we may set up two observers: one who sees Alice measure her particle first and another who sees Bob measure first. Relativity requires that the laws of physics must explain equally well the observations of each observer, so one observer could say that Bob’s measurement affected Alice’s particle and the other observer could say the opposite. Clearly both cannot be correct, especially when experiments showed that the results obtained were invariant under a change of observer. This tells us that the results can be explained equally well either by Alice measuring first, or Bob and as such it is not possible to use decoherence to communicate. All that can be said is that Alice and Bob will observe the same random behaviour. A third explanation, one proposed by Andrew Steane[19], is that the state vector |φi is not an intrinsic property of the quantum system, rather it is an expression of the information content of a quantum variable. There is mutual information between A and B, and so the information content of B changes when we learn something about A. This approach gives us the same behaviour as classical information, so there is no controversy. 45 9.4 Entanglement 1 0 and respectively. As such we may 0 1 build up strings by taking the tensor product of the constituent qubits, for example:   0  1  1 0  = 0 1 0 0 t. (9.2) |01i = |0i|1i = ⊗ =   0 0 1 0 We may express qubits |0i and |1i as vectors Similarly we find that |00i = (1000)t , |10i = (0010)t and |11i = (0001)t . This allows us to build up composite states with simple vector algebra. For examples given a state |αi = √1 (|00i + |01i) we may write this in vector form as √1 (1100)t by adding the corresponding 2 2 vectors together. This representation of qubits is very useful because it allows us to represent any state α in ‘density matrix’ form. This is just the matrix ρα generated by |αihα|, the tensor product of (1100) with (1100)t .   1 1 0 0  1 1 0 0   (9.3) ρα =   0 0 0 0  0 0 0 0 If ρα cannot be factorised, then |αi is said to be an entangled state. It is possible to have different degrees of entanglement, we caneasily construct ρψ that it is partially factorisable into two parts: ρǫ and ρδ .   1 0 0 1  0 0 0 0  1 0  (9.4) ρψ = ρǫ ⊗ ρδ = ⊗  0 0 0 0  0 0 1 0 0 1 It is important to notice that ρψ is only partially entangled - it can be written as a product of density matrices, but ρδ cannot be factorised. Entanglement is a purely quantum mechanical phenomenon, it has no classical analogue. 9.5 The Bell States The Bell states are a set of four possible ways in which two qubits can be entangled. They are named after John Bell, of EPR fame and are defined by the following equations: |φ± i = |00i ± |11i, ± |ψ i = |01i ± |10i. (9.5) (9.6) The Bell states are all mutually orthogonal and are maximally entangled. They become very important when we try to communicate over quantum channels. 9.6 The No Cloning Theorem This theorem simply states that an unknown quantum state cannot be reliably cloned[26]. We should qualify our use of the word ‘reliably’ here. Essentially we mean that under some 46 circumstances, it is possible to clone an unknown quantum state. These circumstances boil down to the requirement that the quantum states we are trying to clone must be mutually orthogonal. First we construct an operator P = |0ih0| ⊗ V + |1ih1| ⊗ U (9.7) where the operators U, V are unitary. This is an example of a quantum logic gate acting on two qubits. The ket-bra operators act on the first qubit and determine if it is in state |0i or |1i. If it is in state |0i then the operator V is applied, if it is in state |1i then U is applied. If we define U and V by the following equations: U = |0ih1| + |1ih0|, V = I (the identity matrix) then we have made ourselves what is know as a ‘controlled-NOT’ or CNOT gate. If the first qubit is in state |0i then the pair of qubits are left alone, however if the first qubit is in state |1i then the NOT operator is applied to the second qubit. The gate is controlled because the NOT operator only comes into play if the first qubit is in state |1i. It is important to note that our CNOT gate can only clone states |1i and |0i, not any arbitrarily chosen states that we choose to hurl at it. For example, it would not be able to clone the state |ai = √12 (|0i + |1i) as this is in a superposition of the two states |0i and |1i. 47 Chapter 10 Quantum Entropy and Information 10.1 Von Neumann Entropy The entropy of a quantum state ρ was defined by Von Neumann to be S(ρ) ≡ −tr(ρ log ρ). (10.1) If we let λx be the eigenvalues of ρ then we can find a basis in which ρ = diag(λx1 , λx2 , . . .), and thus we can rewrite the Von Neumann entropy as follows: X S(ρ) ≡ − λx log λx . (10.2) x Once more, we take 0 × log 0 ≡ 0. Throughout this chapter we will drop normalisation factors such as √12 in front of entangled qubits so as to keep the notation uncluttered. 10.2 Quantum Distance Measures Before we advance it is instructive to discuss distance measures for quantum information - we need a way to measure how different two quantum states are. There are two types of distance measure in wide use: static , for use when we are in possession of two quantum states and want to see how different they are, and dynamic, for use when we wish to see how well a particular process has preserved the information content of a system. In the classical world the trace distance (also known as the Kolmogorov distance) between two probability distributions is defined to be D(px , qx ) = 1X |px − qx |. 2 x (10.3) The trace distance is symmetric and obeys the triangle inequality, so D(y, x) ≤ D(x, z)+D(z, y). In the quantum regime there is an analogue of the trace distance, which is defined by the following equation for quantum states ρ and σ in density matrix formulation: 1 D(ρ, σ) = tr|ρ − σ|. 2 (10.4) √ + where |A| = A† A. If ρ and σ commute then D(ρ, σ) reduces to the Kolmogorov distance between the eigenvalues of σ and ρ. 48 If we act on ρ and σ with a trace preserving quantum operator ǫ then we discover that D(ǫρ, ǫσ) ≤ D(ρ, σ). (10.5) Essentially this means that one cannot increase the distance between two quantum states by acting on them with a trace preserving operator. This is a quantum analogue of the data processing inequality we came across in chapter 4, so now we have a scientific basis for saying ‘Things can only get worse’. Another useful distance measure, one that is useful when considering data compression, is fidelity. The fidelity of two states ρ and σ is defined to be: q 1 1 F (ρ, σ) ≡ tr ρ 2 σρ 2 . (10.6) The fidelity is (despite its appearance) symmetric in its arguments. 10.3 Quantum Error Correction Single Qubit Flip Code To explain how we can correct a single qubit being flipped from |0i to |1i or vice versa we must define another quantum gate - the Troffoli gate. The Troffoli gate is sometimes called a controlled-controlled NOT gate (CCNOT) because it’s action is to check two qubits, and if they are both in state |1i, flip the third. This may be written as: |x, y, zi → |x, y, z + (x.y)i (10.7) where the addition on the third qubit is done modulo 2. To begin with we spread our qubit |ψi = (α|0i + β|1i) across three like so: |ψi|0i|0i = (α|0i + β|1i)|0i|0i → α|000i + β|111i. (10.8) This process can be achieved by the use of two CNOT gates, one acting from the first qubit to the second and one acting from the first to the third. Note that part in state |0i then we get |000i ‘free’, whereas for the part in state |1i the CNOT gates become active and give us |111i. It is important to see that this process does not clone |ψi, were we to do so we would be performing the following: |ψi|0i → |ψi|ψi = α2 |00i + αβ|01i + βα|10i + β 2 |11i. (10.9) This equation has pairs of qubits in mixed states (e.g. |01i), which do not appear in equation (10.8). The situation gets worse if we try |ψi|0i|0i → |ψi|ψi|ψi. Suppose we now send our triples across a noisy quantum channel in which one of the qubits gets its state flipped (for generality we will assume that this occurs in the |000i term as well as the |111i term). We must then find a way to make sure that the original qubit remains in its original state. We do not care about the final states of qubits two and three, since these are only present to preserve the state of qubit one - we only measure the first qubit at the end. We can do this by applying three quantum gates. If we apply two CNOT gates (first to third and first to second) and then a Troffoli gate (which checks qubits two and three then acts on qubit one), then any single bit flip error will be corrected. 49 Worked Example We wish to send |ψi = α|0i + β|1i → α|000i + β|111i across a noisy quantum channel. Upon doing so, the first qubit is flipped, giving α|000i + β|111i → α|100i + β|011i. First we apply a CNOT gate from the first to the third qubit, resulting in a state: α|101i + β|011i. Then we apply a CNOT gate from the first to the second qubit, giving us: α|111i + β|011i. Finally we apply the Troffoli gate, which flips the state of the first qubit if and only if both the second and third are in state |1i. This gives us a final state of: α|011i + β|111i. From which we measure the first qubit only to receive our original qubit |ψi = α|0i + β|1i. Single Qubit Phase Flip Code Flipped qubits are not the only type of error we may encounter in the quantum world. It is also possible for the phase of qubits to be flipped like so: α|0i + β|1i → α|0i − β|1i. (10.10) Such a phase flip is accomplished by the ‘Z’ operator, defined as: Z ≡ |0ih0| − |1ih1|. (10.11) Let us define two quantum states |+i and |−i by the following equations: |+i = α|0i + β|1i |−i = α|0i − β|1i. (10.12) (10.13) Let us now apply the Z gate to each of these states Z|+i = Z(α|0i + β|1i) = α|0i − β|1i = |−i Z|−i = α|0i + β|1i = |+i. (10.14) (10.15) As we can see, a phase flip in the {|0i, |1i} basis is equivalent to a bit flip in the {|+i, |−i} basis! Therefore we can use the qubit flip code in the {|+i, |−i} basis to protect us from phase flips. The Shor Code This code was created by Peter Shor by concatenating the qubit flip and phase flip codes[18]. We start by moving to the {|+i, |−i} basis and using the phase flip code: α|0i + β|1i → α| + ++i + β| − −−i = α (|0i + |1i)⊗3 + β (|0i − |1i)⊗3 . (10.16) Where (|φi)⊗3 = |φi|φi|φi = |φφφi. We now use the single qubit flip code to get the following: α|0i + β|1i → α (|000i + |111i)⊗3 + β (|000i − |111i)⊗3 . (10.17) We are now protected against both qubit flips and phase flips. It turns out however, that the Shor code protects us again arbitrary errors. So a phase shift (such as β → β + η), even a tiny one, will be corrected. 50 10.4 Quantum Teleportation Quantum teleportation is a process by which Arthur can send Belinda a qubit in an unknown state using only classical bits. The process is called teleportation because the no cloning theorem dictates that Arthur cannot retain a copy of the qubit, so its state must be destroyed. Let us assume that Arthur and Belinda both possess half of an entangled pair of qubits in the Bell state (10.18) |φ+ i = |00i + |11i Arthur’s qubit is in a state |ρi = α|0i + β|1i. The state of all three qubits is now: |ρi|φ+ i = (α|0i + β|1i)(|00i + |11i) (10.19) = α|000i + α|011i + β|100i + β|111i. After some mathematical juggling we can rearrange the above into: |ρi|φ+ i = |φ+ i(α|0i + β|1i) + |φ− i(α|0i − β|1i) +|ψ + i(α|1i + β|0i) + |ψ − i(α|1i − β|0i). (10.20) With the Bell states being between the two qubits in Arthur’s possession. This leaves Belinda’s qubit in a superposition of the four bracketed states, each of which is in some sense similar to |ρi. Arthur now measures his two qubits in the Bell basis and randomly obtains one of the states. Given that there are four Bell states this corresponds to two bits of classical information. Arthur then telephones Belinda and tells her what Bell state he obtained. Using this information Belinda is able to determine the state her qubit is in and is therefore able to perform a unitary operation to put her qubit into the state |ρi. Arthur’s Bell state |φ+ i |φ− i |ψ + i |ψ − i Belinda’s qubit state α|0i + β|1i α|0i − β|1i α|1i + β|0i α|1i − β|0i operator to recover |ρi |0ih0| + |1ih1| ≡ I |0ih0| − |1ih1| ≡ Z |0ih1| + |1ih0| ≡ X |1ih0| − |0ih1| ≡ Y Table 10.1: Arthur’s possible measurements of his qubits and the corresponding operators that Belinda must use to reconstruct |ρi. 10.5 Dense Coding It is easy to use qubits to transmit classical information. If we wish to send the string 0110 we can transmit |0110i. Belinda now measures the qubits in the {|0i, |1i} basis and recovers the string 0110 with no ambiguity. This method is fine, there is nothing inherently wrong with it, but it does seem a little wasteful - why bother sending the information via qubits? This allows us to send 1 classical bit for each qubit, so we have not gained anything by using a quantum channel. Suppose now that Arthur and Belinda share two entangled qubitsin the state |00i + |11i. It turns out that Arthur can now send two classical bits to Belinda by sending her his half of the entangled pair (one qubit). This is known as dense coding, or sometimes superdense coding depending on how dramatic the author wishes to make the process sound. Dense coding is a counterpart to teleportation, in the latter we use two classical bits to transfer a single qubit, whereas in the former we use a single qubit to transfer two classical bits. 51 Starting from (|00i + |11i) Arthur can generate any of the Bell basis states by using the quantum logic gates {I, X, Y, Z} as demonstrated in table (10.2). His choice of one state out of four corresponds to two classical bits of information1 . Value 0 1 2 3 Transformation ψ0 = (I ⊗ I)ψ0 ψ1 = (X ⊗ I)ψ1 ψ2 = (Y ⊗ I)ψ2 ψ3 = (Z ⊗ I)ψ3 New State |00i + |11i |10i + |01i |01i − |10i |00i − |11i Table 10.2: Initial preparation by Arthur. Arthur then sends his half of the entangled pair to Belinda, who needs to find what state it is in. She does this by using a CNOT gate (also known as an XOR gate) on the pair to place them into one of the quantum states in table (10.3). Initial State ψ0 ψ1 ψ2 ψ3 State after CNOT |00i + |10i |11i + |01i |01i − |11i |00i − |10i First Qubit |0i + |1i |1i + |0i |0i − |1i |0i − |1i Second Qubit |0i |1i |1i |0i Table 10.3: State of qubits after Belinda applies the CNOT gate. Belinda is now able to measure the second qubit without disturbing the first. This allows her to distinguish between (|00i ± |11i) and (|01i ± |10i). To find the sign of the phase Belinda now acts on the first qubit with the Hadamard gate (below) to get the results shown in table (10.4). H = (|0i + |1i)h0| + (|0i − |1i)h1| (10.21) Initial state ψ0 ψ1 ψ2 ψ3 H(First Qubit) [|0i + |1i] + [|0i − |1i] = |0i [|0i − |1i] + [|0i + |1i] = |0i [|1i − |0i] + [|0i + |1i] = |1i [|0i + |1i] − [|0i − |1i] = |1i Table 10.4: State of first qubit after Belinda applies the H gate. Belinda now measures the second qubit in the {|0i, |1i} basis, obtaining the second qubit. This allows her to determine the state ψi that Arthur chose and therefore get two classical bits with no ambiguity. Dense coding is beneficial if we wish for our communications to be secure (which we usually do) as the qubit Arthur sends can only give two qubits to the person holding it’s entangled partner. Thus it can be very useful for the transmission of cryptographic keys. 1 The tables in this section have been reproduced from Morali Kota’s Quantum Entanglement as a resource for Quantum Information[12] 52 10.6 Quantum Data Compression Suppose we have a source that produces a state |ψi i with probability pi . If it can produce m different states then we may denote write this as source = (pi , |ψi i)m i=1 (10.22) We will treat our source as if it is memoryless. So the generation of each quantum state is independent of all those that went before. It will produce strings of n qubits. In order to ease the notation we will write a general string of n qubits as: |ψi1 i|ψi2 i|ψi3 i|ψi4 i|ψi5 i = |ψI i with pi1 pi2 pi3 pi4 pi5 = pI (10.23) (10.24) Such a string will lie in a 2n dimensional Hilbert space (denoted H). A compression scheme of rate Rn is a map Cn : H → K (10.25) where K is a Hilbert space of dimension 2n.Rn . Decompression is the map back to H from K: Dn : K → H. (10.26) A compression scheme is a pair of operations C n and Dn . A compression scheme is reliable and has a rate R when the following two conditions are met. X I Rn → 0 < R < 1 as n → ∞ pI F (|ψI ihψI | , Dn : C n |ψI ihψI |) → 1, as n → ∞. (10.27) (10.28) In words we may state these as • Asymptotically, R qubits are used for the compression of each qubit in the initial string. • The quantum states are (on average) close, and the expectation of the fidelity of the processed and original states tends towards 1 as n approaches infinity. So as n approaches infinity, the compression/decompression loses less information. We are now in a position to provide a final comparison of quantum information theory to its classical counterpart - a quantum source coding theorem. This is usually referred to as Schumacher’s quantum noiseless channel coding theorem (just as the source coding theorem could be referred to as Shannons noiseless channel coding theorem). Schumacher’s theorem states that given a source (pi , |ψi i) with Von Neumann entropy S(ρ), then for any R > S(ρ) there exists a reliable compression scheme of rate R. Conversely there does not exist any reliable compression scheme with a rate R < S(ρ). So the entropy of any source provides a natural limit on the extent to which we can compress data (be it classical or quantum). Having completed this review of quantum information theory and some of its primary results, it is now time to move on to a much more speculative area of research: quantum neural networks. 53 Chapter 11 Quantum Neural Networks 11.1 Motivation Quantum neural networks are a very new, speculative field. No attempts to build one yet have been made as the theoretical analysis is still in its infancy. Having said this, there are still several reasons why we should investigate them. One of the main practical reasons to research them is motivated by quantum algorithms. Quantum algorithms have sped up several classically inefficient computational tasks. Grover’s 1 search algorithm can sort an unordered database with n entries in O(n 2 ) time, compared to classical algorithms which will require O(n) time. Peter Shor’s factoring algorithm can factorise numbers in polynomial time with respect to the number of binary digits, unlike the best classical algorithms which can only factorise in a time that is exponential in the number of binary digits. Given these dramatic improvements, it is hoped that quantum neural networks may provide similar advantages over classical neural networks. One of the major problems with using artificial neural networks for associative memory is their limited storage capacity. Whilst they function admirably for pattern association and completion, they do not do so efficiently. A quality that we would like to preserve is their guaranteed convergence to a stable state (providing they are not overloaded). According to Ezhov and Ventura[6] if we have a quantum Hopfield network trained with a single memory then we are guaranteed to have convergence to its stable state. Due to the linearity of quantum mechanics however, we can place several of these single memory quantum neural networks into a superposition. Then all of them will act as a single memory state no matter how many memories are stored, according to Ventura and Martinez[24] a quantum associative memory has exponential storage capacity in the number of qubits! A frequently overlooked reason to study neural networks is that it provides a source of intellectual stimulation. A rich theory of quantum information has been developed and it would seem negligent not to apply it as widely as possible. When information theory was applied to physics it was made clear why some effects (decoherence, for example) may travel faster than light without causing problems with our understanding of the universe: these effects are free to travel faster than light as they please, provided that no information is transferred in the process. Just as the laws of physics were elucidated by the advent of information theory, so might the field of neural networks be advanced and illuminated by the application of quantum information theory. Of the few papers that have been written on quantum neural network and quantum neural network-like systems, most have been focussed on creating a network that functions as an associative memory for pattern recognition. Consequently this is where we will devote most of our attention. We will look into their architecture first, explaining how to ‘quantise’ a neural 54 network, then we will have a brief overview of some work that has been done on training these quantum neural networks. We will then move on to the specific case of training a quantum associative memory and investigating its benefits and pitfalls. 11.2 Architecture Before studying quantum neural networks we must first adequately define our architecture. The route we will be taking is the most common one and consists of replacing each neuron with a qubit. Instead of having a neuronal state lying between in the range 0 ≤ y ≤ 1 our qubit will be in the state α|0i + β|1i. This now raises the question of how to quantise the synaptic weights. Once more we will move with the crowd: the synaptic weight between two quantum neurons (qubits) will be given by the amount of entanglement that they share. 11.3 Training Quantum Neural Networks In his work, Altaisky discussed one of the problems with trying to directly quantise a neural network[1]. The activation function that we are used to using in our classical artificial neural networks is (in general) nonlinear, which causes problems when we try to involve quantum mechancis, an inherently linear theory. He proposes that the closest we can come is to have an operator F̂ act on the weighted sum, like so: X state of qubit = F̂ ŵi |xi i. (11.1) i The learning rule he used was also non-unitary, but this was due to physical considerations and an alternative, unitary rule was suggested for theoretical work. Ricks and Ventura have discussed an alternative way to train the network. Their method was to search through all the possible weight vectors of the network in order to find one which is consistent with the network’s training data[17]. This is not without its problems however, since it is not possible to guarantee that the network will correctly classify all the training data. It is also possible for the network to overfit the data when using this training method, just like a classical neural network. This algorithm is does not scale well, but has the advantage that it can be used to achieve arbitrary levels of accuracy. In the same paper they present a randomised search algorithm which searches each neuron’s weight space independently. The results of the simulations they ran indicated that their randomised algorithm is very efficient by comparison to traditional training methods like. After 74 presentations of the training data it had correctly classified 95% of the training set. Standard backpropagation was able to classify 98% of the set, but only after 300 presentations. 11.4 Quantum Associative Memory In this section we will follow the approach of Ezhov and Ventura[6] as they create a quantum associative memory. We will be giving an overview of the method only as the process is highly technical and not very illuminating. There are two distinct phases to using an associative memory: memorisation and recall. We will discuss each in turn and then combine the algorithms to summarise what we find. 55 Memorisation In the process of memorisation we are seeking to store m patterns, each having a length of n qubits. The algorithm uses several well known operators (Troffoli gates, CNOT gates etc) that act on one, two or three qubits. One new operator in the process is given below:   1 0 0 0  0 1  0   q0 p . p−1 −1 Ŝ =  (11.2) √  0 0  p  q p  p−1 √1 0 0 p p The parameter p ranges from 1 to m and so there is a unique Ŝ p for each of the desired memories. The algorithm requires 2n + 1 qubits, the first n qubits actually store the memories whilst the other n + 1 are used for book-keeping and are restored to the state |0i after each iteration. After m iterations the first n qubits are in a coherent superposition of states that correspond to the patterns we wanted to store. It is important to note that this training does not introduce the spurious memories that training a classical associative memory does - our memory is free from these illogical phantoms. The encoding process is polynomial in the number of patterns and the length of the patterns and as such requires O(mn) iterations to encode the m patterns as a superposition of states over n qubits. This is optimal since just reading the patterns cannot be done any faster than O(mn) time. Recall The recall process is implemented by Grover’s search algorithm. The use of Grover’s search algorithm corresponds to measuring the system and having it collapse to the state we are searching for. If we only have n − k qubits when we start our search, then the collapse of the system to the appropriate memory may be considered as performing pattern completion. Pattern completion is all well and good, but ultimately we want our memory to perform pattarn association too. It turns out that with only a slight modification, the recall algorithm can ‘recognise’ noisy images. Unfortunately adding association to our network can result in the recall of spurious memories. These spurious memories are not stored in the network, but are are an unfortunate side effect of the associative recall process. Combining The Algorithms A quantum associative memory can be constructed by combining the two algorithms above. There are 2n distinct patterns of which the network takes √ can store n. The memorisation √ O(mn) time and the recall of a pattern requires O( 2n ) time. The recall process takes O( 2n ) time because it is searching all possible patterns that the system can store. This recall time is exponential, which is not very good, however Ezhov and Ventura suggest that a nonunitary recall mechanism could improve upon this. If unitarity is required however, then Grover’s search algorithm has been proved to be optimal. 11.5 Implementation All models of quantum computation are plagued by decoherence due to unwanted interaction with the system’s environment. It has been suggested that quantum neural networks may 56 be implemented before traditional quantum computers by virtue of their significantly lower requirements on both the number of qubits and the number ofstate transformations required to perform useful calculations. One proposal to counteract the problem of decoherence is to not use a superposition of states to store the memories. Unfortunately this proposal does not take advantage of quantum mechanics. As such it would provide us with little more than a very expensive artificial neural network that just happens to use qubits. The other major problem is common to all attempts to create a physical neural network: the high density of connections makes them very difficult to implement in small scale systems. 11.6 Performance Quantum neural networks appear to offer several advantages over classical neural networks. Menneer and Narayanan found that training their quantum neural networks required 50% fewer weight changes than their classical counterparts[14]. They also found that the quantum neural networks were able to learn particular sets of training data that the classical networks could not. They then make the significant point that quantum neural networks do not suffer from the catastrophic amnesia that overloaded classical neural networks do. According to Trugenberger, a quantum neural network is capable of storing each of the patterns that can be formed, giving a capacity that is exponential in the number of qubits[20]. His recall mechanism is probabilistic: it transforms the initial superposition of states into a superposition centred on the input pattern. This means that any measurement of the system is more likely to result in a state that includes the input pattern. Unfortunately, Brun et. al. take issue with Trugenberger’s claim and proceed to build a convincing case that Trugenberger is in fact mistaken[2]. They make the valid point that once a single memory has been recalled the network’s superposition has collapsed and it can no longer be used. The memory state cannot be perfectly cloned and even if we use a probabilistic method of cloning (which is unreliable, and as such does not violate the no cloning theorem) then its quality will degrade over time, limiting its utility. Following this response, Trugenberger published a reply[21] in which he moderates his statements about storage capacity and explaining that the memories could be stored in an operator M̂ which acts on |0i so that M̂ |0i = |M i (11.3) where |M i is the uniform superposition of memory states. He states that at the very least, an associative memory can be formed for a polynomial number of patterns without the appearence of spurious memories (and hence no transition to amnesia). He also agrees that M̂ would degrade over time and suggests that for a number of memories polynomial in n it could be repeatedly and efficiently manufactured using a sequence of elementary unitary gates involving at most two qubits. Ventura et. al. state that quantum neural networks have storage capacity that is exponential in the number of qubits and Trugenberger believes that in the worst case they have a storage capacity that is polynomial in the number of qubits. Brun et. al. however, are convinced that these view is erroneous and that quantum neural networks hardly offer any advantage over their classical counterparts when producing an associative memory. The question has not been conclusively answered and so it appears that only further research will settle the issue once and for all. 57 Chapter 12 Conclusion In part 1, we discovered some of the main uses of information theory. We now understand how to compress data and correct errors that arise from using noisy channels. Thanks to the ingenuity of programmers, people the world over can reap the benefits of Shannon’s insight without knowing the intimate details of conditional probabilities and noisy channels. A good deal of current research is aimed at broadcast technologies, where simply retransmission of corrupted data is not a viable option. So called digital fountain codes have been devised which have the remarkable property that provided one receives a certain proportion (nine out of ten, say) codewords per set (e.g. frame of a television image) it is possible to rebuild the rest using redundant information contained in the other nine. In other words, there is not only redundancy within each codeword, but also between the codewords themselves. The advent of High Definition Television (HDTV) has caused greater focus on these codes as broadcasters attempt to reliably broadcast ever more information over the already crowded airwaves. In part 2 we investigated pattern recognition and associative memory by the use of neural networks. As we have mentioned before, neural networks have inspired a vast amount of research. Profiteers are drawn to neural networks by their ability to recognise trends and predict future behaviour. They hope to use this to predict the stock market and make their millions. The continued failure of neural networks to reliably predict the stock market has not prevented huge investments of time into this and probably never will. A contrasting area of neural network research focuses on the application of self organising maps to the reconstruction of surfaces, a technology that has been used to store three dimensional models of archaeological finds[5]. This is therefore very important culturally as it allows us to have an enduring ‘copy’ of an artefact lest disaster strike and the original be lost or destroyed. Research in quantum information is also heading in several directions at once. A vast amount of resources have been ploughed into the investigation of quantum cryptography and quantum key distribution. Implemented correctly, these provide uncompromisingly secure communication and are therefore of great interest to governments and large businesses. By contrast, quantum neural networks are very much a niche area. From the exchanges between Trugenberger[20, 21] and Brun et. al.[2] we can see that there is still a healthy debate over whether quantum neural networks provide any benefit over classical neural networks, let alone enough to justify significant investment. Over the course of this report we have touched on some of the ideas that have helped to shape the world in which we live. Computers, email and even televisions would not exist were it not for the insightful work of Claude Shannon and the many who followed in his footsteps during the decades following his original paper. Further to this, we now understand how patterns are recognised, memories stored and 58 associations made. We have seen candidates for a new generation of computers that would render present machines obsolete overnight, fifty years ago people would never have predicted our current technological state. Given the models and ideas we have covered here, it seems unlikely that we can imagine what technology will look like in another fifty years. Will neural networks and quantum computers be commonplace, as televisions and mobile telephones are now? Or will further research relegate them to the private research of a few dedicated individuals? It seems that only time and further research will sate our curiousity in these unique and fascinating areas. 59 Appendix A Notation In order to keep our notation uncluttered we will denote strings of digits (x1 , x2 , x3 , . . .) simply by the letter x, on the understanding that x is a vector. Should we wish to refer to a specific element of x we will use a subscript, like so: x4 . Different input patterns will be denoted like so: x(1), x(2) and so on. In part 2 we will denote the weight vector of neuron k by wk on the understanding that it too is a vector, with componenets (wk1 , wk2 , wk3 , . . .). If we wish to refer to a particular synaptic weight then we will add another subscript, giving wki (a scalar). Matrices will be denoted by capital letters. In light of the above conventions we will denote the scalar product simply as X wk .x = wki xi (A.1) i with the summation over i being implied. 60 Bibliography [1] M. V. Altaisky. Quantum Neural Network, 2001. [2] T. Brun, H. Klauck, A. Nayak, M. Rötteler, and Ch. Zalka. Comment on ‘Probabilistic Quantum Memories’. Phys. Rev. Lett., 91(20):209801, Nov 2003. [3] Soren Brunak and Benny Lautrup. Neural Networks: Computers With Intuition. World Scientific Publishing Co Pte Ltd, 1989. [4] Collaborative. Quantiki. http://www.quantiki.org. [5] Daz-Andreu, Margarita, Hobbs, Richard, Rosser, Nick, Sharpe, Kate, and Trinks. Long Meg: Rock Art Recording Using 3D Laser Scanning. Past (The Newsletter of the Prehistoric Society), (50):2–6, 2005. [6] Alexandr Ezhov and Dan Ventura. Quantum Neural Networks. Future Directions for Intelligent Systems and Information Science 2000, 2000. [7] Neil Fraser. Neural Network Follies, 2003. http://neil.fraser.name/writing/tank/. [8] Ugur Halici. Artificial neural networks, 2004. http://vision1.eee.metu.edu.tr/ halici/courses/543LectureNotes/ [9] Simon Haykin. Neural Networks, A Comprehensive Foundation. Prentice Hall International Incorporated, 1999. [10] Raymond Hill. A First Course In Coding Theory. Oxford University Press, unknown. [11] J.J. Hopfield, D.I. Feinstein, and R.G. Palmer. ‘Unlearning’ has a Stabilizing Effect in Collective Memories. Nature, 304:158–159, Jul 1983. [12] Morali Kota. Quantum Entanglement as a resource for Quantum munication. Technical report, Massachusetts Institute of Technology, http://www.cs.caltech.edu/cbss/2002/pdf/quantum morali.pdf. Com2002. [13] David J. C. Mackay. Information Theory, Inference and Learning Algorithms. Cambridge University Press, 2004. [14] Tammy Menneer and Ajit Narayanan. Quantum-inspired Neural Networks, 1995. http://citeseer.ist.psu.edu/menneer95quantuminspired.html. [15] Michael Nielsen and Isaac Chuang. Quantum Computation and Quantum Information. Cambridge University Press, 2000. [16] D. Petz. Quantum Source Coding and Data Compression. To be published in the proceedings of Conference on Search and Communication Complexity. 61 [17] Bob Ricks and Dan Ventura. Training a Quantum Neural Network. Neural Information Processing Systems, pages 1019–1026, Dec 2003. [18] Peter Shor. Scheme for Reducing Decoherence in Quantum Computer Memory. Physical Review A, 52:2493–2496, 1995. [19] Andrew Steane. Quantum Computing, 1997. arXiv:quant-ph/978022v2. [20] C. A. Trugenberger. Probabilistic Quantum Memories. Phys. Rev. Lett., 87(6):067901, Jul 2001. [21] C. A. Trugenberger. Reply to Comment on Probabilistic Quantum Memories. Phys. Rev. Lett., 91(20):209802, Nov 2003. [22] Vlatko Vedral. An Introduction to Quantum Information Science. Oxford University Press, 2006. [23] Dan Ventura. On the Utility of Entanglement in Quantum Neural Computing. In International Joint Conference on Neural Networks, pages 286–295, July 2001. [24] Dan Ventura and Tony R. Martinez. Quantum Associative Memory. Information Sciences, 1-4:273–296, 2000. [25] Li Weigang. Quantum Neural Computing http://www.cic.unb.br/ weigang/qc/aci.html, accessed on 09/07/2007. Study. [26] W. K. Wootters and W. H. Zurek. A Single Quantum Cannot Be Cloned. Nature, 299:802– 803, 1982. 62

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Information Theory and Machine Learning