Download Information Theory and Machine Learning

Document related concepts

Many-worlds interpretation wikipedia , lookup

Density matrix wikipedia , lookup

Algorithmic cooling wikipedia , lookup

Symmetry in quantum mechanics wikipedia , lookup

EPR paradox wikipedia , lookup

History of quantum field theory wikipedia , lookup

Quantum entanglement wikipedia , lookup

Interpretations of quantum mechanics wikipedia , lookup

Quantum computing wikipedia , lookup

Canonical quantization wikipedia , lookup

Quantum group wikipedia , lookup

Orchestrated objective reduction wikipedia , lookup

Quantum electrodynamics wikipedia , lookup

Quantum state wikipedia , lookup

Quantum key distribution wikipedia , lookup

Hidden variable theory wikipedia , lookup

Quantum machine learning wikipedia , lookup

Probability amplitude wikipedia , lookup

T-symmetry wikipedia , lookup

Quantum teleportation wikipedia , lookup

Transcript
Information Theory and Machine Learning
David Kaye
April 25, 2008
Abstract
This project will start by looking at classical information theory, covering many results from
Claude Shannon’s 1948 paper A Mathematical Theory of Communication before moving on to
machine learning. This will be covered in the context of neural networks, as they are an
influential field of study with many modern applications. Finally it will look at quantum
information theory, comparing and contrasting it with the classical theory before moving on to
introduce the reader to quantum neural networks and some of their applications.
Contents
1 Introduction
3
I
5
Classical Information Theory
2 Introduction to Information Theory
2.1 Binary Symmetric Channel . . . . .
2.2 Linear Codes . . . . . . . . . . . . .
2.3 Error Correcting Codes . . . . . . .
2.3.1 Repetition Codes . . . . . . .
2.3.2 Block Codes . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
6
6
7
8
8
8
3 Probability, Information and Entropy
3.1 Ensembles . . . . . . . . . . . . . . . .
3.2 Probability Book-Keeping . . . . . . .
3.3 Information and Entropy . . . . . . .
3.4 Examples . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
11
11
12
12
14
4 Information Coding
4.1 Introduction . . . . . . . . . . . . . .
4.2 Source Coding . . . . . . . . . . . .
4.3 Symbol Codes . . . . . . . . . . . . .
4.4 Further Entropy and Information . .
4.5 The Noisy Channel Coding Theorem
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
15
15
15
17
19
20
II
.
.
.
.
.
Machine Learning and Neural Networks
22
5 Introduction to Neural Networks
23
5.1 Neurons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.2 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
6 Learning Types
6.1 Supervised Learning . .
6.2 Reinforcement Learning
6.3 Unsupervised Learning .
6.4 Goals for Learning . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
28
28
29
30
30
7 Learning Algorithms
7.1 Error Correction Learning
7.2 Hebbian Learning . . . . .
7.3 Competitive Learning . .
7.4 Self Organising Maps . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
8 Information Theory Applied to Neural Networks
8.1 Capacity of a Single Neuron . . . . . . . . . . . . .
8.2 Network Architecture for Associative Memory . . .
8.3 Memories . . . . . . . . . . . . . . . . . . . . . . .
8.4 Unlearning . . . . . . . . . . . . . . . . . . . . . .
8.5 Capacity of a Hopfield Network . . . . . . . . . . .
III
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
32
32
34
36
37
.
.
.
.
.
38
39
40
41
41
42
Quantum Information Theory and the Cutting Edge
9 Quantum Mechanics
9.1 Motivation . . . . . . . .
9.2 Qubits . . . . . . . . . . .
9.3 The EPR Paradox . . . .
9.4 Entanglement . . . . . . .
9.5 The Bell States . . . . . .
9.6 The No Cloning Theorem
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
10 Quantum Entropy and Information
10.1 Von Neumann Entropy . . . . . . .
10.2 Quantum Distance Measures . . .
10.3 Quantum Error Correction . . . .
10.4 Quantum Teleportation . . . . . .
10.5 Dense Coding . . . . . . . . . . . .
10.6 Quantum Data Compression . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
43
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
44
44
44
45
46
46
46
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
48
48
48
49
51
51
53
11 Quantum Neural Networks
11.1 Motivation . . . . . . . . . . . . . .
11.2 Architecture . . . . . . . . . . . . . .
11.3 Training Quantum Neural Networks
11.4 Quantum Associative Memory . . .
11.5 Implementation . . . . . . . . . . . .
11.6 Performance . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
54
54
55
55
55
56
57
12 Conclusion
58
A Notation
60
Bibliography
62
2
Chapter 1
Introduction
When we wish to communicate with somebody we generally need to do so over an imperfect
channel, be it a crackling telephone line, a noisy room or even just an unreliable internet
connection. The channel will usually add noise to whatever we are saying/sending and so we
need to protect against this. One way of doing this is to add redundancy into the message,
allowing the recipient to check the message against the redundancy to check for, and hopefully
correct, any errors that have occurred.
Unfortunately this decreases the rate at which we can communicate: if we are capable of
sending 10 megabytes per second, but the tenth is purely for correcting errors (and therefore
contains no new information), then our effective rate of communication is just 9 megabytes per
second. To overcome this difficulty we can try to compress the data we wish to send. If we can
find a way to convey 10 megabytes worth of information using only 9 megabytes then we can
use the tenth for error correction. This would mean that for the same rate of communication,
we can now correct some errors that will undoubtedly occur whilst the message is in transit.
It was Claude Shannon who pioneered the field of information theory, the focus of which is
to provide solutions to the problem of communicating over different types of channels. It is this
that we will focus on in part 1.
Error Correcting codes work by having a list of permitted words and attempting to correct
any deviations from these allowed forms. We can therefore view error correction as a form of
pattern recognition, a huge field in its own right.
Popular tools in this area are so-called ‘neural networks’, these are computational models
inspired by the neurons in our own brains. As one might expect given our natural aptitude at
recognising blurred pictures of people, trees and fruit, these networks are inherently suited to
pattern recognition and classification (not to mention various other tasks),
Part 2 will look at neural networks, covering their similarities and differences from the
biological neural networks found within our crania. It will focus on the ways in which they
can store patterns and other data. We will then proceed to examine the limits of these neural
networks: how much information can they store? How reliable are they? And what happens if
we try to store too much information in them?
Next we shall take a brief look at the history of physics in order to prepare ourselves for
some results to follow. In the early twentieth century it was discovered that the laws of classical
physics (for example, those of Sir Isaac Newton) did not provide a complete description of the
world we inhabit. Experiments had shown that under certain conditions particles such as electrons display properties that were unquestionably wave-like, and photons (hitherto considered
purely as waves) were shown to have distinctly particle-like properties. The theory surrounding
these observations was dubbed quantum mechanics.
These findings were not merely physical curiosities, they turned out to have wide reaching
3
implications for almost all areas of science. Information theory, for example, is turned on its
head when we incorporate the laws of quantum mechanics. Under the quantum paradigm, many
of the activities we take for granted (like copying unknown data as and when we please) can
no longer be done: quantum information simply does not behave in the same way as classical
information.
It is the behaviour and control of this new type information that we shall spend the majority
of part 3 discussing. For the remainder we shall delve into an exciting new field: that of quantum
neural networks. It is a highly speculative area, so much so that research is still underway to
ascertain what they are capable of and how they differ in form and function to the (classically
formulated) neural networks covered in part 2.
4
Part I
Classical Information Theory
5
Chapter 2
Introduction to Information Theory
2.1
Binary Symmetric Channel
Suppose Alice wishes to send a message (an email perhaps) to Bob using her computer, this
message will consist of a string of digits x (consisting of elements which can be either 0 or
1). Note that any string of binary digits of length N may be considered as an element of an
n dimensional vector space (defined over the field of positive integers modulo 2). This vector
will need to cross some kind of channel, like a phone line, over which there will be some noise.
As a result of the noise the received vector y may differ from x. We shall make the following
assumptions about the channel:
1. The channel only transmits 1s and 0s, this is called a binary channel.
2. The probability of yi differing from xi is equal to f, i.e
P (yi = 0|xi = 1) = f
(2.1a)
P (yi = 1|xi = 0) = f.
(2.1b)
Therefore the probability of yi being the same as xi is equal to 1 − f . This is called a symmetric
channel.
Figure 2.1: The binary symmetric channel.
Computers store and manipulate all data in binary form. The storage medium may also
be viewed as a channel, however instead of transporting the data from one spatial location to
another it transports them to a different temporal location. In an ideal world our storage media
would be error free and we could utilise their entire capacity for storing data. Unfortunately
6
our media are not perfect and as such some of their capacity must be used for bookkeeping data
in order to detect and correct errors.
Given a hard disk with a bit error probability of f there are two ways to increase its
reliability. First there are physical methods. These aim to directly decrease f by making the
hard disk from better quality components, making it airtight and cooling it to reduce thermal
noise. All of these methods, whilst effective, are expensive and increase the financial cost of the
channel.
The second method is the ‘system method’, this involves adjusting the system itself in
order to make it error tolerant. System methods involve encoding the source message s to add
redundancy, it is this encoded message (denoted t) that is transmitted over the binary symmetric
channel. The channel adds noise to t, meaning that in general the received message r differs
from t. r is then passed through a decoder, which uses the redundancy to extract (or attempt
to extract) the transmitted message t and the noise that was added to it. The redundancy is
then stripped and (if the error correction has been successful) the source message s is recovered.
This method has the advantage of provided error resistance and little or no additional financial
cost, however there is an increased cost in terms of computational resources.
There is a particular class of codes which we shall be looking at due to their very useful
properties, these are called linear codes for reasons that shall become apparent.
2.2
Linear Codes
A code C over a field F (meaning that elements of C are made constructed from elements of
F ) is linear if:
1. for any two codewords u, v ∈ C we have u + v ∈ C
2. for all codewords u, and elements a ∈ F , we have a.u also a codeword (note that F = {0, 1}
for the binary symmetric channel).
From this definition it follows that the null vector (a string of all 0’s) is a member of all linear
codes. A linear code may also be viewed as a group under addition, a feature that has led to
some authors calling them group codes.
Given a vector x, we define its weight w(x) to be the number of non-zero elements of x. For
example if x = (01100101), then w(x) = 4. We will only be considering binary codes so it is
possible to define the distance between codewords x and y as d(x, y) = w(x − y). The minimum
distance is the smallest value of d(x, y) for all codewords x and y.
This brings us to the first advantage of linear codes: the smallest distance between any two
codewords is equal to the smallest value of w(x) for all non zero codewords in C. This means
that if our code has M codewords, we need only make M − 1 comparisons1 , unlike a nonlinear
code where we would be required to make M C2 = 12 M (M − 1) comparisons2 .
The second advantage is that linear codes are easy to specify due to their group-like nature.
Whereas with a nonlinear code we may need to list every single codeword. For a linear code we
simply need to list a set of ‘basis’ codewords from which the other codewords may be generated,
since we know that the sum of any two codewords is also a codeword.
The final advantage with using linear codes is that they are very easy to encode and decode
- these operations amount to matrix algebra. If we have a k × n matrix G whose rows form a
basis of an (n, k) code3 , then G is called a generator matrix of that code. By using elementary
row operations:
1
comparing each codeword to the null vector
comparing every possible pair of codewords
3
one that encodes strings of length k into codewords of length n
2
7
1. interchange two rows
2. multiply a row by a scalar
3. add one row to another
4. interchange two columns
5. multiply a column by a scalar,
it is possible to convert a k × n generator matrix G into what is known as standard form:
G = [Ik |M ], where Ik is the k × k identity matrix and M is a k × (n − k) matrix. When G is
in standard form all codewords of C may be formed by multiplying the k dimensional vector x
that we seek to encode by the transpose of G, like so: y = Gt x.
2.3
2.3.1
Error Correcting Codes
Repetition Codes
Repetition codes (denoted Rn ) are simple: just repeat n (odd) times each bit that you wish
to send. For example, to send 0101100 using the repetition code R3 we would go through the
folowing procedure:
s
t
n
r
d
s̄
0
000
001
001
000
0
1
111
000
111
111
1
0
000
010
010
000
0
1
111
001
110
111
1
1
111
000
111
111
1
0
000
000
000
000
0
0
000
100
100
000
0
source message s
transmitted message (t = s + redundancy)
noise added to transmission n
received message (t + n )
decoded message d
message (d − redundancy )
Table 2.1: Encoding and decoding using the R3 repetition code
The probability of a decoding error is dominated by the probability of two bit errors occurring in a single triplet, which scales with f 2 . An error would also if we had three errors
in a triple, but this probability scales with f 3 , and is often negligible in comparison to the
probability of two errors occurring.
Repetition codes certainly give us a rapidly decreasing error rate (think of R5 with an error
rate of 0.05) but they also significantly restrict our rate of communication. Using a repetition
1
as we are sending m bits over the channel for
code Rm , our communication rate drops to m
every bit of information in our source message. In our perfect fantasy world we would like to
combine a low error probability with a high transmission rate.
2.3.2
Block Codes
Block codes take a sequence of source bits of length k and convert it to a sequence of bits of
length n (n > k) to add redundancy. In a linear block code the extra N − K bits are called
‘parity check’ bits. For the (7, 4) Hamming code n = |s| = 7 and k = |s| = 4.
To create one of the Hamming codewords place it in a tri-circle diagram. Set ti = si for
i = 1 . . . 4, then set t5 . . . t7 so that the parity within each circle is even. The parity of s1 s2 s3 = 0
if s1 + s2 + s3 is even and, the parity is 1 if s1 + s2 + s3 is odd.
Given that the Hamming Code is linear, it can be written compactly in terms of matrices
(meaning that all codewords may be written in the following form: t = Gt s).
8
s
0000
0001
0010
0011
0100
0101
0110
0111
t
0000000
0001011
0010111
0011100
0100110
0101101
0110001
0111010
s
1000
1001
1010
1011
1100
1101
1110
1111
t
1000101
1001110
1010010
1011001
1100011
1101000
1110100
1111111
Table 2.2: The sixteen source words of the Hamming Code
Figure 2.2: Tri-circle diagram.

1
 0
G=
 0
0
0
1
0
0
0
0
1
0
0
0
0
1
1
1
1
0
0
1
1
1

1
0 

1 
1
The generator matrix
of the Hamming Code.
Note that Gt = IP4 .
Now that we have a swish new encoder we need a correspondingly swish decoder. Whilst
decoding it is important to remember that any of the transmitted bits may be flipped - even
our parity bits. We will make one further assumption here, namely that we have no idea what
codewords will be sent (i.e. as far as we know they are all equally likely).
For a given received vector r the optimal decoder selects the codeword t that differs from
r in the fewest places. There is more than one method of doing this. One very simple method
would be to compare r to each of the codewords t one by one, counting the number of places
where ri 6= ti and selecting the codeword which minimised these discrepancies after the 16
comparisons.
For a small code such as the (7, 4) Hamming code this inefficiency is not too troublesome,
however if we generalise the Hamming code to an (n, k) code then we need to perform n
comparisons per codeword received. As n increases this becomes devastatingly inefficient and
as a result this method is rarely used.
The pattern of parity violations is called the syndrome, if we have parity violation in circles
two and three then the syndrome, denoted z, is (0, 1, 1). It follows from this definition that
for all codewords t we have z = (0, 0, 0). The syndrome is calculated by using the following
9
formula:
Syndrome = (calculated parity from r1−4 ) + (received parity from r5−7 ) [modulo 2].
Once we have found the circles with odd parity we must search for the smallest number of
flipped bits that will produce this parity violation. Given z = (0, 1, 1) we need to flip a bit that
lies only in circles two and three. Luckily a unique bit exists for each possible syndrome and so
we can build up a map between the syndrome and the flipped bit.
z
perpetrator
000
none
001
r7
010
r6
011
r4
100
r5
101
r1
110
r2
111
r3
Table 2.3: The eight possible syndromes for the Hamming code.
The above syndromes could all be caused more complex bit-flip patterns, for example: upon
receiving the string 0110101 [syndrome 100] we can see that the error could lie either with r5 or
with s3 and s4 together. However a larger number of flipped bits is necessarily less likely so we
choose to flip r5 . Using this method of decoding we can see that if one bit is flipped the error is
detected and corrected, however if two bits are flipped then the true error is not identified and
thus the ‘correction’ applied to r leads to 3 bit errors. If r3 and r4 had in fact been the culprits
then our decoding would have given
us a string s̄ = 0110101.
We have our matrix Gt = IP4 , so now we define a matrix H = [P |I3 ] to compute the
syndrome. This is a linear operation and is performed by multiplying r by H on the right
hand side like so: z = Hr. All of the received codewords can (by definition) be written in the
following form: r = Gt s + n, meaning that the syndrome is equal to HGt s + Hn. Note that
HGt = 0, so our syndrome is calculated from Hn. In essence, the problem we are facing with
syndrome decoding is: given Hn, find the most probable vector n that gives this particular
value of z. Any decoder which solves this problem is known as a maximum likelihood decoder
for reasons which are hopefully clear.
The probability of block error (denoted pB ) is the probability that the decoded message
and the source message are not in fact the same (P (s̄ 6= s)). The probability of bit error is
the average P
probability that a particular decoded bit doesn’t match it’s corresponding source
bit, pb = k1 ki=1 P (s¯i 6= si ). For the Hamming code pB = the probability of more than one
bit being flipped = f 2 + f 3 + . . . + f 7 . This probability scales with O(f 2 ), exactly the same as
our R3 code, however the Hamming code has a much higher rate of transmission - a rate of 47
(four source bits for every seven transmitted bits). A significant contrast to the R3 code with
its measly 31 transmission rate.
In his ground breaking 1948 paper (entitled A Mathematical Theory of Communication)
Claude Shannon proved that it is possible to have an arbitrarily small probability of bit error
combined with a non-zero rate of transmission. This is the Noisy Channel Coding Theorem and
will be discussed in chapter 4. Before we are able to truly appreciate this result, we must must
define and explain some new concepts.
10
Chapter 3
Probability, Information and
Entropy
3.1
Ensembles
We start by defining an ensemble X as a triplet (x, Ax , Px ) where x is a random variable, Ax is
the set of values that x may take, and Px is the corresponding probability that x will take each
value of Ax . To illustrate this we will look at a real ensemble consisting of the letters of the
english alphabet1 and their various probabilities when a character is drawn at random from a
block of standard english text. The set of values of Ax and Px are shown below:
Ax
Px
Ax
Px
Ax
Px
a
0.0575
j
0.0006
s
0.0567
b
0.0128
k
0.0084
t
0.0706
c
0.0263
l
0.0335
u
0.0334
d
0.0285
m
0.0235
v
0.0069
e
0.0913
n
0.0596
w
0.0119
f
0.0173
o
0.0689
x
0.0073
g
0.0133
p
0.0192
y
0.0164
h
0.0313
q
0.0008
z
0.0007
i
0.0599
r
0.0508
space
0.1928
Table 3.1: Probability of each letter of the alphabet in standard english text
Tables such as this are easily available on the internet and are widely used to defeat substitution ciphers2 such as rot13. In this cipher each letter is assigned a number (a = 1, b = 2 etc...)
then each number n is mapped to n + 13 modulo 26, giving rot13(a) = n, rot13(p) = d and
so on. This obviously a very simple substitution cipher the use of which has been condemned
to the history books along with other substitution ciphers due to the negligible protection they
give when faced with modern computers. This is not to say that they are useless, nor that they
ever were, simply that they do not offer any meaningful protection when an adversary has even
modest computational resources.
A joint ensemble XY is a pair of ensembles where the outcome is an ordered pair (x, y),
where x can be any member of Ax = (a1 , a2 , . . .) with probabilities Px = (px1 , px2 , . . .), and y
can be any member of By with probabilities Py , each defined similarly. There is no requirement
that x and y be independent. For example, given the binary values of the lowercase letters in
A.S.C.I.I.3 we can define x to be the first four digits and y to be the last four digits of the code,
as displayed in Table 3.2.
1
from now on we will always assume that we are dealing with the english alphabet
A cipher is a code whose purpose is to hide/obfuscate the source message
3
American Standard Code for Information Interchange
2
11
letter
a
b
c
d
e
f
g
h
i
j
k
l
m
n
binary value
01100001
01100010
01100011
01100100
01100101
01100110
01100111
01101000
01101001
01101010
01101011
01101100
01101101
01101110
x
0110
0110
0110
0110
0110
0110
0110
0110
0110
0110
0110
0110
0110
0110
y
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
1100
1101
1110
letter
o
p
q
r
s
t
u
v
w
x
y
z
space
binary value
01101111
01110000
01110001
01110010
01110011
01110100
01110101
01110110
01110111
01111000
01111001
01111010
00100000
x
0110
0111
0111
0111
0111
0111
0111
0111
0111
0111
0111
0111
0010
y
1111
0000
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
0000
Table 3.2: Binary expansion of the lower case A.S.C.I.I. letters
As Table 3.1 tells us, given the first four digits (x1 , x2 , x3 , x4 ) of the code some strings
(y1 , y2 , y3 , y4 ) (i.e. letters) will be more likely to occur than others. Regrettably we must now
review some simple rules of probability in order to fully prepare ourselves for one of Shannon’s
discoveries.
3.2
Probability Book-Keeping
Given a random variable X taking its outcome x from Ax and a random variable Y taking
outcomes y from By , denote the probability that x takes the value ai by P (x = ai ) ≡ P (ai ) ≡
P
P(x). Given a subset K ⊆ Ax the probability that x takes a value in K is given by P (x ∈ K) =
is called the marginal probability of x = ai and is defined by
k∈K P (x = k). The probability
P
summation over y: P (ai ) = y∈By P (ai , y).
The joint probability of (ai , bj ) is the probability of x = ai and y = bj occuring together. It
P (a ,b )
is used to define the conditional probability P (ai |bj ) = P (bi j )j , but beware: if P (bj ) = 0 then
this is undefined (as one would expect). We must also not forget the following rules:
Product (Chain) Rule: P (x, y|F ) = P (x|y, F ) × P (y|F )
X
X
Sum rule: P (x|F ) =
P (x, y|F ) =
P (x|y, F ) × P (y|F )
y
(3.2)
y
Independence: x and y are independent if and only if P (x, y) = P (x) × P (y).
3.3
(3.1)
(3.3)
Information and Entropy
The Shannon information content of an outcome x = ai is defined4 to be:
h(ai ) = log
4
1
P (ai )
Unless otherwise stated, all logarithms will be to base 2
12
(3.4)
and is measured in bits (though this is unrelated to binary digits). The entropy of an ensemble
X is the average Shannon information content of its outcome, and is given by:
H(X) =
X
x∈Ax
P (x) × log
1
,
P (x)
(3.5)
[note that P (x) = 0 ⇒ 0 × log 10 ≡ 0, just as we define it for the limit]. The entropy of X is
a measure of the uncertainty in the value that X will take. Since 0 ≤ P (x) ≤ 1 for all values
1
1
of x (by definition of probability), P (x)
≥ 1, meaning that log( P (x)
) ≥ 0, giving us our result
that H(X) ≥ 0 with equality if and only if pi = 1 for one of the i’s. It should be noted that on
occasion the entropy is written as H(p) where p is a vector consisting of the probabilities of the
outcomes xi , so p = (px1 , px2 , . . . , pxn ).
The joint entropy of X and Y is defined by the following equation:
X
1
.
(3.6)
H(X, Y ) =
P (x, y) × log
P (x, y)
x,y∈Ax ,Ay
As one might expect the entropy of two independant random variables is additive, so H(X, Y ) =
H(X)+H(Y ) if X and Y are independant. This describes the situation P (x, y) = P (x)×P (Y ).
Taking the logarithm of |Ax | gives us an upper bound for the entropy, so H(X) ≤ log |Ax |.
The entropy is maximised (we have equality) if Px is uniform, i.e. p1 = p2 = . . . = pn = |Ax |−1 .
The redundancy of an ensemble measures the difference between H(X) and its maximum
possible value, log |Ax |: so
H(X)
redundancy = 1 −
.
(3.7)
log |Ax |
All of the preceeding results have referred to discrete random variables where |Ax | is finite.
However, the concepts involved generalise to continuous random variables (use the probability
density) and infinite sets (where we must be aware that H(X) may tend to infinity).
The relative entropy between two probability distributions P (X) and Q(X) (both defined
over Ax ) is
X
P (x)
DKL (P ||Q) =
P (x) × log
.
(3.8)
Q(x)
x
It is important to note that DKL (P ||Q) is not the same as DKL (Q||P ). The relative entropy
is sometimes known as the Kullback-Leibler divergence. The relative entropy satisfies Gibb’s
inequality, which states that DKL (P ||Q) ≥ 0.
At this point in the proceedings it is advisable to make a special mention of the binary
entropy function H2 (f ). This function describes the entropy of a random variable that may take
the value 0 or 1, and does so with probabilities 1 − f and f respectively. It takes its maximum
at f = 0.5 in agreement with out earlier result. Written explicitly, the binary entropy function
is:
1
1
H2 (f ) = f × log
+ (1 − f ) × log
.
(3.9)
f
1−f
The binary entropy function is useful because it allows us to understand one of Shannon’s
theorems, namely the Noisy Channel Coding Theorem. This states that for a binary symmetric
channel with probability of bit-flip f , the maximum rate of information transfer is given by
subtracting H2 (f ) from 1, or symbolically:
1
1
+ (1 − f ) log
.
(3.10)
C(f ) = 1 − H2 (f ) = 1 − f log
f
1−f
13
3.4
Examples
We have looked at some relatively abstract features of proabilities and ensembles, so now is the
fime to illustrate their meaning by looking some numerical examples.
Taking our ensemble as the alphabet, we can see that the information content of the letter
v being selected is given by
h(v) = log
1
1
= log
= 7.1792
P (v)
0.0069
so we get approximately 7.2 bits of information when the letter v the outcome. Let us compare
this to another letter: e, the information content of e occuring is:
h(e) = log
1
1
= log
= 3.4532
P (e)
0.0913
so e is worth only about 3.5 bits of information, demonstrating that less probable outcomes
convey more information than more probable ones.
Denoting our alphabet ensemble by Ψ we can calculate its entropy in the usual way:
H(Ψ) =
X
µ∈Ψ
P (µ) × log
1
.
P (µ)
Substituting in our values from table 3.1, we can easily calculate that that the average Shannon
Information Content (entropy) of Ψ to be 4.11.
In the next chapter we shall meet another of Shannon’s groundbreaking theorems, the Source
Coding Theorem, which defines limits on our data compression algorithms.
14
Chapter 4
Information Coding
4.1
Introduction
In this chapter we will exploit the results we have obtained in order to reproduce some of
Shannon’s key fndings.
4.2
Source Coding
Source coding is essentially data compression, it is also known as noiseless channel coding. The
raw bit content of an ensemble X is denoted H0 (X) and takes the value log |Ax |. The raw bit
content of X provides a lower bound for the number of yes/no questions that must be asked in
order to uniquely identify an outcome. The raw bit content of a joint ensemble XY is simply
H0 (XY ) = H0 (X) + H0 (Y ).
Lossy Compression
Lossy compression methods, such as those used by the ubiquitous MP3 (MPEG1 Layer 3)
and the increasingly popular Ogg Vorbis codecs actually throw away information in order to
achieve better compression rates. MP3 and Ogg Vorbis are most commonly used to compress
audio tracks from compact discs (typically around 45 megabytes) to a managable size (generally
around 5 megabytes). JPEG compression behaves in a similar way, acting on our pictures to
prevent them taking up valuable space on camera memory cards and computer hard disks.
Because lossy compression algorithms throw away data there is a chance that when we
compress two different files we will end up with two identical files, meaning that we cannot
uniquely identify the source. This is called a failed encoding and its probability of occurence is
denoted δ. Our goal when using a lossy algorithm is to achieve a satisfactory trade off between
the probability of an encoding failure and the level of compression. If we risk a higher probability
of failure then we will undoubtedly achieve better compression, but it is down to us to decide
what value of δ is acceptable for each situtation.
We can implement a lossy compression algorithm to compress text by simply leaving out
uncommon letters of the alphabet, this decreases the size of our alphabet and so our encoder
simply deletes any letters it that it is not expecting. If we remove the three letters from our
alphabet say a, z and q then we will achieve a compressed file size calculated as follows:
1 − P (a) − P (z) − P (q) = 1 − 0.0575 − 0.0007 − 0.0008 = 0.941.
1
Motion Picture Experts Group
15
1
1
N HÆ (X
N=10
N=210
N=410
N=610
N=810
N=1010
N ) 0.8
0.6
0.4
0.2
0
0
0.2
0.4
0.6
0.8
1
Æ
Figure 4.1: Hδ (X n ) for various values of n[13].
So a text file compressed using this method will be about 94% of the original file size. This
gives us a probability of failure P (a) + P (q) + P (z) = 0.059, since any files differing only in
the number and permutations of a’s, z’s and q’s will be indistinguishable after compression.
It should be noted that this is not a terribly useful compression algorithm, but it serves to
illustrate our concepts.
If we are looking to create an algorithm that will give us a particular value of δ then what we
are really trying to do is create the smallest subset Sδ of our alphabet S such that the probability
of a particular chosen letter not lying in Sδ is less than or equal to δ. In our previous example,
we had Sδ as the alphabet excluding a, z and q. This algorithm did meet our requirement that
P (x ∈
/ Sδ ) ≤ 0.06 however it is not optimal as we could have taken several letters in place of
a. An easy method of removing the maximum number of letters is to rearrage them in order of
decreasing probability and remove letters, starting with the least likely, until the probability of
failure is as close as possible to (but not greater than) δ.
For a particular value of δ we define the essential bit content Hδ (X) of our reduced ensemble
to be log |Sδ |. As with the raw bit content, the essential bit content of X is additive. So
given n independant, identically distributed random variables (X,X,X, . . . ,X) (which we shall
denote by X n ) then H(X n ) = n × H(X). This result applies to Hδ (X) as well, but as n
approaches infinity, Hδ (X n ) becomes less and less dependant on δ. For any value of delta, it is
approximately equal to n times the entropy of X, i.e. Hδ (X n ) ≈ n × H(X). This neatly brings
us to the Source Coding Theorem, one of Shannon’s discoveries, which tells us about limits on
our data compression.
The Source Coding Theorem: X is an ensemble with entropy H(X) = H. Given some
positive ǫ and some δ between 0 and 1, there exists some non-zero n0 ∈ N such that for any
n > n0 ;
1
Hδ (X n ) − H < ǫ.
(4.1)
n
Or, to put it plainly, we can compress X into more than nH(X) bits with negligible risk of
failure as N → ∞. On the other hand, if we try to compress it into less than nH(X) bits, it is
almost certain that our encoding will fail.
16
Lossless Compression
With a lossless compression algorithm, all input files are mapped to different output files.
Consequently while some files are decreased in size, others are necessarily increased in size. One
example of a common lossless algorithm is the Free Lossless Audio Codec (FLAC). Lossless
compression frequently makes use of symbol codes, which we shall now discuss.
4.3
Symbol Codes
These codes are lossless, they preserve the information content of the input precisely. As a
result of this, the pigeonhole principle (”You can’t put N pigeons into N − K boxes unless at
least one box has more than one pigeon in it.”) dictates that in order to avoid a box containing
more than one pigeon (a failed encoding) they must sometimes produce encoding strings which
are longer than the original input. Thus our goal when using symbol codes is to assign shorter
encodings to more probable input strings, thus increasing the probability that our code will
compress the input.
Recalling our ensemble X = (x, Ax , Px ) from the previous chapter, we shall now define AN
x
to be the set of ordered N-tuples drawing elements from Ax . We shall further define A+
x to
be the set of all strings of finite length constructed by using elements of Ax . Given these two
definitions, we are now able to define a symbol code. A symbol code C for an ensemble X is
+
a map C : Ax → {0, 1}+ . An extended symbol code is defined similarly, C + : A+
x → {0, 1} ,
and is made by concatenating the corresponding codewords, e.g. C(ai aj ak ) = c(ai )c(aj )c(ak ).
Where c(ai ) denotes the codeword corresponding to an element ai in Ax , the length of this
codeword is written as l(ai ) or sometimes just li .
In order for a symbol code to be useful, it must have the following properties:
• any encoded string must have a unique decoding,
• it must be easy to decode
• the code must achieve the greatest possible amount of compression.
In order to be uniquely decodable, no distinct strings may have the same encoding. So for all
distinct x and y in A+
x we require that c(x) 6= c(y).
A symbol code is easy to decode if it is possible to identify the end of a codeword as soon
as it arrives. This amounts to requiring that no codeword be a prefix of another, e.g. 101 is
a prefix of 10101 so a symbol code which included both of these as codewords would not be
easy to decode. If this condition is met then our symbol code is called a prefix code (sometimes
known as instantaneous or self-punctuating codes). These codes may be drawn as binary trees
with the end of each branch representing a codeword. If there are no unused branches then the
code is complete. In the figures below the shaded strings are codewords. The expected length
L(C, X) of a symbol code C for X is
L(C, X) =
X
P (x)l(x) =
|Ax |
X
pi l i
(4.2)
i=1
x∈Ax
and is bounded below by H(X) if C is uniquely decodable. The expected length of a uniquely
decodable symbol code is minimised (and equal to H(X)) if and only if the lengths of the
codewords li are equal to their Shannon information content, i.e. li = log p1i .
If our code consists solely of codewords with length l, then we have 2l different codewords.
We may view this as each codeword having a ‘cost’ of 2−l out of a budget of 1. We may spend
17
Figure 4.2: A prefix code (complete).
Figure 4.3: An incomplete, non-prefix code.
00
001
0000
0001
0010
0011
0
010
01
011
0100
0101
0110
0111
100
10
101
1
110
1000
1001
1010
1011
1100
1101
11
111
The total symbol code budget
000
1110
1111
Figure 4.4: Symbol Code Budgets[13].
this budget on codewords in various different ways, for example if l = 2 then we might not want
C = {00, 01, 10, 11} so we might replace 00 and 01 with the string 0, which will have a cost
of 2 × 2−l = 2−l+1 . If we go over our ‘budget’ of 1 then we lose unique decodability. To see
this in a trivial case we may take C = {0, 1, 10} , giving a total amount spent on codewords of
2−1 + 2−1 + 2−2 = 1.25 and as we can see: the string 10 may be decoded as c(x2 )c(x1 ) or c(x3 ).
This is known as Kraft’s Inequality, and is formally stated as:
X
if
2−li ≤ 1
then the symbol code is uniquely decodable.
(4.3)
i
If we have equality then the code is complete.
We are now in a position to stated the Source Coding Theorem for symbol codes. It is an
existence theorem which tells us that for an ensemble X there exists a prefix code C such that
H(X) ≤ L(C, X) ≤ H(X) + 1.
(4.4)
Whether one is able to find such a code is somewhat more problematic, and the theorem can
serve both as a source of hope and taunting despair for those who try.
18
Figure 4.5: Relationship between the mutual information and the various entropies of X and
Y.
4.4
Further Entropy and Information
A discrete memoryless channel is one that takes input from a discrete alphabet X and gives
an output from another discrete alphabet Y . The probability of a particular output y being
produced by an input x is given by p(y|x), these probabilities are defined (but not necessarily
non-zero) for all x ∈ X and y ∈ Y . The channel is memoryless if this probability distribution
is independent of all previous input/output pairs, in other words, using the channel does not
change its properties.
given a particular value of y, say y = yi , the conditional entropy of an ensemble X is the
entropy of the probability distribution P (x|yi ). It is defined in a similar way to the entropy of
X:
X
1
H(X|yi ) =
P (x|y = yi ) × log
.
(4.5)
P (x|yi )
x∈X
The conditional entropy of the two ensembles X and Y is the conditional entropy of X when
averaged over all the values of y:
H(X|Y ) =
X
y∈Y
P (y) × H(X|y) =
X
x,y∈X,Y
P (x, y) × log
1
.
P (x|y)
(4.6)
It is the average uncertainty about x after we have learned y
We define the chain rule for entropy as follows:
H(X, Y ) = H(X) + H(Y |X) = H(Y ) + H(X|Y ).
(4.7)
Verbosely, the entropy of two ensembles X and Y is equal to the entropy of Y given X, added
to the entropy of X. It is certainly hard to see how it could be any other way.
The mutual information between X and Y is defined by the following:
mutual information = I(X : Y ) = H(X) − H(Y |X).
(4.8)
The mutual information of two ensembles is symmetric and has a minimum value of zero, which
corresponds to the case where nothing about X may be inferred from a knowledge of Y . The
mutual information is therefore the reduction in our uncertainty about X as a result of learning
Y.
Our noisy channel consists of an input alphabet X, an output alphabet Y and a set of
probabilities P (x, y) = P (y|x) × P (x). It can therefore be thought of as a joint ensemble XY .
19
I (X ; Y )
0.4
0.3
0.2
0.1
0
0
0.25
p1
0.5
0.75
1
Figure 4.6: Mutual information between X and Y over a binary symmetric channel with f =
0.15[13].
Defining it so has the major advantage that we can now apply Bayes’ theorem to the problem.
Bayes’ theorem simply tells us the following:
P (x|y) =
P (y|x)P (x)
P (y|x)P (x)
.
=P
P (y)
z P (y|z)P (z)
(4.9)
This is simple if we are sending information bitwise across the binary symmetric channel, but
it also applies if X is an ensemble of codewords.
If we are dealing with a binary symmetric channel with a probability of bit error f = 0.1 and
X = {1001, 0110, 0000}. With each codeword being equally likely - we have no prior information
about what codewords will be sent. Y is the set of all 4 digit binary strings. Then, if we receive
y = 0100 we can use Bayes’ theorem to work out the most likely source word.
The information conveyed by a noisy channel is the mutual information between the input
and output alphabets so we would like to find a way to maximise this. The capacity of a channel
is defined to be the maximum value that the I(X : Y ) can take over all probability distributions.
By symmetry we can see that this corresponds to P (x = 0) = P (x = 1) = 12 , since there is no
preference for 0’s or 1’s in our channel.
Having illustrated all of these ideas we can now move on to one of Shannon’s main achievements: a description of how efficiently we may communicate over noisy channels
4.5
The Noisy Channel Coding Theorem
The noisy channel coding theorem is in three parts which we shall observe in turn. The first
part states (quoting from Mackay[13]):
For every discrete memoryless channel, the channel capacity
C = max I(X : Y )
PX
20
has the following property. For any ǫ > 0 and R < C, for large enough n, there
exists a code of length n and rate ≥ R and a decoding algorithm, such that the
probability of block error is < ǫ.
The maxPX term means all probability distributions over X. Ultimately the theorem says that
as long as we try to communicate at a rate smaller than the channel capacity then there will
be a code of block length n (for some n, possibly large) that will allow us to communicate with
an arbitrarily small probability of error.
If we are willing to accept a particular probability of error (that is not arbitrarily small) then
it turns out that we are able to communicate at a rate higher than the channel capacity. The
process by which we do this bears a large resemblance to the lossy data compression algorithm
we looked at in the previous chapter.
In order to achieve the higher rate using an (n, k) code, Arthur takes his source message and
splits it up into blocks of length n. He then passes these blocks through the decoder in order
to obtain corresponding blocks of length k, which he then sends over the noisy channel. Upon
reception of the length k blocks Belinda passes them through the encoder in order to retrieve
length n blocks that, she hopes, are the same as the original length n blocks that Arthur wanted
to send.
By using this method, if a probability of error equal to f is deemed acceptable, then communication is possible up to a rate R(f ), given by:
R(f ) =
C
.
1 − H2 (f )
(4.10)
Where C is the channel capacity and H2 (f ) is the binary entropy function evaluated at f .
The process of selecting a source word, encoding it, its corruption by noise and subsequent
decoding define a chain of probabilities known as a Markov chain. Our source word s is encoded to x, which is corrupted to become y, which is subsequently decoded to ŝ, the chain of
probabilities is:
P (s, x, y, ŝ) = P (s)P (x|s) × P (y|x) × P (ŝ|y)
(4.11)
The data processing inequality, which states that processing data necessarily discards information, applies to this chain, telling us that I(s : s̄) ≤ I(x : y). The definition of channel
capacity tells us that I(x : y) ≤ nC, and therefore I(s : s̄) ≤ nC. If the system achieves a rate
R with a bit error probability f , then I(s : s̄) is ≥ Rn(1 − H2 (f )). However, since I(s : s̄) > nC
is not achievable, neither is R > 1−HC2 (f ) . The maximum rate at which one can reliably communicate over a noisy channel with a probability of bit error pb = p is known as the Shannon limit.
Possibly the most remarkable consequence of this is that it tells us we can select an arbitrarily
small probability of bit error, and yet still have a non-zero rate of communication!
Example
If we have a channel with probability of bit error p = 0.05 then our channel capacity will be
given by:
1
1
C(0.05) = 1 − H2 (0.05) = 1 − 0.05 × log
+ 0.95 × log
.
0.05
0.95
= 1 − [0.2161 + 0.0703] = 1 − 0.2864 = 0.7136.
So an probability of bit error equal to 0.05 leads to a channel capaticy of about 0.71 (for each
bit of information sent, 0.71 will be received).
21
Part II
Machine Learning and Neural
Networks
22
Chapter 5
Introduction to Neural Networks
5.1
Neurons
Before we look into neural networks we must first explain what we mean by the word ‘neuron’.
It should be noted that we will not be looking at biological neurons, but artificial ‘idealised’
neurons. We need to simplify because whilst brains have several hundred species of neurons[3],
which would make the analysis and derivations truly horrendous. These will obviously draw
some inspiration from the biological neurons, but are simplified so as to bring to the fore the
features most important to us.
An artificial neuron (simply called a neuron from here onwards) consists of on or more
inputs (also known as synapses, a term brought over from the biologists), labelled xi and one
output, y. Each input is assigned a ‘synaptic strength/weight’ denoted wi , this indicates the
level of influence that an input xi has on the overall activation. This synaptic weight can be
positive or negative depending on whether P
we want xi to increase or inhibit the activation of
the neuron. The neuron now computes the i wi xi to find its activation, a. This activation, a,
is then used as the argument to a function f , somewhat unimaginatively called the activation
function, which determines the output, y. A neuron with three inputs is shown in figure (5.1).
Two possible activation functions are shown in figures (5.1) and (5.1).
In an extreme example, consider a neuron with eleven binary inputs, ten ordinary inputs
with synaptic weight wi = 1, and the eleventh with synaptic weight w11 = −20. The neuron
could be considered a vote counter for ten people - if the weighted inputs total more than 5
the neuron fires and the motion passes. Each voter may either use their input neuron to vote
‘yay’ (xi = 1) or ‘nay’ (xi = 0). Alternatively, a naysayer may use their vote to veto the entire
decision (setting x11 = 1). By forcing the activation to be reduced by twenty, so that no matter
how many people say ‘yay’, the decision cannot go ahead.
5.2
Neural Networks
Neural networks are, unsuprisingly, a system of interconnected neurons. At their most fundamental level our brains are nothing more than vast, immensely complex neural networks [refer
to Brunak and Lautrup for different species of neurons in the brain]. They provide us with an
interesting counterpart to traditional computational methods, as the following paragraphs will
explain.
Computer memory is address based memory - to retrieve a memory you must be able to
recall its address, if you cannot recall the address you are unable to retrieve the memory1 . It
1
we shall not convern ourselves will the storage or recall of the address itself!
23
Figure 5.1: A neuron with three inputs.
1.0
tanh(a)
K
3
K
2
0.5
K
0
1
1
2
3
a
K
0.5
K
1.0
Figure 5.2: Hyperbolic tangent activation function.
1.0
f(a)
K
3
K
2
K
1
0.5
0
1
a
Figure 5.3: Piecewise activation function
24
2
3
is also not associative, for example if you retrieve an image of your wife’s face, you will not be
able to recall her name unless you know the address of where her name is stored. As a result,
it is best not to rely on a computer to remember your wife’s name.
As many people who have had a system failure will attest, computer memory is not robust/fault tolerant. This means that a small fault with the R.A.M. in your computer can (and
usually does) lead to catastrophic results. Finally, memories are not distributed around the entire computer but stored entirely within the R.A.M. chips. Whilst this makes upgrading much
less hassle it also means that when retrieving data from R.A.M. the majority of the computer
is sitting idle as it waits for the data: only the C.P.U. and a few circuits are actually doing
anything during this process.
Biological (neural network) memory on the other hand, is content addressable - seeing your
wife’s face will (barring any psychological difficulties) bring her name to mind. It is also robust,
surviving our best attempts to stop it working. This point is illustrated beautifully by the
following quote, take from Mackay[13]:
Our brains are noisy lumps of meat that are in a continual state of change, with
cells being damaged by natural processes, alcohol and boxing.
Unlike computer memory, it is distributed - all the processing and storage is distributed
throughout the entire network. It is impossible to isolate a one area where information is stored
and one where it is processed since all neurons take part in these tasks. In a network as complex
as our brain it is possible to observe some specialisation in certain regions, but within a region
each neuron is used to store parts of multiple memories.
Given their fundamentally different nature to standard computers, neural networks are
‘programmed’ differently. We refer to this programming as ‘training’ the network, and there are
many different ways in which it can be done. When training a network there are three things
that we must specify:
1. Architecture: this simply describes the network and its topology. It should include
things like the number of neurons, the number of inputs they have, their initial synaptic
weights and other fundamental features. This can often be achieved with a ‘simple’
diagram.
2. Activity Rule: this describes how the activities of the neurons change in response to
each other. This is typically just a list of the activation sum and activation function for
each neuron.
3. Learning Rule: this describes how the weights in the network change with time, it acts
over a longer time period than the activity rule.
We will be looking at variations in the learning rule rather than architecture or activity
rule. Our architecture will consist of several fully connected layers of neurons (every neuron in
layer n is connected to every neuron in layer n + 1). Such networks are called Hopfield networks
afte John Hopfield, an example is shown in figure (8.1). There is no requirement for a neural
network to be like this, we can equally well have a network in which each neuron can affect its
own activation. These are called feedback networks and an example is shown in figure (5.5).
Unless otherwise stated we will treat the activity rule as a abstract function so as to preserve
the generality of the results we derive.
We will be looking at the various different ways of training neural networks along with their
respective advantages and shortcomings. We will then look at the different tasks to which neural
networks can be put and how the concepts of information theory will help us. We will conclude
25
Figure 5.4: A simple feedforward network.
the section by studying at neural network memory, its various properties and how it can be
effectively utilised. Throughout this investigation we will restrict ourselves to neural networks
operating in discrete time. Whilst many of the results we will obtain also apply to continuous
time (spiking) neural networks their derivation is more complex and no more informative for
being so.
26
Figure 5.5: A simple feedback network.
27
Chapter 6
Learning Types
6.1
Supervised Learning
If we subject our unsuspecting network to supervised learning, it means that we have an external
teacher who has some knowledge that he wishes to pass on to the network. We assume that
he has been diligent enough to prepare a set of examples which accurately characterise his
knowledge, we further assume that the neural network has no ‘knowledge’ other than that which
the teacher will force onto it. The examples consist of sets of input data with corresponding
desired responses/outputs from the network. We use the following algorithm to teach n examples
to the network:
1. set i = 1
2. subject the network to the input of example i
3. compare the output of the network y, to the desired response, d to create an error vector
e.
4. use the error vector to adjust the parameters of the network so that the network’s response
is closer to the desired response.
5. if i = n:STOP, else go to 6
6. set i = i + 1
7. go to step 2.
The algorithm is repeated until the network is deemed to have learned as much as possible
from the teacher. When this occurs we remove the teacher and let the network operate freely.
As we shall see in the following chapter, this is an example of an error correcting feedback loop.
There are two ways to perform supervised learning, online and offline. In online learning
the supervision takes place within the network itself. The learning takes places in real time
and once the training examples have been worked through the network operates dynamically,
continuing to learn from all input vectors submitted to it.
In offline learning the supervision is carried out at some kind of remote facility. If the
network is a software program then the supervision could be a separate program on the same
computer, or it could be a program running on a separate computer. Once the training is
complete the supervision facility is disconnected and the network parameters are fixed. The
network runs statically after its training.
28
The most frequently used method of updating the synaptic weights is the backpropagation
algorithm. this algorithm has two distinct phases of computation: forward, where the activations
and outputs are calculated from the input to the output, and backward, where the synaptic
weight adjustments are calculated from the output neurons to the inputs. To calculate the
change in one neuron’s synaptic weights one must analyse every neuron that can connect to it,
which can lead to scaling issues.
6.2
Reinforcement Learning
One of the main problems with supervised learning is that without the teacher present the
network cannot learn new ways of interpreting the data in the example set. One possible way
of overcoming this problem is to use reinforcement learning. This is when a network learns an
input/output map by trial and error in an online manner. The trial and error process attempts
to maximise what is known as the reinforcement index (a scalar measure of how well the network
is performing)uu.
There are two subtypes of reinforcement learning: associative and nonassociative. In nonassociative reinforcement learning reinforcement is the only stimulus that the network receives.
The network simply repeats the action which gives it the greatest reinforcement. In associative
reinforcement learning the network is provided with more stimuli that just reinforcement. The
network is required to learn a map between the various stimuli and their corresponding optimal
action. Formally, we declare the following:
• The network is interacting with an environment in discrete time.
• The environment has a finite set of states that it can be in, denoted X.
• The environment’s state at time n is given by x(n) (where x(n) ∈ X). The initial state is
x(0).
• The network has a finite number of actions to choose from, the set of which is denoted A.
This set may depend on x(n), so the network’s choice of actions may be restricted.
• At time n the network receives x(n) as its input and performs action a(n).
• The action taken affects the environment and moves it from state x(n) to state y. The
new state is determined entirely by a(n) and x(n) - it is independent of all previous actions
and states.
• Pxy (a) denotes the probability that the system will be moved into state y by the action
a.
• At time n + 1 the system receives a reinforcement that has been determined from x(n)
and a(n).
The so-called evaluation function provides a natural measure of the networks performance,
it is defined as follows:
#
"∞
X
k
(6.1)
γ r(k + 1) | x(0) = x0 .
J(x) = E
k=0
The summation term is called the cumulative discounted reinforcement. γ is the discount rate
parameter, it lies in the range 0 ≤ γ < 1 and adjusts the extent to which reinforcement signals
from longer ago are discounted. If γ = 0 then only immediate reinforcement is taken into accont,
29
as as γ → 1 the network takes longer term consequences into account. If we had γ = 1 then
infinite reinforcement would be possible, hence we restrict γ to be strictly less than one. The
expectation of the cumulative discounted reinforcement is taken with respect to the method the
network is using to select actions.
The basic aim behind reinforcement learning is to learn J(x) so that the cumulative discounted reinforcement may be predicted in the future. There is another way to implement
reinforcement learning, it is called the Adaptic Heuristic Critic method. Essentially it uses an
external ‘critic’ to refine the reinforcement signal into a higher quality ‘heuristic’ reinforcement
signal.
In supervised learning the performance of the network was judged by measuring its responses
to a set of examples and using a known error function. By contrast, with reinforcement learning
the performance of the network is judged on the basis of any measure that we choose to provide
it with. In supervised learning the teacher is able to immediately tell the network what to
do in order to improve its performance. This allows us to use any and all knowledge that
we currently have in order to guide the network to an optimal response. With reinforcement
learning the system acts by trial and error - it must try various actions before it can judge which
is optimal. The trial and error nature of the learning, coupled with a delayed reward means
that the operation is slowed down. We should not discount reinforcement learning on this basis,
since it is still a very important tool, especially when combined with supervised learning (this
applies doubly when the neural networks are brains and we are thinking about how humans
learn). With these differences noted, it is time to move on to the third and final type of learning:
unsupervised learning.
6.3
Unsupervised Learning
As one might expect, unsupervised learning takes place without a teacher of any kind. This
is most often used when creating an associative memory: the inputs are repeatedly presented
to the network, whose free parameters gradually become tuned to the input data. One of the
major advantages of unsupervised learning is that the network develops the ability to form its
own internal representations of the data, rather than having these imposed on it (as happens
in supervised learning). This enables it to form new classes as and when it needs to.
Another advantage becomes apparent when we consider how changes to the weights of the
system are calculated. Backpropagation is a good algorithm, but it doesn’t scale very well.
If we have l layers, and the average number of inputs to a neuron in each layer is min then
changing a synaptic weight in the first layer affects (min )l other neurons. So the time taken to
train the network grows exponentially with the number of layers due to the increasing number
of effects that must be calculated.
6.4
Goals for Learning
When training a neural network we usually want it to perform one of the following tasks. We
can put them to other uses, but these are the main ones.
Prediction: This is a ‘temporal signal processing problem’: given a set of past examples of
the behaviour of a system, we require the network to predict its next action/state. This is
similar to error correction as we can use the next state of the system to train the network
to (hopefully) better predict the system in future. One popular idea is trying to predict
the stock market.
30
Approximation: If we have a function y = f (x) (f unknown) and a set of paired values
(xi , xi ) (where yi = f (xi )) then given a large enough example set we may use the neural
network to predict values of yi not given by our examples. This is obviously a candidate
for supervised learning as we need to teach the network our initial data. When we are
training a network for this approximation we must be careful not to over-train the network.
Doing so would lead to a ‘join the dots’ approximation (overfitting to the data) rather
than an accurate representation of the function.
Auto-association: We want the neural network to store a set of patterns, we achieve this
by repeatedly presenting them to the network. The network is then presented with either
a partial or a noisy pattern and is required to recall/retrieve the corresponding pattern.
Autoassociative memories are perfect candidates for unsupervised learning.
Pattern Classification: This takes place with supervised learning, we want the network to
sort its input into a finite number of classes. The network is presented with a large series
of examples and taught what classes they each belong to (e.g. person, rock or fruit). The
network is then presented with an previously unseen input and is required to classify it
correctly.
Control: Neural networks are also very good at maintaining a controlled state in a system
or process due to their ability to learn via several different methods. This should not be
terribly surprising given that our brains are, at their most fundamental level, large neural
networks that control our actions and learn from a multitude of different stimuli.
Now that we have discussed the various types of learning, their advantages, pitfalls and
what goals they are particularly well matched to, it is time to delve deeper and learn about the
details of several different algorithms by which neural networks may learn.
31
Chapter 7
Learning Algorithms
7.1
Error Correction Learning
Error correction learning is the main form that supervised learning takes. The teacher wishes to
train the network to respond correctly to a number of inputs and does so by trying to minimise
the difference between the network’s response and the desired response. The most common way
to do this is by using the backpropagation algorithm, which we shall explain here.
The algorithm is used to alter the weights of the network in a systematic manner. We build
on the definition of the error vector (e = y − d) by defining the total ‘error energy’:
M
1X 2
E(n) =
ej (n)
2
(7.1)
j=1
(where the m is the number of output neurons) which gives us a measure of how well the network
is performing for a particular training example. The average squared error energy is given by
Eav
N
1 X
E(n),
=
N
(7.2)
n=1
and gives us a measure of the performance of the network which takes into account all of the
training examples.
In order to simplify the notation we shall assume that N = 1 and so E = Eav . In training
the network, our goal is to minimise E, which is a function of all the parameters of the network
(the synaptic weights and biases). The activation ak and output yk of neuron k are given by:
ak =
m
X
wki yi ,
(7.3)
i=0
and
yk = fk (ak ).
(7.4)
∂E
∂wki .
The change to weight wki is denoted ∆wki and is proportional to
As a result, those
weights which affect E the most will be adjusted more, whilst those which have very little effect
will be adjusted by a correspondingly small amount. We will now derive the formula for the
backpropagation algorithm.
By the chain rule, we can write the following:
∂E ∂ek ∂yk ∂ak
∂E
=
.
∂wki
∂ek ∂yk ∂ak ∂wki
32
(7.5)
Differentiating equation (7.1) gives us
∂E
= ek .
∂ek
From our equation for the error we simply get
(7.6)
∂ek
= 1.
∂yk
(7.7)
If we partially differentiate yk with respect to ak then we see that:
∂yk
dfk
=
.
∂ak
dak
(7.8)
Finally, we differentiate our expression for the activation of neuron k, ak to find:
∂ak
= yi .
∂wki
(7.9)
Substituting equations (7.6), (7.7), (7.8) and (7.9) into our (7.5) we get the following:
∂E
dfk
= −ek
yi .
∂wki
dak
(7.10)
By using (7.6), (7.7) and (7.8) we may define
δk =
∂E
= ek f ′ (ak ),
∂ak
(7.11)
then we can use the following expression for ∆wki :
∆wki = −η
∂E
= ηδk yi .
∂wki
(7.12)
Where η is the learning rate parameter, which must be chosen with care. If it is too small then
the rate of convergence will be so slow as to be useless, but if it is too high then we might get
no convergence at all.
If neuron k is an output neuron then all is well, but if it is hidden (i.e. not an output neuron)
then we do not have ek , so we redefine δk to
δk = −
∂E
∂E ∂yk
=
∂yk
∂yk ∂ak
(7.13)
∂E
∂yk .
by using the chain rule. As a result we now need to find
From (7.1) it follows that
X ∂ej
X ∂ej ∂aj
∂E
=
ej
=
ej
.
∂yk
∂yk
∂aj ∂yk
j
(7.14)
j
P
∂e
Note that ej = dj − fj (aj ), giving us ∂ajj = −fj′ . Also, aj = k wjk yk giving us:
This implies that
X
X
∂E
=−
ej fj′ wjk = −
δj wjk ,
∂yk
j
∂aj
∂yk
= wjk .
(7.15)
j
which we substitute back into equation (7.13) to find:
X
δk = fk′
δj wjk .
j
33
(7.16)
Figure 7.1: The initial setup of our neural network.
The δj in this formula is found from equation (7.11) for the hidden neurons immediately preceding the output layer, and from (7.13) for every layer thereafter.
The error is a function of the synaptic weights, this allows us to define a multi-dimensional
error performance surface on which we hope to find a minimum value. If the network has units
units with linear activation functions then the surface is quadratic and has a unique minima
which the algorithm above will alwas allow us to reach. If we have nonlinear units, then the
surface will have global minima but will also have local ones which the algorithm might get
‘trapped’ in.
7.2
Hebbian Learning
Perhaps the best way to describe Hebbian learning is to begin with a quotation from Hebb
himself (taken from Simon Haykin[9]):
When an axon1 of cell A is near enough to excite a cell B and repeatedly or
persistently takes part in firing it, some growth process or metabolic changes take
place in one or both cells such that A’s efficiency as one of the cells firing B, is
increased.
This definition has since been expanded to say the following:
1. if two neurons on either side of a synapse fire together then the strength of that synapse
will be increased.
2. if on the other hand the two neurons fire asynchronously, then the strength of the synapse
will be decreased.
Any synapse which obeys the above rule is called a Hebbian synapse. When we are using a
Hebbian learning technique the synaptic modifications can be one of several different types.
1
output
34
They can be Hebbian, in which case the synaptic strength increases when the firing of the
neurons on either side of it are positively correlated. They can be anti-Hebbian, where the
synaptic strength increases when the firing is negatively correlated and the case where the
synaptic strength decreases when the correlation of firing patterns is positive. Alternatively
the modifications may be non-Hebbian, where the modification of synaptic strengths does not
depend on the correlation of the two neuron’s firing. One example of this would be a slow decay
in synaptic strength.
Hebb’s postulate may be expressed as ∆wki = F (yk , xi ), the change in weight wki is a
function of the presynaptic and postsynaptic activities xi and yk . We are using xi instead of yi
so that the adjustment to wki is normalised by the current synaptic weight - if wki is small then
i and k firing together should only adjust the weight by a small amount. We will look at the
special case where F (yk , xi ) = ηyk xi . This simple function does have one disadvantage though
- the synaptic weight cannot decrease, as a result of this repeated correlated firing of neurons k
and i will saturate the synaptic weight. Therefore we add a forgetting factor to slowly decrease
the weight over time. This gives us a new formula:
∆wki = ηyk xi − αyk wki ,
(7.17)
alternatively:
η
.
(7.18)
α
The new synaptic weight of neuron k is given by taking the sum of the old weight and the
Hebbian adjustment, like so:
∆wki = αyk [cxi − wki ]
where
c=
new synaptic weight of neuron k = wk + ∆wk .
(7.19)
Example
Given our starting neural network from figure (7.1), we wish to calculate the weight changes
using a Hebbian learning rule with a learning rate parameter η = 0.2 and a forgetting factor
α = 0.01. We will be using the same activation function as in the case of supervised learning,
the only difference is how we will calculate the weight changes. We will only compute one set
of weight changes as the same principle and method apply to all layers of neurons and there is
little benefit to repeating the calculations.
We can see here that the weight changes for neuron 4 are as follows:
∆w41 = ηy4 x1 − αy4 w41 = (0.2 × 0.99 × [0.1 × 0.54]) − (0.01 × 0.99 × 0.1) = 0.010
∆w42 = (0.2 × 0.99 × [0.16 × 0.89]) − (0.01 × 0.99 × 0.16) = 0.027
∆w43 = (0.2 × 0.99 × [0.2 × 0.46]) − (0.01 × 0.99 × 0.2) = 0.016.
Working through the same process for neuron 5 gives us the following weight change, expressed
as a vector:


0.008
∆w5 =  0.033  .
0.023
35
So the new synaptic weight vectors for neurons
 

0.10
w4 =  0.16  + 
0.20
 

0.09
w5 =  0.21  + 
0.30
7.3
4 and 5 are:
 
0.010
0.027  = 
0.016
 
0.008
0.033  = 
0.023

0.11
0.187 
0.216

0.098
0.243 
0.323
Competitive Learning
In competitive learning we have only one output neuron firing at any one time, all of the
output neurons are competing for this ‘privilege’. The firing neuron is called the winner takes
all neuron. We might also wish to have 3 output groups and have one neuron fire per group,
but to keep things slightly simpler we shall assume that we have only one group and therefore
only want one neuron firing at any one time. Neural networks of this type are very good for
classifying input as we can simply allocate each output neuron to a different input class and we
will get an unambiguous decision from the network about which class the input belongs to. We
must however take care not to simply assume that the network knows the answer - it can only
make useful contributions if it’s training is good. Even then we must exercise caution against
blind faith in the network’s response[7].
There are three main requirements for a competitive learning rule,
1. we must have identical neurons (same connections and activation functions) but with
randomly distributed synaptic weights.
2. The strength of each neuron must be finite and uniform across all neurons (i.e. each
neuron must have the same amount of synaptic weight to distribute among its inputs).
3. There must be some mechanism in place to allow competition between the output neurons
or output groups (to ensure that only one neuron fires).
We need the first so that the network will begin its training by responding differently to
different inputs. If it reacted identically to all input patterns then we would be unable to train
it. The second is needed so that no single neuron dominates the competition and wins the
competition for all input patterns. The third condition is to ensure that only one neuron fires
per input. After all, it is not very useful if we ask the network whether and input x belongs to
A, B or C (all mutually exclusive) and the network responds with A and B.
We can
P implement requirement 2 by requiring that the weights of each neuron all add up
to 1, so j wkj = 1 for all k. We can implement condition 3 by having intra-neuronal synapses
with negative weights. Thus one neuron fireing will inhibit all the others. The winner-takes-all
neuron must have the highest activation of all the output neurons. When it fires its output
is set equal to 1 and all the other output neurons are set to 0. If neuron k wins, then all of
its synaptic weights are decreased slightly and the removed weight is redistributed among the
active synapses. Note that if a neuron has no active inputs then it will not learn.
The weight changes are calculated as follows:
η(xj − wkj ) if neuron k wins
∆wkj =
(7.20)
0
otherwise
36
As we can see from this equation, the weight vector of neuron k (wk ) will converge to the input
pattern x (assuming we have an appropriate learning rate parameter η). Once wk is deemed
sufficiently close to x the network is considered to be trained.
7.4
Self Organising Maps
Self organising maps were by Teuvo Kohonen and hence are sometimes called Kohonen maps.
They take an input pattern (of arbitrary dimension) and convert it to a discrete one or two
dimensional map of interconnected neurons2 .
The weight vector of neuron k is denoted wk = (wk1 , wk2 , . . . , wkm ), where there are m
inputs. The input signal is denoted x. When we are using a self organising map the input
pattern usually has one or two large components, with the rest being small. All neurons are
exposed to every input component so as to allow the map to mature effectively. There are three
distinct phases to the algorithm: competition, cooperation and synaptic adjustment.
In the competitive phase the neurons all compute their output, the one with the highest
output is deemed the winner. If all neurons possess the same activation function then this
process amounts to finding the best match of x with the neuron’s weight vectors, i.e. maximising
wk .x across all k (this is equivalent to minimising i(x) = ||x − wk ||).
Once the winning neuron has been found, the algorithm enters the competitive phase. This
is where lateral connections between the neurons come into play - the firing of the winning
neuron gives some excitation to nearby neurons. We will assume that neuron k is the winner.
Those neurons near to the winner will get more excitation from it firing than those further
away. We define dik to be the distance between neurons k and i (dkk = 0). We now define hki ,
which describes the excitation received by neuron i from neuron k as a function of the distance
between them. We have two requirements of hki : we require it to be unimodal in dik , taking its
maximum value at dkk . We further require that h tends to 0 as the distance tends to infinity.
With a one dimensional output, we can easily define dik as being ||i − k||. It is relatively
simple to define a higher dimensional distance, we simply take dik = ||ri −rk ||, measuring r from
an arbitrary point. In order to help the network converge, we would like h to decreases with
time, so that successive examples/presentations will affect less neurons and allow the network
to become finely tuned to the inputs.
Finally, the network enters the synaptic adaptation phase where we use a Hebbian learning
rule with a forgetting factor g(yk ). For example, ∆wk = ηyk x − g(yk )wk . If we set g(yk ) = yk
and yk = hki (x) then we get a weight adjustment of
∆wk = ηhki (x)(x − wk ).
(7.21)
In order to encourage convergence of the network, we would like η to decrease with time, to
allow the network to become finely tuned to the inputs.
There are two distinct phases of training when we are using a self organising map. In the
first phase we allow hkj to include all the neurons and decrease slowly, η should also decrease
gradually in this phase. This is the ‘rough’ training of the network. The second phase is where
we seek convergence/fine tuning, this is where we hold η constant and h should include only
those neurons closest to the winner. The second phase generally lasts much longer than the
first.
There are many other types of learning algorithm, unfortunately we do not have the space
to cover them here as each one could quite easily be the subject of an entire book. We will
however study memory learning in more depth as this will become useful to us in part 3.
2
3D outputs are possible but less common
37
Chapter 8
Information Theory Applied to
Neural Networks
The key concept when applying Shannon’s information theory to neural networks is mutual
information. Depending on the scenario we may wish to adjust the mutual information between
synaptic weights or output neurons. We will often describe ideas by images, but the concepts
generalise to any data set. To give a few examples, in all cases xa are input vectors and ya are
the output vectors.
1. Acquiring information for the first time (e.g. cramming before an exam). The notes
consist of the vector x which are fed into the neural network, producing a memory y. We
need to maximise the information conveyed to y (what we rememeber) from x (the notes).
2. Associative Memory (given part of an image, recall the rest of it). Non overlapping parts
of the same image are given by xa , xb and xc , which together produce outputs ya , yb and
yc . Now our goal is to maximise the mutual information of all the output vectors.
3. Consistent memory (given two independent images, keep them isolated). Two independent
images xa and xb produce outputs ya and yb respectively. Our aim here is to minimise the
mutual information of ya and yb , since due to the independence of xa and xb we cannot
infer anything about yb from ya or vice versa.
Maximising mutual information between x and y (I(x : y)) is the fundamental goal of signal
processing. This goal can be summarised as the Maximum Mutual Information (Infomax)
Principle, due to Linkser:
The transformation of a random vector x observed in the input layer of a neural
system to a random vector y produced in the output layer of the system should be so
chosen that the activities of the neurons in the output layer jointly maximise information about activities in the input layer. The objective function to be maximised
is the mutual information I(y : x) between the vectors x and y.
The mutual information therefore gives neural networks an analogue of the channel capacity
we met in our brief excursion into information theory, which defines the maximum reliable rate
of information transmittion through a channel. We will look at autoassociative memory in order
to illustrate these ideas.
38
8.1
Capacity of a Single Neuron
If we use our neuron to transfer information from one person to another then it is taking on the
role of a channel. We will look at the case of a single neuron and then generalise to Hopfield
networks.
We will work with binary inputs, analysis may also be done on continuous (and stepped)
inputs, but the work is considerably more complex. We will take our neuron to have k inputs
and assume that we have n (k-dimensional) input vectors to classify. These vectors are said to
be in a general layout if any subset of k or less vectors are linearly independent and no k + 1 of
them lie on a k − 1 dimensional plane. If we consider the case k = 2 then this latter requirement
is simply that no three points lie in a straight line.
In order to simplify the analysis we will also assume that our neuron has a binary threshold
function, defined below:
0 for a ≤ 0
f (a) =
(8.1)
1 for a > 0
Let us assume that our neuron has 6 synapses, so its inputs are points in a 6 dimensional binary
space. We wish to classify some data using the neuron. This data consists of a set of points
{x(i)}5i=1 in the 6D binary space and a set of corresponding target values {ti }5i=1 . If we use
x(3) as the input to the neuron then we want its output to be t3 .
The receiver is only allowed to test the neuron at the 5 points x(1) through x(5) and must use
the weights of the network to reconstruct the original target values. The sender must therefore
choose an appropriate learning algorithm to encode this information into the synaptic weights.
There are a 25 distinct binary labellings of 5 points, if these points are in a general layout
then the neuron is not necessarily able to store them all. The number of labellings that the
neuron can produce for n k-dimensional points in a general position is denoted T (n, k) and
satisfies the recurrence relation
T (n, k) = T (n − 1, k) + T (n − 1, k − 1).
Using this, it is possible to show that T (n, k) is given by the following formula:

 n
 2
for k ≥ n 
Pk−1 n − 1
T (n, k) =
for k < n 
 2 i=0
i
(8.2)
(8.3)
If the neuron is able to produce all 25 labellings then it can reliably classify all of the
information. If it is only able to produce two different labellings for the entire set then it
can store one bit of information. The number of different labellings the neuron can produce
is a natural measure of its capacity. In order to move from labellings to capacity, we use the
following equation:
capacity = log[T (n, k)].
(8.4)
We can check if our neuron is likely to be able to store all the information by checking the
probability that it will be able to store all n bits, this is calculate by taking the ratio of the
achievable labellings with the number of theoretical labellings, like so:
probability of successful storage =
T (n, k)
2n
(8.5)
If we plot a graph of of this probability against n and k then we see an interesting result, shown
below:
39
N=K
1
1
0.75
0.75
N=2K
0.5
0.5
0.25
70 0
60
50
40
30
20
10
0.25
0
10
20
30
N
(a)
40
50
60
70
K
N=K
(b)
50
100
N
150
K=N/2
10
20
30K 40
50
60
70
2400
2300
1
log T(N,K)
log 2^N
2200
0.75
2100
0.5
2000
0.25
1900
N=K
() 0
0
0.5
N=2K
1
1.5
N/K
(d)
2
2.5
1800
1800
1900
2000
2100
2200
2300
2400
3
Figure 8.1: Probability of successfully labelling all points[13], from various different perspectives.
(c) and (d) are drawn at k = 1000
When n < k we can see that the probability of successfully labelling all of the inputs is 1,
so we are guaranteed success. If we look at the region k < n < 2k then there is a very high
probability of successfully labelling all the inputs. So provided we are willing to accept a small
probability of error, the capacity of a single neuron is slightly less than two bits per input.
Our single neuron is a trivial example of a feedforward ‘network’. Feedforward networks
can be used to create interpolative memory, meaning that if the inputs differ slightly from the
memory then the output is also allowed to differ slightly. For example, if we replaced our lone
neuron’s activation function by, for example:
f (a) = tanh(3a)
(8.6)
then its output would be continuous. Suppose we had classified the pattern (1, 1, 1) as type ‘1’
using our neuron, with all weights being equal and taking the value 1/3. If we then submit
x = (1, 1, 0) as our input, then the neuron’s activation will be 2/3 instead of 3/3 as it was for
the stored memory. However the neuron will still fire with an output of
2
f (a) = tanh 3 ×
= tanh(2) = 0.964.
3
Note that whilst this is not the same as the response to (1, 1, 1) (0.995), it is still close.
A rather more interesting use of neural networks is to create so-called accretive memory,
where the output is required to be exactly the same as one of the stored memories - no deviations
due to noise are allowed. For this task feedback networks are most appropriate. The use of
networks is also advantageous as is allows us to store vectors, not just classify them.
8.2
Network Architecture for Associative Memory
When we supply a network with a noisy/incomplete vector x̄(i) we must allow it some time to
recover the clean version x(i). As a result we will feed the network the noisy vector and wait
until it’s output no longer changes with time, at which point we will take it’s output to be the
recovered memory. If x̄(i) causes the network’s output to converge to x(i) then the memory has
been successfully recalled.
Given that we will be using a feedback network we must specify how our neurons will
calculate their activations and update their state. They may either all do this simultaneously
40
(synchronous updating) or they may do so one at a time according to some sequence that may
be predetermined or random (asynchronous updating). When using asynchronous updates the
neurons update their activations and states at a mean rate (per unit of time).
We will look at asynchronous updates in a binary Hopfield feedback network. It can be
shown that if we use asynchronous updates then the network will always converge to a stable
state and the memories of the neural network correspond to these stable states. This result
is quite general and applies to a Hopfield network with a continuous activation function, e.g.
f (a) = tanh(βa). In order to guarantee convergence with synchronous updates we must use
continuous time.
8.3
Memories
The nature of memory recall (wanting certain sets of neurons to fire simultaneously) makes it
an ideal candidate for Hebbian learning.
After the training process is complete some memories will be more accessible than others,
that is to say that they are more easily recalled. For example we might be able to recall x(1) if
it is corrupted by ≤ 3 flipped bits whilst only being able to recall x(2) if it is corrupted by ≤ 2
flipped bits. The accessibility of a memory x(i) is defined to be the fraction of initial network
states that converge to it. A memory that can tolerate a large number of flipped bits is said
to form a large attractor, similarly for a small attractor. The terms ‘large’ and ‘small’ here are
relative to the number of bit-flips that other memories can sustain.
The encoding of memories is not without its problems, there are a number of ways in which
it can fail.
• Individual memories may get corrupted - the stable state of the network might be slightly
displaced from the memory we wish to store.
• Memories might not form attractors, or they might form such small attractors as to be
effectively useless.
• Spurious memories can be formed.
• Memories might interfere with each other, resulting in a stable state that is an amalgamation of two memories.
It is important to note that the spurious memories are not formed by transitive associations
like (A + B) and (B + C) leading to (A + C), but illogical, meaningless connections caused by
correlations in the structure of memories. This means that the spurious memories are naturally
formed when we create a memory - there is no process which can prevent them. These memories
interfere with the running of our network and so we would like to find a way to limit their effect.
8.4
Unlearning
Luckily Hopfield et. al. came up with a method of moderating the effects of these spurious
memories[11]. After the network has been trained we choose a random state as our input and
allow the network to converge on a stable state. We then adjust the synaptic weights in order
to decrease the accessibility of this stable state by a very small amount. We then lather, rinse
and repeat. They called this process ‘unlearning’ as it is nothing more than forgetting each
memory by a small amount.
After its initial training the memories encoded in the network (including the spurious ones)
can have widely varying accessibilities. The unlearning process causes the accessibility of each of
41
the stored memories to converge and the total accessibility of the spurious memories to decrease.
One must exercise caution when implementing the unlearning algorithm since if it runs for too
long then it ends up erasing everything, including the memories we want it to store.
A convincing case has been made for R.E.M.1 sleep and dreams being nothing more than
our brains undergoing a process of unlearning in order to decrease the accessibility of these
spurious state. In doing so it would be keeping our memories consistent and useful, which is
nice.
8.5
Capacity of a Hopfield Network
When we are looking at the storage capacity of a Hopfield network the important quantities are
the number of neurons and the number of patterns to be stored. We shall denote these numbers
by n and p respectively. For a fixed number of neurons there is an upper limit to the number
of patterns that can be stored, as one would expect.
What is less intuitive however, is that if we try to store too many patterns in the network
(i.e. the ratio np is too large) then the network undergoes a sharp transition into complete
amnesia, where none of the memories correspond to stable states. Statistical analysis has found
that this transition occurs at
p = 0.138n
(8.7)
It is interesting to note that there is no threshold below which the stable states correspond
exactly to the patterns we wish to store. Just below the amnesia threshold there is a correlation
of 98.4% between the memories and stable states, above this threshold there is absolutely no
correlation.
Just below the amnesia threshold the probability of error in each memory is 0.016 per bit,
and as np decreases, so does this probability. There are n neurons in the network, each of which
has n − 1 inputs, which gives us a total of 21 n(n − 1) synaptic weights. The mutual information
between the input pattern and the stable state is 1 − H2 (0.016) = 0.96. The number of patterns
we are storing is 0.138n, and each one consists of n binary digits, so we are trying to store
0.138n2 bits in our network. We must then multiply this by the mutual information to find the
amount of information actually stored, which turns out to be 0.122n2 bits.
If we assume that n is large (as it must be if we wish to store a useful amount on information)
then the difference between n and n−1 is negligible and we may consider the number of synaptic
weights to be 21 n2 . This means that our network is storing
0.138n2 Ibits
= 0.24 bits per synapse,
1 2
2 n weights
which is significantly less than our binary neuron classifier.
1
Rapid Eye Movement
42
(8.8)
Part III
Quantum Information Theory and
the Cutting Edge
43
Chapter 9
Quantum Mechanics
9.1
Motivation
The laws of physics were once all classical - governed by Sir Isaac Newton’s laws. These laws
were believed to be all pervasive, governing all motion from the stars in the night sky to the
atoms from which we are formed. The planets behaved as large rocky billiard balls, orbitted
my smaller rocky billiard balls that we called ‘moons’. Atoms were smaller balls, bouncing off
each other throughout the universe. Unfortunately there is a flaw in this simple, intuitive and
intellectually satisfying world view: it is wrong1 . The planets warp space and time by their
very presence and atoms do not bounce off each other but happily coexist not only as particles,
but also as waves. The theory describing atoms and waves at their most fundamental level is
called ‘quantum mechanics’.
Just as the laws of physics were required to change or adapt to the new quantum regime, the
laws of information are required to do the same. The reason for this is simple: all information
is physical, and as such is subject to our newly revised physical laws. In order to adapt, we must
now revise some of our fundamental concepts in order to take into account the fact that we are
now dealing with information in a quantum framework (very small) rather than a classical one
(the kind we deal with every day). In order to build up this quantum theory of information
however, we must first review some of the fundamental concepts of quantum mechanics.
9.2
Qubits
In the classical framework a bit could take one of two values, 0 or 1. We have a similar concept
in quantum information theory, namely the qubit, which has states |0i and |1i. The notation
|.i is called a ‘ket’, and may have its inner product taken by multiplying on the left with a ‘bra’
(h.|). Given a state |ai (‘ket a’) we define ha| (‘bra a’) by ha| = (|ai)† where † denotes taking
the Hermitian conjugate (conjugate transpose). Unlike a classical bit a qubit may also be in a
linear superposition of these two states, meaning that it is in a state |ψi, where
|ψi = α|0i + β|1i
(9.1)
(where |α|2 + |β|2 = 1). The qubit is then said to be in a coherent state, if it interacts with its
environment in any way (if we try to measure it for instance) then the qubit will decohere and
fall into one of the states |0i or |1i. Interestingly enough, the process by which it does so is not
random, it will decohere to |0i with probability |α|2 and |1i with probability |β|2 .
1
It should be noted that the Newtonian view of the universe works very accurately between these two scales,
and as such should not be disregarded.
44
Each qubit there lies in a 2 dimensional vector space over the field of complex numbers, such
a vector space is called a Hilbert Space after David Hilbert. A system is said to have n qubits if
and only if it has 2n mutually orthogonal states, this corresponds to lying in an n-dimensional
Hilbert space. We shall denote these states by |xi with x representing a string of n binary digits,
for example: |01110010i is one of the states of an eight qubit system. Two states are orthogonal
if their inner product, denoted ha|bi, is zero. For our purposes we will reduce inner products of
coherent states to inner products of |0i and |1i states, using the fact that h0|0i = h1|i = 1 and
h0|1i = h1|0i = 0.
9.3
The EPR Paradox
The EPR paradox was a thought experiment proposed by Einstein, Podolsky and Rosen in
an attempt to demonstrate that quantum mechanics provided an incomplete description of
nature. It seems to show that the measurement of particles can be used in order to violate the
fundamental principles of relativity. If we have two particles in the state |φi = √12 (|00i + |11i)
and give one to Alice and one to Bob, who can be arbitrarily far apart. If Alice measures her
particle and discovers that it is in state |0i then the combined state must be |00i, so if Bob
measures his particle he will find it to be in state |0i. This decoherence occurs instantaneously
in spite of any distance between Alice and Bob. Could this decoherence be used to allow Alice
and Bob to comminicate fast than light? No. Though there is a coupling between the particles,
called ‘entanglement’, which we shall look at shortly.
Einstein, Podolksi and Rosen decided that there must be some kind of internal state of the
particles that cause them to be in state |0i or state |1i before the seperation occured. This
state is not accessible to us until we perform a measurement however, so we may only speak
probabilities of the outcomes. Theories of this kind are known as ‘local hidden variable theories’.
Local hidden variable theories do have the attractive property of simplicity, but they cannot
explain the results of measurements made in a different basis. John Bell proved that any local
hidden variable theory satisfies a particular inequality (Bell’s inequality, unsurprisingly), but
in experiments it was shown that this inequality is consistenly violated. So no local hidden
variable theory can explain quantum mechanics.
An alternative explanation is that the measurement of Alice’s particle does affect Bob’s.
This is rather problematic for causality however, as we may set up two observers: one who sees
Alice measure her particle first and another who sees Bob measure first. Relativity requires that
the laws of physics must explain equally well the observations of each observer, so one observer
could say that Bob’s measurement affected Alice’s particle and the other observer could say
the opposite. Clearly both cannot be correct, especially when experiments showed that the
results obtained were invariant under a change of observer. This tells us that the results can be
explained equally well either by Alice measuring first, or Bob and as such it is not possible to
use decoherence to communicate. All that can be said is that Alice and Bob will observe the
same random behaviour.
A third explanation, one proposed by Andrew Steane[19], is that the state vector |φi is
not an intrinsic property of the quantum system, rather it is an expression of the information
content of a quantum variable. There is mutual information between A and B, and so the
information content of B changes when we learn something about A. This approach gives us
the same behaviour as classical information, so there is no controversy.
45
9.4
Entanglement
1
0
and
respectively. As such we may
0
1
build up strings by taking the tensor product of the constituent qubits, for example:
 
0
 1 
1
0
 = 0 1 0 0 t.
(9.2)
|01i = |0i|1i =
⊗
=


0
0
1
0
We may express qubits |0i and |1i as vectors
Similarly we find that |00i = (1000)t , |10i = (0010)t and |11i = (0001)t . This allows us
to build up composite states with simple vector algebra. For examples given a state |αi =
√1 (|00i + |01i) we may write this in vector form as √1 (1100)t by adding the corresponding
2
2
vectors together.
This representation of qubits is very useful because it allows us to represent any state α in
‘density matrix’ form. This is just the matrix ρα generated by |αihα|, the tensor product of
(1100) with (1100)t .


1 1 0 0
 1 1 0 0 

(9.3)
ρα = 
 0 0 0 0 
0 0 0 0
If ρα cannot be factorised, then |αi is said to be an entangled state. It is possible to have
different degrees of entanglement, we caneasily construct ρψ that it is partially factorisable into
two parts: ρǫ and ρδ .


1 0 0 1
 0 0 0 0 
1 0

(9.4)
ρψ = ρǫ ⊗ ρδ =
⊗
 0 0 0 0 
0 0
1 0 0 1
It is important to notice that ρψ is only partially entangled - it can be written as a product of
density matrices, but ρδ cannot be factorised. Entanglement is a purely quantum mechanical
phenomenon, it has no classical analogue.
9.5
The Bell States
The Bell states are a set of four possible ways in which two qubits can be entangled. They are
named after John Bell, of EPR fame and are defined by the following equations:
|φ± i = |00i ± |11i,
±
|ψ i = |01i ± |10i.
(9.5)
(9.6)
The Bell states are all mutually orthogonal and are maximally entangled. They become very
important when we try to communicate over quantum channels.
9.6
The No Cloning Theorem
This theorem simply states that an unknown quantum state cannot be reliably cloned[26].
We should qualify our use of the word ‘reliably’ here. Essentially we mean that under some
46
circumstances, it is possible to clone an unknown quantum state. These circumstances boil
down to the requirement that the quantum states we are trying to clone must be mutually
orthogonal. First we construct an operator
P = |0ih0| ⊗ V + |1ih1| ⊗ U
(9.7)
where the operators U, V are unitary. This is an example of a quantum logic gate acting on two
qubits. The ket-bra operators act on the first qubit and determine if it is in state |0i or |1i. If
it is in state |0i then the operator V is applied, if it is in state |1i then U is applied.
If we define U and V by the following equations: U = |0ih1| + |1ih0|, V = I (the identity
matrix) then we have made ourselves what is know as a ‘controlled-NOT’ or CNOT gate. If
the first qubit is in state |0i then the pair of qubits are left alone, however if the first qubit is in
state |1i then the NOT operator is applied to the second qubit. The gate is controlled because
the NOT operator only comes into play if the first qubit is in state |1i. It is important to note
that our CNOT gate can only clone states |1i and |0i, not any arbitrarily chosen states that we
choose to hurl at it. For example, it would not be able to clone the state |ai = √12 (|0i + |1i) as
this is in a superposition of the two states |0i and |1i.
47
Chapter 10
Quantum Entropy and Information
10.1
Von Neumann Entropy
The entropy of a quantum state ρ was defined by Von Neumann to be
S(ρ) ≡ −tr(ρ log ρ).
(10.1)
If we let λx be the eigenvalues of ρ then we can find a basis in which ρ = diag(λx1 , λx2 , . . .),
and thus we can rewrite the Von Neumann entropy as follows:
X
S(ρ) ≡ −
λx log λx .
(10.2)
x
Once more, we take 0 × log 0 ≡ 0. Throughout this chapter we will drop normalisation factors
such as √12 in front of entangled qubits so as to keep the notation uncluttered.
10.2
Quantum Distance Measures
Before we advance it is instructive to discuss distance measures for quantum information - we
need a way to measure how different two quantum states are. There are two types of distance
measure in wide use: static , for use when we are in possession of two quantum states and want
to see how different they are, and dynamic, for use when we wish to see how well a particular
process has preserved the information content of a system.
In the classical world the trace distance (also known as the Kolmogorov distance) between
two probability distributions is defined to be
D(px , qx ) =
1X
|px − qx |.
2 x
(10.3)
The trace distance is symmetric and obeys the triangle inequality, so D(y, x) ≤ D(x, z)+D(z, y).
In the quantum regime there is an analogue of the trace distance, which is defined by the
following equation for quantum states ρ and σ in density matrix formulation:
1
D(ρ, σ) = tr|ρ − σ|.
2
(10.4)
√
+
where |A| = A† A. If ρ and σ commute then D(ρ, σ) reduces to the Kolmogorov distance
between the eigenvalues of σ and ρ.
48
If we act on ρ and σ with a trace preserving quantum operator ǫ then we discover that
D(ǫρ, ǫσ) ≤ D(ρ, σ).
(10.5)
Essentially this means that one cannot increase the distance between two quantum states by
acting on them with a trace preserving operator. This is a quantum analogue of the data
processing inequality we came across in chapter 4, so now we have a scientific basis for saying
‘Things can only get worse’.
Another useful distance measure, one that is useful when considering data compression, is
fidelity. The fidelity of two states ρ and σ is defined to be:
q
1
1
F (ρ, σ) ≡ tr ρ 2 σρ 2 .
(10.6)
The fidelity is (despite its appearance) symmetric in its arguments.
10.3
Quantum Error Correction
Single Qubit Flip Code
To explain how we can correct a single qubit being flipped from |0i to |1i or vice versa we
must define another quantum gate - the Troffoli gate. The Troffoli gate is sometimes called a
controlled-controlled NOT gate (CCNOT) because it’s action is to check two qubits, and if they
are both in state |1i, flip the third. This may be written as:
|x, y, zi → |x, y, z + (x.y)i
(10.7)
where the addition on the third qubit is done modulo 2.
To begin with we spread our qubit |ψi = (α|0i + β|1i) across three like so:
|ψi|0i|0i = (α|0i + β|1i)|0i|0i → α|000i + β|111i.
(10.8)
This process can be achieved by the use of two CNOT gates, one acting from the first qubit
to the second and one acting from the first to the third. Note that part in state |0i then we
get |000i ‘free’, whereas for the part in state |1i the CNOT gates become active and give us
|111i. It is important to see that this process does not clone |ψi, were we to do so we would be
performing the following:
|ψi|0i → |ψi|ψi = α2 |00i + αβ|01i + βα|10i + β 2 |11i.
(10.9)
This equation has pairs of qubits in mixed states (e.g. |01i), which do not appear in equation
(10.8). The situation gets worse if we try |ψi|0i|0i → |ψi|ψi|ψi.
Suppose we now send our triples across a noisy quantum channel in which one of the qubits
gets its state flipped (for generality we will assume that this occurs in the |000i term as well
as the |111i term). We must then find a way to make sure that the original qubit remains in
its original state. We do not care about the final states of qubits two and three, since these are
only present to preserve the state of qubit one - we only measure the first qubit at the end.
We can do this by applying three quantum gates. If we apply two CNOT gates (first to
third and first to second) and then a Troffoli gate (which checks qubits two and three then acts
on qubit one), then any single bit flip error will be corrected.
49
Worked Example
We wish to send
|ψi = α|0i + β|1i → α|000i + β|111i
across a noisy quantum channel. Upon doing so, the first qubit is flipped, giving
α|000i + β|111i → α|100i + β|011i.
First we apply a CNOT gate from the first to the third qubit, resulting in a state:
α|101i + β|011i.
Then we apply a CNOT gate from the first to the second qubit, giving us:
α|111i + β|011i.
Finally we apply the Troffoli gate, which flips the state of the first qubit if and only if both the
second and third are in state |1i. This gives us a final state of:
α|011i + β|111i.
From which we measure the first qubit only to receive our original qubit |ψi = α|0i + β|1i.
Single Qubit Phase Flip Code
Flipped qubits are not the only type of error we may encounter in the quantum world. It is
also possible for the phase of qubits to be flipped like so:
α|0i + β|1i → α|0i − β|1i.
(10.10)
Such a phase flip is accomplished by the ‘Z’ operator, defined as:
Z ≡ |0ih0| − |1ih1|.
(10.11)
Let us define two quantum states |+i and |−i by the following equations:
|+i = α|0i + β|1i
|−i = α|0i − β|1i.
(10.12)
(10.13)
Let us now apply the Z gate to each of these states
Z|+i = Z(α|0i + β|1i) = α|0i − β|1i = |−i
Z|−i = α|0i + β|1i = |+i.
(10.14)
(10.15)
As we can see, a phase flip in the {|0i, |1i} basis is equivalent to a bit flip in the {|+i, |−i} basis!
Therefore we can use the qubit flip code in the {|+i, |−i} basis to protect us from phase flips.
The Shor Code
This code was created by Peter Shor by concatenating the qubit flip and phase flip codes[18].
We start by moving to the {|+i, |−i} basis and using the phase flip code:
α|0i + β|1i → α| + ++i + β| − −−i = α (|0i + |1i)⊗3 + β (|0i − |1i)⊗3 .
(10.16)
Where (|φi)⊗3 = |φi|φi|φi = |φφφi. We now use the single qubit flip code to get the following:
α|0i + β|1i → α (|000i + |111i)⊗3 + β (|000i − |111i)⊗3 .
(10.17)
We are now protected against both qubit flips and phase flips. It turns out however, that the
Shor code protects us again arbitrary errors. So a phase shift (such as β → β + η), even a tiny
one, will be corrected.
50
10.4
Quantum Teleportation
Quantum teleportation is a process by which Arthur can send Belinda a qubit in an unknown
state using only classical bits. The process is called teleportation because the no cloning theorem
dictates that Arthur cannot retain a copy of the qubit, so its state must be destroyed. Let us
assume that Arthur and Belinda both possess half of an entangled pair of qubits in the Bell
state
(10.18)
|φ+ i = |00i + |11i
Arthur’s qubit is in a state |ρi = α|0i + β|1i. The state of all three qubits is now:
|ρi|φ+ i = (α|0i + β|1i)(|00i + |11i)
(10.19)
= α|000i + α|011i + β|100i + β|111i.
After some mathematical juggling we can rearrange the above into:
|ρi|φ+ i =
|φ+ i(α|0i + β|1i) + |φ− i(α|0i − β|1i)
+|ψ + i(α|1i + β|0i) + |ψ − i(α|1i − β|0i).
(10.20)
With the Bell states being between the two qubits in Arthur’s possession. This leaves
Belinda’s qubit in a superposition of the four bracketed states, each of which is in some sense
similar to |ρi. Arthur now measures his two qubits in the Bell basis and randomly obtains one
of the states. Given that there are four Bell states this corresponds to two bits of classical
information. Arthur then telephones Belinda and tells her what Bell state he obtained. Using
this information Belinda is able to determine the state her qubit is in and is therefore able to
perform a unitary operation to put her qubit into the state |ρi.
Arthur’s Bell state
|φ+ i
|φ− i
|ψ + i
|ψ − i
Belinda’s qubit state
α|0i + β|1i
α|0i − β|1i
α|1i + β|0i
α|1i − β|0i
operator to recover |ρi
|0ih0| + |1ih1| ≡ I
|0ih0| − |1ih1| ≡ Z
|0ih1| + |1ih0| ≡ X
|1ih0| − |0ih1| ≡ Y
Table 10.1: Arthur’s possible measurements of his qubits and the corresponding operators that
Belinda must use to reconstruct |ρi.
10.5
Dense Coding
It is easy to use qubits to transmit classical information. If we wish to send the string 0110 we
can transmit |0110i. Belinda now measures the qubits in the {|0i, |1i} basis and recovers the
string 0110 with no ambiguity. This method is fine, there is nothing inherently wrong with it,
but it does seem a little wasteful - why bother sending the information via qubits? This allows
us to send 1 classical bit for each qubit, so we have not gained anything by using a quantum
channel.
Suppose now that Arthur and Belinda share two entangled qubitsin the state |00i + |11i.
It turns out that Arthur can now send two classical bits to Belinda by sending her his half of
the entangled pair (one qubit). This is known as dense coding, or sometimes superdense coding
depending on how dramatic the author wishes to make the process sound. Dense coding is a
counterpart to teleportation, in the latter we use two classical bits to transfer a single qubit,
whereas in the former we use a single qubit to transfer two classical bits.
51
Starting from (|00i + |11i) Arthur can generate any of the Bell basis states by using the
quantum logic gates {I, X, Y, Z} as demonstrated in table (10.2). His choice of one state out of
four corresponds to two classical bits of information1 .
Value
0
1
2
3
Transformation
ψ0 = (I ⊗ I)ψ0
ψ1 = (X ⊗ I)ψ1
ψ2 = (Y ⊗ I)ψ2
ψ3 = (Z ⊗ I)ψ3
New State
|00i + |11i
|10i + |01i
|01i − |10i
|00i − |11i
Table 10.2: Initial preparation by Arthur.
Arthur then sends his half of the entangled pair to Belinda, who needs to find what state it
is in. She does this by using a CNOT gate (also known as an XOR gate) on the pair to place
them into one of the quantum states in table (10.3).
Initial State
ψ0
ψ1
ψ2
ψ3
State after CNOT
|00i + |10i
|11i + |01i
|01i − |11i
|00i − |10i
First Qubit
|0i + |1i
|1i + |0i
|0i − |1i
|0i − |1i
Second Qubit
|0i
|1i
|1i
|0i
Table 10.3: State of qubits after Belinda applies the CNOT gate.
Belinda is now able to measure the second qubit without disturbing the first. This allows
her to distinguish between (|00i ± |11i) and (|01i ± |10i). To find the sign of the phase Belinda
now acts on the first qubit with the Hadamard gate (below) to get the results shown in table
(10.4).
H = (|0i + |1i)h0| + (|0i − |1i)h1|
(10.21)
Initial state
ψ0
ψ1
ψ2
ψ3
H(First Qubit)
[|0i + |1i] + [|0i − |1i] = |0i
[|0i − |1i] + [|0i + |1i] = |0i
[|1i − |0i] + [|0i + |1i] = |1i
[|0i + |1i] − [|0i − |1i] = |1i
Table 10.4: State of first qubit after Belinda applies the H gate.
Belinda now measures the second qubit in the {|0i, |1i} basis, obtaining the second qubit.
This allows her to determine the state ψi that Arthur chose and therefore get two classical bits
with no ambiguity.
Dense coding is beneficial if we wish for our communications to be secure (which we usually
do) as the qubit Arthur sends can only give two qubits to the person holding it’s entangled
partner. Thus it can be very useful for the transmission of cryptographic keys.
1
The tables in this section have been reproduced from Morali Kota’s Quantum Entanglement as a resource
for Quantum Information[12]
52
10.6
Quantum Data Compression
Suppose we have a source that produces a state |ψi i with probability pi . If it can produce m
different states then we may denote write this as
source = (pi , |ψi i)m
i=1
(10.22)
We will treat our source as if it is memoryless. So the generation of each quantum state is
independent of all those that went before. It will produce strings of n qubits. In order to ease
the notation we will write a general string of n qubits as:
|ψi1 i|ψi2 i|ψi3 i|ψi4 i|ψi5 i = |ψI i
with
pi1 pi2 pi3 pi4 pi5
= pI
(10.23)
(10.24)
Such a string will lie in a 2n dimensional Hilbert space (denoted H). A compression scheme of
rate Rn is a map
Cn : H → K
(10.25)
where K is a Hilbert space of dimension 2n.Rn . Decompression is the map back to H from K:
Dn : K → H.
(10.26)
A compression scheme is a pair of operations C n and Dn . A compression scheme is reliable
and has a rate R when the following two conditions are met.
X
I
Rn → 0 < R < 1 as n → ∞
pI F (|ψI ihψI | , Dn : C n |ψI ihψI |) → 1, as n → ∞.
(10.27)
(10.28)
In words we may state these as
• Asymptotically, R qubits are used for the compression of each qubit in the initial string.
• The quantum states are (on average) close, and the expectation of the fidelity of the
processed and original states tends towards 1 as n approaches infinity. So as n approaches
infinity, the compression/decompression loses less information.
We are now in a position to provide a final comparison of quantum information theory to
its classical counterpart - a quantum source coding theorem. This is usually referred to as
Schumacher’s quantum noiseless channel coding theorem (just as the source coding theorem
could be referred to as Shannons noiseless channel coding theorem). Schumacher’s theorem
states that given a source (pi , |ψi i) with Von Neumann entropy S(ρ), then for any R > S(ρ)
there exists a reliable compression scheme of rate R. Conversely there does not exist any reliable
compression scheme with a rate R < S(ρ). So the entropy of any source provides a natural
limit on the extent to which we can compress data (be it classical or quantum).
Having completed this review of quantum information theory and some of its primary results,
it is now time to move on to a much more speculative area of research: quantum neural networks.
53
Chapter 11
Quantum Neural Networks
11.1
Motivation
Quantum neural networks are a very new, speculative field. No attempts to build one yet have
been made as the theoretical analysis is still in its infancy. Having said this, there are still
several reasons why we should investigate them. One of the main practical reasons to research
them is motivated by quantum algorithms.
Quantum algorithms have sped up several classically inefficient computational tasks. Grover’s
1
search algorithm can sort an unordered database with n entries in O(n 2 ) time, compared to
classical algorithms which will require O(n) time. Peter Shor’s factoring algorithm can factorise
numbers in polynomial time with respect to the number of binary digits, unlike the best classical
algorithms which can only factorise in a time that is exponential in the number of binary digits.
Given these dramatic improvements, it is hoped that quantum neural networks may provide
similar advantages over classical neural networks.
One of the major problems with using artificial neural networks for associative memory
is their limited storage capacity. Whilst they function admirably for pattern association and
completion, they do not do so efficiently. A quality that we would like to preserve is their
guaranteed convergence to a stable state (providing they are not overloaded). According to
Ezhov and Ventura[6] if we have a quantum Hopfield network trained with a single memory
then we are guaranteed to have convergence to its stable state. Due to the linearity of quantum
mechanics however, we can place several of these single memory quantum neural networks into
a superposition. Then all of them will act as a single memory state no matter how many
memories are stored, according to Ventura and Martinez[24] a quantum associative memory
has exponential storage capacity in the number of qubits!
A frequently overlooked reason to study neural networks is that it provides a source of
intellectual stimulation. A rich theory of quantum information has been developed and it would
seem negligent not to apply it as widely as possible. When information theory was applied to
physics it was made clear why some effects (decoherence, for example) may travel faster than
light without causing problems with our understanding of the universe: these effects are free
to travel faster than light as they please, provided that no information is transferred in the
process. Just as the laws of physics were elucidated by the advent of information theory, so
might the field of neural networks be advanced and illuminated by the application of quantum
information theory.
Of the few papers that have been written on quantum neural network and quantum neural
network-like systems, most have been focussed on creating a network that functions as an
associative memory for pattern recognition. Consequently this is where we will devote most of
our attention. We will look into their architecture first, explaining how to ‘quantise’ a neural
54
network, then we will have a brief overview of some work that has been done on training these
quantum neural networks. We will then move on to the specific case of training a quantum
associative memory and investigating its benefits and pitfalls.
11.2
Architecture
Before studying quantum neural networks we must first adequately define our architecture. The
route we will be taking is the most common one and consists of replacing each neuron with a
qubit. Instead of having a neuronal state lying between in the range 0 ≤ y ≤ 1 our qubit will
be in the state α|0i + β|1i.
This now raises the question of how to quantise the synaptic weights. Once more we will
move with the crowd: the synaptic weight between two quantum neurons (qubits) will be given
by the amount of entanglement that they share.
11.3
Training Quantum Neural Networks
In his work, Altaisky discussed one of the problems with trying to directly quantise a neural
network[1]. The activation function that we are used to using in our classical artificial neural
networks is (in general) nonlinear, which causes problems when we try to involve quantum
mechancis, an inherently linear theory. He proposes that the closest we can come is to have an
operator F̂ act on the weighted sum, like so:
X
state of qubit = F̂
ŵi |xi i.
(11.1)
i
The learning rule he used was also non-unitary, but this was due to physical considerations and
an alternative, unitary rule was suggested for theoretical work.
Ricks and Ventura have discussed an alternative way to train the network. Their method
was to search through all the possible weight vectors of the network in order to find one which
is consistent with the network’s training data[17]. This is not without its problems however,
since it is not possible to guarantee that the network will correctly classify all the training data.
It is also possible for the network to overfit the data when using this training method, just like
a classical neural network. This algorithm is does not scale well, but has the advantage that it
can be used to achieve arbitrary levels of accuracy.
In the same paper they present a randomised search algorithm which searches each neuron’s
weight space independently. The results of the simulations they ran indicated that their randomised algorithm is very efficient by comparison to traditional training methods like. After 74
presentations of the training data it had correctly classified 95% of the training set. Standard
backpropagation was able to classify 98% of the set, but only after 300 presentations.
11.4
Quantum Associative Memory
In this section we will follow the approach of Ezhov and Ventura[6] as they create a quantum
associative memory. We will be giving an overview of the method only as the process is highly
technical and not very illuminating. There are two distinct phases to using an associative
memory: memorisation and recall. We will discuss each in turn and then combine the algorithms
to summarise what we find.
55
Memorisation
In the process of memorisation we are seeking to store m patterns, each having a length of n
qubits. The algorithm uses several well known operators (Troffoli gates, CNOT gates etc) that
act on one, two or three qubits. One new operator in the process is given below:


1 0
0
0
 0 1

0


q0
p
.
p−1
−1
Ŝ = 
(11.2)
√
 0 0

p

q p 
p−1
√1
0 0
p
p
The parameter p ranges from 1 to m and so there is a unique Ŝ p for each of the desired memories.
The algorithm requires 2n + 1 qubits, the first n qubits actually store the memories whilst the
other n + 1 are used for book-keeping and are restored to the state |0i after each iteration. After
m iterations the first n qubits are in a coherent superposition of states that correspond to the
patterns we wanted to store. It is important to note that this training does not introduce the
spurious memories that training a classical associative memory does - our memory is free from
these illogical phantoms.
The encoding process is polynomial in the number of patterns and the length of the patterns
and as such requires O(mn) iterations to encode the m patterns as a superposition of states
over n qubits. This is optimal since just reading the patterns cannot be done any faster than
O(mn) time.
Recall
The recall process is implemented by Grover’s search algorithm. The use of Grover’s search
algorithm corresponds to measuring the system and having it collapse to the state we are
searching for. If we only have n − k qubits when we start our search, then the collapse of the
system to the appropriate memory may be considered as performing pattern completion.
Pattern completion is all well and good, but ultimately we want our memory to perform
pattarn association too. It turns out that with only a slight modification, the recall algorithm
can ‘recognise’ noisy images. Unfortunately adding association to our network can result in the
recall of spurious memories. These spurious memories are not stored in the network, but are
are an unfortunate side effect of the associative recall process.
Combining The Algorithms
A quantum associative memory can be constructed by combining the two algorithms above.
There are 2n distinct patterns of which the network
takes
√ can store n. The memorisation √
O(mn) time and the recall of a pattern requires O( 2n ) time. The recall process takes O( 2n )
time because it is searching all possible patterns that the system can store. This recall time
is exponential, which is not very good, however Ezhov and Ventura suggest that a nonunitary
recall mechanism could improve upon this. If unitarity is required however, then Grover’s search
algorithm has been proved to be optimal.
11.5
Implementation
All models of quantum computation are plagued by decoherence due to unwanted interaction
with the system’s environment. It has been suggested that quantum neural networks may
56
be implemented before traditional quantum computers by virtue of their significantly lower
requirements on both the number of qubits and the number ofstate transformations required to
perform useful calculations.
One proposal to counteract the problem of decoherence is to not use a superposition of
states to store the memories. Unfortunately this proposal does not take advantage of quantum
mechanics. As such it would provide us with little more than a very expensive artificial neural
network that just happens to use qubits.
The other major problem is common to all attempts to create a physical neural network:
the high density of connections makes them very difficult to implement in small scale systems.
11.6
Performance
Quantum neural networks appear to offer several advantages over classical neural networks.
Menneer and Narayanan found that training their quantum neural networks required 50% fewer
weight changes than their classical counterparts[14]. They also found that the quantum neural
networks were able to learn particular sets of training data that the classical networks could
not. They then make the significant point that quantum neural networks do not suffer from the
catastrophic amnesia that overloaded classical neural networks do.
According to Trugenberger, a quantum neural network is capable of storing each of the
patterns that can be formed, giving a capacity that is exponential in the number of qubits[20].
His recall mechanism is probabilistic: it transforms the initial superposition of states into a
superposition centred on the input pattern. This means that any measurement of the system
is more likely to result in a state that includes the input pattern.
Unfortunately, Brun et. al. take issue with Trugenberger’s claim and proceed to build a
convincing case that Trugenberger is in fact mistaken[2]. They make the valid point that once a
single memory has been recalled the network’s superposition has collapsed and it can no longer
be used. The memory state cannot be perfectly cloned and even if we use a probabilistic method
of cloning (which is unreliable, and as such does not violate the no cloning theorem) then its
quality will degrade over time, limiting its utility.
Following this response, Trugenberger published a reply[21] in which he moderates his statements about storage capacity and explaining that the memories could be stored in an operator
M̂ which acts on |0i so that
M̂ |0i = |M i
(11.3)
where |M i is the uniform superposition of memory states. He states that at the very least, an
associative memory can be formed for a polynomial number of patterns without the appearence
of spurious memories (and hence no transition to amnesia).
He also agrees that M̂ would degrade over time and suggests that for a number of memories polynomial in n it could be repeatedly and efficiently manufactured using a sequence of
elementary unitary gates involving at most two qubits.
Ventura et. al. state that quantum neural networks have storage capacity that is exponential
in the number of qubits and Trugenberger believes that in the worst case they have a storage
capacity that is polynomial in the number of qubits. Brun et. al. however, are convinced
that these view is erroneous and that quantum neural networks hardly offer any advantage over
their classical counterparts when producing an associative memory. The question has not been
conclusively answered and so it appears that only further research will settle the issue once and
for all.
57
Chapter 12
Conclusion
In part 1, we discovered some of the main uses of information theory. We now understand
how to compress data and correct errors that arise from using noisy channels. Thanks to the
ingenuity of programmers, people the world over can reap the benefits of Shannon’s insight
without knowing the intimate details of conditional probabilities and noisy channels.
A good deal of current research is aimed at broadcast technologies, where simply retransmission of corrupted data is not a viable option. So called digital fountain codes have been devised
which have the remarkable property that provided one receives a certain proportion (nine out
of ten, say) codewords per set (e.g. frame of a television image) it is possible to rebuild the
rest using redundant information contained in the other nine. In other words, there is not only
redundancy within each codeword, but also between the codewords themselves. The advent
of High Definition Television (HDTV) has caused greater focus on these codes as broadcasters
attempt to reliably broadcast ever more information over the already crowded airwaves.
In part 2 we investigated pattern recognition and associative memory by the use of neural
networks. As we have mentioned before, neural networks have inspired a vast amount of research. Profiteers are drawn to neural networks by their ability to recognise trends and predict
future behaviour. They hope to use this to predict the stock market and make their millions.
The continued failure of neural networks to reliably predict the stock market has not prevented
huge investments of time into this and probably never will.
A contrasting area of neural network research focuses on the application of self organising
maps to the reconstruction of surfaces, a technology that has been used to store three dimensional models of archaeological finds[5]. This is therefore very important culturally as it allows
us to have an enduring ‘copy’ of an artefact lest disaster strike and the original be lost or
destroyed.
Research in quantum information is also heading in several directions at once. A vast amount
of resources have been ploughed into the investigation of quantum cryptography and quantum
key distribution. Implemented correctly, these provide uncompromisingly secure communication
and are therefore of great interest to governments and large businesses.
By contrast, quantum neural networks are very much a niche area. From the exchanges
between Trugenberger[20, 21] and Brun et. al.[2] we can see that there is still a healthy debate
over whether quantum neural networks provide any benefit over classical neural networks, let
alone enough to justify significant investment.
Over the course of this report we have touched on some of the ideas that have helped to
shape the world in which we live. Computers, email and even televisions would not exist were
it not for the insightful work of Claude Shannon and the many who followed in his footsteps
during the decades following his original paper.
Further to this, we now understand how patterns are recognised, memories stored and
58
associations made. We have seen candidates for a new generation of computers that would
render present machines obsolete overnight, fifty years ago people would never have predicted
our current technological state.
Given the models and ideas we have covered here, it seems unlikely that we can imagine what
technology will look like in another fifty years. Will neural networks and quantum computers be
commonplace, as televisions and mobile telephones are now? Or will further research relegate
them to the private research of a few dedicated individuals? It seems that only time and further
research will sate our curiousity in these unique and fascinating areas.
59
Appendix A
Notation
In order to keep our notation uncluttered we will denote strings of digits (x1 , x2 , x3 , . . .) simply
by the letter x, on the understanding that x is a vector. Should we wish to refer to a specific
element of x we will use a subscript, like so: x4 . Different input patterns will be denoted like
so: x(1), x(2) and so on.
In part 2 we will denote the weight vector of neuron k by wk on the understanding that it too
is a vector, with componenets (wk1 , wk2 , wk3 , . . .). If we wish to refer to a particular synaptic
weight then we will add another subscript, giving wki (a scalar). Matrices will be denoted by
capital letters.
In light of the above conventions we will denote the scalar product simply as
X
wk .x =
wki xi
(A.1)
i
with the summation over i being implied.
60
Bibliography
[1] M. V. Altaisky. Quantum Neural Network, 2001.
[2] T. Brun, H. Klauck, A. Nayak, M. Rötteler, and Ch. Zalka. Comment on ‘Probabilistic
Quantum Memories’. Phys. Rev. Lett., 91(20):209801, Nov 2003.
[3] Soren Brunak and Benny Lautrup. Neural Networks: Computers With Intuition. World
Scientific Publishing Co Pte Ltd, 1989.
[4] Collaborative. Quantiki. http://www.quantiki.org.
[5] Daz-Andreu, Margarita, Hobbs, Richard, Rosser, Nick, Sharpe, Kate, and Trinks. Long
Meg: Rock Art Recording Using 3D Laser Scanning. Past (The Newsletter of the Prehistoric Society), (50):2–6, 2005.
[6] Alexandr Ezhov and Dan Ventura. Quantum Neural Networks. Future Directions for
Intelligent Systems and Information Science 2000, 2000.
[7] Neil Fraser. Neural Network Follies, 2003. http://neil.fraser.name/writing/tank/.
[8] Ugur Halici. Artificial neural networks, 2004. http://vision1.eee.metu.edu.tr/ halici/courses/543LectureNotes/
[9] Simon Haykin. Neural Networks, A Comprehensive Foundation. Prentice Hall International
Incorporated, 1999.
[10] Raymond Hill. A First Course In Coding Theory. Oxford University Press, unknown.
[11] J.J. Hopfield, D.I. Feinstein, and R.G. Palmer. ‘Unlearning’ has a Stabilizing Effect in
Collective Memories. Nature, 304:158–159, Jul 1983.
[12] Morali Kota.
Quantum Entanglement as a resource for Quantum
munication.
Technical report, Massachusetts Institute of Technology,
http://www.cs.caltech.edu/cbss/2002/pdf/quantum morali.pdf.
Com2002.
[13] David J. C. Mackay. Information Theory, Inference and Learning Algorithms. Cambridge
University Press, 2004.
[14] Tammy Menneer and Ajit Narayanan.
Quantum-inspired Neural Networks, 1995.
http://citeseer.ist.psu.edu/menneer95quantuminspired.html.
[15] Michael Nielsen and Isaac Chuang. Quantum Computation and Quantum Information.
Cambridge University Press, 2000.
[16] D. Petz. Quantum Source Coding and Data Compression. To be published in the proceedings of Conference on Search and Communication Complexity.
61
[17] Bob Ricks and Dan Ventura. Training a Quantum Neural Network. Neural Information
Processing Systems, pages 1019–1026, Dec 2003.
[18] Peter Shor. Scheme for Reducing Decoherence in Quantum Computer Memory. Physical
Review A, 52:2493–2496, 1995.
[19] Andrew Steane. Quantum Computing, 1997. arXiv:quant-ph/978022v2.
[20] C. A. Trugenberger. Probabilistic Quantum Memories. Phys. Rev. Lett., 87(6):067901, Jul
2001.
[21] C. A. Trugenberger. Reply to Comment on Probabilistic Quantum Memories. Phys. Rev.
Lett., 91(20):209802, Nov 2003.
[22] Vlatko Vedral. An Introduction to Quantum Information Science. Oxford University Press,
2006.
[23] Dan Ventura. On the Utility of Entanglement in Quantum Neural Computing. In International Joint Conference on Neural Networks, pages 286–295, July 2001.
[24] Dan Ventura and Tony R. Martinez. Quantum Associative Memory. Information Sciences,
1-4:273–296, 2000.
[25] Li
Weigang.
Quantum
Neural
Computing
http://www.cic.unb.br/ weigang/qc/aci.html, accessed on 09/07/2007.
Study.
[26] W. K. Wootters and W. H. Zurek. A Single Quantum Cannot Be Cloned. Nature, 299:802–
803, 1982.
62