Download Application of Quantum Computing principles to Natural Language Processing

Application of Quantum Computing principles to Natural Language Processing B.Tech Project Report Submitted in partial fulfillment of the requirements for the degree of Bachelor of Technlogy (Honors) by Vipul Singh Roll No : 100050057 under the guidance of Prof. Pushpak Bhattacharyya Department of Computer Science and Engineering Indian Institute of Technology, Bombay Acknowledgement First and foremost, I express my sincere gratitude towards my guide Prof. Pushpak Bhattacharyya for his guidance and for the freedom he has been providing us for our research work. He is daily source of inspiration for me to strive harder in the pursuit of my research goals. Next, I would like to thank Prof. Pranab Sen, Tata Institute of Fundamental Research, Mumbai and Prof. Avatar Tathagat Tulsi, IIT Bombay for their valuable inputs on our work and help with designing the Quantum Viterbi algorithm. Next, I am really thankful to my batchmate Dikkala Sai Nishanth for being a great colleague in this journey of learning. I am grateful to him for being a co-operative co-learner and partner in this project. Last but not the least I would like to thank my family, friends and teachers for their love and kind support. Abstract The discovery of quantum mechanics has led to some radical changes in the theory of computation. A quantum theory of computing has come up and has been applied to give fascinating theoretical results for even classically unsolvable problems. With quantum computers being a part of the foreseeable future, it is definitely worthwhile to take a look at whether they can speed up the existing algorithms for common tasks in Natural Language Processing (NLP). This thesis gives a description of the principles on which quantum computing is based, namely qubits, their superposition and the process of measurement after the application of quantum operations or gates, and also some of the above-mentioned results/algorithms. Then, we explore some search methods pertaining to Machine Learning and Natural Language Processing and see if these can be integrated into the world of quantum computing. Of particular interest to us has been the problem of Part-of-Speech (POS) tagging for which we develop a quantum counterpart to the classical Viterbi. We provide results pertaining to our implementation of the same on the British National Corpus (BNC). Closely related to POS tagging is the machine translation among similar languages, for which, our quantum counterpart, actually gives a huge reduction in running time of the viterbi algorithm. Following this, we foray into the realm of quantum ideas applied to other intelligence tasks, for example, quantum random walks for the A-star search algorithm. Contents 1 2 3 Introduction 1.1 Motivation . . . . . 1.2 Aim of the Thesis . 1.3 Experimental Setup 1.4 Road Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 6 6 7 7 Quantum Computing Principles 2.1 Qubit - The Quantum Bit . . . . . . . . . 2.1.1 Bits vs. Qubits . . . . . . . . . . 2.1.2 Superposition . . . . . . . . . . . 2.1.3 Representation . . . . . . . . . . 2.2 Quantum States . . . . . . . . . . . . . . 2.2.1 Entanglement . . . . . . . . . . . 2.2.2 Registers . . . . . . . . . . . . . 2.3 Operators - Quantum Gates . . . . . . . . 2.3.1 Reversible Logic Gates . . . . . . 2.3.2 Matrix Operator Correspondence 2.3.3 Commonly used gates . . . . . . 2.3.4 Quantum Fourier Transform . . . 2.4 Measurement in Quantum Mechanics . . 2.4.1 A Qualitative Overview . . . . . 2.4.2 The Quantitative Overview . . . . 2.4.3 Collapsing of States . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 9 9 9 9 10 10 11 12 12 12 13 15 16 16 16 17 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Classical Optimization and Search Techniques 3.1 Hidden Markov Model . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 Stochastic Process . . . . . . . . . . . . . . . . . . . . . 3.1.2 Markov Property and Markov Modelling . . . . . . . . . 3.1.3 The urn example . . . . . . . . . . . . . . . . . . . . . . 3.1.4 Formal Description of the Hidden Markov Model . . . . . 3.1.5 The Trellis Diagram . . . . . . . . . . . . . . . . . . . . 3.1.6 Formulating the Part-of-Speech tagging problem using HMM 3.1.7 The Viterbi Algorithm . . . . . . . . . . . . . . . . . . . 1 19 20 20 20 20 21 21 22 22 3.2 3.3 3.4 3.5 3.6 4 3.1.8 Pseudocode . . . . . . . . . . . . . . . . . . . . Maximum Entropy Approach . . . . . . . . . . . . . . . 3.2.1 Entropy - Thermodynamic and Information . . . 3.2.2 The Maximum Entropy Model . . . . . . . . . . 3.2.3 Application to Statistical Machine Learning . . . The ME Principle and a Solution . . . . . . . . . . . . . 3.3.1 Proof for the ME Formulation . . . . . . . . . . 3.3.2 Generalized Iterative Scaling . . . . . . . . . . . Improved Iterative Scaling . . . . . . . . . . . . . . . . 3.4.1 The Model in parametric form . . . . . . . . . . 3.4.2 Maximum Likelihood . . . . . . . . . . . . . . 3.4.3 The objective to optimize . . . . . . . . . . . . . 3.4.4 Deriving the iterative step . . . . . . . . . . . . Swarm Intelligence . . . . . . . . . . . . . . . . . . . . 3.5.1 Foundations . . . . . . . . . . . . . . . . . . . . 3.5.2 Example Algorithms and Applications . . . . . . 3.5.3 Case Study: Ant Colony Optimization applied to hard Travelling Salesman Problem . . . . . . . . Boltzmann Machines . . . . . . . . . . . . . . . . . . . 3.6.1 Structure . . . . . . . . . . . . . . . . . . . . . 3.6.2 Probability of a state . . . . . . . . . . . . . . . 3.6.3 Equilibrium State . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . the . . . . . . . . . . Some popular Quantum Computing Ideas 4.1 Deutsch-Jozsa Algorithm . . . . . . . . . . . . . . . . . . . 4.1.1 Problem Statement . . . . . . . . . . . . . . . . . . 4.1.2 Motivation and a Classical Approach . . . . . . . . 4.1.3 The Deutsch Quantum Algorithm . . . . . . . . . . 4.2 Shor’s Algorithm . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 The factorization problem . . . . . . . . . . . . . . 4.2.2 The integers mod n . . . . . . . . . . . . . . . . . . 4.2.3 A fast classical algorithm for modular exponentiation 4.2.4 Reduction of the Factorization problem . . . . . . . 4.2.5 The Algorithm . . . . . . . . . . . . . . . . . . . . 4.2.6 An example factorization . . . . . . . . . . . . . . . 4.3 Grover’s Algorithm . . . . . . . . . . . . . . . . . . . . . . 4.3.1 The search problem . . . . . . . . . . . . . . . . . . 4.3.2 The Oracle . . . . . . . . . . . . . . . . . . . . . . 4.3.3 The Grover Iteration . . . . . . . . . . . . . . . . . 4.3.4 Performance of the algorithm . . . . . . . . . . . . 4.3.5 An example . . . . . . . . . . . . . . . . . . . . . . 4.4 The Quantum Minimum Algorithm . . . . . . . . . . . . . . 4.4.1 The Problem . . . . . . . . . . . . . . . . . . . . . 4.4.2 The Algorithm . . . . . . . . . . . . . . . . . . . . 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . NP. . . . . . . . . . . . . . . 36 39 39 40 41 . . . . . . . . . . . . . . . . . . . . 42 42 42 42 43 43 44 44 44 45 45 46 48 48 48 49 49 49 50 50 50 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 24 24 25 25 27 27 28 29 29 29 31 31 33 33 35 4.5 5 6 7 4.4.3 Running Time and Precision . . . . . . . . . . . . . . . . Quantum Walks . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.1 Random Walks . . . . . . . . . . . . . . . . . . . . . . . An example: A one-dimensional random walk . . . . . . 4.5.2 Terminology used with Random Walks . . . . . . . . . . 4.5.3 Quantum Analogue: Quantum Markov Chains or Quantum Walks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.4 Application to Element-Distinctness Problem . . . . . . . Quantum Computing and Intelligence Tasks 5.1 Quantum Classification . . . . . . . . . . . . . . 5.1.1 Learning in a Quantum World . . . . . . 5.1.2 The Helstrom Oracle . . . . . . . . . . . 5.1.3 Binary Classification . . . . . . . . . . . 5.1.4 Weighted Binary Classification . . . . . . 5.2 Quantum Walk for A-star search . . . . . . . . . 5.2.1 The A∗ Algorithm . . . . . . . . . . . . The Heart of A∗ : The Heuristic Function 5.2.2 A Quantum Approach? . . . . . . . . . . The Quantum Viterbi 6.1 The Approach . . . . . . . . . . . . . . . . . . 6.1.1 Can Grover be used? . . . . . . . . . . 6.2 The Algorithm . . . . . . . . . . . . . . . . . 6.2.1 The Classical Version . . . . . . . . . . 6.2.2 Quantum exponential searching . . . . 6.2.3 The Grover Iteration . . . . . . . . . . 6.2.4 The Quantum Approach to Viterbi . . . 6.3 Experimental Results . . . . . . . . . . . . . . 6.3.1 Implementation . . . . . . . . . . . . . 6.3.2 Results . . . . . . . . . . . . . . . . . 6.3.3 Tag-wise Precision and Recall Analysis 6.3.4 Concluding Remarks . . . . . . . . . . Machine Translation among Close Languages 7.1 Machine Translation . . . . . . . . . . . . . 7.1.1 What is machine translation? . . . . . 7.1.2 How does machine translation work? 7.1.3 Advantages of machine translation . . 7.2 Similarity to POS tagging for close languages 7.2.1 The izafat phenomenon . . . . . . . . 7.3 Phrase-Book Translation . . . . . . . . . . . 7.4 Experiments and Results . . . . . . . . . . . 7.4.1 Training corpus . . . . . . . . . . . . 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 51 51 51 51 52 52 . . . . . . . . . 54 54 54 55 56 56 57 57 59 60 . . . . . . . . . . . . 61 61 62 62 62 63 63 63 64 64 64 69 69 . . . . . . . . . 70 70 70 70 71 71 72 72 72 72 7.4.2 7.4.3 7.4.4 Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . Results . . . . . . . . . . . . . . . . . . . . . . . . . . . Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 74 74 75 8 Conclusions 77 9 Future Work 78 4 List of Figures 2.1 2.2 3.1 3.2 3.3 3.4 3.5 3.6 3.7 5.1 5.2 Sphere representation for a qubit in the state: α = cos 2θ and β = eiφ sin 2θ . . . . . . . . . . . . . . . . . . . . . . . . . . . Circuit representation of Hadamard, CNOT and Toffoli gates, respectively . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 14 An example Hidden Markov Model with three urns . . . . . . . . An Example Trellis . . . . . . . . . . . . . . . . . . . . . . . . . The Pareto Optimal frontier is the set of hollow points. Operational decisions must be restricted along this set if operational efficiency is to be maintained . . . . . . . . . . . . . . . . . . . . . . . . . The Pareto hypervolume . . . . . . . . . . . . . . . . . . . . . . Search process for m=1000 ants . . . . . . . . . . . . . . . . . . Search process for m=5000 ants . . . . . . . . . . . . . . . . . . Graphical representation for a Boltzmann machine with a few labelled weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 22 Start State for the 8-puzzle problem . . . . . . . . . . . . . . . . Goal State for the 8-puzzle problem . . . . . . . . . . . . . . . . 58 59 5 34 35 38 39 39 Chapter 1 Introduction 1.1 Motivation With the development of quantum mechanics, new paradigms have opened up in many sciences. One such paradigm is a novel way of performing computation directly using quantum mechanical principles. Quantum computing looks at the act of computing from a viewpoint that is radically different from classical theories of computation, the most popular among the latter being the model of Turing machine. Once, a quantum theory of computing was developed, the next task was to develop algorithms for various problems using quantum computing which brings us to the focus of this thesis. A close relation between quantum mechanics, natural language processing and the functioning of the mind has been proposed by many previous works. [4, 8] In this thesis, we study further how quantum computing can be used to give efficient algorithms for common problems which are a part of NLP. 1.2 Aim of the Thesis The aim of the thesis was to develop quantum computing algorithms for classical tasks in Natural Language Processing. Although, these algorithms would need a quantum computer to be actually implemented in practice, here we intend to perform a theoretical study of such algorithms ignoring the implementation part for the time being, as is the case with most existing quantum algorithms such as Grover [3] and Shor [2]. One problem we studied was that of part-of-speech tagging. Given labeled data (sentences with the part-of-speech tags of the words given), a popular classical algorithm for solving this is the Viterbi algorithm [13](it makes the assumption that the data follows a bigram Markov model). In Chapter 6 we present a quantum version of the Viterbi algorithm which runs faster than the classical Viterbi algorithm. 6 1.3 Experimental Setup We also present a discussion of accuracy and precision results obtained from running a classical simulation of the quantum Viterbi algorithm on the BNC English Corpus which has a 57 large tag set. The experimental setup is briefly described here. Firstly, since we are simulating a quantum algorithm on a classical machine we face an exponential blow-up in time which will be described in detail in Chapter 6. To combat this blow-up we ran the algorithm for different folds of the corpus as different processes on a multi-core machine after performing compiler optimizations. Also, we had to cut down on the suggested number of iterations in the corpus to reduce execution time losing some amount of accuracy in the process. 1.4 Road Map The layout of this thesis is as follows. First the principles of quantum computing are presented which the familiar reader can skip. This chapter is then followed by a broad analysis of various search and optimization techniques used classically in NLP. The intent of this section is two-fold. First, the unfamiliar reader can familiarize himself about the presented techniques. Secondly, we study the theoretical aspects of these techniques in detail to gain deeper insight into how the quantum versions of them might be developed. Then we go on to study some popular quantum computing algorithms for classical problems such as factoring and searching in an unsorted database. We present the quantum nature of these algorithms and where they defer from the limitations imposed by classical computation models. After this, we move on to look at some recent work in quantum algorithms for classification and we look at the development of a quantum A∗ algorithm. Next, we present the chapter on the Quantum Viterbi algorithm where we give the quantum algorithm we have developed and also describe the results of the performance of a simulation of the algorithm for Part-of-Speech tagging. Next we look at the problem of Machine Translation (in Chapter 7) which is a hallmark problem of NLP. We note that machine translation of close languages such as hindi-urdu, hindi-marathi is simplified by the fact that most sentence translations are simply word by word replacement hence allowing the problem to be modelled as a POS tagging problem and making it amenable for the Viterbi algorithm to be applied. We also present a study we have undertaken with a relatively small parallel corpus of hindi-urdu sentences and present the results of the quantum Viterbi approach to machine translation on this corpus. We end with the conclusions and insights we have gathered from our study and the direction in which future work will proceed. 7 Chapter 2 Quantum Computing Principles The massive amount of processing power generated by computer manufacturers has not yet been able to quench our thirst for speed and computing capacity. In 1947, American computer engineer Howard Aiken said that just six electronic digital computers would satisfy the computing needs of the United States. Others have made similar errant predictions about the amount of computing power that would support our growing technological needs. Of course, Aiken didn’t count on the large amounts of data generated by scientific research, the proliferation of personal computers or the emergence of the Internet, which have only fuelled our need for more, more and more computing power. Will we ever have the amount of computing power we need or want? If, as Moore’s Law states, the number of transistors on a microprocessor continues to double every 18 months, the year 2020 or 2030 will find the circuits on a microprocessor measured on an atomic scale. And the logical next step will be to create quantum computers, which will harness the power of atoms and molecules to perform memory and processing tasks. Quantum computers have the potential to perform certain calculations significantly faster than any silicon-based computer. Scientists have already built basic quantum computers that can perform certain calculations; but a practical quantum computer is still years away. In this chapter, we explore what a quantum computer is and how it operates. 8 2.1 Qubit - The Quantum Bit In quantum computing, a qubit or quantum bit is a unit of quantum information the quantum analogue of the classical bit. 2.1.1 Bits vs. Qubits A bit is the basic unit of information. It is used to represent information by computers. Regardless of its physical realization, a bit is always understood to be either a 0 or a 1. An analogy to this is a light switch with the off position representing 0 and the on position representing 1. A qubit is a two-state quantum-mechanical system, such as the polarization of a single photon: here the two states are vertical polarization and horizontal polarization. It has a few similarities to a classical bit, but is overall very different. Like a bit, a qubit can have two possible values normally a 0 or a 1. The difference is that whereas a bit must be either 0 or 1, a qubit can be 0, 1, or a superposition of both. 2.1.2 Superposition Think of a qubit as an electron in a magnetic field. The electron’s spin may be either in alignment with the field, which is known as a spin-up state, or opposite to the field, which is known as a spin-down state. Changing the electron’s spin from one state to another is achieved by using a pulse of energy, such as from a laser let’s say that we use 1 unit of laser energy. But what if we only use half a unit of laser energy and completely isolate the particle from all external influences? According to quantum law, the particle then enters a superposition of states, in which it behaves as if it were in both states simultaneously. Each qubit utilized could take a superposition of both 0 and 1. The principle of quantum superposition states that if a physical system may be in one of many configurations arrangements of particles or fields then the most general state is a combination of all of these possibilities, where the amount in each configuration is specified by a complex number. 2.1.3 Representation The two states in which a qubit may be measured are known as basis states (or basis vectors). As is the tradition with any sort of quantum states, Dirac, or bra-ket notation, is used to represent them. This means that the two computational basis states are conventionally written as |0i and |1i (pronounced ”ket 0” and ”ket 1”). A pure qubit state is a linear quantum superposition of the basis states. This means that the qubit can be represented as a linear combination of |0i and |1i: 9 |ψi = α|0i + β|1i where α and β are probability amplitudes and can in general both be complex numbers. The possible states for a single qubit can be visualised using a Bloch sphere as shown in Figure 2.1 1 . Represented on such a sphere, a classical bit could only be at the ”North Pole” or the ”South Pole”, in the locations where |0i and |1i are, respectively. The rest of the surface of the sphere is inaccessible to a classical bit, but a pure qubit state can be represented by any point on the surface. For example, √ the pure qubit state |0i+i|1i would lie on the equator of the sphere, on the positive 2 y-axis. Figure 2.1: Sphere representation for a qubit in the state: α = cos eiφ sin 2θ 2.2 2.2.1 θ 2 and β = Quantum States Entanglement An important distinguishing feature between a qubit and a classical bit is that multiple qubits can exhibit quantum entanglement. Entanglement is a non-local property that allows a set of qubits to express higher correlation than is possible in classical systems. Take, for example, two entangled qubits in the Bell state √1 (|00i + |11i). 2 Imagine that these two entangled qubits are separated, with one each given to Alice and Bob. Alice makes a measurement of her qubit, obtaining |0i or |1i. 1 Source:http://en.wikipedia.org/wiki/Bloch_sphere 10 Because of the qubits’ entanglement, Bob must now get exactly the same measurement as Alice; i.e., if she measures a |0i, Bob must measure the same, as |00i is the only state where Alice’s qubit is a |0i. This is a real phenomenon (Einstein called it ”spooky action at a distance”), the mechanism of which cannot, as yet, be explained by any theory - it simply must be taken as given. Quantum entanglement allows qubits that are separated by incredible distances to interact with each other instantaneously (not limited to the speed of light). No matter how great the distance between the correlated particles, they will remain entangled as long as they are isolated. Entanglement also allows multiple states (such as the Bell state mentioned above) to be acted on simultaneously, unlike classical bits that can only have one value at a time. Entanglement is a necessary ingredient of any quantum computation that cannot be done efficiently on a classical computer. Many of the successes of quantum computation and communication, such as quantum teleportation and superdense coding, make use of entanglement, suggesting that entanglement is a resource that is unique to quantum computation. 2.2.2 Registers A number of entangled qubits taken together is a qubit register. Quantum computers perform calculations by manipulating qubits within a register. An example of a 3-qubit register: Consider first a classical computer that operates on a three-bit register. The state of the computer at any time is a probability distribution over the 23 = 8 different three-bit strings 000, 001, 010, 011, 100, 101, 110, 111. If it is a deterministic computer, then it is in exactly one of these states with probability 1. However, if it is a probabilistic computer, then there is a possibility of it being in any one of a number of different states. We can describe this probabilistic state by eight nonnegative numbers A,B,C,D,E,F,G,H (where A = probability computer is in state 000, B = probability computer is in state 001, etc.). There is a restriction that these probabilities sum to 1. The state of a three-qubit quantum computer is similarly described by an eightdimensional vector (a,b,c,d,e,f,g,h), called a ket. However, instead of the sum of the coefficient magnitudes adding up to one, the sum of the squares of the coefficient magnitudes, |a|2 +|b|2 +...+|h|2 , must equal one. Moreover, the coefficients can have complex values. Since the absolute square of these complex-valued coefficients denote probability amplitudes of given states, the phase between any two coefficients (states) represents a meaningful parameter, which presents a fundamental difference between quantum computing and probabilistic classical computing. 11 Now, an eight-dimensional vector can be specified in many different ways depending on what basis is chosen for the space. The basis of bit strings (e.g., 000, 001, ..., 111) is known as the computational basis. Other possible bases are unitlength, orthogonal vectors, etc. Ket notation is often used to make the choice of basis explicit. For example, the state (a,b,c,d,e,f,g,h) in the computational basis can be written as: a|000i + b|001i + c|010i + d|011i + e|100i + f |101i + g|110i + h|111i where, e.g., |010i = (0, 0, 1, 0, 0, 0, 0, 0). Similarly, the computational basis for a single qubit (two dimensions) is |0i = (1, 0) and |1i = (0, 1). Taken together, quantum superposition and entanglement create an enormously enhanced computing power. Where a 2-bit register in an ordinary computer can store only one of four binary configurations (00, 01, 10, or 11) at any given time, a 2-qubit register in a quantum computer can store all four numbers simultaneously, because each qubit represents two values. If more qubits are added, the increased capacity is expanded exponentially. 2.3 2.3.1 Operators - Quantum Gates Reversible Logic Gates Ordinarily, in a classical computer, the logic gates other than the NOT gate are not reversible. Thus, for instance, for an AND gate one cannot recover the two input bits from the output bit; for example, if the output bit is 0, we cannot tell from this whether the input bits are 0,1 or 1,0 or 0,0. In quantum computing and specifically the quantum circuit model of computation, a quantum gate (or quantum logic gate) is a basic quantum circuit operating on a small number of qubits. They are the building blocks of quantum circuits, like classical logic gates are for conventional digital circuits. Unlike many classical logic gates, quantum logic gates are reversible. However, classical computing can be performed using only reversible gates. For example, the reversible Toffoli gate can implement all Boolean functions. This gate has a direct quantum equivalent, showing that quantum circuits can perform all operations performed by classical circuits. 2.3.2 Matrix Operator Correspondence We can treat an n-qubit state as a vector consisting of 2n complex numbers, each representing the coefficient of a state from the computational basis. Now, a gate operates on such a state and yields another of the same dimension. So, a gate can 12 be seen as a function that transforms a 2n dimensional vector to another. Hence, in the vector-matrix representation in n-qubit space, a gate is a square matrix of dimensions 2n , whose ith column is the vector that results when we apply the gate on the ith element of the computational basis. For a quantum computer gate, we require a very special kind of reversible function, namely a unitary mapping, that is, a mapping on the state-space that preserves the inner product. So, if H is a gate and |ψi and |φi represent two quantum states in 0 0 n-qubit space, then ψ = H|ψi and φ = H|φi will also be n-qubit states and will 0 0 satisfy the property that hψ |φ i = hψ|φi, where h..|..i denotes the inner-product in bra-ket notation. Hence, quantum logic gates are represented by unitary matrices. Note - a complex square matrix U is unitary if U ∗ U = U U ∗ = I, where I is the identity matrix and U ∗ is the conjugate transpose of U. The real analogue of a unitary matrix is an orthogonal matrix. The most common quantum gates operate on spaces of one, two or three qubits. This means that as matrices, quantum gates can be described by 2X2 or 4X4 or 8X8 unitary matrices. 2.3.3 Commonly used gates Quantum gates are usually represented as matrices. A gate which acts on k qubits is represented by a 2k X2k unitary matrix. The number of qubits in the input and output of the gate have to be equal. The action of the quantum gate is found by multiplying the matrix representing the gate with the vector which represents the quantum state. • Hadamard gate The Hadamard gate acts on a single qubit. It maps the basis state |0i to |0i+|1i √ √ and |1i to |0i−|1i , and represents a rotation of π about the axis (x̂ + 2 2 √ ẑ)/ 2. It is represented by the Hadamard matrix: 1 1 1 √ H= 2 1 −1 Since HH ∗ = I where I is the identity matrix, H is indeed a unitary matrix. • Controlled Gates Controlled gates act on 2 or more qubits, where one or more qubits act as a control for some operation. For example, the controlled NOT gate (or CNOT) acts on 2 qubits, and performs the NOT operation on the second qubit only when the first qubit is |1i, and otherwise leaves it unchanged. It is represented by the matrix: 13  1 0 CNOT =  0 0 0 1 0 0 0 0 0 1  0 0  1 0 More generally if U is a gate that operates on single qubits with matrix representation x00 x01 U= , x10 x11 then the controlled-U gate is a gate that operates on two qubits in such a way that the first qubit serves as a control. It maps the basis states as follows: |00i 7→ |00i |01i 7→ |01i |10i 7→ |1iU |0i = |1i (x00 |0i + x10 |1i) |11i 7→ |1iU |1i = |1i (x01 |0i + x11 |1i) The matrix representing the controlled U is:  1 0 C(U ) =  0 0  0 0 0 1 0 0   0 x00 x01  0 x10 x11 Figure 2.2: Circuit representation of Hadamard, CNOT and Toffoli gates, respectively • Toffoli Gate The Toffoli gate, also CCNOT gate, is a 3-bit gate, which is universal for classical computation. The quantum Toffoli gate is the same gate, defined for 3 qubits. If the first two bits are in the state |1i, it applies a Pauli-X (bit inversion) on the third bit, else it does nothing. It is an example of a controlled gate. It swaps the states |110i and |111i; it is an identity map for the other 6 states in the computational basis for a 3-qubit space. The matrix representation is: 14  1 0  0  0  0  0  0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1  0 0  0  0  0  0  1 0 It can be also described as the gate which maps |a, b, ci to |a, b, c ⊕ abi. 2.3.4 Quantum Fourier Transform This is a linear transformation on quantum bits, and is the quantum analogue of the discrete Fourier transform. The quantum Fourier transform is a part of many quantum algorithms, notably Shor’s algorithm for factoring and computing the discrete logarithm, the quantum phase estimation algorithm for estimating the eigenvalues of a unitary operator, and algorithms for the hidden subgroup problem. The quantum Fourier transform can be performed efficiently on a quantum computer, with a particular decomposition into a product of simpler unitary matrices. Using a simple decomposition, the discrete Fourier transform can be implemented as a quantum circuit consisting of only O(n2 ) Hadamard gates and controlled phase shift gates, where n is the number of qubits. This can be compared with the classical discrete Fourier transform, which takes O(n2n ) gates (where n is the number of bits), which is exponentially more than O(n2 ). The quantum Fourier transform is the classical discrete Fourier transform applied to the vector of amplitudes of a quantum state. The classical (unitary) Fourier transform acts on a vector (x0 , ..., xN −1 ) and maps it to the vector (y0 , ..., yN −1 ) according to the formula: yk = where ω = e 2πi N √1 N NP −1 xj ω jk j=0 is a primitive N th root of unity. Similarly, the quantum Fourier transform acts on a quantum state PN −1 maps it to a quantum state i=0 yi |ii according to the formula: yk = √1 N NP −1 j=0 This can also be expressed as the map 15 xj ω jk NP −1 i=0 xi |ii and |ji 7→ √1 N NP −1 ω jk |ki k=0 Equivalently, the quantum Fourier transform on a n-qubit vector (N = 2n ) can be viewed as a unitary matrix acting on quantum state vectors, where the unitary matrix FN is given  by  1 1 1 1 ··· 1 1 ω ω2 ω3 ··· ω N −1    2 4 6 2(N −1)  1 ω ω ω · · · ω   FN = √1N 1 ω3 ω6 ω9 ··· ω 3(N −1)    .  .. .. .. ..  ..  . . . . 1 ω N −1 ω 2(N −1) ω 3(N −1) · · · 2.4 2.4.1 ω (N −1)(N −1) Measurement in Quantum Mechanics A Qualitative Overview One of the most difficult and controversial problems in quantum mechanics is the so-called measurement problem. Opinions on the significance of this problem vary widely. At one extreme the attitude is that there is in fact no problem at all, while at the other extreme the view is that the measurement problem is one of the great unsolved puzzles of quantum mechanics. The issue is that quantum mechanics only provides probabilities for the different possible outcomes in an experiment it provides no mechanism by which the actual, finally observed result, comes about. Of course, probabilistic outcomes feature in many areas of classical physics as well, but in that case, probability enters the picture simply because there is insufficient information to make a definite prediction. In principle, that missing information is there to be found, it is just that accessing it may be a practical impossibility. In contrast, there is no missing information for a quantum system, what we see is all that we can get, even in principle. In Dirac’s words - The intermediate character of the state formed by superposition thus expresses itself through the probability of a particular result for an observation being intermediate between the corresponding probabilities for the original states, not through the result itself being intermediate between the corresponding results for the original states. 2.4.2 The Quantitative Overview For an ideal measurement in quantum mechanics, also called a von Neumann measurement, the only possible measurement outcomes are equal to the eigenvalues (say k) of the operator representing the observable. Consider a system prepared in state |ψi. Since the eigenstates of the observable Ô form a complete basis called eigenbasis, the state vector |ψi can be written in terms of the eigenstates as 16 |ψi = c1 |1i + c2 |2i + c3 |3i + · · · where c1 , c2 , . . . are complex numbers in general. The eigenvalues O1 , O2 , O3 , ... are all possible values of the measurement. The corresponding probabilities are given by Pr(On ) = |hn|ψi|2 hψ|ψi = |cn |2 P |ck |2 k Usually |ψi is assumed to be normalized, i.e. hψ|ψi = 1. Therefore, the expression above is reduced to Pr(On ) = |hn|ψi|2 = |cn |2 . A quantum computer operates by setting the n qubits in a controlled initial state that represents the problem at hand and by manipulating those qubits with a fixed sequence of quantum logic gates. The sequence of gates to be applied is called a quantum algorithm. The calculation ends with measurement of all the states, collapsing each qubit into one of the two pure states, so the outcome can be at most n classical bits of information. For example, if we prepare a 2-qubit system in the state |psii = √1 |00i + (2) √1 |01i + √1 |11i, then a measurement on the system will yield results corre(3) (6) sponding to the state |00i with probability 12 , state |01i with probability |11i with probability 16 . 1 3 and state Partial measurement We can even perform a measurement on just one register. Then, the probability of the state |0i being measured on the register is just a sum of the probabilities of all states wherein this particular register is in the 0i state. So, in the above example, a measurement on the first register will yield |0i with probability = 21 + 13 = 65 2.4.3 Collapsing of States A postulate of quantum mechanics states that the process of measurement formally causes an instantaneous collapse of the quantum state to the eigenstate corresponding to the measured value of the observable. A consequence of this is that the results of a subsequent measurement essentially unrelated to the form of the precollapse quantum state (unless the eigenstates of the operators representing the observables coincide). So, in the example mentioned in the previous subsection, if a measurement on the system had yielded the result corresponding to eigenstate |00i, then all subsequent measurements would have given the same result too because the system would have collapsed to this state. 17 The scenario is slightly different in the case of partial measurement. Here, the measured register collapses entirely into a particular state and then, the states that remain in the system must all have this register in the measured state. Also, as expected, the mutual ratio of the probability associated with these states stays conserved. So, in the example where we did a measurement on the first register only, the resultant state would be s s 2 q q 1 2 √ √1 3 2 2 3 |00i + |01i = |00i + 2 2 2 2 1 1 1 1 5 5 |01i √ 2 +√ 3 √ 2 +√ 3 18 Chapter 3 Classical Optimization and Search Techniques In this chapter we discuss a few popular optimization techniques in use in current day natural language processing algorithms. First we present the Hidden Markov Model (HMM) used for part-of-speech tagging (POS-tagging) among other tasks. Then we formulate the POS-tagging problem using HMM and present its classical solution which is due to the Viterbi algorithm. Then we present the Maximum Entropy approach, which is a heuristic used in problems related to finding probability distributions. Next up is the Maximum Entropy Markov Model (MEMM), a discriminative model that extends a standard maximum entropy classifier by assuming that the unknown values to be learnt are connected in a Markov chain rather than being conditionally independent of each other. MEMMs find applications in information extraction, segmentation and in natural language processing, specifically in part-of-speech tagging. This is followed by an overview of some methods namely Generalised Iterative Scaling and an improved iterative version of it, which find use in solving for the training objectives of many problems which use maximum likelihood estimation on the training data to get the parameters. Then comes the concept of swarm intelligence, which is inspired by the action of insects such as ants. Finally we briefly discuss Boltzmann machines. They were one of the first examples of a neural network capable of learning internal representations, and are able to represent and (given sufficient time) solve difficult combinatoric problems. 19 3.1 Hidden Markov Model The Hidden Markov model is a stochastic model in which the system being modelled is assumed to be a Markov process with unobserved (hidden) states. A key aspect of the HMM is its Markov property which is described in brief below along with other background definitions required. 3.1.1 Stochastic Process Definition 1. Stochastic Process: A stochastic process is a collection of random variables often used to represent the evolution of some random value over time. There is indeterminacy in a stochastic process. Even if we know the initial conditions, the system can evolve in possibly many different ways. 3.1.2 Markov Property and Markov Modelling Definition 2. Markov Property: A stochastic process has the Markov property if the conditional probability distribution of future states of the process (conditional on both past and present values) depends only upon the present state, not on the sequence of events that preceded it. That is, the process is memoryless. A Markov model is a stochastic model that follows the Markov property. Next we present the HMM through the urn problem which eases the exposition. In a further sub-section the formal description of the HMM is given. 3.1.3 The urn example There are N urns, each containing balls of different colours mixed in known proportions. An urn is chosen and a ball is taken out of it. The colour of the ball is noted and the ball is replaced. The choice of the urn from which the nth ball will be picked is determined by a random number and the urn from which the (n − 1)th ball was picked. Hence, the process becomes a Markov process. The problem to be solved is the following: Given the ball colour sequence find the underlying urn sequence. Here the urn sequence is unknown (hidden) from us and hence the name Hidden Markov Model. The diagram1 below shows the architecture of an example HMM. The quantities marked on the transition arrows represent the transition probabilities. 1 Source:http://en.wikipedia.org/wiki/Hidden_Markov_model 20 Figure 3.1: An example Hidden Markov Model with three urns 3.1.4 Formal Description of the Hidden Markov Model The hidden Markov model can be mathematically described as follows: N T θi=1...N φi=1...N,j=1...N φi=1...N xt=1...T yt=1...T F (y|θ) xt=2...T yt=1...T 3.1.5 = = = = = = = = ∼ ∼ number of states number of observations emission parameter for an observation associated with state i probability of transition from state i to state j N -dimensional vector, composed of φi,1...N ; must sum to 1 state of observation at time t observation at time t probability distribution of an observation, parametrized on θ Categorical(φxt−1 ) F (θxt ) The Trellis Diagram Given the set of states in the HMM, we can draw a linear representation of the state transitions given an input sequence by repeating the set of states at every stage. This gives us the trellis diagram. A sample trellis is shown in Figure 3.22 . Each level of the trellis contains all the possible states and transitions from each state onto the states in the next level. Along with every transition, an observation is emitted simultaneously (in the figure a time unit is crossed and observations vary with time). 2 Source: Prof. Pushpak Bhattacharyya’s lecture slides on HMM from the course CS 344 - Artificial Intelligence at IIT Bombay, spring 2013 21 Figure 3.2: An Example Trellis 3.1.6 Formulating the Part-of-Speech tagging problem using HMM The POS tagging problem can be described as follows. We are given a sentence which is a sequence of words. Each word has a POS tag which is unknown. The task is to find the POS tags of each word and return the POS tag sequence corresponding to the sentence. Here the POS tags constitute the hidden states. As in the urn problem, we again assume that words (balls) are emitted by POS tags (urns), a property called the lexical assumption. That is, the probability of seeing a particular word depends only on the POS tag previously seen. Also, as was the case in the urn problem, the probability of a word having a particular POS tag is dependent only on the POS tag of the previous word (urn to urn probability). Having modelled the problem as given above, we need to explain how the transition tables are constructed. The transition probabilities come from data. This is a data-driven approach to POS tagging, and using data on sentences which are already POS tagged we construct the transition tables. Given this formulation, we next present an algorithm which given an input sentence and the transition tables outputs the most probable POS tag sequence. 3.1.7 The Viterbi Algorithm The Viterbi algorithm[13] is a dynamic programming algorithm for finding the most likely sequence of hidden states that result in the sequence of observed states. Here the hidden states are the POS tags (or urns in the example) and the observed sequence is the word sequence (ball colours). The state transition probabilities are known (in practice these are estimated from labelled data) and so are the probabilities of emitting each word in the sentence given the POS tag of the previous word. We start at the start of the input sentence. We define two additional POS tags ˆ and $ to represent the tag for the start of the sentence and the terminal character at the end of the sentence (full stop, exclamation mark and question mark). A straight-forward algorithm to find the most probable POS tag sequence (hidden sequence) would be to just try all possibilities starting from the beginning of 22 the sentence. Here, our problem has more structure. We will exploit the Markov assumption we made earlier to get a much more efficient algorithm which is precisely the Viterbi algorithm. In the trellis for POS tagging problem the following are the major changes to be done. • The observations (words) do not vary with time. Instead they vary with the position of the pointer in the input sentence. • The states are the POS tags. The state transition probabilities are pre-computed using a POS-tagged corpus. Next, we observe that due to the Markov assumption, once we have traversed a part of the sentence, the transition probabilities do not depend on the entire sentence seen so far. They depend only on the previous POS tag. This crucial observation gives rise to the Viterbi algorithm: Suppose we are given a HMM with S possible POS tags (states), initial probabilities πi of being in state i, the transition probabilities P (sj |si ) of going from state i to j and the emission probabilities P (xt |si ) of emitting xt from the state si . If the input sentence is x1 , x2 , . . . , xT then the most probable state sequence that produces the sentence y1 , y2 , . . . , yT is given by the recurrence relations V1,k = P (y1 |sk )πk (3.1) Vt,k = P (yt |sk )maxsx ∈S (P (sk |sx ).Vt−1,x ) (3.2) where Vt,k is the probability of the most probable state sequence which emitted the first t words that has k as the final state. The Viterbi path (most likely state sequence) can be remembered by storing back pointers which contain the state sx which was chosen in the second equation. The complexity of the algorithm is O(|T ||S 2 |) where T is the set of words, the input sequence and S is the set of POS tags. 3.1.8 Pseudocode Pseudocode for the Viterbi algorithm is given below: # # # # # # # # Given Set of states: Array S Start state: s0 End state: se Symbol sequence: Array w State transition probabilities: Matrix a Symbol emission probabilities: Matrix b alpha: Matrix alpha # All indices in arrays start on 1 in this pseudocode 23 # Returns # Total probability: p # Initialisation F1 foreach s in S do alpha [1][s] := a[s0][s]*b[s][w[1]] done # Induction F2 for i := 1 to length(w)-1 do foreach s in S do foreach s’ in S do alpha[i+1][s] += alpha[i][s’]*a[s’][s] done alpha[i+1][s] *= b[s][w[i+1]] done done # Termination F3 foreach s in S do p += alpha[length(w)][s]*a[s][se] done return p In the next section, we present the concept of Maximum Entropy and see how it is applied to NLP tasks via an example for Statistical Machine Learning. 3.2 Maximum Entropy Approach ”Gain in entropy always means loss of information, and nothing more”. - G.N. Lewis (1930) 3.2.1 Entropy - Thermodynamic and Information In statistical mechanics, entropy is of the form: P S = −k pi log pi , i where pi is the probability of the microstate i taken from an equilibrium ensemble. The defining expression for entropy in Shannon’s theory of information is of the form: 24 H=− P pi log pi , i where pi is the probability of the message mi taken from the message space M . Mathematically H may also be seen as an average information, taken over the message space, because when a certain message occurs with probability pi , the information − log pi will be obtained. A connection can be made between the two. If the probabilities in question are the thermodynamic probabilities pi , the (reduced) Gibbs entropy σ can then be seen as simply the amount of Shannon information needed to define the detailed microscopic state of the system, given its macroscopic description. To be more concrete, in the discrete case using base two logarithms, the reduced Gibbs entropy is equal to the minimum number of yes/no questions needed to be answered in order to fully specify the microstate, given that we know the macrostate. 3.2.2 The Maximum Entropy Model Language modelling is the attempt to characterize, capture and exploit regularities in natural language. In statistical language modelling, large amounts of text are used to automatically determine the models parameters, in a process known as training. While building models, we may use each knowledge source separately and then combine. Under the Maximum Entropy approach, one does not construct separate models. Instead, we build a single, combined model, which attempts to capture all the information provided by the various knowledge sources. Each such knowledge source gives rise to a set of constraints, to be imposed on the combined model. The intersection of all the constraints, if not empty, contains a (possibly infinite) set of probability functions, which are all consistent with the knowledge sources. Once the desired knowledge sources have been incorporated, no other features of the data are assumed about the source. Instead, the worst (flattest) of the remaining possibilities is chosen. Let us illustrate these ideas with a simple example. 3.2.3 Application to Statistical Machine Learning Suppose we wish to predict the next word in a document[11], given the history, i.e., what has been read so far. Assume we wish to estimate P (BANK|h), namely the probability of the word BANK given the documents history. One estimate may be provided by a conventional bigram. The bigram would partition the event space (h, w) based on the last word of the history. Consider one such equivalence class, say, the one where the history ends in THE. The bigram assigns the same probability estimate to all events in that class: PBIGRAM (BANK|THE) = K{THE,BANK} That estimate is derived from the distribution of the training data in that class. Specifically, it is derived as: 25 K{THE,BANK} = C(THE,BANK) C(THE) Another estimate may be provided by a particular trigger pair, say (LOAN7→ BANK). Assume we want to capture the dependency of BANK on whether or not LOAN occurred before it in the same document. Thus a different partition of the event space will be added. Similarly to the bigram case, consider now one such equivalence class, say, the one where LOAN did occur in the history. The trigger component assigns the same probability estimate to all events in that class: PLOAN7→BANK (BANK|LOAN∈ h) = K{BANK|LOAN∈h} That estimate is derived from the distribution of the training data in that class. Specifically, it is derived as: K{BANK|LOAN∈h} = C(BANK,LOAN∈h) C(LOAN∈h) These estimates are clearly mutually inconsistent. How can they be reconciled? Linear interpolation solves this problem by averaging the two answers. The backoff method solves it by choosing one of them. The Maximum Entropy approach, on the other hand, does away with the inconsistency by relaxing the conditions imposed by the component sources. Consider the bigram. Under Maximum Entropy, we no longer insist that P (BANK|h) always have the same value K{THE,BANK} whenever the history ends in THE. Instead, we acknowledge that the history may have other features that affect the probability of BANK. Rather, we only require that, in the combined estimate, P (BANK|h) be equal to K{THE,BANK} on average in the training data. E h ends in THE [PCOMBINED (BANK|h)] = K{THE,BANK} where E stands for an expectation, or average. The constraint expressed by this equation is much weaker. There are many different functions PCOMBINED that would satisfy it. Similarly, E [PCOMBINED (BANK|h)] = K{BANK|LOAN∈h} LOAN∈h In general, we can define any subset S of the event space, and any desired expectation K, and impose the constraint: P [P (h, w)] = K (h,w)∈S The subset S can be specified by an index function, also called selector function, fS , an indicator for the belongingness of the pair (h, w) in S. So, we have P [P (h, w)fS (h, w)] = K (h,w) We need not restrict ourselves to index functions. Any real-valued function f (h, w) can be used. We call f (h, w) a constraint function, and the associated K the desired expectation. So, we have hf, P i = K 26 3.3 The ME Principle and a Solution Now, we give a general description of the Maximum Entropy model and its solution. The Maximum Entropy (ME) Principle can be stated as follows[6] 1. Reformulate the different information sources as constraints to be satisfied by the target (combined) estimate. 2. Among all probability distributions that satisfy these constraints, choose the one that has the highest entropy. Given a general event space {x}, to derive a combined probability function P (x), each constraint j is associated with a constraint function fj (x) and a desired expectation Kj . The constraint is then written as: P EP fj = P (x)fj (x) = Kj x Given consistent constraints, a unique ME solution is guaranteed to exist, and to be of the form: f (x) P (x) = Π µj j j where the µj s are some unknown constants, to be found. 3.3.1 Proof for the ME Formulation Here, we give a proof for the unique ME solution that we proposed in the previous subsection. Suppose there are N different points in the event space, and we assign a probability pi to each. Then, the objective to be maximised is the entropy, given N P by H = − pi ln pi . The constraints are: i=1 X pi = 1 i X pi fj (xi ) = Kj ∀j ∈ {1, 2, ..., m} i 27 So, we introduce Lagrange multipliers and now maximise F = − N X pi ln pi + λ( i=1 ∂F ∂pi = − ln pi − 1 + λ + N X i=1 m X pi − 1) + m X N X λj ( pi fj (xi ) − Kj ) j=1 i=1 λj fj (xi ) = 0 j=1 ln pi = λ − 1 + m X λj fj (xi ) j=1 m P λj fj (xi ) pi = eλ−1 ej=1 m Y pi = eλ−1 eλj fj (xi ) j=1 pi = a m Y f (xi ) µj j j=1 where a = eλ−1 is a normalization constant and eλj = µj 3.3.2 Generalized Iterative Scaling Q fj (xi ) for the µi s that will To search the exponential family defined by pi = m j=1 µj make P (x) satisfy all the constraints, an iterative algorithm exists, which is guar(0) anteed to converge to the solution. GIS[5] starts with some arbitrary µi values, which define the initial probability estimate: P 0 (x) = Q j (0) fj (x) µj Each iteration creates a new estimate, which is improved in the sense that it matches the constraints better than its predecessor. Each iteration (say k) consists of the following steps: 1. Compute the expectations of all Pthe fj ’s under the current estimate function. Namely, compute EP (k) fj = P (k) (x)fj (x) x 2. Compare the actual values EP (k) fj ’s to the desired values Kj s, and update the µj ’s according to the following formula: (k+1) µj (k) = µj Kj EP (k) fj 3. Define the next estimate function based on the new µj s: 28 P (k+1) (x) = Q j (k+1) fj (x) µj Iterating is continued until convergence or near-convergence. 3.4 Improved Iterative Scaling Iterative Scaling and its variants are all based on the central idea of the Gradient Descent algorithm for optimizing convex training objectives. It is presented here using a model which occurs at many places in a maximum entropy approach to natural language processing. 3.4.1 The Model in parametric form The problem we consider is a language modelling problem[9], which is to define the distribution P (y|x), where y and x are sequences. For eg, y can be the POS tag sequence and x the input sequence. Henceforth the boldface indicating that x is a sequence will be dropped unless the context demands further elucidation. Given just the above information, the maximum entropy approach maximises the entropy of the model giving us a model of the following form. ! n X 1 PΛ (y|x) = exp (3.3) λi fi (x, y) . ZΛ (x) i=1 where • fi (x, y) is a binary-valued function, called a feature of (x,y), associated with the model. The model given above has n features. • λi is a real-valued weight attached with fi whose absolute value measures the ’importance’ of the feature fi . Λ is the vector of the weights: Λ = {λ1 , λ2 , . . . , λn }. • ZΛ (x) is the normalizing factor which ensures that PΛ is a probability distribution. ! n X X ZΛ (x) = exp λi fi (x, y) y 3.4.2 i=1 Maximum Likelihood The next thing to do would be to train the model, i.e find the parameters λi so as to maximize some objective over the training data. Here, we choose to maximize the likelihood of the training data. The likelihood is computed by assuming that the 29 model is the correct underlying distribution and hence is a function of the parameters of the model. The likelihood of the training data is expressed as follows (N is the number of training instances): M (Λ) = N Y P (xi , yi ) i=1 = N Y PΛ (yi |xi )P (xi ) i=1 Now, we note that log(x) is a one-to-one map for x > 0. Therefore the value of x which maximizes f (x) is the same as that which maximizes log(f (x)). Henceforth we work with the logarithm of the likelihood expression as it is mathematically easier to work with. The log-likelihood expression denoted by L(Λ) is given below: L(Λ) = log(M (Λ)) = N X log (PΛ (yi |xi )) + C i=1 where C is independent of Λ and is hence treated as a constant. It is dropped from the expression henceforth as it does not affect the maximization problem. Now, we express the log-likelihood expression in terms of the empirical probability distribution p̃(x, y) obtained from the training data as follows: c(x, y) x,y c(x, y) p̃(x, y) = P where c(x, y) is the number of times the instance (x, y) occurs in the training data. The log-likelihood expression becomes the following: X Lp̃ (Λ) = log PΛ (y|x)c(x,y) x,y = X p̃(x, y)log (PΛ (y|x)) x,y We ignore P x,y c(x, y) as it is constant for a given training set (= N ). 30 3.4.3 The objective to optimize Hence we arrive the objective to be maximized. The maximum likelihood problem is to discover Λ∗ ≡ argmaxΛ Lp̃ (Λ) where X Lp̃ (Λ) = p̃(x, y)log (PΛ (y|x)) x,y = = X p̃(x, y) X x,y i x,y X X X p̃(x, y) x,y 3.4.4 λi fi (x, y) − X λi fi (x, y) − p̃(x, y)log n X exp y p̃(x)log( X exp y x i X !! λi fi (x, y) i=1 n X ! λi fi (x, y) ) i=1 Deriving the iterative step Suppose we have a model with some arbitrary set of parameters Λ = {λ1 , λ2 , . . . , λn }. We would like to find a new set of parameters Λ+∆ = {λ1 +δ1 , λ2 +δ2 , . . . , λn + δn } which yield a model of higher log-likelihood. The change in log-likelihood is X X Lp̃ (Λ + ∆) − Lp̃ (Λ) = p̃(x, y)logP(Λ+∆) (y|x) − p̃(x, y)logPΛ (y|x) x,y = x,y X p̃(x, y) x,y X δi fi (x, y) − X p̃(x)log x i Z(Λ+∆) (x) Z(Λ) (x) Now, we make use of the inequality −log(α) ≥ 1 − α to establish a lower bound on the above change in likelihood expression. Lp̃ (Λ + ∆) − Lp̃ (Λ) ≥ X p̃(x, y) X x,y δi fi (x, y) + 1 − X p̃(x) x i Z(Λ+∆) (x) Z(Λ) (x) P y exp ( i (λi + δi )fi (x, y)) P = p̃(x, y) δi fi (x, y) + 1 − p̃(x) P exp ( i λi fi (x, y)) y x,y x i !! X X X X exp(P λi fi (x, y) X i = p̃(x, y) δi fi (x, y) + 1 − p̃(x) exp δi fi (x, y) Z Λ (x) x,y x y i i ! X X X X X = p̃(x, y) δi fi (x, y) + 1 − p̃(x) PΛ (y|x)exp δi fi (x, y) X X X x,y i x P y i = A(∆|Λ) Now we know that is we can find a ∆ such that A(∆|Λ) > 0 then we have a improvement in the likelihood. Hence, we try to maximize A(∆|Λ) with respect to each δi . Unfortunately the derivative of A(∆|Λ) with respect to δi yields an equation containing all of {δ1 , δ2 . . . . , δn } and hence the constraint equations for δi are coupled. 31 To get around this, we first observe that the coupling is due to the summation of the δi s present inside the exponentiation function. We consider a counterpart expression with the summation placed outside the exponentiation and compare the two expressions. We find that we can indeed establish an inequality using an important property called the Jensen’s inequality. First, we define the quantity, X f # (x, y) = fi (x, y) i If fi are binary-valued then f # (x, y) just gives the total number of features which are non-zero (applicable) at the point (x,y). We rewrite A(∆|Λ) in terms of f # (x, y) as follows: A(∆|Λ) = X p̃(x, y) X x,y i Now, we note that p(x), X X X δi fi (x, y) δi fi (x, y)+1− p̃(x) PΛ (y|x)exp f # (x, y) f # (x, y) x y ! i fi (x,y) f # (x,y) is a p.d.f. Jensen’s inequality states that for a p.d.f, ! exp X p(x)q(x) ≤ x X exp(p(x)q(x)) x Now, using Jensen’s inequality, we get, A(∆|Λ) ≥ X p̃(x, y) X x,y δi fi (x, y) + 1 − X x i p̃(x) X y X fi (x, y) PΛ (y|x) exp(δi f# (x, y)) f # (x, y) i = B(∆|Λ) where B(∆|Λ) is a new lower-bound on the change in likelihood. B(∆|Λ) can be maximized easily because there is no coupling of variables in its derivative. The derivative of B(∆|Λ) with respect to δi is, X X ∂B(∆) X = p̃(x, y)fi (x, y) − p̃(x) PΛ (y|x)fi (x, y)exp(δi f # (x, y)) ∂δi x,y x y Notice that in the expression for ∂B(∆) ∂δi δi appears alone without the other parameters. Therefore, we can solve for each δi individually. The final IIS algorithm is as follows, • Start with some arbitrary values for λi s. • Repeat until convergence – Solve for ∂B(∆) ∂δi = 0 for δi . – Set λi = λi + δi for each i. 32 3.5 Swarm Intelligence Swarm Intelligence (SI)[10] is a relatively new paradigm being applied in a host of research settings to improve the management and control of large numbers of interacting entities such as communication, computer and sensor networks, satellite constellations and more. Attempts to take advantage of this paradigm and mimic the behaviour of insect swarms however often lead to many different implementations of SI. Here, we provide a set of general principles for SI research and development. A precise definition of self-organized behaviour is described and provides the basis for a more axiomatic and logical approach to research and development as opposed to the more prevalent ad hoc approach in using SI concepts. The concept of Pareto optimality is utilized to capture the notions of efficiency and adaptability. 3.5.1 Foundations The use of swarm intelligence principles makes it possible to control and manage complex systems of interacting entities even though the interactions between and among the entities is minimal. As an example, consider how ants actually solve shortest path problems. Their motivation for solving these problems stems from their need to find sources of food. Many ants set out in search of a food source by apparently randomly choosing several different paths. Along the way they leave traces of pheromone. Once ants find a food source, they retrace their path back to their colony by following their scent back to their point of origin. Since many ants go out from their colony in search of food, the ants that return first are presumably those that have found the food source closest to the colony or at least have found a source that is in some sense more accessible. In this way, an ant colony can identify the shortest or best path to the food source. The cleverness and simplicity of this scheme is highlighted when this process is examined from what one could conceive of as the ants perspective - they simply follow the path with the strongest scent (or so it seems). The shortest path will have the strongest scent because less time has elapsed between when the ants set out in search of food and when they arrive back at the colony, hence there is less time for the pheromone to evaporate. This leads more ants to go along this path further strengthening the pheromone trail and thereby reinforcing the shortest path to the food source and so exhibits a form of reinforcement learning. But this simple method of reinforcement or positive feedback also exhibits important characteristics of efficient group behaviour. If, for instance, the shortest path is somehow obstructed, then the second best shortest path will, at some later point in time, have the strongest pheromone, hence will induce ants to traverse it thereby strengthening this alternate path. Thus, the decay in the pheromone level 33 leads to redundancy, robustness and adaptivity, i.e., what some describe as emergent behaviour. Efficiency via Pareto Optimality Optimization problems are ubiquitous and even social insects must face them. Certainly, the efficient allocation of resources present problems where some goal or objective must be maintained or achieved. Such goals or objectives are often mathematically modelled using objective functions, functions of decision variables or parameters that produce a scalar value that must be either minimized or maximized. The challenge presented in these often difficult problems is to find the values of those parameters that either minimize or maximize, i.e., optimize, the objective function value subject to some constraints on the decision variables. In multi-objective optimization problems (MOPs) system efficiency in a mathematical sense is often based on the definition of Pareto optimality a well established way of characterizing a set of optimal solutions when several objective functions are involved. Each operating point or vector of decision variables (operational parameters) produces several objective function values corresponding to a single point in objective function space (this implies a vector of objective function values). A Pareto optimum corresponds to a point in objective function space with the property that when it is compared to any other feasible point in objective function space, at least one objective function value (vector component) is superior to the corresponding objective function value (vector component) of this other point. Pareto optima therefore constitute a special subset of points in objective function space that lie along what is referred to as the Pareto optimal frontier the set of points that together dominate (are superior to) all other points in objective function space. Figure 3.3: The Pareto Optimal frontier is the set of hollow points. Operational decisions must be restricted along this set if operational efficiency is to be maintained Determining several Pareto optima can be quite valuable for enhancing the survival value of a species (or managing a complex system) because it enables adaptive behaviour. Thus, if in an ant colony a path to a food source becomes congested, then other routes must be utilized. Although the distances to food sources are generally minimized as is the level of congestion, these often conflicting objec34 tives can be efficiently traded off when the shortest distance is sacrificed to lessen the level of congestion. The Measure of Pareto Optima: A rather intuitive yet surprisingly little known aspect of Pareto optima is its measure. This measure is based on the size of the set of points in objective function space that are dominated by the Pareto optimal frontier - in essence a Lebesgue measure or hypervolume. Figure 3.4: The Pareto hypervolume 3.5.2 Example Algorithms and Applications • Ant colony optimization A class of optimization algorithms modelled on the actions of an ant colony, ACO is a probabilistic technique useful in problems that deal with finding better paths through graphs. Artificial ’ants’ -simulation agents, locate optimal solutions by moving through a parameter space representing all possible solutions. Natural ants lay down pheromones directing each other to resources while exploring their environment. The simulated ’ants’ similarly record their positions and the quality of their solutions, so that in later simulation iterations more ants locate better solutions. • Artificial bee colony algorithm Artificial bee colony algorithm (ABC) is a meta-heuristic algorithm that simulates the foraging behaviour of honey bees. The algorithm has three phases: employed bee, onlooker bee and scout bee. In the employed bee and the onlooker bee phases, bees exploit the sources by local searches in the neighbourhood of the solutions selected based on deterministic selection in the employed bee phase and the probabilistic selection in the onlooker bee phase. In the scout bee phase which is an analogy of abandoning exhausted food sources in the foraging process, solutions that are not beneficial any more for search progress are abandoned, and new solutions are inserted instead of them to explore new regions in the search space. The algorithm has a well-balanced exploration and exploitation ability. 35 • Particle swarm optimization PSO is a global optimization algorithm for dealing with problems in which a best solution can be represented as a point or surface in an n-dimensional space. Hypotheses are plotted in this space and seeded with an initial velocity, as well as a communication channel between the particles. Particles then move through the solution space, and are evaluated according to some fitness criterion after each time-step. Over time, particles are accelerated towards those particles within their communication grouping which have better fitness values. The main advantage of such an approach over other global minimization strategies such as simulated annealing is that the large number of members that make up the particle swarm make the technique impressively resilient to the problem of local minima. 3.5.3 Case Study: Ant Colony Optimization applied to the NP-hard Travelling Salesman Problem Travelling salesman problem (TSP) consists of finding the shortest route in complete weighted graph G with n nodes and n(n-1) edges, so that the start node and the end node are identical and all other nodes in this tour are visited exactly once. We apply the Ant Colony[12] heuristic to obtain an approximate solution to the problem. We use virtual ants to traverse the graph and discover paths for us. Their movement depends on the amount of pheromone on the graph edges. We assume the existence of ant’s internal memory. In symbols, what we have is: • Complete weighted graph G = (N, A) • N = set of nodes representing the cities • A = set of arcs • Each arc (i, j) in A is assigned a value (length) dij , which is the distance between cities i and j. Tour Construction τij refers to the desirability of visiting city j directly after city i. Heuristic information is chosen as ηij = d1ij . We apply the following constructive procedure to each ant: 1. Choose, according to some criterion, a start city at which the ant is positioned; 2. Use pheromone and heuristic values to probabilistically construct a tour by iteratively adding cities that the ant has not visited yet, until all cities have been visited; 3. Go back to the initial city; 36 4. After all ants have completed their tour, they may deposit pheromone on the tours they have followed. Continue for a fixed number of iterations or till the pheromone distribution becomes almost constant. Ant System The Ant System (proposed in 1991) uses the following heuristics and formulae for probability propagation • Initialize the pheromone trails with a value slightly higher than the expected amount of pheromone deposited by the ants in one iteration; a rough estimate of this value can be obtained by setting τij = τ0 = m C nn where m is the number of ants, and C nn is the length of a tour generated by the nearest-neighbour heuristic. • In AS, these m artificial ants concurrently build a tour of the TSP. • Initially, put ants on randomly chosen cities. At each construction step, ant k applies a probabilistic action choice rule, called random proportional rule, to decide which city to visit next. β pkij = τijα ηij / P l∈Nik τilα ηilβ , if j ∈ Nik • Each ant k maintains a memory Mk which contains the cities already visited, in the order they were visited. This memory is used to define the feasible neighbourhood Nik in the construction rule. • We can adopt any of the following two: Parallel implementation: at each construction step all ants move from current city to next one; Sequential implementation: ant builds complete tour before next one starts to build another 37 Update of Pheromone Trails • Forget bad decisions: τij ← (1 − ρ)τij ∀i, j, where ρ ∈ {0, 1} • So,if an arc is not chosen by the ants, its pheromone value decreases exponentially • ∆τijk is the amount of pheromone ant k deposits on the arcs it has visited and C k is the length of tour T k built by the k th ant. Then, they are related as follows: ∆τijk = 1/C k , if arc (i, j) belongs to tour T k ; 0 otherwise • The update then happens as follows: τij ← τij + m P k=1 ∆τijk , ∀(i, j) Computational Experiments For experiment, the problem of 32 cities in Slovakia has been solved using the ACO. The optimal solution to that problem has a length of route 1453km. Parameters are α = 1, β = 5. The number of iterations was set to 1000. With m = 1000, the result was the tour with length 1621 km in 34th iteration (difference 11.56% from optimal route). Figure 3.5: Search process for m=1000 ants With m = 5000, algorithm ACO finds the tour with length 1532km in 21st iteration (difference 5.44% from optimal route). 38 Figure 3.6: Search process for m=5000 ants 3.6 Boltzmann Machines One of the first examples of a neural network capable of learning internal representations, Boltzmann machines3 are able to represent and (given sufficient time) solve difficult combinatoric problems. They are named after the Boltzmann distribution in statistical mechanics, which is used in their sampling function. Figure 3.7: Graphical representation for a Boltzmann machine with a few labelled weights 3.6.1 Structure A Boltzmann machine, is a network of stochastic units with an energy defined for the network. The global energy E, in a Boltzmann machine is: P P E = −( i<j wij si sj + i θi si ) 3 Content and figure from http://en.wikipedia.org/wiki/Boltzmann_machine 39 where wij is the connection strength between unit j and unit i; si ∈ {0, 1} is the state of unit i; θi is the bias of unit i in the global energy function. The connections in a Boltzmann machine have two restrictions: • wii = 0 • wij = wji 3.6.2 ∀i. (No unit has a connection with itself.) ∀i, j. (All connections are symmetric.) Probability of a state The difference in the global energy that results from a single unit i being 0(off) versus 1(on), written ∆Ei , is given by: P ∆Ei = j wij sj + θi This can be expressed as the difference of energies of two states: ∆Ei = Ei=off − Ei=on We then substitute the energy of each state with its relative probability according to the Boltzmann Factor (the property of a Boltzmann distribution that the energy of a state is proportional to the negative log probability of that state): ∆Ei = −kB T ln(pi=off ) − (−kB T ln(pi=on )) where kB is Boltzmann’s constant and is absorbed into the artificial notion of temperature T . We then rearrange terms and consider that the probabilities of the unit being on and off must sum to one: ∆Ei T ∆Ei T ∆Ei T ∆Ei − T ∆Ei − T ∆Ei exp − T = ln(pi=on ) − ln(pi=off ) = ln(pi=on ) − ln(1 − pi=on ) pi=on = ln 1 − pi=on 1 − pi=on = ln pi=on 1 = ln −1 pi=on 1 = −1 pi=on We can now solve for pi=on , the probability that the ith unit is on. pi=on = 1 ∆E 1+exp(− T i ) where the scalar T is referred to as the temperature of the system. This relation is the source of the logistic function found in probability expressions in variants of the Boltzmann machine. 40 3.6.3 Equilibrium State The network is run by repeatedly choosing a unit and setting its state according to the above formula. After running for long enough at a certain temperature, the probability of a global state of the network will depend only upon that global state’s energy, according to a Boltzmann distribution. This means that log-probabilities of global states become linear in their energies. This relationship is true when the machine is at thermal equilibrium, meaning that the probability distribution of global states has converged. If we start running the network from a high temperature, and gradually decrease it until we reach a thermal equilibrium at a low temperature, we may converge to a distribution where the energy level fluctuates around the global minimum. This process is called simulated annealing. 41 Chapter 4 Some popular Quantum Computing Ideas A quantum algorithm is a step-by-step procedure such that each of the steps can be performed on a classical computer. Quantum computers can execute algorithms that sometimes dramatically outperform classical computation. The best-known example of this is Shor’s discovery of an efficient quantum algorithm for factoring integers, whereas the same problem appears to be intractable on classical computers. Understanding what other computational problems can be solved significantly faster using quantum algorithms is one of the major challenges in the theory of quantum computation. In an attempt to gain an insight in the same, we study a few of the existing quantum algorithms. The first among them is the Deutsch-Jozsa algorithm used to determine the nature of a function, followed by Shor’s algorithm for factoring integers and then Grover’s algorithm which efficiently searches for an element in an unsorted database. 4.1 4.1.1 Deutsch-Jozsa Algorithm Problem Statement In the Deutsch-Jozsa problem, we are given a black box quantum computer known as an oracle that implements the function f : {0, 1}n → {0, 1}. In layman’s terms, it takes n-digit binary values as input and produces either a 0 or a 1 as output for each such value. We are promised that the function is either constant (0 on all inputs or 1 on all inputs) or balanced (returns 1 for half of the input domain and 0 for the other half); the task then is to determine if f is constant or balanced by using the oracle. 4.1.2 Motivation and a Classical Approach The DeutschJozsa problem[1] is specifically designed to be easy for a quantum algorithm and hard for any deterministic classical algorithm. The motivation is to 42 show a black box problem that can be solved efficiently by a quantum computer with no error, whereas a deterministic classical computer would need exponentially many queries to the black box to solve the problem. For a conventional deterministic algorithm where n is number of bits/qubits, + 1 evaluations of f will be required in the worst case. To prove that f is constant, just over half the set of inputs must be evaluated and their outputs found to be identical. 2n−1 4.1.3 The Deutsch Quantum Algorithm 1. The algorithm begins with the n + 1 qubit state |0i⊗n |1i. That is, the first n qubits are each in the state |0i and the final one is |1i. P n −1 2. Apply a Hadamard transformation to each bit to obtain the state √ 1n+1 2x=0 |xi(|0i− 2 |1i). 3. We have the function f implemented as quantum oracle. The oracle maps the state |xi|yi to |xi|y ⊕ f (x)i, where ⊕ is addition modulo 2. P2n −1 4. Applying the quantum oracle gives √ 1n+1 x=0 |xi(|f (x)i − |1 ⊕ f (x)i). 2 5. For each x, fP (x) is either 0 or 1. A quick check of these two possibilities n −1 yields √ 1n+1 2x=0 (−1)f (x) |xi(|0i − |1i). 2 6. At this point, ignore the last qubit. Apply a Hadamard transformation to each qubit to obtain i P2n −1 P2n −1 hP2n −1 1 P2n −1 x·y |yi = 1 f (x) f (x) (−1)x·y |yi (−1) (−1) (−1) n n y=0 x=0 y=0 x=0 2 2 where x·y = x0 y0 ⊕x1 y1 ⊕· · ·⊕xn−1 yn−1 is the sum of the bitwise product. 2 1 P2n −1 ⊗n f (x) 7. Finally we examine the probability of measuring |0i , 2n x=0 (−1) which evaluates to 1 if f (x) is constant (constructive interference) and 0 if f (x) is balanced (destructive interference). The DeutschJozsa algorithm provided inspiration for Shor’s algorithm and Grover’s algorithm, two of the most revolutionary quantum algorithms, which are described now. 4.2 Shor’s Algorithm Shor’s algorithm, given in 1994 by mathematician Peter Shor, is an algorithm for integer factorization. On a quantum computer, Shor’s algorithm runs in polynomial time. First, we describe the problem of factorization more formally followed 43 by an overview of some mathematical concepts required to understand the algorithm. The familiar reader can skip these subsections and continue reading from the subsection ’Reduction of the Factorization problem’. 4.2.1 The factorization problem The factorization problem definition is given below. Problem Definition: Given an integer n, factorize n as a product of primes. Typically the integer n is very large (a few hundred digits long). Hence the brute force approach of checking whether each number between 2 and n − 1 is a factor of n which takes exponential time to complete, is not efficient and it can take many years for the computation to finish. In fact, there is no deterministic algorithm known that can factorize n in polynomial-time. This limitation is exploited by the famous Rivest-Shamir-Adleman encryption scheme (RSA). We will assume (both for simplicity and with a view to RSA cryptanalysis) that n = pq where p and q are large unknown primes. We must determine p and q. 4.2.2 The integers mod n Let R = 0, 1, 2, . . . , n − 1 with addition and multiplication modulo n. For a, b ∈ R we compute a + b mod n and ab mod n by first computing the sum or product as an ordinary integer, then taking the remainder upon division by n. These operations are easily performed in polynomial time in the input size l = log(n) using a classical logical circuit of size polynomial in l. For x ∈ R and a ≥ 0, the value of xa mod n can also be determined in polynomial time and space via the square-and-multiply algorithm which is described in brief below. 4.2.3 A fast classical algorithm for modular exponentiation The method is based on the following observation: ( a−1 x (x2 ) 2 , if a is odd a x = a (x2 ) 2 , if a is even. (4.1) Now, due to the modular nature of squaring, the number of digits of x2 are limited by the length of n. We computed xa by repeated squaring taking the result modulo n each time before proceeding to the next iteration, which gives rise to the following recursive algorithm for exponentiation. Function exp-by-squaring(x,n) if n<0 then return exp-by-squaring(1/x, -n); else if n=0 then return 1; else if n=1 then return x; else if n is even then return exp-by-squaring(x*x, n/2); else if n is odd then return x*exp-by-squaring(x*x, (n-1)/2). 44 4.2.4 Reduction of the Factorization problem Using randomization, factorization can be reduced to finding the order of an element in the multiplicative group (mod n), where order or r is the smallest r ≥ 1 such that xr mod n is 1. Suppose we choose x randomly from {2, . . . , n − 1} and find the order r of x with respect to n. Then if r is not odd r r (x 2 − 1)(x 2 + 1) ≡ 1 (mod n) r Now consider the gcd(x 2 − 1, n). This fails to be a non-trivial divisor of n only if r x 2 ≡ −1 (mod n) or r is odd. This procedure, when applied to a random x (mod 1 n), yields a factor of n with probability at least 1 − 2k−1 , where k is the number of distinct odd prime factors of n. We will accept this statement without proof. It can be seen that the above probability is at least 12 if k ≥ 1. If k = 1 implying n had only one odd prime factor, it can be easily factored in polynomial time using classical algorithms. (Reference here) Shor’s algorithm finds the factors of n indirectly by first choosing a random r x and then finding the order of x with respect to n. Then it finds gcd(x 2 − 1, n) which will be a factor of n with high probability. It continues doing the same until n has been completely factorized. The algorithm requires a quantum computer only for finding the period of x in polynomial time. This part of the algorithm is presented next. 4.2.5 The Algorithm We present only the quantum part of the algorithm in this section. The complete algorithm is presented at the end of the section. The algorithm uses two quantum registers which hold integers represented in binary and some additional workspace. 1. Find q, such that q = 2l for some integer l and n2 ≤ q < 2n2 . In a quantum gate array we need not even keep the values of n, x and q in memory, as they can be built into the structure of the gate array. 2. Next, the first register is put in the uniform superposition of states representing numbers a (mod q). This leaves the registers in the following state. q−1 1 X 1 q2 |ai|0i. (4.2) a=0 3. Next xa mod n is computed using the square-and-multiply algorithm. This can be done reversibly. This leaves our registers in the following state. q−1 1 X q 1 2 |ai|xa (mod n)i. a=0 45 (4.3) 4. Then the Fourier transform is performed on the first register, as described in Chapter 2 which maps |ai to q−1 1 X 1 q2 exp(2πiac/q)|ci. (4.4) c=0 This leaves the registers in the following state q−1 q−1 1 XX exp(2πiac/q)|ci|xa (mod n)i. q (4.5) a=0 c=0 5. Finally we observe the system. We now compute the probability that our machine ends in a particular state |c, xk mod ni, where 0 ≤ k < r. Summing up over all possible ways to reach this state, this probability is 2 1 X exp(2πiac/q) q a a:x ≡xk Since the order is r, this sum is over all a such that a ≡ k (mod r). Therefore, tha above sum can be written as, 2 b(q−k−1)/rc X 1 exp(2πi(br + k)c/q) q b=0 Since, exp(2πikc/q) factors out of the sum and has magnitude 1, we drop it. Now, on the remaining part of the expression Shor’s algorithm performs an estimation analysis of the above probability expression and derives the following lemma which we present without proof. Lemma 1. The probability of seeing a given state |c, xk (mod n)i is at least 1 if there is a d such that, 3r2 −r r ≤ rc − dq ≤ . 2 2 (4.6) Next, Shor proceeds to prove that the probability of obtaining r via the above δ algorithm is at least loglogr . We will accept the above statement without proof. Hence by repeating the experiment O(loglogr) times, we are assured of a high probability of success. 4.2.6 An example factorization We show the running of Shor over the factorization of n = 55. Since n2 ≤ q < 2n2 and q = 2l , q = 213 = 8192. Suppose we choose x = 13. The running of the algorithm on this input is described below. 46 1. We initialize the initial state to be a superposition of states representing a (mod 8192). |ψi = √ 1 (|0, 0i + |1, 0i + . . . + |8191, 0i) 8192 2. Next the modular exponentiation gate is applied. |ψi = = 1 (|0, 1i + |1, 13i + |2, 132 mod55i . . . + |8191, 138191 mod55i) 8192 1 √ (|0, 1i + |1, 13i + |2, 4i . . . + |8191, 2i) 8192 √ 3. Next we perform the Fourier transform on the first register. |ψi = 8191 8191 1 XX exp(2πiac/8192)|ci|13a (mod 55)i. 8192 a=0 c=0 4. Now we observe the registers. Register 2 can be in any of the states with equal probability. Hence all power of x mod 55 are almost equally likely to be observed if r << q. Suppose we observe 28 as a power of x mod 55. This occurs 410 times in the series as a varies from 0 to 8191. Then the probability of observing register 1 to be in state c is 409 1 1 X P r(c) = exp(2πirdc/8192) 8192 410 d=0 Here r = 20. Among the states which can be observed with reasonably high probability is |4915i which is observed with probability of 4.4%. 5. Now qc = 4915 8192 . Shor’s algorithm uses the method of continued fractions to find d/r from c/q. Applying it here would give us r to be a multiple of r1 = 5 and that on trying r1 , 2r1 , . . . blog(n)1+ cr1 as values for r we are guaranteed to find r with a very high probability. Here, we find that r = 20. 6. Now, the algorithm uses the Euclidean algorithm to find the factors of 55. m = 13(20/2) mod 55 = 1310 mod 55 = 34 and the factors of n = 55 are, gcd(m + 1, 55) = gcd(35, 55) = 5 gcd(m − 1, 55) = gcd(33, 55) = 11 47 4.3 Grover’s Algorithm The Grover algorithm, given by Lov Grover in 1996, solves the problem of search1 ing for an element in an unsorted database with N entries in O(N 2 ) time. Note that with classical computation models this problem cannot be solved in less than linear time (O(N )). 4.3.1 The search problem Assume N = 2n . Suppose that we have a function f (x) from {0, 1}n to 0, 1 which is zero on all inputs except for a single (marked) item x0 : f (x) = δx,x0 . By querying this function we wish to find the marked item x0 . If we have no information about the particular x0 , then finding this marked item is very difficult. In the worst case it will take 2n − 1 queries to find x0 for a deterministic algorithm. In general, if the search problem has M solutions, then the classical algorithm might take as many as 2n − M steps. For large N, the Grover algorithm could yield very large performance increases. The key idea is that although finding a solution to the search problem is hard, recognising a solution is easy. We wish to search through a list of N elements, lets index the elements by x ∈ 0, N − 1 and call them yx . 4.3.2 The Oracle Rather than dealing with the list itself, we focus on the index of the list, x. Given some value of x, we can tell whether yx solves the search problem. We assume that we can construct some device to tell us if yx solves the search problem. This device is called an Oracle. • The Oracle takes as input an index value in a qubit register |xi. It also takes a single Oracle qubit, |qi. The state given to the Oracle is thus |ψi = |xi|qi. • The Oracle is represented by a unitary operator, O. If x indexes a solution to the search problem, O sets f (x) = 1, and f (x) = 0 if it doesnt index a solution. • If f (x) = 1, the Oracle flips the state of |qi. We write this as O|xi|qi = |xiX f (x) |qi. X is just our quantum NOT operator. • So if f (x) = 1, |qi 7→ X|qi, else |qi 7→ |qi. • We choose to initially program |qi = −|qi. And O|xi|qi = √1 2 (|0i − |1i). Then X|qi = √1 2 (−|0i + |1i) = (−1)f (x) |xi|qi. • The Oracle therefore takes |xi 7→ (−1)f (x) |xi. So the term indexing the solution is marked with a − sign. 48 The Oracle does not find the solution to the problem, it simply recognises the answer when presented with one. The key to quantum search is that we can look at all solutions simultaneously: the Oracle just manipulates the state coefficients using a unitary operator!. 4.3.3 The Grover Iteration 1. Begin with |xi = √1 N PN −1 j=0 |ji. 2. Apply the Oracle to |xi: |xi 7→ √1 N PN −1 j=0 (−1)f (x) |ji 3. Apply the QFT to |xi. 4. Reverse the sign of all terms in |xi except for the term |0i. 5. Apply the Inverse QFT. 6. Return to step 2 and repeat. 4.3.4 Performance of the algorithm The point at which we terminate Grovers algorithm and measure the result is critical. This is because the probability associated with the correct state rises to 1 after a certain number of iterations and then oscillates periodically between the two extremes, 0 and 1. It has been shown that the optimum number of iterations is q N , where M is the number of solutions. It has also been shown that this is ∼ π4 M the best that any quantum search algorithm can do. 4.3.5 An example Apply Grovers algorithm to N = 4 with solution x = 2. • We start with |xi = 12 (|0i + |1i + |2i + |3i). • Apply the Oracle: |xi 7→ 12 (|0i + |1i − |2i + |3i). • Apply the QFT: F |xi = 12 (|0i + |1i − |2i + |3i). • Flips signs of all terms except |0i : F |xi 7→ 12 (|0i − |1i + |2i − |3i). • Inverse QFT: |xi = |2i. • So when we measure |xi, we are guaranteed the right answer. 49 4.4 The Quantum Minimum Algorithm We now present a quantum algorithm for finding the minimum value among a given set of numbers. This algorithm is faster than the fastest possible classical algorithm and, as usual, is probabilistic. This algorithm uses the Grover search algorithm repeatedly to find the minimum with a high probability. First, we formally present the problem with notation and then we present the algorithm. 4.4.1 The Problem Let T [0..N − 1] be an unsorted table of N items, each holding a value from an ordered set. The minimum searching problem is to find the index y such that T [y] is minimum. This clearly requires a linear number of probes on a classical probabilistic Turing machine. [16] gave a simple quantum algorithm which solves the problem using O(N 1/2 ) probes. The algorithm makes repeated calls to Grover’s search algorithm to find the index of a smaller item than the value determined by a particular threshold index. If there are t ≥ 1 marked entries, Grover’s algorithm p will return one of them with equal probability after an expected number of O( N/t) iterations. If no entry is marked, it will run forever. 4.4.2 The Algorithm The algorithm is as follows: 1. Choose threshold index 0 ≤ y ≤ N − 1 uniformly at random. 2. Repeat the interrupt it when the total running time is more √ following and 2 than 22.5 N + 1.4 log N . Then go to 2c. (a) Initialize the register as a uniform superposition over the N states, i.e., give each state a coefficient of √1N . Mark every item j for which T [j] < T [y]. This would be an O(N ) operation on a classical computer but here, the entire state which is a superposition of the N basis states, is acted upon at once by a quantum operator. (b) Apply the quantum exponential searching algorithm of [15]. (c) Observe the register: let y 0 be the outcome. If T [y 0 ] < T [y], then set threshold index y to y 0 . 3. Return y. 4.4.3 Running Time and Precision By convention, we assume that stage 2a takes log(N ) time steps and that one iteration in Grover’s algorithm takes one time step. The expected number of iterations used by Grover to find the index of a marked item among N items where t items 50 p are marked is at most 92 N/t. The expected total time before the y holds the index √ 7 of the minimum is at most m0 = 45 N + 10 log 2 N . 4 The algorithm given above finds the minimum with probability at least 12 . This probability can be improved to 1 − 21c by running the algorithm c times. 4.5 Quantum Walks A generalization of Grover’s search technique, quantum walks[17] have lead to a number of quantum algorithms for problems such as element distinctness (which will be described later). In this section, we present the basics of quantum walks and also an application of quantum walks due to Ambainis , namely, element distinctness. First, we describe random walks which are the classical analogue of quantum walks. Random walks provided the inspiration for quantum walks. 4.5.1 Random Walks A random walk is a mathematical formulation of a path that consists of a succession of random steps. For example, the path traced by a gas molecule, the path traced by an animal foraging for food are all random walks. Often, random walks are assumed to be Markov chains or Markov processes in discrete time although there can be other types of random walks too. A classical Markov chain is said to be a random walk on an underlying graph if the nodes of the graph are the states in S, and a state s has a non-zero probability to go to state t if and only if the egde (s, t) exists in the graph. A simple random walk on a graph G(V, E) is described by repeated applications of a stochastic matrix P , where P (u, v) = (d1u ) if (u, v) is an edge in G and du is the degree of the vertex u. If G is connected and nonbipartite, the distribution of the random walk Dt = P t D0 converges to a stationary distribution π which is independent of the initial distribution D0 . An example: A one-dimensional random walk The elementary one-dimensional random walk is a walk on the integer line Z which starts at 0 and at each time step moves +1 or -1 with equal probability. 4.5.2 Terminology used with Random Walks There are many definitions which capture the rate of convergence to the limiting distribution in a random walk. Some important terms are defined here. Definition 3. Mixing Time: M = min{T |∀t ≥ T, D0 : ||Dt − π|| ≤ } (4.7) where the distance between two distributions d1 and d2 is given by ||d1 − d2 || = P i |d1 (i) − d2 (i)|. 51 Definition 4. Filling Time: τ = min{T |∀t ≥ T, D0 , X ⊆ V : Dt (X) ≥ (1 − )π(X)} (4.8) Definition 5. Dispersion Time: ξ = min{T |∀t ≥ T, D0 , X ⊆ V : Dt (X) ≤ (1 + )π(X)} 4.5.3 (4.9) Quantum Analogue: Quantum Markov Chains or Quantum Walks Let G(V, E) be a graph, and let HV be the Hilbert space spanned by the states |vi where v ∈ V . We denote by n, or |V | the number of vertices in G. Assume that G is d-regular. Let HA be the auxiliary Hilbert space of dimension d spanned by the states |1i through |di. Let C be a unitary transformation on HA . Label each directed edge with a number between 1 and d, such that the edges labelled a form a permutation. Now we can define a shift operator S on HA ⊗ HV such that S|a, vi = |a, ui where u is the ath neighbour of v. Hence, one step of the quantum walk is given by the operator U = S.(C ⊗ I). This is called a coined quantum walk. Example: Consider the cycle graph with n nodes. This graph is 2-regular. The Hilbert space for the walk would be C 2 ⊗ C n . We choose C to be the Hadamard transform 1 1 1 H = √2 1 −1 and the shift S is defined as S|R, ii = |R, i + 1mod ni S|L, ii = |L, i − 1mod ni where R denotes a move to the node on the right of the node indexed i and L denotes a move to the left. The quantum walk is then defined to be the repeated application of the Hadamard transform on the first register followed by an application of the shift operator S. Having this general idea of quantum walks in mind, we now proceed to examine their applications to algorithmic problems. We first make a remark that the Grover’s search algorithm is a special case of a quantum walk. Next we describe the application of quantum walks to the element distinctness problem. 4.5.4 Application to Element-Distinctness Problem The element distinctness problem is as follows. Given numbers x1 , x2 · · · xN ∈ [M ], are there i, j ∈ [N ], such that i 6= j and xi = xj ? Any classical solution to this problem will need Ω(N ) queries. Ambainis gave a quantum walk algorithm for this problem that gives the answer in O(N 2/3 ) queries. The main idea is as follows. We have vertices vS corresponding to sets S ⊆ {1, 2, · · · , N }. Two vertices 52 vS and vT are connected by an edge is S and T differ in one variable. A vertex is marked if S contains i, j such that xi = xj . At each moment of time, we know xi for all i ∈ S. This enables us to check if the vertex is marked with no queries. Also, it enables us to move to an adjacent vertex vT by querying just one variable xi for i ∈ / S and i ∈ T . Then we define a quantum walk on subsets of the type S. Ambainis shows that if x1 , x2 · · · xN are not distinct, this walk finds a set S containing i, j such that xi = xj within O(N 2/3 ) steps. With that, we conclude the current chapter on quantum computing literature. Next, we move on to see the applications of quantum computing to more complex algorithmic applications involving intelligence tasks. 53 Chapter 5 Quantum Computing and Intelligence Tasks Given, the quantum computing techniques we have seen on the previous chapter, we would like to see if any of them can be applied to natural language processing tasks. There exist approaches in literature for applying quantum principles to machine learning tasks such as classification [?]. Since NLP relies heavily on machine learning, we would like to study quantum machine learning algorithms too. In this chapter, we first study a quantum approach to classification proposed by Sébastian Gambs in 2008. Then we present our approach to a quantum Viterbi algorithm using a modified version of Grover’s search algorithm. Later, we present other possible approaches currently being investigated by us for the same problem. 5.1 Quantum Classification Quantum classification is defined as the task of predicting the associated class of an unknown quantum state drawn from an ensemble of pure states given a finite number of copies of this state. 5.1.1 Learning in a Quantum World Definition 6. (Quantum training dataset). A quantum training dataset containing n pure quantum states can be described as Dn = {(|ψ1 i, y1 ), ..., (|ψn i, yn )}, where |ψi i is the ith quantum state of the training dataset and yi is the class associated with this state. Example: (Quantum training dataset composed of pure states defined on d qubits). In the context where all the pure states in the training dataset live in a Hilbert space formed by d qubits and we are interested in the task of binary classid fication; |ψi i ∈ C2 and yi ∈ {−1, +1}. 54 Definition 7. (Training error). The training error (or error rate) of a classifier f is defined as the probability that this classifier predicts the wrong class yi on a quantum state |ψi i drawn randomly from the states of the quantum training dataset Dn . Formally: P f = n1 nP rob(f (|ψi i) 6= yi ) i=1 In the context of quantum classification, the notion of regret also takes a particular importance. Definition 8. (Regret). The regret r of a classifier f is defined as the difference between its error rate f and the smallest achievable error opt that can be achieved on the same problem. Formally: rf = f − opt The regret of a classifier, as well as its error, can potentially take any value in the range between zero and one. The concept of regret is particularly meaningful in the context of hard learning problems, where the raw error rate alone is not an appropriate measure to characterize the inherent difficulty of the learning. Definition 9. (Classification cost). The classification cost corresponds to the number of copies of the unknown quantum state |ψ? i that will be used by the classifier to predict the class y? of this state. 5.1.2 The Helstrom Oracle For the purpose of quantum classification, we will be using an abstraction called the Helstrom Oracle. Definition 10. The Helstrom oracle is an abstract construction that takes as input: Version 1: a classical description of the density matrices ρ− and ρ+ and corresponding to the -1 and +1 tagged states and their a priori probabilities p− and p+ or Version 2: a finite number of copies of each state of the quantum training dataset Dn From this input, the oracle can be trained to produce an efficient implementation (exact or approximative) of the POVM (Positive-Operator Valued Measurement) of the Helstrom measurement fhel , in the form of a quantum circuit that can distinguish between ρ− and ρ+. In the second version of the oracle, its training cost corresponds to the minimum amount of copies of each state of the training dataset that the oracle has to sacrifice in order to construct fhel . 55 5.1.3 Binary Classification Let m− be the number of quantum states in Dn for which yi = −1 (negative class), and its complement m+ be the number of states for which yi = +1 (positive class), such that m− + m+ = n, the total number of data points in Dn . Moreover, p− is the a priori probability of observing the negative class and is equal to p− f racm− n , and p+ its complementary probability for the positive class such that p− +p+ = 1. Definition 11. (Statistical mixture of the negative class). The statistical mixture 1 P representing the negative class ρ− , is defined as m− nI{yi = −1}|ψi ihψi |, i=1 where I{.} is the indicator function which equals 1 if its premise is true and 0 otherwise. Definition 12. (Statistical mixture of the positive class). In the same P manner, the 1 statistical mixture representing the positive class ρ+ is defined as m+ nI{yi = i=1 +1}|ψi ihψi |. Theorem 1. (Helstrom measurement). The error rate of distinguishing between the two classes rho− and ρ+ is bounded from below by epsilonhel = 12 − D(ρ−2 ,ρ+ ) , where D(ρ− , ρ+ ) = T r|p− ρ− − p+ ρ+ | is a distance measure between ρ− and ρ+ called the trace distance (here, p− and p+ represent the a priori probabilities of classes ρ− and ρ+ , respectively). Moreover, this bound can be achieved exactly by the optimal POVM called the Helstrom measurement. The Helstroms measurement is a binary classifier that has a null regret, which means rhel = 0. Remark Error rate of the Helstrom measurement for extreme cases: • Consider the case where both the negative class and the positive class are equiprobable. If ρ− and ρ+ are two density matrices which correspond to the same state, their trace distance D(ρ− , ρ+ ) is equal to zero, which means that the error hel of the Helstrom measurement is 1. • On the other hand, if ρ− and ρ+ are orthogonal, this means that D(ρ− , ρ+ ) = 1 and that the Helstrom measurement has an error hel = 0. 5.1.4 Weighted Binary Classification (Reduction from weighted binary classification to standard binary classification via Helstrom oracle). Given the access to an Helstrom oracle that takes as inputs the description of the density matrices ρ− and ρ+ (and their a priori probabilities p− and p+ ), it is possible to reduce the task of weighted binary classification to the task of standard binary classification. Training cost: null Classification cost: Θ(1). 56 Proof : The weight wi of a particular state can be converted to a probability pi wi reflecting its importance by setting pi = P . n wj j=1 Let p− be the new a priori probability of the negative class, which is equal to p̂− = n P pi I{yi = −1} and p̂+ , its complementary probability such that p̂− + p̂+ = 1. i=1 The Helstrom measurement which discriminates between the density matrices in which the weights are incorporated is precisely the POVM which minimizes the weighted error. Therefore, it suffices to call the Helstrom oracle with inputs ρ̂− = ρ̂+ = n P i=1 n P pi I{yi = −1}|ψi ihψi | pi I{yi = +1}|ψi ihψi | i=1 This reduction makes only one call to the Helstrom oracle and requires only one copy of the unknown quantum state at classification. 5.2 Quantum Walk for A-star search The A∗ search algorithm is a heuristic based searching/graph traversal algorithm which has vast applications in the field of artificial intelligence. In this section, we first describe the A∗ algorithm in detail and then present our ideas of a quantum A∗ using the idea of quantum walks. 5.2.1 The A∗ Algorithm A∗ is a graph search algorithm which uses a best-first search to find the least cost path from a given initial node to a goal node. As A∗ traverses the graph, it follows a path of lowest expected cost and keeps a sorted priority queue of alternate path segments along the way. It uses a knowledge-plus-heuristic cost function to determine the order in which the search visits nodes in the tree. A∗ is primarily a search algorithm and the basic building blocks of any search algorithm are the following: 1. State Space: This is the space of states of the graph among which we are searching for a solution. 2. Start State: The start state is the state from which our search starts. The start state is denoted by S0 . 3. Goal State: The goal state is the state we intend to find via the search. The goal state is denoted by G. 4. Operator: This constitutes of a transformation between states. It is a function which takes a state as input and gives as output another state. This is used to move from state to state thereby traversing the graph. Also each use of the operator adds to the cost of taking that particular path. 57 5. Cost: The amount of effort involved in using the operator. This can be a different function for different search problems. 6. Optimal Path: The path with the least cost to move from the start state to the goal state. Now we give an example to illustrate the above building blocks in a concrete setting. We look at the 8-puzzle problem. We have a 3x3 square with the numbers 1 through 8 randomly arranged in 8 of the spaces of the square. This is our initial configuration S0 . The empty square is regarded as free space and we can slide the other 8 blocks upward, downward, left or right into the empty space. The aim is to arrive at the goal configuration by sliding the blocks and making the minimum number of moves (each movement of any block in any direction is regarded as one move) This can be modelled in our search paradigm as follows. The state space the set of all possible configurations of the 8 numbered blocks. The start state is the initial configuration given to us and the goal state is the one shown in Figure 5.2. The available operators are move left, move right, move up and move down if the move is permitted. The cost is simply the total number of times the operators are used in arriving at the goal state. Figure 5.1: Start State for the 8-puzzle problem We now present the A∗ search algorithm. 1. Create a search graph G consisting solely of the start node S. Put S on a list called OPEN. 2. Create a list called CLOSED that is initially empty. 3. Loop: if OPEN is empty, exit with failure. 4. Select first node on OPEN, remove form OPEN and put on closed. Call this node n. 5. If n is the goal node, exit with the solution obtained by tracing a path along the pointers from S to n. (Pointers are established in step 7). 58 Figure 5.2: Goal State for the 8-puzzle problem 6. Expand node n, generating the set M of its successors that are not ancestors of n. Install these members of n as successors of n in G. 7. Establish a pointer from n to those members of M not already in G. (i.e not already on either OPEN or CLOSED. Add these members of M to OPEN. For each member of n, that was already on OPEN or CLOSED, decide whether or not to redirect its pointer to n. For each member of M already on CLOSED, decide for each of its descendants in G whether or not to redirect its pointer. 8. Reorder OPEN using the cost function f which is a mix of knowledge function g and heuristic function h. Reorder in the ascending order of cost. 9. Go to step 3. The Heart of A∗ : The Heuristic Function The cost function f is maintained at every node in the search graph. For a node n, f (n) = g(n) + h(n), where g(n) is the least cost path to n from S0 found so far and h(n) is a function which satisfies the property that h(n) ≤ h∗ (n) where h∗ (n) is the actual cost of the optimal path to G from n which is to be found. If g ∗ (n) is the least cost path from S0 to n, then g and h satisfy the following relations. g(n) ≥ g ∗ (n) ∗ h(n) ≤ h (n) (5.1) (5.2) For example, in the 8-puzzle problem a possible heuristic function is to let h(n) be the total number of tiles displaced. Since we know that any path to the goal configuration has to make a number of moves at least the total number of displaced tiles, this is a valid heuristic function. A property of A∗ is that if we choose an admissible heuristic such as the one above the algorithm always terminates finding the optimal path. Of course, better the heuristic, faster the algorithm. Now, we give 59 our insights on how quantum computing ideas can possibly be applied to the A∗ algorithm. 5.2.2 A Quantum Approach? We notice that A∗ at its heart, is a graph traversal scheme. We know from the previous chapter that quantum walks can also be used as a graph traversal scheme. It is worthwhile to look at whether a quantum counterpart to the classical A∗ algorithm can be designed. There already exist quantum walk algorithms for performing a multi-dimensional grid search which offer speed-up over their classical counterparts. We believe we can apply these algorithms in the A∗ setting to get a quantum A∗ which is faster than the classical one. 60 Chapter 6 The Quantum Viterbi Now we present the quantum Viterbi algorithm developed by us and analyse the precision and accuracy results on the BNC corpus. 6.1 The Approach If we look at the trellis of the states on which Viterbi runs, we can say that the Viterbi algorithm is searching for the ’best’ possible path from the first level of the trellis to the last. Hence, we could call it a search problem. Classically, we would have to search each path one after the other and compare their fitness values. Due to the Markov assumption in the problem, our search is now split into stages. To make the next step in the path, we still need to search among all possible transitions from our current position and then choose the best one. The breakthrough quantum computing brings is the ability to perform computation on many variables simultaneously. For example, in Shor’s algorithm, the routine for finding the period, applies the Fourier transform on all the states simultaneously. In contrast, classical period finding algorithms would have to search among the different xa mod n to find the period. Hence, a natural approach to a quantum Viterbi would be to try and model it as a result of an observation over a quantum superposition of the various paths in the trellis. However, Viterbi is not a random search among all the paths. There is greater structure to the problem lent by the Markov assumption which reduces the search over all possible paths to a search over all the possible next states into which we can transition. So instead of a quantum superposition over all possible trellis paths, we generate a new quantum superposition of states (possible transitions) at each level of the trellis. Now, our task is find an operator or a sequence of operators such that their application on a uniform quantum superposition of the possible transitions from a state will lead to a quantum superposition which when observed has a very high probability of yielding the desired state. 61 6.1.1 Can Grover be used? In the Grover algorithm, we see that the Oracle is a fixed operator which has knowledge of the √ solution state and does not vary across iterations. The Grover search is an O( N ) time algorithm when we have N states to search among. At each tag, we spend O(T) time to select one out of the T possible paths ending there from, the previous stage. So, at each stage of the classical Viterbi trellis,√we spend O(T2 ). Via the Grover search, we wish to bring this time down to O(T T). But, in our case, we do not know beforehand which is the element we are searching for as we want to find the maximum among n numbers. To do this, we need to modify Grover as used in the quantum Minimum finding algorithm. Using this insight, we next present the quantum Viterbi algorithm. 6.2 The Algorithm The classical Viterbi algorithm is first presented again for reference. 6.2.1 The Classical Version Firstly, given a tagged corpus and a tag-set t1 , t2 , ..., tT of size T , we learn tagwise probabilities of starting the sentence πi ; tag-to-tag transition probabilities P (tj |ti ); and tag-to-word generation probabilities P (wk |ti ). Then, given a sentence of length w1 , w2 , ..., wL of length L, 1. Initialize a 2-dimensional vector V of size T × (L + 1) to all zeroes 2. Initialize a 2-dimensional vector B of size T × L to null 3. Set V [i][0] to πi ∀ i ∈ {1, ..., T } 4. For k in 1 to L, For j in 1 to T , B[j][k] = argmax P (tj |ti ) × i P (wk |ti ) × V [i][k − 1] m = B[j][k] V [j][k] = P (tj |tm )P (wk |tm )V [m][k − 1] 5. In the BNC corpus used by us, all sentences end in punctuations. So, we assign the corresponding tag to the last word in the sentence. Say, the index for this tag is p. Then, tagL = p. 6. Now, we use the back-pointers stored in B to find the path that gave us maximum score and ended in p. For k in L − 1 to 1, tagk = B[tagk+1 ][k + 1] 62 6.2.2 Quantum exponential searching This search algorithm[15] receives as input a superposition of N states, of which say t are marked. It gives as output one of those t states. The steps are: 1. Initialize m = 1 and λ = 56 . 2. Choose j uniformly at random from the whole numbers smaller than m. 3. P Apply j iterations of Grover’s algorithm starting from initial state |ψ0 i = √1 i N |ii. 4. Observe the register; Let i be the outcome. If T [i] = x, then the problem is solved; exit and return i. √ 5. Otherwise, set m to min(λm, N ) and go back to step 2. 6.2.3 The Grover Iteration [3] shows that the unitary transform G, defined below, efficiently implements what we called an iteration above. S0 is an operator that inverts the sign of the coefficient of the |0i state. Similarly, St inverts the sign of coefficients of all marked states. T is defined by its actions on the states |0i, |1i, ..., |N − 1i as P −1 (i.j) |ii T |ji = √1N N i=0 (−1) where i.j denotes the bitwise dot product of the two binary strings denoting i and j. Then the transform G is given by G = −T S0 T St Grover considers only the case when N is a power of 2 since the transform T is well-defined only in this case. 6.2.4 The Quantum Approach to Viterbi Note that step 4 in 6.2.1 has an inner loop that runs T times and needs to find the maximum among T quantities each time, leading to the O(T 2 ) component in the running time on a classical computer. On a quantum computer, the T quantities to be compared can be prepared together into a superposition of states in log T time (because we need log T qubits to represent T states) and then, we modify the Quantum Minimum Algorithm to get a Quantum Maximum Algorithm by changing the < comparison operator to ≥ in 2a. Since the T quantities for a fixed j, k are in a superposed state, we use an operator that fetches the required values from the V table and the probability values learnt during the training phase, simultaneously for all possible i in constant time. Then, the triplets are multiplied together to give the T quantities, again in constant time. Now, the Quantum Maximum Algorithm takes O(T 1/2 ) time to find the maximum among the T number of values, hence giving the reduction in overall running time from O(T 2 L) to O(T 3/2 L) 63 6.3 6.3.1 Experimental Results Implementation We implemented a simulation of the quantum algorithm on a classical computer and assigned part-of-speech tags to the British National Corpus, which has a 61strong tagset available at http://www.natcorp.ox.ac.uk/docs/c5spec. html. We padded this with 3 dummy tags in order to work with 64 states, i.e., 6qubit quantum registers. Additionally, instead of running the Quantum Minimum Algorithm for a time √ of 22.5 N + 1.4 log2 N , we restricted the number of iteration of step 2 to 10, thus giving a total running time of the same √ order (because the Quantum Exponential Searching over N states is an O( N ) operation). The rationale behind this was that the probability of getting the correct result from Grover search (of which the Quantum Exponential Search is a generalization) algorithm oscillates with the number of iterations, rising quickly and peaking periodically. Thus, by performing a slightly lesser number of iterations, we do not lose out much on precision but save on execution time. (Note that the simulation of a quantum algorithm on a classical Turing machine incurs an exponential blow-up in execution time.) We used smoothing in our implementation where the tag-to-word probability is boosted by 10−8 for all words. This ensures that even for words not present in the training corpus, a positive probability value is assigned. Pushing up the probabilities of solely the words absent from the training corpus can end up changing the tags of other words. To avoid this, the tag-to-word probabilities of all the words are increased by the same amount regardless of their original value. If there were no smoothing, the algorithm would end up leaving some words untagged. Due to smoothing no word is left untagged and hence the overall precision and recall values become the same. 6.3.2 Results Although the quantum Viterbi algorithm is probabilistic, the probability of success can be made high by setting the running time of the algorithm appropriately. Hence, the precision of the algorithm can be brought as close to that of the classical version as needed. For our implementation the classical version yielded a precision and recall of 0.9289 whereas the quantum counterpart yielded 0.9067 as the precision and recall. Tag nn0 unc ajc pun np0 Classical 0.867388 0.224541 0.892725 1 0.808852 Quantum 0.909962 0.258896 0.897757 1 0.807513 64 Difference -0.0425745 -0.0343547 -0.00503225 0 0.001339 dt0 vbd at0 pnp nn2 dtq vhz aj0 vbb xx0 av0 dps vm0 vvn avq cjc ord vhg cjs nn1 vbz vhb crd vbn vvz pnq vdd prp prf vvi vdg avp vvg vvd pnx cjt vhd pni vdn to0 vbg vbi ex0 vvb 0.889688 0.999746 0.997748 0.9847 0.910331 0.988764 0.999684 0.902669 0.992438 0.994644 0.859771 0.995861 0.974927 0.877453 0.641234 0.999177 0.984701 0.978126 0.83856 0.930382 0.99572 0.955453 0.937628 1 0.839655 0.983776 1 0.965542 0.997954 0.895492 0.918877 0.736335 0.863298 0.859617 0.945212 0.964409 0.98311 0.881366 0.996154 0.966268 0.982608 0.997867 0.968648 0.590293 0.887663 0.993134 0.989039 0.975611 0.897774 0.975926 0.98396 0.884891 0.973092 0.975089 0.840036 0.974889 0.953286 0.855244 0.61802 0.975847 0.961115 0.953638 0.813751 0.904409 0.96692 0.926358 0.907157 0.967927 0.807522 0.948785 0.96372 0.928909 0.959898 0.854277 0.877193 0.692918 0.816867 0.813175 0.895753 0.914131 0.928726 0.826813 0.941558 0.909248 0.921845 0.928526 0.89689 0.512389 65 0.00202525 0.00661225 0.0087095 0.009089 0.0125575 0.0128385 0.0157243 0.0177777 0.0193463 0.0195548 0.0197355 0.020972 0.0216412 0.0222092 0.0232142 0.0233302 0.0235863 0.0244878 0.0248092 0.0259725 0.0288 0.0290948 0.0304707 0.0320727 0.0321325 0.03499 0.03628 0.0366335 0.0380552 0.0412152 0.0416842 0.0434172 0.046431 0.0464418 0.049459 0.050278 0.054384 0.0545532 0.0545952 0.0570202 0.0607627 0.0693417 0.0717575 0.077904 vhi 0.916013 0.836266 0.0797468 zz0 0.532633 0.444551 0.088082 vdz 0.997071 0.89678 0.100291 ajs 0.878247 0.773294 0.104954 vdb 0.97538 0.85216 0.12322 vhn 0.817375 0.616667 0.200709 itj 0.615165 0.372024 0.243142 vdi 0.926695 0.628504 0.298191 Table 6.1: Tag-wise comparison of precision values obtained by both the classical and quantum Viterbi algorithms. Tag vdg ajs xx0 vdz cjt dps pun vdn vbd dtq crd at0 vbi cjc vbn prf nn0 vhg ajc vdd vvi vbz ord vm0 prp vbg nn1 vvd vhd Classical 0.964209 0.955312 0.958409 0.970937 0.887344 0.984227 1 1 0.989565 0.991827 0.976715 0.961239 0.989855 0.991317 0.990903 0.979257 0.967757 1 0.899988 0.997205 0.923541 0.995352 0.982694 0.985806 0.906497 0.986774 0.906681 0.871606 0.98567 Quantum 0.985294 0.969154 0.96795 0.98 0.894839 0.986229 1 1 0.988928 0.989443 0.972657 0.956836 0.984467 0.985674 0.983238 0.970943 0.959004 0.991071 0.890287 0.9875 0.911903 0.982435 0.969212 0.971792 0.891614 0.971668 0.890365 0.854049 0.967624 66 Difference -0.0210847 -0.0138413 -0.009541 -0.00906325 -0.00749475 -0.00200175 0 0 0.0006365 0.002385 0.00405825 0.0044035 0.005388 0.00564275 0.00766475 0.00831375 0.008752 0.0089285 0.0097015 0.0097055 0.011638 0.0129175 0.0134827 0.0140138 0.0148828 0.0151065 0.0163158 0.0175568 0.018046 nn2 0.972807 0.954353 0.0184537 pnq 0.991778 0.969166 0.0226122 np0 0.8974 0.874631 0.022769 vhz 0.991969 0.968629 0.0233397 to0 0.962689 0.939272 0.023417 aj0 0.89451 0.870274 0.0242367 dt0 0.941652 0.909601 0.0320515 pnx 0.964999 0.931361 0.0336375 avq 0.865876 0.832125 0.0337505 vvn 0.827202 0.788522 0.0386807 pni 0.887096 0.847107 0.039989 pnp 0.953722 0.912573 0.0411495 zz0 0.308415 0.266807 0.041608 vhi 0.937314 0.889001 0.0483137 ex0 0.971073 0.921361 0.0497125 vvg 0.88181 0.826416 0.0553935 cjs 0.803035 0.74663 0.0564043 vdi 0.943283 0.866551 0.0767322 vvz 0.936522 0.856728 0.0797933 vhb 0.938971 0.855368 0.0836035 av0 0.901366 0.792911 0.108455 vbb 0.995897 0.869538 0.126359 itj 0.780903 0.65 0.130903 avp 0.809851 0.656315 0.153536 unc 0.730885 0.569445 0.161441 vvb 0.736583 0.535853 0.20073 vdb 0.965909 0.6866 0.279309 vhn 0.767851 0.440196 0.327655 Table 6.2: Tag-wise comparison of recall values obtained by both the classical and quantum Viterbi algorithms. Tag1 vdi itj itj zz0 ex0 vdb itj vhn Tag2 vdb av0 vvn pnp av0 vdi nn2 vhd Classical 0.0696285 0.0927187 0.0349679 0.0383333 0.0264066 0.0246204 0.0149679 0.182625 67 Quantum 0.323111 0.217262 0.125 0.115625 0.0976915 0.0939719 0.0833332 0.25 Difference -0.253482 -0.124543 -0.0900321 -0.0772917 -0.0712849 -0.0693515 -0.0683653 -0.0673752 vbi vbb 0.00183934 0.0647824 -0.062943 vvb nn1 0.261705 0.32404 -0.0623347 vhi vhb 0.0825578 0.143788 -0.0612307 vbg nn1 0.0173924 0.078155 -0.0607626 to0 prp 0.0329976 0.0854874 -0.0524897 vdn vvn 0 0.0508658 -0.0508658 vhn av0 0 0.05 -0.05 cjt dt0 0.0344765 0.0779157 -0.0434392 nn0 nn0 0.867388 0.909962 -0.0425745 vdg nn1 0.0811227 0.122807 -0.0416843 itj vdd 0 0.0416668 -0.0416668 vhn at0 0 0.0416668 -0.0416668 vhn vvn 0 0.0416668 -0.0416668 vvd vvn 0.0992355 0.137942 -0.0387068 ajs nn1 0.0174963 0.0561497 -0.0386534 prf av0 0.00163094 0.0387577 -0.0371267 pnx dt0 0.0209213 0.0558179 -0.0348965 zz0 at0 0.256993 0.291853 -0.0348605 unc unc 0.224541 0.258896 -0.0343547 vhd vhn 0.0159773 0.0497156 -0.0337384 ajs av0 0.0565788 0.0900432 -0.0334643 itj at0 0.0687203 0.10119 -0.0324701 pnq np0 0.00742375 0.0397013 -0.0322776 vbz pnp 0.00165125 0.0315426 -0.0298913 zz0 np0 0.0877019 0.11672 -0.0290181 avp prp 0.232996 0.260796 -0.0278002 ajs aj0 0.0349714 0.0626435 -0.0276721 vdb vbz 0 0.0263157 -0.0263157 vhb vhi 0.0433937 0.0693684 -0.0259747 vdd vbz 0 0.0252422 -0.0252422 vdz vm0 0 0.0240275 -0.0240275 vdi vvi 0.00183823 0.0241926 -0.0223543 vdz vdb 0 0.0217391 -0.0217391 Table 6.3: Tag-to-tag pairwise comparison of confusion values obtained by both the classical and quantum Viterbi algorithms. Listed above are those pairs for which the quantum algorithm increased the confusion by atleast 2% w.r.t. to the classical one. 68 6.3.3 Tag-wise Precision and Recall Analysis For the tags NN0, UNC and AT0 the quantum Viterbi algorithm yielded a higher precision value. And for the tags VDG, AJS, XX0 and VDZ the quantum algorithm gives a higher recall. On the other hand, there are also a number of tags for which the quantum algorithm performs particularly bad. For the tags VDI, ITJ, VHN, VDB, AJS and VDZ the quantum algorithm gave a significantly lower precision value. And for the tags VHN, VDB, VVB, AVP, ITJ, AV0, CJS and PNP the recall is significantly lowered by the quantum algorithm. The quantum Viterbi is a probabilistic algorithm and does badly in the case of tags with specific word forms. For example, VDI is the infinitive form of the verb DO, i.e. ’do’ whereas VDZ is the -s form of verb DO, i.e. does. These are word forms with specific tags and the classical algorithm yields high accuracy in these cases. Here, the introduction of randomization by the quantum algorithm ends up lowering the accuracy significantly. For the tags VDI, VHN, VDB, VDZ, VHI, VVB, VVI and EX0, the quantum implementation yielded a precision which was atleast 6% lesser than that given by the classical algorithm. It is worth noticing that AV0 and AVP, both adverb forms suffered losses of 10 − 15% in recall on using the quantum Viterbi, which suggests that for these, the probabilistic quantum algorithm for maximum-finding doesn’t do as well as the deterministic classical version on scores for words tagged AV0 or AVP. One possible reason is that the values among which maximum is to be determined are close to each other, in which case a deterministic algorithm will go through but a probabilistic one has higher chances of failing. This is confirmed by the precision values being on the lower side (0.73 and 0.85) for these tags in the classical algorithm. 6.3.4 Concluding Remarks The current implementation of the quantum Viterbi algorithm does just 2% worse on overall precision than its classical counterpart. We know that the Grover search q N algorithm gives maximum accuracy when the running time is ∼ π4 M , where M is the number of solutions. Here, M = 1 and N = 64, the closest power of 2 to the number of tags (61). The accuracy will increase marginally as we inch closer to this value of the running time. The larger gain is in the reduction in time complexity of the algorithm. Of course, this is small when N is of the order of just 26 . To see an application where such a decrease can have a significant impact on the overall algorithm, we delve into the problem of Machine Translation among close languages in the next chapter. 69 Chapter 7 Machine Translation among Close Languages 7.1 7.1.1 Machine Translation What is machine translation? Machine translation (MT) is the translation of text by a computer, with no human involvement. Pioneered in the 1950s, machine translation can also be referred to as automated translation, automatic or instant translation. On a basic level, MT performs simple substitution of words in one natural language for words in another, but that alone usually cannot produce a good translation of a text because recognition of whole phrases and their closest counterparts in the target language is needed. Solving this problem with corpus and statistical techniques is a rapidly growing field that is leading to better translations, handling differences in linguistic typology, translation of idioms, and the isolation of anomalies 7.1.2 How does machine translation work? There are two types of machine translation system: • Rule-based systems use a combination of language and grammar rules plus dictionaries for common words. Specialist dictionaries are created to focus on certain industries or disciplines. Rule-based systems typically deliver consistent translations with accurate terminology when trained with specialist dictionaries. The basic approach involves linking the structure of the input sentence with the structure of the output sentence using a parser and an analyser for the source language, a generator for the target language, and a transfer lexicon for the actual translation. Its biggest downfall is that everything must be done explicit: orthographical variation and erroneous input must be made part of the source language analyser in order to cope with 70 it, and lexical selection rules must be written for all instances of ambiguity. Adapting to new domains in itself is not that hard, as the core grammar is the same across domains, and the domain-specific adjustment is limited to lexical selection adjustment. • Statistical systems have no knowledge of language rules. Instead they ”learn” to translate by analysing large amounts of data for each language pair. They can be trained for specific industries or disciplines using additional data relevant to the sector needed. Typically statistical systems deliver more fluent-sounding but less consistent translations. Google Translate and similar statistical translation programs work by detecting patterns in hundreds of millions of documents that have previously been translated by humans and making intelligent guesses based on the findings. Generally, the more human-translated documents available in a given language, the more likely it is that the translation will be of good quality. SMT’s biggest downfall includes it being dependent upon huge amounts of parallel texts, its problems with morphology-rich languages (especially with translating into such languages), and its inability to correct singleton errors. 7.1.3 Advantages of machine translation Some advantages, owing to which research in this field should be pursued, are: • When time is a crucial factor, machine translation can save the day. The software can translate content quickly and provide a quality output to the user in no time at all. • The next benefit is that it is comparatively cheap. It might look like an unnecessary investment but in the long run it is a very small cost considering the return it provides. • Confidentiality: Giving sensitive data to a translator might be risky while with machine translation your information is protected. 7.2 Similarity to POS tagging for close languages Among pairs of close languages such as Hindi and Urdu which almost follow wordfor-word translation, we can treat the words of one language as part-of-speech tags for the corresponding words of the other and then, simply employ the Viterbi POStagging algorithm to obtain machine translation. Note that here, the number of states involved in the Viterbi trellis, i.e., T would be of the order of 104 , and hence an O(T 3/2 L) algorithm will be much more efficient than an O(T 2 L) one. This is a problem where a quantum version of the POS-tagger can drastically bring down computing time, if deployed on a quantum computer. 71 7.2.1 The izafat phenomenon Talking of Hindi-Urdu translation, it is imperative to discuss the phenomenon of izafat, a feature of Urdu orthography derived from Persian. Most of the time, it indicates either description or possession, which are explained via examples below: 1. To indicate that the word following the izafat describes the word preceding it. That is, the second word is being used as an adjective. So, whereas the normal adjective noun pair is simply adjective + noun, the structure of the descriptive izafat construct is the other way around: noun + izafat + adjective. Example: mughal means ’Moghul’, and azam means ’greatest’. To say ’greatest Moghul’ using the izafat, we would say: mughal-i-azam 2. To express the idea, the word preceding the izafat is possessed by the word following it. In other words, it does the same thing as ka, ke and ki from Hindi but in the reverse order. For instance, the word gham (noun) means ’sadness’, and the word dil (noun) means ’heart’. If we wanted to say ’the hearts sadness’ in regular Urdu, we would put the correct form of ka, ke or ki in, and say dil ka gham. But if we wanted to say the same thing using an izafat construction, we would reverse the order of the two nouns, and stick the izafat between them: gham-i-dil. Hindi-Urdu translation is then like a POS tagging problem, modulo izafat, which can be dealt with to some extent post processing. 7.3 Phrase-Book Translation A phrase book is a collection of ready-made phrases, usually for a foreign language along with a translation, indexed and often in the form of questions and answers. To test our modelling of statistical machine translation among close languages as a part-of-speech problem, we use urdu-english and hindi-english phrasebooks and build a corpus from the translations for English sentences that are common in both phrasebooks. 7.4 7.4.1 Experiments and Results Training corpus We built a parallel corpus of 54 sentences, containing 237 distinct urdu words, which act as tags for our Viterbi algorithm. A few examples: 1. urdu: mukhtalif ravaayat aur akaaeed ke log ek saath aate hai hindi: vibhinn paramparaaon aur dharmon ke log ek saath aate hai 72 2. urdu: bambaari se unka achha hona imkaan nahi hai hindi: bambaari se unka bhala hona sambhaavit nahin hai 3. urdu: Sadar Bush ne shayad wohi hasil kiya jisme woh sabase zyada dilchaspi rakhte the hindi: Rashtrapati Bush ne sambhavtah wahi paaya jisme woh sabse adhik dilchaspi rakhte the 4. urdu: jab tak hum kaarrawaahi shuru nahi karenge , hamaare shahari jamhooriyat ki taakat me ummeed khote rahenge hindi: jab tak hum kaarrawaahi aarambh nahi karenge , hamaare naagrik loktantra ki shakti me aashaa khote rahenge 5. urdu: hume ilaakaai ta-aavun aur yakajahatee ko mazboot banaane ke amal ki raaftaar badaana chaahiye hindi: hume kshetriya sahayog aur ekeekaran ko mazboot banaane ki prakriyaa ki gati badaana chaahiye 6. urdu: hukoomaton aur bain-ul-akvaamee tanzeemon ko un taaleem policiyon ki himaayat karna chahiye jinka moassar hona saabit kiya gaya hai hindi: sarkaaron aur antarraashtriya sangathanon ko un shiksha neetiyon ka samarthan karna chahiye jinka prabhavi hona siddh kiya gaya hai 7. urdu: ek hoshiyaar nayi takneek chand hafte kabal ek akhbaar me bayaan ki gayi thi hindi: ek nayi chatur takneek kuchh saptaah poorv ek samaachaar-patr me varnit ki gayi thi 8. urdu: isme koi tajzub nahi ki bahut log khud ko anmol samajhte hai hindi: isme koi aashchary nahi ki kai log khud ko amoolya samajhte hai 9. urdu: khush-amdid hindi: svagat 10. urdu: ap ka taluq kahan se hai hindi: ap ka vaasta kaha se hai 11. urdu: subha bakhair hindi: subh prabhat 12. urdu: kya zara ahistah kehenge hindi: kripaya thode dhire boliye 73 7.4.2 Issues Since the corpus is not too dense, we face the following issues: • There are 54 sentences and 237 tags. Hence, for most tags, the probability that they can start a sentence, obtained via the usual learning routine is 0. When we construct a test corpus sentence starting with a word that does not occur at the beginning of any sentence in the training corpus, the score for the correct path of tags stays at 0 in the Viterbi algorithm. Hence, we add a small constant, 0.005 to the start-probability for each tag. • Since most tags have only one hindi word corresponding to them in training corpus, the tag→word probabilities in many cases turn out to be zeroes, often leaving just a single non-zero value in the vector where maximum is to be found. This takes away the possibility for a hindi word to be translated as different urdu ones that might not have all occurred as instances in the training corpus. Since most words do have a one-to-one mapping and if found in corpus, mostly come with their correct counterparts, we use a small smoothing factor of 0.0001 only. • Since urdu words from training corpus are the tags themselves, the tag→tag transition probabilities are zero in most pairs. So, if we build sentences using hindi words from different sentences and ask for a translation, it is highly likely that their corresponding urdu words would never have been adjacent to each other in the training corpus. This issue is tackled by adding a smoothing factor of 0.1 to all the transition probabilities. This factor is quite large because we wish to account for the less amount of information contained in the corpus regarding what words can follow a particular word in the urdu language. Since the tagset has increased in size from 57 to 237 as compared to the experiments on BNC corpus, we increase the number of iterations in step 2 of the Quantum Minimum Algorithm from 10 to 20. 7.4.3 Results For testing purpose, we use a manually constructed corpus of 11 hindi sentences, containing words occurring in the training data and run the quantum Viterbi. The inputs and their corresponidng outputs are as follows: 1. hindi: saubhagya urdu: allah-ka-fazal-ho 2. hindi: ratri me milenge urdu: bakhair me ek 74 3. hindi: namaste ap ka svagat hai urdu: mein ap se khush-amdid hay 4. hindi: sone ka moolya seemit nahin hai urdu: sone ki tanzeemon tajzub nahi hai 5. hindi: hamaare naagrik loktantra ki shakti me aashaa rakhte the urdu: hamaare bain-ul-akvaamee jamhooriyat ki mukaable me Sweden rakhte the 6. hindi: sriman ko mera dhanyavad urdu: sahib ko anmol shukriya 7. hindi: ap ne acchi sehat ke lie ek takneek varnit ki thi urdu: ap ho ache sehat safr leyae amreeki takneek amal ki thi 8. hindi: ruko mai samajha nahin thode dhire boliye urdu: roko hain samjha nahi zara the kehenge 9. hindi: pradhanmantri ko prabhavi sangathanon ka samarthan karna chahiye urdu: vazeer-e-aazam ko sath kitne ka himaayat isme chahiye 10. hindi: mai tum se kaphi ummeed karta hu urdu: main ap shab kafi tavakko hum hoon 11. hindi: amreeki kaanoon saral nahin hai urdu: amreeki kaanoon aasaan nahi hai 7.4.4 Analysis We observe that many words are translated correctly individually but the sentence as a whole goes wrong because of some incorrect tags. This is a disadvantage of modelling Machine Translation as POS-tagging using the 1-level Markov Model as we lose context beyond the next word in the sentence. Looking at specific examples, we see that ratri me milenge gets translated as bakhair me ek. The translation goes wrong in the first word itself. This is because the only instance of ratri (meaning: night) in the training corpus is when subh ratri (meaning: good night) is translated as shab bakhair where subh corresponds to bakhair and ratri to shab, i.e., the order gets interchanged. Our algorithm learns bakhair as the translation for ratri and uses it when run on the test data, yielding wrong output. sriman ko mera dhanyavad is translated as sahib ko anmol shukriya instead of sahib ko mera shukriya because the words ko and mera do not occur adjacent to each other in training data while ko and anmol do, hence the tag→tag transition 75 probability is higher in the latter case. ke to safr, aashaa to Sweden, ek to amreeki, etc. are some examples of absurd translations that we come across. This is due to the fact that the tag→tag transition probabilities have been shifted by quite a bit due to the smoothing factor of 0.1 and hence, there are paths in the Viterbi trellis that shouldn’t get such high scores but are being assigned them now. This hints towards the use of another different learning algorithm for the smoothing factor itself, which is something to be looked into, in future. 76 Chapter 8 Conclusions Over the course of the past two semesters, we have come across various intriguing and novel aspects of both quantum computing and natural language processing. We have studied why a quantum computer allows us to gain an exponential speed-up over the classical one in many cases, the reason being its ability to operate over all states in one quantum step, using the concept of superposition of qubit states. Also, a recurring feature in our study of various quantum algorithms has been the Fourier Transform, which basically amounts to finding sub-groups within Abelian groups. The Grover algorithm is a quantum approach to the search problem which uses an Oracle that can recognize a solution and uses the Fourier and inverse Fourier Transforms iteratively to propel the probability of the desired state upwards. We have also investigated a whole array of classical optimization techniques and gained an in-depth understanding of their working. The quantum Viterbi algorithm has been thoroughly studied by us and we have implemented the same on the BNC corpus, achieving satisfactory results, just 2% shy of the classical precision. To show applicability to real-life tasks where a time reduction from O(T 2 ) to O(T 3/2 ) would be significant, we have dealt with the problem of statistical machine translation among close languages, which can be modelled as a POS-tagging problem with the words of one language behaving as tagset. |T | here would therefore be of the order of thousands. There are other classical search techniques too wherein ideas can be sought from the quantum realm. For example, quantum random walks can be used for the A-star algorithm. These and quantum versions for other classical algorithms presented in this report, could be investigated further. 77 Chapter 9 Future Work We have done an extensive literature survey on various classical optimization techniques in this thesis. We have also looked at various quantum computing techniques. We further developed a quantum version of the Viterbi algorithm. There are numerous other problems to open for study as to how they can be implemented on a quantum computer more efficiently. If the advent of commercial quantum computers does occur, then quantum algorithms being developed as such will find great applicability in all areas where computation is done. Among the other problems we are looking at for quantum counterparts are the A∗ search algorithm and quantum gradient descent. For the problem A∗ search we are investigating the work done on quantum walks and their algorithmic applications [14]. Another line of future work is to run simulations of the quantum Viterbi algorithm for machine translation of close languages and analyse the quality of the results. We have made forays in this area and have presented our results and analysis in this report, but we believe a more extensive analysis of the same can be made to obtain greater insights into the performance of the algorithm. 78 References [1] D. Deutsch and R. Jozsa. Rapid solutions of problems by quantum computation. Proceedings of the Royal Society of London A, 1992. [2] P. Shor. Polynomial-Time Algorithms for Prime Factorization and Discrete Logarithms on a Quantum Computer. In Proc. 35th Annual ACM Symposium on Foundations of Computer Science, 1994. [3] Grover L.K. A fast quantum mechanical algorithm for database search, In Proc. 28th Annual ACM Symposium on Theory of Computing, STOC-96, page 212, 1996. [4] J.C.H. Chen. Quantum Computing and Natural Language Processing. Master’s Thesis, Universitát Hamburg, 2002. [5] J.N. Darroch and D. Ratcliff. Generalized Iterative Scaling for Log-Linear Models, textitThe Annals of Mathematical Statistics, Volume 43, pages 14701480. 1972. [6] E.T. Jaynes. Information Theory and Statistical Mechanics. Physics Reviews 106, pages 620630. 1957. [7] S. Gambs. Quantum Classification. 2008. [8] S. Clark, B. Coecke, E. Grefenstette, S. Pulman and M. Sadrzadeh. A quantum teleportation inspired algorithm produces sentence meaning from word meaning and grammatical structure. October, 2013. [9] A. Berger. The Improved Iterative Scaling Algorithm: A gentle introduction. December, 1997. [10] M. Fleischer. Foundations of Swarm Intelligence - From Principles to Practice. 2003. [11] R. Rosenfeld. A Maximum Entropy Approach to Adaptive Statistical Language Modeling. 1996. [12] I. Brezina Jr. and Z. Cickova. Solving the Travelling Salesman Problem Using the Ant Colony Optimization, Management Information Systems, Vol.6, 2011. 79 [13] A. Viterbi. Error bounds for convolutional codes and an asymptotically optimal decoding algorithm. IEEE Transactions on Information Theory 13, pages 260-269, 1967. [14] S. Aaronson and A. Ambanis. Quantum search of spatial regions. Theory of Computing, pages 200-209, 2005. [15] M. Boyer, G. Brassard, P. Hoyer and A. Tapp. Tight bounds on quantum searching. Fortschritte Der Physik, 1998. [16] C. Durr and P. Hoyer. A quantum algorithm for finding the minimum. http: //arxiv.org/abs/quant-ph/9607014, 1996. [17] A. Ambainis. Quantum walks and their algorithmic applications. http:// arxiv.org/abs/quantph/0403120, 2008. 80 Appendix The BNC Basic (C5) Tagset used for POS tagging 1. AJ0 Adjective (general or positive) (e.g. good, old, beautiful) 2. AJC Comparative adjective (e.g. better, older) 3. AJS Superlative adjective (e.g. best, oldest) 4. AT0 Article (e.g. the, a, an, no) [N.B. no is included among articles, which are defined here as determiner words which typically begin a noun phrase, but which cannot occur as the head of a noun phrase.] 5. AV0 General adverb: an adverb not subclassified as AVP or AVQ (see below) (e.g. often, well, longer (adv.), furthest. [Note that adverbs, unlike adjectives, are not tagged as positive, comparative, or superlative.This is because of the relative rarity of comparative and superlative adverbs.] 6. AVP Adverb particle (e.g. up, off, out) [N.B. AVP is used for such ”prepositional adverbs”, whether or not they are used idiomatically in a phrasal verb: e.g. in ’Come out here’ and ’I can’t hold out any longer’, the same AVP tag is used for out. 7. AVQ Wh-adverb (e.g. when, where, how, why, wherever) [The same tag is used, whether the word occurs in interrogative or relative use.] 8. CJC Coordinating conjunction (e.g. and, or, but) 9. CJS Subordinating conjunction (e.g. although, when) 10. CJT The subordinating conjunction that [N.B. that is tagged CJT when it introduces not only a nominal clause, but also a relative clause, as in ’the day that follows Christmas’. Some theories treat that here as a relative pronoun, whereas others treat it as a conjunction.We have adopted the latter analysis.] 11. CRD Cardinal number (e.g. one, 3, fifty-five, 3609) 12. DPS Possessive determiner (e.g. your, their, his) 81 13. DT0 General determiner: i.e. a determiner which is not a DTQ. [Here a determiner is defined as a word which typically occurs either as the first word in a noun phrase, or as the head of a noun phrase. E.g. This is tagged DT0 both in ’This is my house’ and in ’This house is mine’.] 14. DTQ Wh-determiner (e.g. which, what, whose, whichever) [The category of determiner here is defined as for DT0 above. These words are tagged as wh-determiners whether they occur in interrogative use or in relative use.] 15. EX0 Existential there, i.e. there occurring in the there is ... or there are ... construction 16. ITJ Interjection or other isolate (e.g. oh, yes, mhm, wow) 17. NN0 Common noun, neutral for number (e.g. aircraft, data, committee) [N.B. Singular collective nouns such as committee and team are tagged NN0, on the grounds that they are capable of taking singular or plural agreement with the following verb: e.g. ’The committee disagrees/disagree’.] 18. NN1 Singular common noun (e.g. pencil, goose, time, revelation) 19. NN2 Plural common noun (e.g. pencils, geese, times, revelations) 20. NP0 Proper noun (e.g. London, Michael, Mars, IBM) [N.B. the distinction between singular and plural proper nouns is not indicated in the tagset, plural proper nouns being a comparative rarity.] 21. ORD Ordinal numeral (e.g. first, sixth, 77th, last) . [N.B. The ORD tag is used whether these words are used in a nominal or in an adverbial role. Next and last, as ”general ordinals”, are also assigned to this category.] 22. PNI Indefinite pronoun (e.g. none, everything, one [as pronoun], nobody) [N.B. This tag applies to words which always function as [heads of] noun phrases. Words like some and these, which can also occur before a noun head in an article-like function, are tagged as determiners (see DT0 and AT0 above).] 23. PNP Personal pronoun (e.g. I, you, them, ours) [Note that possessive pronouns like ours and theirs are tagged as personal pronouns.] 24. PNQ Wh-pronoun (e.g. who, whoever, whom) [N.B. These words are tagged as wh-pronouns whether they occur in interrogative or in relative use.] 25. PNX Reflexive pronoun (e.g. myself, yourself, itself, ourselves) 26. POS The possessive or genitive marker ’s or ’ (e.g. for ’Peter’s or somebody else’s’, the sequence of tags is: NP0 POS CJC PNI AV0 POS) 82 27. PRF The preposition of. Because of its frequency and its almost exclusively postnominal function, of is assigned a special tag of its own. 28. PRP Preposition (except for of) (e.g. about, at, in, on, on behalf of, with) 29. PUL Punctuation: left bracket - i.e. ( or [ 30. PUN Punctuation: general separating mark - i.e. . , ! , : ; - or ? 31. PUQ Punctuation: quotation mark - i.e. ’ or ” 32. PUR Punctuation: right bracket - i.e. ) or ] 33. TO0 Infinitive marker to 34. UNC Unclassified items which are not appropriately classified as items of the English lexicon. [Items tagged UNC include foreign (non-English) words, special typographical symbols, formulae, and (in spoken language) hesitation fillers such as er and erm.] 35. VBB The present tense forms of the verb BE, except for is, ’s: i.e. am, are, ’m, ’re and be [subjunctive or imperative] 36. VBD The past tense forms of the verb BE: was and were 37. VBG The -ing form of the verb BE: being 38. VBI The infinitive form of the verb BE: be 39. VBN The past participle form of the verb BE: been 40. VBZ The -s form of the verb BE: is, ’s 41. VDB The finite base form of the verb BE: do 42. VDD The past tense form of the verb DO: did 43. VDG The -ing form of the verb DO: doing 44. VDI The infinitive form of the verb DO: do 45. VDN The past participle form of the verb DO: done 46. VDZ The -s form of the verb DO: does, ’s 47. VHB The finite base form of the verb HAVE: have, ’ve 48. VHD The past tense form of the verb HAVE: had, ’d 49. VHG The -ing form of the verb HAVE: having 50. VHI The infinitive form of the verb HAVE: have 83 51. VHN The past participle form of the verb HAVE: had 52. VHZ The -s form of the verb HAVE: has, ’s 53. VM0 Modal auxiliary verb (e.g. will, would, can, could, ’ll, ’d) 54. VVB The finite base form of lexical verbs (e.g. forget, send, live, return) [Including the imperative and present subjunctive] 55. VVD The past tense form of lexical verbs (e.g. forgot, sent, lived, returned) 56. VVG The -ing form of lexical verbs (e.g. forgetting, sending, living, returning) 57. VVI The infinitive form of lexical verbs (e.g. forget, send, live, return) 58. VVN The past participle form of lexical verbs (e.g. forgotten, sent, lived, returned) 59. VVZ The -s form of lexical verbs (e.g. forgets, sends, lives, returns) 60. XX0 The negative particle not or n’t 61. ZZ0 Alphabetical symbols (e.g. A, a, B, b, c, d) 84

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Application of Quantum Computing principles to Natural Language Processing