Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Statistical aspects of inferring Bayesian networks from marginal observations Masterarbeit an der Fakultät für Mathematik und Physik der Albert-Ludwigs-Universität Freiburg vorgelegt von Kai von Prillwitz unter der Leitung von Prof. David Gross 23. Oktober 2015 3 Abstract We investigate statistical aspects of inferring compatibility between causal models and small data samples. The considered causal models include hidden variables complicating the task of causal inference. A proposed causal model can be rejected as an explanation for generating the data if the empirical distribution (of the observable variables) differs significantly from the distributions compatible with the model. In fact, the utilized hypothesis tests are based on inequality constraints constituting outer approximations to the true set of distributions compatible with the model. We start by working with inequalities in a recently developed entropic framework and implement likewise recent techniques of entropy estimation. In a second step we derive and implement analogous constraints on the level of certain generalized covariance matrices. In contrast to actual covariances our matrices are independent of the alphabets (the outcome values) of the variables. Furthermore, we distinguish two different approaches to hypothesis testing. Our methods are demonstrated by an application to real empirical data, the so-called ‘iris (flower) data set’. 4 Zusammenfassung In dieser Arbeit untersuchen wir statistische Aspekte der Kompatibilitätsbestimmung von kausalen Modellen und kleinen Datensätzen. Die betrachteten kausalen Modelle beinhalten versteckte, nicht messbare Variablen, wodurch die Aufgabe zusätzlich erschwert wird. Ein gegebenes kausales Modell kann als Erklärung für die beobachteten Daten ausgeschlossen werden, wenn die empirische Wahrscheinlichkeitsverteilung (der beobachtbaren Größen) sich signifikant von den mit dem Modell kompatiblen Verteilungen unterscheidet. Die hier durchgeführten Hypothesentests basieren auf Ungleichungen, welche eine äußere Approximation an die tatsächliche Menge der kompatiblen Verteilungen darstellen. In einem ersten Schritt verwenden wir kürzlich entwickelte Ungleichungen basierend auf Entropien der Wahrscheinlichkeitsverteilungen. Die verwendeten Schätzmethoden der Entropien sind gleichfalls aktuell. In einem zweiten Schritt leiten wir neuartige Ungleichungen her, welche auf Matrizen basieren, die als Verallgemeinerung von Kovarianzen betrachtet werden können und, anders als Kovarianzen, unabhängig von den tatsächlich angenommen Werten der Variablen sind. Darüber hinaus untersuchen wir zwei unterschiedliche Herangehensweisen an die Hypothesentests. Als Anwendungsbeispiel unserer Methoden auf reale Daten betrachten wir den sogenannten ‘Iris flower-Datensatz’. 5 CONTENTS Contents 1 Introduction 8 1.1 Philosophical and mathematical background . . . . . . . . . 8 1.2 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2 Basic concepts 2.1 Introduction to probability theory . . . . . . . . . . . . . . . 13 2.1.1 Discrete random variables . . . . . . . . . . . . . . . 13 2.1.2 Joint and marginal distributions . . . . . . . . . . . . 14 2.1.3 Conditional probabilities and (conditional) independence . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Expected value, variance and covariance . . . . . . . 15 Bayesian networks . . . . . . . . . . . . . . . . . . . . . . . 16 2.2.1 Markov condition . . . . . . . . . . . . . . . . . . . . 17 2.2.2 Faithfulness assumption . . . . . . . . . . . . . . . . 18 2.2.3 Hidden variables . . . . . . . . . . . . . . . . . . . . 19 2.2.4 Hidden common ancestor models . . . . . . . . . . . 20 Information theory . . . . . . . . . . . . . . . . . . . . . . . 23 2.3.1 Shannon entropy . . . . . . . . . . . . . . . . . . . . 23 2.3.2 Joint and conditional entropy . . . . . . . . . . . . . 24 2.3.3 Mutual information . . . . . . . . . . . . . . . . . . . 25 Hermitian and positive semidefinite matrices . . . . . . . . . 26 2.4.1 Definitions and notation . . . . . . . . . . . . . . . . 26 2.4.2 Inverse, pseudoinverse and other functions . . . . . . 27 2.4.3 Projections . . . . . . . . . . . . . . . . . . . . . . . 28 2.1.4 2.2 2.3 2.4 13 6 CONTENTS 3 Testing entropic inequalities 29 3.1 Entropic inequality constraints . . . . . . . . . . . . . . . . . 29 3.2 Entropy estimation . . . . . . . . . . . . . . . . . . . . . . . 32 3.2.1 Introduction to estimators . . . . . . . . . . . . . . . 32 3.2.2 Maximum likelihood estimation . . . . . . . . . . . . 35 3.2.3 Minimax estimation . . . . . . . . . . . . . . . . . . 36 3.2.4 MLE and minimax estimator for entropy . . . . . . . 37 3.2.5 Comparison of MLE and minimax estimator for entropy 40 3.2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . 46 Hypothesis tests . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.3.1 Introduction to hypothesis tests . . . . . . . . . . . . 47 3.3.2 Direct approach . . . . . . . . . . . . . . . . . . . . . 52 3.3.3 Indirect approach (bootstrap) . . . . . . . . . . . . . 58 3.3.4 Additional inequalities . . . . . . . . . . . . . . . . . 64 3.3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . 68 3.3 4 Tests based on generalized covariance matrices 71 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.2 Encoding probability distributions in matrices . . . . . . . . 73 4.2.1 One- and two-variable matrices . . . . . . . . . . . . 73 4.2.2 The compound matrix . . . . . . . . . . . . . . . . . 74 The inequality . . . . . . . . . . . . . . . . . . . . . . . . . . 77 4.3.1 Motivation by covariances for the triangular scenario 77 4.3.2 General inequality for hidden common ancestor models 79 4.3.3 An equivalent representation . . . . . . . . . . . . . . 80 4.3.4 Covariances revisited . . . . . . . . . . . . . . . . . . 86 Proving the inequality . . . . . . . . . . . . . . . . . . . . . 88 4.4.1 88 4.3 4.4 Invariance under local transformations . . . . . . . . 7 CONTENTS 4.5 4.4.2 Proof for a special family of distributions . . . . . . . 93 4.4.3 Counter example . . . . . . . . . . . . . . . . . . . . 100 4.4.4 Generating the whole scenario by local transformations 102 4.4.5 Brief summary of the proof . . . . . . . . . . . . . . 110 Comparison between matrix and entropic inequality . . . . . 112 4.5.1 Analytical investigation . . . . . . . . . . . . . . . . 113 4.5.2 Numerical simulations . . . . . . . . . . . . . . . . . 120 4.5.3 Hypothesis tests . . . . . . . . . . . . . . . . . . . . . 125 4.5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . 130 5 Application to the iris data set 132 5.1 The iris data set . . . . . . . . . . . . . . . . . . . . . . . . . 132 5.2 Discretizing the data . . . . . . . . . . . . . . . . . . . . . . 133 5.3 Proposing a model . . . . . . . . . . . . . . . . . . . . . . . 135 5.4 Rejecting the proposed model . . . . . . . . . . . . . . . . . 137 6 Conclusion and outlook 141 A Generalized proof of Theorem 4.1 145 A.1 Proof for a special family of distributions . . . . . . . . . . . 145 A.1.1 General notation . . . . . . . . . . . . . . . . . . . . 145 A.1.2 The proposition . . . . . . . . . . . . . . . . . . . . . 147 A.1.3 The proof . . . . . . . . . . . . . . . . . . . . . . . . 148 A.2 Locally transforming Aj = { Ajx }x → Aj0 . . . . . . . . . . . 158 B Proof of Corollary 4.2 161 1 INTRODUCTION 1 8 Introduction The scope of this thesis is causal inference, the mathematical theory of ‘what causes what’. Even though causal inference, or in general the concept of causation, is basic to human thinking and has a long philosophical history, a solid mathematical theory has long been missing. Note that the following brief overview of the philosophical background is mainly based on the two secondary sources [1] and [2]. Similarly, large parts of the mathematical history are based on the epilogue of [3]1 . 1.1 Philosophical and mathematical background Philosophical theories about causation date back at least to Aristotle, according to whom “we do not have knowledge of a thing until we have grasped its why, that is to say, its cause” [4]2 . Aristotle distinguishes four fundamental ‘causes’, or answers to ‘why’ questions, the material cause (“that out of which”), the formal cause (“the form”), the efficient cause (“the primary source of the change or rest”) and the final cause (“the end, that for the sake of which a thing is done”) [4]2 . In modern science the term ‘cause’ typically refers to Aristotle’s ‘efficient cause’ as it comes closest to today’s understanding of the phrase ‘X causes Y’. It would seem odd to say that the material or the shape of an object caused the object. An important work on causation of the modern era is A Treatise of Human Nature [5] by the Scottish philosopher David Hume. Before Hume, the traditional view on causation was predominantly rationalistic. It was assumed that causal relations, being intrinsic truths of nature, could be inferred by pure reasoning. Hume, on the other hand, advocated an empirical theory [2]. “Thus we remember to have seen that species of object we call flame, and to have felt that species of sensation we call heat. We likewise call to mind their constant conjunction in all past instances. Without any farther ceremony, we call the one cause and other effect, and infer the existence of the one from that of the other” [5]3 . One severe problem of Hume’s the1 Quotes from primary sources have been adopted from these secondary sources. At each quote we give reference to the supposed primary source (if available) and use a footnote to indicate the secondary source in which the quote was found. 2 quoted from [1] 3 quoted from [3] 1 INTRODUCTION 9 ory is that from the ‘principle of constant conjunction’ any two regularly co-occurring events are identified as directly causally connected. However, it is also possible that the connection between the two events is due to a common cause. Today, this falls under the concept of ‘spurious correlation’ and is related to the statement ‘correlation does not imply causation’. The list of philosophers contributing to the discussion about causation is long, including Aquinas, Descartes, Hobbes, Spinoza, Leibniz, Locke, Newton, Kant, Mill and others [2]. But gaining ground in mathematics, or modern science in general, turned out to be more difficult. In 1913 Bertrand Russel wrote “All philosophers imagine that causation is one of the fundamental axioms of science, yet oddly enough, in advanced sciences, the word ‘cause’ never occurs.... The law of causality, I believe, is a relic of bygone age, surviving, like the monarchy, only because it is erroneously supposed to do no harm”4 . Also Karl Pearson, a founder of mathematical statistics, in the third edition of his book The Grammar of Science [6], “strongly denies the need for an independent concept of causal relation beyond correlation” and “exterminated causation from modern statistics before it had chance to take root” [3]. A major advance was brought about by Sir Ronald Fisher who established the randomized experiment, a scientific method for testing causal relations based on real data [7]. To illustrate the idea, assume that the efficacy of a new drug is to be tested. From a purely observational study the conclusion is drawn that the drug is beneficial to recovery. But in fact, it might be that both, taking the drug and the chance of recovery, are independently influenced by a persons social and financial background. In order to identify the actual effect of the drug, the treatment should be assigned at random, thereby excluding background influences. Modern theories of causal inference include Granger causality [8] and the Rubin causal model (or Neyman-Rubin causal model) [9, 10]. Granger causality uses temporal information to infer the causal relation between two variables, but, as in Hume’s philosophical theory, the result may be misleading if a third variable is involved. The Rubin causal model measures the causal effect of X on Y for a single unit u, e.g. a person, as the difference in Y (at time t2 ) given different treatments X (at time t1 ), i.e. Yx1 (u) − Yx0 (u) for binary X. Since at time t1 one single unit can only be exposed to one of the treatments (e.g. take the drug or not take the drug) only one of the values can be measured. Sometimes this problem is even called the Fundamental 4 quoted from [3] 1 INTRODUCTION 10 Problem of Causal Inference [11]. A solution could be to measure the other value for a ‘similar’ unit, but then, drawing conclusions for the original unit requires additional assumptions. According to Pearl, there are two reasons for the slow progress of mathematical theories and the caution with which causal inference is often treated. 1. Whereas a causal statement like ‘X causes Y’ is directed, algebraic equations like Newton’s law F = ma are undirected. The equation does not tell us whether the force causes the acceleration or vice versa, as it can be brought into several equivalent forms. Thus, from a purely algebraic description, it is impossible to infer (the direction) of causation. 2. From a fundamental point of view the causal effect of X on Y can be understood as the change of Y when externally manipulating X. In probability theory, the typical language of causation, however, such manipulative statements cannot be expressed [3]. Pearl’s solution to the second problem is to introduce a completely new calculus to probability theory, the do-calculus [3]. First, he introduces the new expression P (y | do (x)) which is read as ‘the probability that y occurs given that we fix X to x’. This is in general different from the typical conditional probability P (y | x) where x is only observed. Second, Pearl provides several rules to manipulate expressions involving the do-symbol with the aim to eliminate them from the equation and thus make the final expression evaluable by traditional statistical means [3]. This is a remarkable result as it means that causal effects can in some cases be inferred from purely observational data. Note that Fisher’s randomized experiment also uses Pearl’s idea of intervening in the system. Not letting the patients themselves decide whether or not to take the drug, but assigning the treatment by an external rule, corresponds to applying one of the do-statements do (treatment) or do (placebo). While Pearl’s do-calculus is not the subject of this thesis, the solution of the first problem brings us closer to our utilized framework. The idea is to encode the assumptions about the causal relations between the variables in a graphical model, also called a (causal) Bayesian network. These assumptions can be translated to (conditional) independence statements. Of course one 1 INTRODUCTION 11 could also list all these independence relations without the graph, but there are at least two reasons for the use of a graphical representation. First, when specifying a model it is much easier to think in terms of graphs, where one can simply connect any two variables one assumes to have a direct causal relation. In particular larger models can imply non-obvious independence statements that can algorithmically be obtained from the graph. Second, it may (and will) happen that different causal assumptions (on the same set of variables) lead to the same conditional independence relations. For example, the graphs X → Y → Z and X ← Y → Z both imply that X and Z are ⊥ Z | Y ), while no other conditionally independent given Y (written X ⊥ independence relations hold. To distinguish such models, interventions in the spirit of Pearl’s do-calculus are required. Since for different graphs the effect of the intervention might be different (otherwise one could still not distinguish the models), the graphical representation is indeed necessary. Even though not all models are distinguishable without intervening in the system, one can still obtain some knowledge about the causal structure even without such interventions. This is precisely the subject of this thesis. The whole issue becomes dramatically more challenging if some of the variables are not observable (also called hidden or latent) [12, 13, 14, 15]. Any independence statement including hidden variables cannot be evaluated from the empirical data, which renders testing these statements impossible. The independence relations containing only observable variables, if they exist at all, might carry only little information. Thus, one strives to derive stronger constraints on the marginal distributions of the observed variables, typically in the form of inequalities. If for some data such an inequality is violated, the proposed model can be rejected as an explanation of the data. If no violation is found, one can unfortunately not conclude to have found the one correct model, first, since other models might also be compatible with the data, second, since the inequalities are typically not tight (in the sense that an inequality might be satisfied even though the data are incompatible with the model), and third, since the number of inequalities constraining the model might be very large, so that it is impractical to test all of them. It was mentioned above that for example the models X → Y → Z and X ← Y → Z are indistinguishable. The model X → Y ← Z, on the other hand, implies different constraints, namely that X and Z are unconditionally independent but conditionally dependent given Y . Thus, this model can be distinguished from the other two by purely observational data. Testing 1 INTRODUCTION 12 inequality constraints for real data amounts to a statistical hypothesis test and requires reliable estimation of the involved quantities. Hence, there are two aspects of identifying possible causal models (Bayesian networks) from marginal observations: (1) Deriving inequality constraints for the observable marginal distributions of a proposed network, and (2) statistically testing these inequalities. Both aspects are examined in this thesis. 1.2 Outline As a starting point of this thesis serves a hypothesis test for a specific causal model recently proposed in [16]. The test is based on an entropic inequality constraint introduced in the same paper. In addition to the arguably rather disappointing power of the test, a heuristic was used in its construction, which implies that it is not actually known whether the type-I-error rate meets the design rate of 5%. Note that all required concepts are thoroughly introduced later. The goal of this thesis is threefold. First, we want to improve the hypothesis test from [16], both in terms of its power as well as its reliability (by which we mean the control of the type-I-error rate). To this end, we consider recently introduced, advanced techniques of entropy estimation [17, 18], additional entropic inequality constraints that were already introduced in [16] but not implemented in the hypothesis test, and an alternative approach to the hypothesis test itself. As a final means we leave the entropic framework and derive analogous inequality constraints based on certain generalized covariance matrices. While this is motivated by the search for a more powerful hypothesis test, deriving the new type of inequalities is interesting on its own and can thus be considered the second goal of this thesis. The third goal is an application of the developed hypothesis tests to real empirical data. The rest of this thesis is organized as follows: The required basic graph theoretical and mathematical concepts are introduced in Chapter 2. Estimating entropies and constructing hypothesis tests based on entropic inequalities is the subject of Chapter 3. The derivation of the above mentioned matrix inequalities as well as a comparison to the entropic framework is pursued in Chapter 4. The application to the ‘iris data set’ is presented in Chapter 5. Finally, the thesis is concluded in Chapter 6. 13 2 BASIC CONCEPTS 2 Basic concepts This chapter provides an introduction to the basic mathematical and graph theoretical concepts required for the rest of the thesis. More specialized concepts will be presented along the text before they are needed. We start with a short overview of probability theory in Section 2.1. This is followed by an introduction to directed acyclic graphs (DAGs), that are used to model causal assumptions, in Section 2.2. In particular, the hidden common ancestor models that are considered throughout the whole thesis are introduced in this section. Section 2.3 provides a brief overview of the information theoretical concepts that are required for Chapter 3. The basics of the matrix framework employed in Chapter 4 are introduced in Section 2.4. 2.1 Introduction to probability theory Since our aim is to constrain probability distributions of variables that follow a given causal model, probability theory is the basic language used in this thesis. 2.1.1 Discrete random variables Consider a discrete random variable A with outcomes a1 , ..., aK . The set {a1 , ..., aK } is called the alphabet of A and likewise K is called the alphabet size. For all variables considered in this thesis, K is assumed to be finite. To each outcome we assign a probability 0 ≤ P (A = ai ) ≤ 1, (2.1) with the normalization constraint K X P (A = ai ) = 1. (2.2) i=1 Several alternative notations for P (A = ai ) will be used throughout the thesis. A first measure to keep expressions short, is to write PA (ai ). If the variable is clear from the context the name of the variable might be dropped completely, leaving us with P (ai ). As another frequently used short hand 14 2 BASIC CONCEPTS notation, or when referring to the distribution itself, we also write P (A). In addition, when the specific values ai are not important (i.e. when they only appear as labels inside of probabilities like P (A = ai )), we usually assume integer values and write P (A = i). In general, the notation should always be clear from the context or will be explained at the corresponding position. 2.1.2 Joint and marginal distributions When considering two random variables A and B the joint probability that both, A = ai and B = bj occur, is written as P (A = ai , B = bj ). The distribution of a singe variable can be calculated using the law of total probability, P (A = ai ) = X P (A = ai , B = bj ) . (2.3) j This summation is also called marginalization (over B) and the resulting distribution P (A) is called the marginal distribution of A. It is easy to check that the probabilities P (A = ai ) indeed satisfy conditions (2.1) and (2.2), assuming that the joint distribution satisfies them. In general, for n random variables the joint distribution is referred to as P (A1 , ..., An ). The distribution of any subset of variables may then be called the marginal distribution of these variables. 2.1.3 Conditional probabilities and (conditional) independence The distribution of a variable A may change conditioned on the observation of another variable B. We write P (A = ai | B = bj ) or simply P (A | B) to denote the conditional probability of A given B. The joint distribution can be decomposed as P (A, B) = P (A | B) P (B) . (2.4) If we find P (A | B) = P (A) (meaning P (A = ai | B = bj ) = P (A = ai ) ∀i, j) the variables are called independent. In that case, one also finds P (B | A) = P (B) and the joint distribution factorizes according to P (A, B) = P (A) P (B) . (2.5) Independence statements often include the conditioning on a third variable. A and B are said to be conditionally independent given C, also written as 15 2 BASIC CONCEPTS ⊥ B | C, if the conditional distribution factorizes according to A⊥ P (A, B | C) = P (A | C) P (B | C) . (2.6) Since conditional distributions are also valid probability distributions satisfying (2.1) and (2.2), (2.6) is a straightforward generalization of (2.5). Generalizations to larger sets of variables are likewise straightforward. 2.1.4 Expected value, variance and covariance The expected value of a random variable is defined as E [A] = X ai P (A = ai ) . (2.7) i The variance can then be written as h Var [A] = E |A − E [A]|2 h i i = E |A|2 − |E [A]|2 = X i 2 |ai | P X (ai ) − ai P i 2 (ai ) ≥ 0. (2.8) The non-negativity can be seen right from the first line, since the expectation value of a non-negative quantity will also be non-negative. For the sake of generality we allow complex valued alphabets here. The complex conjugate of x ∈ C is denoted by x∗ . As a generalization for two random variables one defines the covariance as Cov [A, B] = E [(A − E [A])∗ (B − E [B])] = E [A∗ B] − E [A]∗ E [B] X = a∗i bj [P (ai , bj ) − P (ai ) P (bj )] . (2.9) i,j Note that for complex variables we obtain Cov [B, A] = Cov [A, B]∗ instead of full symmetry. If A and B are independent, their joint distribution factorizes and thus Cov [A, B] = 0. The other direction is not true, i.e. depending on the values ai and bj even dependent variables can have covariance zero. Further statistical aspects that play a role in Chapter 3 will be introduced later. 2 BASIC CONCEPTS 2.2 16 Bayesian networks In this section we introduce necessary properties and terminology of graphical models required for the following chapters. For a more detailed treatment of the topic see for example Pearl [3] or Spirtes, Glymour and Scheines [19]. A nice online introduction can be found in the Stanford Encyclopedia of Philosophy on the topic of Probabilistic Causation [20]. Causal assumptions on a set of random variables are often modeled using socalled directed acyclic graphs (DAGs). Each random variable is represented by one node (or vertex). A directed edge between two nodes indicates direct causal influence from one variable on the other. The whole graph being directed means that each edge has exactly one arrowhead. A graph is called acyclic if there exists no directed path from one variable to itself (e.g. A → B → C → A), i.e. no variable must be its own cause. This also implies that if A causes B, B cannot simultaneously cause A. When dealing with DAGs, one often uses genealogical terminology to indicate the relation between variables. If there exists a directed path from A to B, A is called an ancestor of B and B a descendant of A. If the path has length one, i.e. if there is a direct link from A to B, we call them parent and child. In the DAG A → B → C, for example, B is a child of A and a parent of C. Furthermore C is a descendant of A and likewise A is an ancestor of C. We denote the set of parents of a variable A by PA (A) and likewise the sets of descendants and non-descendants by D (A) and ND (A). Note that DAGs can be defined independently of any causal interpretation. In the first place, the DAG is used to encode conditional independence relations, e.g. in the DAG A → B → C, A and C are conditionally independent given B. The total model consisting of the DAG and its implied independence relations is called a Bayesian network. The causal interpretation of a Bayesian network is threefold. First, it is simply convenient and in some sense natural to think of the edges as causal links. Second, if interventions in the spirit of Pearls do-calculus are considered, additional assumptions concerning the locality of these manipulations allow for a causal interpretation. For more details, see Pearls definition of a causal Bayesian network [3]. Finally, and most relevant for this thesis, if we find a violation of the constraints implied by the DAG in real data, the model can be rejected as an explanation for generating the data regardless of a possible causal interpretation. This means that we are in particular able to falsify causal 17 2 BASIC CONCEPTS assumptions. The causal interpretation becomes more relevant when trying to verify causal effects rather than to falsify them. Distinguishing causality from mere correlations can be a delicate task. 2.2.1 Markov condition A list of fundamental independence relations implied by the DAG is given by the Markov condition. The Markov condition states that any variable should be conditionally independent of its non-descendants given its par⊥ ND (A) | PA (A). In particular, once the parents of ents, written as A ⊥ A are known, further knowledge of more distant ancestors does not change the distribution of A anymore. More distant ancestors have only an indirect influence on A by affecting A’s parents or other less distant ancestors. These independence relations imply that the joint distribution of all variables factorizes according to P A1 , ..., An = n Y P Aj | PA Aj . (2.10) j=1 As an example, consider the so-called instrumentality DAG in Figure 1. The ⊥ λ and B ⊥ ⊥C| Markov condition implies the independence relations C ⊥ A, λ. The total distribution can be written as P (A, B, C, λ) = P (A | C, λ) P (B | A, λ) P (C) P (λ) . (2.11) λ C A B Figure 1: Instrumentality DAG. The instrument C can under certain assumptions be used to infer the causal effect of A on B [16, 21, 22, 23]. The variable λ comprises all additional influences on A and B and may be unobserved (see Subsection 2.2.3 for hidden variables). Here, the DAG serves simply as an illustration for the Markov condition. 2 BASIC CONCEPTS 18 Note that the conditional independence relations given by the Markov condition may imply further independence relations that can algorithmically be obtained by the so-called d-separation criterion [3]. The Markov condition is of particular importance for us, since any distribution compatible with the DAG has to factorize according to (2.10). Violation of this factorization, or any derived constraints thereof, is a proper witness of incompatibility of the data with the assumed causal model. 2.2.2 Faithfulness assumption The Markov condition is a sufficient but not necessary condition for conditional independence [20], in the sense that in data that are compatible with the DAG any conditional independence implied by the Markov condition (and hence also the d-separation criterion) will hold, but additional independence relations are possible. The faithfulness assumption states that the Markov condition should also be necessary, i.e. that there exist no additional independence relations than those implied by the Markov condition. This can also be understood such that all edges in the graph are indeed required. For example in the graph A → B, the Markov condition implies no independence relations at all and the faithfulness assumption states that we should then indeed find a dependence between the two variables. If we find that A and B are independent, the graph and the distribution are said to be not faithful to one another. For another illustration, consider the case that a distribution is compatible with more than one DAG and that for some reason one has to decide which graph is ‘the correct one’. Loosely speaking the faithfulness assumption would suggest to choose the most simple one. Complex graphs that allow more dependence relations than actually observed, could be regarded as overfitting the data. In that sense the faithfulness assumption would be a formal version of Occam’s razor [20]. However, even with the faithfulness assumption it is unlikely that a unique graph can be identified. A simple example of several equally complex graphs that entail the same conditional independence relations was already given in the introduction, namely A → B → C, A ← B ← C and A → B ← C which all imply A ⊥ ⊥ C | B. The faithfulness assumption is used in many theorems and algorithms in causal inference [20, 24, 3, 19] and from an ideal point of view the assumption can be justified since distributions that are not faithful to a DAG have 19 2 BASIC CONCEPTS Lebesgue measure zero [24]. However, for practical purposes the faithfulness assumption is also subject to criticism [25, 24]. Fortunately, the approach followed in this thesis does not require the faithfulness assumption since we are not trying to decide between different Markov equivalent DAGs but rather to reject a given DAG (and thus all its Markov equivalents). For this purpose, violation of the Markov condition is a sufficient criterion. 2.2.3 Hidden variables Variables that are too complex to be properly characterized (e.g. comprising incomplete background knowledge) or that can simply not be observed due to other (maybe practical) reasons, have to be included in the model as so-called hidden or latent variables. Variables that are not hidden are called observables in this thesis. As an example, in the debate of smoking as a cause of lung cancer, one could think of an alternative model where a gene is the common cause of both the cancer and a strong craving for nicotine. Since we do not even know whether or not such a gene exists, this common cause has to be treated as a hidden variable. Hidden variables can substantially complicate the task of causal inference. Independence statements including hidden variables for obvious reasons cannot be evaluated (from empirical data). The remaining independence relations, if they exist, may carry only little information. Also, for large DAGs and alphabets, testing all accessible independence relations might become impractical. Concerning the distribution of the remaining observables, the simple product structure given by the Markov condition (see (2.10)) gets lost due to marginalization over the hidden variables. Considering n observables A1 , ..., An and m hidden variables λ1 , ..., λm , the distribution of the observables can be written as P A1 , ..., An = X n Y λ1 ,...,λm j=1 P Aj | PA Aj m Y P (λk | PA (λk )) . (2.12) k=1 The set of all distributions of this form can have a highly complex geometrical structure [12, 13, 14, 15]. In particular, this set will in general be non-convex, meaning that if two distributions P1 and P2 are of the above form, then a mixture of these distributions will in general not be of that form. Typically one aims to find an outer approximation (or if possible a precise description) to this set given in the form of inequality (and equality) 20 2 BASIC CONCEPTS constraints. A distribution violating such an inequality will then automatically fail to be compatible with the DAG. Figure 2 illustrates these set relations. all distributions true set inequality description Figure 2: Illustration of set inclusions for distributions compatible with models including hidden variables. The true set has such a complex structure that deciding membership becomes unfeasible. The four black curves correspond to inequalities that (upper) bound the correlations (or more general dependence relations) between the observables. The set corresponding to the ‘inequality description’ is the set of distributions satisfying all four inequalities. Violation of any inequality is evidence of non-membership to the true set of distributions. 2.2.4 Hidden common ancestor models Consider the case that there are no direct causal links between the observable variables but all correlations are mediated by hidden common ancestors. Furthermore, assume that all ancestors are independent of each other and that the observables do not affect the hidden variables (i.e. the hidden variables have only outgoing edges). We call such a model a hidden common ancestor model. Distributions compatible with this scenario are of the form P A1 , ..., An = X { λx }x P A1 | { λx }x| A1 ...P An | { λx }x|An Y P (λx ) . x (2.13) 21 2 BASIC CONCEPTS The set { λx }x contains all hidden ancestors and { λx }x| j all ancestors of A the observable Aj . Note that we will often index ancestors with the names of the observables they are connecting. To distinguish such ‘set indices’ from the usual integer indices, we will use the letters x, y, z for the former and i, j, k, ... for the latter. In fact, this distinction is majorly required for the notations used in Appendix A. At this point it serves primary to ensure a consistent notation throughout the whole document. In a hidden common ancestor model, an ancestor of only one observable puts no constraints on the observable distribution since the marginalization will only affect one term, e.g. X P A1 | { λx }x| λ0 , λ0 P (λ0 ) = P A1 | { λx }x| 1 A A1 . (2.14) Any distribution P A1 | { λx }x| 1 can be obtained by just not letting λ0 A have any effect at all. In that sense λ0 can always be absorbed into A1 or its other ancestors. Likewise, for the DAGs considered in this section, one ancestor common to all observables does not constrain the observable distribution. To see this, first realize that the joint distribution can be decomposed according to P A1 , ..., An = X P A1 | λ ...P (An | λ) P (λ) . (2.15) λ Now, think of λ as being composed of n subvariables λj , one for each observable Aj and also with the corresponding alphabet size. If we let Aj be deterministically dependent on λj , i.e. P Aj = kj | λ1 = l1 , ..., λn = ln = P Aj = kj | λj = lj = δkj lj , (2.16) we obtain P A1 = k1 , ..., An = kn = X δk1 l1 ...δkn ln P (λ1 = l1 , ..., λn = ln ) l1 ,...,ln = P (λ1 = k1 , ..., λn = kn ) . (2.17) By choosing this deterministic dependence between the ancestor λ and the observables A1 , ...An , the latter simply inherit the distribution of the former. Essentially, we simulated a collection of variables by one larger variable. Thus, any distribution can be realized by one ancestor common to all variables. 22 2 BASIC CONCEPTS A λAB B λAC C λBC Figure 3: The triangular scenario. Three observables with one hidden common ancestor for each pair. The most simple non-trivial example consists of three observables and two ancestors, each connecting one pair of observables. If also the last pair is connected by a third ancestor, one obtains the so-called triangular scenario (see Figure 3). Even though it is one of the most simple examples, the structure of distributions compatible with the triangular scenario is already highly complex. In general, two observables that share no ancestor are independent. To mathematically confirm this intuitive statement, consider the bi-partite marginals of the general distribution (2.13), P Aj , Ak P Aj | { λx }x| j P Ak | { λx }x| X = Ak A { λx }x| Aj ∪{ λx }x| Ak j∈x∨k∈x Y P Aj | { λx }x| j P (λx ) · X A { λx }x| j∈x Aj P A j X P Ak | { λx }x| Y Ak { λx }x| = P (λx ) x = Y P A k∈x Ak k P (λx ) . (2.18) From the first to the second line we assumed that { λx }x| j and { λx }x| k are A A disjoint sets, i.e. that Aj and Ak have no common ancestor. The notation 23 2 BASIC CONCEPTS j ∈ x ∨ k ∈ x shall indicate that the product runs only over ancestors of the observables Aj and/or Ak . Any distribution that violates such an independence statement implied by a given DAG cannot be compatible with that DAG. 2.3 Information theory The inequality constraints encountered in Chapter 3 are given in terms of entropies of the observable variables. Here, we provide a brief introduction to that topic. More details can for example be found in [26, 27, 28]. 2.3.1 Shannon entropy The Shannon entropy of a probability distribution is defined as H (A) = E [− log P (A)] = − K X P (A = ai ) log P (A = ai ) . (2.19) i=1 Entropy is a measure of randomness or uncertainty of a distribution. Since log 1 = 0 and x log x −→ 0, the entropy of a deterministic distribution x→0 P (A = ai ) = δij , where one outcome occurs with certainty, is zero. The maximal value log K is obtained for the uniform distribution P (A = ai ) = K1 . Entropy can also be understood as a measure of information gained from an observation. The more random the distribution P (A), the more will be learned by conducting an experiment with this underlying distribution. For example, we learn more by flipping a fair coin than by flipping a manipulated coin that always shows heads. In the latter case we learn nothing since we already knew the result beforehand. In the context of information, − log P (A = ai ) is also called the information content of the outcome ai , and entropy is the information content of the whole distribution P (A). Note that outcomes with very large information content are suppressed due to their small probability. Also note that the entropy is independent of the actual alphabet (i.e. the outcome values ai ), since only the probabilities P (A = ai ) appear. Since 0 ≤ P (A = ai ) ≤ 1, − log P (A = ai ) is always non-negative leading to H (A) ≥ 0. Contrary to the typically employed base-2 logarithm in information theory, we employ the natural logarithm here. The difference 24 2 BASIC CONCEPTS is merely a constant factor. The unit of the entropy with base-2 logarithm is called bit, with the natural logarithm nat. 2.3.2 Joint and conditional entropy Entropy can easily be generalized to joint distributions. The joint entropy of the variables A1 , ..., An is simply H A1 , ..., An = − X P (a1 , ..., an ) log P (a1 , ..., an ) . (2.20) a1 ,...,an Note that the expression is symmetric in the sense that H (A, B) = H (B, A) (for notational convenience we consider only two variables from now on). As for probabilities, one can decompose the joint entropy in terms of a conditional entropy H (B | A) and the marginal entropy H (A), H (A, B) = H (B | A) + H (A) X X with H (B | A) = − P (a) P (b | a) log P (b | a) . a (2.21) (2.22) b H (B | A) can be understood as the average information that we gain from learning B when we already know A. If for example B is deterministically dependent on A (i.e. P (b | a) ∈ {0, 1}), then H (B | A) = 0. In general, the following inequality relations hold: 1. H (B | A) ≥ 0 2. H (A, B) ≥ H (A) 3. H (A, B) ≤ H (A) + H (B) with equality iff A ⊥ ⊥B 4. H (B | A) ≤ H (B) with equality iff A ⊥ ⊥B (2.23) (2.24) (2.25) (2.26) Note that, as usual, ‘iff’ is short hand for ‘if and only if’. The first inequality follows directly from the definition (2.22) and log P (b | a) ≤ 0. The second inequality follows from the first one inserted in (2.21). Due to the symmetry of H (A, B), H (B) is a lower bound as well. The inequality represents the intuitive statement, that the uncertainty of two variables should be larger than the uncertainty of each single variable. A proof of inequality number three, which states that the total uncertainty cannot be larger than the sum of the single uncertainties, can for example be found in [26]. The fourth inequality follows from the third one (in fact they are equivalent) and (2.21). It says that the uncertainty about B does not grow when learning A. 25 2 BASIC CONCEPTS 2.3.3 Mutual information The mutual information shared by two variables is defined as I (A; B) = X a,b P (a, b) log P (a, b) . P (a) P (b) (2.27) The definition suggests that mutual information measures the closeness of the joint distribution P (a, b) to the product of its marginals P (a) P (b). Thus, it can be considered as a measure of dependence. The mutual information of two variables can be expressed in terms of their entropies via the relations I (A; B) = = = = H (A) + H (B) − H (A, B) H (A) − H (A | B) H (B) − H (B | A) H (A, B) − H (A | B) − H (B | A) . (2.28) (2.29) (2.30) (2.31) A graphical illustration of these relations can be found in Figure 4. Mutual information satisfies the following bounds: ⊥B 1. I (A; B) ≥ 0 with equality iff A ⊥ 2. I (A; B) ≤ H (A) (2.32) (2.33) The bounds follow directly from (2.26) and (2.23) inserted in (2.30) (in fact (2.32) is equivalent to each of (2.25) and (2.26)). The upper bound is achieved for deterministically dependent variables in which case we have H (A) = H (B) = H (A, B). Due to symmetry, H (B) is of course always an upper bound as well. The conditional mutual information of A and B given a third variable C is their mutual information given a specific value c of C averaged over all c, I (A; B | C) = X c P (c) X a,b P (a, b | c) log P (a, b | c) . P (a | c) P (b | c) (2.34) Conditional mutual information also satisfies the bound I (A; B | C) ≥ 0. In terms of entropies it can for example be expressed as I (A; B | C) = H (A, C) + H (B, C) − H (A, B, C) − H (C) . (2.35) 26 2 BASIC CONCEPTS H(A,B) H(A|B) I(A;B) H(A) H(B|A) H(B) Figure 4: Graphical illustration of the relation between marginal, conditional and joint entropy and mutual information. Note that A, B and C can also be replaced by setsA1 , ...An , B 1 ,...,nB moand C 1 , ..., C l . The (conditional) mutual information I {Ai }i ; {B j }j | C k k then measures the dependence between the two sets A1 , ..., An and B 1 , ..., B m (given the set C 1 , ..., C k ). 2.4 Hermitian and positive semidefinite matrices In Chapter 4 we derive constraints on probability distributions in terms of certain matrix inequalities. Here, we introduce the necessary matrixtheoretical concepts. Some basic properties of positive semidefinite matrices can be found in [29]. More details and complete introductions to the topic can for example be found in [30, 31, 32]. 2.4.1 Definitions and notation A complex valued square matrix M ∈ Cn×n is called positive semidefinite, written M ≥ 0, if x† M x ≥ 0 ∀x ∈ Cn , (2.36) where x† denotes the conjugate transpose (or adjoint) of x, and x is defined as a column vector. When convenient, we use the Dirac notation from quantum mechanics (see e.g. [33]) to denote a vector x as an abstract state |xi ∈ H and its adjoint as hx|. H denotes a general Hilbert space. A matrix 27 2 BASIC CONCEPTS M ∈ Cn×n is called hermitian if M † = M . Any hermitian matrix possesses a spectral decomposition (also called eigenvalue decomposition) M= n X λj |ji hj| , (2.37) j=1 with real eigenvalues λj and orthonormal eigenstates |ji. The set of eigenvalues {λj }nj=1 is also called the spectrum of M . M is positive semidefinite if and only if M is hermitian and has non-negative spectrum. From there we can conclude that the determinant of a positive semidefinite matrix, which is simply the product of its eigenvalues, is non-negative as well. Positive semidefiniteness induces a partial order among matrices. We say that M ≥ N if M − N ≥ 0. In general, it need neither be the case that M ≥ N nor N ≥ M . The kernel and range of a matrix M are defined as ker (M ) = {|xi ∈ H | M |xi = 0} , range (M ) = {|xi ∈ H | ∃ |yi ∈ H with M |yi = |xi} . (2.38) (2.39) Taking a look at the spectral decomposition (2.37), the range of a hermitian matrix M can be written as the span of the eigenstates corresponding to non-zero eigenvalues, range (M ) = span ({|ji} | λj 6= 0). Similarly, the kernel can be written as the span of the eigenstates with eigenvalue zero, ker (M ) = span ({|ji} | λj = 0). Since eigenstates corresponding to different eigenvalues are orthogonal, range and kernel of a hermitian matrix are orthogonal subspaces 2.4.2 Inverse, pseudoinverse and other functions P Consider a hermitian matrix M = of M can then be calculated as M −1 j = λj |ji hj| with λj 6= 0 ∀j. The inverse n X 1 |ji hj| . j=1 λj (2.40) If we allow λj = 0 we can define the pseudoinverse of M , M = X j λj 6=0 1 |ji hj| , λj (2.41) 28 2 BASIC CONCEPTS which is the inverse restricted to the range of M . In general, using the spectral decomposition, the action of a complex valued function f : C → C on M can be defined as f (M ) = n X f (λj ) |ji hj| , (2.42) j=1 as long as f (λj ) is properly defined. 2.4.3 Projections A hermitian matrix P is called a projection if P 2 = P . This property is also called idempotence. The spectrum of P consists only of the eigenvalues 0 and 1. Thus, the spectral decomposition reads P = X |ji hj| . (2.43) j λj =1 Two projections P1 , P2 are called orthogonal (to each other) if their ranges are orthogonal subspaces. In that case one obtains P1 P2 = P2 P1 = 0. (2.44) This is not to be confused with a single projection being called orthogonal which is the case if its range and kernel are orthogonal subspaces. The latter is always true for hermitian projections and only those are important in this thesis. A single projection that is not orthogonal is called oblique. The projection PM onto the range of a hermitian matrix M can be obtained using the pseudoinverse M via the relations PM = M M = M M . (2.45) 3 TESTING ENTROPIC INEQUALITIES 3 29 Testing entropic inequalities The goal of this chapter is to investigate hypothesis tests based on entropic inequality constraints that are used to decide compatibility of empirical data with a given DAG. In particular, we want to improve a hypothesis test testing compatibility with the triangular scenario (see Figure 3) that was proposed in [16]. As a first means to this end, in Section 3.2, we implement recent techniques of estimating entropies from [17, 18]. In Section 3.3 we show that the heuristic that was used to construct the hypothesis test in [16] leads to an unreliable control of the type-I-error rate. To circumvent this problem we consider an alternative approach to the hypothesis test based on the relation between hypothesis tests and confidence intervals. At the end of Section 3.3 we implement this alternative approach for additional entropic inequalities constraining the triangular scenario, that were derived but not further considered in [16]. As the very first step, in Section 3.1, we briefly present the method of generating entropic inequalities constraining distributions compatible with a given DAG introduced in [16]. Section 3.2 also contains a general introduction to estimation theory. An application of our methods to real data is presented in Chapter 5. 3.1 Entropic inequality constraints In Subsection 2.2.3 it was mentioned that DAGs with hidden variables impose non-trivial constraints on the observable marginal distributions. To characterize the set of distributions compatible with the DAG, one often has to resort to outer approximations in terms of inequality constraints. Violation of such an inequality allows one to reject the assumed causal model as an explanation for generating the data. Recently, it has been proposed to work on the level of entropies of the marginal distributions. The key idea behind using entropies is that algebraic independence conditions, for example P (A, B) = P (A) P (B), translate into linear conditions on the level of entropies, H (A, B) = H (A) + H (B) or simply I (A; B) = 0. Working with linear constraints is arguably much simpler than working with polynomial constraints. In [16] an algorithm for the entropic characterization of any DAG has been developed. It consists of the three main steps listed below: 30 3 TESTING ENTROPIC INEQUALITIES 1. List the elementary inequalities. 2. Add the constraints implied by the DAG. 3. Eliminate all entropies including hidden variables or any non-observable terms. In the first step, the so-called elementary inequalities constrain the entropies of any set of random variables. We have seen special cases of these inequalities for the bi- (or tri-) partite case already in Section 2.3. For the general case consider the set of variables A = {A1 , ..., An }. Monotonicity demands H (A \ Aj ) ≤ H (A) (c.f. (2.24)), implying that the entropy of any set of variables should be larger than the entropy of any subset of these variables. The so-called sub-modularity condition demands j 0 k 0 0 j k 0 H (A ) + H A , A , A ≤ H (A , A ) + H A , A for any subset A0 ⊂ A. A comparison with (2.35) reveals that this is equivalent to the non-negativity of the conditional mutual information, I Aj ; Ak | A0 = H Aj , A0 + H Ak , A0 − H Aj , Ak , A0 − H (A0 ) . (3.1) Finally, one demands the entropy of the empty set to be zero, H (∅) = 0. The elementary inequalities are also known as the polymatroidal axioms. One should note that they provide only an outer approximation to the true set of possible entropies. A tight description is not generally known [28]. In the second step of the algorithm the conditional independence constraints of the form I (A; B | C) = 0 implied by the Markov condition (and hence also the d-separation criterion) are added. The elimination of the hidden variables from the set of inequalities and equalities can be done by employing the so-called Fourier-Motzkin elimination [34]. Using this procedure, inequalities for several DAGs have been derived [16]. As a first example, distributions compatible with the instrumentality DAG from Figure 1, where λ is assumed to be hidden, have to satisfy I (B; C | A)+ I (A; C) ≤ H (A). In fact, this is the only entropic constraint that is not implied by the elementary inequalities. The number of inequalities on the level of probabilities, on the other hand, increases exponentially with the alphabet sizes of the variables [21]. The only drawback of the entropic characterization is that it is only an outer approximation, i.e. there might be distributions that are incompatible with the scenario but fail to violate 3 TESTING ENTROPIC INEQUALITIES 31 the entropic inequality. In this sense, entropic inequalities are a necessary but non-sufficient conditions for the compatibility of given data with an underlying causal model. As a second example, distributions compatible with the triangular scenario from Figure 3 have to satisfy the inequality H (A) + H (B) + H (C) − H (A, B) − H (A, C) ≤ 0 ⇔ I (A; B) + I (A; C) ≤ H (A) , (3.2) and permutations thereof. The inequality can intuitively be understood as follows (see also [16]). If the mutual information of A and B is large, then A depends strongly on the ancestor λAB . But then, the dependence of A on λAC is necessarily small. Since all correlations between A and C are mediated by λAC , the mutual information of A and C is consequently small as well. Inequality (3.2) gives a precise bound to this intuition. In addition, distributions compatible with the triangular scenario are constrained by the less intuitive inequalities 3HA + 3HB + 3HC − 3HAB − 2HAC − 2HBC + HABC ≤ 0(3.3) and 5HA + 5HB + 5HC − 4HAB − 4HAC − 4HBC + 2HABC ≤ 0(3.4) (and permutations of 3.3). To save space we employed the short hand notations HAB = H (A, B) and so on. Even after rewriting the inequalities in terms of mutual information (see Subsection 3.3.4), a simple, intuitive understanding similar to the one given above for inequality (3.2) is not available. One particular problem is caused by the involvement of ‘tri-partite mutual information’ (see (3.42)) which can, opposed to the usual mutual information, be negative. In that sense, ‘tri-partite mutual information’ is not a well defined information measure, making a simple intuition difficult. It is worth noting that inequality (3.2) is based on bi-partite information alone. On one hand, this might suggest that 3.2 is the least restrictive one, on the other hand it can also be employed if no tri-partite information is available. Such a scenario might arise for example in quantum mechanics, where several observables (e.g. position and momentum of a single particle) are not jointly measurable. In the following we will mainly focus on inequality (3.2) and come back to inequalities (3.3) and (3.4) in Subsection 3.3.4. 3 TESTING ENTROPIC INEQUALITIES 3.2 32 Entropy estimation In order to test an entropic constraint like inequality (3.2) from a data set, one first has to statistically estimate the single quantities appearing in the inequality. Since mutual information can be expressed as I (A; B) = H (A) + H (B) − H (A, B), it suffices to find a reliable estimator for entropy. Estimating joint entropies is also effectively the same as estimating marginal entropies. Regardless of the number of variables we can simply write H A1 , ..., An = − X P (a1 , ..., an ) log P (a1 , ..., an ) (3.5) a1 ,...,an as H (P ) = − X pi log pi , (3.6) i where i runs over the total alphabet of the collection of variables A1 , ..., An , and pi denotes the corresponding probability P (a1 , ..., an ). Note that reliably estimating entropies is an up-to-date problem. While it is not new that simply calculating the entropy of the observed distribution is not the best choice, the estimator that we employ in thesis thesis has been introduced only recently (2014/15) [17, 18]. We thus provide a rather detailed elaboration of the topic. 3.2.1 Introduction to estimators An accessible introduction to estimation theory can for example be found in [35]. When collecting data in the real world, one does usually not know the true probability distribution P underlying the data generating process. Assume that we make N observations, each independently drawn from the same distribution P (the observations can be considered as N independent and identically distributed (i.i.d.) random variables). The observations are called a sample of size N of the distribution P . Further assume that the outcome of any observation can be assigned to one of K categories, the alphabet of the distribution P . The number of observations that fall in category i is denoted by Ni and the distribution P̂ defined in terms of the probabilities p̂i = Ni/N is called the empirical distribution. The empirical distribution P̂ is an estimate of the true distribution P . 33 3 TESTING ENTROPIC INEQUALITIES Note that in statistics one distinguishes between so-called parametric and non-parametric estimation. Parametric estimation means that certain assumptions about the probability distribution have been made, for example that the distribution is characterized by some real parameter θ. Estimating the distribution then amounts to estimating that parameter. If no such assumptions have been made, one speaks of non-parametric estimation. Strictly speaking, non-parametric estimation only exists in the continuous case. The mere assumption that the distribution is discrete (with finite alphabet size) already renders the model parametric, since each probability pi can be considered as one parameter. The next step after estimating the distribution P is the so-called functional estimation, where one aims to estimate a quantity Q(P ). Since we do not have access to the true distribution P our estimate can only be based on the empirical distribution P̂ . Naively, one could simply calculate Q(P̂ ), also called the plug-in estimator, but in general it is advisable to change the function Q as well. A general estimator of Q(P ) is then denoted by Q̂(P̂ ). Since it should be clear that the estimate is based on the empirical distribution, we will typically omit the functional dependence on P̂ and simply write Q̂. Different estimators of the same quantity will be given appropriate indices, for example Q̂a and Q̂b . An estimator Q̂ should be as close to the true quantity Q as possible. There are several quantities that help characterize the performance of an estimator. Definition 3.1. For a fixed true distribution P the expected deviation between an estimator Q̂ and the true value Q is called the bias of Q̂, h i h i BP Q̂ = EP Q̂ − Q. (3.7) If BP Q̂ = 0, the estimator is called unbiased. The index P denotes that P is hold fixed. The expectation is taken with respect to all possible empirical distributions that can arise from the true distribution P . Explicitly, this can be written as h i EP Q̂ = X ProbP P̂ · Q̂ P̂ . (3.8) P̂ The probability that a specific empirical distribution occurs is given by the 34 3 TESTING ENTROPIC INEQUALITIES multinomial distribution ProbP P̂ = N! K pN1 ...pN K N1 !...NK ! 1 if N1 + ... + NK = N otherwise. 0 (3.9) Intuitively, it seems reasonable that a good estimator should be unbiased, but in fact for some quantities (entropy being one of them) unbiased estimators do not even exist. Also, when trying to reduce the bias of an estimator, one might simultaneously increase its variance. The variance of an estimator is defined in the usual way as h i VarP Q̂ = EP h i2 Q̂ − EP Q̂ h h i2 i = EP Q̂2 − EP Q̂ , (3.10) Note that in contrast to the random variables in the introduction of the variance in Subsection 2.1.4, Q̂ is assumed to be real valued. A more suitable quantity than the bias that one often tries to keep small is the mean square error. Definition 3.2. The mean square error (MSE) of an estimator Q̂ is defined as h i 2 MSEP Q̂ = EP Q̂ − Q . (3.11) While the variance of an estimator is its fluctuation around its own expected value, the MSE is the fluctuation around the correct value Q. For unbiased estimators this implies VarP (Q̂) = MSEP (Q̂). In general, the MSE can be decomposed according to h i MSEP Q̂ = EP 2 Q̂ − Q h i h i h i = EP Q̂2 − 2QEP Q̂ + Q2 h i2 = EP Q̂2 − EP Q̂ h i h i2 + EP Q̂ h i2 = VarP Q̂ + BP Q̂ . h i − 2QEP Q̂ + Q2 (3.12) Minimizing the MSE means finding a proper trade-off between minimizing the variance and the bias. 35 3 TESTING ENTROPIC INEQUALITIES The final property of estimators that we want to introduce here is consistency. While it is problematic to demand general unbiasedness, it is reasonable to demand that for the sample size N → ∞ the estimator should approach the true value. Definition 3.3. An estimator Q̂ = Q̂(N ) is said to be consistent if it converges in probability to the true value Q, lim ProbP Q̂ (N ) − Q (P ) > = 0 ∀ > 0. N →∞ (3.13) Convergence in probability allows exceptions in the sense that there might (or will) be empirical distributions P̂ (N ) for which Q̂ does not approach the correct value. The probability measure of these distributions, however, is zero. Loosely speaking, this means that there are only few such distributions which are furthermore very unlikely to occur. 3.2.2 Maximum likelihood estimation A standard estimator used in statistics is the so-called maximum likelihood estimator (MLE). In parametric estimation the MLE θ̂MLE of a parameter θ is defined as the parameter value for which the probability to make the given observation is maximized. Formally, this corresponds to maximizing the likelihood function L(θ) = Probθ (P̂ ) (often written as Prob P̂ | θ ). Typically one rather considers the log-likelihood function log L (θ), since frequently occurring product expressions then split into more convenient sums. The following intuitive result is standard knowledge in statistics, but a proof is rarely given. We reproduce the result for the sake of completeness. Proposition 3.1. The MLE of a true discrete distribution P is simply the empirical distribution P̂ . Proof. According to (3.9) the likelihood function reads log L (P ) = log ProbP P̂ N! NK 1 pN 1 ...pK N1 !...NK ! N! = log + N1 log p1 + ... + NK log pK . N1 !...NK ! = log (3.14) 36 3 TESTING ENTROPIC INEQUALITIES When maximizing this function we have to take care of the additional constraint p1 + ... + pK = 1, which can be implemented by using a Lagrange multiplier λ. The function that we need to maximize then reads log N! + N1 log p1 + ... + NK log pK − λ (p1 + ... + pK − 1) . (3.15) N1 !...NK ! The condition that the ith partial derivative ∂pi vanishes, becomes Ni = λ pi p̂i ⇔ = λN. pi (3.16) This immediately requires λN = 1 and thus pi = p̂i . To see this, assume ∃i s.t. pi > p̂i . The normalization constraint then implies the existence of another j 6= i s.t. pj < p̂j . But then we have p̂j p̂i <1< , pi pj (3.17) which contradicts the requirement that this ratio should be the same for all i. Thus, the empirical probability p̂i is indeed the MLE of the true pi . The MLE features the invariance property that for a one-to-one function g (θ) one finds ĝMLE = g(θ̂MLE ) [35]. As a convention, one typically extends this definition to arbitrary functions g [35]. Thus, when referring to the MLE of the entropy H (P ), we simply mean the plug-in estimator ĤMLE = H(P̂ ) = − X p̂i log p̂i . (3.18) i 3.2.3 Minimax estimation The MLE is an intuitive estimator that is typically easy to calculate. It also features numerous optimality properties in the asymptotic regime (i.e. when the alphabet size approaches infinity) [35]. For finite alphabets, however, there are in general no performance guarantees for the MLE. For a more sophisticated estimator with a finite alphabet guarantee in form of an optimally bounded mean square error, consider the following definition. 37 3 TESTING ENTROPIC INEQUALITIES Definition 3.4. The risk of an estimator Q̂, depending on the alphabet size K and the sample size N , is defined as h i RQ̂ (K, N ) = sup MSEP Q̂ . (3.19) P ∈MK Here, MK is the set of all probability distributions with alphabet size K. The sample size N appears on the right hand side implicitly in the estimator Q̂ = Q̂(P̂ (N )) since the possible empirical distributions depend on N . Definition 3.5. The minimax risk for estimating a quantity Q, depending on the alphabet size K and the sample size N , is defined as RQ (K, N ) = inf RQ̂ (K, N ) . (3.20) Q̂ The risk of an estimator is its worst case behaviour in terms of the MSE. The minimax risk is the best worst case behaviour possible for any estimator Q̂ of Q. It is desirable to have an estimator that achieves the minimax risk of the quantity of interest. In addition, it is of particular interest to know the sample size N as function of the alphabet size K that is required for consistent estimation (see Definition 3.3) when both N and K go to infinity. This relation between N and K is also called the sample complexity. Different estimators will have different sample complexities. Again, it is desirable to have an estimator that achieves a global lower bound of the sample complexity. Note that one will typically not find strict statements of the form, say, N = 2K 3 . Instead one might find that N is bounded from below by K 3 , in the sense that ∃c1 s.t. N ≥ c1 K 3 samples are required for consistent estimation. Adopting the notation from [18], we denote this as N & K 3 . On the other hand, if ∃c2 s.t. the sample size N ≤ c2 K 3 is sufficient for consistent estimation, one writes N . K 3 . If both, N & K 3 and N . K 3 , meaning that a sample size ∝ K 3 is necessary and sufficient, we write N K 3 . 3.2.4 MLE and minimax estimator for entropy For entropy estimation, the ideal sample complexity was shown to be N K [36]. This means that consistent entropy estimation is possible for log K sample size ∝ logKK . For smaller sample sizes, consistent estimation is not 38 3 TESTING ENTROPIC INEQUALITIES possible. This result is extended in [18] where the minimax risk of entropy estimation is shown to be RH (K, N ) K N log K !2 + log2 K . N (3.21) In addition, an estimator achieving this bound (and thus also the sample complexity N logKK ) is constructed. For the MLE, on the other hand, it is known that N & K samples are required for consistent estimation and that the risk is [18] 2 K log2 K + . (3.22) RĤMLE (K, N ) N N Thus, the MLE is clearly suboptimal. In (3.21) and (3.22) the first term on the right hand side corresponds to the (squared) bias while the second term corresponds to the variance. Recall that according to (3.12) the MSE can be decomposed according to MSE = B2 + Var. It is generally acknowledged that in entropy estimation the main difficulty is handling the bias [17, 18]. In fact, it is easy to see that no unbiased estimator exists. To this end, one only has to realize that EP [Ĥ] (see (3.8) and (3.9)) is a polynomial in the probabilities pi , while H (P ) is a non-polynomial function. Furthermore, it can be shown that the MLE is always negatively biased [37]. A comparison of (3.21) and (3.22) shows that the advantage of the minimax estimator over the MLE indeed lies in the reduced bias. Other attempts (than minimax estimation) to correct the bias of the MLE exist. The typical first order bias correction of a single term −p̂i log p̂i is 1 . Note that when we enlarge the alphabet but put no simply −p̂i log p̂i + 2N probability mass on the new outcomes (so that the distribution essentially 1 to all of the new terms does not change), applying the bias correction 2N can hugely overcorrect the bias. Thus, it is advisable to only use the bias correction for terms with p̂i > 0. This gives rise to the so-called MillerMadow bias correction [37, 38, 39]. When applying the bias correction to all terms, we speak of the naive bias correction. In the next subsection, following the construction of the minimax estimator from [17, 18], we numerically verify its optimal performance. To this end, we compare the minimax estimator to the MLE and its Miller-Madow (MLEMM) as well as naively bias corrected versions (n.b.c. MLE). The minimax estimator, analogous to the MLE, estimates each term −pi log pi separately. 3 TESTING ENTROPIC INEQUALITIES 39 Different estimators are applied for ‘large’ and ‘small’ empirical probabilities p̂i . It turns out that for large values the bias corrected MLE works well. For small probabilities the expression −pi log pi is ‘unsmooth’ in the sense that the derivative diverges to infinity (for pi → 0). This causes small errors in the estimate p̂i to lead to large errors in the estimate −p̂i log p̂i . Controlling the bias in this sensitive regime is particularly problematic and not handled well by the typical bias corrections of the MLE. Even the Miller-Madow correction is rather brute. For p̂i > 0, however small it might be, the full 1 is applied, and suddenly for p̂i = 0 no correction is applied at correction 2N all. The minimax estimator provides a smoother solution. 1 If p̂i > ∆ ≡ c1 logNN the bias corrected MLE −p̂i log p̂i + 2N is used. In 5 practice, c1 ≈ 0.5 yields good results [17] . In the case p̂i ≤ ∆ a polynomial approximation of −pi log pi is calculated and then estimated. The order of the polynomial should be D ≈ c2 log N , where a good choice of the constant turns out to be c2 ≈ 0.7 [17]. The employed approximation is the so-called minimax polynomial, also called best approximation in the Chebyshev sense [40, 41]. It is defined as the polynomial with the smallest maximal distance to the true function, max |Pminimax (x) − f (x)| = 0≤x≤∆ inf max |P (x) − f (x)|. P ∈polyD 0≤x≤∆ (3.23) The space polyD is the space of all polynomials of order up to D. In our case the target function is f (x) = −x log x. One may realize that the idea behind the minimax polynomial is similar to the idea behind the minimax risk from estimation theory, see Definition 3.5. The minimax polynomial can be calculated using the Remez algorithm [40, 41]. It is possible (and recommendable) to calculate the polynomial for the interval 0 ≤ x ≤ 1 and then perform a variable transformation to the desired interval [0, ∆]. In this way, one can calculate polynomials up to a desired order (e.g. 10) and store P d them for future applications. If D d=0 rd x is the polynomial for 0 ≤ x ≤ 1, then the polynomial for 0 ≤ x ≤ ∆ reads [18] K X (rd − δd,1 log ∆) ∆−d+1 xd . (3.24) d=0 5 In [17] the authors state that in practical applications c1 ∈ [0.1, 0.5] yielded good results. In a newer version of the article the authors recommend c1 ∈ [0.05, 0.2]. Our own tests in the next subsection show that the estimator with c1 = 0.5 works well. 3 TESTING ENTROPIC INEQUALITIES 40 The final polynomial turns out to be easier to estimate than the original expression f (pi ) = −pi log pi , so that the gain in estimation accuracy is larger than the loss due to the approximation. To estimate the polynomial, each i −d+1) monomial pdi is estimated separately by the estimate pcdi = Ni (Ni −1)...(N . Nd Under so-called Poisson sampling this estimate is unbiased [17, 18]. Poisson sampling means that each Ni is independently drawn from a Poisson distribution with expectation N pi . In contrast, when drawing a sample from the original multinomial distribution (3.9), the Ni are not independent due P to the normalization constraint i Ni = N . Poisson sampling can be justified since the Poisson distribution is peaked sharply around its expectation N pi . Thus, already for rather small N , the normalization constraint will be satisfied at least approximately. Poisson sampling is used as a technique to simplify analytical calculations. Mathematical relations between the Poisson model and the Multinomial model exist [18] . For numerical simulations the samples will be drawn from the proper multinomial distribution. 3.2.5 Comparison of MLE and minimax estimator for entropy We briefly summarize the different estimators that we want to compare in this subsection. All estimators have in common that each summand in P H = i −pi log pi is estimated separately. Maximum likelihood estimator (MLE) The empirical probability p̂i = Ni is used to estimate −pi log pi by −p̂i log p̂i . Hence, the MLE is N simply the plug-in estimator. In general, the MLE is expected to suffer from severe bias. Naively bias corrected MLE (n.b.c. MLE) Independently of the value 1 . p̂i , the estimate −p̂i log p̂i is replaced by −p̂i log p̂i + 2N Miller-Madow MLE (MM-MLE) A bias corrected version of the MLE 1 where for p̂i > 0 the estimate −p̂i log p̂i is replaced by −p̂i log p̂i + 2N . Minimax estimator For large probabilities p̂i > ∆ = c1 logNN (c1 = 0.5) 1 the bias corrected version of the MLE, −p̂i log p̂i + 2N , is used. For p̂i ≤ ∆ an optimal polynomial approximation of −pi log pi (of order D ≈ c2 log N with c2 = 0.7) is estimated. Employing the polynomial approximation aims to reduce the bias. 41 3 TESTING ENTROPIC INEQUALITIES We start by reproducing some of the numerical results from Reference [17]. It is worth mentioning that the authors of [17] only compared the minimax estimator to the MLE without bias correction (and one additional estimator that we do not regard here). No bias corrected version of the MLE was considered6 . Indeed, for uniform distributions and quite large samples one rarely is in the regime p̂i ≤ ∆. In this case the minimax estimator essentially reduces to the n.b.c. MLE. Therefore, some of the results for the minimax estimator from [17] can already be obtained by the n.b.c. MLE. Since the idea of the minimax estimator is to significantly reduce the bias at the cost of slightly increasing the variance, it even happens that the n.b.c. MLE yields slightly better results than the minimax estimator. Note that this is no contradiction to the definition of the minimax estimator. The minimax estimator only guarantees the best worst case behaviour. For specific distributions (here uniforms) other estimators might perform better. The superiority of the minimax estimator becomes more evident when considering non-uniform distributions or very small sample sizes. Thus, we extend the numerical simulations in this direction. MSE along N = 8 logKK In Reference [17] it is shown numerically that the 2 [= 1 M along N = 8 logKK is empirical mean square error MSE m=1 Ĥ − H M bounded for the minimax estimator but unbounded for the MLE. Note that these results follow theoretically from the risks given in (3.21) and (3.22). Along N = c logKK the minimax risk turns out to be P RH 1 log3 K 1 + −→ 2 , 2 c cK K→∞ c (3.25) while the risk of the MLE becomes RĤMLE log2 K log3 K + ∝ log2 K. 2 c cK K1 (3.26) The log2 K increase for the MLE stems from the uncontrolled bias. For each alphabet size K, the samples are drawn from the corresponding uni 2 form distribution. The empirical MSE is obtained by averaging Ĥ − H 6 Note that this was the case in the version of [17] that was available when writing this chapter (arXiv version 3). Newer versions also contain the Miller-Madow MLE and a lot of additional estimators. 42 3 TESTING ENTROPIC INEQUALITIES over M = 10 Monte Carlo simulations. Our results for all four estimators are given in Figure 5. They are in accordance with the results from Reference [17] (for the minimax estimator and the uncorrected MLE). MSE 0.6 0.5 0.4 0.3 0.2 0.1 ▼ ▼ ▼ ▼ ▼ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▼ ▼ ▲ ▲ ▲ ▲ ● ▼ ▲ ▼ ▲ ● ● ▲ ▲ ▼ ▲ ▼● ▲ ● ● ■ ■ ■ ● ■ ● ■ ■ ■ ● ■ ● ▲ ■ ■ ● ■ ● ■ ● ▲ ■ ▲ ■ ■● ■ ● ■ ■ ● ■ ● ▲● ■ ● ■ ● ● ▼ ▼ 1 ▼ ▼ ▼ ▼ ▼ ▼ ▼ 10 ▼ 100 1000 104 ▼ MLE ▲ MM-MLE ■ n.b.c. MLE ● minimax K 105 Figure 5: MSE along N = 8 logKK for the minimax estimator, the MLE and the bias corrected versions of the MLE. As expected the MSE of the MLE grows with increasing alphabet size K. The Miller-Madow correction reduces the MSE but it remains an increasing function of the alphabet size. The MSEs of the other two estimators are bounded. We observe that the unboundedness of the MLE already vanishes for the n.b.c. MLE. Note that the minimax estimator does not reduce to the n.b.c. MLE. To see this, realize that for large K the sample size N = 8 logKK is of the same order of magnitude as the alphabet size, or even smaller. Consequently, many of the empirical observation frequencies take the value Ni = 0, 1 and thus satisfy the condition NNi = p̂i ≤ ∆ = c1 logNN (for N = 50, c1 = 0.5 0 1 and log being the natural logarithm we have 50 , 50 ≤ 0.039...). For these p̂i the minimax estimator indeed resorts to the polynomial approximation 1 instead of the first order bias correction −p̂i log p̂i + 2N . The strong results for the minimax estimator shown in Figure 5 thus provide evidence that the polynomial approximation at the heart of the minimax estimator indeed works well. Performance for large K and N Again motivated by Reference [17], we consider uniform distributions for three combinations of the alphabet size K and the sample size N : 43 3 TESTING ENTROPIC INEQUALITIES K N data rich 200 10 000 data sparse 20 000 10 000 extremely data sparse 20 000 1000 The terms ‘data rich’ and ‘data sparse’ have been adopted from [17]. The extremely data sparse regime was not considered in [17]. Furthermore, we consider a non-uniform distribution in the data sparse regime. In both cases, the goal is that there should be a large number of empirical probabilities p̂i = 0 which should not be handled well by the MLE and its bias corrected versions. The non-uniform distribution is generated by drawing each probability pi from a beta distribution pBeta (x) = Γ (α + β) α−1 x (1 − x)β−1 , Γ (α) + Γ (β) 0 ≤ x ≤ 1, α, β > 0 (3.27) with α = 0.6 and β = 0.5. The emerging vector p is then normalized in order to obtain a valid probability distribution. In all four cases we draw 20 samples and plot the resulting estimates together with the true entropy, see Figure 6. In the data rich regime (upper left plot) the minimax estimator and the bias corrected versions of the MLE coincide and are extremely accurate. The MLE is slightly biased but still acceptable. Due to the large sample size we are almost never in the regime p̂i ≤ ∆ in which the polynomial approximation of the minimax estimator is used. Thus, the minimax estimator essentially reduces to the n.b.c. MLE. In the uniform, data sparse case (upper right plot) the MLE as well as its Miller-Madow version are strongly negatively biased. The best result is obtained by the naively bias corrected MLE, but the performance of the minimax estimator is satisfying as well. In the extremely data sparse regime (lower left plot) all variants of the MLE are strongly biased, while the minimax estimator is still rather close to the true entropy. For the non-uniform distribution (lower right plot) we get a similar picture. The minimax estimator clearly outperforms the other estimators in this case. The results demonstrate the great performance of the minimax estimator for rather large alphabets. In the context of causal inference, however, one rarely deals with alphabets of size 200 or even 20 000. In the next paragraph we do therefore consider smaller alphabets. 44 3 TESTING ENTROPIC INEQUALITIES entropy, uniform, K=200, N=10k H 5.300 5.298 5.296 5.294 5.292 5.290 5.288 0 5 10 15 20 entropy, uniform, K=20k, N=1000 H 17.5 15.0 12.5 10.0 7.5 5.0 2.5 5 10 15 20 entropy, uniform, K=20k, N=10k True H 10.0 MLE 9.8 MM-MLE 9.6 n.b.c. MLE minimax 9.4 9.2 9.0 0 True H 9.8 MLE 9.6 5 10 15 20 entropy, Beta, K=20k, N=10k MM-MLE 9.4 n.b.c. MLE 9.2 minimax 9.0 0 5 10 15 20 Figure 6: Estimated and true entropies for different distributions, alphabet sizes and sample sizes. The minimax estimator is the only estimator that always provides a reliable result. Performance for small alphabets We conduct the same simulations as before (again for uniform distributions) but this time for alphabets of size K = 2 and K = 10 with sample size N = 50. Figure 7 suggests that the superiority of the minimax estimator vanishes for smaller alphabets. One possible explanation is that these combinations of K and N already correspond to the data rich regime from above, where the bias corrected versions of the MLE coincided with the minimax estimator as well. Eventually, we are not interested in estimating a single entropy term but more complicated entropic expressions constraining a given DAG. One example is the expression I (A; B) + I (A; C) − H (A) which is upper bounded by zero for distributions compatible with the triangular scenario (see also (3.2)). This requires in particular estimation of mutual information, which is the subject of the next paragraph. 45 3 TESTING ENTROPIC INEQUALITIES entropy, uniform, K=10, N=50 entropy, uniform, K=2, N=50 H H True 0.70 MLE 0.68 MM-MLE 0.66 n.b.c. MLE 0.64 minimax 0 5 10 15 20 2.35 2.30 2.25 2.20 2.15 0 5 10 15 20 Figure 7: Estimated and true entropies for the uniform distributions with alphabet sizes K = 2, 10 and sample size N = 50. In both cases the minimax estimator (almost) coincides with the bias corrected versions of the MLE. Mutual information for small alphabets In order to estimate mutual information we use the decomposition I (A; B) = H (A) + H (B) − H (A, B) and estimate each entropy separately. The joint distribution P (A, B) required to estimate H (A, B) has alphabet size K 2 . For K = 10 (and N = 50) we might already be in a data sparse regime in which the minimax estimator typically outperforms the different versions of the MLE. To generate a joint distribution with some dependence between the variables, (partially following [17]) we first draw the marginal P (A) with the help of a Beta distribution 3.27 (see also the non-uniform case in the paragraph ‘Performance for large K and N ’). Then, with probability x we set b = a, and with probability (1 − x) we set hb uniform random. The resulting joint distribution i 1−x reads P (a, b) = P (a) xδab + K . For x = 0 the variables are independent while for x = 1 they are deterministically dependent. For x = k/10 (k = 0, ..., 10) we draw 100 samples and calculate the empirical MSE of each estimator. We consider alphabet sizes K = 2, 10 and sample size N = 50. The results are shown in Figure 8. For K = 2 there is hardly any difference between the different estimators, but for K = 10 the minimax estimator is, as suspected, clearly superior to the MLE and its bias corrected versions. One may also realize that (for K = 10) the n.b.c. MLE is strong for weak dependence (MSE (x = 0) ≈ 0) but weak for strong dependence (MSE (x = 1) ≈ 0.8). The reason is that for x = 1 there are many probabilities P (a, b) = 0, which leads to a large overcorrection of the bias when applying the correction +1/2N to all 46 3 TESTING ENTROPIC INEQUALITIES MI, K=10, N=50 MI, K=2, N=50 MSE 0.010 MLE 0.008 MSE 0.8 MM-MLE 0.6 0.006 n.b.c. MLE 0.4 0.004 minimax 0.002 0.2 0.4 0.6 0.8 1.0 x 0.2 0.2 0.4 0.6 0.8 1.0 x Figure 8: MSE for estimating mutual information by employing the minimax estimator, the MLE and the bias corrected versions of the MLE. The parameter x determines the dependence between the variables (x = 0: independent, x = 1: deterministically dependent). For K = 2 all estimators perform equally well. For K = 10 the minimax estimator is clearly superior to the MLE and its bias corrected versions. terms. The uncorrected MLE and the Miller-Madow MLE show the opposite behaviour, though with smaller magnitude. 3.2.6 Conclusion All tests confirmed the theoretically expected great performance of the minimax estimator. In particular for large alphabets (compared to the sample size) the minimax estimator was typically far superior to the MLE and its bias corrected versions. While in some cases the Miller-Madow MLE and the naively bias corrected MLE performed quite well, they have far worse performance in other cases. The minimax estimator is the only estimator that always provided reliable results. Even in the rare occasion (Figure 6, upper right plot) that another estimator performed better than the minimax estimator, the results of the latter were still satisfying. The only drawback is that the superiority of the minimax estimator seems to diminish for extremely small alphabets. In particular for K = 2 and N = 50 (which is the typical scenario considered in the next section) all estimators performed similarly. Still, the minimax estimator is overall the sole reliable estimator considered in this section and should always be preferred to the other estimators. 3 TESTING ENTROPIC INEQUALITIES 3.3 47 Hypothesis tests In this section we construct and elaborate on hypothesis tests based on entropic inequalities. Precisely, we want to test membership to the triangular scenario (Figure 3) by employing inequalities (3.2) to (3.4). While primary focusing on inequality (3.2), the latter inequalities come into consideration in Subsection 3.3.4. To estimate the required entropy terms, we employ the techniques introduced in the previous section, in particular the minimax estimator. The observables are assumed to be binary and the sample size shall be N = 50. Binary variables correspond to the case of simple ‘yes’ or ‘no’ statements. In a real study the observables could represent the occurrence of some symptoms while the hidden variables describe potential causes that are unmeasurable or not known to exist at all (e.g. unknown exposure to a substance or genetic factors). Larger alphabets of the observables could emerge if the symptoms can be further characterized by their strength, or if they are directly assessed in a quantitative way (e.g. the concentration of a substance in a blood sample). One can always construct binary variables from such data by asking if a certain threshold value is exceeded or not. In that sense binary variables represent a rather general case. On the other hand, if the original data are not binary, one might lose information by employing such a thresholding procedure. We have also seen that for binary variables we hardly benefit from the minimax estimator (see Figure 8). It might therefore be preferable to keep larger alphabets. The main reason to consider the binary case is that the large simulations performed in this section would be computationally extremely expensive for larger alphabets. Note that apart from being discrete, no assumptions about the alphabets of the hidden variables are made. 3.3.1 Introduction to hypothesis tests A hypothesis test first requires to state the null hypothesis which is the standard hypothesis that is only rejected if strong evidence is found against it. The opposite is called the alternative hypothesis which is accepted if and only if the null hypothesis is rejected. Often the null hypothesis states that some parameter takes a certain value, θ = θ0 . The alternative hypothesis then comprises all other possibilities, i.e. θ 6= θ0 . In our case, one can think of two null hypotheses, one stating that the distribution underlying 3 TESTING ENTROPIC INEQUALITIES 48 the data generating process is compatible with the triangular scenario (Figure 3), the second stating that the distribution satisfies inequality (3.2), I (A; B) + I (A; C) − H (A) ≤ 0. Note that in both cases the null hypothesis does not single out one specific distribution (or parameter value), but comprises a large set of distributions. Since there are distributions that are not compatible with the DAG but fail to violate the inequality, the two hypotheses are indeed different. Since our ultimate aim is to decide compatibility with the DAG, the first null hypothesis should be preferred (if possible). In general, we will denote the null hypothesis by h0 and the alternative hypothesis by h1 . Ideally, the null hypothesis should be accepted whenever it is true and rejected when it is false. Unfortunately, this ideal scenario is not realizable since also samples from (in)compatible distributions can (satisfy) violate h0 . A type-I-error is made when the null hypothesis is rejected although it is actually true. The opposite, i.e. accepting the null hypothesis when it is actually false, is called a type-II-error. The type-I(II)-error rate is denoted by α(β). The capability to correctly reject the null hypothesis (i.e. to correctly identify incompatible data) is called the power of the hypothesis test and evaluates to 1 − β. In general, there is a trade-off between type-I- and typeII-error rate, meaning that trying to decrease one leads to an increase of the other. Hypothesis tests are often constructed to control the type-I-error, typically α = 0.05. This means that the test must reject at most 100α% of samples stemming from distributions compatible with the null hypothesis. The bound α is then also called the confidence level of the test and one says that the null hypothesis is rejected (or accepted) at the 100α% level. We distinguish two different approaches to hypothesis testing, the direct and the indirect approach. After a first introduction and methodical comparison below, the implementation of the approaches and a detailed comparison follow in Subsections 3.3.2 and 3.3.3. In both cases, the quantity (or statistic) that we have to estimate from the data is T ≡ I (A; B) + I (A; C) − H (A) (see inequality (3.2)). Our direct test is similar to the one already introduced in [16]. The differences lie in a small revision at the construction of the test, and the fact that we also consider the minimax estimator of entropy. Direct approach Assume that for some sample we obtain the estimate T̂ . If, under the null hypothesis, the probability to obtain an even larger 49 3 TESTING ENTROPIC INEQUALITIES value (T̂ 0 ) is smaller than α, P T̂ 0 > T̂ | h0 ≤ α, (3.28) the result is called significant and we reject the null hypothesis at the 100α% level. ‘Under the null hypothesis’ means that we have to consider all distributions compatible with h0 . Calculating the probability (3.28) requires knowledge of the distribution of estimates T̂ 0 of the statistic T under the null hypothesis, in particular under the worst case (or least favorable) distribution among h0 . Loosely speaking, the worst case distribution is the distribution leading to the largest estimates T̂ 0 . The requirement of a worst case distribution causes a huge problem, since finding (or even proving) the worst case is far from obvious. The best we can do is make an educated guess and try to confirm with numerical simulations that we do not find an even worse case. Once a candidate for the worst case distribution is selected, the corresponding distribution of T̂ 0 values can be constructed via a large number of Monte Carlo simulations. In practice, we are interested in the 100(1 − α)% quantile, t, of this distribution, defined by P T̂ 0 > t | h0 , worst case = α. (3.29) The value t is then employed as a threshold value for the final hypothesis test. Whenever we find T̂ > t for some data set, the null hypothesis is rejected. For a graphical illustration in comparison to the indirect approach see Figure 9. In terms of the quantile t, the worst case distribution is the h0 -compatible distribution yielding the largest t value. By definition, we then obtain P T̂ 0 > t | h0 ≤ α (3.30) for all other distributions compatible with the null hypothesis. Thus at most 100α% of samples stemming from compatible distributions are rejected, implying that the type-I-error rate is as desired upper bounded by α. Note, however, that this is only true if we found the correct threshold value (and thus the correct worst case distribution). Otherwise, the hypothesis test tends to reject too many samples and does not work properly at the 100α% level. A major advantage of the direct approach is that it allows to implement the preferred null hypothesis h0 : ‘data are compatible with the triangular scenario’. To this end, the worst case distribution is searched only among 3 TESTING ENTROPIC INEQUALITIES 50 distributions that are compatible with the DAG, instead of the larger set of distributions that are compatible with the inequality T ≤ 0. Indirect approach A major drawback of the direct approach is its dependence on our ability to find the correct threshold value. This will prove difficult already for the triangular scenario with binary observables and employing inequality (3.2). For inequalities (3.3) and in particular (3.4), for which we lack an intuitive understanding, the task becomes even more complex, if not intractable. Similar problems might occur for larger DAGs, or already for larger alphabets. In the direct approach, once a threshold value is at hand, only the point estimate T̂ of the data sample is taken into account, without any measure of uncertainty. A natural alternative approach, without the necessity of a threshold value, is to compute a confidence interval for the estimate T̂ and check if this interval overlaps with T h= 0. In thei current case, we would be interested in a one-sided 95% interval T̂0.05 , T̂max . The upper endpoint T̂max is the maximal value that can be achieved by any empirical distribution, for T = I (A; B) + I (A; C) − H (A) for example T̂max ≈ log 2. (Depending on the distribution and the employed estimator, T̂max might also be smaller or even larger than log 2. In principle, since only the lower endpoint of the interval is relevant for our purpose, we could also set T̂max = ∞.) If the confidence interval overlaps with zero, T̂0.05 ≤ 0, we accept the null hypothesis h0 : ‘data are compatible with the inequality T ≤ 0’. If the lower endpoint of the interval is larger than zero, T̂0.05 > 0, the null hypothesis is rejected at the 5% level. For a graphical illustration with comparison to the direct approach see Figure 9. Note that since no additional information about the DAG is included, the indirect approach automatically uses the null hypothesis of compatibility with the inequality instead of the stronger hypothesis of compatibility with the DAG. The main task in the indirect approach is the construction of the confidence interval. If we could sample at will from the true underlying distribution P , we could draw an arbitrary number of samples, reconstruct the correct distribution of T̂ values and calculate any quantity of interest, including confidence intervals. Typically, however, we only have access to a single data set of presumably small size. Thus, we have to resort to other methods. One typical approximation in statistics, using asymptotic normal theory, is to 51 3 TESTING ENTROPIC INEQUALITIES direct approach indirect approach ^ , worst case) P(T'|h 0 t ^ T' ^ ^ ^ P(T'|observed T) ^ T 0.05 ^ T ^ T' Figure 9: Plot on the left: Schematic distribution of T̂ 0 values under the worst case distribution compatible with the DAG. The 95% quantile of this distribution, t, is employed as a threshold value for the direct approach. If a real data estimate T̂ falls into the shaded area (or beyond; T̂ > t) the null hypothesis is rejected. Plot on the right: Schematical, estimated distribution of T̂ 0 values given the observed value T̂ . The 5% quantile, T̂0.05 , is the lower endpoint of a left-sided 95% confidence interval for the estimate T̂ . If this interval (the unshaded area) does not overlap with zero (T̂0.05 > 0) the null hypothesis (T ≤ 0) is rejected. Note the difference, that for the direct approach the calculation of a rightsided 95% quantile is required, while for the indirect approach it is a leftsided interval (or quantile). Also, in the direct approach the quantile (i.e. the value t) is calculated beforehand for the (supposed) worst case among h0 , and later we simply test T̂ > t. In the indirect approach the interval (i.e. the lower endpoint T̂0.05 ) is calculated for each data sample for which we want to test compatibility, and then test T̂0.05 > 0. Thus, for the direct approach the preparation (constructing the threshold t) is complicated, while the resulting test is rather simple. The indirect approach needs no such preparation but therefore the actual test is more complex. Since the approaches are rather different in nature, it is difficult to foresee which approach might result in the stronger test. More advantages and disadvantages of the two approaches are pointed out in the following subsections. 3 TESTING ENTROPIC INEQUALITIES 52 estimate T̂0.05 = T̂ + z0.05 σ̂, (3.31) where T̂ is the original estimate, σ̂ some estimate of its standard deviation, and z0.05 ≈ −1.645 the 5% quantile of the standard normal distribution. In detail, the approximation assumes that estimates of the statistic T are distributed normally around their mean. Our data estimate T̂ automatically serves as an estimate of the mean value of this distribution. The estimate σ̂ of the standard deviation has to be calculated by other means. The expression T̂ + z0.05 σ̂ is then the 5% quantile of the distribution N T̂ , σ̂ . Aside from the necessity to assess σ̂, the main problem of this procedure is the reliance on a strong asymptotic approximation. In practice, this approximation may be highly inaccurate and consequently lead to wrong confidence intervals [42]. A more sophisticated method, typically resulting in more accurate intervals, is introduced in Subsection 3.3.3. 3.3.2 Direct approach The first step for implementing the direct approach is to identify the worst case distribution. In [16] the following educated guess was made for the triangular scenario with inequality (3.2) (T = I (A; B) + I (A; C) − H (A) ≤ 0): 1. The worst case distribution should lie on the boundary T = 0. 2. Among the DAG-compatible distributions this requires A to be a deterministic function of either B or C (by choice B). 3. The fluctuations of T̂ should be largest if A = B ∼ uniform and independently C ∼ uniform. The obtained threshold value (using the maximum likelihood estimator of entropy) for α = 0.05 was t = 0.0578 bits (or t = 0.0401 nats). In the following we show that the supposed worst case distribution is not the true worst case. While we can slightly adjust the aforementioned threshold value, the main message is that finding the correct value is a formidable task. We keep the above assumptions (1) and (2) intact, but replace the uniform (3) by Bernoulli distributions A = distributions from assumption B ∼ qAB 1 − qAB and C ∼ qC 1 − qC . We consider two scenarios: 3 TESTING ENTROPIC INEQUALITIES 53 1. Fix qAB = 0.5 and vary 0.5 ≤ qC ≤ 1. 2. Set qAB = qC and vary 0.5 ≤ qC ≤ 1. In both scenarios we calculate the 95% quantile of estimates T̂ as a function of qC . To this end, we conduct 200 000 Monte Carlos simulations for each considered value of qC . To estimate T̂ we employ the maximum likelihood estimator as well as the minimax estimator from Subsection 3.2.4. While later restricting to the minimax estimator, there are two reasons to keep the MLE for now. First, we want to compare our results of the direct test to the results from [16], which were also based on the MLE. Second, we want to checker whether the two estimators behave similarly when varying qAB and qC . The results are presented in Figure 10. MLE minimax Figure 10: Threshold value t (95% quantile of the distribution of T̂ values) obtained for the families of distributions described in the text. The originally supposed worst case value (corresponding to qAB = qC = 0.5) is exceeded in both cases and for both estimators. The results suggest that the task of analytically finding (or proving) the correct threshold value might be complicated. In both scenarios and for both estimators, the ideal threshold value is not provided by the originally supposed worst case distribution (qAB = qC = 0.5). The maximal values tMLE = 0.0461 and tminimax = 0.0506 are instead obtained for qAB = 0.5 and qC ≈ 0.9. Other combinations of qAB and qC might lead to even larger values. Note that in the uniform case, within sufficient accuracy, we obtain the same value tMLE = 0.0400 as [16]. In general, the qualitative behaviour is similar for both estimators (though more pronounced for the minimax estimator) and suggests that finding the correct worst case distribution with analytical arguments might be intractable. A 54 3 TESTING ENTROPIC INEQUALITIES monotone behaviour, for example, would probably be more feasible. Also recall that we only relaxed the third assumption from the heuristics employed in [16]. The validity of the first two assumptions is not obvious either. For larger DAGs, inequalities or alphabets, the task becomes even more complex. Amongst other things, numerical simulations similar to those from Figure 10 become drastically more time consuming for larger alphabets. Overall, we should thus be cautious when using a hypothesis test based on a threshold value obtained by such vague means. An underestimated threshold value would cause the test to be more susceptible to reject h0 . The seemingly large power would be misleading, since the test would not properly control the type-I-error rate at 5% anymore. Despite these problems we still want to run tests based on the obtained threshold values in order to get an impression of the tests’ performances. We consider the same family of distributions that was used in [16]: Three initially perfectly correlated, binary variables are flipped independently with probability pflip . This gives rise to the distribution P (a, b, c) = h 1 (1 − p 2 1p 2 flip 3 flip ) + p3flip (1 − pflip ) i if a = b = c otherwise. (3.32) Figure 11 shows the true value T as a function of 0 ≤ pflip ≤ 0.5. For pflip = 0 the distribution is certainly not compatible with the DAG and neither with the inequality (I (A; B) = I (A; C) = H (A) = log 2 → T = log 2 > 0). For pflip = 0.5 all variables are independently uniform which is clearly compatible with the DAG and leads to T = − log 2 ≤ 0 (I (A; B) = I (A; C) = 0, H (A) = log 2). For the critical value satisfying T (pflip ) = 0, we obtain pflip = 0.0584. Thus, the distribution violates the inequality only for rather small flip probabilities. On the other hand, we do not know at which value of pflip the distribution changes its compatibility with the DAG. Since the entropic description is an outer approximation, we only know that this value has to be larger than 0.0584. In order to get an impression of the direct hypothesis test we consider 10 000 samples (of size N = 50) for each value of pflip = 0, 0.005, ..., 0.1. Figure 12 shows the ratio of samples that get rejected by the hypothesis test (i.e. for which we find T̂ > t) as a function of pflip . While we are particularly interested in the power of the test, we generally call this ratio the rejection rate. Recall, that the power is defined as the capability to correctly reject 55 3 TESTING ENTROPIC INEQUALITIES T 0.6 0.4 0.2 -0.2 -0.4 -0.6 0.1 0.2 0.3 0.4 0.5 pflip Figure 11: Value of the statistic T = I (A; B)+I (A; C)−H (A) for the family of ‘flip distributions’ (3.32). Starting with three binary, perfectly correlated observables, each observable is flipped independently with probability pflip . Values T > 0 are evidence of incompatibility of the distribution with the triangular scenario (for the DAG see Figure 3). Values T ≤ 0 indicate (but not prove) compatibility with the DAG. For the critical value satisfying T (pflip ) = 0, we obtain pflip = 0.0584. samples from incompatible distributions. Thus, for distributions that are incompatible with the triangular scenario, the rejection rate is indeed the power. However, right now this is only known to be the case for pflip < 0.0584. In this regime, the violation of the inequality (see Figure 11) implies incompatibility with the DAG. For larger values of pflip we do not know whether or not the distribution is compatible with the DAG. For this reason we use the general term ‘rejection rate’ instead of ‘power’. The test shown in Figure 12 is similar to the test originally proposed in [16]. One difference is that we use a slightly updated threshold value. Furthermore, we also use the minimax estimator of entropy rather than only the MLE, as was the case in [16]. For values close to pflip = 0 the test correctly rejects almost all samples. For pflip ≈ 0.1 the rejection rate is close to zero. Presumably, the distribution is compatible with the DAG for these large values of pflip , meaning that the small rejection rates are indeed desired. For the range in between, the rejection rate varies only slowly. Instead of the rather flat curve, we would have preferred a sharp edge near pflip = 0.0584. In this case, compatible distri- 56 3 TESTING ENTROPIC INEQUALITIES direct test rejection rate 1.0 ●▲ ▲● ▲● ▲● ▲● ▲ 0.8 0.6 0.4 0.2 ● ▲ ● ▲ ● ▲ ● ▲ MLE ● minimax ▲ ● ▲ ● ▲ ● ▲ ● ▲ ● ▲ ● ▲ ● ▲ ● ▲ ● ▲ ● ▲ ● pflip 0.02 0.04 0.06 0.08 0.10 Figure 12: Rejection rates of the direct hypothesis tests based on inequality (3.2) for the family of ‘flip distributions’ (3.32). One test uses the MLE, the other uses the minimax estimator of entropy (see Subsection 3.2.4). Both tests aim for the null hypothesis h0 : ‘data are compatible with the triangular scenario’. For pflip < 0.0584 (vertical line) the null hypothesis is violated. In this regime, large rejection rates (being the powers of the tests) are desired. For pflip ≥ 0.0584 compatibility with the triangular scenario is not known. butions would be reliably accepted while incompatible distributions would be reliably rejected. The main cause for the flat curve is the large variance of the distribution of estimates T̂ . To give an example, forrpflip = 0.06 the h i standard deviation of our 10 000 estimates is roughly σ ≡ Var T̂ ≈ 0.16 for both estimators. Similar standard deviations are obtained for other values of pflip (which are not too close to zero). For values of pflip for which the true value T is close to the threshold value t, measured in units of the standard deviation σ, the test is naturally rather indecisive (corresponding to a rejection rate in the interval, say, [0.2, 0.8]). Next, realize that the total range of T values for 0 ≤ pflip ≤ 0.1 is −0.25 ≤ T ≤ log 2 ≈ 0.69. Since this range is rather small compared to σ ≈ 0.16 (less than six standard deviations), the range of indecisive pflip values is quite large. A graphical illustration is provided in Figure 13. One straightforward possibility to obtain a sharper rejection-curve is to increase the sample size, resulting in a reduction of the variance of estimates 57 3 TESTING ENTROPIC INEQUALITIES t pflip=0.09 pflip=0.03 2σ T(0.1) 2σ T(0) ^ T Figure 13: Histograms of estimates T̂ (employing the minimax estimator) for pflip = 0.03 and pflip = 0.09 obtained by 100 000 Monte Carlo simulations for each value. Due to the large width of the distributions (σ ≈ 0.16), when varying pflip the number of rejected samples (T̂ > t, black line) changes comparatively slowly. Concerning the large widths, it is worth noting that the histogram corresponding to pflip = 0.03 roughly spans over the total range of values −0.25 ≤ T ≤ log 2, corresponding to the regime 0 ≤ pflip ≤ 0.1. For a significantly smaller width, the rejection rate would make a sudden jump when the center of the distribution is shifted over the threshold value t. This behaviour would have been desirable. T̂ . However, this is not an option if some real data sample is as small as N = 50. Another solution might be to find yet another estimation technique that reduces the variance of the MLE. The minimax estimator mainly aims to reduce the bias (which usually causes the most trouble when estimating entropies) and might even slightly increase the variance. However, finding an estimator that reduces the variance without disproportionately increasing the bias seems unlikely, since the minimax estimator already minimizes the combination of both terms. As a completely different matter, in Figure 12 the rejection rate for the test based on the MLE is systematically larger than the rate based on the minimax estimator. A detailed view on the distributions of estimates T̂ for the supposed worst case distribution and an exemplary ‘flip distribution’ explains this difference. For the supposed worst case distribution the 58 3 TESTING ENTROPIC INEQUALITIES h i bias of T̂MLE is B T̂MLE | worst case = 0.012. For the ‘flip distribution’ h i with pflip = 0.06 we obtain the larger bias B T̂MLE | pflip = 0.06 = 0.034. This difference (+0.022) leads to a systematic overestimation of T̂MLE for the ‘flip distribution’ which implies an overestimated rejection rate. This effect is in particular problematic if it occurs for distributions that are actually compatible with the DAG and should not be rejected. Furthermore, the opposite effect could decrease the rejection rate for incompatible distributions. In this way, an uncontrolled bias reduces the reliability of the hypothesis test. For the on the other hand, we find h i minimax estimator, h i B T̂minimax | worst case = 0.003 and B T̂minimax | pflip = 0.06 = 0.002. In this case, the difference between the biases (−0.001) is of much smaller magnitude, indicating a superior bias control by the minimax estimator. As a consequence, the corresponding test is potentially more reliable. The aim of the following subsections is to improve the direct test from Figure 12, both in terms of the power as well as the control of the type-I-error rate. 3.3.3 Indirect approach (bootstrap) As already pointed out several times, the main disadvantage of the direct approach is its dependence on our ability to find the worst case distribution. The indirect approach is free of this optimization problem but requires hestimationi of the lower endpoint of a left-sided 95% confidence interval T̂0.05 , T̂max . Strong normality assumptions might lead to inaccurate intervals when the assumptions are not met, for example in the small sample regime considered in this thesis. More accurate intervals can be obtained by a technique called bootstrapping, introduced by statistician Bradley Efron in 1979 [43]. Bootstrapping belongs to the larger class of resampling techniques. The idea is that since we are not able to draw samples from the true distribution P , we instead draw so-called bootstrap samples from the empirical distribution P̂ . We denote an empirical distribution of such a bootstrap sample by P̂ ∗ and a bootstrap estimate of the statistic T by T̂ ∗ . The sample size of the bootstrap samples is the same as the size of the original sample. By drawing a large number of bootstrap samples we obtain a whole distribution of estimates T̂ ∗ from which desired quantities like confidence intervals can be estimated. Ideally, bootstrapping should mimic sampling from the true distribution P , but since P̂ and P are in general not the same, 59 3 TESTING ENTROPIC INEQUALITIES bootstrapping is only an approximation as well. There exist several methods to estimate the endpoints of confidence intervals based on the bootstrap statistic T̂ ∗ . While the simple and intuitive techniques are often suboptimal, there also exist more involved or computationally heavy techniques which lead to more accurate intervals. Even though the bootstrap methods rest on certain assumptions that will not be entirely satisfied in practice, the assumptions should be closer to the true situation than the traditional normality assumptions (see (3.31)) [42]. A sound overview of relevant methods and a guideline to their application is provided by [44]. Here, we briefly introduce two methods, the simple percentile bootstrap and the advanced BCa bootstrap (bias corrected and accelerated). As a common framework, following [44], we assume that B = 999 bootstrap samples are drawn and the resulting estimates T̂i∗ , i = 1, ...B are sorted in increasing order (T̂i∗ ≤ T̂j∗ whenever i < j). Larger numbers B ≈ 2000 are sometimes recommended but the large simulations conducted here are quite expensive already for B = 999. Percentile bootstrap: If we could sample from the true distribution we could calculate a large number of estimates T̂ and obtain T̂0.05 as the 5% quantile of this distribution. Since the true distribution is not available we replace T̂ by the bootstrap statistic T̂ ∗ and estimate the lower endpoint of the confidence interval by ∗ T̂0.05 = T̂50 . (3.33) This method may perform poorly if the distributions of T̂ and T̂ ∗ differ significantly. In particular, the h iperformance might suffer, first, if the distribution of T̂ is biased (E T̂ 6= T ) or generally asymmetric, and second, if the standard deviation σ̂ = σT T̂ of that distribution depends on the true value T . BCa bootstrap: The bias corrected and accelerated bootstrap improves on the percentile method by addressing the aforementioned problems of the latter. The lower endpoint of the confidence interval is estimated by ∗ T̂0.05 = T̂bQc , (3.34) with ! b + z0.05 , Q = (B + 1) Φ b + 1 − a (b + z0.05 ) (3.35) 60 3 TESTING ENTROPIC INEQUALITIES where b·c denotes the integer part, Φ the cumulative distribution function (CDF) of the standard normal distribution, and z0.05 ≈ −1.645 its 5% quantile. The bias correction constant b can be estimated by b = Φ−1 # T̂i∗ < T̂ B (3.36) , where # T̂i∗ < T̂ is the number of bootstrap estimates that are smaller than the original estimate. The acceleration constant a (correcting the potential dependence of σ̂ = σT T̂ on the true value T ) can be estimated using a jack-knife estimate. From the initial sample, omit the ith observation and estimate T̂ (i) based on the remaining sample of size N − 1. Proceed for all i = 1, ..., N . Denote the mean of estimates T̂ (i) by T̄ . Then calculate PN i=1 a= 6 T̄ − T̂ (i) PN i=1 3 T̄ − T̂ (i) 2 23 . (3.37) Some of the above formulas might look peculiar, but the BCa bootstrap is motivated and thoroughly examined in [42]. The obtained confidence intervals are usually highly accurate. We simulate hypothesis tests based on the percentile and BCa bootstrap for the family of ‘flip distributions’ (3.32) known from the direct approach in Subsection 3.3.2. First, we compare the results of the different bootstrap approaches to each other (Figure 14), then (in Figure 15) we compare the more reliable bootstrap approach to the minimax version of the direct test from Figure 12. The bootstrap tests were carried out using the minimax estimator of entropy as well. For each value of pflip we conducted 1000 initial Monte Carlo simulations and B = 999 bootstrap simulations for each initial sample. A general observation from Figure 14 is that the bootstrap tests are powerful, say rejection rate ≥ 0.8, only for values of pflip extremely close to zero. While the test based on the percentile bootstrap seems to be more powerful, we have in fact one rather objective criterion for the ‘correctness’ of the different methods. By construction we want to test at the 5% level, i.e. at 61 3 TESTING ENTROPIC INEQUALITIES bootstrap tests rejection rate 1.0 ●▲ ● 0.8 0.6 0.4 0.2 ▲ ● ● ● ▲ ▲ ● ▲ ● ▲ ● percentile ▲ BCa ● ▲ ● ▲▲ ● ● ▲▲● ● ● ▲▲▲▲ ● ▲ ● ▲ ● ● ▲● ▲● ▲ pflip 0.02 0.04 0.06 0.08 0.10 Figure 14: Rejection rates of the indirect (bootstrap) hypothesis tests based on inequality (3.2) for the family of ‘flip distributions’ (3.32). Both tests employ the minimax estimator of entropy and use the null hypothesis h0 : ‘data are compatible with the inequality T ≤ 0’. For pflip < 0.0584 (vertical line) the true distribution violates the null hypothesis (see Figure 11). In this regime, large rejection rates (being the powers of the tests) are desired. For pflip ≥ 0.0584 the null hypothesis is satisfied and small rejection rates are expected. most 5% of samples from compatible distributions should be rejected. The true distribution is compatible with the null hypothesis (T ≤ 0) exactly for pflip ≥ 0.0584 (see Figure 11). The rejection rate at the critical value pflip = 0.0584 thus indicates whether or not a test properly works at the 5% level. We find the rejection rates 0.147 (percentile) and 0.048 (BCa). The BCa value is, as theoretically expected, significantly closer to the desired rate of 0.05. The large rejection rate of the percentile method suggests a general overestimation by that method. For the comparison to the direct test we do therefore consider the BCa bootstrap. Figure 15 reveals that the bootstrap test is significantly weaker than the direct test. Depending on the unknown value of pflip for which the distribution becomes compatible with the triangular scenario, small rejection rates for pflip ≥ 0.0584 might actually be desired. For pflip < 0.0584, however, the weak power of the bootstrap test is in fact disappointing. There are at least two possible reasons for the inferiority of the bootstrap test. 62 3 TESTING ENTROPIC INEQUALITIES rejection rate 1.0 ●▲ ● ● ● ● 0.8 0.6 0.4 0.2 ▲ ● ● ● ▲ ● ▲ ● direct ▲ bootstrap (BCa) ● ▲ ● ▲ ● ▲ ● ▲▲ ● ● ● ● ▲▲ ● ● ● ▲▲▲▲▲▲▲ ▲▲ pflip 0.02 0.04 0.06 0.08 0.10 Figure 15: Rejection rates of the direct test and the (BCa) bootstrap test based on inequality (3.2) for the family of ‘flip distributions’ (3.32). Both tests employ the minimax estimator of entropy. The direct test uses the null hypothesis h0 : ‘data are compatible with the triangular scenario’. The bootstrap test uses the weaker null hypothesis h0 : ‘data are compatible with the inequality T ≤ 0’. For pflip < 0.0584 (vertical line) the true distribution violates the inequality. In this regime, large rejection rates (being the powers of the tests) are desired. The value of pflip for which compatibility with the triangular scenario is established is unknown. 1. Due to failure at finding the correct worst case distribution, the threshold value used in the direct approach could be too small. The large rejection rate would then (partially) be caused by the fact that the test does not properly work at the 5% level. 2. The discrepancy between the two null hypotheses (compatibility with the DAG opposed to compatibility with the inequality) might be pretty large. The stricter null hypothesis of the direct test is naturally more frequently rejected. We illustrate the discrepancy mentioned in the second explanation by esti(DAG) mating the value pflip at which the distribution becomes compatible with the DAG. To this end, we employ the data shown in Figure 15 and assume that the direct test correctly works at the 5% level. Since this assumption might not be correct, the following argument is not rigorous but serves merely as an illustration. By interpolation between the flip probabilities pflip = 0.090 and pflip = 0.095 (having rejection rates right above and right 3 TESTING ENTROPIC INEQUALITIES 63 (DAG) below 5%) we can calculate the estimate pflip ≈ 0.0919. Since there might be incompatible distributions with rejection rate < 5%, this value is in fact (DAG) only a lower bound. But already pflip ≈ 0.0919 is significantly larger than the value pflip = 0.0584 above which inequality (3.2) is satisfied. This consideration indicates that the set constrained by inequality (3.2) might be a clearly suboptimal approximation to the true set of distributions compatible with the DAG. The above arguments might suggest that bootstrapping is deemed to result in a weaker test. But the results were by no means clear beforehand. First, a general, theoretical comparison of the tests is difficult due to their different natures (see Figure 9). Second, there are also reasonable arguments in favor of the bootstrap test: • By bad luck there might be a small, not representative set of distributions compatible with the DAG whose samples lead to comparatively large violations of the inequality. This would result in a threshold value so large that the power of the direct test would be unreasonably small. Since the bootstrap approach does not involve a worst case distribution it would not be affected by such a disproportional worst case. • In Subsection 3.3.2 we identified the large variance of estimates T̂ (for fixed pflip ) as the main reason for the flat rejection curve of the direct test. The bootstrap principle suggests that a distribution of bootstrap estimates T̂ ∗ should have a similarly large variance. But the actual quantity of interest in the indirect approach is the lower endpoint, T̂0.05 , of the confidence interval estimated from such a distribution. This is in contrast to the direct approach where, once the threshold value is available, only the point estimate T̂ is required. The variance of endpoints T̂0.05 might indeed be smaller than the variance of the estimates T̂ (or T̂ ∗ ) itself. This would be the case if the bootstrap distributions for different initial samples were similar, or if the estimation technique (here the BCa method) provided appropriate corrections. While the bootstrap test would still have low rejection rate for pflip ≥ 0.0584, the power would increase more rapidly when decreasing pflip below that critical value. Unfortunately, it seems that these effects played only a minor role, if they 3 TESTING ENTROPIC INEQUALITIES 64 occurred at all. The opposite effects discussed above (stronger null hypothesis of the direct test and potentially underestimated threshold value) clearly dominate the discrepancy between the two tests. Note that since a proper theoretical comparison between the two approaches is difficult, the list of advantages and disadvantages might not be complete. Also, an underestimated threshold value is by no means an advantage of the direct approach. The seemingly larger power would be misleading, since the test would not correctly operate at the 5% level anymore. In fact, we have shown in Figure 10 that the threshold value from [16] was underestimated. Moreover, we have no proof, not even reasonable arguments, that our slightly improved threshold value is correct. The (BCa) bootstrap test, on the other hand, could be verified to correctly work at the 5% level. We can thus trust the bootstrap test more than the direct test. If the bootstrap test rejects some given data, we can be extremely confident that the data are indeed incompatible with inequality (3.2) and thus in particular with the triangular scenario. Recall, that this is the only rigorous inference we are able to draw anyway, first, since compatibility with the inequality does not imply compatibility with the DAG, and second, because compatibility with the DAG does not guarantee that this DAG is the ‘one correct’ explanation for the data. Of course it would be preferable if the bootstrap test showed strong performance for a larger range of distributions. 3.3.4 Additional inequalities In this subsection we show that we can improve on the performance of the current bootstrap test by employing inequalities (3.3) and (3.4). As discussed before, we also encounter an increasing difficulty in finding a worst case distribution for the new inequalities, which would be required for a direct test. From now on, we will often refer to inequality (3.2) ((3.3), (3.4)) as the ‘first (second, third) inequality’. 1 Analogous to the statistic Tent ≡ T = I (A; B) + I (A; C) − H (A) of the first inequality, we denote the statistics corresponding to inequalities (3.3) and (3.4) by 2 ≡ 3HA + 3HB + 3HC − 3HAB − 2HAC − 2HBC + HABC , (3.38) Tent 3 Tent ≡ 5HA + 5HB + 5HC − 4HAB − 4HAC − 4HBC + 2HABC , (3.39) 3 TESTING ENTROPIC INEQUALITIES 65 again using the short hand notation HAB = H (A, B) and so on. The subscript ‘ent’ stands for ‘entropic’ and is introduced in light of Chapter 4 (Subsection 4.5.3) where yet another statistic Tmat appears. For the simulated hypothesis tests we will again consider the family of ‘flip distributions’ (3.32) that was already considered in Subsections 3.3.2 and 3.3.3. Figure 16 2 3 1 shows Tent and Tent in comparison to Tent as functions of pflip . We observe i 3 that the critical value of pflip for which Tent = 0 is largest for Tent and small1 est for Tent . This means that, at least for the family of ‘flip distributions’, the second and third inequality are stronger than the first one, suggesting 3 2 involve and Tent that they should also lead to more powerful tests. Since Tent tri-partite information this tendency is not surprising. 0.0584 0.0750 0.0797 1 2 3 Figure 16: The statistics Tent , Tent and Tent for the family of ‘flip distributions’ (3.32) as functions of pflip . For the critical flip probabilities (sat(1.ent) (2.ent) i isfying Tent (pflip ) = 0) we obtain pflip = 0.0584, pflip = 0.0750 and (3.ent) pflip = 0.0797. Thus, for the family of ‘flip distributions’, the new inequalities are stronger than the first one, suggesting that they should also result 3 in more powerful hypothesis tests. The larger absolute values of Tent and 2 Tent carry no direct meaning. Before tackling the construction of a direct hypothesis test, we try to interpret the new inequalities. In terms of mutual information the inequalities i Tent ≤ 0 (for i = 2, 3) can for example be rewritten as 2. IAB + IAC + IBC and 3. IAB + IAC + IBC ≤ HAB + I ABC ≤ HABC + 3I ABC . (3.40) (3.41) 3 TESTING ENTROPIC INEQUALITIES 66 The ‘tri-partite information’ IABC , also called interaction information [45], is defined as IABC = IAB|C − IAB = HAB + HAC + HBC − HABC − HA − HB − HC , (3.42) and is symmetric in the three observables. Now, recall the rather simple interpretation of the first inequality I (A; B) + I (A; C) ≤ H (A) from Section 3.1: ‘If the mutual information of A and B is large, then A depends strongly on the ancestor λAB . But then, the dependence of A on λAC is necessarily small. Since all correlations between A and C are mediated by λAC , the mutual information of A and C is consequently small as well. Inequality (3.2) gives a precise bound for this intuition.’ For the new inequalities such an interpretation is not that simple. In the representations (3.40) and (3.41), the inequalities also seem to bound the sum of pairwise mutual information terms, but the interaction information IABC (involved in the upper bound) complicates an intuitive understanding. First, IABC is not lower bounded by zero. Second, the general behaviour of IABC when varying multiple pairwise mutual information terms is difficult to predict. In addition, the mere increase in the number of involved terms complicates any potential interpretation of the inequalities and raises the risk of our intuition to be flawed. Also, an advantage of the first inequality was that it clearly singles out the variable A, so that the interpretation could be built around that variable. Inequalities (3.40) and (3.41), in addition to the tri-partite terms, also involve the dependency between the variables B and C. The third inequality is even completely symmetric in all variables, resulting in the loss of variable A as the natural starting point for an intuitive interpretation. The above problems complicate the task of finding a worst case distribution required for a direct test. In general, one could try to use the same rationale for finding the worst case that was already employed for the first inequality in Subsection 3.3.2. While it makes sense to assume that the worst case i distribution should again lie on the boundary Tent = 0, problems arise at the second step. For the first inequality some intuition was employed to find 1 (or propose) a rather simple structure of distributions satisfying Tent = 0. Due to their less intuitive interpretations, this is not as straightforward for the second and third inequality. Any pure guess would result in an even less trustworthy threshold value. Still, one such guess is to start with the same distribution as before, namely A = B ∼ uniform and independently 3 TESTING ENTROPIC INEQUALITIES 67 C ∼ uniform, and then replace the uniforms by general Bernoulli distributions. The second inequality shows similar qualitative behaviour as the first 2 = 0 and the threshold one (recall Figure 10). In particular, we have Tent value can be improved by considering the non-uniform cases. For the third 3 = − log 2 in the uniform case. Thus, the inequality, however, we obtain Tent 3 distribution does not even lie on the boundary Tent = 0. This suggests, that in particular for the third inequality, the supposed worst case distribution of the first inequality is not a good candidate here. But also for the second inequality, choosing the old worst case guess has no sound standing. Due to the above problems we would not be able to trust a direct hypothesis for the new inequalities and thus stop the construction at this point. A more detailed elaboration of the inequalities might help to overcome or at least reduce the problems. But here, we instead decide to focus on the indirect hypothesis tests based on bootstrapping. This approach does not suffer from the above problems. In general, bootstrapping is a rather easy to implement, automated procedure which often significantly eases the task for the scientist at cost of increasing the computational burden. While the latter is a relevant point for the large simulations we are conducting here, it causes no real problem when applying the test to a single (real world) data set. The main disadvantage remains that we are not able to implement the stronger null hypothesis of compatibility with the DAG. Instead, the i ≤ 0. approach automatically tests compatibility with the inequality Tent The implementation of the bootstrap tests is exactly the same as for the first inequality. For a general explanation of the approach see Subsection 3.3.3. As before, we employ the BCa method to estimate the required lower endpoint of the confidence interval. The results in comparison to the indirect as well as the direct test based on the first inequality are presented in Figure 17. Note that the new tests correctly work at the 5% level. For the second (2.ent) inequality we find a rejection rate of 4.1% at pflip = 0.0750 and for the (3.ent) third inequality 4.9% at pflip = 0.0797. We observe that the tests based on the second and third inequality are significantly more powerful than the bootstrap test employing the first inequality. The difference from the second to the third inequality is comparatively small. This is in accordance with the quite large gap between the critical values (1.ent) (2.ent) pflip = 0.0584 and pflip = 0.0750 compared to the smaller gap between (3.ent) the latter value and pflip = 0.0797 (a larger critical value corresponds 68 3 TESTING ENTROPIC INEQUALITIES tests for diff. entropic ineqs. rejection rate 1.0 ●■▲ ■●● ●● ● ● 0.8 0.6 ▲■ ● ■ ▲ ▲ ● ● ■ ■ 1. direct ▲ 1. bootstrap ■ 2. bootstrap ● 3. bootstrap ● ● ■ ▲ 0.2 ● ● ▲ 0.4 ● ● ■ ▲ ● ● ● ● ■ ▲▲ ● ■ ● ● ● ● ● ■■ ● ● ● ● ▲▲ ■ ● ● ● ● ● ● ▲▲■ ● ■ ■■ ▲■ ● ● ▲▲ ▲ ■ ▲● ▲■ ▲ pflip 0.02 0.04 0.06 0.08 0.10 Figure 17: Rejection rates of the bootstrap tests for the second ((3.3) or (3.40)) and third ((3.4) or (3.41)) inequality compared to the direct and the bootstrap test based on the first inequality (3.2). In all cases the minimax estimator of entropy was employed. As before, we consider the family of ‘flip distributions’ (3.32). The vertical lines mark the critical values (1.ent) (2.ent) (3.ent) pflip = 0.0584, pflip = 0.0750 and pflip = 0.0797 below which the respective inequalities are violated. As expected, the tests based on the second and third inequality are more powerful than the bootstrap test for the first inequality. Though, the power of the direct test employing the first inequality is not reached. to a more restrictive inequality and thus indicates a more powerful test). Even though the power of the first inequality’s direct test is not reached or even surpassed, the clear improvement over the first bootstrap test considerably enhances the usefulness of the bootstrap approach. Since the threshold value involved in the direct test is rather questionable, one might now actually prefer the bootstrap test based on the third inequality (if tri-partite information is available). 3.3.5 Summary At this point it is reasonable to recapitulate what we have accomplished so far. Recall that the direct hypothesis test employing the first inequality (3.2) has already been proposed in [16] (for the maximum likelihood estimator and with a slightly different threshold value). At the end of Chapter 1 we have stated that one prime goal of this thesis is to improve on this test. 3 TESTING ENTROPIC INEQUALITIES 69 • As a first drawback we observed in Figure 10 that the heuristics for finding the threshold value employed in [16] is flawed. While we could slightly amend the threshold value, the main observation was that it will in general be extremely difficult to find the correct threshold value. Thus, in addition to the arguably weak power of the direct test (see Figure 12), we do not even know if the test correctly works at the 5% level. Our goal is therefore not only to improve the power of this test, but also to improve the reliability, by which we mean a proper control of the type-I-error rate of 5%. • Another means that was intended to increase the reliability of the test and hopefully also the power was the implementation of the minimax estimator of entropy, introduced in Section 3.2. We could confirm that the minimax estimator is often (far) superior to the MLE, but for the alphabet and sample sizes considered in this section the differences were rather insignificant. Figure 12 shows that the powers of the direct tests based on the MLE and the minimax estimator are indeed similar. While the minimax test is even slightly less powerful, the discussion at the end of Subsection 3.3.2 suggests that the minimax test might be more reliable. • To overcome the problem of the poorly controlled type-I-error rate of the direct test, we considered a bootstrap approach to hypothesis testing (Subsection 3.3.3). Employing the BCa method, we were able to properly control the type-I-error rate at 5% (see Figure 14). Unfortunately, the comparison to the direct test shown in Figure 15 reveals that the bootstrap test is considerably less powerful than the direct test. • In order to improve the unsatisfying power of the bootstrap test, we implemented bootstrap tests for the additional inequalities (3.3) and (3.4) (Subsection 3.3.4). In Figure 17 it can be seen that these tests are indeed significantly more powerful than the bootstrap test based on inequality (3.2). The power of the direct test is unfortunately not reached or even surpassed. On the plus side, like the first bootstrap test, the new tests correctly work at the 5% level. Overall, we have been able to construct a test that is more reliable than the direct test from Figure 12 (or originally [16]) in terms of a superiorly 3 TESTING ENTROPIC INEQUALITIES 70 controlled type-I-error rate. On the other hand, we have not been able to improve the power of this test. Our final measure in this direction will be to leave the entropic framework and derive similar inequality constraints based on certain generalized covariance matrices. While deriving the new matrix inequalities is a significant goal on its own, we particularly hope to be able to construct more powerful tests in this new framework. Both, the derivation as well as the implementation of our new inequality is the subject of Chapter 4. An application to real data, employing an entropic as well as the matrix inequality, will be presented in Chapter 5. 4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 71 4 4.1 Tests based on generalized covariance matrices Introduction In Subsection 2.2.4 we have introduced so-called hidden common ancestor models, where all correlations between the observable variables are mediated by hidden common ancestors. A special case is the triangular scenario (Figure 3) consisting of three observables with one common ancestor for each pair. In Section 3.1 the entropic inequality I (A; B) + I (A; C) ≤ H (A) constraining all distributions of the observable variables that are compatible with the triangular scenario was introduced. The inequality was the subject of intensive simulations of statistical tests in Section 3.3. The authors of [16] did not stop at the triangular scenario but also considered general hidden common ancestor models. A hidden common ancestor model can be characterized by the number of observables (n) and the maximal number of observables that may be connected by a single ancestor (m). We may also call this number the degree of an ancestor. An example with n = 5 and m = 3 is given in Figure 18. In [16], it was shown that for compatibility with this kind of scenario the inequality n X I A1 ; Ai ≤ (m − 1)H A1 (4.1) i=2 (and permutations thereof) must be satisfied. The inequality bounds the mutual information that A1 shares with all the other observables. One might expect that similar inequalities also hold for other measures of correlation. Going further, it might be possible to use the tools from [16] to find entropic inequalities for a given DAG and then generalize these inequalities to other measures of correlation. In this chapter we prove the analog to inequality (4.1) on the level of certain generalized covariance matrices. As a special case, constraints on usual covariances, or rather correlation coefficients, can be derived. The motivation to do this is twofold. First, when going to entropies we lose some information since already the elementary inequalities constraining entropies of any set of random variables are only an outer approximation (see Section 3.1). Second, we have seen in Chapter 3 that estimating entropies and in particular establishing statistical tests for entropic inequalities can be a thorny 4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 72 λ12 1 A 2 5 A λ145 4 A A λ134 λ23 3 A Figure 18: An example of a hidden common ancestor model with n = 5 observables and ancestors of degree up to m = 3 (two ancestors of degree 3 and two ancestors of degree 2). Applying inequality (4.1) to this DAG results in the constraint IA1 A2 + IA1 A3 + IA1 ,A4 + IA1 A5 ≤ 2HA1 . issue. One might hope that our new inequality gives rise to simpler or more powerful tests. The rest of this chapter is structured as follows. After introducing the general framework in Section 4.2, we motivate and propose the new inequality in Section 4.3. A step by step proof is provided in Section 4.4. In Section 4.5 we compare the strength of our new inequality to the entropic inequality (4.1). For a special class of distributions this can be done analytically. For more general distributions we conduct a number of numerical simulations. We also study the performance of statistical hypothesis tests based on our new inequality and compare the results to the analogous entropic tests from Section 3.3. An application to real data of the techniques developed in this chapter, as well as the techniques from Chapter 3, is presented in Chapter 5. Note that we always assume the alphabets of all variables (observables as well as hidden ancestors) to be discrete and finite. Even if this should not be explicitly stated in some of the following sections, propositions, lemmata etc, finiteness of the alphabets is always implicitly assumed. The number of observables (n) should be finite as well. This implies that the number of 4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 73 ancestors as well as the maximal degree of an ancestor (m) are also finite. 4.2 Encoding probability distributions in matrices 4.2.1 One- and two-variable matrices The covariance of two random variables A and B can be written as Cov [A, B] = KA X KB X a∗i [P (A = ai , B = bj ) − P (A = ai ) P (B = bj )] bj , i=1 j=1 (4.2) see (2.9) in Subsection 2.1.4. For the sake of generality we allow complex valued variables. Recall that in this case we have Cov [B, A] = Cov [A, B]∗ instead of full symmetry. If we define the (real valued) matrix A ,KB M A:B := [P (A = ai , B = bj ) − P (A = ai ) P (B = bj )]K i,j=1 , T and the vectors a := a1 · · · aKA , b := b1 · · · bKB can be written as the matrix product Cov [A, B] = a† M A:B b. T (4.3) , the covariance (4.4) The vectors a and b carry the alphabets of the variables A and B while the matrix M A:B carries the information about the joint and marginal distributions. Note that the alphabet sizes KA and KB are assumed to be finite. For the covariance, independent variables satisfy Cov [A, B] = 0 while the converse is not necessarily true. In contrast, it can be seen directly from the definition that M A:B is the zero-matrix if and only if A and B are independent. This statement is in particular independent of the actual alphabets of A and B. Thus, M A:B encodes the distribution of A and B in a more elementary way than the covariance does. For this reason we prefer to work with the M -matrices instead of covariances. We will see later that this makes indeed a difference. 4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 74 Starting with expression (2.8), the variance of A can be written as Var [A] = KA X |ai |2 P (A = i=1 = KA X KA X X KA ai ) − ai P i=1 (A = 2 ai ) a∗i [P (A = ai ) δij − P (A = ai ) P (A = aj )] aj i=1 j=1 = a† M A a, (4.5) by defining the matrix A M A := [P (A = ai ) δij − P (A = ai ) P (A = aj )]K i,j=1 . (4.6) Recall that Var [A] ≥ 0 independently of the chosen alphabet (even if some of the outcome values ai coincide). Thus, a† M A a ≥ 0 ∀a ∈ CKA , which means that M A is positive semidefinite. 4.2.2 The compound matrix To capture the joint information about A and B in one matrix, we define the compound matrix ! MA:B M A M A:B . := M B:A M B (4.7) T T Note that MA:B is symmetric since M B:A = M A:B and M A = M A . Since all M -matrices are real valued, the symmetry also implies hermiticity. In general, for n random variables A1 , ..., An we define MA 1 :...:An 1 1 2 MA M A :A A2 :A1 2 M MA := . .. .. . An :A1 An :A2 M M 1 n · · · M A :A 2 n · · · M A :A . .. .. . . n · · · MA (4.8) Like M A can be considered as an alphabet-independent generalization of the 1 n variance Var [A] and M A:B of the covariance Cov [A, B], the matrix MA :...:A 4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 75 is a generalization of the covariance matrix h Cov A1 : ... : An i · · · Cov [A1 , An ] · · · Cov [A2 , An ] .. ... . n ··· Var [A ] Var [A1 ] Cov [A1 , A2 ] 2 1 Var [A2 ] Cov [A , A ] = . .. .. . n 1 Cov [A , A ] Cov [An , A2 ] † † 1 (an )† M A n :A1 1 † 2 1 n (a1 ) M A :A a2 · · · (a1 ) M A :A an 2 2 n † † (a2 ) M A a2 · · · (a2 ) M A :A an . (4.9) .. .. .. . . . n 2 n † † n A :A 2 n A n (a ) M a · · · (a ) M a (a1 ) M A a1 2 † (a ) M A2 :A1 a1 = .. . a1 Note that to denote the covariance matrix Cov [A1 : ... : An ] we separate arguments by a colon, while for the scalar covariance Cov [A1 , A2 ] we use a comma. At full length we could address the framework based on the M -matrices as the ‘generalized covariance matrix framework’. As a short hand notation we will usually simply write ‘matrix framework’ and refer to inequality constraints in this framework as ‘matrix inequalities’ rather than ‘inequalities based on generalized covariance matrices’. To conclude this subsection, we want to deduce from the non-negativity of 1 n the covariance matrix, that also the compound matrix MA :...:A is positive semidefinite. Lemma 4.1. The covariance matrix of the variables A1 , ..., An (with finite alphabets) is positive semidefinite, h i Cov A1 : ... : An ≥ 0. (4.10) Proof. When writing the variables A1 , ..., An in one random vector A = T A1 · · · An one can express the covariance matrix as h i h i Cov A1 : ... : An = E (A − E [A])∗ (A − E [A])T . (4.11) Using the linearity of the expectation value, one finds for an arbitrary vector 4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 76 c ∈ Cn , h i c† Cov A1 : ... : An c h i = c† E (A − E [A])∗ (A − E [A])T c h = E c† (A − E [A])∗ (A − E [A])T c i 2 = E c† (A − E [A])∗ ≥ 0. (4.12) h Lemma 4.2. For all complex valued block matrices X (ij) in i,j=1 (of finite dimension) it is the case that X (11) · · · X (1n) . .. .. Y := . . ≥0 .. (n1) (nn) X ··· X (4.13) if and only if † † (x1 ) X (11) x1 · · · (x1 ) X (1n) xn .. .. .. ≥0 Z := . . . (xn )† X (n1) x1 · · · (xn )† X (nn) xn (4.14) for all complex valued vectors x1 , ..., xn of suitable dimension. Proof. If Z ≥ 0 for all suitable x1 , ..., xn , then in particular 1 0 ≤ 1 ··· 1 . Z .. 1 x1 . · · · (xn )† Y .. xn = † (x1 ) for all suitable x1 , ..., xn , and thus Y ≥ 0. (4.15) 4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 77 For the converse we can calculate r1∗ r 1 . ∗ · · · rn Z .. = rn n X ri∗ xi i,j=1 | † X (ij) xj rj } | {z } {z † yi =( ) :=y j y1 . · · · (y n )† Y .. yn = † (y 1 ) ≥ 0. (4.16) Since this is true for arbitrary suitable x1 , ..., xn and likewise arbitrary r ∈ Cn we obtain the desired statement. 1 n The matrix MA :...:A is of the form of the matrix Y in Lemma 4.2 and the covariance matrix is of the form of the matrix Z. Thus, by combining Lemma 4.1 and the if-part of Lemma 4.2 we obtain: 1 n Corollary 4.1. The compound matrix MA :...:A of the variables A1 , ..., An (with finite alphabets) defined in (4.8) is positive semidefinite. Aside from being an interesting feature on its own, the non-negativity of 1 n i j MA :...:A (or rather the bi-partite case MA :A ) will be used later in the proof of the shortly proposed inequality. 4.3 4.3.1 The inequality Motivation by covariances for the triangular scenario To motivate the general inequality proposed in the next subsection, we first consider the triangular scenario (see Figure 3). We want to construct an inequality similar to the entropic inequality I (A; B) + I (A; C) ≤ H (A), first introduced in (3.2). By replacing mutual information with covariances, one might expect |Cov [A, B]| + |Cov [A, C]| to be bounded. The covariance 4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 78 Cov [B, C] should not appear in the inequality. Starting from the covariance matrix Var [A] Cov [A, B] Cov [A, C] Var [B] Cov [B, C] Cov [A : B : C] = Cov [B, A] , Cov [C, A] Cov [C, B] Var [C] (4.17) and replacing Cov [B, C] by 0, we propose the inequality Z A:B:C Var [A] Cov [A, B] Cov [A, C] Cov [B, A] Var [B] 0 := ≥ 0. Cov [C, A] 0 Var [C] (4.18) Note that this inequality is not trivially satisfied. Of course there exist joint distributions with Cov [B, C] = 0 for which the inequality is trivial, but here we assume that in fact Cov [B, C] might be non-zero which will in general also affect the covariances Cov [A, B] and Cov [A, C]. Since the determinant of a positive semidefinite matrix must be non-negative, we obtain the inequality Var [A] Var [B] Var [C] − Var [B] |Cov [A, C]|2 − Var [C] |Cov [A, B]|2 ≥ 0 (∗) ⇔ ⇔ |Cov [A, C]|2 |Cov [A, B]|2 + ≤ 1 Var [A] Var [B] Var [A] Var [C] |Corr [A, B]|2 + |Corr [A, C]|2 ≤ 1, (4.19) where Corr [A, B] := q Cov [A, B] (4.20) Var [A] Var [B] denotes the usual correlation coefficient. Inequality (4.19) can be considered as the analog of I (A; B) + I (A; C) ≤ H (A) for (squared) correlation coefficients. Keep in mind that with Cov [A, B] also Corr [A, B] will in general be complex for complex valued random variables. The absolute values keep the whole expression real. If the alphabets are real valued, as usually is the case, we do not have to worry about that issue at all. In fact, the main reason to work with complex variables is merely that some matrix-theoretical results are better established in the complex case. A restriction to real variables might have required some additional attention. As a different issue, for the equivalence relation marked by (∗) we assumed Var [A] , Var [B] , Var [C] 6= 0. 4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 79 Note that we have not yet proven that inequality (4.19) indeed constraints distributions compatible with the triangular scenario, but only proposed it. Also, as mentioned before, we do generally not want to work on the level of covariances and variances, but rather with the M -matrices introduced in the previous subsections. In this more general framework the analog of inequality (4.19), or rather (4.18), reads M A M A:B M A:C B:A MB 0 := M ≥ 0. C:A C M 0 M X A:B:C (4.21) In this framework (i.e. once the inequality is proven) we do not have to worry about possibly complex valued variables or vanishing variances at all. 4.3.2 General inequality for hidden common ancestor models Now, consider a general hidden common ancestor model (see Subsection 2.2.4 and Section 4.1) with n observables and ancestors of degree up to m. We desire an inequality that bounds the pairwise dependence between the variable A1 and all other variables in terms of the M -matrices (we pick the specific variable A1 instead of a general Aj simply for notational conve1 n nience). Again, we start with the full matrix MA :...:A (see (4.8)). Matrices i j M A :A carrying the dependence between pairs of variables not including A1 (i.e. i, j 6= 1) are set to 0 since they should not appear in the inequality. To 1 take account of the maximal degree m of the ancestors, the matrix M A is equipped with the prefactor m − 1. This is in analogy to the m − 1 prefactor of the term H (A1 ) in the general entropic inequality (4.1). We propose the final inequality in the following theorem: Theorem 4.1. Distributions compatible with a hidden common ancestor model with n observables A1 , ..., An (with finite alphabets) and ancestors of degree up to m (with likewise finite alphabets) satisfy the inequality XA 1 :...An (m − 1) M A 2 1 M A :A .. := . .. . An :A1 M 1 1 M A :A 2 MA 0 .. . 0 2 ··· 0 .. . .. . ··· 1 · · · M A :A ··· 0 .. .. . . .. . 0 n 0 MA n ≥ 0. (4.22) 4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 80 Note that we could restrict the inequality to variables that A1 shares at least one ancestor with. For any pair A1 , Aj without a common ancestor, the DAG demands A1 and Aj to be independent (see Subsection 2.2.4). If a distribution violates this independence relation, the distribution is known to be incompatible with the DAG without having to consider any complicated inequality. On the other hand, if a distribution satisfies all required 1 n independence relations, one can use the inequality X A :...A ≥ 0 in its above 1 j j form. Since we have M A :A = 0 for independent variables, the M A block will be disconnected from the rest of the matrix and trivially be positive semidefinite. We may further assume that one of the ancestors of A1 indeed has degree m. Otherwise, one should replace m by m0 , the maximal degree of A1 ’s ancestors. Using m would still result in a valid inequality, but the inequality would be unnecessarily loose. Even though the main focus of this thesis lies on hypothesis tests, Theorem 4.1 is an important result on its own. Testable constraints for models including hidden variables, that are based on the model’s structure alone, are rare to this day. With this regard, Theorem 4.1 can even be understood as the main result of this chapter. The proof is carried out in Section 4.4 and partially prepared in the following subsection. For the sake of readability, two steps of the proof are presented only for the triangular scenario. The generalization to arbitrary hidden common ancestor models can be found in Appendix A. A brief recapitulation of all major steps is given in Subsection 4.4.5. Note that the proof is rather lengthy, Section 4.4 spanning over roughly 25 pages. In principle, it is possible to skip Section 4.4 without causing difficulties at understanding the rest of the thesis. 4.3.3 An equivalent representation Before proving inequality (4.22) we derive an equivalent representation that P better resembles the entropic inequality ni=2 I (A1 ; Ai ) ≤ (m − 1)H (A1 ). This representation also turns out to be more convenient for some parts of the proof. First, we provide a lemma that gives a necessary and sufficient condition for a block matrix to be positive semidefinite. The lemma already exists in the literature in several different versions, see for example Reference [46] Theorem IX.5.9 or Reference [30] Theorem 7.7.7. We present the lemma (and the proof) for the sake of completeness and to have a version of the lemma which is best suited for our purposes. 4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 81 Lemma 4.3. If R ∈ Cn×n , S ∈ Cm×m and Q ∈ Cn×m , then ! R Q X= ≥0 Q† S (4.23) if and only if R ≥ 0, S ≥ 0, R ≥ QS Q† , QPS = Q, (4.24) where S denotes the pseudoinverse of S as introduced in Subsection 2.4.2 and PS denotes the projection onto the range of S. The conditions (4.24) can equivalently be replaced by R ≥ 0, S ≥ 0, S ≥ Q† R Q, PR Q = Q. (4.25) First, note that the precondition that the lower left block Q† is the adjoint of the upper right block Q is required since a positive semidefinite matrix is necessarily hermitian. Second, in (4.24), the conditions S ≥ 0 and R ≥ QS Q† already imply R ≥ 0. Third, to gain intuition for the conditions, think of R, S and Q as scalars. The diagonal entries of a positive semidefinite matrix have to be non-negative, thus R ≥ 0 and S ≥ 0. From the positivity of the determinant one concludes RS ≥ |Q|2 . In (4.24), the conditions R ≥ QS Q† and QPS = Q are essentially the generalization of this scalar condition. We prove Lemma 4.3 for the conditions (4.24). The proof for the conditions (4.25) is analogous. Proof. For the if-part choose arbitrary r ∈ Cn and s ∈ Cm . Also recall from Subsection 2.4.3 that the projection onto the range √ of a positive semidefinite matrix M can be written as PM = M M . Since M has the same range as M itself (and is also positive semidefinite), we can further write P M = P√ M = √ √ M M. (4.26) 4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 82 Using this identity and the conditions (4.24) we obtain r † s† ! R Q Q† S r s ! = r † Rr + r † Qs + s† Q† r + s† Ss √ √ √ √ = r † Rr + r † Q S Ss + s† S S Q† r + s† Ss √ √ √ √ ≥ r † QS Q† r + r † Q S Ss + s† S S Q† r + s† Ss √ √ † √ † √ S Q† r + Ss S Q r + Ss = ≥ 0. (4.27) From the second to the√third line we used QPS = Q and S ≥ 0 which √ allows us to write PS = S S. From the third to the fourth line we used R ≥ QS Q† . For the only if-part, R ≥ 0 and S ≥ 0 are clear since in particular vectors T T of the form x1 = r 0S and x2 = 0R s with r ∈ Cn and s ∈ Cm satisfy xi † Xxi ≥ 0. Next, for arbitrary r ∈ Cn , we find 0 ≤ r † −r † QS ! R Q Q† S r † −S Q r ! = r † Rr − r † QS Q† r − r † QS Q† r + r † Q S SS } Q† r | {z =PS S =S = r † R − QS Q† r. (4.28) Thus, we obtain R ≥ QS Q† . Finally, we need to show that QPS = Q. For this purpose, define PS⊥ = 1S − PS , the projection onto the orthogonal complement of range (S) (i.e. the kernel of S). This projection satisfies in particular SPS⊥ = PS⊥ S = 0. Let s ∈ Cm , r ∈ Cn , x ≥ 0 and θ ∈ R be arbitrary, then 0 ≤ r † xeiθ s† PS⊥ ! R Q Q† S r −iθ ⊥ xe PS s ! = r † Rr + xe−iθ r † QPS⊥ s + xeiθ s† PS⊥ Q† r + x2 s† PS⊥ SPS⊥ s (4.29) | = r † Rr + 2xRe e−iθ r † QPS⊥ s . {z 0 } (4.30) 4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 83 We know that r † Rr ≥ 0 but if r † QPS⊥ s 6= 0 we can always choose an appropriate θ and large enough x to make the whole expression negative. Thus, we require r † QPS⊥ s = 0 ∀r ∈ Cn and ∀s ∈ Cm . From there we can conclude QPS⊥ = 0 and thus QPS = Q PS + PS⊥ = Q1S = Q. (4.31) We can now formulate and prove the following equivalent representation of inequality (4.22): Proposition 4.1. A probability distribution on n observables A1 , ..., An (with 1 n finite alphabets) satisfies inequality (4.22), X A :...:A ≥ 0, if and only if YA 1 :...:An := n √ X M A1 M A 1 :Ai i i M A M A :A 1 √ M A1 ≤ (m − 1) 1A1 . (4.32) i=2 To get an intuitive understanding of inequality (4.32), recall that the mai 1 i trices M A and M A :A can be considered as generalizations of Var [Ai ] and Cov [A1 , Ai ]. Inequality (4.32) can then be understood as the generalization of n X 1 q i=2 h Var [A1 ] Cov A1 , Ai i h i 1 1 i 1 q ≤ (m − 1) Cov A , A i Var [A ] Var [A1 ] n h i2 X Corr A1 , Ai ⇔ ≤ (m − 1) , i=2 (4.33) assuming Var [Aj ] 6= 0 ∀j = 1, ..., n. Note that at this point this inequality serves only as an illustration. The inequality is properly stated as a corollary of Theorem 4.1 in Subsection 4.3.4 and proven in Appendix B. 1 2 Proof. (Prop. 4.1) By identifying R = (m − 1) M A , S = diag M A , ..., M A 1 2 1 n n and Q = M A :A . . . M A :A we can use Lemma 4.3 (with the first set 1 n of conditions) which in this case reduces to the statement that X A :...:A ≥ 0 if and only if R ≥ QS Q† . The conditions R ≥ 0 and S ≥ 0 are trivially 4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 84 i satisfied since each matrix M A is already known to be positive semidefinite. Concerning the condition QPS = Q, we can conclude from Lemma 4.3 applied to the matrices 1 M 1 i 1 1 MA M A :A = i 1 i M A :A MA A1 :Ai i i ! ≥ 0, 1 (4.34) i that M A :A PM Ai = M A :A (the matrices MA :A are known to be positive semidefinite according to Corollary 4.2.2). Due to the block diagonal structure of S, this implies QPS = = = MA MA MA 1 :A2 1 :A2 . . . MA 1 :A PM A 2 n PM A 2 . . . M A 1 :A2 . . . MA 1 :A 1 :An ... PM A n P M An n = Q. (4.35) Analogously, one obtains PR Q = Q which will be required later. Thus, the † only non-trivial condition Q . This condition can equivalently be √ QS √ is R †≥ replaced by 1R ≥ R QS Q R . To see this, first note that the relation Z1 ≥ Z2 is invariant under transformations of the form Zi → U Zi U † where U is an arbitrary matrix of suitable dimension, † † † † † x† U Z 1 U | {zx} = y Z1 y ≥ y Z2 y = x U Z2 U x. (4.36) :=y ‘⇒’: √ Since R ≥ 0 , also R is√positive semidefinite and in particular hermitian. √ The transformation Zi → R Zi R is thus of the above form and respects matrix ordering. By applying this transformation to both sides of R ≥ QS Q† , one obtains ⇒ R ≥ QS Q† √ √ √ †√ R R R ≥ R QS Q R | {z } PR ⇒ 1R ≥ √ †√ R QS Q R . (4.37) 4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 85 In the second step we used that any projection is upper bounded by the identity. ‘⇐’: ⇒ √ †√ R QS Q R ≤ 1R √ √ †√ √ √ √ R R QS Q R R ≤ R1R R PR QS Q† PR ⇒ ⇒ ≤ R † ≤ R QS Q (4.38) From line one to line two we performed the matrix-order-preserving trans√ √ √ † √ formation Zi → RZi R (note that R = R). Next, we identified √ √ √ √ √ √ R R = R R = PR and R1R R = R. In the last step we used PR Q = Q, which can be shown in exactly the same way as the statement QPS = Q from above. √ √ Explicitly writing down the condition 1R ≥ R QS Q† R in terms of the M -matrices concludes the proof of Proposition 4.1. The product QS Q† evaluates to n QS Q† = X MA 1 :Ai i i 1 M A M A :A . (4.39) i=2 With R = (m − 1) M A the inequality 1R ≥ 1 √ †√ R QS Q R reads n √ √ X 1 1 1 i i i 1 √ M A1 M A1 √ ≤ 1A1 M A :A M A M A :A m−1 m−1 i=2 n √ √ X 1 i i i 1 M A1 M A :A M A M A :A M A1 ≤ (m − 1) 1A1 . ⇔ ! i=2 | {z 1 :...:An YA } (4.40) 1 n Thus, X A :...A ≥ 0 is equivalent to Y A is short for 1M A1 . 1 :...:An ≤ (m − 1) 1A1 . Note that 1A1 4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 86 4.3.4 Covariances revisited In Subsection 4.3.3 we have introduced the inequality n i2 h X Corr A1 , Ai ≤ (m − 1) (4.41) i=2 (see also (4.33)) in order to get an intuitive understanding of the matrix inequality (4.32), n √ X M A1 M A 1 :Ai i i M A M A :A 1 √ M A1 ≤ (m − 1) 1A1 . (4.42) i=2 It is possible to prove inequality (4.41), more precisely a version on the level of covariances and variances rather than correlation coefficients, as a corollary of Theorem 4.1. Corollary 4.2. All distributions compatible with a hidden common ancestor model with n observables A1 , ..., An (with finite alphabets) and ancestors of degree up to m, satisfy the inequality n n h i2 Y X Cov A1 , Aj j=2 h i Var Ak ≤ (m − 1) k=2 n Y h i Var Ai . (4.43) i=1 k6=j Inequality (4.41) can be obtained by demanding that all observables have non-vanishing variance. The full proof of Corollary 4.2 is presented in Appendix B. Here we sketch the general idea for the special case of the triangular scenario for which Theorem 4.1 states M A M A:B M A:C B:A MB 0 = M ≥ 0. C:A C M 0 M X A:B:C (4.44) Given that X A:B:C is positive semidefinite, Lemma 4.2 allows us to conclude that the matrix Z A:B:C Var [A] Cov [A, B] Cov [A, C] Var [B] 0 = Cov [B, A] Cov [C, A] 0 Var [C] (4.45) 4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 87 (first introduced in (4.18)) is positive semidefinite as well (for arbitrary alphabets). To apply Lemma 4.2, recall that one can for example write Cov [A, B] = a† M A:B b, where the vectors a, b carry the alphabets of A and B. Positive semidefiniteness of Z A:B:C implies det Z A:B:C ≥ 0 which amounts to the inequality (see also Subsection 4.3.1) |Cov [A, B]|2 Var [C]+|Cov [A, C]|2 Var [B] ≤ Var [A] Var [B] Var [C] . (4.46) This is the special case of inequality (4.43) for the triangular scenario. Calculating the determinant in the general case requires a bit more effort, which is why the general proof has been moved to Appendix B. Note that the covariance inequality for one specific choice of alphabet values of the observables is not equivalent to the alphabet independent matrix inequality X A:B:C ≥ 0. The matrix inequality will always be at least as powerful as the covariance inequality. To illustrate this statement, we characterize the strength of an inequality by the number of distributions violating the inequality. More violations correspond to a stronger inequality. Now, assume that a distribution violates X A:B:C ≥ 0. From Lemma 4.2 we can only conclude that there exist alphabets for which the matrix Z A:B:C from (4.45) is not positive semidefinite either. However, there might, and in general will, exist other alphabets for which we obtain Z A:B:C ≥ 0. Thus, it might happen that even though X A:B:C ≥ 0 is violated, the inequality Z A:B:C ≥ 0 is satisfied. Going further from Z A:B:C ≥ 0 to the scalar covariance inequality (4.46), recall that (4.46) simply states the non-negativity of det Z A:B:C . But even if Z A:B:C is not positive semidefinite, the determinant might still be positive (for an even number of negative eigenvalues). Thus, violation of Z A:B:C ≥ 0 (and in particular X A:B:C ≥ 0) does not imply violation of inequality (4.46). Considering the other direction, a negative determinant (i.e. violation of inequality (4.46)) automatically implies that Z A:B:C is not positive semidefinite. Going further, by applying Lemma 4.2 in the other direction, as soon as we find violation of Z A:B:C ≥ 0 for an arbitrary choice of alphabets, we know that X A:B:C ≥ 0 is violated as well. Thus, violation of det Z A:B:C ≥ 0 implies violation of X A:B:C ≥ 0. The latter inequality is therefore stronger (or at least not weaker) than the former. This holds not only for the triangular scenario but also for general hidden common ancestor models. Working in the general matrix framework with 4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 88 1 n the inequality X A :...:A ≥ 0 is thus not just a matter of taste. The matrix inequality is indeed stronger than the inequality for covariances. The gap between the inequalities will be illustrated later by one example in Figure 23. 4.4 Proving the inequality The proof of Theorem 4.1 is splitted into several parts. First, we show that if the inequality (‘the inequality’ might refer to any of (4.22) or (4.32)) is satisfied for one given distribution, then it will also be satisfied for any distribution that can be obtained from the initial one by local transformations. The concept of local transformations will be introduced along the way. Second, the inequality is shown to hold for one specific family of distributions. To demonstrate that the inequality is not trivially satisfied, a counterexample is presented as well. To conclude the proof, we show that all distributions compatible with a given hidden common ancestor model can be obtained by local transformations (and a subsequent limit procedure) starting with the family of distributions shown to be compatible in the previous step. To wrap everything up, we give a brief overview of all important steps of the proof. Note that in this section some parts of the proof are presented only for the triangular scenario. The generalization to arbitrary hidden common ancestor models can be found in Appendix A. Also note that this section spans over roughly 25 pages. In principle, the rest of the thesis can be understood even if this section is skipped. 4.4.1 Invariance under local transformations A local transformation of a single random variable A → A0 can be defined via conditional probabilities, essentially employing the law of total probability (2.3), P (A0 = k) = KA X P (A0 = k | A = l) P (A = l) . (4.47) l=1 In the following we will usually use the short hand notation PA0 |A (k | l) ≡ P (A0 = k | A = l) , (4.48) 4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 89 and so on. Locally transforming several variables A1 , ..., An → A10 , ..., An0 with joint distribution P (A1 , ..., An ) reads PA10 ,...,An0 (k1 , ..., kn ) = X PA10 |A1 (k1 | l1 ) ...PAn0 |An (kn | ln ) PA1 ,...,An (l1 , ..., ln ) . (4.49) l1 ,...,ln The transformations are called local, since each single transformation Aj → Aj0 will only affect the marginal of the variable Aj . In particular, a product distribution P (A1 , ...An ) = P (A1 ) ...P (An ) will be transformed to a product distribution P (A10 , ..., An0 ) = P (A10 ) ...P (An0 ). In our matrix framework a local transformation A → A0 (between variables with finite alphabets) can be represented by the matrix 0 iK 0 ,KA A h T A ,A := PA0 |A (k | l) k.l=1 (4.50) . When likewise representing probability distributions as vectors, PA := PA (1) · · · PA (KA ) T , (4.51) the transformation A → A0 reads 0 PA0 = T A ,A PA . We further define 0 0 T A,A := T A ,A T (4.52) . (4.53) Since the transformation matrices are real valued, the transpose is simultaneously the adjoint. Note that for a given transformation A → A0 the 0 backwards transformation A0 → A does generally not exist. T A,A is thus 0 not the inverse of T A ,A . Recall, that the goal of this subsection is to show that inequality (4.22) from Theorem 4.1 remains valid under local transformations. We prepare the proof by introducing two lemmata. Lemma 4.4. Under local transformations A → A0 , B → B 0 (between variables with finite alphabets) the matrix M A:B defined in (4.3) properly transforms as 0 0 0 0 (4.54) M A :B = T A ,A M A:B T B,B . 4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 90 Proof. 0 0 MkAA:B ,kB = PA0 ,B 0 (kA , kB ) − PA0 (kA ) PB 0 (kB ) X = PA0 |A (kA | lA ) PB 0 |B (kB | lB ) PA,B (lA , lB ) lA ,lB − X PA0 |A (kA | lA ) PA (lA ) lA X = X PB 0 |B (kB | lB ) PB (lB ) lB PA0 |A (kA | lA ) [PA,B (lA , lB ) − PA (lA ) PB (lB )] PB 0 |B (kB | lB ) lA ,lB 0 0 TkAA ,l,AA MlA:B T B,B A ,lB lB ,kB X = lA ,lB h 0 0 T A ,A M A:B T B,B = i (4.55) kA ,kB Unfortunately, and maybe surprisingly, the single variable matrix M A from 0 (4.6) does not satisfy this exact transformation behaviour, that is M A 6= 0 0 T A ,A M A T A,A . The reason is the structural difference between the matrices M A:B and M A brought about by the Kronecker delta appearing in M A . Fortunately, we have the following lemma which is sufficient for our purposes. Lemma 4.5. Under a local transformation A → A0 (between variables with 0 finite alphabets) the matrices M A and M A defined by (4.6) satisfy 0 0 0 M A ≥ T A ,A M A T A,A . (4.56) Proof. We have to show that h 0 0 0 i a† M A − T A ,A M A T A,A a X = h 0 0 a∗k1 M A − T A ,A M A T A,A k1 ,k2 0 i k1 ,k2 ak 2 (4.57) is non-negative for all complex valued vectors a of suitable dimension. To 0 0 0 this end we explicitly write down the matrices M A and T A ,A M A T A,A , 0 MkA1 ,k2 = PA0 (k1 ) δk1 ,k2 − PA0 (k1 ) PA0 (k2 ) = X PA0 |A (k1 | l1 ) PA (l1 ) δk1 ,k2 l1 − X l1 PA0 |A (k1 | l1 ) PA (l1 ) X l2 PA0 |A (k2 | l2 ) PA (l2 ) (4.58) 4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 91 and h 0 T A ,A M A T A,A X = 0 i k1 ,k2 PA0 |A (k1 | l1 ) [PA (l1 ) δl1 ,l2 − PA (l1 ) PA (l2 )] PA0 |A (k2 | l2 ) l1 ,l2 X = PA0 |A (k1 | l1 ) PA (l1 ) PA0 |A (k2 | l1 ) l1 X − PA0 |A (k1 | l1 ) PA (l1 ) PA (l2 ) PA0 |A (k2 | l2 ) . (4.59) l1 ,l2 By combining the two expressions one obtains 0 h 0 M A − T A ,A M A T A,A = δk1 ,k2 X 0 i k1 ,k2 PA0 |A (k1 | l1 ) PA (l1 ) l1 − X PA0 |A (k1 | l1 ) PA0 |A (k2 | l1 ) PA (l1 ) . (4.60) l1 Insertion into (4.57) yields X 0 h 0 a∗k1 M A − T A ,A M A T A,A k1 ,k2 |ak1 |2 = X X k1 l1 − X a∗k1 k1 ,k2 i k1 ,k2 ak2 PA0 |A (k1 | l1 ) PA (l1 ) X PA0 |A (k1 | l1 ) PA0 |A (k2 | l1 ) PA (l1 ) ak2 l1 X X PA (l1 ) |ak1 |2 PA0 |A (k1 = 0 l1 k1 | | X l1 ) − ak1 PA0 |A (k1 k1 {z ≥0 ≥ 0. 0 0 | 2 l1 ) } (4.61) 0 M A − T A ,A M A T A,A is thus positive semidefinite. To see the last step, note that |·|2 is a convex function. By definition, a (suitably defined) function f is convex if λf (x1 ) + (1 − λ) f (x2 ) ≥ f (λx1 + (1 − λ) x2 ) (4.62) ∀0 ≤ λ ≤ 1 and ∀x1 , x2 in the domain of f . This inequality straightforwardly extends to larger mixtures given that the mixing coefficients are non-negative 4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 92 and sum to unity. In our case the role of the mixing coefficients is played by the probabilities PA0 |A (k1 | l1 ). The function |·|2 is well known to be convex. By employing Lemmata 4.4 and 4.5 we can finally prove the desired statement. Lemma 4.6. If a probability distribution on the variables A1 , ..., An (with finite alphabets) satisfies inequality (4.22), XA 1 :...:An (m − 1) M A 2 1 M A :A .. = . .. . An :A1 M 1 1 M A :A 2 MA 2 ··· 0 .. . .. . ··· 0 .. . 0 1 · · · M A :A ··· 0 .. .. . . .. . 0 n 0 MA n ≥ 0, then also the distribution on the variables A10 , ..., An0 (with finite alphabets) obtained by local transformations A1 → A10 , ..., An → An0 satisfies XA 10 :...:An0 (m − 1) M A 20 10 M A :A .. = . .. . An0 :A10 M 10 10 M A :A 20 MA 0 .. . 0 Proof. We show that in fact X A pound transformation matrix 10 :...:An0 TA T := 1 n 20 ··· 0 .. . .. . ··· ≥ T XA 10 · · · M A :A ··· 0 .. .. . . .. . 0 n0 0 MA 1 :...:An 10 ,A1 n0 ≥ 0. (4.63) T T where T is the com- ... T An0 ,An . (4.64) 1 n Note that from X A :...:A ≥ 0 we can conclude T X A :...:A T T ≥ 0 (see also 10 n0 (4.36) and the text above; note that T T = T † ). The relation X A :...:A ≥ 1 n 10 n0 T X A :...:A T T therefore implies X A :...:A ≥ 0. To show the desired relation we can calculate 4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 93 10 n0 1 n X A :...:A − T X A :...:A T T h i 10 10 1 1 1 10 (m − 1) M A − T A ,A M A T A ,A M A20 :A10 − T A20 ,A2 M A2 :A1 T A1 ,A10 = .. . .. . by Lemma 4.4 h i 10 10 1 1 1 10 (m − 1) M A − T A ,A M A T A ,A = 0 .. . MA 10 :A20 − TA 20 10 20 MA − TA ,A1 ,A2 MA 1 :A2 2 TA 2 MA TA 2 ,A20 ,A20 0 .. . 20 20 ,A2 .. 0 .. . .. 0 MA − TA ··· 2 2 MA TA . ,A20 . ··· · · · .. . .. . ··· .. . .. . by Lemma 4.5 ≥0, (4.65) In the last step we used that a block diagonal matrix is positive semidefinite if this is true for each block. Each single block is positive semidefinite due to Lemma 4.5. 4.4.2 Proof for a special family of distributions As the second step of the proof of Theorem 4.1 we show that inequality (4.22), or rather the equivalent inequality (4.32), is satisfied for a specific family of distributions. Here, we will only consider the special case of the triangular scenario (see Figure 3), i.e. n = 3 observables A, B, C and ancestors λAB , λAC , λBC of degree m = 2. Inequality (4.32) applied to the triangular scenario reads √ √ √ √ M A M A:B M B M B:A M A + M A M A:C M C M C:A M A ≤ 1A . (4.66) The general case is treated in Appendix A.1. While there are strong similarities, the proof for the general case requires one more final step and the notation is more complicated. Fortunately, the general procedure can be understood equally well by restricting to the triangular scenario. We model each observable as the joint of two independent subvariables A = { A1 , A2 }, B = { B1 , B2 } and C = { C1 , C2 }, with distributions P (A) = 4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 94 A A1 A2 B2 C1 B1 B C2 C Figure 19: The triangular scenario where each observable is modeled by two subvariables. Correlations between pairs of subvariables play the role of the ancestors of the original scenario. P (A1 , A2 ) = P (A1 ) P (A2 ) etc. The hidden ancestor structure is modeled by assuming that one subvariable of each observable is correlated with exactly one subvariable of any other observable, by choice A1 ↔ B2 , B1 ↔ C2 and C1 ↔ A2 . For our purpose we assume these correlations to be perfect, i.e. 1 δk l , PA1 ,B2 (kA , lB ) = KAB A B 1 PC1 ,A2 (kC , lA ) = δk l KAC C A 1 PB1 ,C2 (kB , lC ) = δk l (4.67) KBC B C where KAB (KAC , KBC ) is the common (finite) alphabet size of A1 and B2 (C1 and A2 ; B1 and C2 ). Without loss of generality, not only the alphabet sizes but also the alphabets themselves can be assumed to coincide (taking the integer values kA , lB = 1, ..., KAB etc). There shall be no further correlations beside those defined here. A graphical illustration is provided by Figure 19. The joint distribution of the variables A, B, C reads PA,B,C = PA1 ,A2 ,B1 ,B2 ,C1 ,C2 (kA , lA , kB , lB , kC , lC ) = PA1 ,B2 (kA , lB ) PB1 ,C2 (kB , lC ) PC1 ,A2 (kC , lA ) 1 = δk l δk l δk l . KAB KAC KBC A B B C C A (4.68) 4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 95 Note that aside from being finite, the alphabet sizes KAB , KAC and KBC are arbitrary, justifying the term ‘family of distributions’. This will become important in the next step of the proof in Subsection 4.4.4, where distributions on large alphabets are used to model more general distributions on smaller alphabets. We further denote the total alphabet sizes of the variables A, B and C by KA := KAB KAC , KB := KAB KBC , KC := KAC KBC . (4.69) 1 δ δ δ Proposition 4.2. The distributions P (A, B, C) = KAB KAC KBC kA lB kB lC kC lA on the variables A = { A1 , A2 }, B = { B1 , B2 } and C = { C1 , C2 } with arbitrary, finite alphabet sizes KAB , KAC , KBC , satisfy inequality (4.66), √ √ √ √ M A M A:B M B M B:A M A + M A M A:C M C M C:A M A ≤ 1A . The corresponding statement for general hidden common ancestor models is proven in Appendix A.1. Proof. We have to construct the matrices M A , M B , M C , M A:B and M A:C defined in (4.3) and (4.6). As the main ingredient we require the mono- and bi-partite marginals of P (A, B, C). Marginalization over, for example C, amounts to marginalization over the two subvariables C1 and C2 . This leads to the bi-partite distributions 1 δk l KAB KAC KBC A B 1 δk l . and PA1 ,A2 ,C1 ,C2 (kA , lA , kC , lC ) = KAB KAC KBC C A PA1 ,A2 ,B1 ,B2 (kA , lA , kB , lB ) = (4.70) Continuing the marginalization, one obtains the single variable marginals 1 1 = , KAB KAC KA 1 1 PB1 ,B2 (kB , lB ) = = KAB KBC KB 1 1 and PC1 ,C2 (kC , lC ) = = . KAC KBC KC PA1 ,A2 (kA , lA ) = (4.71) 4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 96 As initially demanded, the distributions factorize according to, for example, PA1 ,A2 (kA , lA ) = PA1 (kA ) PA2 (lA ) with PA1 (kA ) = 1/KAB and PA2 (lA ) = 1/K . Since this was clear by construction, one could have written down AC the marginals without any calculations. With (4.70) and (4.71) we essentially have all ingredients that are required to write down the M -matrices. We aim, however, for a concise operator representation of the M -matrices which will considerably simplify all following calculations. To this end, we switch to the Dirac notation and represent mono-partite marginals as Ket-vectors and bi-partite marginals as operators. P (A) = P (A1 , A2 ) = K AB K AC X X PA1 (kA ) PA2 (lA ) |kA iA1 ⊗ |lA iA2 kA =1 lA =1 = K AB K AC X X 1 1 |kA iA1 ⊗ |lA iA2 KAC kA =1 lA =1 KAB = √ 1 K AB X K AC X 1 1 √ √ |kA iA1 ⊗ |lA iA2 KAB KAC kA =1 lA =1 KAB KAC 1 |IA1 i ⊗ |IA2 i = √ KA 1 |IA i . = √ KA (4.72) In the last two steps we introduced the normalized states K AB X 1 |IA1 i := √ |kA iA1 , KAB kA =1 AC 1 KX |IA2 i := √ |lA iA2 , KAC lA =1 |IA i := |IA1 i ⊗ |IA2 i . (4.73) Similarly, we obtain 1 |IB i KB 1 and P (C) = √ |IC i , KC P (B) = √ (4.74) (4.75) 4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 97 with definitions of the |Ii -states analogous to those from (4.73). In the bi-partite case we obtain (note that we use upright boldface to denote the operator representation of a bi-partite probability distribution) P (A, B) = X PA1 ,A2 ,B1 ,B2 (kA , lA , kB , lB ) kA ,lA ,kB ,lB |kA iA1 ⊗ |lA iA2 hkB |B1 ⊗ hlB |B2 X 1 = δkA lB kA ,lA ,kB ,lB KAB KAC KBC |kA iA1 ⊗ |lA iA2 hkB |B1 ⊗ hlB |B2 X 1 √ |kA iA1 hkA |B2 ⊗ = KAB KAC KBC kA 1 ⊗ √ KAC X |lA iA2 √ lA 1 KBC X hkB |B1 kB A1 ↔B2 1 = √ 1 ⊗ |IA2 i hIB1 | , KA KB (4.76) where we defined the symbol A1 ↔B2 1 := X |kA iA1 hkA |B2 . (4.77) kA When acting from the left, A1 ↔B2 1 transforms a state from the space of B2 to A1 ↔B2 the ‘same’ state in the space of A1 , e.g. 1 |IB2 i = |IA1 i. This is possible since A1 and B2 have the same alphabets (in particular the same alphabet size KAB ). When acting from the right, the transformation goes in the other A1 ↔B2 direction. Thus, 1 can essentially be regarded as an ‘identity operation between isomorphic spaces’. Similarly, we obtain A2 ↔C1 1 |IA1 i hIC2 | ⊗ 1 . P (A, C) = √ K A KC (4.78) Note that the roles of A1 and A2 are reversed compared to P (A, B). The reason is that A is correlated with C via A2 while the correlation with B was mediated by A1 . Employing (4.72) to (4.78) we can now simply write 4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 98 down the matrices M A:B and M A:C , M A:B = P (A, B) − P (A)P (B)† ! A1 ↔B2 1 = √ 1 ⊗ |IA2 i hIB1 | − |IA i hIB | KA KB ! A1 ↔B2 1 = √ 1 − |IA1 i hIB2 | ⊗ |IA2 i hIB1 | , KA KB and M A:C 1 =√ |IA1 i hIC2 | KA KC (4.79) ! A2 ↔C1 1 − |IA2 i hIC1 | (4.80) Concerning the mono-partite matrices M A , M B and M C , additional care has to be taken due to the Kronecker delta in the definition of the matrices (see (4.6)). One can write MA = X k,l,k0 ,l0 = δkk0 δll0 PA1 ,A2 (k, l) |kiA1 hk 0 |A1 ⊗ |liA2 hl0 |A2 − P (A)P (A)† 1 1 X |kiA1 hk|A1 ⊗ |liA2 hl|A2 − |IA1 i hIA1 | ⊗ |IA2 i hIA2 | KA k,l KA 1 (1A1 ⊗ 1A2 − |IA1 i hIA1 | ⊗ |IA2 i hIA2 |) KA 1 (1A − |IA i hIA |) . = KA = (4.81) Similarly, we obtain 1 (1B − |IB i hIB |) KB 1 = (1C − |IC i hIC |) . KC MB = and M C (4.82) (4.83) It is instructive to realize that 1Ω − |IΩ i hIΩ | (for Ω = A, B, C) is a projection. Due to this fact, taking the square root or (pseudo) inverse of M Ω (as required by inequality (4.66)) simply amounts to taking the square root or inverse of the prefactor 1/KΩ . Using this realization and (4.79), (4.81) and √ √ (4.82), we can calculate the product M A M A:B M B M B:A M A . We 4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 99 start by considering the product of the first two matrices, √ M A M A:B A1 ↔B2 1 = KA (1A − |IA i hIA |) × √ 1 − |IA1 i hIB2 | ⊗ |IA2 i hIB1 | KA KB p A1 ↔B2 1 = KA M A:B − √ 1 − |IA1 i hIB2 | ⊗ |IA2 i hIB1 | |IA1 i hIA1 | ⊗ |IA2 i hIA2 | × KB p 1 = KA M A:B − √ (|IA i hIB2 | − |IA1 i hIB2 |) ⊗ |IA2 i hIB1 | {z } KB | 1 p 0 = KA M A:B . p (4.84) √ We see that M A merely has a scalar-multiplicative effect on M A:B (M A:B √ √ is an ‘eigenoperator’ of M A ). The effect of M A on M B:A from the right is exactly the same, and the action of M B is analogous as well (providing the prefactor KB ). By exploiting this behaviour and using M A:B from (4.79), we obtain √ √ M A M A:B M B M B:A M A q = KA M A:B =KA KB M = A1 ↔B2 M A:B B M M B:A q KA B:A ! 1 − |IA1 i hIB2 | ⊗ |IA2 i hIB1 | × B2 ↔A1 ! 1 − |IB2 i hIA1 | ⊗ |IB1 i hIA2 | = (1A1 − |IA1 i hIA1 |) ⊗ |IA2 i hIA2 | . (4.85) √ √ When calculating M A M A:C M C M C:A M A it is important to take into account the reversed roles of A1 and A2 . Thus, √ √ M A M A:C M C M C:A M A = |IA1 i hIA1 | ⊗ (1A2 − |IA2 i hIA2 |) . (4.86) To this step √ of the proof√it remains to show that sum of √ conclude √ the A:B B B:A A:C C C:A A A A A M M M M M and M M M M M is indeed upper bounded by the identity. To this end, one can realize that the two terms are hermitian, mutually orthogonal projections, i.e. [(1A1 − |IA1 i hIA1 |) ⊗ |IA2 i hIA2 |]2 =(1A1 − |IA1 i hIA1 |) ⊗ |IA2 i hIA2 | , (4.87) [|IA1 i hIA1 | ⊗ (1A2 − |IA2 i hIA2 |)] = |IA1 i hIA1 | ⊗ (1A2 − |IA2 i hIA2 |) (4.88) 2 4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 100 and [(1A1 − |IA1 i hIA1 |) ⊗ |IA2 i hIA2 |] × [|IA1 i hIA1 | ⊗ (1A2 − |IA2 i hIA2 |)] = 0. (4.89) The sum of two such projections, P1 and P2 , is again a projection, (P1 + P2 )2 = P 2 + P1 P2 + P2 P1 + P22 1 |{z} =P1 | {z } =0 = P1 + P2 . | {z } =0 |{z} =P2 (4.90) Furthermore, recall that the spectrum of a projection consists only of the eigenvalues 0 and 1. Due to this, any projection is upper bounded by the identity operator (for operators that are diagonal in the same basis it suffices to compare the eigenvalues). Thus, the family of distributions considered in this subsection satisfies the desired inequality (4.66), √ √ √ √ M A M A:B M B M B:A M A + M A M A:C M C M C:A M A ≤ 1A . 4.4.3 Counter example In this subsection we construct a distribution for which inequality (4.32) ((4.66) for the triangular scenario) is violated, proving that the inequality poses indeed a non-trivial constraint to the corresponding hidden common ancestor model. Since calculations are simpler than in the previous subsection (in fact fairly similar but without the tensor product structure) we directly consider the general case of n observables A1 , ..., An and ancestors of arbitrary degree m. Note that, as pointed out in Subsection 2.2.4, any distribution can be realized by one ancestor common to all observables (i.e. m = n). Thus, we restrict ourselves to the non-trivial case m < n. It is reasonable that a distribution where all observables are perfectly correlated is not compatible with such a scenario. It is desirable that this distribution also violates inequality (4.32). Here, we show that this is indeed the case. The joint distribution of all observables (with common, finite alphabet size K) reads 1 PA1 ,...,An (k1 , ..., kn ) = δk1 k2 ...kn . (4.91) K 4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 101 The ‘multidimensional Kronecker delta’ demands that all indices coincide. For the mono- and bi-partite marginals one finds 1 δk ,k , K 1 j 1 . PAj (kj ) = K (4.92) PA1 ,Aj (k1 , kj ) = (4.93) Employing the Dirac notation from the previous subsection, the vector and operator representations of these distributions can be expressed as P A1 , Aj and P Aj = K 1 A1 ↔Aj 1 X 1 |kiA1 hk|Aj = K k=1 K (4.94) = K 1 X |ki j K k=1 A (4.95) 1 = √ |IAj i . K The M -matrices become M and A1 :Aj M Aj 1 = K = K X A1 ↔Aj ! 1 − |IA1 i hIAj | (4.96) δkl PAj (k) |kiAj hl|Aj − P Aj P Aj † k.l=1 1 (1Aj − |IAj i hIAj |) . (4.97) K √ √ 1 j j j 1 To calculate the expression M A1 M A :A M A M A :A M A1 required for inequality (4.32), we start by considering the product of the first two matrices. Similarly to the previous subsection, we find = √ M A1 M A1 :Aj √ ! A1 ↔Aj 1 1 − |IA1 i hIAj | = K (1A1 − |IA1 i hIA1 |) K ! √ 1 A1 ↔Aj K = 1 − |IA1 i hIAj | K √ 1 j = KM A :A . (4.98) √ j matrix M A in the middle The second matrix√ M A1 at the right, and the √ 1 j j j 1 of the expression M A1 M A :A M A M A :A M A1 similarly provide the 4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 102 scalar factors √ K and K. Thus, one can calculate √ √ 1 j j j 1 M A1 M A :A M A M A :A M A1 =K 2 M A = 1 :Aj MA j :A1 A1 ↔Aj ! 1 − |IA1 i hIAj | Aj ↔A1 ! 1 − |IAj i hIA1 | =1A1 − |IA1 i hIA1 | . (4.99) Since we get one such summand for each j = 2, ..., n, inequality (4.32) becomes ? (n − 1) (1A1 − |IA1 i hIA1 |) ≤ (m − 1) 1A1 . (4.100) Since 1A1 − |IA1 i hIA1 | is a projection, the left hand side has eigenvalues 0 and n − 1, the latter being realized by states orthogonal to |IA1 i. Since all eigenvalues of the right hand side take the value m−1, and since we demand m < n, the left hand side is not upper bounded by the right hand side, (n − 1) (1A1 − |IA1 i hIA1 |) (m − 1) 1A1 . (4.101) Thus, we found a distribution that violates the inequality. 4.4.4 Generating the whole scenario by local transformations In this subsection we show that one can reach all distributions compatible with a given hidden common ancestor model by local transformations (and a subsequent limit procedure) starting with the corresponding family of distributions introduced in Subsection 4.4.2 (Appendix A.1 for the general case). According to Lemma 4.6, all distributions obtained by local transformations will automatically satisfy inequality (4.22) (and equivalently (4.32)). We divide the current elaboration into three steps. First of all, recall that in Subsection 4.4.2 (and Appendix A.1) we modeled the hidden ancestors by a set of subvariables with perfect correlation. In case of the triangular scenario we had λAB , A1 ↔ B2 , λBC , B1 ↔ C2 and λAC , C1 ↔ A2 . The joint distribution of the observables could be written as (see also (4.68)) P (A, B, C) = P (A1 , A2 , B1 , B2 , C1 , C2 ) = P (A1 , B2 ) P (B1 , C2 ) P (C1 , A2 ) 1 δA B δB C δC A . = KAB KAC KBC 1 2 1 2 1 2 (4.102) 4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 103 Note that as a short hand notation we omit explicitly assigning values to the variables. In particular, the Kronecker deltas are to be understood that both variables shall take the same value. On the other hand, according to Subsection 2.2.4 (employing the Markov condition and marginalizing over the hidden ancestors), all distributions compatible with the triangular scenario are of the form P (A, B, C) = X P (A | λAB , λAC ) P (B | λAB , λBC ) P (C | λAC , λBC ) λAB ,λAC ,λAB ·P (λAB ) P (λAC ) P (λBC ) . (4.103) This decomposition looks considerably different from the above family of distributions. The first of our three steps thus aims to transform the distribution (4.102) to better resemble (4.103). This transformation also leads to a larger family of obtainable distributions. Here, as before in Subsection 4.4.2, this step is only performed for the triangular scenario while the general case in presented in Appendix A.2. Essentially, the reason is again mainly a notational one. The general case requires more complicated notations, some of which are introduced only in Appendix A.1. As the result of the first step, we will obtain (4.103) but with uniform ancestors P (λx = j) = 1/Kx . In the second step, we use these uniform ancestors with large alphabets to model more general ancestors with arbitrary rational probabilities. The third step extends the result to irrational probabilities. Both, the second and the third step can directly be presented for the most general case since all ancestors can be considered separately. Step 1: Locally transforming A = { A1 , A2 } → A0 ,... Before establishing the actual result of this step, we illustrate to what extent the family of distributions from (4.102) is restricted. The variable A was defined as the joint of the two subvariables A1 and A2 with factorizing distribution P (A) = P (A1 , A2 ) = P (A1 ) P (A2 ) (and similarly for B and C; see also Subsection 4.4.2). Furthermore, the variables A1 and B2 , B1 and C2 as well as C1 and A2 have the same alphabet sizes. This strongly limits the alphabet sizes of the compound observables A, B and C. It is for example impossible to let all variables be binary. Constructing A and B as binary would force C to have either one or four outcomes. Thus, already for the 4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 104 reason to allow arbitrary alphabet sizes of the observables, local transformations A = { A1 , A2 } → A0 , B = { B1 , B2 } → B 0 and C = { C1 , C2 } → C 0 are required. Proposition 4.3. Starting with the family of distributions PA1 ,A2 ,B1 ,B2 ,C1 ,C2 = 1 δ δ δ (first introduced in (4.68)), one can obtain all KAB KAC KBC A1 B2 B1 C2 C1 A2 distributions of the form (4.103) with uniform ancestors P (λx = j) = 1/Kx (and finite alphabets), P (A, B, C) X = P (A | λAB , λAC ) P (B | λAB , λBC ) P (C | λAC , λBC ) λAB ,λAC ,λAB · 1 , KAB KAC KBC (4.104) via local transformations A = { A1 , A2 } → A0 , B = { B1 , B2 } → B 0 and C = { C1 , C2 } → C 0 . The case of general hidden common ancestor models is considered in Appendix A.2. Proof. By locally transforming PA1 ,A2 ,B1 ,B2 ,C1 ,C2 = we obtain 1 δ δ δ , KAB KAC KBC A1 B2 B1 C2 C1 A2 P (A0 , B 0 , C 0 ) P (A0 | A1 , A2 ) P (B 0 | B1 , B2 ) P (C 0 | C1 , C2 ) X = A1 ,A2 ,B1 ,B2 ,C1 ,C2 X = ·P (A1 , A2 , B1 , B2 , C1 , C2 ) P (A0 | A1 , A2 ) P (B 0 | B1 , B2 ) P (C 0 | C1 , C2 ) A1 ,A2 ,B1 ,B2 ,C1 ,C2 · (∗) = X 1 δA B δB C δC A KAB KAC KBC 1 2 1 2 1 2 P (A0 | A1 , C1 ) P (B 0 | B1 , A1 ) P (C 0 | C1 , B1 ) A1 ,B1 ,C1 = X 1 KAB KAC KBC P (A0 | λAB , λAC ) P (B 0 | λAB , λBC ) P (C 0 | λAC , λBC ) λAB ,λAC ,λBC · 1 . KAB KAC KBC (4.105) 4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 105 For the equality marked with (∗), note that strictly speaking the Kronecker delta δA1 ,B2 only demands that A1 and B2 take the same value, not necessarily that they are the same variable. In a notation where we explicitly write down the values of the variables, the Kronecker delta δA1 ,B2 has the effect P (B 0 = kB | B1 = lB1 , B2 = lB2 ) → P (B 0 = kB | B1 = lB1 , B2 = lA1 ) . (4.106) But at this point, we can simply define P (B 0 = kB | B1 = lB1 , A1 = lA1 ) := P (B 0 = kB | B1 = lB1 , B2 = lA1 ) . (4.107) 0 0 This enables us to replace P (B | B1 , B2 ) by P (B | B1 , A1 ). The same argument holds for the effect of the other Kronecker deltas. In the last step we renamed the remaining variables A1 → λAB , B1 → λBC and C1 → λAC in order to introduce our typical hidden ancestor notation. One could say that we identified the remaining subvariable of each pair of correlated subvariables as the common ancestor of the two involved observables. Step 2: Modeling ancestors with P (λx = j) ∈ Q by ancestors with P (λx = j) = 1/Kx Considering the general case instead of only the triangular scenario (see Appendix A.2), we have shown that we can generate all distributions of the form Y 1 X , P A1 , ..., An = P A1 | { λx }x| 1 ...P An | { λx }x|An A x Kx { λx }x (4.108) where { λx }x denotes the set of all hidden ancestors and { λx }x| j the set A of all hidden ancestors of the observable Aj . The case of arbitrary ancestordistributions reads (see also (2.13)) P A1 , ..., An = X { λx }x P A1 | { λx }x| A1 ...P An | { λx }x|An Y P (λx ) . x (4.109) Proposition 4.4. Any distribution of the form (4.109) with rational valued ancestor-probabilities P (λx = j) ∈ Q (of finite alphabet size) can be modeled by a distribution of the form (4.108). 4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 106 Following all previous steps of the proof, the equivalence of these two models immediately implies that all distributions of the form (4.109) with P (λx = j) ∈ Q satisfy inequality (4.22). Proof. Assume that we have a distribution P (A1 , ..., An ) of the form (4.109) with arbitrary, rational ancestors. We show how to model one ancestor at a time by a uniform ancestor. The ancestor under consideration from the ‘rational model’ is denoted by λx and has alphabet size Kx . The corresponding ancestor from the ‘uniform model’ is denoted by λ0x and has alphabet size Kx0 . The probabilities P (λx = j) ∈ Q of the ancestor λx can be written as P (λx = j) = zj , Z (4.110) all with the common denominator Z. In order to model this ancestor by a uniform λ0x , we choose for λ0x the alphabet size Kx0 = Z. Furthermore, for all observables Ar that depend on the ancestor λx , we define the conditional probabilities P (Ar | ..., λ0x = j 0 , ...) := P (Ar | ..., λx = j, ...) , (4.111) for exactly zj outcomes j 0 of the uniform ancestor λ0x . To be explicit, we Pj P choose the outcomes j 0 = j−1 l=1 zl . We further define αj := l=1 zl + 1, ..., Pj 0 l=1 zl (for j = 0, ..., Kx ; α0 = 0). Starting with the uniform ancestor λx , one obtains X Z X Y P A1 | ..., λ0x = j 0 , ... ...P (An | ..., λ0x = j 0 , ...) P (λ0x = j 0 ) P (λy ) { λy }y6=x j 0 =1 = X Kx X y6=x αj 1 Y P A1 | ..., λx = j, ... ...P (An | ..., λx = j, ...) P (λy ) Z X { λy }y6=x j=1 j 0 =αj−1 +1 = X Kx X y6=x P A1 | ..., λx = j, ... ...P (An | ..., λx = j, ...) { λy }y6=x j=1 = X y6=x P (λx =j) P A1 | { λy }y| A1 { λy }y zj Y P (λy ) Z |{z} Y ...P An | { λy }y|An P (λy ) (4.112) y In this calculation, the other ancestors λy , y 6= x can be either uniform or rational. Thus, the procedure can be applied to one ancestor at a time, allowing us to stepwise model a distribution with all ancestors rational by a 4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 107 distribution with all ancestors uniform. In this way, any distribution of the form (4.109) with rational ancestor-probabilities P (λx ) can be obtained. In fact, this implies that the families of distributions defined by (4.108) and (4.109) are the same. In both models the same distributions P (A1 , ..., An ) can be realized. Step 3: From rational to arbitrary real P (λx ) In the first part of this step we show explicitly that an arbitrary real distribution of an ancestor λx can be obtained as a limit of rational valued distributions. We then have to show that this limit procedure respects inequality (4.22). Step 3a: Real distributions as limits of rational distributions Proposition 4.5. Any distribution of the form (4.109) with real valued ancestor-probabilities P (λx = j) ∈ R (of finite alphabet size) can be obtained as a limit of a sequence of such distributions with rational valued ancestor-probabilities Pk (λx = j) ∈ Q. Proof. As in the second step we can treat each ancestor λx separately. We use the fact that any real number, in particular any irrational number, can be written as the limit of a sequence of rational numbers. When generalizing this concept to a whole probability distribution, we have to take into account that probabilities have to sum to unity and be non-negative. Consider an arbitrary λx with finite alphabet size Kx . Assume that N ≤ Kx of the probabilities P (λx = j) are irrational and write them in decreasing order 1 > p(1) ≥ p(2) ≥ ... ≥ p(N ) > 0. (4.113) The possibly rational probabilities get the superscripts N + 1, ..., Kx in any order. For simplicity, define the sum of these rational probabilities as q := p(N +1) + ... + p(Kx ) ∈ Q. (4.114) Note that we can assume N ≥ 2, since N = 0 is trivial and N = 1 is impossible. In the latter case one irrational number and many rational numbers would have to sum to one, which is not possible. 4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 108 (j) Denote by n o the sequence of rational numbers that converges to a single p (j) (j) (j) pk . For j = N + 1, ..., Kx we can simply choose pk = p ∀k. k∈N To make sure that for each n the probabilities sum to one, we approach p(1) , ..., p(N −1) from below and p(N ) from above. If the decimal expansion of p(1) is p(1) = 0.n1 n2 n3 ..., (4.115) then choose the sequence n (1) pk o k = {0.n1 , 0.n1 n2 , 0.n1 n2 n3 , ...} . (1) (1) (1) (4.116) 0 ≤ pk < 1. Thus, By construction pk ∈ Q, pk ≤ p(1) and nin particular o (j) (1) for j = 2, ..., N − 1 can be pk is a valid probability. The sequences pk k (N ) defined analogously. Finally, for p define (N ) pk =1−q− N −1 X (j) (4.117) pk . j=1 (j) (N ) (j) Since q ∈ Q and each pk ∈ Q we also have pk ∈ Q. Since pk ≤ p(j) for P −1 (j) (N ) = p(N ) > 0. Also, j = 1, ..., N − 1 we have pk ≥ 1 − q − N j=1 p (N ) by construction pk ≤ 1. In particular, the total distribution satisfies PN (j) (j) q + j=1 pk = 1. Hence, the pk form a valid, rational valued probability distribution with the desired convergence to an arbitrary, initially fixed distribution with potentially irrational probabilities. To be on the safe side, since the sequence of the whole distribution consists of Kx (or effectively N ) single sequences, the alphabet size Kx should be finite. Successively performing this procedure for all ancestors, we obtain the full set of distributions that can be written according to (4.109). At last, this the whole family of distributions compatible with the given hidden common ancestor model. Step 3b: Limit respects the inequality The transformation in Step 1 was guaranteed to respect the inequality due to Proposition 4.6. In Step 2 we simply showed the equivalence of two models of distributions, implying that the target-model inherits the compatibility with 4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 109 the inequality from the starting-model. Here, we have to take additional care since the employed limit procedure is covered by neither of the two previous explanations. Fortunately, this can be done with a rather general and simple statement about limits of continuous functions. The details of the construction of the sequences from Step 3a are not relevant. Lemma 4.7. The limit procedure from Proposition 4.5 respects inequality 1 n (4.22), X A :...:A ≥ 0. Proof. It is again sufficient to consider the limit procedure separately for all ancestors. For a single ancestor, we have to show that the matrix 1 n 1 n X A :...:A ≡ X A :...:A [P (λx )] (first defined in (4.8)) corresponding to the limit distribution P (λx ) is positive semidefinite. We assume (or rather know from the previous steps of the proof) that this is true for the ma1 n 1 n trices XkA :...:A ≡ X A :...:A [Pk (λx )] corresponding to each element of the sequence of rational valued distributions Pk (λx ). To prove the statement for the limit distribution we employ the definition of positive semidefiniteness, 1 n i.e. we show v † X A :...:A v ≥ 0 for all complex valued v of suitable dimension. Each element of an X-matrix is of the form δll0 PAj (l) − PAj (l) PAj (l0 ) or PAj ,Aj0 (l, l0 ) − PAj (l) PAj0 (l0 ). Each single probability is of the form (4.109) with additional marginalization over all but one or two observables. This means that each single probability and hence each matrix element is a polynomial of the ancestor-probabilities P(k) (λx = j). The expectaA1 :...:An tion value v † X(k) v is a linear combination of the matrix elements and thus also a polynomial of the ancestor-probabilities. This means in particA1 :...:An ular that v † X(k) v is a continuous function of the ancestor-probabilities P(k) (λx = j). From this continuity (and P (λx ) = limk→∞ Pk (λx )) it follows that 1 n 1 n v † X A :...:A v = lim v † XkA :...:A v. (4.118) k→∞ 1 n But this implies that the global bound for v † XkA :...:A v ≥ 0 (global mean1 n ing for all k) is also true for v † X A :...:A v. The negation of this statement, 1 n v † X A :...:A v < 0, could easily be shown to contradict the definition of continuity and/or convergence. In combination with all the previous steps of the proof, we have shown that 1 n X A :...:A ≥ 0 for all X-matrices arising from distributions given by (4.109). This concludes the proof of Theorem 4.1. 4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 110 4.4.5 Brief summary of the proof Since the proof of Theorem 4.1 consisted of a lot of individual steps, we briefly sum up the line of arguments and give an overview of all involved propositions and major lemmata. Theorem 4.1 states that all distributions compatible with a given hidden common ancestor model with n observables connected by ancestors of degree up to m satisfy inequality (4.22), XA 1 :...:An (m − 1) M A 2 1 M A :A .. = . .. . n 1 M A :A 1 1 M A :A 2 MA 2 0 .. . 0 ··· 0 ... .. . ··· 1 · · · M A :A ··· 0 .. ... . .. . 0 n 0 MA 1 n ≥ 0. n Proposition 4.1 The equivalence of the inequalities X A :...:A ≥ 0 and n √ √ X 1 i i i 1 A1 :...:An Y = M A1 M A :A M A M A :A M A1 ≤ (m − 1) 1A1 i=2 is shown. This allows us to freely choose the more convenient representation in any of the following steps. Proposition 4.2 (A.1) A specific family of distributions is shown to satisfy the inequality. The ancestors λx are modeled by a collection of perfectly correlated subvariables, one for each observable connected by the ancestor. The final observables are defined as the joint of all their subvariables. In case of the triangular scenario this family of distribution reads P (A, B, C) = P (A1 , A2 , B1 , B2 , C1 , C2 ) 1 δA B δB C δC A . = KAB KAC KBC 1 2 1 2 1 2 Proposition 4.3 (A.2) Starting with the above family of distributions (for a general hidden common ancestor model) all distributions of the form P A1 , ..., An = X { λx }x P A1 | { λx }x| A1 ...P An | { λx }x|An Y x 1 Kx can be reached by local transformations. The terms 1/Kx correspond to uniform ancestor-probabilities P (λx = j) = 1/Kx . 4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 111 Lemma 4.6 The local transformations performed in the previous proposition respect the inequality. Thus, all distributions of the above form satisfy the inequality. Proposition 4.4 The families of distributions P A1 , ..., An = X P A1 | { λx }x| { λx }x A1 ...P An | { λx }x|An Y P (λx ) , x once with rational ancestor-probabilities P (λx = j) ∈ Q and once with uniform ancestor-probabilities P (λx = j) = 1/Kx coincide. Thus, all distributions of the above form with rational ancestor-probabilities satisfy the inequality. Proposition 4.5 Any distribution of the form P A1 , ..., An = X { λx }x P A1 | { λx }x| A1 ...P An | { λx }x|An Y P (λx ) x with arbitrary real ancestor-probabilities P (λx = j) ∈ R can be obtained as the limit of a sequence of such distributions with rational ancestor-probabilities. In this way we obtain the whole family of distributions compatible with the corresponding hidden common ancestor model. Lemma 4.7 The limit procedure from the previous proposition respects the inequality. Thus, all distributions compatible with the corresponding hidden common ancestor model satisfy the inequality. This is exactly the statement of Theorem 4.1. Note that while not explicitly necessary for all steps of the proof, the alphabets of the observables as well as of the ancestors should be discrete and finite. Concerning the observables, already the M -matrices as the basic constituent parts of the inequality require finite and in particular discrete alphabets. Concerning the ancestors, already the subvariables in the family of distributions from Proposition 4.2 were always treated as discrete and finite. In Proposition 4.3 the ancestors inherited the alphabet sizes of these subvariables. In Proposition 4.4 we exploited that all the rational probabilities P (λx = j) of one ancestor can be expressed as fractions with the same 4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 112 (finite) denominator. This implicitly requires a finite number of probabilities P (λx = j) as well. In Proposition 4.5 the sequence of rational valued distributions Pk (λx ) → P (λx ) consists of one sequence per outcome λx = j. All these sequences are taken simultaneously. To be on the safe side, the number of sequences, and thus the alphabet size, should be finite. 4.5 Comparison between matrix and entropic inequality The equivalent inequalities X A Y A1 :...:An = n √ X M A1 M A 1 :...:An 1 :Ai i ≥ 0 and i M A M A :A 1 √ M A1 ≤ (m − 1) 1A1 (4.119) i=2 (see inequalities (4.22) and (4.32)), based on generalized covariance matrices, have been derived as an analog to the entropic inequality n X I A1 ; Ai ≤ (m − 1)H A1 (4.120) i=2 (originally (4.1)). One major purpose to derive the matrix inequality was that in this framework the inequality might be stronger than the corresponding entropic inequality (meaning that the set of distributions compatible with the former is smaller than the set of distributions compatible with the latter). A reasonable motivation for this assumption was the fact that already the elementary inequalities (see Section 3.1) constraining the entropies of any set of random variables entail only an outer approximation. On the 1 n other hand, the starting point of the derivation of X A :...:A ≥ 0 was rather arbitrary and primarily motivated by the analogy to the known entropic inequality. It is therefore difficult to estimate the degree of approximation inherent to the matrix inequality. In this section we compare the strengths of the entropic and the matrix inequality by considering some exemplary families of distributions. For a simple family (but flexible with respect to the number of observables and their alphabet size) the comparison can be done analytically (or with precise numerics). The elaboration of this case is presented in Subsection 4.5.1. More general distributions for which the comparison is performed via Monte 4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 113 Carlo simulations are considered in Subsection 4.5.2. Finally, in Subsection 4.5.3, we simulate hypothesis tests based on the matrix inequality. The results are compared to the analogous entropic tests examined exhaustively in Section 3.3. 4.5.1 Analytical investigation We consider the mixture of the perfectly correlated and the uniform distribution, 1 δA1 ,...,An + (1 − v) n . (4.121) P A1 , ..., An = v K K The parameter v can be referred to as the visibility of the correlated distribution. In that context, the uniform distribution can be considered as noise. For v = 0 the distribution will be compatible with any DAG and satisfy any inequality. For v = 1 the distribution will be incompatible with any (non-trivial) DAG and violate any (non-trivial) inequality. Concerning the attribute ‘non-trivial’, recall that a DAG where all observables share one common ancestor puts no constraint on the compatible distributions. For the comparison between the two inequalities we are interested in the critical value vc for which compatibility with a given inequality changes. The smaller the value vc the less distributions satisfy the inequality. This means that the inequality with the smaller vc imposes a stronger constraint on the set of compatible distributions and thus also leads to the closer approximation to the true set of distributions compatible with a corresponding DAG. A graphical illustration is provided by Figure 20. In order to simply compare the matrix and the entropic inequality we do not need to choose a specific DAG that is constrained by the inequalities. If we nevertheless wanted to do so, any pair of observables should at least share one ancestor. Otherwise the DAG would demand two observables without a common ancestor do be independent (see Subsection 2.2.4). The distributions (4.121) with v > 0 would clearly violate this constraint. Furthermore, we should assume that one of the ancestors of A1 indeed has degree m. Otherwise, we could replace m by m0 , the maximal degree of A1 ’s ancestors. See also the annotations below Theorem 4.1. Note that we do not consider the family of ‘flip distributions’ (3.32) from Chapter 3, because there exists no straightforward generalization to larger alphabets. 4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 114 DAG description pure noise perfect correlation vcDAG v=0 vcineq v=1 inequality description Figure 20: Set relation of distributions compatible with a DAG and distributions compatible with a corresponding inequality description. The distribution (4.121) with v = 1 is incompatible with both descriptions. When decreasing v, at the critical value vcineq compatibility with the inequality is established. At some unknown value vcDAG ≤ vcineq the distribution starts to be compatible with the DAG as well. The smaller vcineq the better is the inequality description. Matrix description We start by considering the matrix inequality Y A :...:A ≤ (m − 1) 1A1 (which 1 n turns out to be simpler than the equivalent inequality X A :...:A ≥ 0). Due j 1 j to the symmetry of the distributions, the matrices M A and M A :A do not depend on j. The marginal distributions read 1 n δkl 1 + (1 − v) 2 , K K 1 PAj (k) = . K PA1 ,Aj (k, l) = v (4.122) (4.123) For the M -matrices one obtains j A Mk,l = δkl PAj (k) − PAj (k) PAj (l) 1 1 = δkl − 2 , K K (4.124) 4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 115 and 1 A :A Mk,l j = PA1 ,Aj (k, l) − PA1 (k) PAj (l) 1 1 1 = vδkl + (1 − v) 2 − 2 K K K 1 1 = vδkl − v 2 . K K We see in particular that M A inequality (4.32) reads 1 :Aj j = vM A . By further defining M := M A √ M vM M vM M ≤ (m − 1) 1 √ −1 √ −1 ⇔ v 2 (n − 1) M M M −1 M M ⇔ v 2 (n − 1) 1 ⇔ v 2 (n − 1) ≤ ≤ ≤ (m − 1) 1 (m − 1) 1 (m − 1) n √ X (4.125) j j=2 s ⇔ v ≤ m−1 . n−1 (4.126) From the first to the second line we used that M has full rank which implies that the pseudoinverse is the ordinary inverse. From the second to the third √ −1 √ −1 line we used M M −1 = 1 and M M M = 1. According to inequality (4.126) the critical value in the matrix framework is s vcmat = m−1 . n−1 (4.127) Note that vcmat is independent of the alphabet size K. Entropic description In order to evaluate inequality (4.1) note that, similar to the matrix framework, the mutual information I (A1 ; Aj ) and the entropies H1 := H (Aj ) and H2 := H (A1 , Aj ) are independent of j. Further using I (A1 ; Aj ) = 4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 116 H (A1 ) + H (Aj ) − H (A1 , Aj ), inequality (4.1) can be written as n X I A1 ; Ai ≤ (m − 1)H A1 i=2 ⇔ (n − 1) (2H1 − H2 ) ≤ (m − 1) H1 m−1 H2 ⇔ 2− ≤ (4.128) H1 n−1 Note that we have H1 6= 0 (see below) and n − 1 6= 0 since a hidden common ancestor model with only one observable is rather boring. Also note that the inequality does not depend on the single values n and m but only on . Using (4.122) and (4.123), the mono- and bi-partite entropies the ratio m−1 n−1 can be expressed as K X 1 1 log K k=1 K = log K H1 = − (4.129) and 1−v v 1−v 1−v 1−v v 2 H2 = −K + + log . log − K − K 2 2 2 K K K K K K2 (4.130) The first term in H2 corresponds to the K ‘diagonal probabilities’ (k = l in the expression (4.122)) while the second term corresponds to the remaining K 2 − K probabilities. Even though the expression of H2 seems to be rather complicated, it can be used in this form to numerically solve inequality (4.128) for arbitrary n, m and K. By doing so, we can obtain the corresponding value vcent . In the limit for K → ∞, vcent can be calculated analytically. To this end, we approximate the two terms of H2 as v 1−v v 1−v −K log + + 2 K K K K2 1−v 1−v = − v+ − log K + log v + K K 1−v 1−v 1−v = v log K − log K − v + log v + K K } | K {z } | {z →0 K1 → K1 v log K − v log v → v log v K1 (4.131) 4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 117 and 1−v 1−v log 2 K K2 1 = − 1− (1 − v) [−2 log K + log (1 − v)] K → 2 (1 − v) log K − (1 − v) log (1 − v) . − K2 − K K1 (4.132) By combining (4.131) and (4.132) we obtain H2 → v log K − v log v + 2 (1 − v) log K − (1 − v) log (1 − v) = (2 − v) log K + h (v) , K1 (4.133) where we identified −v log v − (1 − v) log (1 − v) as the binary entropy function h (v) (i.e. the entropy of a binary variable with probabilities v and 1 − v). The binary entropy satisfies 0 ≤ h (v) ≤ log 2 (recall that any entropy is lower bounded by zero and upper bounded by log K where K is the alphabet size). Inserting this limit of H2 and H1 = log K into inequality (4.128), one can solve the inequality according to H2 H1 (2 − v) log K + h (v) ⇔ 2− log K h (v) ⇔ 2 − (2 − v) − log K 2− → v K1 ≤ ≤ ≤ ≤ m−1 n−1 m−1 n−1 m−1 n−1 m−1 . n−1 (4.134) The critical value for the entropic inequality and K → ∞ is thus vcent → K→∞ m−1 . n−1 (4.135) Comparison vcent vs vcmat First, consider the case of large alphabets K → ∞. In this case we have q m−1 mat vc = n−1 opposed to vcent = m−1 . Recall that we can restrict to the case n−1 m < n. Equality would mean that there exists one ancestor common to all 4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 118 observables. Any distribution could be realized by this scenario and both <1 inequalities would be trivially satisfied for any v. For m < n we have m−1 n−1 q and thus vcent = m−1 < m−1 = vcmat meaning that the entropic inequality n−1 n−1 is stronger than the matrix inequality in the limit of large alphabets. The smaller the ratio m−1 , the more significant is the advantage of the entropic n−1 inequality (in terms of the ratio vcmat/vcent ). For smaller alphabets we start by considering the triangular scenario, i.e. n = 3 and m = 2. Independentlyqof the alphabet size K, the critical value of the matrix inequality is vcmat = 1/2 ≈ 0.707. By numerically solving inequality (4.128) with H1 = log K and H2 from (4.130), we obtain the K-dependent value vcent : K vcent ≈ 2 0.780 3 0.761 10 0.709 11 0.705 100 0.637 10100 0.503 For K ≤ 10 we find vcmat < vcent , meaning that the matrix inequality is stronger than the entropic inequality in the regime of small alphabets. In the case of binary variables that was considered for the entropic hypothesis tests in Chapter 3, the advantage of the matrix inequality is rather significant. For K > 10 the advantage changes in favour of the entropic inequality. The value of vcent for K = 10100 shall illustrate that the convergence vcent → 12 K→∞ is rather slow. It seems to be generally true that for binary variables the matrix inequality is always stronger than the entropic inequality. While we present no general proof of this statement, Figure 21 strongly supports the claim. The same observation can be made for K = 3. For K ≥ 4 on the other hand, we are able to find scenarios with vcent < vcmat . Finally, we investigate the critical alphabet size Kc at which the entropic inequality becomes stronger than the matrix inequality. To be precise, Kc is defined as the alphabet size satisfying vcmat ≤ vcent for K ≤ Kc and vcent < vcmat . Starting at for K > Kc . Figure 22 shows Kc as a function of the ratio m−1 n−1 m−1 m−1 m−1 Kc = 3 for n−1 → 0, Kc is an increasing function of n−1 . For n−1 → 1, Kc diverges to extremely large values. 4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 119 K=2 vc 1 1 2 1 4 1 8 1 16 1 32 vcent vcmat 1 1000 1 100 1 10 1 m-1 n-1 Figure 21: Comparison of the critical values vcent and vcmat as a function of m−1 k in the binary case. We considered values m−1 = 1000 for k = 1, ..., 1000. n−1 n−1 m−1 For n−1 = 1 both critical values coincide at vc = 1. For m−1 < 1 we n−1 ent mat uniformly obtain vc < vc . Note that the double-logarithmic plot uses the base-10-logarithm for the x-axis but the base-2-logarithm for the y-axis. triangular scenario Figure 22: Critical alphabet Kc above which the entropic inequality is stronger than the matrix inequality as a function of m−1 . We considered n−1 m−1 k m−1 values n−1 = 100 for k = 1, ..., 99. For small n−1 the matrix inequality is stronger only for alphabets as small as K ≤ 3. For large m−1 the advantage n−1 of the matrix inequality extends to extremely large alphabets. 4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 120 for the At this point it is instructive to discuss the meaning of the ratio m−1 n−1 DAG. A small value corresponds to a DAG with a large number of observcorresponds to a ables that are only weakly connected. A large ratio m−1 n−1 DAG with strongly connected observables. Here, weak and strong connectivity are to be understood in terms of the (relative) number of observables connected by a single ancestor. The number of ancestors plays no role. In that sense, the entropic inequality tends to be stronger for weakly connected graphs while the matrix inequality tends to be stronger for strongly connected graphs. We can summarize the results as follows: • For K = 2, 3 the matrix inequality is always stronger than the entropic inequality (by observation). • For K → ∞ the entropic inequality is always stronger than the matrix inequality (by analytical proof). the advantage of the matrix inequality extends to • For increasing m−1 n−1 larger alphabets (by observation). Note that we arrived at these results by considering only one specific family of distributions. The statements are not necessarily true in the general case and should thus be understood as tendencies rather than as strict rules. 4.5.2 Numerical simulations In this subsection we compare the entropic and the matrix inequality for three different families of random distributions. Opposed to the precise calculations from the previous subsection, Monte Carlos simulations are required in this case. Due to the thereby accompanied computational burden, we have to restrict the simulations to rather small alphabets and DAGs. In fact, we consider only the triangular scenario. In this case, according to the previous subsection, we expect the matrix inequality to be stronger for K . 10. This statement, while not being universally true, can be confirmed on a qualitative level. 4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 121 Model 1: Random from DAG + correlation We construct distributions PDAG according to the formula PDAG = X P (A | λAB , λAC ) P (B | λAB , λBC ) P (C | λAC , λBC ) λAB ,λAC ,λAB ·P (λAB ) P (λAC ) P (λBC ) . (4.136) Any distribution of this form is compatible with the triangular scenario and thus with both inequalities. To construct distributions that violate the inequalities, we mix PDAG with the perfectly correlated distribution, Pfinal = vPcorr + (1 − v) PDAG , where Pcorr = δA,B,C . K (4.137) For v = 1 (v = 0) both inequalities will be violated (satisfied). The common alphabet size of all observables is K. For all the pairwise ancestors we choose the alphabet size K 2 . Due to this, we have K 5 conditional probabilities (e.g. P (A | λAB , λAC )) for each observable. We assume that the observables are deterministic functions of the ancestors such that each conditional probability is either 0 or 1. To make sure that, for example, for fixed λAB and λAC exactly one output of A has probability 1, we generate a K 2 × K 2 random matrix with entries k = 0, ..., K − 1. The matrix elements specify to which output of A a given combination of λAB , λAC is mapped. The elements are drawn from a binomial distribution Bin (K − 1, p), ! K −1 k pBin(K−1,p) (k) = p (1 − p)K−1−k for k = 0, ..., K − 1, p ∈ [0, 1] . k (4.138) The parameter p is uniformly distributed in [0, 1], fixed for one matrix but different for different matrices (there is one matrix for each observable). To generate the marginals of the ancestors, each of the K 2 probabilities P (λx ) (per ancestor) is drawn uniformly from [0, 1] with subsequent normalization. For K = 2, 3, 4 we generate 1000 distributions PDAG and consider mixing k , k = 1, ..., 19. Figure 23 shows the number of states parameters v = 20 violating the respective inequalities as a function of v. 4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 122 K=2 K=4 K=3 entropic probabilistic matrix covariances Figure 23: Numbers of distributions (out of 1000), generated according to Model 1 violating the entropic and the matrix inequality, respectively. In 1 n addition to the matrix inequality X A :...:A ≥ 0 we also considered the special case of the covariance inequality (4.43) with alphabet {0, ..., K − 1} for all observables. For K = 2 the results for the covariance inequality and the general matrix inequality coincide. For K = 3, 4 the matrix inequality is significantly stronger than the covariance inequality. See also the comment concerning this gap in the last paragraph of Subsection 4.3.4. The matrix inequality is typically also stronger than the entropic inequality, the difference being more significant for larger K. We observe that in most cases the matrix inequality is stronger than the entropic inequality. This is in accordance with the intuition that we gained in Subsection 4.5.1. There, we found that for the triangular scenario and a simple family of distributions, the matrix inequality is stronger than the entropic inequality for all K ≤ 10. Contrary to the intuition that the difference between the inequalities should diminish for increasing K, in Figure 23 the advantage of the matrix inequality is even larger for larger K. Since the required time for the simulations increases rapidly with K (≈ 7 seconds for K = 2, ≈ 80 seconds for K = 3 and ≈ 20 minutes for K = 4) we 4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 123 can unfortunately not consider much larger alphabets. Furthermore, the impression from Subsection 4.5.1, that for K = 2 the matrix inequality is always stronger than the entropic inequality, turns out not to be universally true. For K = 2 and small v we found a slight advantage of the entropic inequality. On a qualitative level, however, we can confirm that for the triangular scenario with small alphabets the matrix inequality is typically (significantly) stronger than the entropic inequality. Model 2: Simple random distributions We generate distributions P (A, B, C) by drawing each of the K 3 probabilities independently from a simple probability distribution with subsequent normalization. Compared to the previous model this construction is considerably less expensive in terms of computation time. We are thus able to explore larger alphabet sizes, say up to K = 15. If we draw all K 3 probabilities P (A, B, C) from a uniform distribution the correlation between the variables A, B and C can be expected to be small. Indeed, we observe no violations of any inequality in this case (for K = 2, ..., 15). We desire a distribution P (A, B, C) where only ‘few’ probabilities are significantly different from zero. Taking a look at the perfectly correlated distribution, one might expect that ≈ K non-zero probabilities lead to the strongest correlaK2 non-zero tions. Since we do not want perfect correlations, we aim for log K probabilities. This choice is rather arbitrary and is justified retrospectively by the sound results. To generate distributions of the desired type, we first decide for each of the K 3 probabilities whether it should be zero or not. This is done by drawing a Bernoulli variable with success probability 1 . The expected number of successes, and thus of non-zero probabiliK log K 3 2 K ties, is K K = log . Next, each probability chosen to be non-zero is drawn log K K uniformly from the interval [0, 1]. Since this is procedure is rather an educated guess than a bullet-proof construction, there is no guarantee that the ‘degree of correlation’ will be the same for all considered alphabet sizes. We generate 10 000 distributions for K = 2, ..., 15 and for each K compute the numbers of distributions violating the two inequalities. We observe (Figure 24) that for K ≥ 3 the matrix inequality is considerably more powerful than the entropic inequality. The reason for the small number of violations for K = 2 is most likely that the construction of the distributions does not lead to sufficient correlations in this case. Based on 4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 124 violations 7000 6000 5000 ◼ 4000 3000 ◼ 2000 1000 0 ◼ 2 4 ◼ ◼ ◼ ◼ ◼ ◼ ◼ ◼ ◼ ◼ ◼ 6 8 10 12 14 entropic ◼ matrix K Figure 24: Numbers of distributions (out of 10 000), generated according to Model 2, violating the entropic and the matrix inequality, respectively. For K ≥ 3 the matrix inequality is significantly stronger than the entropic inequality. the results from Subsection 4.5.1, the advantage of the matrix inequality could be expected for K ≤ 10. Figure 24 shows that for other families of distributions this advantage extends to larger alphabets. The gap between the inequalities (for K ≥ 3) is surprisingly large. Model 3: Draw marginal, construct joint distribution For the construction of distributions according to this model see also Subsection 3.2.5, the paragraph ‘Mutual information for small alphabets’ where we compared different techniques for estimating mutual information. We begin by drawing the marginal distribution P (A). Each of the K probabilities is drawn independently from a beta distribution (see (3.27)) with parameters α = 0.1, β = 1. To ensure that the probabilities sum to one we normalize at the end. Then, to construct the bi-partite marginal P (A, B) we set B = A with probability x and B uniform with probability 1 − x, 1−x . PA,B (k, l) = PA (k) xδkl + K (4.139) For x = 1, A and B are maximally correlated; for x = 0 they are independent. The bi-partite marginal P (A, C) is defined in exactly the same way. 4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 125 Since the inequalities should only be violated for rather strong correlations we consider the values x = 0.7 and x = 0.9. In both cases we draw 10 000 distributions for each alphabet size K = 2, ..., 15. Figure 25 shows the numbers of distributions violating the two inequalities as functions of K. Note that it turned out to be difficult to find values of α, β and x (or even define them as functions of K) such that we obtain similar numbers of violations for all K. x=0.9 x=0.7 entropic matrix Figure 25: Numbers of distributions (out of 10 000), generated according to Model 3, violating the entropic and the matrix inequality, respectively. For x = 0.7 the matrix inequality is stronger than the entropic inequality. For x = 0.9 we obtain the opposite result, even though with smaller magnitude. For x = 0.7 we find the familiar picture that the matrix inequality is stronger than the entropic inequality. For x = 0.9, on the other hand, it is exactly the other way around. Thus, we found a clear example proving that for the triangular scenario with small alphabets the matrix inequality is not always stronger than the entropic inequality. Even in this case, however, the advantage of the entropic inequality is rather weak compared to the typical advantage of the matrix inequality (see the x = 0.7 case and Figures 23 and 24). 4.5.3 Hypothesis tests With regard to the statistical emphasis of this thesis, the presumably most important comparison of the matrix framework and the entropic framework is in terms of the hypothesis tests the respective inequalities give rise to. We aim to perform the same simulations that were conducted in Section 4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 126 3.3 for the entropic inequality (3.2). As the general setting, we consider the triangular scenario with binary observables and samples of size 50. The employed representation of the matrix inequality is the original inequality (4.22), which for the triangular scenario reads M A M A:B M A:C 0 = M B:A M B ≥ 0. C:A C M 0 M X A:B:C (4.140) Recall, that this inequality is the analog of the entropic inequality (3.2) based on the generalized covariance matrices (the M -matrices). In particular, both inequalities are based on bi-partite information alone. The entropic inequalities (3.3) and (3.4), on the other hand, require access to the full distribution P (A, B, C). In the following, we refer to inequality (4.140) as the ‘matrix inequality’ and to inequality (3.2) ((3.3), (3.4)) as the ‘first (second, third) entropic inequality’ To estimate X A:B:C from a data sample, we simply plug in the empirical distributions P̂ (A, B), P̂ (A, C), P̂ (A) , ... into the definitions of the M -matrices from (4.3) and (4.6). In order to test whether or not the matrix is positive semidefinite we calculate its minimal eigenvalue, n Tmat := min eigenvalues X A:B:C o . (4.141) If Tmat < 0, the matrix X A:B:C is not positive semidefinite. If Tmat ≥ 0, then also X A:B:C ≥ 0. Since in the binary case each M -matrix is a 2 × 2 matrix, X A:B:C has a total number of six eigenvalues. It turns out that three eigenvalues are always zero (in general one eigenvalue per observable). These eigenvalues will be ignored. Otherwise, the statistic Tmat would be upper bounded by zero, causing problems in the bootstrap simulations. The samples are drawn from the family of ‘flip distributions’ introduced in (3.32). Starting with A, B, C perfectly correlated, each variable is independently flipped with probability pflip . In Figure 26 Tmat is shown as a function (mat.) of pflip . For pflip < pflip = 0.0796 we obtain Tmat (pflip ) < 0, meaning that the matrix inequality is violated in this regime of flip probabilities. Recall, that for the entropic inequalities we found the critical flip probabilities (1.ent) (2.ent) (3.ent) pflip = 0.0584, pflip = 0.0750 and pflip = 0.0797 (see Figure 16). A larger value indicates a stronger inequality. Thus, the matrix inequality should be significantly stronger than the first entropic inequality, its direct 4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 127 Tmat 0.2 0.1 0.05 0.10 0.15 pflip 0.20 -0.1 -0.2 n o Figure 26: Minimal-eigenvalue-statistic Tmat = min eigenvalues X A:B:C for the family of ‘flip distributions’ (3.32) as function of pflip . The matrix inequality is satisfied for pflip ≥ 0.0796. analog in the entropic framework. The strength of the matrix inequality even seems to be comparable to the second and third entropic inequalities which resort to tri-partite information. Recall that we introduced two different approaches to hypothesis testing, the direct and the indirect (or bootstrap) approach. For the complete introduction see Section 3.3. As in Section 3.3, we start by considering the direct approach. Direct approach For the direct approach we require a threshold value tmat . If for a data estimate T̂mat we observe T̂mat < tmat the null hypothesis h0 : ‘sample is compatible with the triangular scenario’ is rejected. Note that for the first entropic inequality it was exactly the other way around, i.e. the null hypothesis was rejected for T̂ent > tent . The reason is, that for distributions compatible with the triangular scenario, the statistic Tent was upper bounded by zero, while here the statistic Tmat is lower bounded by zero. The threshold has to be chosen such that the type-I-error rate is upper bounded by α = 0.05. This means, that at most 5% of samples stemming from a compatible distribution are allowed to falsely violate the inequality T̂mat < tmat . The threshold value is thus defined as the 5% quantile (95% quantile in the entropic case) of the supposed worst case distribution (the distribution with the smallest 5% 4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 128 quantile among all distributions compatible with the DAG). In the entropic framework, reliably identifying the worst case distribution turned out to be the main problem of the direct approach. tmat 0.000 0.6 0.7 0.8 -0.005 -0.010 -0.015 -0.020 qC 1.0 ◼ 0.9 ◼ ◼ ◼ ◼◼ ◼◼◼◼◼◼◼◼◼◼◼◼◼ ◼ ◼ ◼ ◼ ◼ ◼ ◼ qAB=0.5 ◼ qAB=qC Figure 27: Supposed threshold value tmat calculated as the 5% quantile of the statistic T̂mat for underlying observables A = B ∼ (qAB 1 − qAB ) and C ∼ (qC 1 − qC ). For this family of distributions, it seems that the most extreme threshold value, tmat ≈ −0.019044, is indeed obtained in the uniform case qAB = qAC = 0.5. For each considered combination (qAB , qC ), the distribution of estimates T̂mat was reconstructed by drawing 100 000 samples. Here, we pursue the same approach that was considered in the entropic case in Subsection 3.3.2. The rationale was that the worst case distribution should satisfy Tent = 0 which should require A to be a deterministic function of B (by choice), and C should be independent of A and B. While a natural choice seemed to be A = B ∼ uniform and C ∼ uniform (see [16]), we generalized the approach to distributions A = B ∼ qAB 1 − qAB and C ∼ qC 1 − qC and found that the uniform case was not the worst case (see Figure 10). In contrast to this inconvenient observation, Figure 27 raises hope that in the matrix framework the optimal threshold value might indeed be obtained by the uniform distribution qAB = qC = 0.5. However, we have no proof. Employing the threshold value tmat = −0.019044, we simulate the hypothesis test for the family of ‘flip distributions’ (3.32). We are interested in the power of the test which is defined as the ratio of correctly rejected samples (i.e. T̂mat < tmat ) stemming from DAG-incompatible distributions. For 4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 129 (mat.) (3.ent) values pflip < pflip = 0.0796 (or in fact pflip < pflip = 0.0797) the true distribution is known to be incompatible with the DAG. In this regime, the ratio of rejected samples is indeed the power of the test. For larger values of pflip , compatibility of the DAG and the distribution is not known. In Figure 28 we compare the ratio of rejected samples to the corresponding ratio of the entropic test from Subsection 3.3.2. rejection rate 1.0 ▲● ▲● ▲● ▲● ● ● ● ● ▲ 0.8 0.6 0.4 ▲ ▲ ● ● ▲ ● ● ● ▲ ▲ 1. ent. direct ● mat. direct ● ▲ ● ▲ ● ▲ ▲ 0.2 ● ▲ ● ▲▲ ▲▲ ● ● ▲▲ ● ●● ●●● ●●●● 0.02 0.04 0.06 0.08 0.10 0.12 0.14 pflip Figure 28: Comparison of rejection rates of the direct hypothesis tests based on the entropic inequality (3.2) and the matrix inequality (4.140). Each data point is based on 10 000 samples drawn from the ‘flip distribution’ (3.32) with (mat.) flip probability pflip . The vertical line marks the critical value pflip = 0.0796 below which the true distribution violates the matrix inequality and is thus incompatible with the triangular scenario. In this regime a large rejection rate (being the power of the test) is desired. (mat.) (1.ent) In accordance with pflip > pflip , the matrix test is significantly more powerful than the entropic test. The general shapes of the curves are however rather similar. A sharp step near the vertical line would have been preferable. In Subsection 3.3.2, we identified the large variance of estimates T̂ent (or now T̂mat ) as the reason for the rather flat curves. The similar shapes of the curves thus suggest that from a statistical point of view Tmat is not easier to estimate than Tent . The advantage of the matrix test is not due to simpler statistics, but due to the general superiority of the matrix inequality (for this family of distributions). 4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 130 Indirect approach For the indirect approach no worst case distribution is needed. h Instead,i the min 0.95 point estimate T̂mat is equipped with a confidence interval T̂mat , T̂mat . If 0.95 < 0, the null hypothesis h0 : ‘sample is compatible with the we find T̂mat inequality Tmat ≥ 0’ is rejected. Recall, that this hypothesis is weaker than the null hypothesis from the direct approach, stating compatibility with the DAG. To construct the confidence interval we resort to bootstrapping. From the original empirical distribution so-called bootstrap samples are drawn. ∗ is used to estimate the upper The distribution of bootstrap estimates T̂mat endpoint of the confidence interval. To this end, we employ the advanced ‘BCa bootstrap’ technique that already yielded the most reliable results in the entropic case (Subsection 3.3.3). Regarding the results from the di(mat.) rect approach, or in general the critical flip probabilities pflip = 0.0796, (1.ent) (2.ent) (3.ent) pflip = 0.0584, pflip = 0.0750 and pflip = 0.0797, the matrix bootstrap test should be significantly more powerful than the bootstrap test using the first entropic inequality. In fact, the matrix bootstrap test is expected to be of comparable strength to the test based on the third entropic inequality. Figure 29 confirms these expectations. The matrix bootstrap test is even slightly more powerful than the third entropic bootstrap test and comes close (mat.) to the first entropic direct test. At pflip = 0.8 ≈ pflip the matrix bootstrap test rejects 5.2% of all samples. This suggests that the test correctly works at the 5% level. Since we do not know how trustworthy the threshold value lying at the heart of the direct test is, one might actually prefer the matrix bootstrap test over the first entropic direct test. The larger power compared to the third entropic bootstrap test deserves attention as well. After all, the third entropic inequality requires access to the full observable distribution P (A, B, C), while the matrix inequality requires only bi-partite information. 4.5.4 Summary In this section we have seen that for small DAGs and alphabets the matrix 1 n inequality (4.22), X A :...:A ≥ 0, is typically (but not always) stronger than the analogous entropic inequality (4.1). Concerning the simulated hypothesis tests for the triangular scenario, we made the following observations: • The bootstrap hypothesis test based on the inequality X A 1 :...:An ≥0 4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 131 rejection rate 1.0 ▲▼◼ ▲◼● ▲ ▲ 0.8 0.6 0.4 0.2 ▼ ● ▲ ◼ ● ◼●▲ ◼●▲ ◼●▲ ▼ ◼●▲ ▼ ◼●▲ ▼ ◼●▲ ▼ ◼●▲ ▼ ◼●▲ ◼●▲ ▼▼ ●▲ ◼◼ ●▲▲ ▼ ◼ ●◼ ●▲ ▼▼▼▼ ◼ ●▲ ●▲ ●●●●●● p ◼ ◼ ◼ ▼▼▼ ▼ ▼ ▼ flip ▲ 1. ent. direct ▼ 1. ent. boots ◼ 3. ent. boots ● mat. boots 0.02 0.04 0.06 0.08 0.10 0.12 Figure 29: Comparison of rejection rates of the matrix bootstrap test and several entropic tests. The matrix test is based on inequality (4.140). The entropic tests are based on inequalities (3.2) (1. ent.) and (3.4) (3. ent.). Each data point of the direct test is based on 10 000 samples drawn from the ‘flip distribution’ (3.32) with flip probability pflip . The bootstrap tests are based on 1000 samples and 999 bootstrap samples for each initial sample. (mat.) The vertical line marks the critical flip probability pflip = 0.0796 below which the true distribution violates the matrix inequality. In this regime, large rejection rates are desired. turned out to be more powerful than all considered bootstrap tests in the entropic framework (see Figure 29). The matrix test, properly controlling the type-I-error rate at 5%, is only slightly less powerful than the direct test based on the entropic inequality (3.2). The latter test does arguably not have the desired control of the type-I-error rate. • The direct test based on the matrix inequality is significantly more powerful than the analogous test in the entropic framework (see Figure 28). Furthermore, Figure 27 suggests that for the matrix inequality the initially proposed worst case distribution (required for calculating the threshold value needed in the direct approach) might in fact be correct. This was not the case in the entropic framework. However, we have no proof. In general the problems at finding the worst case distribution (in particular for larger DAGs and alphabets) remain. 132 5 APPLICATION TO THE IRIS DATA SET 5 Application to the iris data set In this chapter we pursue the last goal of this thesis, an application of the developed methods to real data. This application serves primary for illustrative purposes. The data that we are considering have already been studied intensively so that we will not contribute new results but rather try to reconstruct existing knowledge. 5.1 The iris data set A collection of freely available data sets can be found online at the machine learning repository of the University of California [47]. Here, we are considering the iris data set which is one of the most simple but also most famous data sets from the repository. The data set was introduced by Ronald Fisher in his paper The use of multiple measurements in taxonomic problems [48] from 1936. A long list of papers citing the iris data set can be found on the corresponding page of the machine learning repository. The data set even has its own article in the free online encyclopedia Wikipedia (https://en.wikipedia.org/wiki/Iris_flower_data_set). 5 4 3 2 1 Figure 30: Schematic representation of a blossom, 2: sepals, 3: petals. According to Meyers Konversationslexikon 1888, remade by Petr Dlouhý / Wikimedia, licensed under CC BY-SA 3.0. https://commons.wikimedia.org/wiki/File:Bluete-Schema.svg 5 APPLICATION TO THE IRIS DATA SET 133 The iris data set contains several size-attributes of the blossoms of iris flowers. Amongst others, an iris blossom consists of so-called petals and sepals. The petals are the usually colorful leaves of the blossom when in bloom while the sepals play a primary protective role for the buds or a supportive role for the petals, see Figure 30. The data set lists the petal length, petal width, sepal length and sepal width of N = 150 iris flowers. In fact, the sample consists of three subsamples of size 50, each subsample corresponding to a different type of iris flower. In the fields of machine learning and pattern recognition the goal is to predict the type of iris flower using the four size-attributes [47]. Here, we pretend that we do not know about the classification into the different types and propose a pairwise hidden common ancestor model for the four variables. We then try to reject this model by employing the full data set. In our model, the hidden ancestors could for example stand for genetic or environmental factors. While a rejection of our model would not necessarily imply the existence of the three different types of iris flowers (which could be considered as a common ancestor of all four attributes), it could at least be considered as a hint in this direction. In that sense, we would partially reconstruct the already existing knowledge of the three different types. While this application might not be the most spectacular one, and a pairwise hidden common ancestor model might seem artificial in this scenario, the data are well suited to illustrate our methods. 5.2 Discretizing the data The first issue that we have to take care of is the continuity, or rather the precision, of the data. The M -matrices from the generalized covariance framework as well as our entropy estimation techniques require discrete variables. In the iris data set all four attributes are given in centimeters with one additional position after the decimal point. This suggests a natural discretization in steps of 0.1cm but the associated alphabet sizes would be too large for reliable estimation of the required quantities (in particular the bipartite entropies and bi-partite distributions required for the M -matrices). More details about the marginal distributions of the four attributes are shown in Figure 31. Using a discretization in steps of 0.1cm would lead to alphabet sizes of the single variables ranging from K = 25 up to K = 60. The joint distribution of the petal width and sepal width would have 625 possible outcomes. For the sample size N = 150 this would be dispropor- 134 5 APPLICATION TO THE IRIS DATA SET petal length petal width # 50 # 50 40 40 30 30 20 20 10 10 0 2 3 4 5 6 7 cm 0 0.5 1.0 sepal length 2.0 2.5 3.0 cm sepal width # 30 25 20 15 10 5 0 1.5 # 5 petal length petal width sepal length sepal width 6 7 min. val. [cm] 1.0 0.1 4.3 2.0 8 cm 35 30 25 20 15 10 5 0 max. val. [cm] 6.9 2.5 7.9 4.4 2.5 3.0 3.5 corres. K 60 25 37 25 4.0 4.5 cm # of diff. vals. 43 22 35 23 Figure 31: Histograms of the marginal observation frequencies of the four attributes in the iris data set. The table below shows the minimal and maximal values of the four sizes. The corresponding alphabet sizes K are obtained by assuming a discretization in steps of 0.1cm between the minimal and maximal values. The last column shows the number of actually occurring different values, which is typically close to the corresponding K. 5 APPLICATION TO THE IRIS DATA SET 135 tionately large. In fact, the entropy estimation results from Figure 6 for the (extremely) data sparse regime suggest that the entropy estimation might still work. The simulated hypothesis tests from Section 3.3 and Subsection 4.5.3 for N = 50 and K = 2, on the other hand, suggest that we should keep the alphabets small. We thus choose a discretization resulting in the alphabet size K = 3 for all variables and define the thresholds separating the categories such that all categories have roughly 50 counts. For the sepal length, for example, all values ≤ 5.4cm are assigned to the first category, all values > 6.2cm to the third, and all values in between to the second category. The three categories then have 52, 47 and 51 counts. In light of the three different types of iris flowers, all with subsample size 50, this discretization seems to be natural. On the other hand, one might argue that the discretization is specifically tailored to support knowledge that we pretend not to know at this point. We did therefore also consider discretizations where the categories are defined by equidistant steps between the minimal and maximal value of the respective variable, aiming for alphabet sizes K = 3 as well as K = 6. In all cases we obtained qualitatively the same results. Here we present only the originally proposed discretization. Note that since results might in general depend on the chosen discretization, our methods are best suited for data with inherently well defined categories. Such data could for example arise as the results of questionnaires with well defined choices for the answers, or in medical data where one only asks for the presence of a disease or a symptom. 5.3 Proposing a model For choosing an appropriate pairwise hidden common ancestor model, we first check for independence relations between the variables. By default we allow that any pair of variables shares a common ancestor. If we find that two variables are independent, however, the faithfulness assumption from Subsection 2.2.2 suggests that these variables should have no ancestor in common. In order to decide whether or not two given variables, say A and B, are independent we first estimate the mutual information Iˆ (A; B). Since the mutual information is upper bounded by each of the marginal entropies 136 5 APPLICATION TO THE IRIS DATA SET Ĥ (A) and Ĥ (B), we further calculate the ratio Iˆr (A; B) := Iˆ (A; B) n min Ĥ (A) , Ĥ (B) o, (5.1) which we call the relative mutual information of A and B. Iˆr (A; B) is bounded between zero and one. Since even a sample from a distribution with actually independent variables might not exactly satisfy Iˆr (A; B) = 0, we will consider any pair of variables for which Iˆr (A; B) ≤ 0.05 as independent. Being more careful, one might conduct hypothesis tests already at this point. However, the main intention of this section is to see the hypothesis tests of our inequality constraints in action, rather than to illuminate all details of the model construction. In addition, as can be seen in (5.2), most observed dependence relations are quite strong, indicating that hypothesis tests would most likely reject hypothesized independence relations. Denoting the attributes as PL (petal length), PW (petal width), SL (sepal length) and SW (sepal width), we find the following relative mutual information values: Iˆr (PL; PW) = 0.79 Iˆr (PL; SL) = 0.46 Iˆr (PW; SL) = 0.41 Iˆr (PL; SW) = 0.18 Iˆr (PW; SW) = 0.22 Iˆr (SL; SW) = 0.09 (5.2) In all cases the relative mutual information is larger than our predefined threshold value of 0.05. Thus, we allow that all pairs of variables share a common ancestor. A graphical depiction of the model is provided by Figure 32. At first glance, the model might seem to be quite artificial but in fact a general hidden common ancestor model is arguably more reasonable than a model with direct links between the observables. It would seem odd to assume that, for example, the petal width had a direct causal influence on the sepal length. Genetic and environmental common causes appear to be more natural. The restriction to pairwise ancestors, on the other hand, might not necessarily be one’s first choice. Then again, recall that this application serves primary for illustrative purposes. The model implies neither unconditional nor any conditional independence relations. The faithfulness assumption suggests that we should not find any independence relations between the observables in the data. While we 137 5 APPLICATION TO THE IRIS DATA SET PL PW SL SW Figure 32: Pairwise hidden common ancestor model for the four attributes petal length (PL), petal width (PW), sepal length (SL) and sepal width (SW) for the iris data set. The hidden variables are not further specified but could for example stand for genetic or environmental factors. confirmed this for the unconditional independence relations (see (5.2)), we refrain from doing so in the conditional case. The main reason is that the estimation of the conditional mutual information I (A; B | C) (see (2.35)), involving the tri-partite entropy H (A, B, C), will be less reliable than the estimation of the unconditional mutual information involving at most bipartite entropies. In particular, if we would actually conduct hypothesis tests in order to find/reject conditional independence relations, with sample size N = 150 and tri-partite alphabet size K 3 = 27 the power of the tests might be rather small. Note that the inequality constraints, that we test in the next subsection, involve only bi-partite information as well. In that sense, checking only (unconditional) independence relations that require at most bi-partite information is consistent with our general approach. 5.4 Rejecting the proposed model The model from Figure 32 is a hidden common ancestor model with n = 4 observables and ancestors of degree m = 2. In this case the general entropic inequality (4.1) reads I A1 ; A2 + I A1 ; A3 + I A1 ; A4 ≤ H A1 , (5.3) 138 5 APPLICATION TO THE IRIS DATA SET where {A1 , A2 , A3 , A4 } can be any permutation of the observables {PL, PW, SL, SW}. Note that effectively there are only four different inequalities since 5.3 is invariant under permutations of A2 , A3 and A4 . Analogously, the general matrix inequality (4.22) reads X A1 :A2 :A3 :A4 1 1 MA M A :A A2 :A1 2 M MA = A3 :A1 0 M 4 1 M A :A 0 2 1 M A :A 0 3 MA 0 3 1 M A :A 0 0 4 MA 4 ≥ 0, (5.4) with the M -matrices defined as in (4.3) and (4.6). In order for the data to be compatible with the model, all four inequalities (in the chosen framework) have to be satisfied. Thus, in general, four hypothesis tests have to be conducted. If we want the composite test to have a significance level of 5%, the singles tests will generally have to aim for an even smaller type-I-error rate. In the case of k independent hypothesis tests one can calculate the required significance level α of the single tests via [49] 1 α = 1 − (1 − ᾱ) k , (5.5) where ᾱ is the desired level of the composite test. For ᾱ = 0.05 and k = 4 one obtains α ≈ 0.013. The relation 5.5, also known as the Šidàk correction [49, 50], is one of many methods to control the type-I-error rate of a composite hypothesis test. The general problem of simultaneous inference [51] and the required multiple comparison procedures [52] become drastically more complicated for dependent tests. In our case the tests are indeed not independent. The mutual information I (PL; PW), for example, appears in the test with A1 = PL as well as A1 = PW. In fact, any mutual information appears in exactly two entropic tests. In the matrix framework an analogous observation can be made for the bi-partite M -matrices. We do not want to go into the details of multiple comparison procedures, but rather consider a single test as we did for the simulations in Section 3.3 and Subsection 4.5.3. When trying to reject the model by employing a single inequality it is reasonable to select for A1 the variable that shows the strongest correlations with the other attributes. In the previous subsection we already listed the (relative) mutual information of all pairs of variables (see (5.2)). The largest values are obtained for the petal length (PL), closely followed by the petal width (PW). The sepal width (SW) shows by far the 139 5 APPLICATION TO THE IRIS DATA SET smallest correlations with the other variables. By choosing A1 = PL we obtain the real data estimates T̂ent = Iˆ (PL; PW) + Iˆ (PL; SL) + Iˆ (PL; SW) − H (PL) = 0.47 (5.6) and n T̂mat = min eigenvalues X̂ PL:PW:SL:SW o = −0.14. (5.7) Both inequalities, Tent ≤ 0 as well as Tmat ≥ 0, are violated, which is a first indication that the data are incompatible with the model. Of course we still have to conduct the actual hypothesis tests. We decide to consider only the bootstrap tests since the direct tests would require the calculation of threshold values above/below which the null hypothesis would be rejected. Note that we cannot use the values from Section 3.3 and Subsection 4.5.3 since they have been calculated for the triangular scenario with K = 2 and N = 50. Here, the model consists of four observables with K = 3 and N = 150. More importantly, as already discussed several times, the main problem of the direct approach would be that we could not be sure if the threshold values were actually correct. When employing wrong threshold values, the tests would not have the desired significance level of 5%. The bootstrap tests, on the other hand, properly controlled the type-I-error rate at 5% in all previous simulations. We draw B = 999 bootstrap samples and use the BCa method to estimate 0.05 in the entropic case the required endpoints of the confidence intervals, T̂ent 0.95 and T̂mat in the matrix framework. Figure 33 shows histograms of the boot∗ ∗ together with the estimated endpoints as well and T̂mat strap estimates T̂ent 0.05 as the original estimates T̂ent and T̂mat . We find T̂ent = 0.29 > 0 in the 0.95 entropic framework and T̂mat = −0.116 < 0 in the matrix framework. In both cases the null hypothesis (Tent ≤ 0 or Tmat ≥ 0 respectively) is rejected. This implies, in particular, that with our hypothesis tests we come to the conclusion that the data are incompatible with the proposed causal structure from Figure 32. The correlations between the variables are stronger than those allowed by a pairwise hidden common ancestor model. This means that there either has to be direct causal influence between the observables or, if we want to stick with a hidden common ancestor model, that ancestors of larger degree are required. Further tests showed that we cannot reject a model with ancestors of maximal degree m = 3. An ancestor of degree m = 4 could stand for the type of iris flower that we pretended not to know at the beginning of the model construction. 140 5 APPLICATION TO THE IRIS DATA SET ^ 0.05 Tent ^ Tent ^ Tmat ^ 0.95 Tmat mat ∗ ∗ (see (5.6) and T̂mat Figure 33: Histograms of the bootstrap estimates T̂ent and (5.7)) for the iris data set. In the entropic case (left plot) the whole ∗ distribution lies in the regime T̂ent > 0. Likewise, in the matrix case (right ∗ < 0. It is thus not surprising plot) all bootstrap estimates satisfy T̂mat 0.05 = that also the estimated endpoints of the confidence intervals satisfy T̂ent 0.95 0.29 > 0 (entropic case) and T̂mat = −0.116 < 0 (matrix case), meaning that both tests reject their respective null hypothesis (compatibility with the inequality). 6 CONCLUSION AND OUTLOOK 6 141 Conclusion and outlook In this thesis we have investigated statistical aspects of identifying, or rather rejecting, causal models (DAGs, Bayesian networks) given data sets of finite size. To be precise, we constructed and elaborated on hypothesis tests based on inequalities constraining probability distributions that are compatible with a given DAG. Since the inequality constraints are outer approximations to the true set of distributions compatible with the DAG, a rejection of the null hypothesis: ‘data satisfy the inequality’ implies the rejection of the causal model. A rigorous conclusion in the other direction, stating that we found the one correct model explaining the data, cannot be drawn within our approach. One fundamental reason, independently of the statistical aspects considered here, is that more than one DAG might be compatible with the data. Choosing between several compatible models is a task that lies outside of the scope of this thesis. It is worth mentioning that our approach does not require a causal interpretation of the underlying DAG [recall that in the first instance a DAG encodes (conditional) independence relations]. By observing correlations that are too strong to be compatible with the DAG, the DAG can be rejected independently of its interpretation. Also note that the faithfulness assumption, stating that all correlations allowed by the DAG should indeed be observed, is not required by the present approach. In a sense, the faithfulness assumption bounds correlations from below, while our approach tests correlation-bounds from above. Nevertheless, the faithfulness assumption might be used as a formal version of Occam’s razor when choosing between different compatible models. At the end of the introductory Chapter 1 we stated three major goals of this thesis. First, improve the hypothesis test proposed in [16] based on an entropic inequality; second, derive an analogous inequality based on certain generalized covariance matrices; and finally, apply our methods to real empirical data. Note that we considered two different approaches to hypothesis testing. The direct approach (also considered in [16]) requires calculation of a threshold value above/below which the null hypothesis is rejected. In the indirect (or bootstrap) approach a confidence interval of the statistic of interest is estimated (via bootstrapping). The null hypothesis is rejected when the confidence interval does not overlap with the compatibility region of the inequality. We furthermore considered both types of hypothesis tests in the entropic as well as the (generalized covariance) matrix framework. 6 CONCLUSION AND OUTLOOK 142 In Chapter 3 we pursued the first of the above mentioned goals within the entropic framework. As one means we implemented recent techniques of entropy estimation [17, 18]. We could confirm that the new (minimax) estimator is often clearly superior to the maximum likelihood estimate (and two bias corrected versions). In the case of small samples (N = 50) and small alphabets (K = 2), that we considered in all of our simulated hypothesis tests, however, the advantage of the minimax estimator was rather insignificant. In addition to the arguably weak power of the entropic direct test from [16], we observed that the test might not have the desired type-I-error rate of 5%. Utilizing the bootstrap approach, we could circumvent this problem. The power of the entropic bootstrap test, however, was even smaller than the power of the entropic direct test. This difference was, at least partially, due to a weaker null hypothesis implicitly used by the bootstrap test (compatibility with the inequality opposed to compatibility with the the DAG). By implementing additional entropic inequalities from [16], more powerful entropic bootstrap tests could be constructed. However, these bootstrap tests were still less powerful than the entropic direct test. In Chapter 4 we constructed tests implementing our newly derived inequalities in the (generalized covariance) matrix framework. The matrix bootstrap test was more powerful than all entropic bootstrap tests, almost matching the power of the entropic direct test. With a direct test in the matrix framework we were finally able to significantly surpass the power of the entropic direct test. The observation making us doubt the proper type-I-error control of the entropic direct test could not be made in the matrix framework. Nevertheless, the reliability of the matrix direct test remains questionable as well. Overall, we could improve the power of the original test from [16] or the reliability, but not both at once. Not considered further in this thesis, a straightforward increase of the power would be achieved by increasing the sample size. With a larger sample size the estimation of entropies (as well as the generalized covariance matrices) would become more precise. For real data (a small sample) this solution is unfortunately not realizable. Decreasing the variance and the bias of the estimates by employing other estimation techniques seems unlikely, at least in the entropic case. The employed minimax estimator of entropy already aims at minimizing the combination of variance and bias. Concerning the second goal of the thesis, the newly derived inequality in the matrix framework, constraining hidden common ancestor models, is an inter- 6 CONCLUSION AND OUTLOOK 143 esting result on its own. Constraints for such models based on the structure alone are rare to this day. For the triangular scenario and small alphabets, as considered in our simulated hypothesis tests, the new inequality turned out to be stronger than the analogous entropic inequality. Though, we have seen that for other settings the opposite might be true. Note that, as a corollary of our matrix inequality, we also derived an inequality on the level of the usual covariances. The matrix inequality, being independent of the actual alphabets of the variables, is typically stronger (and never weaker) than the covariance inequality for one particular choice of outcome values. Looking beyond this thesis, one might hope that entropic inequalities for other scenarios (which can be derived with the algorithm from [16]) can be be translated to the matrix framework as well. As a drawback, since we encode probability distributions in matrices, our approach is restricted to bi-partite information (the domain of the employed M -matrices can be associated with one variable and the codomain can be associated with a second (or the same) variable). A generalization to more than bi-partite information is not straightforward. Concerning the third goal, we demonstrated how our methods can be used to reject a proposed model based on real data. Even though the scenario of the iris flowers might not be the most spectacular one, the scenario was well suited for illustrative purposes. We were able to falsify the proposed model with bootstrap tests in both the entropic and the matrix framework. As a next step it would be desirable to contribute with our methods to current research. While our whole approach specifically aims for models with hidden variables, the current restriction of the matrix framework to purely hidden common ancestor models might complicate the search for up-to-date applications. Another obstacle is the assumption of discrete ancestors. While it might be possible to adapt the entropic approach to continuous ancestors, our proof of the matrix inequality strongly rests on the assumption of discrete, even finite, ancestors. This assumption could be problematic since in some cases we might not have enough knowledge about the hidden variables to justify the assumption. If, on the other hand, discrete ancestors can be justified or tolerated, our methods can be readily applied to any suitable data set. Concerning the observables, our methods are best suited for discrete observables with well defined categories. In that case no manual discretization of the observables is required. Data with well defined categories could for example emerge as the results of questionnaires. Acknowledgements I want to thank Prof. David Gross for supervising this thesis and for insightful discussions. Furthermore, I want to thank Rafael Chaves and Johan Åberg for co-supervising the thesis, for likewise insightful discussions and for valuable feedback. A GENERALIZED PROOF OF THEOREM 4.1 A 145 Generalized proof of Theorem 4.1 In the main text, the following two steps of the proof of Theorem 4.1 have only been formulated and performed for the special case of the triangular scenario: • Proposition 4.2 in Subsection 4.4.2: Proof for a special family of distributions • Proposition 4.3 in Subsection 4.4.4, Step 1: Locally transforming A = { A1 , A2 } → A0 ,... In both cases, the reason was mainly to ensure a proper readability. In fact, the main difference in the general proofs presented here, will be the generalized (and more complicated) notation. Employing this notation, a lot of the calculations should be extremely familiar. It is therefore advisable to first read the specialized proofs in the main text, or to reread those proofs in case of any difficulties. A.1 Proof for a special family of distributions The idea in Subsection 4.4.2 was to model each ancestor as a collection of correlated subvariables, one for each observable connected by that ancestor. Each observable was then defined as the product of all its subvariables. A graphical illustration for the triangular scenario was provided by Figure 19. A.1.1 General notation Now, considering the observables A1 , ..., An , if for example A1 , A2 and A3 share one ancestor, the ancestor will be denoted by λ123 . The subvariables corresponding to that ancestor will be denoted by A1123 , A2123 and A3123 . The collection of these subvariables, or rather the set of indices { 1, 2, 3 }, will be referred to as the correlator of A1 , A2 and A3 . Usually, we will label correlators by a single letter x, y or z (e.g. x = { 1, 2, 3 }). In contrast, the observables themselves are as usual labeled by one of the letters i, j, k, l, ... (e.g. Aj ). To state that Aj is part of the correlator x, we write j ∈ x. The corresponding subvariable is denoted by Ajx . Note that each correlator x 146 A GENERALIZED PROOF OF THEOREM 4.1 corresponds to exactly one ancestor λx . Also note that while strictly speaking x is defined as a set, we write for example A1123 instead of A1{ 1,2,3 } when addressing a specific subvariable A1x . The set of all correlators is denoted by χ. We will further use the following notations for sets of subvariables: { Ajx }x := { Ajx | j fixed, x ∈ χ s.t. j ∈ x } . { Ajx } j { Ajk } := { Ajx | j ∈ x, x fixed } := { Ajx | j, k fixed, x ∈ χ s.t. j, k ∈ x } j {A { j 1 (A.1) {A1x{x A12 1 A145 145 1 A134 {A14{ 2 A12 5 A145 2 {A5x{x A23 {A2x{x 4 A145 4 A134 {A4x{x 3 A134 3 A23 {A3x{x Figure 34: DAG from Figure 18 where all observables are decomposed into subvariables, one for each ancestor (or correlator) of the observable. The set j { Aj145 } = { A1145 , A4145 , A5145 } is the set of all subvariables corresponding to the correlator {1, 4, 5} (or the ancestor λ145 in the original model). { A14 } = { A1134 , A1145 } is the set of all subvariables of A1 that share a correlator with A4 . The set, say, { A2x }x is the set of all subvariables composing the variable A2 . In words, { Ajx }x is the set of all subvariables composing the observable Aj j and { Ajx } is the set of all subvariables composing the correlator x. { Ajk } 147 A GENERALIZED PROOF OF THEOREM 4.1 is the set of all subvariables of Aj sharing a correlator with Ak . Figure 34 provides a graphical illustration based on the DAG from Figure 18. All correlations are mediated by the correlators; there are no additional corre⊥ A1124 ). lations between subvariables from different correlators (e.g. A1123 ⊥ The joint distribution of A1 , ..., An reads P A1 , ..., An = P { Ajx } Y j (A.2) . x∈χ Q Q In the following, we will simply write x instead of x∈χ . Marginalization over one variable corresponds to summation over all of its subvariables. When calculating P (A1 , Aj ) all factors on the right hand side of (A.2) where neither A1 nor Aj appears become one. The remaining distribution reads P A1 , Aj = Y P A1x , Ajx Y P A1y Y x y z 1,j∈x 1∈y,j ∈y / 1∈z,j∈z / P Ajz . (A.3) In the second product, for example, y runs over all correlators including a subvariable of A1 but no subvariable of A2 . This type of notation will be used heavily for the rest of this section. Further marginalizing over A1 yields P Aj = Y P Ajx Y x = z 1∈z,j∈z / 1,j∈x Y P Ajz P Ajx . (A.4) x j∈x (the final expression also being valid for j = 1) which is in accordance with the claim that Aj is the product of all its independent subvariables Ajx . A.1.2 The proposition For the desired family of distributions that is supposed to satisfy inequality (4.32), we assume perfect correlation between all subvariables belonging to one correlator, 1 j δ j j. (A.5) P { Ajx } = Kx { Ax } The multidimensional Kronecker delta demands that all subvariables in the j set { Ajx } take the same value. Kx is the common alphabet size of these subvariables. Insertion into the general distribution (A.2) yields Y 1 P A1 , ..., An = δ{ Aj }j . (A.6) x x Kx 148 A GENERALIZED PROOF OF THEOREM 4.1 Proposition A.1. The distributions P (A1 , ..., An ) = x K1x δ{ Aj }j on the x variables Aj = { Ajx }x , with arbitrary, finite alphabet sizes Kx , satisfy inequality (4.32), Q n √ X M A1 M A 1 :Aj j MA MA j :A1 √ M A1 ≤ (m − 1) 1A1 . j=2 Recall that we formulated the inequality specifically around the variable A1 . An analogous inequality holds for any other variable Ai , the sum then running over j = 1, ...i−1, i+1, ..., n. Alternatively, one could simply rename the variables. A.1.3 The proof Constructing the marginal distributions Either by marginalization of the distribution P (A1 , ..., An ) = x K1x δ{ Aj }j , x or by simply writing the marginals down (which are clear by construction), one obtains the subvariable-marginals Q 1 δ 1 j, Kx Ax ,Ax 1 P Ajx = . Kx P A1x , Ajx (A.7) = (A.8) In the bi-partite case it is always implied that j 6= 1. Inserting (A.7) and (A.8) into (A.3) leads to the total bi-partite marginals P A1 , Aj Y = x Y 1 1 Y 1 δA1x ,Ajx . Kx Ky z Kz y 1,j∈x (A.9) 1∈z,j∈z / 1∈y,j ∈y / It is convenient to introduce the short hand notations K1j := Y Kx , K1j := Y x y 1,j∈x 1∈y,j ∈y / K1 :=K1j K1j , Kj := K1j K1j Y K1j := Ky , Kz , z 1∈z,j∈z / and δ{ A1 },{ Aj } := j 1 Y x δA1x ,Ajx . 1,j∈x (A.10) 149 A GENERALIZED PROOF OF THEOREM 4.1 The indices indicate over which correlators the multiplication runs. For example, K1j is the result of multiplying the alphabet sizes of all correlators containing A1 but not Aj . The delta function δ{ A1 },{ Aj } demands that each 1 j of the variables A1x coincides with its counterpart Ajx . Using these short hand notations, (A.9) can be written as 1 1 δ{ A1 },{ Aj } 1 K K j K1j 1j 1j 1 1 = q δ{ A1 },{ Aj } q . 1 j K1j K1j K1 Kj P A1 , Aj = (A.11) By inserting expression A.8 into equation A.4 one finds the total one-variable marginals P Aj = Y x 1 1 = . Kx Kj (A.12) j∈x Marginals in operator notation In order to simplify the calculations with the M -matrices, we write down the marginal distributions P (Aj ) and P (A1 , Aj ) as states and operators in the Dirac notation. For comparison, see (4.72) to (4.78) in the proof of Proposition 4.2. We can write P Aj 1 O E 1 E IAj . = q I j := q x Kj Kj x (A.13) j∈x E The states IAjx are defined as E IAj x Kx 1 X √ := |ki j . Kx k=1 Ax (A.14) Note that all |Ii -states are normalized. Similarly, the operator corresponding to the bi-partite marginal reads 1 P A ,A j =q 1 K1 K j { A1j }↔{ Aj1 } 1 ⊗ Ij1 ED I1j , (A.15) 150 A GENERALIZED PROOF OF THEOREM 4.1 where the individual factors have the inner tensor product structure { A1j }↔{ Aj1 } 1 O A1x ↔Ajx 1 , := (A.16) x 1,j∈x E 1 Ij E O IA1x , := (A.17) x 1∈x,j ∈x / E j I1 E O IAj . := (A.18) x x 1∈x,j∈x / The meaning of the indices is the same as for the alphabet sizes from (A.10). A1x ↔Ajx The operator 1 can be understood as an ‘identity operator between the isomorphic spaces of the variables A1x and Ajx ’. It is essentially the operator version of the scalar Kronecker delta δA1x ,Ajx , and is explicitly defined as A1x ↔Ajx 1 := Kx X |kiA1x hk|Ajx . (A.19) k=1 See also the original definition (4.77) and the explanations below. For practical purposes, it is only important that when acting from the left on a state from the space of A1x ↔Ajx j Ax , transforms E E A1x ↔Ajx 1 this state to the ‘same’ state in the space of A1x , e.g. 1 IAjx = IA1x (if both 1, j ∈ x). Acting from the right on a state from the space of A1x , the transformation goes in the other { A1j }↔{ Aj1 } 1 direction. For the compound operator for example, { A1j }↔{ Aj1 } 1 one consequently obtains, E j I1 = Ij1 where the states E E O 1 Ij := IA1x , E are defined as (A.20) x 1,j∈x E j I1 := E O IAj . x x (A.21) 1,j∈x The M -matrices For the original definition of the matrices see (4.3) and (4.6). Using the 1 j marginals (A.13) and (A.15), the bi-partite matrix M A :A (in a concise 151 A GENERALIZED PROOF OF THEOREM 4.1 operator form) can be written down without any further calculations (except for some simple manipulations), MA 1 :Aj ED = P A1 , Aj − P A 1 P A j { A1j }↔{ Aj1 } 1 1 = q K1 Kj { A1j }↔{ Aj1 1 = ⊗ Ij1 q K1 Kj 1 † I1j − q 1 K1 Kj ED 1 I I j } − Ij1 ED I1j ⊗ Ij1 ED I1j . (A.22) E From the second to the third line we used the decompositions |I 1 i = Ij1 ⊗ E 1 Ij E E and |I j i = I1j ⊗ I1j . For the mono-partite matrices we obtain M Aj K 1 Xj δkk0 |kiAj hk 0 |Aj − P (A)P (A)† = Kj k,k0 =1 = E D 1 1Aj − I j I j . Kj (A.23) Formally, the expressions coincide with the specific expressions for the triangular scenario from (4.79) and (4.81). However, here, the matrices generally have an even deeper tensor product structure. When trying to conclude the desired inequality constraint (4.32), this structure will indeed make a difference. √ √ 1 j j j 1 Calculating the matrix product M A1 M A :A M A M A :A M A1 is comj pletely analogous to the triangular scenario. First, note that M A is a scalar multiplicative of a projection. Thus, taking the (pseudo) inverse and square root simply corresponds to taking the inverse and square root of the prefactor 1/Kj . We start by considering the product of the first two matrices (see 152 A GENERALIZED PROOF OF THEOREM 4.1 (A.22) and (A.23)), p M A1 M A p E D 1 K1 1A1 − I 1 I 1 × p = 1 :Aj { A1j }↔{ Aj1 } K1 Kj 1 ED ED − Ij1 I1j ⊗ Ij1 I1j p = K1 M A1 :Aj p 1 :Aj p A1 :Aj = K1 M A j 1 1 1 E D 1 1 E D 1 { Aj }↔{ A1 } 1 E D j 1 E D j −p Ij ⊗ Ij̄ Ij̄ × 1 − Ij I1 ⊗ Ij I1 I Kj j 1 −p Kj E D E D E D 1 I1j − Ij1 I1j ⊗ Ij1 I1j Ij | {z } 0 = K1 M . (A.24) √ As it was the case in the calculations for the triangular scenario, √M A1 1 j merely has a scalar-multiplicative effect on M A :A . The effect of M A1 j 1 j on M A :A from the right is exactly the same, and analogously the M A in the middle reduces to the scalar prefactor Kj . By exploiting this behaviour 1 j and using M A :A from (A.22), we obtain √ √ 1 j j j 1 M A1 M A :A M A M A :A M A1 =K1 Kj M A = 1 :Aj { A1j }↔{ Aj1 } 1 MA j { A1 }↔{ A1j } ED ED ED ED 1 − I1j Ij1 ⊗ I1j Ij1 − Ij1 I1j ⊗ Ij1 I1j × = 1{ A1j } − Ij1 j :A1 ED Ij1 ⊗ Ij1 ED Ij1 . (A.25) For each single j, this expression is a projection. However, in contrast to the special case of the triangular scenario, projections for different j are in general not orthogonal to each other anymore. As a consequence, the sum Pn √ A1 A1 :Aj Aj Aj :A1 √ A1 M M M M M is not a projection (and typically j=2 neither a scalar multiple of a projection). Showing that the sum is bounded by (m − 1) 1A1 is therefore more complicated in the general case. A detailed consideration of the tensor product structure becomes necessary. Proving the inequality: Gaining intuition By a simple example we explicitly confirm that the projections √ considering √ 1 :Aj j j :A1 1 1 A A A MA M M M M A for different j are in general not orthog- 153 A GENERALIZED PROOF OF THEOREM 4.1 onal to each other. With the help of a second example we try to get an impression of how the different projections overlap. Consider the extreme case of three observables that share one common ancestor (and thus are perfectly correlated). A1 consists of only one subvariable A1123 . Taking a look at (A.25) (last line), for both j = 2, 3 one obtains 1{ A1j } = 1A1123 , E 1 Ij E 1 Ij (A.26) = E IA1 , (A.27) : does not exist. (A.28) 123 E Concerning the last one, recall that Ij1 is the tensor product of all states E IA1x such that j ∈ / x. But here, both j = 2, 3 are part of the only correlator E x = { 1, 2, 3 }, so that no such IA1x exists. In total, (A.25) reads √ M A1 M A 1 :Aj j MA MA j :A1 √ M A1 = 1A1123 − IA1123 ED IA1123 . (A.29) We find the same projection for both j = 2, 3. In particular, they are not orthogonal to each other. On the other hand, since their sum is a scalar multiple of a projection, the proper (and trivial) inequality for this scenario (m = 3; A1123 = A1 ), 2 (1A1 − |IA1 i hIA1 |) ≤ 21A1 , (A.30) can easily be seen to be satisfied. The inequality is trivial in the sense that this scenario puts no constraints on the distribution at all. In general, two projections will neither be the same nor orthogonal, but overlap only on some subspaces of the tensor product space. For a more complex example that illustrates this behaviour consider four variables of which all triplets share one ancestor. There are three correlators including the variable A1 : { 1, 2, 3 }, { 1, 2, 4 } and { 1, 3, 4 }. Explicitly writing down all tensor products, the left hand side of inequality (4.32) (using (A.25)) 154 A GENERALIZED PROOF OF THEOREM 4.1 reads 4 √ X M A1 M A 1 :Aj j MA MA j :A1 √ M A1 j=2 = 1A1123 ⊗ 1A1124 − IA1123 ED IA1123 ⊗ IA1124 + 1A1123 ⊗ 1A1134 − IA1123 ED IA1123 ⊗ IA1134 + 1A1124 ⊗ 1A1134 − IA1124 ED IA1124 ⊗ IA1134 ED IA1124 ⊗ IA1134 ED IA1134 ED IA1134 ⊗ IA1124 ED IA1124 ED IA1134 ⊗ IA1123 ED IA1123 . (A.31) We see that each of the terms 1A1123 , 1A1124 and 1A1134 appears in two of the three projections. These terms are exactly the causes of the overlaps (i.e. the non-orthogonality). Since we have m = 3 for the current scenario, one may realize that the number of occurrences of each overlapping term is indeed upper bounded by m − 1. To verify this intuitive statement for the general case, a more detailed treatment follows. Proving the inequality: General treatment √ √ 1 j j j 1 An important realization is that all terms M A1 M A :A M A M A :A EM A1 are diagonal in the same basis, partially spanned by the states IA1x (see (A.25) and the definitions of the |Ii -states from (A.14), A.17 and A.20). In order to validate inequality (4.32), it is thus enough to show that all diagonal √ √ P 1 j j j 1 elements of nj=2 M A1 M A :A M A M A :A M A1 are upper bounded by m − 1. We label the correlators x including the observable A1 by x1 , ..., xN . We further denote the basisE states of the variable A1xi by |0i i , |1i i , |2i i , ... |Ki − 1i and set |0i i ≡ IA1xi . The other states are arbitrary and never occur explicitly. Similar notation is used for the identity operators, i.e. 1i ≡ 1A1xi . The projections (A.25), √ √ A1 :Aj Aj Aj :A1 A1 M A1 M M M M ED ED ED = 1{ A1j } ⊗ Ij1 Ij1 − Ij1 Ij1 ⊗ Ij1 Ij1 , | {z =:T1 (j) } | {z =:T2 (j) } (A.32) A GENERALIZED PROOF OF THEOREM 4.1 155 can now be written down in a way that explicitly takes into account the full tensor product structure. The second term reads N O T2 (j) = |0i i h0i | i=1 = |01 i h01 | ⊗ |02 i h02 | ⊗ ... ⊗ |0N i h0N | , (A.33) independently of j. Only the diagonal element corresponding to the tensor product state |01 , 02 , ..., 0N i will be one, while all others are zero. For the first term of the projection (A.32), we obtain O T1 (j) = 1i ⊗ |0i i h0i | i j∈xi O i j ∈x / i = 11 ⊗ 12 ⊗ |03 i h03 | ⊗ 14 ⊗ ... ⊗ |0N i h0N | . e.g. (A.34) The second line depends on the ancestors that A1 shares with Aj . The identity operators occur exactly at the positions of the common correlators (or ancestors) of A1 and Aj . Still considering this example, the diagonal elements corresponding to states of the form |α1 , α2 , 03 , α4 , ..., 0N i with αi = 0, 1, ...Kxi − 1 will be one, all others will be zero. The positions of the αi are exactly the positions of the identity operators in T1 (j). Note that the element corresponding to the state |01 , 02 , ..., 0N i will always be one, but it will be canceled by the term T2 (j). Thus, for the total projection √ √ 1 j j j 1 M A1 M A :A M A M A :A M A1 , all elements corresponding to states of the form |α1 , α2 , 03 , α4 , ..., 0N i with at least one of the αi 6= 0 will be one. Now, the final question is in how many projections a specific identity operator 1i (or a combination of several identity operators) can occur (i.e. in how √ √ 1 j j j 1 many projections M A1 M A :A M A M A :A M A1 a specific diagonal element can be one). For a given j = 2, ..., n, the identity operator 1A1xi = 1i occurs if and only if j ∈ xi . Since each ancestor connects at most m observables, each correlator containing A1 can contain at most m − 1 other variables Aj . Thus, the term 1A1xi can occur at most m − 1 times. Of course any combination of more than one identity operator cannot occur more often than each identity individually. This means that a given diagonal element can be one in at most m − 1 projections. Thus, considering the sum of all projections, each diagonal element is bounded by m − 1. Since the total A GENERALIZED PROOF OF THEOREM 4.1 156 operator is diagonal we can conclude the desired inequality n X j=2 1{ A1j } − Ij1 √ | Ij1 ⊗ Ij1 Ij1 ≤ (m − 1) 1A1 , {z } √ 1 j j 1 ED j M A1 M A1 :A M A = MA :A ED (A.35) MA for the considered special family of distributions. This finishes the proof of Proposition A.1. Example We illustrate the above construction of the matrix representation of Pn √ A1 A1 :Aj Aj Aj :A1 √ A1 M M M M M by an example. Consider again j=2 four variables with one ancestor for each triplet. In an operator notation, Pn √ A1 A1 :Aj Aj Aj :A1 √ A1 M M M M M was already calculated in (A.31). j=2 To establish the same order of the tensor product decomposition in all terms, we rewrite this expression as 4 √ X 1 M A1 M A :Aj j MA MA j :A1 √ M A1 j=2 1A1123 1A1123 1A1124 ⊗ 01134 01134 − 01123 01123 ⊗ 01124 01124 ⊗ 01134 01134 + 1A1134 − 01123 01123 ⊗ 01124 01124 ⊗ 01134 01134 ⊗ 01124 01124 ⊗ + 01123 01123 ⊗ 1A1124 ⊗ 1A1134 − 01123 01123 ⊗ 01124 01124 ⊗ 01134 01134 . (A.36) E Note that we also changed the notation from IA1x to |01x i. In the following, = ⊗ we completely neglect the indices since they should be clear from the order of the tensor product decomposition. When assuming that all variables are we have to calculate diagonal elements for each projection √ binary, √ eight 1 1 A1 :Aj Aj Aj :A1 A A M M M M M (for short denoted as Pj , j = 2, 3, 4 from now on). Starting with P2 (second line in (A.36)) we obtain h0, 0, 0 | P2 | 0, 0, 0i = h0, 0, 0| (1 ⊗ 1 ⊗ |0i h0| − |0i h0| ⊗ |0i h0| ⊗ |0i h0|) |0, 0, 0i = h0 | 1 | 0i h0 | 1 | 0i h0 | 0i h0 | 0i − h0 | 0i h0 | 0i h0 | 0i h0 | 0i h0 | 0i h0 | 0i =1 − 1 =0, (A.37) 157 A GENERALIZED PROOF OF THEOREM 4.1 h1, 0, 0 | P2 | 1, 0, 0i = h1, 0, 0| (1 ⊗ 1 ⊗ |0i h0| − |0i h0| ⊗ |0i h0| ⊗ |0i h0|) |1, 0, 0i = h1 | 1 | 1i h0 | 1 | 0i h0 | 0i h0 | 0i − h1 | 0i h0 | 1i h0 | 0i h0 | 0i h0 | 0i h0 | 0i =1 − 0 =1, (A.38) and similarly h0, 0, 1 | P2 h0, 1, 0 | P2 h0, 1, 1 | P2 h1, 0, 1 | P2 h1, 1, 0 | P2 h1, 1, 1 | P2 | 0, 0, 1i | 0, 1, 0i | 0, 1, 1i | 1, 0, 1i | 1, 1, 0i | 1, 1, 1i = = = = = = 0 1 0 0 1 0. (A.39) (A.40) (A.41) (A.42) (A.43) (A.44) A suitable matrix representation of this projection reads P2 = |0, 0, 0i |0, 0, 1i |0, 1, 0i |0, 1, 1i |1, 0, 0i |1, 0, 1i |1, 1, 0i |1, 1, 1i 0 0 1 0 1 0 1 . (A.45) 0 In the same basis, the other two projections can be written as P3 = |0, 0, 0i |0, 0, 1i |0, 1, 0i |0, 1, 1i |1, 0, 0i |1, 0, 1i |1, 1, 0i |1, 1, 1i 0 1 0 0 1 1 0 0 , (A.46) 158 A GENERALIZED PROOF OF THEOREM 4.1 and 0 |0, 0, 0i |0, 0, 1i 1 1 |0, 1, 0i 1 |0, 1, 1i . P4 = 0 |1, 0, 0i 0 |1, 0, 1i 0 |1, 1, 0i 0 |1, 1, 1i Finally, the sum of all three projections satisfies P2 + P3 + P4 0 0 1 0 = 0 = 1 0 1 2 2 1 2 1 1 + 0 0 + (A.47) 0 1 0 0 1 1 0 1 1 0 0 0 0 0 ≤21. A.2 1 (A.48) Locally transforming Aj = { Ajx }x → Aj0 In Proposition 4.3 we have shown for the special case of the triangular scenario how to locally transform distributions from the ‘correlated subvariable model’ to distributions from the ‘ancestor model’. The starting distributions read P (A, B, C) = P (A1 , A2 , B1 , B2 , C1 , C2 ) 1 = δA B δB C δC A , KAB KAC KBC 1 2 1 2 1 2 (A.49) 0 159 A GENERALIZED PROOF OF THEOREM 4.1 (see also (4.102)) while the resulting distributions were of the form P (A0 , B 0 , C 0 ) X = P (A0 | λAB , λAC ) P (B 0 | λAB , λBC ) P (C 0 | λAC , λBC ) λAB ,λAC ,λAB · 1 , KAB KAC KBC (A.50) (see also (4.104)). The latter is the general form of all distributions compatible with the triangular scenario, but with uniform distributions of the ancestors, P (λx = j) = 1/Kx . For general hidden common ancestor models, Q we start with the distributions P (A1 , ..., An ) = x K1x δ{ Aj }j introduced in x (A.6). We will also use the notations related to subvariables Ajx introduced j in that section. In particular, recall that by { Ajx } we denoted the set of all subvariables composing the correlator x and by { Ajx }x the set of all subvariables composing the observable Aj , see also (A.1) and Figure 34. In addition, we introduce the notation j { Ajx }x := { Ajx | j ∈ x, x ∈ χ } (A.51) for the the set of all subvariables Ajx . In the ancestor framework, recall that we denoted the set of all ancestors λx by { λx }x and the set of all ancestors of the observable Aj by { λx }x| j , see also (4.108) and (4.109). A Proposition A.2. Starting with the family of distributions P (A1 , ..., An ) = Q 1 x Kx δ{ Aj }j introduced in (A.6), one can obtain all distributions of the form x (4.108), P A1 , ..., An = X P A1 | { λx }x| { λx }x A1 ...P An | { λx }x|An Y x 1 , Kx (A.52) (with finite alphabets) via local transformations Aj = { Ajx }x → Aj0 . Proof. Recall that for a single variable a local transformation reads P Aj0 = X P Aj0 | Aj P Aj Aj = X { Ajx }x P Aj0 | { Ajx }x P { Ajx }x . (A.53) 160 A GENERALIZED PROOF OF THEOREM 4.1 Applied to the joint distribution P (A1 , ..., An ), by locally transforming all variables we arrive at P A10 , ..., An0 P A10 | A1 ...P (An0 | An ) P A1 , ..., An X = A1 ,...,An P A10 | { A1x }x ...P (An0 | { Anx }x ) P A1 , ..., An X = { A1x }x ,...,{ An x }x P A10 | { A1x }x ...P (An0 | { Anx }x ) X = Y x j { Ajx }x 1 δ j j Kx { Ax } The summation over all but one subvariable of each correlator x cancels the δ{ Aj }j and forces all involved subvariables to coincide. This allows us to x model all these correlated subvariables by one single variable. The latter can be identified as the common ancestor of all observables belonging to the given correlator. We consequently rename the remaining subvariable of the correlator x as λx and finally obtain P A10 , ..., An0 = P A10 | { A1x }x ...P (An0 | { Anx }x ) X x j { Ajx }x = X { λx }x Y P A10 | { λx }x| A10 1 δ j j Kx { Ax } ...P An0 | { λx }x|An0 Y x 1 . Kx (A.54) 161 B PROOF OF COROLLARY 4.2 B Proof of Corollary 4.2 Corollary 4.2 states that all distributions compatible with a hidden common ancestor model with n observables A1 , ..., An (with finite alphabets) and ancestors of degree up to m, satisfy the inequality n n h i2 Y X Cov A1 , Aj j=2 h i Var Ak ≤ (m − 1) n Y h i Var Ai . (B.1) i=1 k=2 k6=j Proof. From Theorem 4.1 we know that for all distributions compatible with the given hidden common ancestor model, the matrix XA 1 :...An (m − 1) M A 2 1 M A :A .. = . . .. n 1 M A :A 1 1 M A :A 2 MA 2 ··· 0 .. . .. . ··· 0 .. . 0 1 · · · M A :A ··· 0 .. .. . . .. . 0 n 0 MA n (B.2) is positive semidefinite. Lemma 4.2 allows us to conclude that the matrix · · · Cov [A1 , An ] ··· 0 .. ... A1 :...:An . Z ... 0 ··· 0 Var [An ] (B.3) is positive semidefinite as well. To apply Lemma 4.2, recall that one can † i j write Cov [Ai , Aj ] = ai M A :A aj where the vectors ai , aj carry the alpha(m − 1) Var [A1 ] Cov [A1 , A2 ] Cov [A2 , A1 ] Var [A2 ] .. . 0 := .. .. . . Cov [An , A1 ] 0 ··· 0 ... ... 1 n 1 n bets of Ai and Aj . Positive semidefiniteness of Z A :...:A implies det Z A :...:A ≥ 1 n 0. For simplicity, we write Z ≡ Z A :...:A from now on. To calculate the determinant we use Laplace’s formula and expand the matrix along the first column. In that case, Laplace’s formula reads det Z = n X (−1)j+1 Zj1 det Z (j,1) . j=1 | {z dj } (B.4) 162 B PROOF OF COROLLARY 4.2 The matrix Z (j,1) is obtained from the matrix Z by erasing the jth row and the first column. In general, this formula has to be applied recursively until the determinants on the right hand side can be calculated by other means. Fortunately, we only require one application of (B.4) here. • For j = 1 we obtain (−1)1+1 = 1, Z11 = (m − 1) Var [A1 ] and Var [A2 ] .. Z (1,1) = . n Var [A ] . (B.5) The determinant of the diagonal matrix Z (1,1) is simply the product of Q its diagonal elements, det Z (1,1) = ni=2 Var [Ai ]. Thus, the j = 1-term on the right hand side of (B.4) reads d1 = (m − 1) n Y h i Var Ai . (B.6) i=1 This is exactly the right hand side of the desired inequality (B.1). • For j ≥ 2 we obtain the prefactors (−1)j+1 , Zj1 = Cov [Aj , A1 ] and the matrix Z (j,1) (where we abbreviate Cov , C, Var , V and Ai , i): C [1, 2] C [1, 3] · · · V [2] 0 ··· .. 0 . V [3] .. . 0 .. . .. . . .. 0 ··· ··· C [1, j − 1] 0 .. . V [j − 1] 0 0 0 .. . .. . 0 0 V [j + 1] 0 .. . 0 .. . · · · · · · C [1, n] ··· ··· 0 .. . .. .. . . .. .. . . .. .. . . 0 0 0 ··· 0 C [1, j] C [1, j + 1] 0 0 .. .. . . .. .. . . 0 V [n] To guide the eye, the main diagonal is colored light green. Up to the ‘j − 1-column’, the element directly below the diagonal is (in general) non-zero. The ‘j-column’ is non-zero only in the first row. From the ‘j + 1’ to the last column all elements below the diagonal vanish as 163 B PROOF OF COROLLARY 4.2 well. Thus, by permuting the ‘j-column’ to the left (requiring j − 2 permutations), one arrives at the upper triangular matrix: C [1, j] C [1, 2] C [1, 3] · · · 0 V [2] 0 ··· .. . V [3] 0 .. .. . . 0 . .. . .. . .. C [1, j − 1] 0 0 V [j − 1] 0 0 V [j + 1] 0 ··· 0 · · · · · · C [1, n] ··· ··· 0 .. . .. .. . . .. .. . . .. .. . . 0 C [1, j + 1] 0 ··· ··· V [n] 0 The determinant of this matrix is simply the product of its diagonal elements. Each of the j − 2 permutations required to bring the matrix to the upper triangular form introduces one factor (−1). Thus, h det Z (j,1) = (−1)j−2 Cov A1 , Aj n iY h i Var Ak . (B.7) k=2 k6=j The contribution of each j = 2, ..., n to the Laplace decomposition of 1 n det Z A :...:A from (B.4) becomes dj = (−1) j+1 | j−2 (−1) {z h j 1 i h j Cov A , A Cov A , A n iY 1 :...:An = n X i k6=j Taking (B.6) and (B.8) together, the determinant of Z A det Z A h Var Ak . (B.8) k=2 } =−1 1 1 :...:An reads dj j=1 = (m − 1) n Y i=1 h i Var Ai − n n h i2 Y X Cov A1 , Aj j=2 h Var Ak i k=2 k6=j (B.9) 164 B PROOF OF COROLLARY 4.2 Thus, the requirement det Z A n n h i2 Y X Cov A1 , Aj j=2 1 :...:An h ≥ 0 becomes i Var Ak ≤ (m − 1) n Y h i Var Ai . (B.10) i=1 k=2 k6=j In the case that Var [Ai ] 6= 0 for all observables Ai , this inequality can be Q divided by ni=1 Var [Ai ], yielding n X 2 |Cov [A1 , Aj ]| ≤ (m − 1) 1 j j=2 Var [A ] Var [A ] ⇒ n h i2 X Corr A1 , Aj ≤ (m − 1) . (B.11) j=2 In this way, we obtain the inequality for correlation coefficients initially introduced for illustrative purposes in Subsection 4.3.3. REFERENCES 165 References [1] A. Falcon. Aristotle on Causality. Stanford Encyclopedia of Philosophy, http://plato.stanford.edu/entries/aristotle-causality/ (2006, revised 2015). [2] M. Hulswit. Cause to Causation. A Peircean Perspective. Kluwer Publishers (2002). [3] J. Pearl. Causality: Models, Reasoning, and Inference. CUP (2009). [4] Aristotle. Physics. 194 b17-20 (350 BC). [5] D. Hume. A Treatise of Human Nature. (1738). [6] K. Pearson. The Grammar of Science, chapter: Contingency and Correlation - The Insufficiency of Causation. A. and C. Black (1911). [7] R. A. Fisher. The Design of Experiments. Macmillan (9th ed. 1971, orginally 1935). [8] C. W. J. Granger. Investigating causal relations by econometric models and cross-spectral methods. Econometrica, 37 (3): 424-438 (1969). [9] J. Neyman. Sur les applications de la theorie des probabilites aux experiences agricoles: Essai des principes. Master’s Thesis (1923). [10] J. Sekhon. The Neyman-Rubin Model of Causal Inference and Estimation via Matching Methods, in The Oxford Handbook of Political Methodology, OUP (2008). [11] P. W. Holland. Statistics and causal inference. J. Amer. Statist. Assoc, 81 (396): 945-960 (1986). [12] C. Kang and J. Tian. Inequality constraints in causal models with hidden variables. UAI 2006, pages 233–240. [13] C. Kang and J. Tian. Polynomial constraints in causal bayesian networks. UAI 2007, pages 200–208. [14] J. Tian and J. Pearl. On the testable implications of causal models with hidden variables. UAI 2002, pages 519–527. REFERENCES 166 [15] G. Ver Steeg and A. Galstyan. A sequence of relaxations constraining hidden variable models. UAI 2011, pages 717–727. [16] R. Chaves, L. Luft, T. O. Maciel, D. Gross, D. Janzing, and B. Schölkopf. Inferring latent structures via information inequalities. Proceedings of the 30th Conference on Uncertainty in Artificial Intelligence, UAI 2014, pages 112 – 121, AUAI Press (2014). [17] J. Jiao, K. Venkat, Y. Han, and T. Weissman. Minimax Estimation of Functionals of Discrete Distributions. Used version: arXiv:1406.6956v3 (2014). Now published: IEEE Transactions on Information Theory, 61 (5): 2835-2885 (2015). [18] Y. Wu and P. Yang. Minimax rates of entropy estimation on large alphabets via best polynomial approximation. arXiv:1407.0381, (2014). [19] P. Spirtes, N. Glymour, and R. Scheienes. Causation, Prediction, and Search. 2nd ed. MIT Press (2001). [20] C. Hitchcock. Probabilistic Causation. Stanford Encyclopedia of Philosophy, http://plato.stanford.edu/entries/causation-probabilistic/ (1997, revised 2010). [21] B. Bonet. Instrumentality tests revisited. UAI 2001, pages 48–55. [22] A. S. Goldberger. Structural equation methods in the social sciences. Econometrica, 40 (6): 979-1001 (1972). [23] J. Pearl. On the testability of causal models with latent and instrumental variables. UAI 1995, pages 435–443. [24] C. Uhler, G. Raskutti, P. Bühlmann, and B. Yu. Geometry of the faithfulness assumption in causal inference. Annals of Statistics, 41 (2): 436-463 (2013). [25] N. Cartwright. Hunting Causes and Using Them. CUP (2007). [26] R. B. Ash. Information Theory. Dover Publications (1990, originally 1965). [27] T. M. Cover and J. A. Thomas. Elements of Information Theory. 2nd ed. Wiley (2006). REFERENCES 167 [28] R. W. Yeung. Information Theory and Network Coding. Springer (2008). [29] R. Bhatia. Positive Definite Matrices. PUP (2007). [30] R. A. Horn and C. R. Johnson. Matrix Analysis. 2nd ed. CUP (2013). [31] F. Zhang. Matrix Theory: Basic Results and Techniques. Springer (1999). [32] S. Axler. Linear Algebra Done Right. 2nd ed. Springer (1997). [33] C. Cohen-Tannoudji, B. Diu, and F. Laloë. Quantum Mechanics. Wiley (2006). [34] H. P. Williams. Fourier’s method of linear programming and its dual. Amer. Math. Monthly, 93 (9): 681-695 (1986). [35] K. Knight. Mathematical Statistics. Chapman & Hall (2000). [36] G. Valiant and P. Valiant. Estimating the unseen: an n/ log(n)-sample estimator for entropy and support size, shown optimal via new clts. Proceedings of the 43rd annual ACM symposium on Theory of computing, ACM 2011, pages 685–694. [37] L. Paninski. Estimation of entropy and mutual information. Neural Computation, 15: 1191-1253 (2003). [38] G. Miller. Note on the bias of information estimates, in Information Theory in Psychology: Problems and Methods, Free Press (1955). [39] T. Schürmann. Bias analysis in entropy estimation. J. Phys. A: Math. Gen, 37 (27): L295-L301 (2004). [40] R. Pachón and L. N. Trefethen. Barycentric-remez algorithms for best polynomial approximation in the chebfun system. BIT Numer Math, 49 (4): 721-741 (2009). [41] L. Veidinger. On the numerical determination of the best approximations in the chebychev sense. Numerische Mathematik, 2: 99-105 (1960). REFERENCES 168 [42] B. Efron and T. J. DiCiccio. Bootstrap confidence intervals. Statistical Science, 11 (3): 189-228 (1996). [43] B. Efron. Bootstrap methods: Another look at the jackknife. The Anals of Statistics, 7 (1): 1-26 (1979). [44] J. Carpenter and J. Bithell. Bootstrap confidence intervals: when, which, what? A practical guide for medical statisticians. Statist. Med, 19 (9): 1141-1164 (2000). [45] W. J. McGill. Multivariate information transmission. Psychometrika, 19: 97-116 (1954). [46] R. Bhatia. Matrix Analysis (Graduate texts in mathematics, 169). Springer (1997). [47] M. Lichman. UCI Machine Learning Repository, http://archive.ics.uci.edu/ml, Irvine, CA: University of California, School of Information and Computer Science (2013). [48] R. A. Fisher. The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7 (2): 179-188 (1936). [49] H. Abdi. Bonferroni and Šidák corrections for multiple comparisons, in Encyclopedia of Measurement and Statistics, Sage Publications (2007). [50] Z. Šidák. Rectangular confidence region for the means of multivariate normal distributions. JASA, 62 (318): 626-633 (1967). [51] R. G. Miller. Simultaneous Statistical Inference. 2nd ed. Springer (1981). [52] Y. Hochberg and A. C. Tamhane. Multiple Comparison Procedures. Wiley (1987).