Download Statistical aspects of inferring Bayesian networks from marginal

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
Statistical aspects of inferring Bayesian
networks from marginal observations
Masterarbeit
an der
Fakultät für Mathematik und Physik der
Albert-Ludwigs-Universität Freiburg
vorgelegt von
Kai von Prillwitz
unter der Leitung von Prof. David Gross
23. Oktober 2015
3
Abstract
We investigate statistical aspects of inferring compatibility between causal
models and small data samples. The considered causal models include hidden variables complicating the task of causal inference. A proposed causal
model can be rejected as an explanation for generating the data if the empirical distribution (of the observable variables) differs significantly from the
distributions compatible with the model. In fact, the utilized hypothesis
tests are based on inequality constraints constituting outer approximations
to the true set of distributions compatible with the model. We start by
working with inequalities in a recently developed entropic framework and
implement likewise recent techniques of entropy estimation. In a second
step we derive and implement analogous constraints on the level of certain
generalized covariance matrices. In contrast to actual covariances our matrices are independent of the alphabets (the outcome values) of the variables.
Furthermore, we distinguish two different approaches to hypothesis testing.
Our methods are demonstrated by an application to real empirical data, the
so-called ‘iris (flower) data set’.
4
Zusammenfassung
In dieser Arbeit untersuchen wir statistische Aspekte der Kompatibilitätsbestimmung von kausalen Modellen und kleinen Datensätzen. Die betrachteten kausalen Modelle beinhalten versteckte, nicht messbare Variablen,
wodurch die Aufgabe zusätzlich erschwert wird. Ein gegebenes kausales
Modell kann als Erklärung für die beobachteten Daten ausgeschlossen werden, wenn die empirische Wahrscheinlichkeitsverteilung (der beobachtbaren
Größen) sich signifikant von den mit dem Modell kompatiblen Verteilungen
unterscheidet. Die hier durchgeführten Hypothesentests basieren auf Ungleichungen, welche eine äußere Approximation an die tatsächliche Menge der
kompatiblen Verteilungen darstellen. In einem ersten Schritt verwenden wir
kürzlich entwickelte Ungleichungen basierend auf Entropien der Wahrscheinlichkeitsverteilungen. Die verwendeten Schätzmethoden der Entropien sind
gleichfalls aktuell. In einem zweiten Schritt leiten wir neuartige Ungleichungen her, welche auf Matrizen basieren, die als Verallgemeinerung von Kovarianzen betrachtet werden können und, anders als Kovarianzen, unabhängig
von den tatsächlich angenommen Werten der Variablen sind. Darüber hinaus untersuchen wir zwei unterschiedliche Herangehensweisen an die Hypothesentests. Als Anwendungsbeispiel unserer Methoden auf reale Daten
betrachten wir den sogenannten ‘Iris flower-Datensatz’.
5
CONTENTS
Contents
1 Introduction
8
1.1
Philosophical and mathematical background . . . . . . . . .
8
1.2
Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
2 Basic concepts
2.1
Introduction to probability theory . . . . . . . . . . . . . . .
13
2.1.1
Discrete random variables . . . . . . . . . . . . . . .
13
2.1.2
Joint and marginal distributions . . . . . . . . . . . .
14
2.1.3
Conditional probabilities and (conditional) independence . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
Expected value, variance and covariance . . . . . . .
15
Bayesian networks . . . . . . . . . . . . . . . . . . . . . . .
16
2.2.1
Markov condition . . . . . . . . . . . . . . . . . . . .
17
2.2.2
Faithfulness assumption . . . . . . . . . . . . . . . .
18
2.2.3
Hidden variables . . . . . . . . . . . . . . . . . . . .
19
2.2.4
Hidden common ancestor models . . . . . . . . . . .
20
Information theory . . . . . . . . . . . . . . . . . . . . . . .
23
2.3.1
Shannon entropy . . . . . . . . . . . . . . . . . . . .
23
2.3.2
Joint and conditional entropy . . . . . . . . . . . . .
24
2.3.3
Mutual information . . . . . . . . . . . . . . . . . . .
25
Hermitian and positive semidefinite matrices . . . . . . . . .
26
2.4.1
Definitions and notation . . . . . . . . . . . . . . . .
26
2.4.2
Inverse, pseudoinverse and other functions . . . . . .
27
2.4.3
Projections . . . . . . . . . . . . . . . . . . . . . . .
28
2.1.4
2.2
2.3
2.4
13
6
CONTENTS
3 Testing entropic inequalities
29
3.1
Entropic inequality constraints . . . . . . . . . . . . . . . . .
29
3.2
Entropy estimation . . . . . . . . . . . . . . . . . . . . . . .
32
3.2.1
Introduction to estimators . . . . . . . . . . . . . . .
32
3.2.2
Maximum likelihood estimation . . . . . . . . . . . .
35
3.2.3
Minimax estimation
. . . . . . . . . . . . . . . . . .
36
3.2.4
MLE and minimax estimator for entropy . . . . . . .
37
3.2.5
Comparison of MLE and minimax estimator for entropy 40
3.2.6
Conclusion . . . . . . . . . . . . . . . . . . . . . . . .
46
Hypothesis tests . . . . . . . . . . . . . . . . . . . . . . . . .
47
3.3.1
Introduction to hypothesis tests . . . . . . . . . . . .
47
3.3.2
Direct approach . . . . . . . . . . . . . . . . . . . . .
52
3.3.3
Indirect approach (bootstrap) . . . . . . . . . . . . .
58
3.3.4
Additional inequalities . . . . . . . . . . . . . . . . .
64
3.3.5
Summary . . . . . . . . . . . . . . . . . . . . . . . .
68
3.3
4 Tests based on generalized covariance matrices
71
4.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . .
71
4.2
Encoding probability distributions in matrices . . . . . . . .
73
4.2.1
One- and two-variable matrices . . . . . . . . . . . .
73
4.2.2
The compound matrix . . . . . . . . . . . . . . . . .
74
The inequality . . . . . . . . . . . . . . . . . . . . . . . . . .
77
4.3.1
Motivation by covariances for the triangular scenario
77
4.3.2
General inequality for hidden common ancestor models 79
4.3.3
An equivalent representation . . . . . . . . . . . . . .
80
4.3.4
Covariances revisited . . . . . . . . . . . . . . . . . .
86
Proving the inequality . . . . . . . . . . . . . . . . . . . . .
88
4.4.1
88
4.3
4.4
Invariance under local transformations . . . . . . . .
7
CONTENTS
4.5
4.4.2
Proof for a special family of distributions . . . . . . .
93
4.4.3
Counter example . . . . . . . . . . . . . . . . . . . . 100
4.4.4
Generating the whole scenario by local transformations 102
4.4.5
Brief summary of the proof . . . . . . . . . . . . . . 110
Comparison between matrix and entropic inequality . . . . . 112
4.5.1
Analytical investigation
. . . . . . . . . . . . . . . . 113
4.5.2
Numerical simulations . . . . . . . . . . . . . . . . . 120
4.5.3
Hypothesis tests . . . . . . . . . . . . . . . . . . . . . 125
4.5.4
Summary . . . . . . . . . . . . . . . . . . . . . . . . 130
5 Application to the iris data set
132
5.1
The iris data set . . . . . . . . . . . . . . . . . . . . . . . . . 132
5.2
Discretizing the data . . . . . . . . . . . . . . . . . . . . . . 133
5.3
Proposing a model . . . . . . . . . . . . . . . . . . . . . . . 135
5.4
Rejecting the proposed model . . . . . . . . . . . . . . . . . 137
6 Conclusion and outlook
141
A Generalized proof of Theorem 4.1
145
A.1 Proof for a special family of distributions . . . . . . . . . . . 145
A.1.1 General notation . . . . . . . . . . . . . . . . . . . . 145
A.1.2 The proposition . . . . . . . . . . . . . . . . . . . . . 147
A.1.3 The proof . . . . . . . . . . . . . . . . . . . . . . . . 148
A.2 Locally transforming Aj = { Ajx }x → Aj0 . . . . . . . . . . . 158
B Proof of Corollary 4.2
161
1 INTRODUCTION
1
8
Introduction
The scope of this thesis is causal inference, the mathematical theory of
‘what causes what’. Even though causal inference, or in general the concept
of causation, is basic to human thinking and has a long philosophical history,
a solid mathematical theory has long been missing. Note that the following
brief overview of the philosophical background is mainly based on the two
secondary sources [1] and [2]. Similarly, large parts of the mathematical
history are based on the epilogue of [3]1 .
1.1
Philosophical and mathematical background
Philosophical theories about causation date back at least to Aristotle, according to whom “we do not have knowledge of a thing until we have grasped
its why, that is to say, its cause” [4]2 . Aristotle distinguishes four fundamental ‘causes’, or answers to ‘why’ questions, the material cause (“that out
of which”), the formal cause (“the form”), the efficient cause (“the primary
source of the change or rest”) and the final cause (“the end, that for the
sake of which a thing is done”) [4]2 . In modern science the term ‘cause’
typically refers to Aristotle’s ‘efficient cause’ as it comes closest to today’s
understanding of the phrase ‘X causes Y’. It would seem odd to say that the
material or the shape of an object caused the object.
An important work on causation of the modern era is A Treatise of Human
Nature [5] by the Scottish philosopher David Hume. Before Hume, the traditional view on causation was predominantly rationalistic. It was assumed
that causal relations, being intrinsic truths of nature, could be inferred by
pure reasoning. Hume, on the other hand, advocated an empirical theory
[2]. “Thus we remember to have seen that species of object we call flame,
and to have felt that species of sensation we call heat. We likewise call to
mind their constant conjunction in all past instances. Without any farther
ceremony, we call the one cause and other effect, and infer the existence of
the one from that of the other” [5]3 . One severe problem of Hume’s the1
Quotes from primary sources have been adopted from these secondary sources. At
each quote we give reference to the supposed primary source (if available) and use a
footnote to indicate the secondary source in which the quote was found.
2
quoted from [1]
3
quoted from [3]
1 INTRODUCTION
9
ory is that from the ‘principle of constant conjunction’ any two regularly
co-occurring events are identified as directly causally connected. However,
it is also possible that the connection between the two events is due to a
common cause. Today, this falls under the concept of ‘spurious correlation’
and is related to the statement ‘correlation does not imply causation’.
The list of philosophers contributing to the discussion about causation is
long, including Aquinas, Descartes, Hobbes, Spinoza, Leibniz, Locke, Newton, Kant, Mill and others [2]. But gaining ground in mathematics, or
modern science in general, turned out to be more difficult. In 1913 Bertrand
Russel wrote “All philosophers imagine that causation is one of the fundamental axioms of science, yet oddly enough, in advanced sciences, the word
‘cause’ never occurs.... The law of causality, I believe, is a relic of bygone
age, surviving, like the monarchy, only because it is erroneously supposed to
do no harm”4 . Also Karl Pearson, a founder of mathematical statistics, in
the third edition of his book The Grammar of Science [6], “strongly denies
the need for an independent concept of causal relation beyond correlation”
and “exterminated causation from modern statistics before it had chance to
take root” [3]. A major advance was brought about by Sir Ronald Fisher
who established the randomized experiment, a scientific method for testing
causal relations based on real data [7]. To illustrate the idea, assume that
the efficacy of a new drug is to be tested. From a purely observational study
the conclusion is drawn that the drug is beneficial to recovery. But in fact,
it might be that both, taking the drug and the chance of recovery, are independently influenced by a persons social and financial background. In order
to identify the actual effect of the drug, the treatment should be assigned
at random, thereby excluding background influences.
Modern theories of causal inference include Granger causality [8] and the Rubin causal model (or Neyman-Rubin causal model) [9, 10]. Granger causality
uses temporal information to infer the causal relation between two variables,
but, as in Hume’s philosophical theory, the result may be misleading if a
third variable is involved. The Rubin causal model measures the causal effect of X on Y for a single unit u, e.g. a person, as the difference in Y (at
time t2 ) given different treatments X (at time t1 ), i.e. Yx1 (u) − Yx0 (u) for
binary X. Since at time t1 one single unit can only be exposed to one of the
treatments (e.g. take the drug or not take the drug) only one of the values
can be measured. Sometimes this problem is even called the Fundamental
4
quoted from [3]
1 INTRODUCTION
10
Problem of Causal Inference [11]. A solution could be to measure the other
value for a ‘similar’ unit, but then, drawing conclusions for the original unit
requires additional assumptions.
According to Pearl, there are two reasons for the slow progress of mathematical theories and the caution with which causal inference is often treated.
1. Whereas a causal statement like ‘X causes Y’ is directed, algebraic
equations like Newton’s law F = ma are undirected. The equation
does not tell us whether the force causes the acceleration or vice versa,
as it can be brought into several equivalent forms. Thus, from a purely
algebraic description, it is impossible to infer (the direction) of causation.
2. From a fundamental point of view the causal effect of X on Y can be
understood as the change of Y when externally manipulating X. In
probability theory, the typical language of causation, however, such
manipulative statements cannot be expressed [3].
Pearl’s solution to the second problem is to introduce a completely new calculus to probability theory, the do-calculus [3]. First, he introduces the new
expression P (y | do (x)) which is read as ‘the probability that y occurs given
that we fix X to x’. This is in general different from the typical conditional
probability P (y | x) where x is only observed. Second, Pearl provides several rules to manipulate expressions involving the do-symbol with the aim
to eliminate them from the equation and thus make the final expression
evaluable by traditional statistical means [3]. This is a remarkable result as
it means that causal effects can in some cases be inferred from purely observational data. Note that Fisher’s randomized experiment also uses Pearl’s
idea of intervening in the system. Not letting the patients themselves decide
whether or not to take the drug, but assigning the treatment by an external rule, corresponds to applying one of the do-statements do (treatment) or
do (placebo).
While Pearl’s do-calculus is not the subject of this thesis, the solution of
the first problem brings us closer to our utilized framework. The idea is to
encode the assumptions about the causal relations between the variables in a
graphical model, also called a (causal) Bayesian network. These assumptions
can be translated to (conditional) independence statements. Of course one
1 INTRODUCTION
11
could also list all these independence relations without the graph, but there
are at least two reasons for the use of a graphical representation. First, when
specifying a model it is much easier to think in terms of graphs, where one
can simply connect any two variables one assumes to have a direct causal
relation. In particular larger models can imply non-obvious independence
statements that can algorithmically be obtained from the graph. Second, it
may (and will) happen that different causal assumptions (on the same set of
variables) lead to the same conditional independence relations. For example,
the graphs X → Y → Z and X ← Y → Z both imply that X and Z are
⊥ Z | Y ), while no other
conditionally independent given Y (written X ⊥
independence relations hold. To distinguish such models, interventions in
the spirit of Pearl’s do-calculus are required. Since for different graphs the
effect of the intervention might be different (otherwise one could still not
distinguish the models), the graphical representation is indeed necessary.
Even though not all models are distinguishable without intervening in the
system, one can still obtain some knowledge about the causal structure even
without such interventions. This is precisely the subject of this thesis. The
whole issue becomes dramatically more challenging if some of the variables
are not observable (also called hidden or latent) [12, 13, 14, 15]. Any independence statement including hidden variables cannot be evaluated from
the empirical data, which renders testing these statements impossible. The
independence relations containing only observable variables, if they exist at
all, might carry only little information. Thus, one strives to derive stronger
constraints on the marginal distributions of the observed variables, typically
in the form of inequalities. If for some data such an inequality is violated,
the proposed model can be rejected as an explanation of the data. If no
violation is found, one can unfortunately not conclude to have found the
one correct model, first, since other models might also be compatible with
the data, second, since the inequalities are typically not tight (in the sense
that an inequality might be satisfied even though the data are incompatible
with the model), and third, since the number of inequalities constraining
the model might be very large, so that it is impractical to test all of them.
It was mentioned above that for example the models X → Y → Z and
X ← Y → Z are indistinguishable. The model X → Y ← Z, on the other
hand, implies different constraints, namely that X and Z are unconditionally independent but conditionally dependent given Y . Thus, this model can
be distinguished from the other two by purely observational data. Testing
1 INTRODUCTION
12
inequality constraints for real data amounts to a statistical hypothesis test
and requires reliable estimation of the involved quantities. Hence, there are
two aspects of identifying possible causal models (Bayesian networks) from
marginal observations: (1) Deriving inequality constraints for the observable
marginal distributions of a proposed network, and (2) statistically testing
these inequalities. Both aspects are examined in this thesis.
1.2
Outline
As a starting point of this thesis serves a hypothesis test for a specific causal
model recently proposed in [16]. The test is based on an entropic inequality
constraint introduced in the same paper. In addition to the arguably rather
disappointing power of the test, a heuristic was used in its construction,
which implies that it is not actually known whether the type-I-error rate
meets the design rate of 5%. Note that all required concepts are thoroughly
introduced later. The goal of this thesis is threefold. First, we want to
improve the hypothesis test from [16], both in terms of its power as well as
its reliability (by which we mean the control of the type-I-error rate). To
this end, we consider recently introduced, advanced techniques of entropy
estimation [17, 18], additional entropic inequality constraints that were already introduced in [16] but not implemented in the hypothesis test, and an
alternative approach to the hypothesis test itself. As a final means we leave
the entropic framework and derive analogous inequality constraints based
on certain generalized covariance matrices. While this is motivated by the
search for a more powerful hypothesis test, deriving the new type of inequalities is interesting on its own and can thus be considered the second goal
of this thesis. The third goal is an application of the developed hypothesis
tests to real empirical data.
The rest of this thesis is organized as follows: The required basic graph theoretical and mathematical concepts are introduced in Chapter 2. Estimating
entropies and constructing hypothesis tests based on entropic inequalities
is the subject of Chapter 3. The derivation of the above mentioned matrix
inequalities as well as a comparison to the entropic framework is pursued in
Chapter 4. The application to the ‘iris data set’ is presented in Chapter 5.
Finally, the thesis is concluded in Chapter 6.
13
2 BASIC CONCEPTS
2
Basic concepts
This chapter provides an introduction to the basic mathematical and graph
theoretical concepts required for the rest of the thesis. More specialized concepts will be presented along the text before they are needed. We start with
a short overview of probability theory in Section 2.1. This is followed by
an introduction to directed acyclic graphs (DAGs), that are used to model
causal assumptions, in Section 2.2. In particular, the hidden common ancestor models that are considered throughout the whole thesis are introduced
in this section. Section 2.3 provides a brief overview of the information theoretical concepts that are required for Chapter 3. The basics of the matrix
framework employed in Chapter 4 are introduced in Section 2.4.
2.1
Introduction to probability theory
Since our aim is to constrain probability distributions of variables that follow
a given causal model, probability theory is the basic language used in this
thesis.
2.1.1
Discrete random variables
Consider a discrete random variable A with outcomes a1 , ..., aK . The set
{a1 , ..., aK } is called the alphabet of A and likewise K is called the alphabet
size. For all variables considered in this thesis, K is assumed to be finite.
To each outcome we assign a probability
0 ≤ P (A = ai ) ≤ 1,
(2.1)
with the normalization constraint
K
X
P (A = ai ) = 1.
(2.2)
i=1
Several alternative notations for P (A = ai ) will be used throughout the
thesis. A first measure to keep expressions short, is to write PA (ai ). If the
variable is clear from the context the name of the variable might be dropped
completely, leaving us with P (ai ). As another frequently used short hand
14
2 BASIC CONCEPTS
notation, or when referring to the distribution itself, we also write P (A). In
addition, when the specific values ai are not important (i.e. when they only
appear as labels inside of probabilities like P (A = ai )), we usually assume
integer values and write P (A = i). In general, the notation should always
be clear from the context or will be explained at the corresponding position.
2.1.2
Joint and marginal distributions
When considering two random variables A and B the joint probability that
both, A = ai and B = bj occur, is written as P (A = ai , B = bj ). The distribution of a singe variable can be calculated using the law of total probability,
P (A = ai ) =
X
P (A = ai , B = bj ) .
(2.3)
j
This summation is also called marginalization (over B) and the resulting
distribution P (A) is called the marginal distribution of A. It is easy to
check that the probabilities P (A = ai ) indeed satisfy conditions (2.1) and
(2.2), assuming that the joint distribution satisfies them. In general, for
n random variables the joint distribution is referred to as P (A1 , ..., An ).
The distribution of any subset of variables may then be called the marginal
distribution of these variables.
2.1.3
Conditional probabilities and (conditional) independence
The distribution of a variable A may change conditioned on the observation
of another variable B. We write P (A = ai | B = bj ) or simply P (A | B) to
denote the conditional probability of A given B. The joint distribution can
be decomposed as
P (A, B) = P (A | B) P (B) .
(2.4)
If we find P (A | B) = P (A) (meaning P (A = ai | B = bj ) = P (A = ai )
∀i, j) the variables are called independent. In that case, one also finds
P (B | A) = P (B) and the joint distribution factorizes according to
P (A, B) = P (A) P (B) .
(2.5)
Independence statements often include the conditioning on a third variable.
A and B are said to be conditionally independent given C, also written as
15
2 BASIC CONCEPTS
⊥ B | C, if the conditional distribution factorizes according to
A⊥
P (A, B | C) = P (A | C) P (B | C) .
(2.6)
Since conditional distributions are also valid probability distributions satisfying (2.1) and (2.2), (2.6) is a straightforward generalization of (2.5).
Generalizations to larger sets of variables are likewise straightforward.
2.1.4
Expected value, variance and covariance
The expected value of a random variable is defined as
E [A] =
X
ai P (A = ai ) .
(2.7)
i
The variance can then be written as
h
Var [A] = E |A − E [A]|2
h
i
i
= E |A|2 − |E [A]|2
=
X
i
2
|ai | P
X
(ai ) − ai P
i
2
(ai )
≥ 0.
(2.8)
The non-negativity can be seen right from the first line, since the expectation
value of a non-negative quantity will also be non-negative. For the sake of
generality we allow complex valued alphabets here. The complex conjugate
of x ∈ C is denoted by x∗ . As a generalization for two random variables one
defines the covariance as
Cov [A, B] = E [(A − E [A])∗ (B − E [B])]
= E [A∗ B] − E [A]∗ E [B]
X
=
a∗i bj [P (ai , bj ) − P (ai ) P (bj )] .
(2.9)
i,j
Note that for complex variables we obtain Cov [B, A] = Cov [A, B]∗ instead
of full symmetry. If A and B are independent, their joint distribution factorizes and thus Cov [A, B] = 0. The other direction is not true, i.e. depending
on the values ai and bj even dependent variables can have covariance zero.
Further statistical aspects that play a role in Chapter 3 will be introduced
later.
2 BASIC CONCEPTS
2.2
16
Bayesian networks
In this section we introduce necessary properties and terminology of graphical models required for the following chapters. For a more detailed treatment
of the topic see for example Pearl [3] or Spirtes, Glymour and Scheines [19].
A nice online introduction can be found in the Stanford Encyclopedia of
Philosophy on the topic of Probabilistic Causation [20].
Causal assumptions on a set of random variables are often modeled using socalled directed acyclic graphs (DAGs). Each random variable is represented
by one node (or vertex). A directed edge between two nodes indicates direct
causal influence from one variable on the other. The whole graph being
directed means that each edge has exactly one arrowhead. A graph is called
acyclic if there exists no directed path from one variable to itself (e.g. A →
B → C → A), i.e. no variable must be its own cause. This also implies that
if A causes B, B cannot simultaneously cause A. When dealing with DAGs,
one often uses genealogical terminology to indicate the relation between
variables. If there exists a directed path from A to B, A is called an ancestor
of B and B a descendant of A. If the path has length one, i.e. if there is
a direct link from A to B, we call them parent and child. In the DAG
A → B → C, for example, B is a child of A and a parent of C. Furthermore
C is a descendant of A and likewise A is an ancestor of C. We denote the
set of parents of a variable A by PA (A) and likewise the sets of descendants
and non-descendants by D (A) and ND (A).
Note that DAGs can be defined independently of any causal interpretation.
In the first place, the DAG is used to encode conditional independence relations, e.g. in the DAG A → B → C, A and C are conditionally independent
given B. The total model consisting of the DAG and its implied independence relations is called a Bayesian network. The causal interpretation of
a Bayesian network is threefold. First, it is simply convenient and in some
sense natural to think of the edges as causal links. Second, if interventions
in the spirit of Pearls do-calculus are considered, additional assumptions
concerning the locality of these manipulations allow for a causal interpretation. For more details, see Pearls definition of a causal Bayesian network
[3]. Finally, and most relevant for this thesis, if we find a violation of the
constraints implied by the DAG in real data, the model can be rejected as
an explanation for generating the data regardless of a possible causal interpretation. This means that we are in particular able to falsify causal
17
2 BASIC CONCEPTS
assumptions. The causal interpretation becomes more relevant when trying
to verify causal effects rather than to falsify them. Distinguishing causality
from mere correlations can be a delicate task.
2.2.1
Markov condition
A list of fundamental independence relations implied by the DAG is given
by the Markov condition. The Markov condition states that any variable
should be conditionally independent of its non-descendants given its par⊥ ND (A) | PA (A). In particular, once the parents of
ents, written as A ⊥
A are known, further knowledge of more distant ancestors does not change
the distribution of A anymore. More distant ancestors have only an indirect
influence on A by affecting A’s parents or other less distant ancestors. These
independence relations imply that the joint distribution of all variables factorizes according to
P A1 , ..., An =
n
Y
P Aj | PA Aj
.
(2.10)
j=1
As an example, consider the so-called instrumentality DAG in Figure 1. The
⊥ λ and B ⊥
⊥C|
Markov condition implies the independence relations C ⊥
A, λ. The total distribution can be written as
P (A, B, C, λ) = P (A | C, λ) P (B | A, λ) P (C) P (λ) .
(2.11)
λ
C
A
B
Figure 1: Instrumentality DAG. The instrument C can under certain assumptions be used to infer the causal effect of A on B [16, 21, 22, 23]. The
variable λ comprises all additional influences on A and B and may be unobserved (see Subsection 2.2.3 for hidden variables). Here, the DAG serves
simply as an illustration for the Markov condition.
2 BASIC CONCEPTS
18
Note that the conditional independence relations given by the Markov condition may imply further independence relations that can algorithmically be
obtained by the so-called d-separation criterion [3]. The Markov condition
is of particular importance for us, since any distribution compatible with the
DAG has to factorize according to (2.10). Violation of this factorization, or
any derived constraints thereof, is a proper witness of incompatibility of the
data with the assumed causal model.
2.2.2
Faithfulness assumption
The Markov condition is a sufficient but not necessary condition for conditional independence [20], in the sense that in data that are compatible with
the DAG any conditional independence implied by the Markov condition
(and hence also the d-separation criterion) will hold, but additional independence relations are possible. The faithfulness assumption states that the
Markov condition should also be necessary, i.e. that there exist no additional
independence relations than those implied by the Markov condition. This
can also be understood such that all edges in the graph are indeed required.
For example in the graph A → B, the Markov condition implies no independence relations at all and the faithfulness assumption states that we should
then indeed find a dependence between the two variables. If we find that A
and B are independent, the graph and the distribution are said to be not
faithful to one another.
For another illustration, consider the case that a distribution is compatible
with more than one DAG and that for some reason one has to decide which
graph is ‘the correct one’. Loosely speaking the faithfulness assumption
would suggest to choose the most simple one. Complex graphs that allow
more dependence relations than actually observed, could be regarded as
overfitting the data. In that sense the faithfulness assumption would be a
formal version of Occam’s razor [20]. However, even with the faithfulness
assumption it is unlikely that a unique graph can be identified. A simple
example of several equally complex graphs that entail the same conditional
independence relations was already given in the introduction, namely A →
B → C, A ← B ← C and A → B ← C which all imply A ⊥
⊥ C | B.
The faithfulness assumption is used in many theorems and algorithms in
causal inference [20, 24, 3, 19] and from an ideal point of view the assumption
can be justified since distributions that are not faithful to a DAG have
19
2 BASIC CONCEPTS
Lebesgue measure zero [24]. However, for practical purposes the faithfulness
assumption is also subject to criticism [25, 24]. Fortunately, the approach
followed in this thesis does not require the faithfulness assumption since
we are not trying to decide between different Markov equivalent DAGs but
rather to reject a given DAG (and thus all its Markov equivalents). For this
purpose, violation of the Markov condition is a sufficient criterion.
2.2.3
Hidden variables
Variables that are too complex to be properly characterized (e.g. comprising incomplete background knowledge) or that can simply not be observed
due to other (maybe practical) reasons, have to be included in the model as
so-called hidden or latent variables. Variables that are not hidden are called
observables in this thesis. As an example, in the debate of smoking as a
cause of lung cancer, one could think of an alternative model where a gene
is the common cause of both the cancer and a strong craving for nicotine.
Since we do not even know whether or not such a gene exists, this common
cause has to be treated as a hidden variable. Hidden variables can substantially complicate the task of causal inference. Independence statements
including hidden variables for obvious reasons cannot be evaluated (from
empirical data). The remaining independence relations, if they exist, may
carry only little information. Also, for large DAGs and alphabets, testing
all accessible independence relations might become impractical. Concerning
the distribution of the remaining observables, the simple product structure
given by the Markov condition (see (2.10)) gets lost due to marginalization
over the hidden variables. Considering n observables A1 , ..., An and m hidden variables λ1 , ..., λm , the distribution of the observables can be written
as
P A1 , ..., An =
X
n
Y
λ1 ,...,λm j=1
P Aj | PA Aj
m
Y
P (λk | PA (λk )) . (2.12)
k=1
The set of all distributions of this form can have a highly complex geometrical structure [12, 13, 14, 15]. In particular, this set will in general be
non-convex, meaning that if two distributions P1 and P2 are of the above
form, then a mixture of these distributions will in general not be of that
form. Typically one aims to find an outer approximation (or if possible a
precise description) to this set given in the form of inequality (and equality)
20
2 BASIC CONCEPTS
constraints. A distribution violating such an inequality will then automatically fail to be compatible with the DAG. Figure 2 illustrates these set
relations.
all distributions
true set
inequality description
Figure 2: Illustration of set inclusions for distributions compatible with models including hidden variables. The true set has such a complex structure
that deciding membership becomes unfeasible. The four black curves correspond to inequalities that (upper) bound the correlations (or more general
dependence relations) between the observables. The set corresponding to
the ‘inequality description’ is the set of distributions satisfying all four inequalities. Violation of any inequality is evidence of non-membership to the
true set of distributions.
2.2.4
Hidden common ancestor models
Consider the case that there are no direct causal links between the observable variables but all correlations are mediated by hidden common ancestors.
Furthermore, assume that all ancestors are independent of each other and
that the observables do not affect the hidden variables (i.e. the hidden variables have only outgoing edges). We call such a model a hidden common
ancestor model. Distributions compatible with this scenario are of the form
P A1 , ..., An =
X
{ λx }x
P A1 | { λx }x|
A1
...P An | { λx }x|An
Y
P (λx ) .
x
(2.13)
21
2 BASIC CONCEPTS
The set { λx }x contains all hidden ancestors and { λx }x| j all ancestors of
A
the observable Aj . Note that we will often index ancestors with the names
of the observables they are connecting. To distinguish such ‘set indices’ from
the usual integer indices, we will use the letters x, y, z for the former and
i, j, k, ... for the latter. In fact, this distinction is majorly required for the
notations used in Appendix A. At this point it serves primary to ensure a
consistent notation throughout the whole document.
In a hidden common ancestor model, an ancestor of only one observable puts
no constraints on the observable distribution since the marginalization will
only affect one term, e.g.
X
P A1 | { λx }x|
λ0
, λ0 P (λ0 ) = P A1 | { λx }x|
1
A
A1
.
(2.14)
Any distribution P A1 | { λx }x| 1 can be obtained by just not letting λ0
A
have any effect at all. In that sense λ0 can always be absorbed into A1
or its other ancestors. Likewise, for the DAGs considered in this section,
one ancestor common to all observables does not constrain the observable
distribution. To see this, first realize that the joint distribution can be
decomposed according to
P A1 , ..., An =
X
P A1 | λ ...P (An | λ) P (λ) .
(2.15)
λ
Now, think of λ as being composed of n subvariables λj , one for each observable Aj and also with the corresponding alphabet size. If we let Aj be
deterministically dependent on λj , i.e.
P Aj = kj | λ1 = l1 , ..., λn = ln = P Aj = kj | λj = lj = δkj lj ,
(2.16)
we obtain
P A1 = k1 , ..., An = kn
=
X
δk1 l1 ...δkn ln P (λ1 = l1 , ..., λn = ln )
l1 ,...,ln
= P (λ1 = k1 , ..., λn = kn ) .
(2.17)
By choosing this deterministic dependence between the ancestor λ and the
observables A1 , ...An , the latter simply inherit the distribution of the former. Essentially, we simulated a collection of variables by one larger variable. Thus, any distribution can be realized by one ancestor common to all
variables.
22
2 BASIC CONCEPTS
A
λAB
B
λAC
C
λBC
Figure 3: The triangular scenario. Three observables with one hidden common ancestor for each pair.
The most simple non-trivial example consists of three observables and two
ancestors, each connecting one pair of observables. If also the last pair is
connected by a third ancestor, one obtains the so-called triangular scenario
(see Figure 3). Even though it is one of the most simple examples, the
structure of distributions compatible with the triangular scenario is already
highly complex.
In general, two observables that share no ancestor are independent. To
mathematically confirm this intuitive statement, consider the bi-partite
marginals of the general distribution (2.13),
P Aj , Ak
P Aj | { λx }x| j P Ak | { λx }x|
X
=
Ak
A
{ λx }x|
Aj
∪{ λx }x|


Ak
j∈x∨k∈x

Y

P Aj | { λx }x| j
P (λx ) ·
X
A
{ λx }x|
j∈x
Aj




P A
j
X
P Ak | { λx }x|
Y
Ak
{ λx }x|
=
P (λx )
x

=
Y
P A
k∈x
Ak
k

P (λx )
.
(2.18)
From the first to the second line we assumed that { λx }x| j and { λx }x| k are
A
A
disjoint sets, i.e. that Aj and Ak have no common ancestor. The notation
23
2 BASIC CONCEPTS
j ∈ x ∨ k ∈ x shall indicate that the product runs only over ancestors
of the observables Aj and/or Ak . Any distribution that violates such an
independence statement implied by a given DAG cannot be compatible with
that DAG.
2.3
Information theory
The inequality constraints encountered in Chapter 3 are given in terms of
entropies of the observable variables. Here, we provide a brief introduction
to that topic. More details can for example be found in [26, 27, 28].
2.3.1
Shannon entropy
The Shannon entropy of a probability distribution is defined as
H (A) = E [− log P (A)]
= −
K
X
P (A = ai ) log P (A = ai ) .
(2.19)
i=1
Entropy is a measure of randomness or uncertainty of a distribution. Since
log 1 = 0 and x log x −→ 0, the entropy of a deterministic distribution
x→0
P (A = ai ) = δij , where one outcome occurs with certainty, is zero. The maximal value log K is obtained for the uniform distribution P (A = ai ) = K1 .
Entropy can also be understood as a measure of information gained from
an observation. The more random the distribution P (A), the more will be
learned by conducting an experiment with this underlying distribution. For
example, we learn more by flipping a fair coin than by flipping a manipulated coin that always shows heads. In the latter case we learn nothing
since we already knew the result beforehand. In the context of information,
− log P (A = ai ) is also called the information content of the outcome ai , and
entropy is the information content of the whole distribution P (A). Note that
outcomes with very large information content are suppressed due to their
small probability. Also note that the entropy is independent of the actual
alphabet (i.e. the outcome values ai ), since only the probabilities P (A = ai )
appear. Since 0 ≤ P (A = ai ) ≤ 1, − log P (A = ai ) is always non-negative
leading to H (A) ≥ 0. Contrary to the typically employed base-2 logarithm
in information theory, we employ the natural logarithm here. The difference
24
2 BASIC CONCEPTS
is merely a constant factor. The unit of the entropy with base-2 logarithm
is called bit, with the natural logarithm nat.
2.3.2
Joint and conditional entropy
Entropy can easily be generalized to joint distributions. The joint entropy
of the variables A1 , ..., An is simply
H A1 , ..., An = −
X
P (a1 , ..., an ) log P (a1 , ..., an ) .
(2.20)
a1 ,...,an
Note that the expression is symmetric in the sense that H (A, B) = H (B, A)
(for notational convenience we consider only two variables from now on).
As for probabilities, one can decompose the joint entropy in terms of a
conditional entropy H (B | A) and the marginal entropy H (A),
H (A, B) = H (B | A) + H (A)
X
X
with H (B | A) = −
P (a)
P (b | a) log P (b | a) .
a
(2.21)
(2.22)
b
H (B | A) can be understood as the average information that we gain from
learning B when we already know A. If for example B is deterministically
dependent on A (i.e. P (b | a) ∈ {0, 1}), then H (B | A) = 0. In general, the
following inequality relations hold:
1. H (B | A) ≥
0
2. H (A, B) ≥
H (A)
3. H (A, B) ≤ H (A) + H (B) with equality iff A ⊥
⊥B
4. H (B | A) ≤
H (B) with equality iff A ⊥
⊥B
(2.23)
(2.24)
(2.25)
(2.26)
Note that, as usual, ‘iff’ is short hand for ‘if and only if’. The first inequality
follows directly from the definition (2.22) and log P (b | a) ≤ 0. The second
inequality follows from the first one inserted in (2.21). Due to the symmetry
of H (A, B), H (B) is a lower bound as well. The inequality represents the
intuitive statement, that the uncertainty of two variables should be larger
than the uncertainty of each single variable. A proof of inequality number
three, which states that the total uncertainty cannot be larger than the sum
of the single uncertainties, can for example be found in [26]. The fourth
inequality follows from the third one (in fact they are equivalent) and (2.21).
It says that the uncertainty about B does not grow when learning A.
25
2 BASIC CONCEPTS
2.3.3
Mutual information
The mutual information shared by two variables is defined as
I (A; B) =
X
a,b
P (a, b) log
P (a, b)
.
P (a) P (b)
(2.27)
The definition suggests that mutual information measures the closeness of
the joint distribution P (a, b) to the product of its marginals P (a) P (b).
Thus, it can be considered as a measure of dependence. The mutual information of two variables can be expressed in terms of their entropies via the
relations
I (A; B) =
=
=
=
H (A) + H (B) − H (A, B)
H (A) − H (A | B)
H (B) − H (B | A)
H (A, B) − H (A | B) − H (B | A) .
(2.28)
(2.29)
(2.30)
(2.31)
A graphical illustration of these relations can be found in Figure 4. Mutual
information satisfies the following bounds:
⊥B
1. I (A; B) ≥
0 with equality iff A ⊥
2. I (A; B) ≤ H (A)
(2.32)
(2.33)
The bounds follow directly from (2.26) and (2.23) inserted in (2.30) (in
fact (2.32) is equivalent to each of (2.25) and (2.26)). The upper bound
is achieved for deterministically dependent variables in which case we have
H (A) = H (B) = H (A, B). Due to symmetry, H (B) is of course always an
upper bound as well.
The conditional mutual information of A and B given a third variable C is
their mutual information given a specific value c of C averaged over all c,
I (A; B | C) =
X
c
P (c)
X
a,b
P (a, b | c) log
P (a, b | c)
.
P (a | c) P (b | c)
(2.34)
Conditional mutual information also satisfies the bound I (A; B | C) ≥ 0.
In terms of entropies it can for example be expressed as
I (A; B | C) = H (A, C) + H (B, C) − H (A, B, C) − H (C) .
(2.35)
26
2 BASIC CONCEPTS
H(A,B)
H(A|B)
I(A;B)
H(A)
H(B|A)
H(B)
Figure 4: Graphical illustration of the relation between marginal, conditional
and joint entropy and mutual information.
Note that A, B and C can also be replaced by setsA1 , ...An , B 1 ,...,nB moand
C 1 , ..., C l . The (conditional) mutual information I {Ai }i ; {B j }j | C k
k
then measures the dependence between the two sets A1 , ..., An and B 1 , ..., B m
(given the set C 1 , ..., C k ).
2.4
Hermitian and positive semidefinite matrices
In Chapter 4 we derive constraints on probability distributions in terms
of certain matrix inequalities. Here, we introduce the necessary matrixtheoretical concepts. Some basic properties of positive semidefinite matrices
can be found in [29]. More details and complete introductions to the topic
can for example be found in [30, 31, 32].
2.4.1
Definitions and notation
A complex valued square matrix M ∈ Cn×n is called positive semidefinite,
written M ≥ 0, if
x† M x ≥ 0 ∀x ∈ Cn ,
(2.36)
where x† denotes the conjugate transpose (or adjoint) of x, and x is defined
as a column vector. When convenient, we use the Dirac notation from
quantum mechanics (see e.g. [33]) to denote a vector x as an abstract state
|xi ∈ H and its adjoint as hx|. H denotes a general Hilbert space. A matrix
27
2 BASIC CONCEPTS
M ∈ Cn×n is called hermitian if M † = M . Any hermitian matrix possesses
a spectral decomposition (also called eigenvalue decomposition)
M=
n
X
λj |ji hj| ,
(2.37)
j=1
with real eigenvalues λj and orthonormal eigenstates |ji. The set of eigenvalues {λj }nj=1 is also called the spectrum of M . M is positive semidefinite if
and only if M is hermitian and has non-negative spectrum. From there we
can conclude that the determinant of a positive semidefinite matrix, which
is simply the product of its eigenvalues, is non-negative as well.
Positive semidefiniteness induces a partial order among matrices. We say
that M ≥ N if M − N ≥ 0. In general, it need neither be the case that
M ≥ N nor N ≥ M .
The kernel and range of a matrix M are defined as
ker (M ) = {|xi ∈ H | M |xi = 0} ,
range (M ) = {|xi ∈ H | ∃ |yi ∈ H with M |yi = |xi} .
(2.38)
(2.39)
Taking a look at the spectral decomposition (2.37), the range of a hermitian matrix M can be written as the span of the eigenstates corresponding
to non-zero eigenvalues, range (M ) = span ({|ji} | λj 6= 0). Similarly, the
kernel can be written as the span of the eigenstates with eigenvalue zero,
ker (M ) = span ({|ji} | λj = 0). Since eigenstates corresponding to different eigenvalues are orthogonal, range and kernel of a hermitian matrix are
orthogonal subspaces
2.4.2
Inverse, pseudoinverse and other functions
P
Consider a hermitian matrix M =
of M can then be calculated as
M
−1
j
=
λj |ji hj| with λj 6= 0 ∀j. The inverse
n
X
1
|ji hj| .
j=1 λj
(2.40)
If we allow λj = 0 we can define the pseudoinverse of M ,
M =
X
j
λj 6=0
1
|ji hj| ,
λj
(2.41)
28
2 BASIC CONCEPTS
which is the inverse restricted to the range of M . In general, using the
spectral decomposition, the action of a complex valued function f : C → C
on M can be defined as
f (M ) =
n
X
f (λj ) |ji hj| ,
(2.42)
j=1
as long as f (λj ) is properly defined.
2.4.3
Projections
A hermitian matrix P is called a projection if P 2 = P . This property is also
called idempotence. The spectrum of P consists only of the eigenvalues 0
and 1. Thus, the spectral decomposition reads
P =
X
|ji hj| .
(2.43)
j
λj =1
Two projections P1 , P2 are called orthogonal (to each other) if their ranges
are orthogonal subspaces. In that case one obtains
P1 P2 = P2 P1 = 0.
(2.44)
This is not to be confused with a single projection being called orthogonal
which is the case if its range and kernel are orthogonal subspaces. The
latter is always true for hermitian projections and only those are important
in this thesis. A single projection that is not orthogonal is called oblique.
The projection PM onto the range of a hermitian matrix M can be obtained
using the pseudoinverse M via the relations
PM = M M = M M .
(2.45)
3 TESTING ENTROPIC INEQUALITIES
3
29
Testing entropic inequalities
The goal of this chapter is to investigate hypothesis tests based on entropic
inequality constraints that are used to decide compatibility of empirical data
with a given DAG. In particular, we want to improve a hypothesis test testing
compatibility with the triangular scenario (see Figure 3) that was proposed
in [16]. As a first means to this end, in Section 3.2, we implement recent
techniques of estimating entropies from [17, 18]. In Section 3.3 we show that
the heuristic that was used to construct the hypothesis test in [16] leads to
an unreliable control of the type-I-error rate. To circumvent this problem
we consider an alternative approach to the hypothesis test based on the
relation between hypothesis tests and confidence intervals. At the end of
Section 3.3 we implement this alternative approach for additional entropic
inequalities constraining the triangular scenario, that were derived but not
further considered in [16].
As the very first step, in Section 3.1, we briefly present the method of generating entropic inequalities constraining distributions compatible with a given
DAG introduced in [16]. Section 3.2 also contains a general introduction to
estimation theory. An application of our methods to real data is presented
in Chapter 5.
3.1
Entropic inequality constraints
In Subsection 2.2.3 it was mentioned that DAGs with hidden variables impose non-trivial constraints on the observable marginal distributions. To
characterize the set of distributions compatible with the DAG, one often
has to resort to outer approximations in terms of inequality constraints. Violation of such an inequality allows one to reject the assumed causal model
as an explanation for generating the data. Recently, it has been proposed
to work on the level of entropies of the marginal distributions. The key idea
behind using entropies is that algebraic independence conditions, for example P (A, B) = P (A) P (B), translate into linear conditions on the level of
entropies, H (A, B) = H (A) + H (B) or simply I (A; B) = 0. Working with
linear constraints is arguably much simpler than working with polynomial
constraints. In [16] an algorithm for the entropic characterization of any
DAG has been developed. It consists of the three main steps listed below:
30
3 TESTING ENTROPIC INEQUALITIES
1. List the elementary inequalities.
2. Add the constraints implied by the DAG.
3. Eliminate all entropies including hidden variables or any non-observable
terms.
In the first step, the so-called elementary inequalities constrain the entropies of any set of random variables. We have seen special cases of these
inequalities for the bi- (or tri-) partite case already in Section 2.3. For
the general case consider the set of variables A = {A1 , ..., An }. Monotonicity demands H (A \ Aj ) ≤ H (A) (c.f. (2.24)), implying that the entropy of any set of variables should be larger than the entropy of any
subset of these
variables.
The so-called sub-modularity
condition demands
j
0
k
0
0
j
k
0
H (A ) + H A , A , A ≤ H (A , A ) + H A , A for any subset A0 ⊂ A. A
comparison with (2.35) reveals that this is equivalent to the non-negativity
of the conditional mutual information,
I Aj ; Ak | A0 = H Aj , A0 + H Ak , A0 − H Aj , Ak , A0 − H (A0 ) . (3.1)
Finally, one demands the entropy of the empty set to be zero, H (∅) = 0.
The elementary inequalities are also known as the polymatroidal axioms.
One should note that they provide only an outer approximation to the true
set of possible entropies. A tight description is not generally known [28].
In the second step of the algorithm the conditional independence constraints
of the form I (A; B | C) = 0 implied by the Markov condition (and hence
also the d-separation criterion) are added. The elimination of the hidden
variables from the set of inequalities and equalities can be done by employing
the so-called Fourier-Motzkin elimination [34].
Using this procedure, inequalities for several DAGs have been derived [16].
As a first example, distributions compatible with the instrumentality DAG
from Figure 1, where λ is assumed to be hidden, have to satisfy I (B; C | A)+
I (A; C) ≤ H (A). In fact, this is the only entropic constraint that is not
implied by the elementary inequalities. The number of inequalities on the
level of probabilities, on the other hand, increases exponentially with the
alphabet sizes of the variables [21]. The only drawback of the entropic
characterization is that it is only an outer approximation, i.e. there might
be distributions that are incompatible with the scenario but fail to violate
3 TESTING ENTROPIC INEQUALITIES
31
the entropic inequality. In this sense, entropic inequalities are a necessary
but non-sufficient conditions for the compatibility of given data with an
underlying causal model.
As a second example, distributions compatible with the triangular scenario
from Figure 3 have to satisfy the inequality
H (A) + H (B) + H (C) − H (A, B) − H (A, C) ≤ 0
⇔
I (A; B) + I (A; C) ≤ H (A) , (3.2)
and permutations thereof. The inequality can intuitively be understood as
follows (see also [16]). If the mutual information of A and B is large, then
A depends strongly on the ancestor λAB . But then, the dependence of A
on λAC is necessarily small. Since all correlations between A and C are
mediated by λAC , the mutual information of A and C is consequently small
as well. Inequality (3.2) gives a precise bound to this intuition. In addition,
distributions compatible with the triangular scenario are constrained by the
less intuitive inequalities
3HA + 3HB + 3HC − 3HAB − 2HAC − 2HBC + HABC ≤ 0(3.3)
and 5HA + 5HB + 5HC − 4HAB − 4HAC − 4HBC + 2HABC ≤ 0(3.4)
(and permutations of 3.3). To save space we employed the short hand notations HAB = H (A, B) and so on. Even after rewriting the inequalities in
terms of mutual information (see Subsection 3.3.4), a simple, intuitive understanding similar to the one given above for inequality (3.2) is not available.
One particular problem is caused by the involvement of ‘tri-partite mutual
information’ (see (3.42)) which can, opposed to the usual mutual information, be negative. In that sense, ‘tri-partite mutual information’ is not a
well defined information measure, making a simple intuition difficult. It is
worth noting that inequality (3.2) is based on bi-partite information alone.
On one hand, this might suggest that 3.2 is the least restrictive one, on the
other hand it can also be employed if no tri-partite information is available.
Such a scenario might arise for example in quantum mechanics, where several observables (e.g. position and momentum of a single particle) are not
jointly measurable. In the following we will mainly focus on inequality (3.2)
and come back to inequalities (3.3) and (3.4) in Subsection 3.3.4.
3 TESTING ENTROPIC INEQUALITIES
3.2
32
Entropy estimation
In order to test an entropic constraint like inequality (3.2) from a data
set, one first has to statistically estimate the single quantities appearing in
the inequality. Since mutual information can be expressed as I (A; B) =
H (A) + H (B) − H (A, B), it suffices to find a reliable estimator for entropy.
Estimating joint entropies is also effectively the same as estimating marginal
entropies. Regardless of the number of variables we can simply write
H A1 , ..., An
= −
X
P (a1 , ..., an ) log P (a1 , ..., an )
(3.5)
a1 ,...,an
as
H (P ) = −
X
pi log pi ,
(3.6)
i
where i runs over the total alphabet of the collection of variables A1 , ..., An ,
and pi denotes the corresponding probability P (a1 , ..., an ).
Note that reliably estimating entropies is an up-to-date problem. While it
is not new that simply calculating the entropy of the observed distribution
is not the best choice, the estimator that we employ in thesis thesis has
been introduced only recently (2014/15) [17, 18]. We thus provide a rather
detailed elaboration of the topic.
3.2.1
Introduction to estimators
An accessible introduction to estimation theory can for example be found
in [35].
When collecting data in the real world, one does usually not know the true
probability distribution P underlying the data generating process. Assume
that we make N observations, each independently drawn from the same
distribution P (the observations can be considered as N independent and
identically distributed (i.i.d.) random variables). The observations are called
a sample of size N of the distribution P . Further assume that the outcome
of any observation can be assigned to one of K categories, the alphabet of
the distribution P . The number of observations that fall in category i is
denoted by Ni and the distribution P̂ defined in terms of the probabilities
p̂i = Ni/N is called the empirical distribution. The empirical distribution P̂
is an estimate of the true distribution P .
33
3 TESTING ENTROPIC INEQUALITIES
Note that in statistics one distinguishes between so-called parametric and
non-parametric estimation. Parametric estimation means that certain assumptions about the probability distribution have been made, for example that the distribution is characterized by some real parameter θ. Estimating the distribution then amounts to estimating that parameter. If no
such assumptions have been made, one speaks of non-parametric estimation.
Strictly speaking, non-parametric estimation only exists in the continuous
case. The mere assumption that the distribution is discrete (with finite alphabet size) already renders the model parametric, since each probability pi
can be considered as one parameter.
The next step after estimating the distribution P is the so-called functional
estimation, where one aims to estimate a quantity Q(P ). Since we do not
have access to the true distribution P our estimate can only be based on
the empirical distribution P̂ . Naively, one could simply calculate Q(P̂ ),
also called the plug-in estimator, but in general it is advisable to change
the function Q as well. A general estimator of Q(P ) is then denoted by
Q̂(P̂ ). Since it should be clear that the estimate is based on the empirical
distribution, we will typically omit the functional dependence on P̂ and
simply write Q̂. Different estimators of the same quantity will be given
appropriate indices, for example Q̂a and Q̂b .
An estimator Q̂ should be as close to the true quantity Q as possible. There
are several quantities that help characterize the performance of an estimator.
Definition 3.1. For a fixed true distribution P the expected deviation between an estimator Q̂ and the true value Q is called the bias of Q̂,
h i
h i
BP Q̂ = EP Q̂ − Q.
(3.7)
If BP Q̂ = 0, the estimator is called unbiased.
The index P denotes that P is hold fixed. The expectation is taken with
respect to all possible empirical distributions that can arise from the true
distribution P . Explicitly, this can be written as
h i
EP Q̂
=
X
ProbP P̂ · Q̂ P̂ .
(3.8)
P̂
The probability that a specific empirical distribution occurs is given by the
34
3 TESTING ENTROPIC INEQUALITIES
multinomial distribution
ProbP P̂ =


N!
K
pN1 ...pN
K
N1 !...NK ! 1
if N1 + ... + NK = N
otherwise.
0
(3.9)
Intuitively, it seems reasonable that a good estimator should be unbiased,
but in fact for some quantities (entropy being one of them) unbiased estimators do not even exist. Also, when trying to reduce the bias of an estimator,
one might simultaneously increase its variance. The variance of an estimator
is defined in the usual way as
h i
VarP Q̂
= EP
h i2 Q̂ − EP Q̂
h
h i2
i
= EP Q̂2 − EP Q̂ ,
(3.10)
Note that in contrast to the random variables in the introduction of the
variance in Subsection 2.1.4, Q̂ is assumed to be real valued. A more suitable
quantity than the bias that one often tries to keep small is the mean square
error.
Definition 3.2. The mean square error (MSE) of an estimator Q̂ is defined
as
h i
2 MSEP Q̂ = EP Q̂ − Q .
(3.11)
While the variance of an estimator is its fluctuation around its own expected
value, the MSE is the fluctuation around the correct value Q. For unbiased
estimators this implies VarP (Q̂) = MSEP (Q̂). In general, the MSE can be
decomposed according to
h i
MSEP Q̂
= EP
2 Q̂ − Q
h
i
h
i
h i
= EP Q̂2 − 2QEP Q̂ + Q2
h i2
= EP Q̂2 − EP Q̂
h i
h i2
+ EP Q̂
h i2
= VarP Q̂ + BP Q̂ .
h i
− 2QEP Q̂ + Q2
(3.12)
Minimizing the MSE means finding a proper trade-off between minimizing
the variance and the bias.
35
3 TESTING ENTROPIC INEQUALITIES
The final property of estimators that we want to introduce here is consistency. While it is problematic to demand general unbiasedness, it is reasonable to demand that for the sample size N → ∞ the estimator should
approach the true value.
Definition 3.3. An estimator Q̂ = Q̂(N ) is said to be consistent if it converges in probability to the true value Q,
lim ProbP Q̂ (N ) − Q (P ) > = 0 ∀ > 0.
N →∞
(3.13)
Convergence in probability allows exceptions in the sense that there might
(or will) be empirical distributions P̂ (N ) for which Q̂ does not approach the
correct value. The probability measure of these distributions, however, is
zero. Loosely speaking, this means that there are only few such distributions
which are furthermore very unlikely to occur.
3.2.2
Maximum likelihood estimation
A standard estimator used in statistics is the so-called maximum likelihood
estimator (MLE). In parametric estimation the MLE θ̂MLE of a parameter
θ is defined as the parameter value for which the probability to make the
given observation is maximized. Formally, this corresponds to maximizing
the likelihood function L(θ) = Probθ (P̂ ) (often written as Prob P̂ | θ ).
Typically one rather considers the log-likelihood function log L (θ), since frequently occurring product expressions then split into more convenient sums.
The following intuitive result is standard knowledge in statistics, but a proof
is rarely given. We reproduce the result for the sake of completeness.
Proposition 3.1. The MLE of a true discrete distribution P is simply the
empirical distribution P̂ .
Proof. According to (3.9) the likelihood function reads
log L (P ) = log ProbP P̂
N!
NK
1
pN
1 ...pK
N1 !...NK !
N!
= log
+ N1 log p1 + ... + NK log pK .
N1 !...NK !
= log
(3.14)
36
3 TESTING ENTROPIC INEQUALITIES
When maximizing this function we have to take care of the additional constraint p1 + ... + pK = 1, which can be implemented by using a Lagrange
multiplier λ. The function that we need to maximize then reads
log
N!
+ N1 log p1 + ... + NK log pK − λ (p1 + ... + pK − 1) . (3.15)
N1 !...NK !
The condition that the ith partial derivative ∂pi vanishes, becomes
Ni
= λ
pi
p̂i
⇔
= λN.
pi
(3.16)
This immediately requires λN = 1 and thus pi = p̂i . To see this, assume
∃i s.t. pi > p̂i . The normalization constraint then implies the existence of
another j 6= i s.t. pj < p̂j . But then we have
p̂j
p̂i
<1< ,
pi
pj
(3.17)
which contradicts the requirement that this ratio should be the same for all
i. Thus, the empirical probability p̂i is indeed the MLE of the true pi .
The MLE features the invariance property that for a one-to-one function
g (θ) one finds ĝMLE = g(θ̂MLE ) [35]. As a convention, one typically extends
this definition to arbitrary functions g [35]. Thus, when referring to the
MLE of the entropy H (P ), we simply mean the plug-in estimator
ĤMLE = H(P̂ ) = −
X
p̂i log p̂i .
(3.18)
i
3.2.3
Minimax estimation
The MLE is an intuitive estimator that is typically easy to calculate. It also
features numerous optimality properties in the asymptotic regime (i.e. when
the alphabet size approaches infinity) [35]. For finite alphabets, however,
there are in general no performance guarantees for the MLE. For a more
sophisticated estimator with a finite alphabet guarantee in form of an optimally bounded mean square error, consider the following definition.
37
3 TESTING ENTROPIC INEQUALITIES
Definition 3.4. The risk of an estimator Q̂, depending on the alphabet size
K and the sample size N , is defined as
h i
RQ̂ (K, N ) = sup MSEP Q̂ .
(3.19)
P ∈MK
Here, MK is the set of all probability distributions with alphabet size K.
The sample size N appears on the right hand side implicitly in the estimator
Q̂ = Q̂(P̂ (N )) since the possible empirical distributions depend on N .
Definition 3.5. The minimax risk for estimating a quantity Q, depending
on the alphabet size K and the sample size N , is defined as
RQ (K, N ) = inf RQ̂ (K, N ) .
(3.20)
Q̂
The risk of an estimator is its worst case behaviour in terms of the MSE.
The minimax risk is the best worst case behaviour possible for any estimator
Q̂ of Q.
It is desirable to have an estimator that achieves the minimax risk of the
quantity of interest. In addition, it is of particular interest to know the sample size N as function of the alphabet size K that is required for consistent
estimation (see Definition 3.3) when both N and K go to infinity. This
relation between N and K is also called the sample complexity. Different
estimators will have different sample complexities. Again, it is desirable to
have an estimator that achieves a global lower bound of the sample complexity. Note that one will typically not find strict statements of the form, say,
N = 2K 3 . Instead one might find that N is bounded from below by K 3 , in
the sense that ∃c1 s.t. N ≥ c1 K 3 samples are required for consistent estimation. Adopting the notation from [18], we denote this as N & K 3 . On the
other hand, if ∃c2 s.t. the sample size N ≤ c2 K 3 is sufficient for consistent
estimation, one writes N . K 3 . If both, N & K 3 and N . K 3 , meaning
that a sample size ∝ K 3 is necessary and sufficient, we write N K 3 .
3.2.4
MLE and minimax estimator for entropy
For entropy estimation, the ideal sample complexity was shown to be N K
[36]. This means that consistent entropy estimation is possible for
log K
sample size ∝ logKK . For smaller sample sizes, consistent estimation is not
38
3 TESTING ENTROPIC INEQUALITIES
possible. This result is extended in [18] where the minimax risk of entropy
estimation is shown to be
RH (K, N ) K
N log K
!2
+
log2 K
.
N
(3.21)
In addition, an estimator achieving this bound (and thus also the sample
complexity N logKK ) is constructed. For the MLE, on the other hand, it is
known that N & K samples are required for consistent estimation and that
the risk is [18]
2
K
log2 K
+
.
(3.22)
RĤMLE (K, N ) N
N
Thus, the MLE is clearly suboptimal. In (3.21) and (3.22) the first term on
the right hand side corresponds to the (squared) bias while the second term
corresponds to the variance. Recall that according to (3.12) the MSE can
be decomposed according to MSE = B2 + Var. It is generally acknowledged
that in entropy estimation the main difficulty is handling the bias [17, 18].
In fact, it is easy to see that no unbiased estimator exists. To this end, one
only has to realize that EP [Ĥ] (see (3.8) and (3.9)) is a polynomial in the
probabilities pi , while H (P ) is a non-polynomial function. Furthermore, it
can be shown that the MLE is always negatively biased [37]. A comparison
of (3.21) and (3.22) shows that the advantage of the minimax estimator over
the MLE indeed lies in the reduced bias.
Other attempts (than minimax estimation) to correct the bias of the MLE
exist. The typical first order bias correction of a single term −p̂i log p̂i is
1
. Note that when we enlarge the alphabet but put no
simply −p̂i log p̂i + 2N
probability mass on the new outcomes (so that the distribution essentially
1
to all of the new terms
does not change), applying the bias correction 2N
can hugely overcorrect the bias. Thus, it is advisable to only use the bias
correction for terms with p̂i > 0. This gives rise to the so-called MillerMadow bias correction [37, 38, 39]. When applying the bias correction to all
terms, we speak of the naive bias correction.
In the next subsection, following the construction of the minimax estimator
from [17, 18], we numerically verify its optimal performance. To this end, we
compare the minimax estimator to the MLE and its Miller-Madow (MLEMM) as well as naively bias corrected versions (n.b.c. MLE). The minimax
estimator, analogous to the MLE, estimates each term −pi log pi separately.
3 TESTING ENTROPIC INEQUALITIES
39
Different estimators are applied for ‘large’ and ‘small’ empirical probabilities
p̂i . It turns out that for large values the bias corrected MLE works well. For
small probabilities the expression −pi log pi is ‘unsmooth’ in the sense that
the derivative diverges to infinity (for pi → 0). This causes small errors in
the estimate p̂i to lead to large errors in the estimate −p̂i log p̂i . Controlling
the bias in this sensitive regime is particularly problematic and not handled
well by the typical bias corrections of the MLE. Even the Miller-Madow
correction is rather brute. For p̂i > 0, however small it might be, the full
1
is applied, and suddenly for p̂i = 0 no correction is applied at
correction 2N
all. The minimax estimator provides a smoother solution.
1
If p̂i > ∆ ≡ c1 logNN the bias corrected MLE −p̂i log p̂i + 2N
is used. In
5
practice, c1 ≈ 0.5 yields good results [17] . In the case p̂i ≤ ∆ a polynomial
approximation of −pi log pi is calculated and then estimated. The order of
the polynomial should be D ≈ c2 log N , where a good choice of the constant
turns out to be c2 ≈ 0.7 [17]. The employed approximation is the so-called
minimax polynomial, also called best approximation in the Chebyshev sense
[40, 41]. It is defined as the polynomial with the smallest maximal distance
to the true function,
max |Pminimax (x) − f (x)| =
0≤x≤∆
inf
max |P (x) − f (x)|.
P ∈polyD 0≤x≤∆
(3.23)
The space polyD is the space of all polynomials of order up to D. In our
case the target function is f (x) = −x log x. One may realize that the idea
behind the minimax polynomial is similar to the idea behind the minimax
risk from estimation theory, see Definition 3.5. The minimax polynomial
can be calculated using the Remez algorithm [40, 41]. It is possible (and
recommendable) to calculate the polynomial for the interval 0 ≤ x ≤ 1 and
then perform a variable transformation to the desired interval [0, ∆]. In this
way, one can calculate polynomials up to a desired order (e.g. 10) and store
P
d
them for future applications. If D
d=0 rd x is the polynomial for 0 ≤ x ≤ 1,
then the polynomial for 0 ≤ x ≤ ∆ reads [18]
K
X
(rd − δd,1 log ∆) ∆−d+1 xd .
(3.24)
d=0
5
In [17] the authors state that in practical applications c1 ∈ [0.1, 0.5] yielded good
results. In a newer version of the article the authors recommend c1 ∈ [0.05, 0.2]. Our own
tests in the next subsection show that the estimator with c1 = 0.5 works well.
3 TESTING ENTROPIC INEQUALITIES
40
The final polynomial turns out to be easier to estimate than the original expression f (pi ) = −pi log pi , so that the gain in estimation accuracy is larger
than the loss due to the approximation. To estimate the polynomial, each
i −d+1)
monomial pdi is estimated separately by the estimate pcdi = Ni (Ni −1)...(N
.
Nd
Under so-called Poisson sampling this estimate is unbiased [17, 18]. Poisson
sampling means that each Ni is independently drawn from a Poisson distribution with expectation N pi . In contrast, when drawing a sample from
the original multinomial distribution (3.9), the Ni are not independent due
P
to the normalization constraint i Ni = N . Poisson sampling can be justified since the Poisson distribution is peaked sharply around its expectation
N pi . Thus, already for rather small N , the normalization constraint will be
satisfied at least approximately. Poisson sampling is used as a technique to
simplify analytical calculations. Mathematical relations between the Poisson model and the Multinomial model exist [18] . For numerical simulations
the samples will be drawn from the proper multinomial distribution.
3.2.5
Comparison of MLE and minimax estimator for entropy
We briefly summarize the different estimators that we want to compare
in this subsection. All estimators have in common that each summand in
P
H = i −pi log pi is estimated separately.
Maximum likelihood estimator (MLE) The empirical probability p̂i =
Ni
is used to estimate −pi log pi by −p̂i log p̂i . Hence, the MLE is
N
simply the plug-in estimator. In general, the MLE is expected to
suffer from severe bias.
Naively bias corrected MLE (n.b.c. MLE) Independently of the value
1
.
p̂i , the estimate −p̂i log p̂i is replaced by −p̂i log p̂i + 2N
Miller-Madow MLE (MM-MLE) A bias corrected version of the MLE
1
where for p̂i > 0 the estimate −p̂i log p̂i is replaced by −p̂i log p̂i + 2N
.
Minimax estimator For large probabilities p̂i > ∆ = c1 logNN (c1 = 0.5)
1
the bias corrected version of the MLE, −p̂i log p̂i + 2N
, is used. For
p̂i ≤ ∆ an optimal polynomial approximation of −pi log pi (of order
D ≈ c2 log N with c2 = 0.7) is estimated. Employing the polynomial
approximation aims to reduce the bias.
41
3 TESTING ENTROPIC INEQUALITIES
We start by reproducing some of the numerical results from Reference [17].
It is worth mentioning that the authors of [17] only compared the minimax
estimator to the MLE without bias correction (and one additional estimator
that we do not regard here). No bias corrected version of the MLE was
considered6 . Indeed, for uniform distributions and quite large samples one
rarely is in the regime p̂i ≤ ∆. In this case the minimax estimator essentially
reduces to the n.b.c. MLE. Therefore, some of the results for the minimax
estimator from [17] can already be obtained by the n.b.c. MLE. Since the
idea of the minimax estimator is to significantly reduce the bias at the cost
of slightly increasing the variance, it even happens that the n.b.c. MLE
yields slightly better results than the minimax estimator. Note that this
is no contradiction to the definition of the minimax estimator. The minimax estimator only guarantees the best worst case behaviour. For specific
distributions (here uniforms) other estimators might perform better. The
superiority of the minimax estimator becomes more evident when considering non-uniform distributions or very small sample sizes. Thus, we extend
the numerical simulations in this direction.
MSE along N = 8 logKK
In Reference [17] it is shown numerically that the
2
[= 1 M
along N = 8 logKK is
empirical mean square error MSE
m=1 Ĥ − H
M
bounded for the minimax estimator but unbounded for the MLE. Note that
these results follow theoretically from the risks given in (3.21) and (3.22).
Along N = c logKK the minimax risk turns out to be
P
RH 1
log3 K
1
+
−→ 2 ,
2
c
cK K→∞ c
(3.25)
while the risk of the MLE becomes
RĤMLE
log2 K log3 K
+
∝ log2 K.
2
c
cK K1
(3.26)
The log2 K increase for the MLE stems from the uncontrolled bias.
For each alphabet size K, the samples are drawn from the corresponding
uni
2
form distribution. The empirical MSE is obtained by averaging Ĥ − H
6
Note that this was the case in the version of [17] that was available when writing this
chapter (arXiv version 3). Newer versions also contain the Miller-Madow MLE and a lot
of additional estimators.
42
3 TESTING ENTROPIC INEQUALITIES
over M = 10 Monte Carlo simulations. Our results for all four estimators are
given in Figure 5. They are in accordance with the results from Reference
[17] (for the minimax estimator and the uncorrected MLE).
MSE
0.6
0.5
0.4
0.3
0.2
0.1
▼
▼
▼
▼
▼
▲ ▲
▲ ▲
▲
▲ ▲
▼ ▼
▲ ▲
▲
▲
●
▼
▲
▼
▲
● ●
▲
▲
▼
▲
▼●
▲ ●
●
■ ■
■ ●
■ ●
■ ■
■ ●
■ ●
▲
■ ■ ●
■ ●
■ ●
▲
■
▲
■
■●
■ ●
■
■ ●
■ ●
▲●
■
● ■
● ●
▼ ▼
1
▼
▼
▼
▼
▼
▼
▼
10
▼
100 1000
104
▼
MLE
▲
MM-MLE
■
n.b.c. MLE
●
minimax
K
105
Figure 5: MSE along N = 8 logKK for the minimax estimator, the MLE
and the bias corrected versions of the MLE. As expected the MSE of the
MLE grows with increasing alphabet size K. The Miller-Madow correction
reduces the MSE but it remains an increasing function of the alphabet size.
The MSEs of the other two estimators are bounded.
We observe that the unboundedness of the MLE already vanishes for the
n.b.c. MLE. Note that the minimax estimator does not reduce to the n.b.c.
MLE. To see this, realize that for large K the sample size N = 8 logKK is of the
same order of magnitude as the alphabet size, or even smaller. Consequently,
many of the empirical observation frequencies take the value Ni = 0, 1 and
thus satisfy the condition NNi = p̂i ≤ ∆ = c1 logNN (for N = 50, c1 = 0.5
0 1
and log being the natural logarithm we have 50
, 50 ≤ 0.039...). For these
p̂i the minimax estimator indeed resorts to the polynomial approximation
1
instead of the first order bias correction −p̂i log p̂i + 2N
. The strong results
for the minimax estimator shown in Figure 5 thus provide evidence that
the polynomial approximation at the heart of the minimax estimator indeed
works well.
Performance for large K and N Again motivated by Reference [17], we
consider uniform distributions for three combinations of the alphabet size
K and the sample size N :
43
3 TESTING ENTROPIC INEQUALITIES
K
N
data rich
200
10 000
data sparse
20 000
10 000
extremely data sparse
20 000
1000
The terms ‘data rich’ and ‘data sparse’ have been adopted from [17]. The
extremely data sparse regime was not considered in [17]. Furthermore, we
consider a non-uniform distribution in the data sparse regime. In both
cases, the goal is that there should be a large number of empirical probabilities p̂i = 0 which should not be handled well by the MLE and its bias
corrected versions. The non-uniform distribution is generated by drawing
each probability pi from a beta distribution
pBeta (x) =
Γ (α + β) α−1
x
(1 − x)β−1 ,
Γ (α) + Γ (β)
0 ≤ x ≤ 1, α, β > 0
(3.27)
with α = 0.6 and β = 0.5. The emerging vector p is then normalized in
order to obtain a valid probability distribution. In all four cases we draw 20
samples and plot the resulting estimates together with the true entropy, see
Figure 6.
In the data rich regime (upper left plot) the minimax estimator and the bias
corrected versions of the MLE coincide and are extremely accurate. The
MLE is slightly biased but still acceptable. Due to the large sample size
we are almost never in the regime p̂i ≤ ∆ in which the polynomial approximation of the minimax estimator is used. Thus, the minimax estimator
essentially reduces to the n.b.c. MLE. In the uniform, data sparse case (upper right plot) the MLE as well as its Miller-Madow version are strongly
negatively biased. The best result is obtained by the naively bias corrected
MLE, but the performance of the minimax estimator is satisfying as well. In
the extremely data sparse regime (lower left plot) all variants of the MLE are
strongly biased, while the minimax estimator is still rather close to the true
entropy. For the non-uniform distribution (lower right plot) we get a similar
picture. The minimax estimator clearly outperforms the other estimators in
this case.
The results demonstrate the great performance of the minimax estimator
for rather large alphabets. In the context of causal inference, however, one
rarely deals with alphabets of size 200 or even 20 000. In the next paragraph
we do therefore consider smaller alphabets.
44
3 TESTING ENTROPIC INEQUALITIES
entropy, uniform, K=200, N=10k
H
5.300
5.298
5.296
5.294
5.292
5.290
5.288
0
5
10
15
20
entropy, uniform, K=20k, N=1000
H
17.5
15.0
12.5
10.0
7.5
5.0
2.5
5
10
15
20
entropy, uniform, K=20k, N=10k
True
H
10.0
MLE
9.8
MM-MLE
9.6
n.b.c. MLE
minimax
9.4
9.2
9.0
0
True
H
9.8
MLE
9.6
5
10
15
20
entropy, Beta, K=20k, N=10k
MM-MLE 9.4
n.b.c. MLE 9.2
minimax
9.0
0
5
10
15
20
Figure 6: Estimated and true entropies for different distributions, alphabet
sizes and sample sizes. The minimax estimator is the only estimator that
always provides a reliable result.
Performance for small alphabets We conduct the same simulations as
before (again for uniform distributions) but this time for alphabets of size
K = 2 and K = 10 with sample size N = 50. Figure 7 suggests that
the superiority of the minimax estimator vanishes for smaller alphabets.
One possible explanation is that these combinations of K and N already
correspond to the data rich regime from above, where the bias corrected
versions of the MLE coincided with the minimax estimator as well.
Eventually, we are not interested in estimating a single entropy term but
more complicated entropic expressions constraining a given DAG. One example is the expression I (A; B) + I (A; C) − H (A) which is upper bounded
by zero for distributions compatible with the triangular scenario (see also
(3.2)). This requires in particular estimation of mutual information, which
is the subject of the next paragraph.
45
3 TESTING ENTROPIC INEQUALITIES
entropy, uniform, K=10, N=50
entropy, uniform, K=2, N=50
H
H
True
0.70
MLE
0.68
MM-MLE
0.66
n.b.c. MLE
0.64
minimax
0
5
10
15
20
2.35
2.30
2.25
2.20
2.15
0
5
10
15
20
Figure 7: Estimated and true entropies for the uniform distributions with
alphabet sizes K = 2, 10 and sample size N = 50. In both cases the minimax
estimator (almost) coincides with the bias corrected versions of the MLE.
Mutual information for small alphabets In order to estimate mutual
information we use the decomposition I (A; B) = H (A) + H (B) − H (A, B)
and estimate each entropy separately. The joint distribution P (A, B) required to estimate H (A, B) has alphabet size K 2 . For K = 10 (and N = 50)
we might already be in a data sparse regime in which the minimax estimator typically outperforms the different versions of the MLE. To generate a
joint distribution with some dependence between the variables, (partially
following [17]) we first draw the marginal P (A) with the help of a Beta distribution 3.27 (see also the non-uniform case in the paragraph ‘Performance
for large K and N ’). Then, with probability x we set b = a, and with probability (1 − x) we set hb uniform random.
The resulting joint distribution
i
1−x
reads P (a, b) = P (a) xδab + K . For x = 0 the variables are independent while for x = 1 they are deterministically dependent. For x = k/10
(k = 0, ..., 10) we draw 100 samples and calculate the empirical MSE of each
estimator. We consider alphabet sizes K = 2, 10 and sample size N = 50.
The results are shown in Figure 8.
For K = 2 there is hardly any difference between the different estimators,
but for K = 10 the minimax estimator is, as suspected, clearly superior
to the MLE and its bias corrected versions. One may also realize that (for
K = 10) the n.b.c. MLE is strong for weak dependence (MSE (x = 0) ≈
0) but weak for strong dependence (MSE (x = 1) ≈ 0.8). The reason is
that for x = 1 there are many probabilities P (a, b) = 0, which leads to a
large overcorrection of the bias when applying the correction +1/2N to all
46
3 TESTING ENTROPIC INEQUALITIES
MI, K=10, N=50
MI, K=2, N=50
MSE
0.010
MLE
0.008
MSE
0.8
MM-MLE 0.6
0.006
n.b.c. MLE 0.4
0.004
minimax
0.002
0.2
0.4
0.6
0.8
1.0
x
0.2
0.2
0.4
0.6
0.8
1.0
x
Figure 8: MSE for estimating mutual information by employing the minimax estimator, the MLE and the bias corrected versions of the MLE. The
parameter x determines the dependence between the variables (x = 0: independent, x = 1: deterministically dependent). For K = 2 all estimators
perform equally well. For K = 10 the minimax estimator is clearly superior
to the MLE and its bias corrected versions.
terms. The uncorrected MLE and the Miller-Madow MLE show the opposite
behaviour, though with smaller magnitude.
3.2.6
Conclusion
All tests confirmed the theoretically expected great performance of the minimax estimator. In particular for large alphabets (compared to the sample
size) the minimax estimator was typically far superior to the MLE and its
bias corrected versions. While in some cases the Miller-Madow MLE and
the naively bias corrected MLE performed quite well, they have far worse
performance in other cases. The minimax estimator is the only estimator
that always provided reliable results. Even in the rare occasion (Figure 6,
upper right plot) that another estimator performed better than the minimax
estimator, the results of the latter were still satisfying. The only drawback
is that the superiority of the minimax estimator seems to diminish for extremely small alphabets. In particular for K = 2 and N = 50 (which is
the typical scenario considered in the next section) all estimators performed
similarly. Still, the minimax estimator is overall the sole reliable estimator considered in this section and should always be preferred to the other
estimators.
3 TESTING ENTROPIC INEQUALITIES
3.3
47
Hypothesis tests
In this section we construct and elaborate on hypothesis tests based on entropic inequalities. Precisely, we want to test membership to the triangular
scenario (Figure 3) by employing inequalities (3.2) to (3.4). While primary
focusing on inequality (3.2), the latter inequalities come into consideration
in Subsection 3.3.4. To estimate the required entropy terms, we employ the
techniques introduced in the previous section, in particular the minimax estimator. The observables are assumed to be binary and the sample size shall
be N = 50. Binary variables correspond to the case of simple ‘yes’ or ‘no’
statements. In a real study the observables could represent the occurrence
of some symptoms while the hidden variables describe potential causes that
are unmeasurable or not known to exist at all (e.g. unknown exposure to
a substance or genetic factors). Larger alphabets of the observables could
emerge if the symptoms can be further characterized by their strength, or
if they are directly assessed in a quantitative way (e.g. the concentration of
a substance in a blood sample). One can always construct binary variables
from such data by asking if a certain threshold value is exceeded or not. In
that sense binary variables represent a rather general case. On the other
hand, if the original data are not binary, one might lose information by employing such a thresholding procedure. We have also seen that for binary
variables we hardly benefit from the minimax estimator (see Figure 8). It
might therefore be preferable to keep larger alphabets. The main reason
to consider the binary case is that the large simulations performed in this
section would be computationally extremely expensive for larger alphabets.
Note that apart from being discrete, no assumptions about the alphabets of
the hidden variables are made.
3.3.1
Introduction to hypothesis tests
A hypothesis test first requires to state the null hypothesis which is the standard hypothesis that is only rejected if strong evidence is found against it.
The opposite is called the alternative hypothesis which is accepted if and
only if the null hypothesis is rejected. Often the null hypothesis states that
some parameter takes a certain value, θ = θ0 . The alternative hypothesis then comprises all other possibilities, i.e. θ 6= θ0 . In our case, one can
think of two null hypotheses, one stating that the distribution underlying
3 TESTING ENTROPIC INEQUALITIES
48
the data generating process is compatible with the triangular scenario (Figure 3), the second stating that the distribution satisfies inequality (3.2),
I (A; B) + I (A; C) − H (A) ≤ 0. Note that in both cases the null hypothesis
does not single out one specific distribution (or parameter value), but comprises a large set of distributions. Since there are distributions that are not
compatible with the DAG but fail to violate the inequality, the two hypotheses are indeed different. Since our ultimate aim is to decide compatibility
with the DAG, the first null hypothesis should be preferred (if possible).
In general, we will denote the null hypothesis by h0 and the alternative
hypothesis by h1 .
Ideally, the null hypothesis should be accepted whenever it is true and rejected when it is false. Unfortunately, this ideal scenario is not realizable
since also samples from (in)compatible distributions can (satisfy) violate h0 .
A type-I-error is made when the null hypothesis is rejected although it is
actually true. The opposite, i.e. accepting the null hypothesis when it is actually false, is called a type-II-error. The type-I(II)-error rate is denoted by
α(β). The capability to correctly reject the null hypothesis (i.e. to correctly
identify incompatible data) is called the power of the hypothesis test and
evaluates to 1 − β. In general, there is a trade-off between type-I- and typeII-error rate, meaning that trying to decrease one leads to an increase of the
other. Hypothesis tests are often constructed to control the type-I-error,
typically α = 0.05. This means that the test must reject at most 100α% of
samples stemming from distributions compatible with the null hypothesis.
The bound α is then also called the confidence level of the test and one says
that the null hypothesis is rejected (or accepted) at the 100α% level.
We distinguish two different approaches to hypothesis testing, the direct and
the indirect approach. After a first introduction and methodical comparison
below, the implementation of the approaches and a detailed comparison
follow in Subsections 3.3.2 and 3.3.3. In both cases, the quantity (or statistic)
that we have to estimate from the data is T ≡ I (A; B) + I (A; C) − H (A)
(see inequality (3.2)). Our direct test is similar to the one already introduced
in [16]. The differences lie in a small revision at the construction of the test,
and the fact that we also consider the minimax estimator of entropy.
Direct approach Assume that for some sample we obtain the estimate
T̂ . If, under the null hypothesis, the probability to obtain an even larger
49
3 TESTING ENTROPIC INEQUALITIES
value (T̂ 0 ) is smaller than α,
P T̂ 0 > T̂ | h0 ≤ α,
(3.28)
the result is called significant and we reject the null hypothesis at the 100α%
level. ‘Under the null hypothesis’ means that we have to consider all distributions compatible with h0 . Calculating the probability (3.28) requires
knowledge of the distribution of estimates T̂ 0 of the statistic T under the
null hypothesis, in particular under the worst case (or least favorable) distribution among h0 . Loosely speaking, the worst case distribution is the
distribution leading to the largest estimates T̂ 0 . The requirement of a worst
case distribution causes a huge problem, since finding (or even proving) the
worst case is far from obvious. The best we can do is make an educated
guess and try to confirm with numerical simulations that we do not find an
even worse case. Once a candidate for the worst case distribution is selected,
the corresponding distribution of T̂ 0 values can be constructed via a large
number of Monte Carlo simulations. In practice, we are interested in the
100(1 − α)% quantile, t, of this distribution, defined by
P T̂ 0 > t | h0 , worst case = α.
(3.29)
The value t is then employed as a threshold value for the final hypothesis
test. Whenever we find T̂ > t for some data set, the null hypothesis is
rejected. For a graphical illustration in comparison to the indirect approach
see Figure 9. In terms of the quantile t, the worst case distribution is the
h0 -compatible distribution yielding the largest t value. By definition, we
then obtain
P T̂ 0 > t | h0 ≤ α
(3.30)
for all other distributions compatible with the null hypothesis. Thus at
most 100α% of samples stemming from compatible distributions are rejected,
implying that the type-I-error rate is as desired upper bounded by α. Note,
however, that this is only true if we found the correct threshold value (and
thus the correct worst case distribution). Otherwise, the hypothesis test
tends to reject too many samples and does not work properly at the 100α%
level.
A major advantage of the direct approach is that it allows to implement
the preferred null hypothesis h0 : ‘data are compatible with the triangular
scenario’. To this end, the worst case distribution is searched only among
3 TESTING ENTROPIC INEQUALITIES
50
distributions that are compatible with the DAG, instead of the larger set of
distributions that are compatible with the inequality T ≤ 0.
Indirect approach A major drawback of the direct approach is its dependence on our ability to find the correct threshold value. This will prove
difficult already for the triangular scenario with binary observables and employing inequality (3.2). For inequalities (3.3) and in particular (3.4), for
which we lack an intuitive understanding, the task becomes even more complex, if not intractable. Similar problems might occur for larger DAGs, or
already for larger alphabets.
In the direct approach, once a threshold value is at hand, only the point
estimate T̂ of the data sample is taken into account, without any measure
of uncertainty. A natural alternative approach, without the necessity of a
threshold value, is to compute a confidence interval for the estimate T̂ and
check if this interval overlaps with T h= 0. In thei current case, we would be
interested in a one-sided 95% interval T̂0.05 , T̂max . The upper endpoint T̂max
is the maximal value that can be achieved by any empirical distribution, for
T = I (A; B) + I (A; C) − H (A) for example T̂max ≈ log 2. (Depending on
the distribution and the employed estimator, T̂max might also be smaller
or even larger than log 2. In principle, since only the lower endpoint of
the interval is relevant for our purpose, we could also set T̂max = ∞.) If
the confidence interval overlaps with zero, T̂0.05 ≤ 0, we accept the null
hypothesis h0 : ‘data are compatible with the inequality T ≤ 0’. If the lower
endpoint of the interval is larger than zero, T̂0.05 > 0, the null hypothesis
is rejected at the 5% level. For a graphical illustration with comparison to
the direct approach see Figure 9. Note that since no additional information
about the DAG is included, the indirect approach automatically uses the
null hypothesis of compatibility with the inequality instead of the stronger
hypothesis of compatibility with the DAG.
The main task in the indirect approach is the construction of the confidence
interval. If we could sample at will from the true underlying distribution
P , we could draw an arbitrary number of samples, reconstruct the correct
distribution of T̂ values and calculate any quantity of interest, including
confidence intervals. Typically, however, we only have access to a single data
set of presumably small size. Thus, we have to resort to other methods. One
typical approximation in statistics, using asymptotic normal theory, is to
51
3 TESTING ENTROPIC INEQUALITIES
direct approach
indirect approach
^ , worst case)
P(T'|h
0
t
^
T'
^ ^
^
P(T'|observed
T)
^
T
0.05
^
T
^
T'
Figure 9: Plot on the left: Schematic distribution of T̂ 0 values under the
worst case distribution compatible with the DAG. The 95% quantile of this
distribution, t, is employed as a threshold value for the direct approach. If
a real data estimate T̂ falls into the shaded area (or beyond; T̂ > t) the null
hypothesis is rejected.
Plot on the right: Schematical, estimated distribution of T̂ 0 values given
the observed value T̂ . The 5% quantile, T̂0.05 , is the lower endpoint of a
left-sided 95% confidence interval for the estimate T̂ . If this interval (the
unshaded area) does not overlap with zero (T̂0.05 > 0) the null hypothesis
(T ≤ 0) is rejected.
Note the difference, that for the direct approach the calculation of a rightsided 95% quantile is required, while for the indirect approach it is a leftsided interval (or quantile). Also, in the direct approach the quantile (i.e. the
value t) is calculated beforehand for the (supposed) worst case among h0 ,
and later we simply test T̂ > t. In the indirect approach the interval (i.e. the
lower endpoint T̂0.05 ) is calculated for each data sample for which we want to
test compatibility, and then test T̂0.05 > 0. Thus, for the direct approach the
preparation (constructing the threshold t) is complicated, while the resulting
test is rather simple. The indirect approach needs no such preparation but
therefore the actual test is more complex. Since the approaches are rather
different in nature, it is difficult to foresee which approach might result in
the stronger test. More advantages and disadvantages of the two approaches
are pointed out in the following subsections.
3 TESTING ENTROPIC INEQUALITIES
52
estimate
T̂0.05 = T̂ + z0.05 σ̂,
(3.31)
where T̂ is the original estimate, σ̂ some estimate of its standard deviation,
and z0.05 ≈ −1.645 the 5% quantile of the standard normal distribution.
In detail, the approximation assumes that estimates of the statistic T are
distributed normally around their mean. Our data estimate T̂ automatically
serves as an estimate of the mean value of this distribution. The estimate σ̂ of
the standard deviation has to be calculated by other means.
The
expression
T̂ + z0.05 σ̂ is then the 5% quantile of the distribution N T̂ , σ̂ . Aside from
the necessity to assess σ̂, the main problem of this procedure is the reliance
on a strong asymptotic approximation. In practice, this approximation may
be highly inaccurate and consequently lead to wrong confidence intervals
[42]. A more sophisticated method, typically resulting in more accurate
intervals, is introduced in Subsection 3.3.3.
3.3.2
Direct approach
The first step for implementing the direct approach is to identify the worst
case distribution. In [16] the following educated guess was made for the
triangular scenario with inequality (3.2) (T = I (A; B) + I (A; C) − H (A) ≤
0):
1. The worst case distribution should lie on the boundary T = 0.
2. Among the DAG-compatible distributions this requires A to be a deterministic function of either B or C (by choice B).
3. The fluctuations of T̂ should be largest if A = B ∼ uniform and
independently C ∼ uniform.
The obtained threshold value (using the maximum likelihood estimator of
entropy) for α = 0.05 was t = 0.0578 bits (or t = 0.0401 nats).
In the following we show that the supposed worst case distribution is not the
true worst case. While we can slightly adjust the aforementioned threshold
value, the main message is that finding the correct value is a formidable
task. We keep the above assumptions (1) and (2) intact, but replace the
uniform
(3) by Bernoulli distributions A =
distributions
from assumption
B ∼ qAB 1 − qAB and C ∼ qC 1 − qC . We consider two scenarios:
3 TESTING ENTROPIC INEQUALITIES
53
1. Fix qAB = 0.5 and vary 0.5 ≤ qC ≤ 1.
2. Set qAB = qC and vary 0.5 ≤ qC ≤ 1.
In both scenarios we calculate the 95% quantile of estimates T̂ as a function
of qC . To this end, we conduct 200 000 Monte Carlos simulations for each
considered value of qC . To estimate T̂ we employ the maximum likelihood
estimator as well as the minimax estimator from Subsection 3.2.4. While
later restricting to the minimax estimator, there are two reasons to keep the
MLE for now. First, we want to compare our results of the direct test to
the results from [16], which were also based on the MLE. Second, we want
to checker whether the two estimators behave similarly when varying qAB
and qC . The results are presented in Figure 10.
MLE
minimax
Figure 10: Threshold value t (95% quantile of the distribution of T̂ values)
obtained for the families of distributions described in the text. The originally
supposed worst case value (corresponding to qAB = qC = 0.5) is exceeded
in both cases and for both estimators. The results suggest that the task
of analytically finding (or proving) the correct threshold value might be
complicated.
In both scenarios and for both estimators, the ideal threshold value is not
provided by the originally supposed worst case distribution (qAB = qC =
0.5). The maximal values tMLE = 0.0461 and tminimax = 0.0506 are instead
obtained for qAB = 0.5 and qC ≈ 0.9. Other combinations of qAB and qC
might lead to even larger values. Note that in the uniform case, within sufficient accuracy, we obtain the same value tMLE = 0.0400 as [16]. In general,
the qualitative behaviour is similar for both estimators (though more pronounced for the minimax estimator) and suggests that finding the correct
worst case distribution with analytical arguments might be intractable. A
54
3 TESTING ENTROPIC INEQUALITIES
monotone behaviour, for example, would probably be more feasible. Also recall that we only relaxed the third assumption from the heuristics employed
in [16]. The validity of the first two assumptions is not obvious either. For
larger DAGs, inequalities or alphabets, the task becomes even more complex.
Amongst other things, numerical simulations similar to those from Figure 10
become drastically more time consuming for larger alphabets. Overall, we
should thus be cautious when using a hypothesis test based on a threshold
value obtained by such vague means. An underestimated threshold value
would cause the test to be more susceptible to reject h0 . The seemingly
large power would be misleading, since the test would not properly control
the type-I-error rate at 5% anymore.
Despite these problems we still want to run tests based on the obtained
threshold values in order to get an impression of the tests’ performances.
We consider the same family of distributions that was used in [16]: Three
initially perfectly correlated, binary variables are flipped independently with
probability pflip . This gives rise to the distribution
P (a, b, c) =
 h
 1 (1 − p
2
1p
2 flip
3
flip )
+ p3flip
(1 − pflip )
i
if a = b = c
otherwise.
(3.32)
Figure 11 shows the true value T as a function of 0 ≤ pflip ≤ 0.5. For pflip = 0
the distribution is certainly not compatible with the DAG and neither with
the inequality (I (A; B) = I (A; C) = H (A) = log 2 → T = log 2 > 0). For
pflip = 0.5 all variables are independently uniform which is clearly compatible
with the DAG and leads to T = − log 2 ≤ 0 (I (A; B) = I (A; C) = 0,
H (A) = log 2). For the critical value satisfying T (pflip ) = 0, we obtain
pflip = 0.0584. Thus, the distribution violates the inequality only for rather
small flip probabilities. On the other hand, we do not know at which value
of pflip the distribution changes its compatibility with the DAG. Since the
entropic description is an outer approximation, we only know that this value
has to be larger than 0.0584.
In order to get an impression of the direct hypothesis test we consider 10 000
samples (of size N = 50) for each value of pflip = 0, 0.005, ..., 0.1. Figure 12
shows the ratio of samples that get rejected by the hypothesis test (i.e. for
which we find T̂ > t) as a function of pflip . While we are particularly
interested in the power of the test, we generally call this ratio the rejection
rate. Recall, that the power is defined as the capability to correctly reject
55
3 TESTING ENTROPIC INEQUALITIES
T
0.6
0.4
0.2
-0.2
-0.4
-0.6
0.1
0.2
0.3
0.4
0.5
pflip
Figure 11: Value of the statistic T = I (A; B)+I (A; C)−H (A) for the family
of ‘flip distributions’ (3.32). Starting with three binary, perfectly correlated
observables, each observable is flipped independently with probability pflip .
Values T > 0 are evidence of incompatibility of the distribution with the
triangular scenario (for the DAG see Figure 3). Values T ≤ 0 indicate (but
not prove) compatibility with the DAG. For the critical value satisfying
T (pflip ) = 0, we obtain pflip = 0.0584.
samples from incompatible distributions. Thus, for distributions that are
incompatible with the triangular scenario, the rejection rate is indeed the
power. However, right now this is only known to be the case for pflip <
0.0584. In this regime, the violation of the inequality (see Figure 11) implies
incompatibility with the DAG. For larger values of pflip we do not know
whether or not the distribution is compatible with the DAG. For this reason
we use the general term ‘rejection rate’ instead of ‘power’. The test shown
in Figure 12 is similar to the test originally proposed in [16]. One difference
is that we use a slightly updated threshold value. Furthermore, we also use
the minimax estimator of entropy rather than only the MLE, as was the
case in [16].
For values close to pflip = 0 the test correctly rejects almost all samples. For
pflip ≈ 0.1 the rejection rate is close to zero. Presumably, the distribution
is compatible with the DAG for these large values of pflip , meaning that the
small rejection rates are indeed desired. For the range in between, the rejection rate varies only slowly. Instead of the rather flat curve, we would have
preferred a sharp edge near pflip = 0.0584. In this case, compatible distri-
56
3 TESTING ENTROPIC INEQUALITIES
direct test
rejection rate
1.0 ●▲ ▲● ▲● ▲● ▲● ▲
0.8
0.6
0.4
0.2
●
▲
●
▲
●
▲
●
▲
MLE
●
minimax
▲
●
▲
●
▲
●
▲
●
▲
●
▲
●
▲
● ▲
● ▲
● ▲
● ▲
●
pflip
0.02 0.04 0.06 0.08 0.10
Figure 12: Rejection rates of the direct hypothesis tests based on inequality
(3.2) for the family of ‘flip distributions’ (3.32). One test uses the MLE, the
other uses the minimax estimator of entropy (see Subsection 3.2.4). Both
tests aim for the null hypothesis h0 : ‘data are compatible with the triangular
scenario’. For pflip < 0.0584 (vertical line) the null hypothesis is violated. In
this regime, large rejection rates (being the powers of the tests) are desired.
For pflip ≥ 0.0584 compatibility with the triangular scenario is not known.
butions would be reliably accepted while incompatible distributions would
be reliably rejected. The main cause for the flat curve is the large variance
of the distribution of estimates T̂ . To give an example, forrpflip = 0.06 the
h i
standard deviation of our 10 000 estimates is roughly σ ≡ Var T̂ ≈ 0.16
for both estimators. Similar standard deviations are obtained for other values of pflip (which are not too close to zero). For values of pflip for which
the true value T is close to the threshold value t, measured in units of the
standard deviation σ, the test is naturally rather indecisive (corresponding
to a rejection rate in the interval, say, [0.2, 0.8]). Next, realize that the total
range of T values for 0 ≤ pflip ≤ 0.1 is −0.25 ≤ T ≤ log 2 ≈ 0.69. Since
this range is rather small compared to σ ≈ 0.16 (less than six standard
deviations), the range of indecisive pflip values is quite large. A graphical
illustration is provided in Figure 13.
One straightforward possibility to obtain a sharper rejection-curve is to increase the sample size, resulting in a reduction of the variance of estimates
57
3 TESTING ENTROPIC INEQUALITIES
t
pflip=0.09
pflip=0.03
2σ
T(0.1)
2σ
T(0)
^
T
Figure 13: Histograms of estimates T̂ (employing the minimax estimator)
for pflip = 0.03 and pflip = 0.09 obtained by 100 000 Monte Carlo simulations
for each value. Due to the large width of the distributions (σ ≈ 0.16), when
varying pflip the number of rejected samples (T̂ > t, black line) changes
comparatively slowly. Concerning the large widths, it is worth noting that
the histogram corresponding to pflip = 0.03 roughly spans over the total
range of values −0.25 ≤ T ≤ log 2, corresponding to the regime 0 ≤ pflip ≤
0.1. For a significantly smaller width, the rejection rate would make a sudden
jump when the center of the distribution is shifted over the threshold value
t. This behaviour would have been desirable.
T̂ . However, this is not an option if some real data sample is as small as
N = 50. Another solution might be to find yet another estimation technique
that reduces the variance of the MLE. The minimax estimator mainly aims
to reduce the bias (which usually causes the most trouble when estimating
entropies) and might even slightly increase the variance. However, finding
an estimator that reduces the variance without disproportionately increasing the bias seems unlikely, since the minimax estimator already minimizes
the combination of both terms.
As a completely different matter, in Figure 12 the rejection rate for the
test based on the MLE is systematically larger than the rate based on the
minimax estimator. A detailed view on the distributions of estimates T̂
for the supposed worst case distribution and an exemplary ‘flip distribution’ explains this difference. For the supposed worst case distribution the
58
3 TESTING ENTROPIC INEQUALITIES
h
i
bias of T̂MLE is B T̂MLE | worst case = 0.012. For the ‘flip distribution’
h
i
with pflip = 0.06 we obtain the larger bias B T̂MLE | pflip = 0.06 = 0.034.
This difference (+0.022) leads to a systematic overestimation of T̂MLE for
the ‘flip distribution’ which implies an overestimated rejection rate. This
effect is in particular problematic if it occurs for distributions that are actually compatible with the DAG and should not be rejected. Furthermore,
the opposite effect could decrease the rejection rate for incompatible distributions. In this way, an uncontrolled bias reduces the reliability of the
hypothesis
test. For the
on the other hand,
we find
h
i minimax estimator,
h
i
B T̂minimax | worst case = 0.003 and B T̂minimax | pflip = 0.06 = 0.002. In
this case, the difference between the biases (−0.001) is of much smaller magnitude, indicating a superior bias control by the minimax estimator. As a
consequence, the corresponding test is potentially more reliable.
The aim of the following subsections is to improve the direct test from Figure
12, both in terms of the power as well as the control of the type-I-error rate.
3.3.3
Indirect approach (bootstrap)
As already pointed out several times, the main disadvantage of the direct
approach is its dependence on our ability to find the worst case distribution. The indirect approach is free of this optimization problem but requires hestimationi of the lower endpoint of a left-sided 95% confidence interval T̂0.05 , T̂max . Strong normality assumptions might lead to inaccurate
intervals when the assumptions are not met, for example in the small sample
regime considered in this thesis. More accurate intervals can be obtained by
a technique called bootstrapping, introduced by statistician Bradley Efron
in 1979 [43]. Bootstrapping belongs to the larger class of resampling techniques. The idea is that since we are not able to draw samples from the true
distribution P , we instead draw so-called bootstrap samples from the empirical distribution P̂ . We denote an empirical distribution of such a bootstrap
sample by P̂ ∗ and a bootstrap estimate of the statistic T by T̂ ∗ . The sample
size of the bootstrap samples is the same as the size of the original sample. By drawing a large number of bootstrap samples we obtain a whole
distribution of estimates T̂ ∗ from which desired quantities like confidence
intervals can be estimated. Ideally, bootstrapping should mimic sampling
from the true distribution P , but since P̂ and P are in general not the same,
59
3 TESTING ENTROPIC INEQUALITIES
bootstrapping is only an approximation as well.
There exist several methods to estimate the endpoints of confidence intervals based on the bootstrap statistic T̂ ∗ . While the simple and intuitive
techniques are often suboptimal, there also exist more involved or computationally heavy techniques which lead to more accurate intervals. Even
though the bootstrap methods rest on certain assumptions that will not be
entirely satisfied in practice, the assumptions should be closer to the true situation than the traditional normality assumptions (see (3.31)) [42]. A sound
overview of relevant methods and a guideline to their application is provided
by [44]. Here, we briefly introduce two methods, the simple percentile bootstrap and the advanced BCa bootstrap (bias corrected and accelerated). As
a common framework, following [44], we assume that B = 999 bootstrap
samples are drawn and the resulting estimates T̂i∗ , i = 1, ...B are sorted in
increasing order (T̂i∗ ≤ T̂j∗ whenever i < j). Larger numbers B ≈ 2000 are
sometimes recommended but the large simulations conducted here are quite
expensive already for B = 999.
Percentile bootstrap: If we could sample from the true distribution we
could calculate a large number of estimates T̂ and obtain T̂0.05 as the
5% quantile of this distribution. Since the true distribution is not
available we replace T̂ by the bootstrap statistic T̂ ∗ and estimate the
lower endpoint of the confidence interval by
∗
T̂0.05 = T̂50
.
(3.33)
This method may perform poorly if the distributions of T̂ and T̂ ∗
differ significantly. In particular, the
h iperformance might suffer, first,
if the distribution of T̂ is biased (E T̂ 6= T ) or generally asymmetric,
and second, if the standard deviation σ̂ = σT T̂ of that distribution
depends on the true value T .
BCa bootstrap: The bias corrected and accelerated bootstrap improves
on the percentile method by addressing the aforementioned problems
of the latter. The lower endpoint of the confidence interval is estimated
by
∗
T̂0.05 = T̂bQc
,
(3.34)
with
!
b + z0.05
,
Q = (B + 1) Φ b +
1 − a (b + z0.05 )
(3.35)
60
3 TESTING ENTROPIC INEQUALITIES
where b·c denotes the integer part, Φ the cumulative distribution function (CDF) of the standard normal distribution, and z0.05 ≈ −1.645
its 5% quantile. The bias correction constant b can be estimated by

b = Φ−1 
# T̂i∗ < T̂
B

(3.36)
,
where # T̂i∗ < T̂ is the number of bootstrap estimates that are smaller
than the original estimate. The acceleration
constant a (correcting the
potential dependence of σ̂ = σT T̂ on the true value T ) can be estimated using a jack-knife estimate. From the initial sample, omit the
ith observation and estimate T̂ (i) based on the remaining sample of size
N − 1. Proceed for all i = 1, ..., N . Denote the mean of estimates T̂ (i)
by T̄ . Then calculate
PN i=1
a=
6
T̄ − T̂ (i)
PN i=1
3
T̄ − T̂ (i)
2 23
.
(3.37)
Some of the above formulas might look peculiar, but the BCa bootstrap
is motivated and thoroughly examined in [42]. The obtained confidence
intervals are usually highly accurate.
We simulate hypothesis tests based on the percentile and BCa bootstrap for
the family of ‘flip distributions’ (3.32) known from the direct approach in
Subsection 3.3.2. First, we compare the results of the different bootstrap
approaches to each other (Figure 14), then (in Figure 15) we compare the
more reliable bootstrap approach to the minimax version of the direct test
from Figure 12. The bootstrap tests were carried out using the minimax
estimator of entropy as well. For each value of pflip we conducted 1000
initial Monte Carlo simulations and B = 999 bootstrap simulations for each
initial sample.
A general observation from Figure 14 is that the bootstrap tests are powerful,
say rejection rate ≥ 0.8, only for values of pflip extremely close to zero.
While the test based on the percentile bootstrap seems to be more powerful,
we have in fact one rather objective criterion for the ‘correctness’ of the
different methods. By construction we want to test at the 5% level, i.e. at
61
3 TESTING ENTROPIC INEQUALITIES
bootstrap tests
rejection rate
1.0 ●▲ ●
0.8
0.6
0.4
0.2
▲
●
●
●
▲
▲
●
▲
●
▲
●
percentile
▲
BCa
●
▲
●
▲▲
●
●
▲▲● ● ●
▲▲▲▲
● ▲
● ▲
● ●
▲●
▲●
▲
pflip
0.02 0.04 0.06 0.08 0.10
Figure 14: Rejection rates of the indirect (bootstrap) hypothesis tests based
on inequality (3.2) for the family of ‘flip distributions’ (3.32). Both tests
employ the minimax estimator of entropy and use the null hypothesis h0 :
‘data are compatible with the inequality T ≤ 0’. For pflip < 0.0584 (vertical
line) the true distribution violates the null hypothesis (see Figure 11). In
this regime, large rejection rates (being the powers of the tests) are desired.
For pflip ≥ 0.0584 the null hypothesis is satisfied and small rejection rates
are expected.
most 5% of samples from compatible distributions should be rejected. The
true distribution is compatible with the null hypothesis (T ≤ 0) exactly
for pflip ≥ 0.0584 (see Figure 11). The rejection rate at the critical value
pflip = 0.0584 thus indicates whether or not a test properly works at the 5%
level. We find the rejection rates 0.147 (percentile) and 0.048 (BCa). The
BCa value is, as theoretically expected, significantly closer to the desired
rate of 0.05. The large rejection rate of the percentile method suggests a
general overestimation by that method. For the comparison to the direct
test we do therefore consider the BCa bootstrap.
Figure 15 reveals that the bootstrap test is significantly weaker than the
direct test. Depending on the unknown value of pflip for which the distribution becomes compatible with the triangular scenario, small rejection rates
for pflip ≥ 0.0584 might actually be desired. For pflip < 0.0584, however, the
weak power of the bootstrap test is in fact disappointing. There are at least
two possible reasons for the inferiority of the bootstrap test.
62
3 TESTING ENTROPIC INEQUALITIES
rejection rate
1.0 ●▲ ● ● ● ●
0.8
0.6
0.4
0.2
▲
●
●
●
▲
●
▲
●
direct
▲
bootstrap (BCa)
●
▲
●
▲
●
▲
●
▲▲
●
●
● ●
▲▲
● ● ●
▲▲▲▲▲▲▲
▲▲
pflip
0.02 0.04 0.06 0.08 0.10
Figure 15: Rejection rates of the direct test and the (BCa) bootstrap test
based on inequality (3.2) for the family of ‘flip distributions’ (3.32). Both
tests employ the minimax estimator of entropy. The direct test uses the
null hypothesis h0 : ‘data are compatible with the triangular scenario’. The
bootstrap test uses the weaker null hypothesis h0 : ‘data are compatible with
the inequality T ≤ 0’. For pflip < 0.0584 (vertical line) the true distribution
violates the inequality. In this regime, large rejection rates (being the powers
of the tests) are desired. The value of pflip for which compatibility with the
triangular scenario is established is unknown.
1. Due to failure at finding the correct worst case distribution, the threshold value used in the direct approach could be too small. The large
rejection rate would then (partially) be caused by the fact that the
test does not properly work at the 5% level.
2. The discrepancy between the two null hypotheses (compatibility with
the DAG opposed to compatibility with the inequality) might be pretty
large. The stricter null hypothesis of the direct test is naturally more
frequently rejected.
We illustrate the discrepancy mentioned in the second explanation by esti(DAG)
mating the value pflip
at which the distribution becomes compatible with
the DAG. To this end, we employ the data shown in Figure 15 and assume
that the direct test correctly works at the 5% level. Since this assumption might not be correct, the following argument is not rigorous but serves
merely as an illustration. By interpolation between the flip probabilities
pflip = 0.090 and pflip = 0.095 (having rejection rates right above and right
3 TESTING ENTROPIC INEQUALITIES
63
(DAG)
below 5%) we can calculate the estimate pflip ≈ 0.0919. Since there might
be incompatible distributions with rejection rate < 5%, this value is in fact
(DAG)
only a lower bound. But already pflip ≈ 0.0919 is significantly larger than
the value pflip = 0.0584 above which inequality (3.2) is satisfied. This consideration indicates that the set constrained by inequality (3.2) might be a
clearly suboptimal approximation to the true set of distributions compatible
with the DAG.
The above arguments might suggest that bootstrapping is deemed to result
in a weaker test. But the results were by no means clear beforehand. First,
a general, theoretical comparison of the tests is difficult due to their different
natures (see Figure 9). Second, there are also reasonable arguments in favor
of the bootstrap test:
• By bad luck there might be a small, not representative set of distributions compatible with the DAG whose samples lead to comparatively
large violations of the inequality. This would result in a threshold
value so large that the power of the direct test would be unreasonably
small. Since the bootstrap approach does not involve a worst case
distribution it would not be affected by such a disproportional worst
case.
• In Subsection 3.3.2 we identified the large variance of estimates T̂ (for
fixed pflip ) as the main reason for the flat rejection curve of the direct
test. The bootstrap principle suggests that a distribution of bootstrap
estimates T̂ ∗ should have a similarly large variance. But the actual
quantity of interest in the indirect approach is the lower endpoint, T̂0.05 ,
of the confidence interval estimated from such a distribution. This is
in contrast to the direct approach where, once the threshold value is
available, only the point estimate T̂ is required. The variance of endpoints T̂0.05 might indeed be smaller than the variance of the estimates
T̂ (or T̂ ∗ ) itself. This would be the case if the bootstrap distributions
for different initial samples were similar, or if the estimation technique
(here the BCa method) provided appropriate corrections. While the
bootstrap test would still have low rejection rate for pflip ≥ 0.0584, the
power would increase more rapidly when decreasing pflip below that
critical value.
Unfortunately, it seems that these effects played only a minor role, if they
3 TESTING ENTROPIC INEQUALITIES
64
occurred at all. The opposite effects discussed above (stronger null hypothesis of the direct test and potentially underestimated threshold value) clearly
dominate the discrepancy between the two tests. Note that since a proper
theoretical comparison between the two approaches is difficult, the list of advantages and disadvantages might not be complete. Also, an underestimated
threshold value is by no means an advantage of the direct approach. The
seemingly larger power would be misleading, since the test would not correctly operate at the 5% level anymore. In fact, we have shown in Figure 10
that the threshold value from [16] was underestimated. Moreover, we have
no proof, not even reasonable arguments, that our slightly improved threshold value is correct. The (BCa) bootstrap test, on the other hand, could be
verified to correctly work at the 5% level. We can thus trust the bootstrap
test more than the direct test. If the bootstrap test rejects some given data,
we can be extremely confident that the data are indeed incompatible with
inequality (3.2) and thus in particular with the triangular scenario. Recall,
that this is the only rigorous inference we are able to draw anyway, first,
since compatibility with the inequality does not imply compatibility with
the DAG, and second, because compatibility with the DAG does not guarantee that this DAG is the ‘one correct’ explanation for the data. Of course
it would be preferable if the bootstrap test showed strong performance for
a larger range of distributions.
3.3.4
Additional inequalities
In this subsection we show that we can improve on the performance of
the current bootstrap test by employing inequalities (3.3) and (3.4). As
discussed before, we also encounter an increasing difficulty in finding a worst
case distribution for the new inequalities, which would be required for a
direct test. From now on, we will often refer to inequality (3.2) ((3.3), (3.4))
as the ‘first (second, third) inequality’.
1
Analogous to the statistic Tent
≡ T = I (A; B) + I (A; C) − H (A) of the first
inequality, we denote the statistics corresponding to inequalities (3.3) and
(3.4) by
2
≡ 3HA + 3HB + 3HC − 3HAB − 2HAC − 2HBC + HABC , (3.38)
Tent
3
Tent ≡ 5HA + 5HB + 5HC − 4HAB − 4HAC − 4HBC + 2HABC , (3.39)
3 TESTING ENTROPIC INEQUALITIES
65
again using the short hand notation HAB = H (A, B) and so on. The subscript ‘ent’ stands for ‘entropic’ and is introduced in light of Chapter 4
(Subsection 4.5.3) where yet another statistic Tmat appears. For the simulated hypothesis tests we will again consider the family of ‘flip distributions’
(3.32) that was already considered in Subsections 3.3.2 and 3.3.3. Figure 16
2
3
1
shows Tent
and Tent
in comparison to Tent
as functions of pflip . We observe
i
3
that the critical value of pflip for which Tent = 0 is largest for Tent
and small1
est for Tent . This means that, at least for the family of ‘flip distributions’,
the second and third inequality are stronger than the first one, suggesting
3
2
involve
and Tent
that they should also lead to more powerful tests. Since Tent
tri-partite information this tendency is not surprising.
0.0584
0.0750
0.0797
1
2
3
Figure 16: The statistics Tent
, Tent
and Tent
for the family of ‘flip distributions’ (3.32) as functions of pflip . For the critical flip probabilities (sat(1.ent)
(2.ent)
i
isfying Tent
(pflip ) = 0) we obtain pflip
= 0.0584, pflip
= 0.0750 and
(3.ent)
pflip = 0.0797. Thus, for the family of ‘flip distributions’, the new inequalities are stronger than the first one, suggesting that they should also result
3
in more powerful hypothesis tests. The larger absolute values of Tent
and
2
Tent carry no direct meaning.
Before tackling the construction of a direct hypothesis test, we try to interpret the new inequalities. In terms of mutual information the inequalities
i
Tent
≤ 0 (for i = 2, 3) can for example be rewritten as
2. IAB + IAC + IBC
and 3. IAB + IAC + IBC
≤
HAB + I ABC
≤ HABC + 3I ABC .
(3.40)
(3.41)
3 TESTING ENTROPIC INEQUALITIES
66
The ‘tri-partite information’ IABC , also called interaction information [45],
is defined as
IABC = IAB|C − IAB
= HAB + HAC + HBC − HABC − HA − HB − HC ,
(3.42)
and is symmetric in the three observables. Now, recall the rather simple interpretation of the first inequality I (A; B) + I (A; C) ≤ H (A) from Section
3.1: ‘If the mutual information of A and B is large, then A depends strongly
on the ancestor λAB . But then, the dependence of A on λAC is necessarily
small. Since all correlations between A and C are mediated by λAC , the mutual information of A and C is consequently small as well. Inequality (3.2)
gives a precise bound for this intuition.’ For the new inequalities such an
interpretation is not that simple. In the representations (3.40) and (3.41),
the inequalities also seem to bound the sum of pairwise mutual information
terms, but the interaction information IABC (involved in the upper bound)
complicates an intuitive understanding. First, IABC is not lower bounded
by zero. Second, the general behaviour of IABC when varying multiple pairwise mutual information terms is difficult to predict. In addition, the mere
increase in the number of involved terms complicates any potential interpretation of the inequalities and raises the risk of our intuition to be flawed.
Also, an advantage of the first inequality was that it clearly singles out the
variable A, so that the interpretation could be built around that variable.
Inequalities (3.40) and (3.41), in addition to the tri-partite terms, also involve the dependency between the variables B and C. The third inequality
is even completely symmetric in all variables, resulting in the loss of variable
A as the natural starting point for an intuitive interpretation.
The above problems complicate the task of finding a worst case distribution
required for a direct test. In general, one could try to use the same rationale
for finding the worst case that was already employed for the first inequality
in Subsection 3.3.2. While it makes sense to assume that the worst case
i
distribution should again lie on the boundary Tent
= 0, problems arise at
the second step. For the first inequality some intuition was employed to find
1
(or propose) a rather simple structure of distributions satisfying Tent
= 0.
Due to their less intuitive interpretations, this is not as straightforward for
the second and third inequality. Any pure guess would result in an even
less trustworthy threshold value. Still, one such guess is to start with the
same distribution as before, namely A = B ∼ uniform and independently
3 TESTING ENTROPIC INEQUALITIES
67
C ∼ uniform, and then replace the uniforms by general Bernoulli distributions. The second inequality shows similar qualitative behaviour as the first
2
= 0 and the threshold
one (recall Figure 10). In particular, we have Tent
value can be improved by considering the non-uniform cases. For the third
3
= − log 2 in the uniform case. Thus, the
inequality, however, we obtain Tent
3
distribution does not even lie on the boundary Tent
= 0. This suggests, that
in particular for the third inequality, the supposed worst case distribution
of the first inequality is not a good candidate here. But also for the second
inequality, choosing the old worst case guess has no sound standing.
Due to the above problems we would not be able to trust a direct hypothesis
for the new inequalities and thus stop the construction at this point. A more
detailed elaboration of the inequalities might help to overcome or at least
reduce the problems. But here, we instead decide to focus on the indirect
hypothesis tests based on bootstrapping. This approach does not suffer
from the above problems. In general, bootstrapping is a rather easy to
implement, automated procedure which often significantly eases the task
for the scientist at cost of increasing the computational burden. While the
latter is a relevant point for the large simulations we are conducting here,
it causes no real problem when applying the test to a single (real world)
data set. The main disadvantage remains that we are not able to implement
the stronger null hypothesis of compatibility with the DAG. Instead, the
i
≤ 0.
approach automatically tests compatibility with the inequality Tent
The implementation of the bootstrap tests is exactly the same as for the
first inequality. For a general explanation of the approach see Subsection
3.3.3. As before, we employ the BCa method to estimate the required lower
endpoint of the confidence interval. The results in comparison to the indirect
as well as the direct test based on the first inequality are presented in Figure
17. Note that the new tests correctly work at the 5% level. For the second
(2.ent)
inequality we find a rejection rate of 4.1% at pflip
= 0.0750 and for the
(3.ent)
third inequality 4.9% at pflip = 0.0797.
We observe that the tests based on the second and third inequality are significantly more powerful than the bootstrap test employing the first inequality.
The difference from the second to the third inequality is comparatively small.
This is in accordance with the quite large gap between the critical values
(1.ent)
(2.ent)
pflip = 0.0584 and pflip = 0.0750 compared to the smaller gap between
(3.ent)
the latter value and pflip
= 0.0797 (a larger critical value corresponds
68
3 TESTING ENTROPIC INEQUALITIES
tests for diff. entropic ineqs.
rejection rate
1.0 ●■▲ ■●● ●● ● ●
0.8
0.6
▲■
●
■
▲
▲
●
●
■
■
1. direct
▲
1. bootstrap
■
2. bootstrap
●
3. bootstrap
●
●
■
▲
0.2
●
●
▲
0.4
●
●
■
▲
●
●
●
●
■
▲▲
●
■
●
●
●
●
●
■■
● ● ● ●
▲▲ ■
● ● ●
● ● ●
▲▲■
● ■
■■
▲■
●
●
▲▲
▲
■
▲●
▲■
▲
pflip
0.02 0.04 0.06 0.08 0.10
Figure 17: Rejection rates of the bootstrap tests for the second ((3.3) or
(3.40)) and third ((3.4) or (3.41)) inequality compared to the direct and
the bootstrap test based on the first inequality (3.2). In all cases the minimax estimator of entropy was employed. As before, we consider the family of ‘flip distributions’ (3.32). The vertical lines mark the critical values
(1.ent)
(2.ent)
(3.ent)
pflip
= 0.0584, pflip
= 0.0750 and pflip
= 0.0797 below which the
respective inequalities are violated. As expected, the tests based on the
second and third inequality are more powerful than the bootstrap test for
the first inequality. Though, the power of the direct test employing the first
inequality is not reached.
to a more restrictive inequality and thus indicates a more powerful test).
Even though the power of the first inequality’s direct test is not reached or
even surpassed, the clear improvement over the first bootstrap test considerably enhances the usefulness of the bootstrap approach. Since the threshold
value involved in the direct test is rather questionable, one might now actually prefer the bootstrap test based on the third inequality (if tri-partite
information is available).
3.3.5
Summary
At this point it is reasonable to recapitulate what we have accomplished so
far. Recall that the direct hypothesis test employing the first inequality (3.2)
has already been proposed in [16] (for the maximum likelihood estimator and
with a slightly different threshold value). At the end of Chapter 1 we have
stated that one prime goal of this thesis is to improve on this test.
3 TESTING ENTROPIC INEQUALITIES
69
• As a first drawback we observed in Figure 10 that the heuristics for
finding the threshold value employed in [16] is flawed. While we could
slightly amend the threshold value, the main observation was that it
will in general be extremely difficult to find the correct threshold value.
Thus, in addition to the arguably weak power of the direct test (see
Figure 12), we do not even know if the test correctly works at the 5%
level. Our goal is therefore not only to improve the power of this test,
but also to improve the reliability, by which we mean a proper control
of the type-I-error rate of 5%.
• Another means that was intended to increase the reliability of the test
and hopefully also the power was the implementation of the minimax
estimator of entropy, introduced in Section 3.2. We could confirm
that the minimax estimator is often (far) superior to the MLE, but for
the alphabet and sample sizes considered in this section the differences
were rather insignificant. Figure 12 shows that the powers of the direct
tests based on the MLE and the minimax estimator are indeed similar.
While the minimax test is even slightly less powerful, the discussion
at the end of Subsection 3.3.2 suggests that the minimax test might
be more reliable.
• To overcome the problem of the poorly controlled type-I-error rate of
the direct test, we considered a bootstrap approach to hypothesis testing (Subsection 3.3.3). Employing the BCa method, we were able to
properly control the type-I-error rate at 5% (see Figure 14). Unfortunately, the comparison to the direct test shown in Figure 15 reveals
that the bootstrap test is considerably less powerful than the direct
test.
• In order to improve the unsatisfying power of the bootstrap test, we
implemented bootstrap tests for the additional inequalities (3.3) and
(3.4) (Subsection 3.3.4). In Figure 17 it can be seen that these tests
are indeed significantly more powerful than the bootstrap test based
on inequality (3.2). The power of the direct test is unfortunately not
reached or even surpassed. On the plus side, like the first bootstrap
test, the new tests correctly work at the 5% level.
Overall, we have been able to construct a test that is more reliable than
the direct test from Figure 12 (or originally [16]) in terms of a superiorly
3 TESTING ENTROPIC INEQUALITIES
70
controlled type-I-error rate. On the other hand, we have not been able to
improve the power of this test. Our final measure in this direction will be
to leave the entropic framework and derive similar inequality constraints
based on certain generalized covariance matrices. While deriving the new
matrix inequalities is a significant goal on its own, we particularly hope to
be able to construct more powerful tests in this new framework. Both, the
derivation as well as the implementation of our new inequality is the subject
of Chapter 4. An application to real data, employing an entropic as well as
the matrix inequality, will be presented in Chapter 5.
4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 71
4
4.1
Tests based on generalized covariance matrices
Introduction
In Subsection 2.2.4 we have introduced so-called hidden common ancestor
models, where all correlations between the observable variables are mediated by hidden common ancestors. A special case is the triangular scenario
(Figure 3) consisting of three observables with one common ancestor for
each pair. In Section 3.1 the entropic inequality I (A; B) + I (A; C) ≤ H (A)
constraining all distributions of the observable variables that are compatible
with the triangular scenario was introduced. The inequality was the subject of intensive simulations of statistical tests in Section 3.3. The authors
of [16] did not stop at the triangular scenario but also considered general
hidden common ancestor models. A hidden common ancestor model can be
characterized by the number of observables (n) and the maximal number of
observables that may be connected by a single ancestor (m). We may also
call this number the degree of an ancestor. An example with n = 5 and
m = 3 is given in Figure 18. In [16], it was shown that for compatibility
with this kind of scenario the inequality
n
X
I A1 ; Ai ≤ (m − 1)H A1
(4.1)
i=2
(and permutations thereof) must be satisfied. The inequality bounds the
mutual information that A1 shares with all the other observables. One might
expect that similar inequalities also hold for other measures of correlation.
Going further, it might be possible to use the tools from [16] to find entropic
inequalities for a given DAG and then generalize these inequalities to other
measures of correlation.
In this chapter we prove the analog to inequality (4.1) on the level of certain
generalized covariance matrices. As a special case, constraints on usual covariances, or rather correlation coefficients, can be derived. The motivation
to do this is twofold. First, when going to entropies we lose some information since already the elementary inequalities constraining entropies of any
set of random variables are only an outer approximation (see Section 3.1).
Second, we have seen in Chapter 3 that estimating entropies and in particular establishing statistical tests for entropic inequalities can be a thorny
4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 72
λ12
1
A
2
5
A
λ145
4
A
A
λ134
λ23
3
A
Figure 18: An example of a hidden common ancestor model with n = 5
observables and ancestors of degree up to m = 3 (two ancestors of degree
3 and two ancestors of degree 2). Applying inequality (4.1) to this DAG
results in the constraint IA1 A2 + IA1 A3 + IA1 ,A4 + IA1 A5 ≤ 2HA1 .
issue. One might hope that our new inequality gives rise to simpler or more
powerful tests.
The rest of this chapter is structured as follows. After introducing the general framework in Section 4.2, we motivate and propose the new inequality
in Section 4.3. A step by step proof is provided in Section 4.4. In Section
4.5 we compare the strength of our new inequality to the entropic inequality
(4.1). For a special class of distributions this can be done analytically. For
more general distributions we conduct a number of numerical simulations.
We also study the performance of statistical hypothesis tests based on our
new inequality and compare the results to the analogous entropic tests from
Section 3.3. An application to real data of the techniques developed in this
chapter, as well as the techniques from Chapter 3, is presented in Chapter
5.
Note that we always assume the alphabets of all variables (observables as
well as hidden ancestors) to be discrete and finite. Even if this should not
be explicitly stated in some of the following sections, propositions, lemmata
etc, finiteness of the alphabets is always implicitly assumed. The number
of observables (n) should be finite as well. This implies that the number of
4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 73
ancestors as well as the maximal degree of an ancestor (m) are also finite.
4.2
Encoding probability distributions in matrices
4.2.1
One- and two-variable matrices
The covariance of two random variables A and B can be written as
Cov [A, B] =
KA X
KB
X
a∗i [P (A = ai , B = bj ) − P (A = ai ) P (B = bj )] bj ,
i=1 j=1
(4.2)
see (2.9) in Subsection 2.1.4. For the sake of generality we allow complex
valued variables. Recall that in this case we have Cov [B, A] = Cov [A, B]∗
instead of full symmetry. If we define the (real valued) matrix
A ,KB
M A:B := [P (A = ai , B = bj ) − P (A = ai ) P (B = bj )]K
i,j=1 ,
T
and the vectors a := a1 · · · aKA , b := b1 · · · bKB
can be written as the matrix product
Cov [A, B] = a† M A:B b.
T
(4.3)
, the covariance
(4.4)
The vectors a and b carry the alphabets of the variables A and B while the
matrix M A:B carries the information about the joint and marginal distributions. Note that the alphabet sizes KA and KB are assumed to be finite.
For the covariance, independent variables satisfy Cov [A, B] = 0 while the
converse is not necessarily true. In contrast, it can be seen directly from the
definition that M A:B is the zero-matrix if and only if A and B are independent. This statement is in particular independent of the actual alphabets
of A and B. Thus, M A:B encodes the distribution of A and B in a more
elementary way than the covariance does. For this reason we prefer to work
with the M -matrices instead of covariances. We will see later that this makes
indeed a difference.
4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 74
Starting with expression (2.8), the variance of A can be written as
Var [A] =
KA
X
|ai |2 P (A =
i=1
=
KA X
KA
X
X
KA
ai ) − ai P
i=1
(A =
2
ai )
a∗i [P (A = ai ) δij − P (A = ai ) P (A = aj )] aj
i=1 j=1
= a† M A a,
(4.5)
by defining the matrix
A
M A := [P (A = ai ) δij − P (A = ai ) P (A = aj )]K
i,j=1 .
(4.6)
Recall that Var [A] ≥ 0 independently of the chosen alphabet (even if some
of the outcome values ai coincide). Thus, a† M A a ≥ 0 ∀a ∈ CKA , which
means that M A is positive semidefinite.
4.2.2
The compound matrix
To capture the joint information about A and B in one matrix, we define
the compound matrix
!
MA:B
M A M A:B
.
:=
M B:A M B
(4.7)
T
T
Note that MA:B is symmetric since M B:A = M A:B and M A = M A .
Since all M -matrices are real valued, the symmetry also implies hermiticity.
In general, for n random variables A1 , ..., An we define

MA
1 :...:An
1
1
2
MA
M A :A
 A2 :A1
2
M
MA

:=  .
..

..
.

An :A1
An :A2
M
M
1
n

· · · M A :A
2
n
· · · M A :A 

.
.. 
..
.
. 

n
· · · MA
(4.8)
Like M A can be considered as an alphabet-independent generalization of the
1
n
variance Var [A] and M A:B of the covariance Cov [A, B], the matrix MA :...:A
4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 75
is a generalization of the covariance matrix
h
Cov A1 : ... : An
i
· · · Cov [A1 , An ]
· · · Cov [A2 , An ]


..
...


.
n
···
Var [A ]
Var [A1 ]
Cov [A1 , A2 ]

2
1
Var [A2 ]
 Cov [A , A ]
= 
.
..

..

.
n
1
Cov [A , A ] Cov [An , A2 ]


†

†
1
(an )† M A
n :A1
1
†
2
1
n

(a1 ) M A :A a2 · · · (a1 ) M A :A an

2
2
n
†
†
(a2 ) M A a2 · · · (a2 ) M A :A an 

 . (4.9)
..
..
..

.
.
.

n
2
n
†
†
n
A :A
2
n
A
n
(a ) M
a · · · (a ) M a
(a1 ) M A a1
 2 †
 (a ) M A2 :A1 a1

= 
..

.

a1
Note that to denote the covariance matrix Cov [A1 : ... : An ] we separate
arguments by a colon, while for the scalar covariance Cov [A1 , A2 ] we use
a comma. At full length we could address the framework based on the
M -matrices as the ‘generalized covariance matrix framework’. As a short
hand notation we will usually simply write ‘matrix framework’ and refer to
inequality constraints in this framework as ‘matrix inequalities’ rather than
‘inequalities based on generalized covariance matrices’.
To conclude this subsection, we want to deduce from the non-negativity of
1
n
the covariance matrix, that also the compound matrix MA :...:A is positive
semidefinite.
Lemma 4.1. The covariance matrix of the variables A1 , ..., An (with finite
alphabets) is positive semidefinite,
h
i
Cov A1 : ... : An ≥ 0.
(4.10)
Proof. When writing the variables A1 , ..., An in one random vector A =
T
A1 · · · An one can express the covariance matrix as
h
i
h
i
Cov A1 : ... : An = E (A − E [A])∗ (A − E [A])T .
(4.11)
Using the linearity of the expectation value, one finds for an arbitrary vector
4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 76
c ∈ Cn ,
h
i
c† Cov A1 : ... : An c
h
i
= c† E (A − E [A])∗ (A − E [A])T c
h
= E c† (A − E [A])∗ (A − E [A])T c
i
2 = E c† (A − E [A])∗ ≥ 0.
(4.12)
h
Lemma 4.2. For all complex valued block matrices X (ij)
in
i,j=1
(of finite
dimension) it is the case that


X (11) · · · X (1n)
 .
.. 
..
Y := 
.
. 
≥0
 ..
(n1)
(nn)
X
··· X
(4.13)
if and only if
†

†

(x1 ) X (11) x1 · · · (x1 ) X (1n) xn


..
..
..
≥0
Z := 
.
.
.


(xn )† X (n1) x1 · · · (xn )† X (nn) xn
(4.14)
for all complex valued vectors x1 , ..., xn of suitable dimension.
Proof. If Z ≥ 0 for all suitable x1 , ..., xn , then in particular
 
1
0 ≤
1 ··· 1
.

Z
 .. 
1
x1
 
. 
· · · (xn )† Y 
 .. 
xn

=
†
(x1 )
for all suitable x1 , ..., xn , and thus Y ≥ 0.

(4.15)
4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 77
For the converse we can calculate

r1∗

r
 1
.
∗
· · · rn Z 
 ..  =
rn
n
X
ri∗ xi
i,j=1 |
†
X (ij) xj rj
}
| {z }
{z
†
yi
=(
)
:=y j
y1
 
. 
· · · (y n )† Y 
 .. 
yn

=
†
(y 1 )

≥ 0.
(4.16)
Since this is true for arbitrary suitable x1 , ..., xn and likewise arbitrary r ∈
Cn we obtain the desired statement.
1
n
The matrix MA :...:A is of the form of the matrix Y in Lemma 4.2 and
the covariance matrix is of the form of the matrix Z. Thus, by combining
Lemma 4.1 and the if-part of Lemma 4.2 we obtain:
1
n
Corollary 4.1. The compound matrix MA :...:A of the variables A1 , ..., An
(with finite alphabets) defined in (4.8) is positive semidefinite.
Aside from being an interesting feature on its own, the non-negativity of
1
n
i
j
MA :...:A (or rather the bi-partite case MA :A ) will be used later in the
proof of the shortly proposed inequality.
4.3
4.3.1
The inequality
Motivation by covariances for the triangular scenario
To motivate the general inequality proposed in the next subsection, we first
consider the triangular scenario (see Figure 3). We want to construct an
inequality similar to the entropic inequality I (A; B) + I (A; C) ≤ H (A),
first introduced in (3.2). By replacing mutual information with covariances,
one might expect |Cov [A, B]| + |Cov [A, C]| to be bounded. The covariance
4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 78
Cov [B, C] should not appear in the inequality. Starting from the covariance
matrix


Var [A]
Cov [A, B] Cov [A, C]

Var [B]
Cov [B, C]
Cov [A : B : C] = Cov [B, A]
,
Cov [C, A] Cov [C, B]
Var [C]
(4.17)
and replacing Cov [B, C] by 0, we propose the inequality

Z A:B:C

Var [A]
Cov [A, B] Cov [A, C]


Cov
[B,
A]
Var [B]
0
:= 
 ≥ 0.
Cov [C, A]
0
Var [C]
(4.18)
Note that this inequality is not trivially satisfied. Of course there exist joint
distributions with Cov [B, C] = 0 for which the inequality is trivial, but here
we assume that in fact Cov [B, C] might be non-zero which will in general
also affect the covariances Cov [A, B] and Cov [A, C]. Since the determinant of a positive semidefinite matrix must be non-negative, we obtain the
inequality
Var [A] Var [B] Var [C] − Var [B] |Cov [A, C]|2 − Var [C] |Cov [A, B]|2 ≥ 0
(∗)
⇔
⇔
|Cov [A, C]|2
|Cov [A, B]|2
+
≤ 1
Var [A] Var [B] Var [A] Var [C]
|Corr [A, B]|2 + |Corr [A, C]|2 ≤ 1,
(4.19)
where
Corr [A, B] := q
Cov [A, B]
(4.20)
Var [A] Var [B]
denotes the usual correlation coefficient. Inequality (4.19) can be considered
as the analog of I (A; B) + I (A; C) ≤ H (A) for (squared) correlation coefficients. Keep in mind that with Cov [A, B] also Corr [A, B] will in general be
complex for complex valued random variables. The absolute values keep the
whole expression real. If the alphabets are real valued, as usually is the case,
we do not have to worry about that issue at all. In fact, the main reason to
work with complex variables is merely that some matrix-theoretical results
are better established in the complex case. A restriction to real variables
might have required some additional attention. As a different issue, for the
equivalence relation marked by (∗) we assumed Var [A] , Var [B] , Var [C] 6= 0.
4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 79
Note that we have not yet proven that inequality (4.19) indeed constraints
distributions compatible with the triangular scenario, but only proposed it.
Also, as mentioned before, we do generally not want to work on the level
of covariances and variances, but rather with the M -matrices introduced
in the previous subsections. In this more general framework the analog of
inequality (4.19), or rather (4.18), reads
M A M A:B M A:C
 B:A
MB
0 
:= M
 ≥ 0.
C:A
C
M
0
M

X A:B:C

(4.21)
In this framework (i.e. once the inequality is proven) we do not have to worry
about possibly complex valued variables or vanishing variances at all.
4.3.2
General inequality for hidden common ancestor models
Now, consider a general hidden common ancestor model (see Subsection
2.2.4 and Section 4.1) with n observables and ancestors of degree up to m.
We desire an inequality that bounds the pairwise dependence between the
variable A1 and all other variables in terms of the M -matrices (we pick
the specific variable A1 instead of a general Aj simply for notational conve1
n
nience). Again, we start with the full matrix MA :...:A (see (4.8)). Matrices
i
j
M A :A carrying the dependence between pairs of variables not including A1
(i.e. i, j 6= 1) are set to 0 since they should not appear in the inequality. To
1
take account of the maximal degree m of the ancestors, the matrix M A is
equipped with the prefactor m − 1. This is in analogy to the m − 1 prefactor
of the term H (A1 ) in the general entropic inequality (4.1). We propose the
final inequality in the following theorem:
Theorem 4.1. Distributions compatible with a hidden common ancestor
model with n observables A1 , ..., An (with finite alphabets) and ancestors of
degree up to m (with likewise finite alphabets) satisfy the inequality

XA
1 :...An
(m − 1) M A

2
1
 M A :A


..
:= 
.


..

.

An :A1
M
1
1
M A :A
2
MA
0
..
.
0
2
···
0
..
.
..
.
···
1
· · · M A :A
···
0
..
..
.
.
..
.
0
n
0
MA
n










≥ 0.
(4.22)
4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 80
Note that we could restrict the inequality to variables that A1 shares at
least one ancestor with. For any pair A1 , Aj without a common ancestor,
the DAG demands A1 and Aj to be independent (see Subsection 2.2.4). If
a distribution violates this independence relation, the distribution is known
to be incompatible with the DAG without having to consider any complicated inequality. On the other hand, if a distribution satisfies all required
1
n
independence relations, one can use the inequality X A :...A ≥ 0 in its above
1
j
j
form. Since we have M A :A = 0 for independent variables, the M A block
will be disconnected from the rest of the matrix and trivially be positive
semidefinite. We may further assume that one of the ancestors of A1 indeed
has degree m. Otherwise, one should replace m by m0 , the maximal degree
of A1 ’s ancestors. Using m would still result in a valid inequality, but the
inequality would be unnecessarily loose.
Even though the main focus of this thesis lies on hypothesis tests, Theorem
4.1 is an important result on its own. Testable constraints for models including hidden variables, that are based on the model’s structure alone, are
rare to this day. With this regard, Theorem 4.1 can even be understood as
the main result of this chapter. The proof is carried out in Section 4.4 and
partially prepared in the following subsection. For the sake of readability,
two steps of the proof are presented only for the triangular scenario. The
generalization to arbitrary hidden common ancestor models can be found
in Appendix A. A brief recapitulation of all major steps is given in Subsection 4.4.5. Note that the proof is rather lengthy, Section 4.4 spanning
over roughly 25 pages. In principle, it is possible to skip Section 4.4 without
causing difficulties at understanding the rest of the thesis.
4.3.3
An equivalent representation
Before proving inequality (4.22) we derive an equivalent representation that
P
better resembles the entropic inequality ni=2 I (A1 ; Ai ) ≤ (m − 1)H (A1 ).
This representation also turns out to be more convenient for some parts of
the proof. First, we provide a lemma that gives a necessary and sufficient
condition for a block matrix to be positive semidefinite. The lemma already
exists in the literature in several different versions, see for example Reference
[46] Theorem IX.5.9 or Reference [30] Theorem 7.7.7. We present the lemma
(and the proof) for the sake of completeness and to have a version of the
lemma which is best suited for our purposes.
4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 81
Lemma 4.3. If R ∈ Cn×n , S ∈ Cm×m and Q ∈ Cn×m , then
!
R Q
X=
≥0
Q† S
(4.23)
if and only if
R ≥ 0,
S ≥ 0,
R ≥ QS Q† ,
QPS = Q,
(4.24)
where S denotes the pseudoinverse of S as introduced in Subsection 2.4.2
and PS denotes the projection onto the range of S. The conditions (4.24)
can equivalently be replaced by
R ≥ 0,
S ≥ 0,
S ≥ Q† R Q,
PR Q = Q.
(4.25)
First, note that the precondition that the lower left block Q† is the adjoint
of the upper right block Q is required since a positive semidefinite matrix
is necessarily hermitian. Second, in (4.24), the conditions S ≥ 0 and R ≥
QS Q† already imply R ≥ 0. Third, to gain intuition for the conditions,
think of R, S and Q as scalars. The diagonal entries of a positive semidefinite
matrix have to be non-negative, thus R ≥ 0 and S ≥ 0. From the positivity
of the determinant one concludes RS ≥ |Q|2 . In (4.24), the conditions
R ≥ QS Q† and QPS = Q are essentially the generalization of this scalar
condition. We prove Lemma 4.3 for the conditions (4.24). The proof for the
conditions (4.25) is analogous.
Proof. For the if-part choose arbitrary r ∈ Cn and s ∈ Cm . Also recall from
Subsection 2.4.3 that the projection onto the range √
of a positive semidefinite
matrix M can be written as PM = M M . Since M has the same range
as M itself (and is also positive semidefinite), we can further write
P M = P√ M =
√ √
M
M.
(4.26)
4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 82
Using this identity and the conditions (4.24) we obtain
r † s†
!
R Q
Q† S
r
s
!
= r † Rr + r † Qs + s† Q† r + s† Ss
√ √ √ √
= r † Rr + r † Q S Ss + s† S S Q† r + s† Ss
√ √
√ √ ≥ r † QS Q† r + r † Q S Ss + s† S S Q† r + s† Ss
√
√ † √ †
√ S Q† r + Ss
S Q r + Ss
=
≥ 0.
(4.27)
From the second to the√third
line we used QPS = Q and S ≥ 0 which
√
allows us to write PS = S S. From the third to the fourth line we used
R ≥ QS Q† .
For the only if-part, R ≥ 0 and S ≥ 0 are clear since in particular vectors
T
T
of the form x1 = r 0S and x2 = 0R s with r ∈ Cn and s ∈ Cm
satisfy xi † Xxi ≥ 0. Next, for arbitrary r ∈ Cn , we find
0 ≤
r † −r † QS !
R Q
Q† S
r
†
−S Q r
!
= r † Rr − r † QS Q† r − r † QS Q† r + r † Q S
SS } Q† r
| {z
=PS S =S = r † R − QS Q† r.
(4.28)
Thus, we obtain R ≥ QS Q† . Finally, we need to show that QPS = Q.
For this purpose, define PS⊥ = 1S − PS , the projection onto the orthogonal
complement of range (S) (i.e. the kernel of S). This projection satisfies in
particular SPS⊥ = PS⊥ S = 0. Let s ∈ Cm , r ∈ Cn , x ≥ 0 and θ ∈ R be
arbitrary, then
0 ≤
r † xeiθ s† PS⊥
!
R Q
Q† S
r
−iθ ⊥
xe PS s
!
= r † Rr + xe−iθ r † QPS⊥ s + xeiθ s† PS⊥ Q† r + x2 s† PS⊥ SPS⊥ s (4.29)
|
= r † Rr + 2xRe e−iθ r † QPS⊥ s .
{z
0
}
(4.30)
4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 83
We know that r † Rr ≥ 0 but if r † QPS⊥ s 6= 0 we can always choose an
appropriate θ and large enough x to make the whole expression negative.
Thus, we require r † QPS⊥ s = 0 ∀r ∈ Cn and ∀s ∈ Cm . From there we can
conclude QPS⊥ = 0 and thus
QPS = Q PS + PS⊥ = Q1S = Q.
(4.31)
We can now formulate and prove the following equivalent representation of
inequality (4.22):
Proposition 4.1. A probability distribution on n observables A1 , ..., An (with
1
n
finite alphabets) satisfies inequality (4.22), X A :...:A ≥ 0, if and only if
YA
1 :...:An
:=
n √
X
M A1 M A
1 :Ai
i
i
M A M A :A
1
√
M A1 ≤ (m − 1) 1A1 . (4.32)
i=2
To get an intuitive understanding of inequality (4.32), recall that the mai
1
i
trices M A and M A :A can be considered as generalizations of Var [Ai ] and
Cov [A1 , Ai ]. Inequality (4.32) can then be understood as the generalization
of
n
X
1
q
i=2
h
Var [A1 ]
Cov A1 , Ai
i
h
i
1
1
i
1
q
≤ (m − 1)
Cov
A
,
A
i
Var [A ]
Var [A1 ]
n h
i2
X
Corr A1 , Ai ⇔
≤ (m − 1) ,
i=2
(4.33)
assuming Var [Aj ] 6= 0 ∀j = 1, ..., n. Note that at this point this inequality
serves only as an illustration. The inequality is properly stated as a corollary
of Theorem 4.1 in Subsection 4.3.4 and proven in Appendix B.
1
2
Proof. (Prop. 4.1) By identifying R = (m − 1) M A , S = diag M A , ..., M A
1
2
1
n
n
and Q = M A :A . . . M A :A we can use Lemma 4.3 (with the first set
1
n
of conditions) which in this case reduces to the statement that X A :...:A ≥ 0
if and only if R ≥ QS Q† . The conditions R ≥ 0 and S ≥ 0 are trivially
4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 84
i
satisfied since each matrix M A is already known to be positive semidefinite. Concerning the condition QPS = Q, we can conclude from Lemma 4.3
applied to the matrices
1
M
1
i
1
1
MA
M A :A
=
i
1
i
M A :A
MA
A1 :Ai
i
i
!
≥ 0,
1
(4.34)
i
that M A :A PM Ai = M A :A (the matrices MA :A are known to be positive semidefinite according to Corollary 4.2.2). Due to the block diagonal
structure of S, this implies


QPS =
=
=
MA
MA
MA
1 :A2
1 :A2
. . . MA
1 :A
PM A 2
n 


PM A 2 . . . M A
1 :A2
. . . MA
1 :A
1 :An
...
PM A n
P M An



n
= Q.
(4.35)
Analogously, one obtains PR Q = Q which will be required later. Thus, the
†
only non-trivial condition
Q . This condition can equivalently be
√ QS
√ is R †≥
replaced by 1R ≥ R QS Q R . To see this, first note that the relation
Z1 ≥ Z2 is invariant under transformations of the form Zi → U Zi U † where
U is an arbitrary matrix of suitable dimension,
†
†
†
†
†
x† U Z 1 U
| {zx} = y Z1 y ≥ y Z2 y = x U Z2 U x.
(4.36)
:=y
‘⇒’:
√ Since R ≥ 0 , also R is√positive
semidefinite and in particular hermitian.
√ The transformation Zi → R Zi R is thus of the above form and respects
matrix ordering. By applying this transformation to both sides of R ≥
QS Q† , one obtains
⇒
R ≥ QS Q†
√ √
√ †√ R
R
R
≥
R QS Q R
|
{z
}
PR
⇒
1R ≥
√ †√ R QS Q R .
(4.37)
4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 85
In the second step we used that any projection is upper bounded by the
identity.
‘⇐’:
⇒
√ †√ R QS Q R
≤ 1R
√ √ †√ √
√
√
R R QS Q R R ≤
R1R R
PR QS Q† PR
⇒
⇒
≤ R
†
≤ R
QS Q
(4.38)
From line one to line two we performed the matrix-order-preserving trans√
√
√ †
√
formation Zi → RZi R (note that R = R). Next, we identified
√ √ √ √
√
√
R R = R R = PR and R1R R = R. In the last step we used
PR Q = Q, which can be shown in exactly the same way as the statement
QPS = Q from above.
√ √ Explicitly writing down the condition 1R ≥ R QS Q† R in terms of
the M -matrices concludes the proof of Proposition 4.1. The product QS Q†
evaluates to
n
QS Q† =
X
MA
1 :Ai
i
i
1
M A M A :A .
(4.39)
i=2
With R = (m − 1) M A the inequality 1R ≥
1
√ †√ R QS Q R reads
n
√
√
X
1
1
1
i
i
i
1
√
M A1
M A1 √
≤ 1A1
M A :A M A M A :A
m−1
m−1
i=2
n √
√
X
1
i
i
i
1
M A1 M A :A M A M A :A M A1 ≤ (m − 1) 1A1 .
⇔
!
i=2
|
{z
1 :...:An
YA
}
(4.40)
1
n
Thus, X A :...A ≥ 0 is equivalent to Y A
is short for 1M A1 .
1 :...:An
≤ (m − 1) 1A1 . Note that 1A1
4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 86
4.3.4
Covariances revisited
In Subsection 4.3.3 we have introduced the inequality
n i2
h
X
Corr A1 , Ai ≤ (m − 1)
(4.41)
i=2
(see also (4.33)) in order to get an intuitive understanding of the matrix
inequality (4.32),
n √
X
M A1 M A
1 :Ai
i
i
M A M A :A
1
√
M A1 ≤ (m − 1) 1A1 .
(4.42)
i=2
It is possible to prove inequality (4.41), more precisely a version on the
level of covariances and variances rather than correlation coefficients, as a
corollary of Theorem 4.1.
Corollary 4.2. All distributions compatible with a hidden common ancestor
model with n observables A1 , ..., An (with finite alphabets) and ancestors of
degree up to m, satisfy the inequality
n n
h
i2 Y
X
Cov A1 , Aj j=2
h
i
Var Ak ≤ (m − 1)
k=2
n
Y
h
i
Var Ai .
(4.43)
i=1
k6=j
Inequality (4.41) can be obtained by demanding that all observables have
non-vanishing variance. The full proof of Corollary 4.2 is presented in Appendix B. Here we sketch the general idea for the special case of the triangular scenario for which Theorem 4.1 states
M A M A:B M A:C
 B:A
MB
0 
= M
 ≥ 0.
C:A
C
M
0
M

X A:B:C

(4.44)
Given that X A:B:C is positive semidefinite, Lemma 4.2 allows us to conclude
that the matrix

Z A:B:C

Var [A]
Cov [A, B] Cov [A, C]


Var [B]
0
= Cov [B, A]

Cov [C, A]
0
Var [C]
(4.45)
4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 87
(first introduced in (4.18)) is positive semidefinite as well (for arbitrary
alphabets). To apply Lemma 4.2, recall that one can for example write
Cov [A, B] = a† M A:B b, where the vectors a, b carry the alphabets of A and
B. Positive semidefiniteness of Z A:B:C implies det Z A:B:C ≥ 0 which amounts
to the inequality (see also Subsection 4.3.1)
|Cov [A, B]|2 Var [C]+|Cov [A, C]|2 Var [B] ≤ Var [A] Var [B] Var [C] . (4.46)
This is the special case of inequality (4.43) for the triangular scenario. Calculating the determinant in the general case requires a bit more effort, which
is why the general proof has been moved to Appendix B.
Note that the covariance inequality for one specific choice of alphabet values of the observables is not equivalent to the alphabet independent matrix
inequality X A:B:C ≥ 0. The matrix inequality will always be at least as
powerful as the covariance inequality. To illustrate this statement, we characterize the strength of an inequality by the number of distributions violating
the inequality. More violations correspond to a stronger inequality. Now, assume that a distribution violates X A:B:C ≥ 0. From Lemma 4.2 we can only
conclude that there exist alphabets for which the matrix Z A:B:C from (4.45)
is not positive semidefinite either. However, there might, and in general
will, exist other alphabets for which we obtain Z A:B:C ≥ 0. Thus, it might
happen that even though X A:B:C ≥ 0 is violated, the inequality Z A:B:C ≥ 0
is satisfied. Going further from Z A:B:C ≥ 0 to the scalar covariance inequality (4.46), recall that (4.46) simply states the non-negativity of det Z A:B:C .
But even if Z A:B:C is not positive semidefinite, the determinant might still
be positive (for an even number of negative eigenvalues). Thus, violation
of Z A:B:C ≥ 0 (and in particular X A:B:C ≥ 0) does not imply violation of
inequality (4.46).
Considering the other direction, a negative determinant (i.e. violation of
inequality (4.46)) automatically implies that Z A:B:C is not positive semidefinite. Going further, by applying Lemma 4.2 in the other direction, as soon
as we find violation of Z A:B:C ≥ 0 for an arbitrary choice of alphabets, we
know that X A:B:C ≥ 0 is violated as well. Thus, violation of det Z A:B:C ≥ 0
implies violation of X A:B:C ≥ 0. The latter inequality is therefore stronger
(or at least not weaker) than the former.
This holds not only for the triangular scenario but also for general hidden
common ancestor models. Working in the general matrix framework with
4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 88
1
n
the inequality X A :...:A ≥ 0 is thus not just a matter of taste. The matrix
inequality is indeed stronger than the inequality for covariances. The gap
between the inequalities will be illustrated later by one example in Figure
23.
4.4
Proving the inequality
The proof of Theorem 4.1 is splitted into several parts. First, we show that
if the inequality (‘the inequality’ might refer to any of (4.22) or (4.32)) is
satisfied for one given distribution, then it will also be satisfied for any distribution that can be obtained from the initial one by local transformations.
The concept of local transformations will be introduced along the way. Second, the inequality is shown to hold for one specific family of distributions.
To demonstrate that the inequality is not trivially satisfied, a counterexample is presented as well. To conclude the proof, we show that all distributions
compatible with a given hidden common ancestor model can be obtained by
local transformations (and a subsequent limit procedure) starting with the
family of distributions shown to be compatible in the previous step. To wrap
everything up, we give a brief overview of all important steps of the proof.
Note that in this section some parts of the proof are presented only for the
triangular scenario. The generalization to arbitrary hidden common ancestor
models can be found in Appendix A. Also note that this section spans over
roughly 25 pages. In principle, the rest of the thesis can be understood even
if this section is skipped.
4.4.1
Invariance under local transformations
A local transformation of a single random variable A → A0 can be defined via
conditional probabilities, essentially employing the law of total probability
(2.3),
P (A0 = k) =
KA
X
P (A0 = k | A = l) P (A = l) .
(4.47)
l=1
In the following we will usually use the short hand notation
PA0 |A (k | l) ≡ P (A0 = k | A = l) ,
(4.48)
4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 89
and so on. Locally transforming several variables A1 , ..., An → A10 , ..., An0
with joint distribution P (A1 , ..., An ) reads
PA10 ,...,An0 (k1 , ..., kn )
=
X
PA10 |A1 (k1 | l1 ) ...PAn0 |An (kn | ln ) PA1 ,...,An (l1 , ..., ln ) .
(4.49)
l1 ,...,ln
The transformations are called local, since each single transformation Aj →
Aj0 will only affect the marginal of the variable Aj . In particular, a product
distribution P (A1 , ...An ) = P (A1 ) ...P (An ) will be transformed to a product distribution P (A10 , ..., An0 ) = P (A10 ) ...P (An0 ). In our matrix framework a local transformation A → A0 (between variables with finite alphabets) can be represented by the matrix
0
iK 0 ,KA
A
h
T A ,A := PA0 |A (k | l)
k.l=1
(4.50)
.
When likewise representing probability distributions as vectors,
PA := PA (1) · · · PA (KA )
T
,
(4.51)
the transformation A → A0 reads
0
PA0 = T A ,A PA .
We further define
0
0
T A,A := T A ,A
T
(4.52)
.
(4.53)
Since the transformation matrices are real valued, the transpose is simultaneously the adjoint. Note that for a given transformation A → A0 the
0
backwards transformation A0 → A does generally not exist. T A,A is thus
0
not the inverse of T A ,A . Recall, that the goal of this subsection is to show
that inequality (4.22) from Theorem 4.1 remains valid under local transformations. We prepare the proof by introducing two lemmata.
Lemma 4.4. Under local transformations A → A0 , B → B 0 (between variables with finite alphabets) the matrix M A:B defined in (4.3) properly transforms as
0
0
0
0
(4.54)
M A :B = T A ,A M A:B T B,B .
4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 90
Proof.
0
0
MkAA:B
,kB = PA0 ,B 0 (kA , kB ) − PA0 (kA ) PB 0 (kB )
X
=
PA0 |A (kA | lA ) PB 0 |B (kB | lB ) PA,B (lA , lB )
lA ,lB
−
X
PA0 |A (kA | lA ) PA (lA )
lA
X
=
X
PB 0 |B (kB | lB ) PB (lB )
lB
PA0 |A (kA | lA ) [PA,B (lA , lB ) − PA (lA ) PB (lB )] PB 0 |B (kB | lB )
lA ,lB
0
0
TkAA ,l,AA MlA:B
T B,B
A ,lB lB ,kB
X
=
lA ,lB
h
0
0
T A ,A M A:B T B,B
=
i
(4.55)
kA ,kB
Unfortunately, and maybe surprisingly, the single variable matrix M A from
0
(4.6) does not satisfy this exact transformation behaviour, that is M A 6=
0
0
T A ,A M A T A,A . The reason is the structural difference between the matrices
M A:B and M A brought about by the Kronecker delta appearing in M A .
Fortunately, we have the following lemma which is sufficient for our purposes.
Lemma 4.5. Under a local transformation A → A0 (between variables with
0
finite alphabets) the matrices M A and M A defined by (4.6) satisfy
0
0
0
M A ≥ T A ,A M A T A,A .
(4.56)
Proof. We have to show that
h
0
0
0
i
a† M A − T A ,A M A T A,A a
X
=
h
0
0
a∗k1 M A − T A ,A M A T A,A
k1 ,k2
0
i
k1 ,k2
ak 2
(4.57)
is non-negative for all complex valued vectors a of suitable dimension. To
0
0
0
this end we explicitly write down the matrices M A and T A ,A M A T A,A ,
0
MkA1 ,k2 = PA0 (k1 ) δk1 ,k2 − PA0 (k1 ) PA0 (k2 )
=
X
PA0 |A (k1 | l1 ) PA (l1 ) δk1 ,k2
l1
−
X
l1
PA0 |A (k1 | l1 ) PA (l1 )
X
l2
PA0 |A (k2 | l2 ) PA (l2 ) (4.58)
4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 91
and
h
0
T A ,A M A T A,A
X
=
0
i
k1 ,k2
PA0 |A (k1 | l1 ) [PA (l1 ) δl1 ,l2 − PA (l1 ) PA (l2 )] PA0 |A (k2 | l2 )
l1 ,l2
X
=
PA0 |A (k1 | l1 ) PA (l1 ) PA0 |A (k2 | l1 )
l1
X
−
PA0 |A (k1 | l1 ) PA (l1 ) PA (l2 ) PA0 |A (k2 | l2 ) .
(4.59)
l1 ,l2
By combining the two expressions one obtains
0
h
0
M A − T A ,A M A T A,A
= δk1 ,k2
X
0
i
k1 ,k2
PA0 |A (k1 | l1 ) PA (l1 )
l1
−
X
PA0 |A (k1 | l1 ) PA0 |A (k2 | l1 ) PA (l1 ) .
(4.60)
l1
Insertion into (4.57) yields
X
0
h
0
a∗k1 M A − T A ,A M A T A,A
k1 ,k2
|ak1 |2
=
X
X
k1
l1
−
X
a∗k1
k1 ,k2
i
k1 ,k2
ak2
PA0 |A (k1 | l1 ) PA (l1 )
X
PA0 |A (k1 | l1 ) PA0 |A (k2 | l1 ) PA (l1 ) ak2
l1

X
X
PA (l1 )  |ak1 |2 PA0 |A (k1
=
0
l1
k1
|
|
X
l1 ) − ak1 PA0 |A (k1
k1
{z
≥0
≥ 0.
0
0
|
2 

l1 ) 
}
(4.61)
0
M A − T A ,A M A T A,A is thus positive semidefinite. To see the last step, note
that |·|2 is a convex function. By definition, a (suitably defined) function f
is convex if
λf (x1 ) + (1 − λ) f (x2 ) ≥ f (λx1 + (1 − λ) x2 )
(4.62)
∀0 ≤ λ ≤ 1 and ∀x1 , x2 in the domain of f . This inequality straightforwardly
extends to larger mixtures given that the mixing coefficients are non-negative
4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 92
and sum to unity. In our case the role of the mixing coefficients is played by
the probabilities PA0 |A (k1 | l1 ). The function |·|2 is well known to be convex.
By employing Lemmata 4.4 and 4.5 we can finally prove the desired statement.
Lemma 4.6. If a probability distribution on the variables A1 , ..., An (with
finite alphabets) satisfies inequality (4.22),

XA
1 :...:An
(m − 1) M A

2
1
 M A :A


..
=
.


..

.

An :A1
M
1
1
M A :A
2
MA
2
···
0
..
.
..
.
···
0
..
.
0
1
· · · M A :A
···
0
..
..
.
.
..
.
0
n
0
MA
n










≥ 0,
then also the distribution on the variables A10 , ..., An0 (with finite alphabets)
obtained by local transformations A1 → A10 , ..., An → An0 satisfies

XA
10 :...:An0
(m − 1) M A

20
10
 M A :A


..
=
.


..

.

An0 :A10
M
10
10
M A :A
20
MA
0
..
.
0
Proof. We show that in fact X A
pound transformation matrix

10 :...:An0
TA

T := 

1
n
20
···
0
..
.
..
.
···
≥ T XA
10
· · · M A :A
···
0
..
..
.
.
..
.
0
n0
0
MA
1 :...:An
10 ,A1
n0










≥ 0.
(4.63)
T T where T is the com-

...
T
An0 ,An

.

(4.64)
1
n
Note that from X A :...:A ≥ 0 we can conclude T X A :...:A T T ≥ 0 (see also
10
n0
(4.36) and the text above; note that T T = T † ). The relation X A :...:A ≥
1
n
10
n0
T X A :...:A T T therefore implies X A :...:A ≥ 0. To show the desired relation
we can calculate
4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 93
10
n0
1
n
X A :...:A − T X A :...:A T T
h
i

10
10
1
1
1
10
(m − 1) M A − T A ,A M A T A ,A

 M A20 :A10 − T A20 ,A2 M A2 :A1 T A1 ,A10

=
..

.

..
.
by Lemma 4.4
h
i

10
10
1
1
1
10
(m − 1) M A − T A ,A M A T A ,A


=
0

..
.
MA
10
:A20
− TA
20
10
20
MA − TA
,A1
,A2
MA
1
:A2
2
TA
2
MA TA
2
,A20
,A20
0
..
.
20
20
,A2
..
0
..
.
..
0
MA − TA
···
2
2
MA TA
.
,A20
.

···

· · ·

.. 
.

..
.

···

.. 
.

..
.
by Lemma 4.5
≥0,
(4.65)
In the last step we used that a block diagonal matrix is positive semidefinite
if this is true for each block. Each single block is positive semidefinite due
to Lemma 4.5.
4.4.2
Proof for a special family of distributions
As the second step of the proof of Theorem 4.1 we show that inequality
(4.22), or rather the equivalent inequality (4.32), is satisfied for a specific
family of distributions. Here, we will only consider the special case of the triangular scenario (see Figure 3), i.e. n = 3 observables A, B, C and ancestors
λAB , λAC , λBC of degree m = 2. Inequality (4.32) applied to the triangular
scenario reads
√
√
√
√
M A M A:B M B M B:A M A + M A M A:C M C M C:A M A ≤ 1A .
(4.66)
The general case is treated in Appendix A.1. While there are strong similarities, the proof for the general case requires one more final step and the
notation is more complicated. Fortunately, the general procedure can be
understood equally well by restricting to the triangular scenario.
We model each observable as the joint of two independent subvariables A =
{ A1 , A2 }, B = { B1 , B2 } and C = { C1 , C2 }, with distributions P (A) =
4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 94
A
A1
A2
B2
C1
B1
B
C2
C
Figure 19: The triangular scenario where each observable is modeled by two
subvariables. Correlations between pairs of subvariables play the role of the
ancestors of the original scenario.
P (A1 , A2 ) = P (A1 ) P (A2 ) etc. The hidden ancestor structure is modeled by
assuming that one subvariable of each observable is correlated with exactly
one subvariable of any other observable, by choice A1 ↔ B2 , B1 ↔ C2 and
C1 ↔ A2 . For our purpose we assume these correlations to be perfect, i.e.
1
δk l ,
PA1 ,B2 (kA , lB ) =
KAB A B
1
PC1 ,A2 (kC , lA ) =
δk l
KAC C A
1
PB1 ,C2 (kB , lC ) =
δk l
(4.67)
KBC B C
where KAB (KAC , KBC ) is the common (finite) alphabet size of A1 and B2
(C1 and A2 ; B1 and C2 ). Without loss of generality, not only the alphabet
sizes but also the alphabets themselves can be assumed to coincide (taking
the integer values kA , lB = 1, ..., KAB etc). There shall be no further correlations beside those defined here. A graphical illustration is provided by
Figure 19. The joint distribution of the variables A, B, C reads
PA,B,C = PA1 ,A2 ,B1 ,B2 ,C1 ,C2 (kA , lA , kB , lB , kC , lC )
= PA1 ,B2 (kA , lB ) PB1 ,C2 (kB , lC ) PC1 ,A2 (kC , lA )
1
=
δk l δk l δk l .
KAB KAC KBC A B B C C A
(4.68)
4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 95
Note that aside from being finite, the alphabet sizes KAB , KAC and KBC are
arbitrary, justifying the term ‘family of distributions’. This will become important in the next step of the proof in Subsection 4.4.4, where distributions
on large alphabets are used to model more general distributions on smaller
alphabets. We further denote the total alphabet sizes of the variables A, B
and C by
KA := KAB KAC ,
KB := KAB KBC ,
KC := KAC KBC .
(4.69)
1
δ
δ
δ
Proposition 4.2. The distributions P (A, B, C) = KAB KAC
KBC kA lB kB lC kC lA
on the variables A = { A1 , A2 }, B = { B1 , B2 } and C = { C1 , C2 } with arbitrary, finite alphabet sizes KAB , KAC , KBC , satisfy inequality (4.66),
√
√
√
√
M A M A:B M B M B:A M A + M A M A:C M C M C:A M A ≤ 1A .
The corresponding statement for general hidden common ancestor models
is proven in Appendix A.1.
Proof. We have to construct the matrices M A , M B , M C , M A:B and M A:C
defined in (4.3) and (4.6). As the main ingredient we require the mono- and
bi-partite marginals of P (A, B, C). Marginalization over, for example C,
amounts to marginalization over the two subvariables C1 and C2 . This leads
to the bi-partite distributions
1
δk l
KAB KAC KBC A B
1
δk l .
and PA1 ,A2 ,C1 ,C2 (kA , lA , kC , lC ) =
KAB KAC KBC C A
PA1 ,A2 ,B1 ,B2 (kA , lA , kB , lB ) =
(4.70)
Continuing the marginalization, one obtains the single variable marginals
1
1
=
,
KAB KAC
KA
1
1
PB1 ,B2 (kB , lB ) =
=
KAB KBC
KB
1
1
and PC1 ,C2 (kC , lC ) =
=
.
KAC KBC
KC
PA1 ,A2 (kA , lA ) =
(4.71)
4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 96
As initially demanded, the distributions factorize according to, for example,
PA1 ,A2 (kA , lA ) = PA1 (kA ) PA2 (lA ) with PA1 (kA ) = 1/KAB and PA2 (lA ) =
1/K . Since this was clear by construction, one could have written down
AC
the marginals without any calculations.
With (4.70) and (4.71) we essentially have all ingredients that are required
to write down the M -matrices. We aim, however, for a concise operator
representation of the M -matrices which will considerably simplify all following calculations. To this end, we switch to the Dirac notation and represent
mono-partite marginals as Ket-vectors and bi-partite marginals as operators.
P (A) = P (A1 , A2 )
=
K
AB K
AC
X
X
PA1 (kA ) PA2 (lA ) |kA iA1 ⊗ |lA iA2
kA =1 lA =1
=
K
AB K
AC
X
X
1
1
|kA iA1 ⊗
|lA iA2
KAC
kA =1 lA =1 KAB
= √
1
K
AB
X
K
AC
X
1
1
√
√
|kA iA1 ⊗
|lA iA2
KAB
KAC
kA =1
lA =1
KAB KAC
1
|IA1 i ⊗ |IA2 i
= √
KA
1
|IA i .
= √
KA
(4.72)
In the last two steps we introduced the normalized states
K
AB
X
1
|IA1 i := √
|kA iA1 ,
KAB kA =1
AC
1 KX
|IA2 i := √
|lA iA2 ,
KAC lA =1
|IA i := |IA1 i ⊗ |IA2 i .
(4.73)
Similarly, we obtain
1
|IB i
KB
1
and P (C) = √
|IC i ,
KC
P (B) = √
(4.74)
(4.75)
4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 97
with definitions of the |Ii -states analogous to those from (4.73). In the
bi-partite case we obtain (note that we use upright boldface to denote the
operator representation of a bi-partite probability distribution)
P (A, B) =
X
PA1 ,A2 ,B1 ,B2 (kA , lA , kB , lB )
kA ,lA ,kB ,lB
|kA iA1 ⊗ |lA iA2 hkB |B1 ⊗ hlB |B2
X
1
=
δkA lB
kA ,lA ,kB ,lB KAB KAC KBC
|kA iA1 ⊗ |lA iA2
hkB |B1 ⊗ hlB |B2


X
1

√
|kA iA1 hkA |B2  ⊗
=
KAB KAC KBC kA

1
⊗ √
KAC

X
|lA iA2   √
lA
1
KBC

X
hkB |B1 
kB
A1 ↔B2
1
= √
1 ⊗ |IA2 i hIB1 | ,
KA KB
(4.76)
where we defined the symbol
A1 ↔B2
1
:=
X
|kA iA1 hkA |B2 .
(4.77)
kA
When acting from the left,
A1 ↔B2
1
transforms a state from the space of B2 to
A1 ↔B2
the ‘same’ state in the space of A1 , e.g. 1 |IB2 i = |IA1 i. This is possible
since A1 and B2 have the same alphabets (in particular the same alphabet
size KAB ). When acting from the right, the transformation goes in the other
A1 ↔B2
direction. Thus, 1 can essentially be regarded as an ‘identity operation
between isomorphic spaces’. Similarly, we obtain
A2 ↔C1
1
|IA1 i hIC2 | ⊗ 1 .
P (A, C) = √
K A KC
(4.78)
Note that the roles of A1 and A2 are reversed compared to P (A, B). The
reason is that A is correlated with C via A2 while the correlation with B
was mediated by A1 . Employing (4.72) to (4.78) we can now simply write
4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 98
down the matrices M A:B and M A:C ,
M A:B = P (A, B) − P (A)P (B)†
!
A1 ↔B2
1
= √
1 ⊗ |IA2 i hIB1 | − |IA i hIB |
KA KB
!
A1 ↔B2
1
= √
1 − |IA1 i hIB2 | ⊗ |IA2 i hIB1 | ,
KA KB
and
M
A:C
1
=√
|IA1 i hIC2 |
KA KC
(4.79)
!
A2 ↔C1
1 − |IA2 i hIC1 |
(4.80)
Concerning the mono-partite matrices M A , M B and M C , additional care
has to be taken due to the Kronecker delta in the definition of the matrices
(see (4.6)). One can write

MA = 

X
k,l,k0 ,l0
=
δkk0 δll0 PA1 ,A2 (k, l) |kiA1 hk 0 |A1 ⊗ |liA2 hl0 |A2  − P (A)P (A)†
1
1 X
|kiA1 hk|A1 ⊗ |liA2 hl|A2 −
|IA1 i hIA1 | ⊗ |IA2 i hIA2 |
KA k,l
KA
1
(1A1 ⊗ 1A2 − |IA1 i hIA1 | ⊗ |IA2 i hIA2 |)
KA
1
(1A − |IA i hIA |) .
=
KA
=
(4.81)
Similarly, we obtain
1
(1B − |IB i hIB |)
KB
1
=
(1C − |IC i hIC |) .
KC
MB =
and M C
(4.82)
(4.83)
It is instructive to realize that 1Ω − |IΩ i hIΩ | (for Ω = A, B, C) is a projection. Due to this fact, taking the square root or (pseudo) inverse of M Ω (as
required by inequality (4.66)) simply amounts to taking the square root or
inverse of the prefactor 1/KΩ . Using this realization and (4.79), (4.81) and
√
√
(4.82), we can calculate the product M A M A:B M B M B:A M A . We
4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 99
start by considering the product of the first two matrices,
√
M A M A:B
A1 ↔B2
1
= KA (1A − |IA i hIA |) × √
1 − |IA1 i hIB2 | ⊗ |IA2 i hIB1 |
KA KB
p
A1 ↔B2
1
= KA M A:B − √
1 − |IA1 i hIB2 | ⊗ |IA2 i hIB1 |
|IA1 i hIA1 | ⊗ |IA2 i hIA2 | ×
KB
p
1
= KA M A:B − √
(|IA i hIB2 | − |IA1 i hIB2 |) ⊗ |IA2 i hIB1 |
{z
}
KB | 1
p
0
= KA M A:B .
p
(4.84)
√
We see that M A merely has a scalar-multiplicative effect on M A:B (M A:B
√
√
is an ‘eigenoperator’ of M A ). The effect of M A on M B:A from the right
is exactly the same, and the action of M B is analogous as well (providing
the prefactor KB ). By exploiting this behaviour and using M A:B from (4.79),
we obtain
√
√
M A M A:B M B M B:A M A
q
= KA M
A:B
=KA KB M
=
A1 ↔B2
M
A:B
B
M
M
B:A
q
KA
B:A
!
1 − |IA1 i hIB2 | ⊗ |IA2 i hIB1 | ×
B2 ↔A1
!
1 − |IB2 i hIA1 | ⊗ |IB1 i hIA2 |
= (1A1 − |IA1 i hIA1 |) ⊗ |IA2 i hIA2 | .
(4.85)
√
√
When calculating M A M A:C M C M C:A M A it is important to take into
account the reversed roles of A1 and A2 . Thus,
√
√
M A M A:C M C M C:A M A = |IA1 i hIA1 | ⊗ (1A2 − |IA2 i hIA2 |) . (4.86)
To
this step √
of the proof√it remains to show that
sum of
√ conclude
√ the
A:B
B
B:A
A:C
C
C:A
A
A
A
A
M M M M
M
and M M M M
M
is indeed
upper bounded by the identity. To this end, one can realize that the two
terms are hermitian, mutually orthogonal projections, i.e.
[(1A1 − |IA1 i hIA1 |) ⊗ |IA2 i hIA2 |]2 =(1A1 − |IA1 i hIA1 |) ⊗ |IA2 i hIA2 | ,
(4.87)
[|IA1 i hIA1 | ⊗ (1A2 − |IA2 i hIA2 |)] = |IA1 i hIA1 | ⊗ (1A2 − |IA2 i hIA2 |)
(4.88)
2
4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 100
and
[(1A1 − |IA1 i hIA1 |) ⊗ |IA2 i hIA2 |] × [|IA1 i hIA1 | ⊗ (1A2 − |IA2 i hIA2 |)] = 0. (4.89)
The sum of two such projections, P1 and P2 , is again a projection,
(P1 + P2 )2 =
P 2 + P1 P2 + P2 P1 + P22
1
|{z}
=P1
| {z }
=0
= P1 + P2 .
| {z }
=0
|{z}
=P2
(4.90)
Furthermore, recall that the spectrum of a projection consists only of the
eigenvalues 0 and 1. Due to this, any projection is upper bounded by the
identity operator (for operators that are diagonal in the same basis it suffices
to compare the eigenvalues). Thus, the family of distributions considered in
this subsection satisfies the desired inequality (4.66),
√
√
√
√
M A M A:B M B M B:A M A + M A M A:C M C M C:A M A ≤ 1A .
4.4.3
Counter example
In this subsection we construct a distribution for which inequality (4.32)
((4.66) for the triangular scenario) is violated, proving that the inequality poses indeed a non-trivial constraint to the corresponding hidden common ancestor model. Since calculations are simpler than in the previous
subsection (in fact fairly similar but without the tensor product structure)
we directly consider the general case of n observables A1 , ..., An and ancestors of arbitrary degree m. Note that, as pointed out in Subsection 2.2.4,
any distribution can be realized by one ancestor common to all observables
(i.e. m = n). Thus, we restrict ourselves to the non-trivial case m < n. It is
reasonable that a distribution where all observables are perfectly correlated
is not compatible with such a scenario. It is desirable that this distribution
also violates inequality (4.32). Here, we show that this is indeed the case.
The joint distribution of all observables (with common, finite alphabet size
K) reads
1
PA1 ,...,An (k1 , ..., kn ) = δk1 k2 ...kn .
(4.91)
K
4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 101
The ‘multidimensional Kronecker delta’ demands that all indices coincide.
For the mono- and bi-partite marginals one finds
1
δk ,k ,
K 1 j
1
.
PAj (kj ) =
K
(4.92)
PA1 ,Aj (k1 , kj ) =
(4.93)
Employing the Dirac notation from the previous subsection, the vector and
operator representations of these distributions can be expressed as
P A1 , Aj
and P Aj
=
K
1 A1 ↔Aj
1 X
1
|kiA1 hk|Aj =
K k=1
K
(4.94)
=
K
1 X
|ki j
K k=1 A
(4.95)
1
= √ |IAj i .
K
The M -matrices become
M
and
A1 :Aj
M
Aj
1
=
K
=
K
X
A1 ↔Aj
!
1 − |IA1 i hIAj |
(4.96)
δkl PAj (k) |kiAj hl|Aj − P Aj P Aj
†
k.l=1
1
(1Aj − |IAj i hIAj |) .
(4.97)
K
√
√
1
j
j
j
1
To calculate the expression M A1 M A :A M A M A :A M A1 required for
inequality (4.32), we start by considering the product of the first two matrices. Similarly to the previous subsection, we find
=
√
M A1
M
A1 :Aj
√
!
A1 ↔Aj
1
1 − |IA1 i hIAj |
=
K (1A1 − |IA1 i hIA1 |)
K
!
√ 1 A1 ↔Aj
K
=
1 − |IA1 i hIAj |
K
√
1
j
=
KM A :A .
(4.98)
√
j
matrix M A in the middle
The second matrix√ M A1 at the right, and the
√
1
j
j
j
1
of the expression M A1 M A :A M A M A :A M A1 similarly provide the
4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 102
scalar factors
√
K and K. Thus, one can calculate
√
√
1
j
j
j
1
M A1 M A :A M A M A :A M A1
=K 2 M A
=
1 :Aj
MA
j :A1
A1 ↔Aj
!
1 − |IA1 i hIAj |
Aj ↔A1
!
1 − |IAj i hIA1 |
=1A1 − |IA1 i hIA1 | .
(4.99)
Since we get one such summand for each j = 2, ..., n, inequality (4.32)
becomes
?
(n − 1) (1A1 − |IA1 i hIA1 |) ≤ (m − 1) 1A1 .
(4.100)
Since 1A1 − |IA1 i hIA1 | is a projection, the left hand side has eigenvalues 0
and n − 1, the latter being realized by states orthogonal to |IA1 i. Since all
eigenvalues of the right hand side take the value m−1, and since we demand
m < n, the left hand side is not upper bounded by the right hand side,
(n − 1) (1A1 − |IA1 i hIA1 |) (m − 1) 1A1 .
(4.101)
Thus, we found a distribution that violates the inequality.
4.4.4
Generating the whole scenario by local transformations
In this subsection we show that one can reach all distributions compatible
with a given hidden common ancestor model by local transformations (and a
subsequent limit procedure) starting with the corresponding family of distributions introduced in Subsection 4.4.2 (Appendix A.1 for the general case).
According to Lemma 4.6, all distributions obtained by local transformations
will automatically satisfy inequality (4.22) (and equivalently (4.32)). We
divide the current elaboration into three steps. First of all, recall that in
Subsection 4.4.2 (and Appendix A.1) we modeled the hidden ancestors by a
set of subvariables with perfect correlation. In case of the triangular scenario
we had λAB , A1 ↔ B2 , λBC , B1 ↔ C2 and λAC , C1 ↔ A2 . The joint
distribution of the observables could be written as (see also (4.68))
P (A, B, C) = P (A1 , A2 , B1 , B2 , C1 , C2 )
= P (A1 , B2 ) P (B1 , C2 ) P (C1 , A2 )
1
δA B δB C δC A .
=
KAB KAC KBC 1 2 1 2 1 2
(4.102)
4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 103
Note that as a short hand notation we omit explicitly assigning values to the
variables. In particular, the Kronecker deltas are to be understood that both
variables shall take the same value. On the other hand, according to Subsection 2.2.4 (employing the Markov condition and marginalizing over the
hidden ancestors), all distributions compatible with the triangular scenario
are of the form
P (A, B, C)
=
X
P (A | λAB , λAC ) P (B | λAB , λBC ) P (C | λAC , λBC )
λAB ,λAC ,λAB
·P (λAB ) P (λAC ) P (λBC ) .
(4.103)
This decomposition looks considerably different from the above family of
distributions. The first of our three steps thus aims to transform the distribution (4.102) to better resemble (4.103). This transformation also leads
to a larger family of obtainable distributions. Here, as before in Subsection
4.4.2, this step is only performed for the triangular scenario while the general
case in presented in Appendix A.2. Essentially, the reason is again mainly a
notational one. The general case requires more complicated notations, some
of which are introduced only in Appendix A.1. As the result of the first step,
we will obtain (4.103) but with uniform ancestors P (λx = j) = 1/Kx . In the
second step, we use these uniform ancestors with large alphabets to model
more general ancestors with arbitrary rational probabilities. The third step
extends the result to irrational probabilities. Both, the second and the third
step can directly be presented for the most general case since all ancestors
can be considered separately.
Step 1: Locally transforming A = { A1 , A2 } → A0 ,...
Before establishing the actual result of this step, we illustrate to what extent
the family of distributions from (4.102) is restricted. The variable A was
defined as the joint of the two subvariables A1 and A2 with factorizing distribution P (A) = P (A1 , A2 ) = P (A1 ) P (A2 ) (and similarly for B and C;
see also Subsection 4.4.2). Furthermore, the variables A1 and B2 , B1 and C2
as well as C1 and A2 have the same alphabet sizes. This strongly limits the
alphabet sizes of the compound observables A, B and C. It is for example
impossible to let all variables be binary. Constructing A and B as binary
would force C to have either one or four outcomes. Thus, already for the
4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 104
reason to allow arbitrary alphabet sizes of the observables, local transformations A = { A1 , A2 } → A0 , B = { B1 , B2 } → B 0 and C = { C1 , C2 } → C 0
are required.
Proposition 4.3. Starting with the family of distributions PA1 ,A2 ,B1 ,B2 ,C1 ,C2 =
1
δ
δ
δ
(first introduced in (4.68)), one can obtain all
KAB KAC KBC A1 B2 B1 C2 C1 A2
distributions of the form (4.103) with uniform ancestors P (λx = j) = 1/Kx
(and finite alphabets),
P (A, B, C)
X
=
P (A | λAB , λAC ) P (B | λAB , λBC ) P (C | λAC , λBC )
λAB ,λAC ,λAB
·
1
,
KAB KAC KBC
(4.104)
via local transformations A = { A1 , A2 } → A0 , B = { B1 , B2 } → B 0 and
C = { C1 , C2 } → C 0 .
The case of general hidden common ancestor models is considered in Appendix A.2.
Proof. By locally transforming PA1 ,A2 ,B1 ,B2 ,C1 ,C2 =
we obtain
1
δ
δ
δ
,
KAB KAC KBC A1 B2 B1 C2 C1 A2
P (A0 , B 0 , C 0 )
P (A0 | A1 , A2 ) P (B 0 | B1 , B2 ) P (C 0 | C1 , C2 )
X
=
A1 ,A2 ,B1 ,B2 ,C1 ,C2
X
=
·P (A1 , A2 , B1 , B2 , C1 , C2 )
P (A0 | A1 , A2 ) P (B 0 | B1 , B2 ) P (C 0 | C1 , C2 )
A1 ,A2 ,B1 ,B2 ,C1 ,C2
·
(∗)
=
X
1
δA B δB C δC A
KAB KAC KBC 1 2 1 2 1 2
P (A0 | A1 , C1 ) P (B 0 | B1 , A1 ) P (C 0 | C1 , B1 )
A1 ,B1 ,C1
=
X
1
KAB KAC KBC
P (A0 | λAB , λAC ) P (B 0 | λAB , λBC ) P (C 0 | λAC , λBC )
λAB ,λAC ,λBC
·
1
.
KAB KAC KBC
(4.105)
4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 105
For the equality marked with (∗), note that strictly speaking the Kronecker
delta δA1 ,B2 only demands that A1 and B2 take the same value, not necessarily that they are the same variable. In a notation where we explicitly write
down the values of the variables, the Kronecker delta δA1 ,B2 has the effect
P (B 0 = kB | B1 = lB1 , B2 = lB2 ) → P (B 0 = kB | B1 = lB1 , B2 = lA1 ) .
(4.106)
But at this point, we can simply define
P (B 0 = kB | B1 = lB1 , A1 = lA1 ) := P (B 0 = kB | B1 = lB1 , B2 = lA1 ) .
(4.107)
0
0
This enables us to replace P (B | B1 , B2 ) by P (B | B1 , A1 ). The same
argument holds for the effect of the other Kronecker deltas. In the last step
we renamed the remaining variables A1 → λAB , B1 → λBC and C1 → λAC in
order to introduce our typical hidden ancestor notation. One could say that
we identified the remaining subvariable of each pair of correlated subvariables
as the common ancestor of the two involved observables.
Step 2: Modeling ancestors with P (λx = j) ∈ Q by ancestors with
P (λx = j) = 1/Kx
Considering the general case instead of only the triangular scenario (see
Appendix A.2), we have shown that we can generate all distributions of the
form
Y 1
X
,
P A1 , ..., An =
P A1 | { λx }x| 1 ...P An | { λx }x|An
A
x Kx
{ λx }x
(4.108)
where { λx }x denotes the set of all hidden ancestors and { λx }x| j the set
A
of all hidden ancestors of the observable Aj . The case of arbitrary ancestordistributions reads (see also (2.13))
P A1 , ..., An =
X
{ λx }x
P A1 | { λx }x|
A1
...P An | { λx }x|An
Y
P (λx ) .
x
(4.109)
Proposition 4.4. Any distribution of the form (4.109) with rational valued
ancestor-probabilities P (λx = j) ∈ Q (of finite alphabet size) can be modeled
by a distribution of the form (4.108).
4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 106
Following all previous steps of the proof, the equivalence of these two models
immediately implies that all distributions of the form (4.109) with P (λx = j)
∈ Q satisfy inequality (4.22).
Proof. Assume that we have a distribution P (A1 , ..., An ) of the form (4.109)
with arbitrary, rational ancestors. We show how to model one ancestor at a
time by a uniform ancestor. The ancestor under consideration from the ‘rational model’ is denoted by λx and has alphabet size Kx . The corresponding
ancestor from the ‘uniform model’ is denoted by λ0x and has alphabet size
Kx0 . The probabilities P (λx = j) ∈ Q of the ancestor λx can be written as
P (λx = j) =
zj
,
Z
(4.110)
all with the common denominator Z. In order to model this ancestor by a
uniform λ0x , we choose for λ0x the alphabet size Kx0 = Z. Furthermore, for
all observables Ar that depend on the ancestor λx , we define the conditional
probabilities
P (Ar | ..., λ0x = j 0 , ...) := P (Ar | ..., λx = j, ...) ,
(4.111)
for exactly zj outcomes j 0 of the uniform ancestor λ0x . To be explicit, we
Pj
P
choose the outcomes j 0 = j−1
l=1 zl . We further define αj :=
l=1 zl + 1, ...,
Pj
0
l=1 zl (for j = 0, ..., Kx ; α0 = 0). Starting with the uniform ancestor λx ,
one obtains
X
Z
X
Y
P A1 | ..., λ0x = j 0 , ... ...P (An | ..., λ0x = j 0 , ...) P (λ0x = j 0 )
P (λy )
{ λy }y6=x j 0 =1
=
X
Kx
X
y6=x
αj
1 Y
P A1 | ..., λx = j, ... ...P (An | ..., λx = j, ...)
P (λy )
Z
X
{ λy }y6=x j=1 j 0 =αj−1 +1
=
X
Kx
X
y6=x
P A1 | ..., λx = j, ... ...P (An | ..., λx = j, ...)
{ λy }y6=x j=1
=
X
y6=x
P (λx =j)
P A1 | { λy }y|
A1
{ λy }y
zj Y
P (λy )
Z
|{z}
Y
...P An | { λy }y|An
P (λy )
(4.112)
y
In this calculation, the other ancestors λy , y 6= x can be either uniform or
rational. Thus, the procedure can be applied to one ancestor at a time,
allowing us to stepwise model a distribution with all ancestors rational by a
4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 107
distribution with all ancestors uniform. In this way, any distribution of the
form (4.109) with rational ancestor-probabilities P (λx ) can be obtained. In
fact, this implies that the families of distributions defined by (4.108) and
(4.109) are the same. In both models the same distributions P (A1 , ..., An )
can be realized.
Step 3: From rational to arbitrary real P (λx )
In the first part of this step we show explicitly that an arbitrary real distribution of an ancestor λx can be obtained as a limit of rational valued
distributions. We then have to show that this limit procedure respects inequality (4.22).
Step 3a: Real distributions as limits of rational distributions
Proposition 4.5. Any distribution of the form (4.109) with real valued
ancestor-probabilities P (λx = j) ∈ R (of finite alphabet size) can be obtained as a limit of a sequence of such distributions with rational valued
ancestor-probabilities Pk (λx = j) ∈ Q.
Proof. As in the second step we can treat each ancestor λx separately. We
use the fact that any real number, in particular any irrational number, can
be written as the limit of a sequence of rational numbers. When generalizing
this concept to a whole probability distribution, we have to take into account
that probabilities have to sum to unity and be non-negative. Consider an
arbitrary λx with finite alphabet size Kx . Assume that N ≤ Kx of the
probabilities P (λx = j) are irrational and write them in decreasing order
1 > p(1) ≥ p(2) ≥ ... ≥ p(N ) > 0.
(4.113)
The possibly rational probabilities get the superscripts N + 1, ..., Kx in any
order. For simplicity, define the sum of these rational probabilities as
q := p(N +1) + ... + p(Kx ) ∈ Q.
(4.114)
Note that we can assume N ≥ 2, since N = 0 is trivial and N = 1 is
impossible. In the latter case one irrational number and many rational
numbers would have to sum to one, which is not possible.
4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 108
(j)
Denote
by
n
o the sequence of rational numbers that converges to a single p
(j)
(j)
(j)
pk
. For j = N + 1, ..., Kx we can simply choose pk = p ∀k.
k∈N
To make sure that for each n the probabilities sum to one, we approach
p(1) , ..., p(N −1) from below and p(N ) from above. If the decimal expansion of
p(1) is
p(1) = 0.n1 n2 n3 ...,
(4.115)
then choose the sequence
n
(1)
pk
o
k
= {0.n1 , 0.n1 n2 , 0.n1 n2 n3 , ...} .
(1)
(1)
(1)
(4.116)
0 ≤ pk < 1. Thus,
By construction pk ∈ Q, pk ≤ p(1) and nin particular
o
(j)
(1)
for j = 2, ..., N − 1 can be
pk is a valid probability. The sequences pk
k
(N )
defined analogously. Finally, for p
define
(N )
pk
=1−q−
N
−1
X
(j)
(4.117)
pk .
j=1
(j)
(N )
(j)
Since q ∈ Q and each pk ∈ Q we also have pk ∈ Q. Since pk ≤ p(j) for
P −1 (j)
(N )
= p(N ) > 0. Also,
j = 1, ..., N − 1 we have pk ≥ 1 − q − N
j=1 p
(N )
by construction pk ≤ 1. In particular, the total distribution satisfies
PN
(j)
(j)
q + j=1 pk = 1. Hence, the pk form a valid, rational valued probability distribution with the desired convergence to an arbitrary, initially fixed
distribution with potentially irrational probabilities. To be on the safe side,
since the sequence of the whole distribution consists of Kx (or effectively
N ) single sequences, the alphabet size Kx should be finite. Successively
performing this procedure for all ancestors, we obtain the full set of distributions that can be written according to (4.109). At last, this the whole
family of distributions compatible with the given hidden common ancestor
model.
Step 3b: Limit respects the inequality
The transformation in Step 1 was guaranteed to respect the inequality due to
Proposition 4.6. In Step 2 we simply showed the equivalence of two models of
distributions, implying that the target-model inherits the compatibility with
4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 109
the inequality from the starting-model. Here, we have to take additional
care since the employed limit procedure is covered by neither of the two
previous explanations. Fortunately, this can be done with a rather general
and simple statement about limits of continuous functions. The details of
the construction of the sequences from Step 3a are not relevant.
Lemma 4.7. The limit procedure from Proposition 4.5 respects inequality
1
n
(4.22), X A :...:A ≥ 0.
Proof. It is again sufficient to consider the limit procedure separately for
all ancestors. For a single ancestor, we have to show that the matrix
1
n
1
n
X A :...:A ≡ X A :...:A [P (λx )] (first defined in (4.8)) corresponding to the
limit distribution P (λx ) is positive semidefinite. We assume (or rather
know from the previous steps of the proof) that this is true for the ma1
n
1
n
trices XkA :...:A ≡ X A :...:A [Pk (λx )] corresponding to each element of the
sequence of rational valued distributions Pk (λx ). To prove the statement for
the limit distribution we employ the definition of positive semidefiniteness,
1
n
i.e. we show v † X A :...:A v ≥ 0 for all complex valued v of suitable dimension. Each element of an X-matrix is of the form δll0 PAj (l) − PAj (l) PAj (l0 )
or PAj ,Aj0 (l, l0 ) − PAj (l) PAj0 (l0 ). Each single probability is of the form
(4.109) with additional marginalization over all but one or two observables.
This means that each single probability and hence each matrix element
is a polynomial of the ancestor-probabilities P(k) (λx = j). The expectaA1 :...:An
tion value v † X(k)
v is a linear combination of the matrix elements and
thus also a polynomial of the ancestor-probabilities. This means in particA1 :...:An
ular that v † X(k)
v is a continuous function of the ancestor-probabilities
P(k) (λx = j). From this continuity (and P (λx ) = limk→∞ Pk (λx )) it follows
that
1
n
1
n
v † X A :...:A v = lim v † XkA :...:A v.
(4.118)
k→∞
1
n
But this implies that the global bound for v † XkA :...:A v ≥ 0 (global mean1
n
ing for all k) is also true for v † X A :...:A v. The negation of this statement,
1
n
v † X A :...:A v < 0, could easily be shown to contradict the definition of continuity and/or convergence.
In combination with all the previous steps of the proof, we have shown that
1
n
X A :...:A ≥ 0 for all X-matrices arising from distributions given by (4.109).
This concludes the proof of Theorem 4.1.
4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 110
4.4.5
Brief summary of the proof
Since the proof of Theorem 4.1 consisted of a lot of individual steps, we
briefly sum up the line of arguments and give an overview of all involved
propositions and major lemmata. Theorem 4.1 states that all distributions
compatible with a given hidden common ancestor model with n observables
connected by ancestors of degree up to m satisfy inequality (4.22),

XA
1 :...:An
(m − 1) M A

2
1
 M A :A


..
=
.


..

.

n
1
M A :A
1
1
M A :A
2
MA
2
0
..
.
0
···
0
...
..
.
···
1
· · · M A :A
···
0
..
...
.
..
.
0
n
0
MA
1
n










≥ 0.
n
Proposition 4.1 The equivalence of the inequalities X A :...:A ≥ 0 and
n √
√
X
1
i
i
i
1
A1 :...:An
Y
=
M A1 M A :A M A M A :A M A1 ≤ (m − 1) 1A1
i=2
is shown. This allows us to freely choose the more convenient representation in any of the following steps.
Proposition 4.2 (A.1) A specific family of distributions is shown to satisfy the inequality. The ancestors λx are modeled by a collection of
perfectly correlated subvariables, one for each observable connected
by the ancestor. The final observables are defined as the joint of all
their subvariables. In case of the triangular scenario this family of
distribution reads
P (A, B, C) = P (A1 , A2 , B1 , B2 , C1 , C2 )
1
δA B δB C δC A .
=
KAB KAC KBC 1 2 1 2 1 2
Proposition 4.3 (A.2) Starting with the above family of distributions (for
a general hidden common ancestor model) all distributions of the form
P A1 , ..., An =
X
{ λx }x
P A1 | { λx }x|
A1
...P An | { λx }x|An
Y
x
1
Kx
can be reached by local transformations. The terms 1/Kx correspond
to uniform ancestor-probabilities P (λx = j) = 1/Kx .
4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 111
Lemma 4.6 The local transformations performed in the previous proposition respect the inequality. Thus, all distributions of the above form
satisfy the inequality.
Proposition 4.4 The families of distributions
P A1 , ..., An =
X
P A1 | { λx }x|
{ λx }x
A1
...P An | { λx }x|An
Y
P (λx ) ,
x
once with rational ancestor-probabilities P (λx = j) ∈ Q and once with
uniform ancestor-probabilities P (λx = j) = 1/Kx coincide. Thus, all
distributions of the above form with rational ancestor-probabilities
satisfy the inequality.
Proposition 4.5 Any distribution of the form
P A1 , ..., An =
X
{ λx }x
P A1 | { λx }x|
A1
...P An | { λx }x|An
Y
P (λx )
x
with arbitrary real ancestor-probabilities P (λx = j) ∈ R can be obtained as the limit of a sequence of such distributions with rational
ancestor-probabilities. In this way we obtain the whole family of distributions compatible with the corresponding hidden common ancestor
model.
Lemma 4.7 The limit procedure from the previous proposition respects the
inequality. Thus, all distributions compatible with the corresponding
hidden common ancestor model satisfy the inequality. This is exactly
the statement of Theorem 4.1.
Note that while not explicitly necessary for all steps of the proof, the alphabets of the observables as well as of the ancestors should be discrete
and finite. Concerning the observables, already the M -matrices as the basic
constituent parts of the inequality require finite and in particular discrete
alphabets. Concerning the ancestors, already the subvariables in the family
of distributions from Proposition 4.2 were always treated as discrete and
finite. In Proposition 4.3 the ancestors inherited the alphabet sizes of these
subvariables. In Proposition 4.4 we exploited that all the rational probabilities P (λx = j) of one ancestor can be expressed as fractions with the same
4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 112
(finite) denominator. This implicitly requires a finite number of probabilities P (λx = j) as well. In Proposition 4.5 the sequence of rational valued
distributions Pk (λx ) → P (λx ) consists of one sequence per outcome λx = j.
All these sequences are taken simultaneously. To be on the safe side, the
number of sequences, and thus the alphabet size, should be finite.
4.5
Comparison between matrix and entropic inequality
The equivalent inequalities X A
Y
A1 :...:An
=
n √
X
M A1 M A
1 :...:An
1 :Ai
i
≥ 0 and
i
M A M A :A
1
√
M A1 ≤ (m − 1) 1A1 (4.119)
i=2
(see inequalities (4.22) and (4.32)), based on generalized covariance matrices,
have been derived as an analog to the entropic inequality
n
X
I A1 ; Ai ≤ (m − 1)H A1
(4.120)
i=2
(originally (4.1)). One major purpose to derive the matrix inequality was
that in this framework the inequality might be stronger than the corresponding entropic inequality (meaning that the set of distributions compatible
with the former is smaller than the set of distributions compatible with the
latter). A reasonable motivation for this assumption was the fact that already the elementary inequalities (see Section 3.1) constraining the entropies
of any set of random variables entail only an outer approximation. On the
1
n
other hand, the starting point of the derivation of X A :...:A ≥ 0 was rather
arbitrary and primarily motivated by the analogy to the known entropic
inequality. It is therefore difficult to estimate the degree of approximation
inherent to the matrix inequality.
In this section we compare the strengths of the entropic and the matrix
inequality by considering some exemplary families of distributions. For a
simple family (but flexible with respect to the number of observables and
their alphabet size) the comparison can be done analytically (or with precise
numerics). The elaboration of this case is presented in Subsection 4.5.1.
More general distributions for which the comparison is performed via Monte
4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 113
Carlo simulations are considered in Subsection 4.5.2. Finally, in Subsection
4.5.3, we simulate hypothesis tests based on the matrix inequality. The
results are compared to the analogous entropic tests examined exhaustively
in Section 3.3.
4.5.1
Analytical investigation
We consider the mixture of the perfectly correlated and the uniform distribution,
1
δA1 ,...,An
+ (1 − v) n .
(4.121)
P A1 , ..., An = v
K
K
The parameter v can be referred to as the visibility of the correlated distribution. In that context, the uniform distribution can be considered as
noise. For v = 0 the distribution will be compatible with any DAG and
satisfy any inequality. For v = 1 the distribution will be incompatible with
any (non-trivial) DAG and violate any (non-trivial) inequality. Concerning
the attribute ‘non-trivial’, recall that a DAG where all observables share one
common ancestor puts no constraint on the compatible distributions. For
the comparison between the two inequalities we are interested in the critical value vc for which compatibility with a given inequality changes. The
smaller the value vc the less distributions satisfy the inequality. This means
that the inequality with the smaller vc imposes a stronger constraint on the
set of compatible distributions and thus also leads to the closer approximation to the true set of distributions compatible with a corresponding DAG.
A graphical illustration is provided by Figure 20.
In order to simply compare the matrix and the entropic inequality we do not
need to choose a specific DAG that is constrained by the inequalities. If we
nevertheless wanted to do so, any pair of observables should at least share
one ancestor. Otherwise the DAG would demand two observables without
a common ancestor do be independent (see Subsection 2.2.4). The distributions (4.121) with v > 0 would clearly violate this constraint. Furthermore,
we should assume that one of the ancestors of A1 indeed has degree m. Otherwise, we could replace m by m0 , the maximal degree of A1 ’s ancestors. See
also the annotations below Theorem 4.1.
Note that we do not consider the family of ‘flip distributions’ (3.32) from
Chapter 3, because there exists no straightforward generalization to larger
alphabets.
4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 114
DAG description
pure
noise
perfect
correlation
vcDAG
v=0
vcineq
v=1
inequality description
Figure 20: Set relation of distributions compatible with a DAG and distributions compatible with a corresponding inequality description. The distribution (4.121) with v = 1 is incompatible with both descriptions. When
decreasing v, at the critical value vcineq compatibility with the inequality is
established. At some unknown value vcDAG ≤ vcineq the distribution starts
to be compatible with the DAG as well. The smaller vcineq the better is the
inequality description.
Matrix description
We start by considering the matrix inequality Y A :...:A ≤ (m − 1) 1A1 (which
1
n
turns out to be simpler than the equivalent inequality X A :...:A ≥ 0). Due
j
1
j
to the symmetry of the distributions, the matrices M A and M A :A do not
depend on j. The marginal distributions read
1
n
δkl
1
+ (1 − v) 2 ,
K
K
1
PAj (k) =
.
K
PA1 ,Aj (k, l) = v
(4.122)
(4.123)
For the M -matrices one obtains
j
A
Mk,l
= δkl PAj (k) − PAj (k) PAj (l)
1
1
= δkl − 2 ,
K K
(4.124)
4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 115
and
1
A :A
Mk,l
j
= PA1 ,Aj (k, l) − PA1 (k) PAj (l)
1
1
1
= vδkl + (1 − v) 2 − 2
K
K
K
1
1
= vδkl − v 2 .
K
K
We see in particular that M A
inequality (4.32) reads
1 :Aj
j
= vM A . By further defining M := M A
√ M vM M vM M
≤
(m − 1) 1
√ −1
√ −1
⇔ v 2 (n − 1) M M M −1 M M
⇔
v 2 (n − 1) 1
⇔
v 2 (n − 1)
≤
≤
≤
(m − 1) 1
(m − 1) 1
(m − 1)
n √
X
(4.125)
j
j=2
s
⇔
v
≤
m−1
.
n−1
(4.126)
From the first to the second line we used that M has full rank which implies
that the pseudoinverse is the ordinary inverse. From the second to the third
√ −1 √ −1
line we used M M −1 = 1 and M M M = 1. According to inequality
(4.126) the critical value in the matrix framework is
s
vcmat =
m−1
.
n−1
(4.127)
Note that vcmat is independent of the alphabet size K.
Entropic description
In order to evaluate inequality (4.1) note that, similar to the matrix framework, the mutual information I (A1 ; Aj ) and the entropies H1 := H (Aj )
and H2 := H (A1 , Aj ) are independent of j. Further using I (A1 ; Aj ) =
4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 116
H (A1 ) + H (Aj ) − H (A1 , Aj ), inequality (4.1) can be written as
n
X
I A1 ; Ai
≤ (m − 1)H A1
i=2
⇔ (n − 1) (2H1 − H2 ) ≤ (m − 1) H1
m−1
H2
⇔
2−
≤
(4.128)
H1
n−1
Note that we have H1 6= 0 (see below) and n − 1 6= 0 since a hidden common
ancestor model with only one observable is rather boring. Also note that
the inequality does not depend on the single values n and m but only on
. Using (4.122) and (4.123), the mono- and bi-partite entropies
the ratio m−1
n−1
can be expressed as
K
X
1
1
log
K
k=1 K
= log K
H1 = −
(4.129)
and
1−v
v
1−v
1−v
1−v
v
2
H2 = −K
+
+
log
.
log
−
K
−
K
2
2
2
K
K
K
K
K
K2
(4.130)
The first term in H2 corresponds to the K ‘diagonal probabilities’ (k = l
in the expression (4.122)) while the second term corresponds to the remaining K 2 − K probabilities. Even though the expression of H2 seems to be
rather complicated, it can be used in this form to numerically solve inequality (4.128) for arbitrary n, m and K. By doing so, we can obtain the
corresponding value vcent .
In the limit for K → ∞, vcent can be calculated analytically. To this end, we
approximate the two terms of H2 as
v
1−v
v
1−v
−K
log
+
+
2
K
K
K
K2
1−v
1−v
= − v+
− log K + log v +
K
K
1−v
1−v
1−v
= v log K −
log K − v +
log v +
K
K }
| K {z
} |
{z
→0
K1
→
K1
v log K − v log v
→ v log v
K1
(4.131)
4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 117
and
1−v
1−v
log
2
K
K2
1
= − 1−
(1 − v) [−2 log K + log (1 − v)]
K
→ 2 (1 − v) log K − (1 − v) log (1 − v) .
− K2 − K
K1
(4.132)
By combining (4.131) and (4.132) we obtain
H2
→
v log K − v log v + 2 (1 − v) log K − (1 − v) log (1 − v)
=
(2 − v) log K + h (v) ,
K1
(4.133)
where we identified −v log v − (1 − v) log (1 − v) as the binary entropy function h (v) (i.e. the entropy of a binary variable with probabilities v and
1 − v). The binary entropy satisfies 0 ≤ h (v) ≤ log 2 (recall that any entropy is lower bounded by zero and upper bounded by log K where K is the
alphabet size). Inserting this limit of H2 and H1 = log K into inequality
(4.128), one can solve the inequality according to
H2
H1
(2 − v) log K + h (v)
⇔ 2−
log K
h (v)
⇔
2 − (2 − v) −
log K
2−
→
v
K1
≤
≤
≤
≤
m−1
n−1
m−1
n−1
m−1
n−1
m−1
.
n−1
(4.134)
The critical value for the entropic inequality and K → ∞ is thus
vcent →
K→∞
m−1
.
n−1
(4.135)
Comparison vcent vs vcmat
First, consider
the case of large alphabets K → ∞. In this case we have
q
m−1
mat
vc = n−1 opposed to vcent = m−1
. Recall that we can restrict to the case
n−1
m < n. Equality would mean that there exists one ancestor common to all
4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 118
observables. Any distribution could be realized by this scenario and both
<1
inequalities would be trivially satisfied for any v. For m < n we have m−1
n−1
q
and thus vcent = m−1
< m−1
= vcmat meaning that the entropic inequality
n−1
n−1
is stronger than the matrix inequality in the limit of large alphabets. The
smaller the ratio m−1
, the more significant is the advantage of the entropic
n−1
inequality (in terms of the ratio vcmat/vcent ).
For smaller alphabets we start by considering the triangular scenario, i.e. n =
3 and m = 2. Independentlyqof the alphabet size K, the critical value of the
matrix inequality is vcmat = 1/2 ≈ 0.707. By numerically solving inequality
(4.128) with H1 = log K and H2 from (4.130), we obtain the K-dependent
value vcent :
K
vcent ≈
2
0.780
3
0.761
10
0.709
11
0.705
100
0.637
10100
0.503
For K ≤ 10 we find vcmat < vcent , meaning that the matrix inequality is
stronger than the entropic inequality in the regime of small alphabets. In the
case of binary variables that was considered for the entropic hypothesis tests
in Chapter 3, the advantage of the matrix inequality is rather significant.
For K > 10 the advantage changes in favour of the entropic inequality. The
value of vcent for K = 10100 shall illustrate that the convergence vcent → 12
K→∞
is rather slow.
It seems to be generally true that for binary variables the matrix inequality
is always stronger than the entropic inequality. While we present no general
proof of this statement, Figure 21 strongly supports the claim. The same
observation can be made for K = 3. For K ≥ 4 on the other hand, we are
able to find scenarios with vcent < vcmat .
Finally, we investigate the critical alphabet size Kc at which the entropic
inequality becomes stronger than the matrix inequality. To be precise, Kc is
defined as the alphabet size satisfying vcmat ≤ vcent for K ≤ Kc and vcent < vcmat
. Starting at
for K > Kc . Figure 22 shows Kc as a function of the ratio m−1
n−1
m−1
m−1
m−1
Kc = 3 for n−1 → 0, Kc is an increasing function of n−1 . For n−1 → 1, Kc
diverges to extremely large values.
4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 119
K=2
vc
1
1
2
1
4
1
8
1
16
1
32
vcent
vcmat
1
1000
1
100
1
10
1
m-1
n-1
Figure 21: Comparison of the critical values vcent and vcmat as a function of
m−1
k
in the binary case. We considered values m−1
= 1000
for k = 1, ..., 1000.
n−1
n−1
m−1
For n−1 = 1 both critical values coincide at vc = 1. For m−1
< 1 we
n−1
ent
mat
uniformly obtain vc < vc . Note that the double-logarithmic plot uses
the base-10-logarithm for the x-axis but the base-2-logarithm for the y-axis.
triangular
scenario
Figure 22: Critical alphabet Kc above which the entropic inequality is
stronger than the matrix inequality as a function of m−1
. We considered
n−1
m−1
k
m−1
values n−1 = 100 for k = 1, ..., 99. For small n−1 the matrix inequality is
stronger only for alphabets as small as K ≤ 3. For large m−1
the advantage
n−1
of the matrix inequality extends to extremely large alphabets.
4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 120
for the
At this point it is instructive to discuss the meaning of the ratio m−1
n−1
DAG. A small value corresponds to a DAG with a large number of observcorresponds to a
ables that are only weakly connected. A large ratio m−1
n−1
DAG with strongly connected observables. Here, weak and strong connectivity are to be understood in terms of the (relative) number of observables
connected by a single ancestor. The number of ancestors plays no role.
In that sense, the entropic inequality tends to be stronger for weakly connected graphs while the matrix inequality tends to be stronger for strongly
connected graphs.
We can summarize the results as follows:
• For K = 2, 3 the matrix inequality is always stronger than the entropic
inequality (by observation).
• For K → ∞ the entropic inequality is always stronger than the matrix
inequality (by analytical proof).
the advantage of the matrix inequality extends to
• For increasing m−1
n−1
larger alphabets (by observation).
Note that we arrived at these results by considering only one specific family
of distributions. The statements are not necessarily true in the general case
and should thus be understood as tendencies rather than as strict rules.
4.5.2
Numerical simulations
In this subsection we compare the entropic and the matrix inequality for
three different families of random distributions. Opposed to the precise
calculations from the previous subsection, Monte Carlos simulations are required in this case. Due to the thereby accompanied computational burden,
we have to restrict the simulations to rather small alphabets and DAGs. In
fact, we consider only the triangular scenario. In this case, according to
the previous subsection, we expect the matrix inequality to be stronger for
K . 10. This statement, while not being universally true, can be confirmed
on a qualitative level.
4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 121
Model 1: Random from DAG + correlation
We construct distributions PDAG according to the formula
PDAG
=
X
P (A | λAB , λAC ) P (B | λAB , λBC ) P (C | λAC , λBC )
λAB ,λAC ,λAB
·P (λAB ) P (λAC ) P (λBC ) .
(4.136)
Any distribution of this form is compatible with the triangular scenario
and thus with both inequalities. To construct distributions that violate the
inequalities, we mix PDAG with the perfectly correlated distribution,
Pfinal = vPcorr + (1 − v) PDAG , where Pcorr =
δA,B,C
.
K
(4.137)
For v = 1 (v = 0) both inequalities will be violated (satisfied). The common alphabet size of all observables is K. For all the pairwise ancestors
we choose the alphabet size K 2 . Due to this, we have K 5 conditional probabilities (e.g. P (A | λAB , λAC )) for each observable. We assume that the
observables are deterministic functions of the ancestors such that each conditional probability is either 0 or 1. To make sure that, for example, for
fixed λAB and λAC exactly one output of A has probability 1, we generate a
K 2 × K 2 random matrix with entries k = 0, ..., K − 1. The matrix elements
specify to which output of A a given combination of λAB , λAC is mapped.
The elements are drawn from a binomial distribution Bin (K − 1, p),
!
K −1 k
pBin(K−1,p) (k) =
p (1 − p)K−1−k for k = 0, ..., K − 1, p ∈ [0, 1] .
k
(4.138)
The parameter p is uniformly distributed in [0, 1], fixed for one matrix but
different for different matrices (there is one matrix for each observable). To
generate the marginals of the ancestors, each of the K 2 probabilities P (λx )
(per ancestor) is drawn uniformly from [0, 1] with subsequent normalization.
For K = 2, 3, 4 we generate 1000 distributions PDAG and consider mixing
k
, k = 1, ..., 19. Figure 23 shows the number of states
parameters v = 20
violating the respective inequalities as a function of v.
4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 122
K=2
K=4
K=3
entropic
probabilistic
matrix
covariances
Figure 23: Numbers of distributions (out of 1000), generated according to
Model 1 violating the entropic and the matrix inequality, respectively. In
1
n
addition to the matrix inequality X A :...:A ≥ 0 we also considered the special case of the covariance inequality (4.43) with alphabet {0, ..., K − 1} for
all observables. For K = 2 the results for the covariance inequality and the
general matrix inequality coincide. For K = 3, 4 the matrix inequality is
significantly stronger than the covariance inequality. See also the comment
concerning this gap in the last paragraph of Subsection 4.3.4. The matrix inequality is typically also stronger than the entropic inequality, the difference
being more significant for larger K.
We observe that in most cases the matrix inequality is stronger than the
entropic inequality. This is in accordance with the intuition that we gained
in Subsection 4.5.1. There, we found that for the triangular scenario and
a simple family of distributions, the matrix inequality is stronger than the
entropic inequality for all K ≤ 10. Contrary to the intuition that the difference between the inequalities should diminish for increasing K, in Figure
23 the advantage of the matrix inequality is even larger for larger K. Since
the required time for the simulations increases rapidly with K (≈ 7 seconds
for K = 2, ≈ 80 seconds for K = 3 and ≈ 20 minutes for K = 4) we
4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 123
can unfortunately not consider much larger alphabets. Furthermore, the
impression from Subsection 4.5.1, that for K = 2 the matrix inequality is
always stronger than the entropic inequality, turns out not to be universally
true. For K = 2 and small v we found a slight advantage of the entropic
inequality. On a qualitative level, however, we can confirm that for the
triangular scenario with small alphabets the matrix inequality is typically
(significantly) stronger than the entropic inequality.
Model 2: Simple random distributions
We generate distributions P (A, B, C) by drawing each of the K 3 probabilities independently from a simple probability distribution with subsequent
normalization. Compared to the previous model this construction is considerably less expensive in terms of computation time. We are thus able to
explore larger alphabet sizes, say up to K = 15. If we draw all K 3 probabilities P (A, B, C) from a uniform distribution the correlation between the
variables A, B and C can be expected to be small. Indeed, we observe no
violations of any inequality in this case (for K = 2, ..., 15). We desire a
distribution P (A, B, C) where only ‘few’ probabilities are significantly different from zero. Taking a look at the perfectly correlated distribution, one
might expect that ≈ K non-zero probabilities lead to the strongest correlaK2
non-zero
tions. Since we do not want perfect correlations, we aim for log
K
probabilities. This choice is rather arbitrary and is justified retrospectively
by the sound results. To generate distributions of the desired type, we
first decide for each of the K 3 probabilities whether it should be zero or
not. This is done by drawing a Bernoulli variable with success probability
1
. The expected number of successes, and thus of non-zero probabiliK log K
3
2
K
ties, is K K
= log
. Next, each probability chosen to be non-zero is drawn
log K
K
uniformly from the interval [0, 1]. Since this is procedure is rather an educated guess than a bullet-proof construction, there is no guarantee that the
‘degree of correlation’ will be the same for all considered alphabet sizes. We
generate 10 000 distributions for K = 2, ..., 15 and for each K compute the
numbers of distributions violating the two inequalities.
We observe (Figure 24) that for K ≥ 3 the matrix inequality is considerably more powerful than the entropic inequality. The reason for the small
number of violations for K = 2 is most likely that the construction of the
distributions does not lead to sufficient correlations in this case. Based on
4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 124
violations
7000
6000
5000
◼
4000
3000
◼
2000


1000
0 ◼
2 4
◼
◼ ◼ ◼ ◼ ◼
◼
◼
◼
◼
◼
    
 
  

6
8
10 12 14

entropic
◼
matrix
K
Figure 24: Numbers of distributions (out of 10 000), generated according
to Model 2, violating the entropic and the matrix inequality, respectively.
For K ≥ 3 the matrix inequality is significantly stronger than the entropic
inequality.
the results from Subsection 4.5.1, the advantage of the matrix inequality
could be expected for K ≤ 10. Figure 24 shows that for other families of
distributions this advantage extends to larger alphabets. The gap between
the inequalities (for K ≥ 3) is surprisingly large.
Model 3: Draw marginal, construct joint distribution
For the construction of distributions according to this model see also Subsection 3.2.5, the paragraph ‘Mutual information for small alphabets’ where
we compared different techniques for estimating mutual information.
We begin by drawing the marginal distribution P (A). Each of the K probabilities is drawn independently from a beta distribution (see (3.27)) with
parameters α = 0.1, β = 1. To ensure that the probabilities sum to one we
normalize at the end. Then, to construct the bi-partite marginal P (A, B)
we set B = A with probability x and B uniform with probability 1 − x,
1−x
.
PA,B (k, l) = PA (k) xδkl +
K
(4.139)
For x = 1, A and B are maximally correlated; for x = 0 they are independent. The bi-partite marginal P (A, C) is defined in exactly the same way.
4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 125
Since the inequalities should only be violated for rather strong correlations
we consider the values x = 0.7 and x = 0.9. In both cases we draw 10 000
distributions for each alphabet size K = 2, ..., 15. Figure 25 shows the numbers of distributions violating the two inequalities as functions of K. Note
that it turned out to be difficult to find values of α, β and x (or even define
them as functions of K) such that we obtain similar numbers of violations
for all K.
x=0.9
x=0.7
entropic
matrix
Figure 25: Numbers of distributions (out of 10 000), generated according to
Model 3, violating the entropic and the matrix inequality, respectively. For
x = 0.7 the matrix inequality is stronger than the entropic inequality. For
x = 0.9 we obtain the opposite result, even though with smaller magnitude.
For x = 0.7 we find the familiar picture that the matrix inequality is stronger
than the entropic inequality. For x = 0.9, on the other hand, it is exactly
the other way around. Thus, we found a clear example proving that for
the triangular scenario with small alphabets the matrix inequality is not
always stronger than the entropic inequality. Even in this case, however,
the advantage of the entropic inequality is rather weak compared to the
typical advantage of the matrix inequality (see the x = 0.7 case and Figures
23 and 24).
4.5.3
Hypothesis tests
With regard to the statistical emphasis of this thesis, the presumably most
important comparison of the matrix framework and the entropic framework
is in terms of the hypothesis tests the respective inequalities give rise to.
We aim to perform the same simulations that were conducted in Section
4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 126
3.3 for the entropic inequality (3.2). As the general setting, we consider
the triangular scenario with binary observables and samples of size 50. The
employed representation of the matrix inequality is the original inequality
(4.22), which for the triangular scenario reads
M A M A:B M A:C
0 
=
M B:A M B
 ≥ 0.
C:A
C
M
0
M

X A:B:C

(4.140)
Recall, that this inequality is the analog of the entropic inequality (3.2)
based on the generalized covariance matrices (the M -matrices). In particular, both inequalities are based on bi-partite information alone. The entropic
inequalities (3.3) and (3.4), on the other hand, require access to the full distribution P (A, B, C). In the following, we refer to inequality (4.140) as the
‘matrix inequality’ and to inequality (3.2) ((3.3), (3.4)) as the ‘first (second,
third) entropic inequality’
To estimate X A:B:C from a data sample, we simply plug in the empirical distributions P̂ (A, B), P̂ (A, C), P̂ (A) , ... into the definitions of the
M -matrices from (4.3) and (4.6). In order to test whether or not the matrix
is positive semidefinite we calculate its minimal eigenvalue,
n
Tmat := min eigenvalues X A:B:C
o
.
(4.141)
If Tmat < 0, the matrix X A:B:C is not positive semidefinite. If Tmat ≥ 0,
then also X A:B:C ≥ 0. Since in the binary case each M -matrix is a 2 × 2
matrix, X A:B:C has a total number of six eigenvalues. It turns out that
three eigenvalues are always zero (in general one eigenvalue per observable).
These eigenvalues will be ignored. Otherwise, the statistic Tmat would be
upper bounded by zero, causing problems in the bootstrap simulations.
The samples are drawn from the family of ‘flip distributions’ introduced in
(3.32). Starting with A, B, C perfectly correlated, each variable is independently flipped with probability pflip . In Figure 26 Tmat is shown as a function
(mat.)
of pflip . For pflip < pflip = 0.0796 we obtain Tmat (pflip ) < 0, meaning that
the matrix inequality is violated in this regime of flip probabilities. Recall, that for the entropic inequalities we found the critical flip probabilities
(1.ent)
(2.ent)
(3.ent)
pflip
= 0.0584, pflip
= 0.0750 and pflip
= 0.0797 (see Figure 16). A
larger value indicates a stronger inequality. Thus, the matrix inequality
should be significantly stronger than the first entropic inequality, its direct
4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 127
Tmat
0.2
0.1
0.05
0.10
0.15
pflip
0.20
-0.1
-0.2
n
o
Figure 26: Minimal-eigenvalue-statistic Tmat = min eigenvalues X A:B:C
for the family of ‘flip distributions’ (3.32) as function of pflip . The matrix
inequality is satisfied for pflip ≥ 0.0796.
analog in the entropic framework. The strength of the matrix inequality
even seems to be comparable to the second and third entropic inequalities
which resort to tri-partite information.
Recall that we introduced two different approaches to hypothesis testing,
the direct and the indirect (or bootstrap) approach. For the complete introduction see Section 3.3. As in Section 3.3, we start by considering the direct
approach.
Direct approach
For the direct approach we require a threshold value tmat . If for a data estimate T̂mat we observe T̂mat < tmat the null hypothesis h0 : ‘sample is compatible with the triangular scenario’ is rejected. Note that for the first entropic
inequality it was exactly the other way around, i.e. the null hypothesis was
rejected for T̂ent > tent . The reason is, that for distributions compatible with
the triangular scenario, the statistic Tent was upper bounded by zero, while
here the statistic Tmat is lower bounded by zero. The threshold has to be
chosen such that the type-I-error rate is upper bounded by α = 0.05. This
means, that at most 5% of samples stemming from a compatible distribution
are allowed to falsely violate the inequality T̂mat < tmat . The threshold value
is thus defined as the 5% quantile (95% quantile in the entropic case) of
the supposed worst case distribution (the distribution with the smallest 5%
4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 128
quantile among all distributions compatible with the DAG). In the entropic
framework, reliably identifying the worst case distribution turned out to be
the main problem of the direct approach.
tmat
0.000
0.6
0.7
0.8
-0.005
-0.010
-0.015
-0.020


qC
1.0
◼

0.9




◼


◼

◼

◼◼
◼◼◼◼◼◼◼◼◼◼◼◼◼
◼

◼ ◼ ◼
◼
◼
◼

qAB=0.5
◼
qAB=qC
Figure 27: Supposed threshold value tmat calculated as the 5% quantile
of the statistic T̂mat for underlying observables A = B ∼ (qAB 1 − qAB )
and C ∼ (qC 1 − qC ). For this family of distributions, it seems that the
most extreme threshold value, tmat ≈ −0.019044, is indeed obtained in the
uniform case qAB = qAC = 0.5. For each considered combination (qAB , qC ),
the distribution of estimates T̂mat was reconstructed by drawing 100 000
samples.
Here, we pursue the same approach that was considered in the entropic
case in Subsection 3.3.2. The rationale was that the worst case distribution
should satisfy Tent = 0 which should require A to be a deterministic function
of B (by choice), and C should be independent of A and B. While a natural
choice seemed to be A = B ∼ uniform and C ∼ uniform
(see [16]),
we
generalized the approach to distributions A = B ∼ qAB 1 − qAB and
C ∼ qC 1 − qC and found that the uniform case was not the worst case
(see Figure 10). In contrast to this inconvenient observation, Figure 27 raises
hope that in the matrix framework the optimal threshold value might indeed
be obtained by the uniform distribution qAB = qC = 0.5. However, we have
no proof.
Employing the threshold value tmat = −0.019044, we simulate the hypothesis test for the family of ‘flip distributions’ (3.32). We are interested in
the power of the test which is defined as the ratio of correctly rejected samples (i.e. T̂mat < tmat ) stemming from DAG-incompatible distributions. For
4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 129
(mat.)
(3.ent)
values pflip < pflip = 0.0796 (or in fact pflip < pflip
= 0.0797) the true
distribution is known to be incompatible with the DAG. In this regime, the
ratio of rejected samples is indeed the power of the test. For larger values of
pflip , compatibility of the DAG and the distribution is not known. In Figure
28 we compare the ratio of rejected samples to the corresponding ratio of
the entropic test from Subsection 3.3.2.
rejection rate
1.0 ▲● ▲● ▲● ▲● ● ● ● ●
▲
0.8
0.6
0.4
▲
▲
●
●
▲
●
●
●
▲
▲
1. ent. direct
●
mat. direct
●
▲
●
▲
●
▲
▲
0.2
●
▲
●
▲▲
▲▲
●
●
▲▲
●
●●
●●●
●●●●
0.02 0.04 0.06 0.08 0.10 0.12 0.14
pflip
Figure 28: Comparison of rejection rates of the direct hypothesis tests based
on the entropic inequality (3.2) and the matrix inequality (4.140). Each data
point is based on 10 000 samples drawn from the ‘flip distribution’ (3.32) with
(mat.)
flip probability pflip . The vertical line marks the critical value pflip = 0.0796
below which the true distribution violates the matrix inequality and is thus
incompatible with the triangular scenario. In this regime a large rejection
rate (being the power of the test) is desired.
(mat.)
(1.ent)
In accordance with pflip > pflip , the matrix test is significantly more
powerful than the entropic test. The general shapes of the curves are however
rather similar. A sharp step near the vertical line would have been preferable.
In Subsection 3.3.2, we identified the large variance of estimates T̂ent (or
now T̂mat ) as the reason for the rather flat curves. The similar shapes of the
curves thus suggest that from a statistical point of view Tmat is not easier to
estimate than Tent . The advantage of the matrix test is not due to simpler
statistics, but due to the general superiority of the matrix inequality (for
this family of distributions).
4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 130
Indirect approach
For the indirect approach no worst case distribution is needed.
h Instead,i the
min
0.95
point estimate T̂mat is equipped with a confidence interval T̂mat
, T̂mat
. If
0.95
< 0, the null hypothesis h0 : ‘sample is compatible with the
we find T̂mat
inequality Tmat ≥ 0’ is rejected. Recall, that this hypothesis is weaker than
the null hypothesis from the direct approach, stating compatibility with the
DAG. To construct the confidence interval we resort to bootstrapping. From
the original empirical distribution so-called bootstrap samples are drawn.
∗
is used to estimate the upper
The distribution of bootstrap estimates T̂mat
endpoint of the confidence interval. To this end, we employ the advanced
‘BCa bootstrap’ technique that already yielded the most reliable results in
the entropic case (Subsection 3.3.3). Regarding the results from the di(mat.)
rect approach, or in general the critical flip probabilities pflip = 0.0796,
(1.ent)
(2.ent)
(3.ent)
pflip = 0.0584, pflip = 0.0750 and pflip = 0.0797, the matrix bootstrap
test should be significantly more powerful than the bootstrap test using the
first entropic inequality. In fact, the matrix bootstrap test is expected to be
of comparable strength to the test based on the third entropic inequality.
Figure 29 confirms these expectations. The matrix bootstrap test is even
slightly more powerful than the third entropic bootstrap test and comes close
(mat.)
to the first entropic direct test. At pflip = 0.8 ≈ pflip the matrix bootstrap
test rejects 5.2% of all samples. This suggests that the test correctly works
at the 5% level. Since we do not know how trustworthy the threshold value
lying at the heart of the direct test is, one might actually prefer the matrix
bootstrap test over the first entropic direct test. The larger power compared
to the third entropic bootstrap test deserves attention as well. After all, the
third entropic inequality requires access to the full observable distribution
P (A, B, C), while the matrix inequality requires only bi-partite information.
4.5.4
Summary
In this section we have seen that for small DAGs and alphabets the matrix
1
n
inequality (4.22), X A :...:A ≥ 0, is typically (but not always) stronger than
the analogous entropic inequality (4.1). Concerning the simulated hypothesis tests for the triangular scenario, we made the following observations:
• The bootstrap hypothesis test based on the inequality X A
1 :...:An
≥0
4 TESTS BASED ON GENERALIZED COVARIANCE MATRICES 131
rejection rate
1.0 ▲▼◼ ▲◼● ▲ ▲
0.8
0.6
0.4
0.2
▼
● ▲
◼
●
◼●▲
◼●▲
◼●▲
▼
◼●▲
▼
◼●▲
▼
◼●▲
▼
◼●▲
▼
◼●▲
◼●▲
▼▼
●▲
◼◼
●▲▲
▼
◼
●◼
●▲
▼▼▼▼
◼
●▲
●▲
●●●●●● p
◼
◼
◼
▼▼▼
▼
▼
▼
flip
▲
1. ent. direct
▼
1. ent. boots
◼ 3. ent. boots
●
mat. boots
0.02 0.04 0.06 0.08 0.10 0.12
Figure 29: Comparison of rejection rates of the matrix bootstrap test and
several entropic tests. The matrix test is based on inequality (4.140). The
entropic tests are based on inequalities (3.2) (1. ent.) and (3.4) (3. ent.).
Each data point of the direct test is based on 10 000 samples drawn from the
‘flip distribution’ (3.32) with flip probability pflip . The bootstrap tests are
based on 1000 samples and 999 bootstrap samples for each initial sample.
(mat.)
The vertical line marks the critical flip probability pflip = 0.0796 below
which the true distribution violates the matrix inequality. In this regime,
large rejection rates are desired.
turned out to be more powerful than all considered bootstrap tests
in the entropic framework (see Figure 29). The matrix test, properly
controlling the type-I-error rate at 5%, is only slightly less powerful
than the direct test based on the entropic inequality (3.2). The latter
test does arguably not have the desired control of the type-I-error rate.
• The direct test based on the matrix inequality is significantly more
powerful than the analogous test in the entropic framework (see Figure
28). Furthermore, Figure 27 suggests that for the matrix inequality
the initially proposed worst case distribution (required for calculating
the threshold value needed in the direct approach) might in fact be
correct. This was not the case in the entropic framework. However,
we have no proof. In general the problems at finding the worst case
distribution (in particular for larger DAGs and alphabets) remain.
132
5 APPLICATION TO THE IRIS DATA SET
5
Application to the iris data set
In this chapter we pursue the last goal of this thesis, an application of
the developed methods to real data. This application serves primary for
illustrative purposes. The data that we are considering have already been
studied intensively so that we will not contribute new results but rather try
to reconstruct existing knowledge.
5.1
The iris data set
A collection of freely available data sets can be found online at the machine learning repository of the University of California [47]. Here, we are
considering the iris data set which is one of the most simple but also most famous data sets from the repository. The data set was introduced by Ronald
Fisher in his paper The use of multiple measurements in taxonomic problems [48] from 1936. A long list of papers citing the iris data set can be
found on the corresponding page of the machine learning repository. The
data set even has its own article in the free online encyclopedia Wikipedia
(https://en.wikipedia.org/wiki/Iris_flower_data_set).
5
4
3
2
1
Figure 30: Schematic representation of a blossom, 2: sepals, 3: petals.
According to Meyers Konversationslexikon 1888, remade by Petr Dlouhý /
Wikimedia, licensed under CC BY-SA 3.0.
https://commons.wikimedia.org/wiki/File:Bluete-Schema.svg
5 APPLICATION TO THE IRIS DATA SET
133
The iris data set contains several size-attributes of the blossoms of iris flowers. Amongst others, an iris blossom consists of so-called petals and sepals.
The petals are the usually colorful leaves of the blossom when in bloom
while the sepals play a primary protective role for the buds or a supportive
role for the petals, see Figure 30. The data set lists the petal length, petal
width, sepal length and sepal width of N = 150 iris flowers. In fact, the
sample consists of three subsamples of size 50, each subsample corresponding to a different type of iris flower. In the fields of machine learning and
pattern recognition the goal is to predict the type of iris flower using the
four size-attributes [47]. Here, we pretend that we do not know about the
classification into the different types and propose a pairwise hidden common
ancestor model for the four variables. We then try to reject this model by
employing the full data set. In our model, the hidden ancestors could for
example stand for genetic or environmental factors. While a rejection of our
model would not necessarily imply the existence of the three different types
of iris flowers (which could be considered as a common ancestor of all four
attributes), it could at least be considered as a hint in this direction. In
that sense, we would partially reconstruct the already existing knowledge
of the three different types. While this application might not be the most
spectacular one, and a pairwise hidden common ancestor model might seem
artificial in this scenario, the data are well suited to illustrate our methods.
5.2
Discretizing the data
The first issue that we have to take care of is the continuity, or rather
the precision, of the data. The M -matrices from the generalized covariance
framework as well as our entropy estimation techniques require discrete variables. In the iris data set all four attributes are given in centimeters with
one additional position after the decimal point. This suggests a natural discretization in steps of 0.1cm but the associated alphabet sizes would be too
large for reliable estimation of the required quantities (in particular the bipartite entropies and bi-partite distributions required for the M -matrices).
More details about the marginal distributions of the four attributes are
shown in Figure 31. Using a discretization in steps of 0.1cm would lead to
alphabet sizes of the single variables ranging from K = 25 up to K = 60.
The joint distribution of the petal width and sepal width would have 625
possible outcomes. For the sample size N = 150 this would be dispropor-
134
5 APPLICATION TO THE IRIS DATA SET
petal length
petal width
#
50
#
50
40
40
30
30
20
20
10
10
0
2
3
4
5
6
7
cm
0
0.5
1.0
sepal length
2.0
2.5
3.0
cm
sepal width
#
30
25
20
15
10
5
0
1.5
#
5
petal length
petal width
sepal length
sepal width
6
7
min. val. [cm]
1.0
0.1
4.3
2.0
8
cm
35
30
25
20
15
10
5
0
max. val. [cm]
6.9
2.5
7.9
4.4
2.5
3.0
3.5
corres. K
60
25
37
25
4.0
4.5
cm
# of diff. vals.
43
22
35
23
Figure 31: Histograms of the marginal observation frequencies of the four
attributes in the iris data set. The table below shows the minimal and
maximal values of the four sizes. The corresponding alphabet sizes K are
obtained by assuming a discretization in steps of 0.1cm between the minimal and maximal values. The last column shows the number of actually
occurring different values, which is typically close to the corresponding K.
5 APPLICATION TO THE IRIS DATA SET
135
tionately large. In fact, the entropy estimation results from Figure 6 for the
(extremely) data sparse regime suggest that the entropy estimation might
still work. The simulated hypothesis tests from Section 3.3 and Subsection
4.5.3 for N = 50 and K = 2, on the other hand, suggest that we should keep
the alphabets small.
We thus choose a discretization resulting in the alphabet size K = 3 for
all variables and define the thresholds separating the categories such that
all categories have roughly 50 counts. For the sepal length, for example,
all values ≤ 5.4cm are assigned to the first category, all values > 6.2cm
to the third, and all values in between to the second category. The three
categories then have 52, 47 and 51 counts. In light of the three different
types of iris flowers, all with subsample size 50, this discretization seems to
be natural. On the other hand, one might argue that the discretization is
specifically tailored to support knowledge that we pretend not to know at this
point. We did therefore also consider discretizations where the categories are
defined by equidistant steps between the minimal and maximal value of the
respective variable, aiming for alphabet sizes K = 3 as well as K = 6. In all
cases we obtained qualitatively the same results. Here we present only the
originally proposed discretization. Note that since results might in general
depend on the chosen discretization, our methods are best suited for data
with inherently well defined categories. Such data could for example arise
as the results of questionnaires with well defined choices for the answers,
or in medical data where one only asks for the presence of a disease or a
symptom.
5.3
Proposing a model
For choosing an appropriate pairwise hidden common ancestor model, we
first check for independence relations between the variables. By default we
allow that any pair of variables shares a common ancestor. If we find that
two variables are independent, however, the faithfulness assumption from
Subsection 2.2.2 suggests that these variables should have no ancestor in
common. In order to decide whether or not two given variables, say A and
B, are independent we first estimate the mutual information Iˆ (A; B). Since
the mutual information is upper bounded by each of the marginal entropies
136
5 APPLICATION TO THE IRIS DATA SET
Ĥ (A) and Ĥ (B), we further calculate the ratio
Iˆr (A; B) :=
Iˆ (A; B)
n
min Ĥ (A) , Ĥ (B)
o,
(5.1)
which we call the relative mutual information of A and B. Iˆr (A; B) is
bounded between zero and one. Since even a sample from a distribution
with actually independent variables might not exactly satisfy Iˆr (A; B) = 0,
we will consider any pair of variables for which Iˆr (A; B) ≤ 0.05 as independent. Being more careful, one might conduct hypothesis tests already at this
point. However, the main intention of this section is to see the hypothesis
tests of our inequality constraints in action, rather than to illuminate all
details of the model construction. In addition, as can be seen in (5.2), most
observed dependence relations are quite strong, indicating that hypothesis
tests would most likely reject hypothesized independence relations. Denoting the attributes as PL (petal length), PW (petal width), SL (sepal length)
and SW (sepal width), we find the following relative mutual information
values:
Iˆr (PL; PW) = 0.79
Iˆr (PL; SL) = 0.46
Iˆr (PW; SL) = 0.41
Iˆr (PL; SW) = 0.18
Iˆr (PW; SW) = 0.22
Iˆr (SL; SW) = 0.09 (5.2)
In all cases the relative mutual information is larger than our predefined
threshold value of 0.05. Thus, we allow that all pairs of variables share a
common ancestor. A graphical depiction of the model is provided by Figure
32. At first glance, the model might seem to be quite artificial but in fact
a general hidden common ancestor model is arguably more reasonable than
a model with direct links between the observables. It would seem odd to
assume that, for example, the petal width had a direct causal influence on
the sepal length. Genetic and environmental common causes appear to be
more natural. The restriction to pairwise ancestors, on the other hand, might
not necessarily be one’s first choice. Then again, recall that this application
serves primary for illustrative purposes.
The model implies neither unconditional nor any conditional independence
relations. The faithfulness assumption suggests that we should not find
any independence relations between the observables in the data. While we
137
5 APPLICATION TO THE IRIS DATA SET
PL
PW
SL
SW
Figure 32: Pairwise hidden common ancestor model for the four attributes
petal length (PL), petal width (PW), sepal length (SL) and sepal width
(SW) for the iris data set. The hidden variables are not further specified
but could for example stand for genetic or environmental factors.
confirmed this for the unconditional independence relations (see (5.2)), we
refrain from doing so in the conditional case. The main reason is that the
estimation of the conditional mutual information I (A; B | C) (see (2.35)),
involving the tri-partite entropy H (A, B, C), will be less reliable than the
estimation of the unconditional mutual information involving at most bipartite entropies. In particular, if we would actually conduct hypothesis
tests in order to find/reject conditional independence relations, with sample
size N = 150 and tri-partite alphabet size K 3 = 27 the power of the tests
might be rather small. Note that the inequality constraints, that we test
in the next subsection, involve only bi-partite information as well. In that
sense, checking only (unconditional) independence relations that require at
most bi-partite information is consistent with our general approach.
5.4
Rejecting the proposed model
The model from Figure 32 is a hidden common ancestor model with n = 4
observables and ancestors of degree m = 2. In this case the general entropic
inequality (4.1) reads
I A1 ; A2 + I A1 ; A3 + I A1 ; A4 ≤ H A1 ,
(5.3)
138
5 APPLICATION TO THE IRIS DATA SET
where {A1 , A2 , A3 , A4 } can be any permutation of the observables {PL, PW,
SL, SW}. Note that effectively there are only four different inequalities since
5.3 is invariant under permutations of A2 , A3 and A4 . Analogously, the
general matrix inequality (4.22) reads

X
A1 :A2 :A3 :A4
1
1
MA
M A :A
 A2 :A1
2
M
MA

=  A3 :A1
0
M
4
1
M A :A
0
2
1
M A :A
0
3
MA
0
3
1
M A :A
0
0
4
MA
4






≥ 0,
(5.4)
with the M -matrices defined as in (4.3) and (4.6). In order for the data to be
compatible with the model, all four inequalities (in the chosen framework)
have to be satisfied. Thus, in general, four hypothesis tests have to be
conducted. If we want the composite test to have a significance level of 5%,
the singles tests will generally have to aim for an even smaller type-I-error
rate. In the case of k independent hypothesis tests one can calculate the
required significance level α of the single tests via [49]
1
α = 1 − (1 − ᾱ) k ,
(5.5)
where ᾱ is the desired level of the composite test. For ᾱ = 0.05 and k = 4
one obtains α ≈ 0.013. The relation 5.5, also known as the Šidàk correction [49, 50], is one of many methods to control the type-I-error rate of a
composite hypothesis test. The general problem of simultaneous inference
[51] and the required multiple comparison procedures [52] become drastically
more complicated for dependent tests. In our case the tests are indeed not
independent. The mutual information I (PL; PW), for example, appears in
the test with A1 = PL as well as A1 = PW. In fact, any mutual information
appears in exactly two entropic tests. In the matrix framework an analogous
observation can be made for the bi-partite M -matrices.
We do not want to go into the details of multiple comparison procedures,
but rather consider a single test as we did for the simulations in Section
3.3 and Subsection 4.5.3. When trying to reject the model by employing a
single inequality it is reasonable to select for A1 the variable that shows the
strongest correlations with the other attributes. In the previous subsection
we already listed the (relative) mutual information of all pairs of variables
(see (5.2)). The largest values are obtained for the petal length (PL), closely
followed by the petal width (PW). The sepal width (SW) shows by far the
139
5 APPLICATION TO THE IRIS DATA SET
smallest correlations with the other variables. By choosing A1 = PL we
obtain the real data estimates
T̂ent = Iˆ (PL; PW) + Iˆ (PL; SL) + Iˆ (PL; SW) − H (PL) = 0.47
(5.6)
and
n
T̂mat = min eigenvalues X̂ PL:PW:SL:SW
o
= −0.14.
(5.7)
Both inequalities, Tent ≤ 0 as well as Tmat ≥ 0, are violated, which is a first
indication that the data are incompatible with the model. Of course we still
have to conduct the actual hypothesis tests. We decide to consider only
the bootstrap tests since the direct tests would require the calculation of
threshold values above/below which the null hypothesis would be rejected.
Note that we cannot use the values from Section 3.3 and Subsection 4.5.3
since they have been calculated for the triangular scenario with K = 2 and
N = 50. Here, the model consists of four observables with K = 3 and
N = 150. More importantly, as already discussed several times, the main
problem of the direct approach would be that we could not be sure if the
threshold values were actually correct. When employing wrong threshold
values, the tests would not have the desired significance level of 5%. The
bootstrap tests, on the other hand, properly controlled the type-I-error rate
at 5% in all previous simulations.
We draw B = 999 bootstrap samples and use the BCa method to estimate
0.05
in the entropic case
the required endpoints of the confidence intervals, T̂ent
0.95
and T̂mat in the matrix framework. Figure 33 shows histograms of the boot∗
∗
together with the estimated endpoints as well
and T̂mat
strap estimates T̂ent
0.05
as the original estimates T̂ent and T̂mat . We find T̂ent
= 0.29 > 0 in the
0.95
entropic framework and T̂mat
= −0.116 < 0 in the matrix framework. In
both cases the null hypothesis (Tent ≤ 0 or Tmat ≥ 0 respectively) is rejected.
This implies, in particular, that with our hypothesis tests we come to the
conclusion that the data are incompatible with the proposed causal structure
from Figure 32. The correlations between the variables are stronger than
those allowed by a pairwise hidden common ancestor model. This means
that there either has to be direct causal influence between the observables
or, if we want to stick with a hidden common ancestor model, that ancestors
of larger degree are required. Further tests showed that we cannot reject a
model with ancestors of maximal degree m = 3. An ancestor of degree
m = 4 could stand for the type of iris flower that we pretended not to know
at the beginning of the model construction.
140
5 APPLICATION TO THE IRIS DATA SET
^ 0.05
Tent
^
Tent
^
Tmat
^ 0.95
Tmat
mat
∗
∗
(see (5.6)
and T̂mat
Figure 33: Histograms of the bootstrap estimates T̂ent
and (5.7)) for the iris data set. In the entropic case (left plot) the whole
∗
distribution lies in the regime T̂ent
> 0. Likewise, in the matrix case (right
∗
< 0. It is thus not surprising
plot) all bootstrap estimates satisfy T̂mat
0.05
=
that also the estimated endpoints of the confidence intervals satisfy T̂ent
0.95
0.29 > 0 (entropic case) and T̂mat = −0.116 < 0 (matrix case), meaning
that both tests reject their respective null hypothesis (compatibility with
the inequality).
6 CONCLUSION AND OUTLOOK
6
141
Conclusion and outlook
In this thesis we have investigated statistical aspects of identifying, or rather
rejecting, causal models (DAGs, Bayesian networks) given data sets of finite
size. To be precise, we constructed and elaborated on hypothesis tests based
on inequalities constraining probability distributions that are compatible
with a given DAG. Since the inequality constraints are outer approximations to the true set of distributions compatible with the DAG, a rejection
of the null hypothesis: ‘data satisfy the inequality’ implies the rejection of
the causal model. A rigorous conclusion in the other direction, stating that
we found the one correct model explaining the data, cannot be drawn within
our approach. One fundamental reason, independently of the statistical aspects considered here, is that more than one DAG might be compatible with
the data. Choosing between several compatible models is a task that lies
outside of the scope of this thesis. It is worth mentioning that our approach
does not require a causal interpretation of the underlying DAG [recall that in
the first instance a DAG encodes (conditional) independence relations]. By
observing correlations that are too strong to be compatible with the DAG,
the DAG can be rejected independently of its interpretation. Also note
that the faithfulness assumption, stating that all correlations allowed by the
DAG should indeed be observed, is not required by the present approach. In
a sense, the faithfulness assumption bounds correlations from below, while
our approach tests correlation-bounds from above. Nevertheless, the faithfulness assumption might be used as a formal version of Occam’s razor when
choosing between different compatible models.
At the end of the introductory Chapter 1 we stated three major goals of
this thesis. First, improve the hypothesis test proposed in [16] based on an
entropic inequality; second, derive an analogous inequality based on certain
generalized covariance matrices; and finally, apply our methods to real empirical data. Note that we considered two different approaches to hypothesis
testing. The direct approach (also considered in [16]) requires calculation
of a threshold value above/below which the null hypothesis is rejected. In
the indirect (or bootstrap) approach a confidence interval of the statistic of
interest is estimated (via bootstrapping). The null hypothesis is rejected
when the confidence interval does not overlap with the compatibility region
of the inequality. We furthermore considered both types of hypothesis tests
in the entropic as well as the (generalized covariance) matrix framework.
6 CONCLUSION AND OUTLOOK
142
In Chapter 3 we pursued the first of the above mentioned goals within the
entropic framework. As one means we implemented recent techniques of
entropy estimation [17, 18]. We could confirm that the new (minimax) estimator is often clearly superior to the maximum likelihood estimate (and two
bias corrected versions). In the case of small samples (N = 50) and small
alphabets (K = 2), that we considered in all of our simulated hypothesis
tests, however, the advantage of the minimax estimator was rather insignificant. In addition to the arguably weak power of the entropic direct test from
[16], we observed that the test might not have the desired type-I-error rate
of 5%. Utilizing the bootstrap approach, we could circumvent this problem.
The power of the entropic bootstrap test, however, was even smaller than
the power of the entropic direct test. This difference was, at least partially,
due to a weaker null hypothesis implicitly used by the bootstrap test (compatibility with the inequality opposed to compatibility with the the DAG).
By implementing additional entropic inequalities from [16], more powerful
entropic bootstrap tests could be constructed. However, these bootstrap
tests were still less powerful than the entropic direct test. In Chapter 4 we
constructed tests implementing our newly derived inequalities in the (generalized covariance) matrix framework. The matrix bootstrap test was more
powerful than all entropic bootstrap tests, almost matching the power of
the entropic direct test. With a direct test in the matrix framework we were
finally able to significantly surpass the power of the entropic direct test. The
observation making us doubt the proper type-I-error control of the entropic
direct test could not be made in the matrix framework. Nevertheless, the
reliability of the matrix direct test remains questionable as well.
Overall, we could improve the power of the original test from [16] or the
reliability, but not both at once. Not considered further in this thesis, a
straightforward increase of the power would be achieved by increasing the
sample size. With a larger sample size the estimation of entropies (as well
as the generalized covariance matrices) would become more precise. For real
data (a small sample) this solution is unfortunately not realizable. Decreasing the variance and the bias of the estimates by employing other estimation
techniques seems unlikely, at least in the entropic case. The employed minimax estimator of entropy already aims at minimizing the combination of
variance and bias.
Concerning the second goal of the thesis, the newly derived inequality in the
matrix framework, constraining hidden common ancestor models, is an inter-
6 CONCLUSION AND OUTLOOK
143
esting result on its own. Constraints for such models based on the structure
alone are rare to this day. For the triangular scenario and small alphabets,
as considered in our simulated hypothesis tests, the new inequality turned
out to be stronger than the analogous entropic inequality. Though, we have
seen that for other settings the opposite might be true. Note that, as a
corollary of our matrix inequality, we also derived an inequality on the level
of the usual covariances. The matrix inequality, being independent of the
actual alphabets of the variables, is typically stronger (and never weaker)
than the covariance inequality for one particular choice of outcome values.
Looking beyond this thesis, one might hope that entropic inequalities for
other scenarios (which can be derived with the algorithm from [16]) can
be be translated to the matrix framework as well. As a drawback, since
we encode probability distributions in matrices, our approach is restricted
to bi-partite information (the domain of the employed M -matrices can be
associated with one variable and the codomain can be associated with a
second (or the same) variable). A generalization to more than bi-partite
information is not straightforward.
Concerning the third goal, we demonstrated how our methods can be used
to reject a proposed model based on real data. Even though the scenario of
the iris flowers might not be the most spectacular one, the scenario was well
suited for illustrative purposes. We were able to falsify the proposed model
with bootstrap tests in both the entropic and the matrix framework. As a
next step it would be desirable to contribute with our methods to current
research. While our whole approach specifically aims for models with hidden
variables, the current restriction of the matrix framework to purely hidden
common ancestor models might complicate the search for up-to-date applications. Another obstacle is the assumption of discrete ancestors. While
it might be possible to adapt the entropic approach to continuous ancestors, our proof of the matrix inequality strongly rests on the assumption
of discrete, even finite, ancestors. This assumption could be problematic
since in some cases we might not have enough knowledge about the hidden
variables to justify the assumption. If, on the other hand, discrete ancestors can be justified or tolerated, our methods can be readily applied to
any suitable data set. Concerning the observables, our methods are best
suited for discrete observables with well defined categories. In that case no
manual discretization of the observables is required. Data with well defined
categories could for example emerge as the results of questionnaires.
Acknowledgements
I want to thank Prof. David Gross for supervising this thesis and for insightful discussions. Furthermore, I want to thank Rafael Chaves and Johan
Åberg for co-supervising the thesis, for likewise insightful discussions and
for valuable feedback.
A GENERALIZED PROOF OF THEOREM 4.1
A
145
Generalized proof of Theorem 4.1
In the main text, the following two steps of the proof of Theorem 4.1 have
only been formulated and performed for the special case of the triangular
scenario:
• Proposition 4.2 in Subsection 4.4.2: Proof for a special family of distributions
• Proposition 4.3 in Subsection 4.4.4, Step 1: Locally transforming A =
{ A1 , A2 } → A0 ,...
In both cases, the reason was mainly to ensure a proper readability. In
fact, the main difference in the general proofs presented here, will be the
generalized (and more complicated) notation. Employing this notation, a
lot of the calculations should be extremely familiar. It is therefore advisable
to first read the specialized proofs in the main text, or to reread those proofs
in case of any difficulties.
A.1
Proof for a special family of distributions
The idea in Subsection 4.4.2 was to model each ancestor as a collection of
correlated subvariables, one for each observable connected by that ancestor.
Each observable was then defined as the product of all its subvariables. A
graphical illustration for the triangular scenario was provided by Figure 19.
A.1.1
General notation
Now, considering the observables A1 , ..., An , if for example A1 , A2 and A3
share one ancestor, the ancestor will be denoted by λ123 . The subvariables
corresponding to that ancestor will be denoted by A1123 , A2123 and A3123 . The
collection of these subvariables, or rather the set of indices { 1, 2, 3 }, will
be referred to as the correlator of A1 , A2 and A3 . Usually, we will label
correlators by a single letter x, y or z (e.g. x = { 1, 2, 3 }). In contrast, the
observables themselves are as usual labeled by one of the letters i, j, k, l, ...
(e.g. Aj ). To state that Aj is part of the correlator x, we write j ∈ x. The
corresponding subvariable is denoted by Ajx . Note that each correlator x
146
A GENERALIZED PROOF OF THEOREM 4.1
corresponds to exactly one ancestor λx . Also note that while strictly speaking x is defined as a set, we write for example A1123 instead of A1{ 1,2,3 } when
addressing a specific subvariable A1x . The set of all correlators is denoted by
χ. We will further use the following notations for sets of subvariables:
{ Ajx }x := { Ajx | j fixed, x ∈ χ s.t. j ∈ x } .
{ Ajx }
j
{ Ajk }
:= { Ajx | j ∈ x, x fixed }
:= { Ajx | j, k fixed, x ∈ χ s.t. j, k ∈ x }
j
{A {
j
1
(A.1)
{A1x{x
A12
1
A145
145
1
A134
{A14{
2
A12
5
A145
2
{A5x{x
A23
{A2x{x
4
A145
4
A134
{A4x{x
3
A134
3
A23
{A3x{x
Figure 34: DAG from Figure 18 where all observables are decomposed into
subvariables, one for each ancestor (or correlator) of the observable. The set
j
{ Aj145 } = { A1145 , A4145 , A5145 } is the set of all subvariables corresponding to
the correlator {1, 4, 5} (or the ancestor λ145 in the original model). { A14 } =
{ A1134 , A1145 } is the set of all subvariables of A1 that share a correlator with
A4 . The set, say, { A2x }x is the set of all subvariables composing the variable
A2 .
In words, { Ajx }x is the set of all subvariables composing the observable Aj
j
and { Ajx } is the set of all subvariables composing the correlator x. { Ajk }
147
A GENERALIZED PROOF OF THEOREM 4.1
is the set of all subvariables of Aj sharing a correlator with Ak . Figure 34
provides a graphical illustration based on the DAG from Figure 18. All
correlations are mediated by the correlators; there are no additional corre⊥ A1124 ).
lations between subvariables from different correlators (e.g. A1123 ⊥
The joint distribution of A1 , ..., An reads
P A1 , ..., An =
P { Ajx }
Y
j
(A.2)
.
x∈χ
Q
Q
In the following, we will simply write x instead of x∈χ . Marginalization over one variable corresponds to summation over all of its subvariables.
When calculating P (A1 , Aj ) all factors on the right hand side of (A.2) where
neither A1 nor Aj appears become one. The remaining distribution reads
P A1 , Aj =
Y
P A1x , Ajx
Y
P A1y
Y
x
y
z
1,j∈x
1∈y,j ∈y
/
1∈z,j∈z
/
P Ajz .
(A.3)
In the second product, for example, y runs over all correlators including a
subvariable of A1 but no subvariable of A2 . This type of notation will be
used heavily for the rest of this section. Further marginalizing over A1 yields
P Aj
=
Y
P Ajx
Y
x
=
z
1∈z,j∈z
/
1,j∈x
Y
P Ajz
P Ajx .
(A.4)
x
j∈x
(the final expression also being valid for j = 1) which is in accordance with
the claim that Aj is the product of all its independent subvariables Ajx .
A.1.2
The proposition
For the desired family of distributions that is supposed to satisfy inequality
(4.32), we assume perfect correlation between all subvariables belonging to
one correlator,
1
j
δ j j.
(A.5)
P { Ajx } =
Kx { Ax }
The multidimensional Kronecker delta demands that all subvariables in the
j
set { Ajx } take the same value. Kx is the common alphabet size of these
subvariables. Insertion into the general distribution (A.2) yields
Y 1
P A1 , ..., An =
δ{ Aj }j .
(A.6)
x
x Kx
148
A GENERALIZED PROOF OF THEOREM 4.1
Proposition A.1. The distributions P (A1 , ..., An ) = x K1x δ{ Aj }j on the
x
variables Aj = { Ajx }x , with arbitrary, finite alphabet sizes Kx , satisfy inequality (4.32),
Q
n √
X
M A1 M A
1 :Aj
j
MA MA
j :A1
√
M A1 ≤ (m − 1) 1A1 .
j=2
Recall that we formulated the inequality specifically around the variable
A1 . An analogous inequality holds for any other variable Ai , the sum then
running over j = 1, ...i−1, i+1, ..., n. Alternatively, one could simply rename
the variables.
A.1.3
The proof
Constructing the marginal distributions
Either by marginalization of the distribution P (A1 , ..., An ) = x K1x δ{ Aj }j ,
x
or by simply writing the marginals down (which are clear by construction),
one obtains the subvariable-marginals
Q
1
δ 1 j,
Kx Ax ,Ax
1
P Ajx =
.
Kx
P A1x , Ajx
(A.7)
=
(A.8)
In the bi-partite case it is always implied that j 6= 1. Inserting (A.7) and
(A.8) into (A.3) leads to the total bi-partite marginals
P A1 , Aj
Y
=
x
Y
1
1 Y 1
δA1x ,Ajx
.
Kx
Ky z Kz
y
1,j∈x
(A.9)
1∈z,j∈z
/
1∈y,j ∈y
/
It is convenient to introduce the short hand notations
K1j :=
Y
Kx ,
K1j :=
Y
x
y
1,j∈x
1∈y,j ∈y
/
K1 :=K1j K1j ,
Kj := K1j K1j
Y
K1j :=
Ky ,
Kz ,
z
1∈z,j∈z
/
and δ{ A1 },{ Aj } :=
j
1
Y
x
δA1x ,Ajx .
1,j∈x
(A.10)
149
A GENERALIZED PROOF OF THEOREM 4.1
The indices indicate over which correlators the multiplication runs. For
example, K1j is the result of multiplying the alphabet sizes of all correlators
containing A1 but not Aj . The delta function δ{ A1 },{ Aj } demands that each
1
j
of the variables A1x coincides with its counterpart Ajx . Using these short
hand notations, (A.9) can be written as
1
1
δ{ A1 },{ Aj }
1 K K
j
K1j
1j 1j
1
1
= q
δ{ A1 },{ Aj } q
.
1
j
K1j K1j
K1 Kj
P A1 , Aj
=
(A.11)
By inserting expression A.8 into equation A.4 one finds the total one-variable
marginals
P Aj
=
Y
x
1
1
=
.
Kx
Kj
(A.12)
j∈x
Marginals in operator notation
In order to simplify the calculations with the M -matrices, we write down
the marginal distributions P (Aj ) and P (A1 , Aj ) as states and operators in
the Dirac notation. For comparison, see (4.72) to (4.78) in the proof of
Proposition 4.2. We can write
P Aj
1 O E
1 E
IAj .
= q I j := q
x
Kj
Kj x
(A.13)
j∈x
E
The states IAjx are defined as
E
IAj
x
Kx
1 X
√
:=
|ki j .
Kx k=1 Ax
(A.14)
Note that all |Ii -states are normalized. Similarly, the operator corresponding to the bi-partite marginal reads
1
P A ,A
j
=q
1
K1 K j
{ A1j }↔{ Aj1 }
1
⊗ Ij1
ED
I1j ,
(A.15)
150
A GENERALIZED PROOF OF THEOREM 4.1
where the individual factors have the inner tensor product structure
{ A1j }↔{ Aj1 }
1
O A1x ↔Ajx
1 ,
:=
(A.16)
x
1,j∈x
E
1
Ij
E
O IA1x ,
:=
(A.17)
x
1∈x,j ∈x
/
E
j
I1
E
O IAj .
:=
(A.18)
x
x
1∈x,j∈x
/
The meaning of the indices is the same as for the alphabet sizes from (A.10).
A1x ↔Ajx
The operator 1 can be understood as an ‘identity operator between the
isomorphic spaces of the variables A1x and Ajx ’. It is essentially the operator
version of the scalar Kronecker delta δA1x ,Ajx , and is explicitly defined as
A1x ↔Ajx
1
:=
Kx
X
|kiA1x hk|Ajx .
(A.19)
k=1
See also the original definition (4.77) and the explanations below. For practical purposes, it is only important that when acting from the left on a state
from the space of
A1x ↔Ajx
j
Ax ,
transforms
E
E
A1x ↔Ajx 1
this state to the ‘same’ state in the
space of A1x , e.g. 1 IAjx = IA1x (if both 1, j ∈ x). Acting from the
right on a state from the space of A1x , the transformation goes in the other
{ A1j }↔{ Aj1 }
1
direction. For the compound operator
for example,
{ A1j }↔{ Aj1 } 1
one consequently obtains,
E
j
I1 = Ij1 where the states
E
E
O 1
Ij
:=
IA1x ,
E
are defined as
(A.20)
x
1,j∈x
E
j
I1
:=
E
O IAj .
x
x
(A.21)
1,j∈x
The M -matrices
For the original definition of the matrices see (4.3) and (4.6). Using the
1
j
marginals (A.13) and (A.15), the bi-partite matrix M A :A (in a concise
151
A GENERALIZED PROOF OF THEOREM 4.1
operator form) can be written down without any further calculations (except
for some simple manipulations),
MA
1 :Aj
ED
= P A1 , Aj − P A 1 P A j
{ A1j }↔{ Aj1 }
1
1
= q
K1 Kj

{ A1j }↔{ Aj1
1
=
⊗ Ij1
q
K1 Kj

1
†
I1j − q
1
K1 Kj
ED 1
I
I j 
}
− Ij1
ED
I1j  ⊗ Ij1
ED
I1j .
(A.22)
E
From the second to the third line we used the decompositions |I 1 i = Ij1 ⊗
E
1
Ij
E
E
and |I j i = I1j ⊗ I1j . For the mono-partite matrices we obtain

M
Aj

K
1 Xj
δkk0 |kiAj hk 0 |Aj  − P (A)P (A)†
= 
Kj k,k0 =1
=
E D 1 1Aj − I j I j .
Kj
(A.23)
Formally, the expressions coincide with the specific expressions for the triangular scenario from (4.79) and (4.81). However, here, the matrices generally
have an even deeper tensor product structure. When trying to conclude
the desired inequality constraint (4.32), this structure will indeed make a
difference.
√
√
1
j
j
j
1
Calculating the matrix product M A1 M A :A M A M A :A M A1 is comj
pletely analogous to the triangular scenario. First, note that M A is a scalar
multiplicative of a projection. Thus, taking the (pseudo) inverse and square
root simply corresponds to taking the inverse and square root of the prefactor 1/Kj . We start by considering the product of the first two matrices (see
152
A GENERALIZED PROOF OF THEOREM 4.1
(A.22) and (A.23)),
p
M A1 M A
p
E D 1
K1 1A1 − I 1 I 1 × p
=
1 :Aj

{ A1j }↔{ Aj1 }
K1 Kj

1

ED ED − Ij1 I1j  ⊗ Ij1 I1j 
p
= K1 M
A1 :Aj
p
1 :Aj
p
A1 :Aj
= K1 M A

j
1
1 1 E D 1 1 E D 1 { Aj }↔{ A1 } 1 E D j  1 E D j −p
Ij ⊗ Ij̄ Ij̄ ×
1
− Ij I1 ⊗ Ij I1 I
Kj j
1
−p
Kj
E D E D E D 1
I1j − Ij1 I1j ⊗ Ij1 I1j Ij
|
{z
}
0
= K1 M
.
(A.24)
√
As it was the case in the calculations for the triangular scenario, √M A1
1
j
merely has a scalar-multiplicative effect on M A :A . The effect of M A1
j
1
j
on M A :A from the right is exactly the same, and analogously the M A in
the middle reduces to the scalar prefactor Kj . By exploiting this behaviour
1
j
and using M A :A from (A.22), we obtain
√
√
1
j
j
j
1
M A1 M A :A M A M A :A M A1
=K1 Kj M A

=
1 :Aj
{ A1j }↔{ Aj1 }
1
MA

 j

{ A1 }↔{ A1j }
ED ED ED ED 1
− I1j Ij1  ⊗ I1j Ij1 − Ij1 I1j  ⊗ Ij1 I1j × 
= 1{ A1j } − Ij1
j :A1
ED
Ij1 ⊗ Ij1
ED
Ij1 .
(A.25)
For each single j, this expression is a projection. However, in contrast to
the special case of the triangular scenario, projections for different j are in
general
not orthogonal to each other anymore. As a consequence, the sum
Pn √ A1 A1 :Aj Aj Aj :A1 √ A1 M
M
M M
M
is not a projection (and typically
j=2
neither a scalar multiple of a projection). Showing that the sum is bounded
by (m − 1) 1A1 is therefore more complicated in the general case. A detailed
consideration of the tensor product structure becomes necessary.
Proving the inequality: Gaining intuition
By
a simple example
we explicitly confirm that the projections
√ considering
√
1 :Aj
j
j :A1
1
1
A
A
A
MA M
M M
M A for different j are in general not orthog-
153
A GENERALIZED PROOF OF THEOREM 4.1
onal to each other. With the help of a second example we try to get an
impression of how the different projections overlap.
Consider the extreme case of three observables that share one common ancestor (and thus are perfectly correlated). A1 consists of only one subvariable
A1123 . Taking a look at (A.25) (last line), for both j = 2, 3 one obtains
1{ A1j } = 1A1123 ,
E
1
Ij
E
1
Ij
(A.26)
=
E
IA1
,
(A.27)
:
does not exist.
(A.28)
123
E
Concerning the last one, recall that Ij1 is the tensor product of all states
E
IA1x
such that j ∈
/ x. But here, both j = 2, 3 are part of the only correlator
E
x = { 1, 2, 3 }, so that no such IA1x exists. In total, (A.25) reads
√
M A1 M A
1 :Aj
j
MA MA
j :A1
√
M A1 = 1A1123 − IA1123
ED
IA1123 .
(A.29)
We find the same projection for both j = 2, 3. In particular, they are not
orthogonal to each other. On the other hand, since their sum is a scalar
multiple of a projection, the proper (and trivial) inequality for this scenario
(m = 3; A1123 = A1 ),
2 (1A1 − |IA1 i hIA1 |) ≤ 21A1 ,
(A.30)
can easily be seen to be satisfied. The inequality is trivial in the sense that
this scenario puts no constraints on the distribution at all.
In general, two projections will neither be the same nor orthogonal, but
overlap only on some subspaces of the tensor product space. For a more
complex example that illustrates this behaviour consider four variables of
which all triplets share one ancestor. There are three correlators including
the variable A1 : { 1, 2, 3 }, { 1, 2, 4 } and { 1, 3, 4 }. Explicitly writing down
all tensor products, the left hand side of inequality (4.32) (using (A.25))
154
A GENERALIZED PROOF OF THEOREM 4.1
reads
4 √
X
M A1 M A
1 :Aj
j
MA MA
j :A1
√
M A1
j=2
= 1A1123 ⊗ 1A1124 − IA1123
ED
IA1123 ⊗ IA1124
+ 1A1123 ⊗ 1A1134 − IA1123
ED
IA1123 ⊗ IA1134
+ 1A1124 ⊗ 1A1134 − IA1124
ED
IA1124 ⊗ IA1134
ED
IA1124 ⊗ IA1134
ED
IA1134 ED
IA1134 ⊗ IA1124
ED
IA1124 ED
IA1134 ⊗ IA1123
ED
IA1123 .
(A.31)
We see that each of the terms 1A1123 , 1A1124 and 1A1134 appears in two of the
three projections. These terms are exactly the causes of the overlaps (i.e. the
non-orthogonality). Since we have m = 3 for the current scenario, one may
realize that the number of occurrences of each overlapping term is indeed
upper bounded by m − 1. To verify this intuitive statement for the general
case, a more detailed treatment follows.
Proving the inequality: General treatment
√
√
1
j
j
j
1
An important realization is that all terms M A1 M A :A M A M A :A EM A1
are diagonal in the same basis, partially spanned by the states IA1x (see
(A.25) and the definitions of the |Ii -states from (A.14), A.17 and A.20). In
order to validate inequality (4.32), it is thus enough to show that all diagonal
√
√
P
1
j
j
j
1
elements of nj=2 M A1 M A :A M A M A :A M A1 are upper bounded by
m − 1. We label the correlators x including the observable A1 by x1 , ..., xN .
We further denote the basisE states of the variable A1xi by |0i i , |1i i , |2i i , ...
|Ki − 1i and set |0i i ≡ IA1xi . The other states are arbitrary and never occur
explicitly. Similar notation is used for the identity operators, i.e. 1i ≡ 1A1xi .
The projections (A.25),
√
√
A1 :Aj
Aj
Aj :A1
A1
M A1 M
M
M
M
ED ED ED = 1{ A1j } ⊗ Ij1 Ij1 − Ij1 Ij1 ⊗ Ij1 Ij1 ,
|
{z
=:T1 (j)
}
|
{z
=:T2 (j)
}
(A.32)
A GENERALIZED PROOF OF THEOREM 4.1
155
can now be written down in a way that explicitly takes into account the full
tensor product structure. The second term reads
N
O
T2 (j) =
|0i i h0i |
i=1
= |01 i h01 | ⊗ |02 i h02 | ⊗ ... ⊗ |0N i h0N | ,
(A.33)
independently of j. Only the diagonal element corresponding to the tensor
product state |01 , 02 , ..., 0N i will be one, while all others are zero. For the
first term of the projection (A.32), we obtain
O
T1 (j) = 




1i 
⊗
|0i i h0i |



i
j∈xi

O
i

j ∈x
/ i
= 11 ⊗ 12 ⊗ |03 i h03 | ⊗ 14 ⊗ ... ⊗ |0N i h0N | .
e.g.
(A.34)
The second line depends on the ancestors that A1 shares with Aj . The
identity operators occur exactly at the positions of the common correlators
(or ancestors) of A1 and Aj . Still considering this example, the diagonal
elements corresponding to states of the form |α1 , α2 , 03 , α4 , ..., 0N i with αi =
0, 1, ...Kxi − 1 will be one, all others will be zero. The positions of the
αi are exactly the positions of the identity operators in T1 (j). Note that
the element corresponding to the state |01 , 02 , ..., 0N i will always be one,
but it will be canceled by the term T2 (j). Thus, for the total projection
√
√
1
j
j
j
1
M A1 M A :A M A M A :A M A1 , all elements corresponding to states of
the form |α1 , α2 , 03 , α4 , ..., 0N i with at least one of the αi 6= 0 will be one.
Now, the final question is in how many projections a specific identity operator 1i (or a combination of several identity operators) can occur (i.e. in how
√
√
1
j
j
j
1
many projections M A1 M A :A M A M A :A M A1 a specific diagonal element can be one). For a given j = 2, ..., n, the identity operator 1A1xi = 1i
occurs if and only if j ∈ xi . Since each ancestor connects at most m observables, each correlator containing A1 can contain at most m − 1 other
variables Aj . Thus, the term 1A1xi can occur at most m − 1 times. Of course
any combination of more than one identity operator cannot occur more often
than each identity individually. This means that a given diagonal element
can be one in at most m − 1 projections. Thus, considering the sum of all
projections, each diagonal element is bounded by m − 1. Since the total
A GENERALIZED PROOF OF THEOREM 4.1
156
operator is diagonal we can conclude the desired inequality
n X
j=2
1{ A1j } − Ij1
√
|
Ij1 ⊗ Ij1 Ij1 ≤ (m − 1) 1A1 ,
{z
}
√ 1
j
j 1
ED
j
M A1 M A1 :A M A
=
MA
:A
ED
(A.35)
MA
for the considered special family of distributions. This finishes the proof of
Proposition A.1.
Example
We illustrate
the above construction of the matrix representation of
Pn √ A1 A1 :Aj Aj Aj :A1 √ A1 M
M
M M
M
by an example. Consider again
j=2
four variables
with
one
ancestor
for
each
triplet. In an operator notation,
Pn √ A1 A1 :Aj Aj Aj :A1 √ A1 M
M
M M
M
was already calculated in (A.31).
j=2
To establish the same order of the tensor product decomposition in all terms,
we rewrite this expression as
4 √
X
1
M A1 M A
:Aj
j
MA MA
j
:A1
√
M A1
j=2
1A1123
1A1123
1A1124
⊗ 01134 01134 − 01123 01123 ⊗ 01124 01124 ⊗ 01134 01134 +
1A1134 − 01123 01123 ⊗ 01124 01124 ⊗ 01134 01134 ⊗ 01124 01124 ⊗
+ 01123 01123 ⊗
1A1124
⊗
1A1134 − 01123 01123 ⊗ 01124 01124 ⊗ 01134 01134 .
(A.36)
E
Note that we also changed the notation from IA1x to |01x i. In the following,
=
⊗
we completely neglect the indices since they should be clear from the order
of the tensor product decomposition. When assuming that all variables
are
we have to calculate
diagonal elements for each projection
√ binary,
√ eight
1
1
A1 :Aj
Aj
Aj :A1
A
A
M
M
M M
M
(for short denoted as Pj , j = 2, 3, 4 from
now on). Starting with P2 (second line in (A.36)) we obtain
h0, 0, 0 | P2 | 0, 0, 0i
= h0, 0, 0| (1 ⊗ 1 ⊗ |0i h0| − |0i h0| ⊗ |0i h0| ⊗ |0i h0|) |0, 0, 0i
= h0 | 1 | 0i h0 | 1 | 0i h0 | 0i h0 | 0i − h0 | 0i h0 | 0i h0 | 0i h0 | 0i h0 | 0i h0 | 0i
=1 − 1
=0,
(A.37)
157
A GENERALIZED PROOF OF THEOREM 4.1
h1, 0, 0 | P2 | 1, 0, 0i
= h1, 0, 0| (1 ⊗ 1 ⊗ |0i h0| − |0i h0| ⊗ |0i h0| ⊗ |0i h0|) |1, 0, 0i
= h1 | 1 | 1i h0 | 1 | 0i h0 | 0i h0 | 0i − h1 | 0i h0 | 1i h0 | 0i h0 | 0i h0 | 0i h0 | 0i
=1 − 0
=1,
(A.38)
and similarly
h0, 0, 1 | P2
h0, 1, 0 | P2
h0, 1, 1 | P2
h1, 0, 1 | P2
h1, 1, 0 | P2
h1, 1, 1 | P2
| 0, 0, 1i
| 0, 1, 0i
| 0, 1, 1i
| 1, 0, 1i
| 1, 1, 0i
| 1, 1, 1i
=
=
=
=
=
=
0
1
0
0
1
0.
(A.39)
(A.40)
(A.41)
(A.42)
(A.43)
(A.44)
A suitable matrix representation of this projection reads
P2 =
|0, 0, 0i
|0, 0, 1i
|0, 1, 0i
|0, 1, 1i
|1, 0, 0i
|1, 0, 1i
|1, 1, 0i
|1, 1, 1i


0














0
1
0
1
0
1














.
(A.45)
0
In the same basis, the other two projections can be written as
P3 =
|0, 0, 0i
|0, 0, 1i
|0, 1, 0i
|0, 1, 1i
|1, 0, 0i
|1, 0, 1i
|1, 1, 0i
|1, 1, 1i


0














1
0
0
1
1
0














0
,
(A.46)
158
A GENERALIZED PROOF OF THEOREM 4.1
and


0
|0, 0, 0i


|0, 0, 1i  1




1
|0, 1, 0i 




1
|0, 1, 1i 

 .
P4 =



0
|1, 0, 0i 



0
|1, 0, 1i 



0 
|1, 1, 0i 
0
|1, 1, 1i
Finally, the sum of all three projections satisfies
P2 + P3 + P4

0
 0


1


0

=





0





=






1
0
1
2
2
1
2
1
1

 
 
 
 
 
+
 
 
 
 
 
0


0
 
 
 
 
 
+
 
 
 
 
 
(A.47)


0
1
0
0
1
1
0
1
1
0
0
0
0











0
≤21.
A.2
1
(A.48)
Locally transforming Aj = { Ajx }x → Aj0
In Proposition 4.3 we have shown for the special case of the triangular scenario how to locally transform distributions from the ‘correlated subvariable
model’ to distributions from the ‘ancestor model’. The starting distributions
read
P (A, B, C) = P (A1 , A2 , B1 , B2 , C1 , C2 )
1
=
δA B δB C δC A ,
KAB KAC KBC 1 2 1 2 1 2
(A.49)











0
159
A GENERALIZED PROOF OF THEOREM 4.1
(see also (4.102)) while the resulting distributions were of the form
P (A0 , B 0 , C 0 )
X
=
P (A0 | λAB , λAC ) P (B 0 | λAB , λBC ) P (C 0 | λAC , λBC )
λAB ,λAC ,λAB
·
1
,
KAB KAC KBC
(A.50)
(see also (4.104)). The latter is the general form of all distributions compatible with the triangular scenario, but with uniform distributions of the
ancestors, P (λx = j) = 1/Kx . For general hidden common ancestor models,
Q
we start with the distributions P (A1 , ..., An ) = x K1x δ{ Aj }j introduced in
x
(A.6). We will also use the notations related to subvariables Ajx introduced
j
in that section. In particular, recall that by { Ajx } we denoted the set of
all subvariables composing the correlator x and by { Ajx }x the set of all
subvariables composing the observable Aj , see also (A.1) and Figure 34. In
addition, we introduce the notation
j
{ Ajx }x := { Ajx | j ∈ x, x ∈ χ }
(A.51)
for the the set of all subvariables Ajx . In the ancestor framework, recall that
we denoted the set of all ancestors λx by { λx }x and the set of all ancestors
of the observable Aj by { λx }x| j , see also (4.108) and (4.109).
A
Proposition A.2. Starting with the family of distributions P (A1 , ..., An ) =
Q 1
x Kx δ{ Aj }j introduced in (A.6), one can obtain all distributions of the form
x
(4.108),
P A1 , ..., An =
X
P A1 | { λx }x|
{ λx }x
A1
...P An | { λx }x|An
Y
x
1
,
Kx
(A.52)
(with finite alphabets) via local transformations Aj = { Ajx }x → Aj0 .
Proof. Recall that for a single variable a local transformation reads
P Aj0
=
X
P Aj0 | Aj P Aj
Aj
=
X
{ Ajx
}x
P Aj0 | { Ajx }x P { Ajx }x .
(A.53)
160
A GENERALIZED PROOF OF THEOREM 4.1
Applied to the joint distribution P (A1 , ..., An ), by locally transforming all
variables we arrive at
P A10 , ..., An0
P A10 | A1 ...P (An0 | An ) P A1 , ..., An
X
=
A1 ,...,An
P A10 | { A1x }x ...P (An0 | { Anx }x ) P A1 , ..., An
X
=
{ A1x }x ,...,{ An
x }x
P A10 | { A1x }x ...P (An0 | { Anx }x )
X
=
Y
x
j
{ Ajx }x
1
δ j j
Kx { Ax }
The summation over all but one subvariable of each correlator x cancels the
δ{ Aj }j and forces all involved subvariables to coincide. This allows us to
x
model all these correlated subvariables by one single variable. The latter
can be identified as the common ancestor of all observables belonging to the
given correlator. We consequently rename the remaining subvariable of the
correlator x as λx and finally obtain
P A10 , ..., An0
=
P A10 | { A1x }x ...P (An0 | { Anx }x )
X
x
j
{ Ajx }x
=
X
{ λx }x
Y
P A10 | { λx }x|
A10
1
δ j j
Kx { Ax }
...P An0 | { λx }x|An0
Y
x
1
.
Kx
(A.54)
161
B PROOF OF COROLLARY 4.2
B
Proof of Corollary 4.2
Corollary 4.2 states that all distributions compatible with a hidden common
ancestor model with n observables A1 , ..., An (with finite alphabets) and
ancestors of degree up to m, satisfy the inequality
n
n h
i2 Y
X
Cov A1 , Aj j=2
h
i
Var Ak ≤ (m − 1)
n
Y
h
i
Var Ai .
(B.1)
i=1
k=2
k6=j
Proof. From Theorem 4.1 we know that for all distributions compatible with
the given hidden common ancestor model, the matrix

XA
1 :...An
(m − 1) M A

2
1
 M A :A


..
=
.


.

..

n
1
M A :A
1
1
M A :A
2
MA
2
···
0
..
.
..
.
···
0
..
.
0
1
· · · M A :A
···
0
..
..
.
.
..
.
0
n
0
MA
n










(B.2)
is positive semidefinite. Lemma 4.2 allows us to conclude that the matrix
· · · Cov [A1 , An ]

···
0


..
...

A1 :...:An
.
Z



...

0
··· 0
Var [An ]
(B.3)
is positive semidefinite as well. To apply Lemma 4.2, recall that one can
†
i
j
write Cov [Ai , Aj ] = ai M A :A aj where the vectors ai , aj carry the alpha(m − 1) Var [A1 ] Cov [A1 , A2 ]
 Cov [A2 , A1 ]
Var [A2 ]


..
.
0
:= 


..
..


.
.
Cov [An , A1 ]
0

···
0
...
...
1

n
1
n
bets of Ai and Aj . Positive semidefiniteness of Z A :...:A implies det Z A :...:A ≥
1
n
0. For simplicity, we write Z ≡ Z A :...:A from now on. To calculate the determinant we use Laplace’s formula and expand the matrix along the first
column. In that case, Laplace’s formula reads
det Z =
n
X
(−1)j+1 Zj1 det Z (j,1) .
j=1 |
{z
dj
}
(B.4)
162
B PROOF OF COROLLARY 4.2
The matrix Z (j,1) is obtained from the matrix Z by erasing the jth row and
the first column. In general, this formula has to be applied recursively until
the determinants on the right hand side can be calculated by other means.
Fortunately, we only require one application of (B.4) here.
• For j = 1 we obtain (−1)1+1 = 1, Z11 = (m − 1) Var [A1 ] and


Var [A2 ]
..
Z (1,1) = 


.
n
Var [A ]

.

(B.5)
The determinant of the diagonal matrix Z (1,1) is simply the product of
Q
its diagonal elements, det Z (1,1) = ni=2 Var [Ai ]. Thus, the j = 1-term
on the right hand side of (B.4) reads
d1 = (m − 1)
n
Y
h
i
Var Ai .
(B.6)
i=1
This is exactly the right hand side of the desired inequality (B.1).
• For j ≥ 2 we obtain the prefactors (−1)j+1 , Zj1 = Cov [Aj , A1 ] and the
matrix Z (j,1) (where we abbreviate Cov , C, Var , V and Ai , i):

C [1, 2] C [1, 3] · · ·
 V [2]
0
···


..
 0
.
V [3]


..

.
0

 ..
.
..
 .




 .
 ..



0
···
···
C [1, j − 1]
0
..
.
V [j − 1]
0
0
0
..
.
..
.
0
0
V [j + 1]
0
..
.
0
..
.

· · · · · · C [1, n]
··· ···
0 




.. 
. 




.. 
..
.
. 


.. ..

.
.


..
..
.
.
0 
0
0
···
0
C [1, j] C [1, j + 1]
0
0
..
..
.
.
..
..
.
.
0
V [n]
To guide the eye, the main diagonal is colored light green. Up to the
‘j − 1-column’, the element directly below the diagonal is (in general)
non-zero. The ‘j-column’ is non-zero only in the first row. From the
‘j + 1’ to the last column all elements below the diagonal vanish as
163
B PROOF OF COROLLARY 4.2
well. Thus, by permuting the ‘j-column’ to the left (requiring j − 2
permutations), one arrives at the upper triangular matrix:

C [1, j] C [1, 2] C [1, 3] · · ·
 0
V [2]
0
···


..

.
V [3]
0

 ..
..
 .
.
0


.
..


 .
 ..




 .
 ..
C [1, j − 1]
0
0
V [j − 1]
0
0
V [j + 1]
0
···
0

· · · · · · C [1, n]
··· ···
0 




.. 
. 




.. 
..
.
. 


.. ..

.
.


..
..
.
.
0 
C [1, j + 1]
0
···
···
V [n]
0
The determinant of this matrix is simply the product of its diagonal
elements. Each of the j − 2 permutations required to bring the matrix
to the upper triangular form introduces one factor (−1). Thus,
h
det Z (j,1) = (−1)j−2 Cov A1 , Aj
n
iY
h
i
Var Ak .
(B.7)
k=2
k6=j
The contribution of each j = 2, ..., n to the Laplace decomposition of
1
n
det Z A :...:A from (B.4) becomes
dj = (−1)
j+1
|
j−2
(−1)
{z
h
j
1
i
h
j
Cov A , A Cov A , A
n
iY
1 :...:An
=
n
X
i
k6=j
Taking (B.6) and (B.8) together, the determinant of Z A
det Z A
h
Var Ak . (B.8)
k=2
}
=−1
1
1 :...:An
reads
dj
j=1
=
(m − 1)
n
Y
i=1
h
i
Var Ai −
n n
h
i2 Y
X
Cov A1 , Aj j=2
h
Var Ak
i
k=2
k6=j
(B.9)
164
B PROOF OF COROLLARY 4.2
Thus, the requirement det Z A
n n
h
i2 Y
X
Cov A1 , Aj j=2
1 :...:An
h
≥ 0 becomes
i
Var Ak ≤ (m − 1)
n
Y
h
i
Var Ai .
(B.10)
i=1
k=2
k6=j
In the case that Var [Ai ] 6= 0 for all observables Ai , this inequality can be
Q
divided by ni=1 Var [Ai ], yielding
n
X
2
|Cov [A1 , Aj ]|
≤ (m − 1)
1
j
j=2 Var [A ] Var [A ]
⇒
n h
i2
X
Corr A1 , Aj ≤ (m − 1) .
(B.11)
j=2
In this way, we obtain the inequality for correlation coefficients initially
introduced for illustrative purposes in Subsection 4.3.3.
REFERENCES
165
References
[1] A. Falcon. Aristotle on Causality. Stanford Encyclopedia of Philosophy, http://plato.stanford.edu/entries/aristotle-causality/ (2006, revised 2015).
[2] M. Hulswit. Cause to Causation. A Peircean Perspective. Kluwer Publishers (2002).
[3] J. Pearl. Causality: Models, Reasoning, and Inference. CUP (2009).
[4] Aristotle. Physics. 194 b17-20 (350 BC).
[5] D. Hume. A Treatise of Human Nature. (1738).
[6] K. Pearson. The Grammar of Science, chapter: Contingency and Correlation - The Insufficiency of Causation. A. and C. Black (1911).
[7] R. A. Fisher. The Design of Experiments. Macmillan (9th ed. 1971,
orginally 1935).
[8] C. W. J. Granger. Investigating causal relations by econometric models
and cross-spectral methods. Econometrica, 37 (3): 424-438 (1969).
[9] J. Neyman. Sur les applications de la theorie des probabilites aux experiences agricoles: Essai des principes. Master’s Thesis (1923).
[10] J. Sekhon. The Neyman-Rubin Model of Causal Inference and Estimation via Matching Methods, in The Oxford Handbook of Political
Methodology, OUP (2008).
[11] P. W. Holland. Statistics and causal inference. J. Amer. Statist. Assoc,
81 (396): 945-960 (1986).
[12] C. Kang and J. Tian. Inequality constraints in causal models with
hidden variables. UAI 2006, pages 233–240.
[13] C. Kang and J. Tian. Polynomial constraints in causal bayesian networks. UAI 2007, pages 200–208.
[14] J. Tian and J. Pearl. On the testable implications of causal models
with hidden variables. UAI 2002, pages 519–527.
REFERENCES
166
[15] G. Ver Steeg and A. Galstyan. A sequence of relaxations constraining
hidden variable models. UAI 2011, pages 717–727.
[16] R. Chaves, L. Luft, T. O. Maciel, D. Gross, D. Janzing, and
B. Schölkopf. Inferring latent structures via information inequalities.
Proceedings of the 30th Conference on Uncertainty in Artificial Intelligence, UAI 2014, pages 112 – 121, AUAI Press (2014).
[17] J. Jiao, K. Venkat, Y. Han, and T. Weissman. Minimax Estimation of
Functionals of Discrete Distributions. Used version: arXiv:1406.6956v3
(2014). Now published: IEEE Transactions on Information Theory, 61
(5): 2835-2885 (2015).
[18] Y. Wu and P. Yang. Minimax rates of entropy estimation on large
alphabets via best polynomial approximation. arXiv:1407.0381, (2014).
[19] P. Spirtes, N. Glymour, and R. Scheienes. Causation, Prediction, and
Search. 2nd ed. MIT Press (2001).
[20] C. Hitchcock. Probabilistic Causation. Stanford Encyclopedia of Philosophy, http://plato.stanford.edu/entries/causation-probabilistic/ (1997,
revised 2010).
[21] B. Bonet. Instrumentality tests revisited. UAI 2001, pages 48–55.
[22] A. S. Goldberger. Structural equation methods in the social sciences.
Econometrica, 40 (6): 979-1001 (1972).
[23] J. Pearl. On the testability of causal models with latent and instrumental variables. UAI 1995, pages 435–443.
[24] C. Uhler, G. Raskutti, P. Bühlmann, and B. Yu. Geometry of the
faithfulness assumption in causal inference. Annals of Statistics, 41
(2): 436-463 (2013).
[25] N. Cartwright. Hunting Causes and Using Them. CUP (2007).
[26] R. B. Ash. Information Theory. Dover Publications (1990, originally
1965).
[27] T. M. Cover and J. A. Thomas. Elements of Information Theory. 2nd
ed. Wiley (2006).
REFERENCES
167
[28] R. W. Yeung. Information Theory and Network Coding. Springer
(2008).
[29] R. Bhatia. Positive Definite Matrices. PUP (2007).
[30] R. A. Horn and C. R. Johnson. Matrix Analysis. 2nd ed. CUP (2013).
[31] F. Zhang. Matrix Theory: Basic Results and Techniques. Springer
(1999).
[32] S. Axler. Linear Algebra Done Right. 2nd ed. Springer (1997).
[33] C. Cohen-Tannoudji, B. Diu, and F. Laloë. Quantum Mechanics. Wiley
(2006).
[34] H. P. Williams. Fourier’s method of linear programming and its dual.
Amer. Math. Monthly, 93 (9): 681-695 (1986).
[35] K. Knight. Mathematical Statistics. Chapman & Hall (2000).
[36] G. Valiant and P. Valiant. Estimating the unseen: an n/ log(n)-sample
estimator for entropy and support size, shown optimal via new clts. Proceedings of the 43rd annual ACM symposium on Theory of computing,
ACM 2011, pages 685–694.
[37] L. Paninski. Estimation of entropy and mutual information. Neural
Computation, 15: 1191-1253 (2003).
[38] G. Miller. Note on the bias of information estimates, in Information
Theory in Psychology: Problems and Methods, Free Press (1955).
[39] T. Schürmann. Bias analysis in entropy estimation. J. Phys. A: Math.
Gen, 37 (27): L295-L301 (2004).
[40] R. Pachón and L. N. Trefethen. Barycentric-remez algorithms for best
polynomial approximation in the chebfun system. BIT Numer Math,
49 (4): 721-741 (2009).
[41] L. Veidinger. On the numerical determination of the best approximations in the chebychev sense. Numerische Mathematik, 2: 99-105
(1960).
REFERENCES
168
[42] B. Efron and T. J. DiCiccio. Bootstrap confidence intervals. Statistical
Science, 11 (3): 189-228 (1996).
[43] B. Efron. Bootstrap methods: Another look at the jackknife. The Anals
of Statistics, 7 (1): 1-26 (1979).
[44] J. Carpenter and J. Bithell. Bootstrap confidence intervals: when,
which, what? A practical guide for medical statisticians. Statist. Med,
19 (9): 1141-1164 (2000).
[45] W. J. McGill. Multivariate information transmission. Psychometrika,
19: 97-116 (1954).
[46] R. Bhatia. Matrix Analysis (Graduate texts in mathematics, 169).
Springer (1997).
[47] M.
Lichman.
UCI
Machine
Learning
Repository,
http://archive.ics.uci.edu/ml, Irvine, CA: University of California,
School of Information and Computer Science (2013).
[48] R. A. Fisher. The use of multiple measurements in taxonomic problems.
Annals of Eugenics, 7 (2): 179-188 (1936).
[49] H. Abdi. Bonferroni and Šidák corrections for multiple comparisons, in
Encyclopedia of Measurement and Statistics, Sage Publications (2007).
[50] Z. Šidák. Rectangular confidence region for the means of multivariate
normal distributions. JASA, 62 (318): 626-633 (1967).
[51] R. G. Miller. Simultaneous Statistical Inference. 2nd ed. Springer
(1981).
[52] Y. Hochberg and A. C. Tamhane. Multiple Comparison Procedures.
Wiley (1987).