Download Causal Search Using Graphical Causal Models

Document related concepts

Data assimilation wikipedia , lookup

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Least squares wikipedia , lookup

Time series wikipedia , lookup

Instrumental variables estimation wikipedia , lookup

Rubin causal model wikipedia , lookup

Transcript
Causal Search Using Graphical Causal Models Kevin D. Hoover Duke University
1 Course Outline LECTURE 1. Motivations and Basic Notions LECTURE 2. Causal Search Algorithms LECTURE 3. Applications
2 Course Website: http://econ.duke.edu/~kdh9 (click on Courses An Introduction to Graphical Causal Models Lectures Lecture 1. Lecture 2. Lecture 3. Annotated Bibliography Tetrad Tetrad Project Homepage Tetrad IV Download Site Bootgraph Software Download Data Sets Text and Excel versions available; documentation in Excel versions. Text versions are a pure data matrix only without headers or date array; dimensons of data matrix in text version in parentheses (observations x variables). Swanson­Granger Data: Text (216 x 4); Excel U.S. M2 Data (10 variable): Text (181 x 10); Excel U.S. M2 Data (11 variable): Text (182 x 11); Excel Hoover­Jorda Data (U.S. Macro): Text (494 x 6); Excel
3 Causal Search Using Graphical Causal Models Kevin D. Hoover Duke University Lecture 1 Motivations and Basic Notions
4 I. Motivations: Why do we care about causal order? 1. Regressions have an implicit causal direction. Example: Regression: LIQDEP = 15.13 – 0.12FF R 2 = 0.62 Algebraic transformation of regression: FF = 126.08 – 8.33LIQDEP Reverse Regression: FF = 81.95 – 5.30LIQDEP R 2 = 0.62
5 2. Regression on effects is misleading. Example: Data­generating Process X1 = N(0,1) X2 = 5X1 + N(0,1) X3 = X2 + 0.1N(0,1) Causally­oriented regression X2= 0.038 + 5.033X1 (0.08) (67.1) Regression on Effect X2= 0.008 + 0.057X1 + 0.99X3 (0.01) (1.56) (138.2) Potential problem for PcGets or Autometrics.
6 3. The causal order of structural vector autoregressions (SVAR) is critical to conclusions. Impulse Responses of Money to a Consumption Shock for Two Contemporaneous Causal Orderings of U.S. Money, GDP, Consumption, and Investment 0.3 0.2 Choleski (M,Y, C, I ) 0.1 0 Percent 1 2 3 4 5 6 7 8 9 10 ­0.1 ­0.2 ­0.3 ­0.4 ­0.5 Response of Y to a one­standard deviation shock to C ( 0.33)
Choleski (M,C, I, Y ) ­0.6 Quarters 7 II. The Language of Graphs. 1. SOURCES OF THE GRAPH­THEORETIC APPROACH TO CAUSAL MODELING: ØSpirtes, Glymour, and Scheines (2000), Causation, Prediction, and Search, 2 nd edition. ØPearl (2000), Causality: Models, Reasoning, and Inference. Issues:
· Representation of data probabilistically.
· Representation of asymmetries causally and graphically.
· Relationship between two representations.
8 2. BASIC ELEMENTS OF GRAPHS:
· Nodes (or vertices) represent variables;
· Edges represent causal relationships;
· Arrowheads represent causal asymmetry.
· A path (or trek) is a sequence of edges starting at one node and ending at another.
· A directed path is a path in which the arrowheads align in a continuous direction.
9 A Graph E directed edge A nodes B C undirected edge D bidirected edge BAC is a path. DAE is a directed path.
10 · Directed edge: B A read as “B causes A”;
· Bidirected edge: C D read as “C causes D and D causes C” or “C and D are mutual causes (or simultaneous causes)” NB. bidirectional edges sometimes given other functions in the literature, so that mutual causation is sometimes represented as: C D
· Undirected edge: C A read as “causal direction between C and A undetermined.”
· In the literature, edges may have other endpoints (e.g., o or o
o) representing relations not used in this course.
11
Skeleton = pattern of edges irrespective of orientation. These graphs have the same skeleton: A B A C D B A C D B C D
12 Patterns Acylical Graph Cyclical Graph A B Simultaneous Graph A C B A C B C
13 3. RELATIONSHIP OF GRAPHS TO EQUATIONS. Example eV + eW + eX + eZ V =
W = aWXX X = aXVV Z = aZXX 2 s
where each ej ~ independent N(0, j ) V X W Z
14 Variables regarded as stochastic with random element error term. Could be included in graph; for example
eV
eX V X W Z
eW
eZ Independence implies only one edge from any error term; therefore, typically omitted.
15 Same system in matrix notation Let
é V ù
êW ú
Y = ê ú
ê X ú
ê ú,
ë Z û
é e V ù
êe ú
E = ê W ú
êe X ú
ê ú , ë e Z û
such that the covariance matrix W = E(EE’) is diagonal. Then AY = E,
16 é 1 ê 0 AY = ê
êa XV ê
ë 0 0 0 1 aWX 0 1 0 a ZX 0 ù é V ù ée V 0 0 0 ù
0 úú êW ú êê 0 e W 0 0 úú
ê ú=
= E 0 ú ê X ú ê 0 0 e X 0 ú
úê ú ê
ú
1 û ë Z û ë 0 0 0 e Z û
Causal ordering among the variables in Y defined by the zero restrictions on A. V X W Z
17 III. Graphs and Probability 1. LIMITATIONS OF THIS COURSE.
· Acyclical graphs only.
· Causal sufficiency. A graph is causally sufficient when no variable not included in the graph is a cause of more than one variable included in the graph. Causal sufficiency rules out latent variables and correlation among random error terms.
18 2. IDENTIFICATION. Two concepts of causal identification:
· Theoretical = actual causal structure that generated the data is recoverable (an important goal);
· Uniqueness = data correspond to a unique causal representation (failure of which is an important obstacle).
19 Folk theorem: economic data cannot be identified in the sense of uniqueness without the imposition of a causal structure selected by a priori theory identifying a model in the sense of theoretical identification. A priori theoretical identification is a necessary and sufficient condition for identification in the sense of uniqueness. Corollary: Tests of overidentifying restrictions are conditional on a priori identification. The folk theorem and its corollary are false.
20 Proof by counterexample: Simple data­generating process A =
eA B =
eB C = aCAA + aCBB + eC Its graph A B C
21 In matrix form 0 0 ù é A ù ée A ù
é 1 AY = êê 0 1 0 úú êê B úú + êêe B úú = E êë aCA a CB 1 úû êëC úû êëe C úû
and
0 ù
ée A e A e A e B e A e C ù és AA 0 Ω = E ( E E ¢) = E êêe B e A e B e B e B e C úú = êê 0 s BB 0 úú
êëe C e A e C e B e C e C úû êë 0 0 s CC úû
22 If form of system (and its graph) were known, then it could easily be estimated. The identification problem is that we do not know it, unless theory tells us a priori. Key Question: Can the DGP be recovered from the data alone?
· Folk theorem says, NO.
· Real answer, SOMETIMES: the point of the course.
23 The reduced form describes the probability distribution of the data: A -1AY = Y = U = A -1 E 24 é Aù éu A ù
Y = êê B úú = êêu B úú = U êëC úû êëuC úû
and
éu A u A u A u B u A u C ù éw AA w AB w AC ù
Σ = E ( U U ¢) = E êêu B u A u B u B u B u C úú = êêw BA w BB w BC úú
êëu C u A u C u B u C u C úû êëwCA wCB wCC úû
25 S is easily estimated. Identification requires a diagonal covariance matrix. Many matrices P exist, such that P - 1 Y = P -1 U = Ψ such that
0 ù
éy AA 0 Λ = E ( Ψ Ψ ¢) = êê 0 y BB 0 úú
êë 0 0 y CC úû
26 · symmetry implies that P ­1 imposes 3 restrictions on L;
· therefore, P ­1 has at most 3 degrees of freedom if it is just­identified in the sense of uniqueness;
· or, equivalently, 3 restrictions.
· In general, if Y contains n variables, the covariance matrix of the reduced form must be diagonal imposing n(n ­1)/2 restrictions and the matrix multiplying the variables imposes another n(n ­1)/2, for a total of n(n ­1) to achieve identification.
27
· For a given ordering of the variables in Y (here ABC), there exists a unique lower triangular matrix P such that Λ = E ( P -1 U ( P -1 U )' ) .
· P is called the Choleski matrix and a transformation using P ­1 to achieve orthogonal error terms – i.e., L diagonal is a Choleski transformation (or decomposition).
· Every just­identified system can be expressed in a Choleski ordering in which P ­1 is lower triangular.
· There is one Choleski ordering for each ordering of the variables in Y; here 3! = 6 orderings; in general n! orderings.
28
· Starting with Σ = E ( U U ¢) , finding a Choleski ordering is just a matter of calculation.
· All just­identified systems of the same variables (i.e., all Choleski orderings) have the same likelihood.
29
Consider a simulation based on data­generating process: 0 0 ù é Aù ée A ù
é 1 AY = êê 0 1 0 úú êê B úú + êêe B úú = E , êë- 1 / 3 - 1 / 3 1 úû êëC úû êëe C úû
where eA, eB ~ independent N(0,1) and
eC ~ independent N(0, 1/9) Consider the DGP unobservable.
30 The reduced form, however, can be estimated: é Aù é - 0. 05 ù éuˆ A ù
ê ú ˆ Y = êê B úú = êê 0 . 03 úú + êuˆ B ú = U êëC úû êë 0 . 05 úû êëuˆ C úû
where each of the constants is statistically insignificant.
é 1 0 . 01 0 . 58 ù
ˆ = E ( U ˆ U ˆ ¢) = ê 0 . 01 1 0 . 59 ú
Σ ê
ú
êë0 . 58 0 . 59 1 úû
standardized to be the correlation matrix.
31 Alternative Choleski Orderings P ­1 for order ABC 1 ­0.012 ­0.33 0 1 ­0.36 P ­1 for order ACB 0 0 1 1 0.47 ­0.34 0 1 0 0 ­1.43 1 log likelihood = –558 log likelihood = –558 P ­1 for order BAC P ­1 for order BCA 1 0 ­0.33 0.01 1 ­0.36 0 0 1 log likelihood = –558 1 0 0 0.53 1 ­.36 1.50 0 1 log likelihood = –558
32 Alternative Choleski Orderings P ­1 for order CAB 1 0.47 0 0 1 0 ­0.99 ­1.430 1 P ­1 for order CBA 1 0.53 ­1.50 0 1 ­0.96 0 0 1 log likelihood = –558
log likelihood = –558 33 · Folk theorem appears correct: quantitatively very different orderings indistinguishable on the basis of likelihood.
· But, not all the information in the likelihood function has been exploited.
· Consider the correlation matrix of the reduced form: 1 0.01 0.58 0.01 1 0.59 0.58 0.59 1
· The correlation between A and B is statistically zero.
34
That correlation could not be zero if the graph of the DGP were A B C or any other ordering in a line. Nor could it be A B C or any other ordering with a fork.
35 Only ordering consistent with the zero correlation: A B C And that’s the graph of the DGP!
36 Its specification: 0 0 ù é Aù éeˆ A ù
é 1 ê
ú
ê
ú
ê
ú
ˆ
ˆ ˆ A Y = ê 0 1 0 ú ê B ú + êe B ú = E êë- 0 . 33 - 0 . 36 1 úû êëC úû êëeˆ C úû
0 ù
é0 . 99 0 ˆ = E ( E E ¢) = ê 0 0 . 94 0 ú
Ω ê
ú
êë 0 0 0 . 33 úû
37 · Contrary to the folk theorem, the graph of the DGP identified from data alone without first positing a just­ identified model, and then testing the overidentifying restriction.
· In fact, DGP is not even nested in the Choleski orderings ACB, BCA, CAB, CBA.
· DGP nested in both ABC, BAC.
· In a sense, overidentification precedes identification.
· Still, we can test the overidentified model: the p­value vs. null of ABC Choleski order is p = 0.87.
38
3. PRINCIPLES OF SEARCH.
· Model in example identified from unconditional independence information.
· To generalize beyond the 3­variable case, consider conditional independence: A independent of B conditional on Z iff P(A, B|Z) = P(A|Z)P(B|Z).
39 Patterns of Dependence and Independence Case 1: Screen A B C A and C are probabilistically dependent (correlated), but conditional on C, they are independent: B screens off the influence of A on C. Case 2: Common Cause A B C A and C are probabilistically dependent (correlated), but conditional on C, they are independent: C is a common cause of A and B; and C screens off the correlation between of A on B.
40 Case 3: Unshielded Collider A B C A and C are probabilistically independent (uncorrelated), but conditional on C, they are dependent: C is a common effect (or unshielded collider on the path ACB) of A and B; and C induces a correlation between A on B. ØIllustration: Let A = the state of charge of a car’s battery (charged/flat); B = the position of the car’s starter switch (off/on); and C = the state of the car’s operation (starts/doesn’t start). A and B may be completely independent. Yet, if we know that the car doesn’t start, then knowing the that switch is on raises the probability that the battery is flat.
41 · Search algorithms construct graphs from data exploiting patterns of unconditional and conditional independence based on these basic forms.
· Key idea: Reichenbach’s (1956) Principle of the Common Cause: if any two variables, A and B, are truly correlated, then either A causes B (A ® B) or B causes A or (A ¬ B) or the have a common cause.
· Generalization: Causal Markov Condition.
· d­separation: A set Z d­separates X from Y (where X and Y may be variables or sets of variables) iff Z blocks every path from a node in X to a node in Y with a screen or an unshielded collider.
42
A B C F D E C d­separates A and E; {C} is the sepset for A and E.
43 A B C F D E A and B d­separates D and C; {A, B} is the sepset for D and C.
44 Causal Markov Condition A B C F D E A and B are parents of C. C is a child (or daughter) of A and B. [NB. children may have only one parent or many – the modern family.]
45 A B C F D E D is an ancestor of B, C, E, and F. B, C, E, and F are an descendents of D.
46 Causal Markov Condition: a variable in a graph is, conditional on its parents, probabilistically independent of all other variables that are neither its parents nor its descendants.
· Essentially, the Causal Markov Condition holds when a graph corresponds to the conditional independence relationships in the associated probability distribution.
· A graph is said to be faithful if, and only if, every conditional independence relation implied by the causal Markov condition applied to the graph is found in the probability distribution.
47 Failure of Faithfulness Suppose that it just happens that
X = eX d = –a/b; Y = aX + bZ + eY then Z = dX + eZ Y = aX + b(–aX/b + eZ) + eY = beZ + eY Y So, X and Y are uncorrelated! X Z But, say Spirtes et al. and Pearl, rare cases (Lebesque measure zero).
48 X = eX Y = aX + bZ + eY Z = dX + eZ Interpretation 2: X = drivers of the business cycle; Y = GDP; Z = policymaker’s instrument. Y X Interpretation 1: X = actions of waves on a ship; Y = the course of the ship; Z = the position of the helm. Z Optimal control ubiquitous in economics. Not rare (hardly Lebesque measure zero situations).
49 Equivalent Graphs Observational Equivalence Theorem (Pearl): Any probability distribution that can be faithfully represented in a causally sufficient, acyclical graph can equally well be represented by another acyclical graph that has the same skeleton and the same unshielded colliders.
50 A B C F A D E B C F D E Equivalent graph
Unshielded colliders: C on ACE C on BCE 51 A B C F A D E Unshielded colliders: C on ACE C on BCE B C F D E Nonequivalent graph: removes two unshielded colliders.
52 A B C F A D E Unshielded colliders: C on ACE C on BCE B C F D E Nonequivalent graph: adds an unshielded collier (B on ABD)
53 Earlier Cases Reexamined All two­variable systems are observationally equivalent: A B A B No unshielded colliders.
54 All three­variable systems, except one with a common effect, are observationally equivalent: A Chains B C A B C Common Cause A B C No unshielded colliders.
55 Common Effect A B C One unshielded collider.
56 End of Lecture 1
57