Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Causal Data Mining
Richard Scheines
Dept. of Philosophy, Machine Learning, &
Human-Computer Interaction
Carnegie Mellon
1
Causal Graphs
Causal Graph G = {V,E}
Each edge X Y represents a direct causal claim:
X is a direct cause of Y relative to V
Exposure
Chicken Pox
Exposure
Rash
Infection
Rash
2
Causal Bayes Networks
The Joint Distribution Factors
Smoking [0,1]
According to the Causal Graph,
Yellow Fingers
[0,1]
Lung Cancer
[0,1]
P(S = 0) = .7
P(S = 1) = .3
P(YF = 0 | S = 0) = .99
P(YF = 1 | S = 0) = .01
P(YF = 0 | S = 1) = .20
P(YF = 1 | S = 1) = .80
i.e., for all X in V
P(V) = P(X|Immediate
Causes of(X))
P(S,YF, LC) = P(S) P(YF | S) P(LC | S)
P(LC = 0 | S = 0) = .95
P(LC = 1 | S = 0) = .05
P(LC = 0 | S = 1) = .80
P(LC = 1 | S = 1) = .20
3
Structural Equation Models
Education
Causal Graph
Income
Longevity
• Structural Equations:
One Equation for each variable V in the graph:
V = f(parents(V), errorV)
for SEM (linear regression) f is a linear function
• Statistical Constraints:
Joint Distribution over the Error terms
4
Structural Equation Models
Causal Graph
Education
Equations:
Education = ed
Longevity
Income = Educationincome
Longevity = EducationLongevity
Income
SEM Graph
Education
(path diagram)
1
Income
Income
2
Longevity
Statistical Constraints:
(ed, Income,Income ) ~N(0,2)
2 diagonal
- no variance is zero
Longevity
5
Tetrad 4: Demo
www.phil.cmu.edu/projects/tetrad
6
Causal Datamining in Ed. Research
1. Collect Raw Data
2. Build Meaningful Variables
3. Constrain Model Space with Background
Knowledge
4. Search for Models
5. Estimate and Test
6. Interpret
7
CSR Online
Are Online students learning
as much?
What features of online
behavior matter?
8
CSR Online
Are Online students learning as much?
Raw Data : Pitt 2001, 87 students
For everyone:
Pre-test, Recitation attendance, final exam
For Online Students:
logged: Voluntary question attempts, online
quizzes, requests to print modules
9
CSR Online
Build Meaningful Variables:
1.
2.
3.
4.
Online [0,1]
Pre-test [%]
Recitation Attendance [%]
Final Exam [%]
10
CSR Online
Data: Correlation Matrix
(corrs.dat, N=83)
Pre
Online
Rec
Pre
1.0
Online
.023
1.0
Rec
-.004
-.255
1.0
Final
.287
.182
.297
Final
1.0
11
CSR Online
Background Knowledge:
Temporal Tiers:
1. Online, Pre
2. Rec
3. Final
12
CSR Online
Model Search:
No latents (patterns – with PC or GES)
- no time order : 729 models
- temporal tiers: 96 models)
With Latents (PAGs – with FCI search)
- no time order : 4,096
- temporal tiers: 2,916
13
Tetrad Demo
Online vs. Lecture
Data file: corrs.dat
14
Estimate and Test: Results
Pre-test (%)
.23
Online
5.3
Final Exam (%)
-10
Recitation
Attendance (%)
.22
•
Model fit excellent
•
Online students attended 10% fewer recitations
•
Each recitation gives an increase of 2% on the final exam
•
Online students did 1/2 a Stdev better than lecture students (p = .059)
15
References
•
An Introduction to Causal Inference, (1997), R. Scheines, in Causality in Crisis?, V.
McKim and S. Turner (eds.), Univ. of Notre Dame Press, pp. 185-200.
•
Causation, Prediction, and Search, 2nd Edition, (2000), by P. Spirtes, C. Glymour,
and R. Scheines ( MIT Press)
•
Causality: Models, Reasoning, and Inference, (2000), Judea Pearl, Cambridge
Univ. Press
•
“Causal Inference,” (2004), Spirtes, P., Scheines, R.,Glymour, C., Richardson, T.,
and Meek, C. (2004), in Handbook of Quantitative Methodology in the Social
Sciences, ed. David Kaplan, Sage Publications, 447-478
•
Computation, Causation, & Discovery (1999), edited by C. Glymour and G.
Cooper, MIT Press
16