Download Causal Data Mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Causal Data Mining
Richard Scheines
Dept. of Philosophy, Machine Learning, &
Human-Computer Interaction
Carnegie Mellon
1
Causal Graphs
Causal Graph G = {V,E}
Each edge X  Y represents a direct causal claim:
X is a direct cause of Y relative to V
Exposure
Chicken Pox
Exposure
Rash
Infection
Rash
2
Causal Bayes Networks
The Joint Distribution Factors
Smoking [0,1]
According to the Causal Graph,
Yellow Fingers
[0,1]
Lung Cancer
[0,1]
P(S = 0) = .7
P(S = 1) = .3
P(YF = 0 | S = 0) = .99
P(YF = 1 | S = 0) = .01
P(YF = 0 | S = 1) = .20
P(YF = 1 | S = 1) = .80
i.e., for all X in V
P(V) = P(X|Immediate
Causes of(X))
P(S,YF, LC) = P(S) P(YF | S) P(LC | S)
P(LC = 0 | S = 0) = .95
P(LC = 1 | S = 0) = .05
P(LC = 0 | S = 1) = .80
P(LC = 1 | S = 1) = .20
3
Structural Equation Models
Education
Causal Graph
Income
Longevity
• Structural Equations:
One Equation for each variable V in the graph:
V = f(parents(V), errorV)
for SEM (linear regression) f is a linear function
• Statistical Constraints:
Joint Distribution over the Error terms
4
Structural Equation Models
Causal Graph
Education
Equations:
Education = ed
Longevity
Income = Educationincome
Longevity = EducationLongevity
Income
SEM Graph
Education
(path diagram)
1
Income
Income
2
Longevity
Statistical Constraints:
(ed, Income,Income ) ~N(0,2)
2 diagonal
- no variance is zero
Longevity
5
Tetrad 4: Demo
www.phil.cmu.edu/projects/tetrad
6
Causal Datamining in Ed. Research
1. Collect Raw Data
2. Build Meaningful Variables
3. Constrain Model Space with Background
Knowledge
4. Search for Models
5. Estimate and Test
6. Interpret
7
CSR Online
Are Online students learning
as much?
What features of online
behavior matter?
8
CSR Online
Are Online students learning as much?
Raw Data : Pitt 2001, 87 students
For everyone:
Pre-test, Recitation attendance, final exam
For Online Students:
logged: Voluntary question attempts, online
quizzes, requests to print modules
9
CSR Online
Build Meaningful Variables:
1.
2.
3.
4.
Online [0,1]
Pre-test [%]
Recitation Attendance [%]
Final Exam [%]
10
CSR Online
Data: Correlation Matrix
(corrs.dat, N=83)
Pre
Online
Rec
Pre
1.0
Online
.023
1.0
Rec
-.004
-.255
1.0
Final
.287
.182
.297
Final
1.0
11
CSR Online
Background Knowledge:
Temporal Tiers:
1. Online, Pre
2. Rec
3. Final
12
CSR Online
Model Search:
No latents (patterns – with PC or GES)
- no time order : 729 models
- temporal tiers: 96 models)
With Latents (PAGs – with FCI search)
- no time order : 4,096
- temporal tiers: 2,916
13
Tetrad Demo
Online vs. Lecture
Data file: corrs.dat
14
Estimate and Test: Results
Pre-test (%)
.23
Online
5.3
Final Exam (%)
-10
Recitation
Attendance (%)
.22
•
Model fit excellent
•
Online students attended 10% fewer recitations
•
Each recitation gives an increase of 2% on the final exam
•
Online students did 1/2 a Stdev better than lecture students (p = .059)
15
References
•
An Introduction to Causal Inference, (1997), R. Scheines, in Causality in Crisis?, V.
McKim and S. Turner (eds.), Univ. of Notre Dame Press, pp. 185-200.
•
Causation, Prediction, and Search, 2nd Edition, (2000), by P. Spirtes, C. Glymour,
and R. Scheines ( MIT Press)
•
Causality: Models, Reasoning, and Inference, (2000), Judea Pearl, Cambridge
Univ. Press
•
“Causal Inference,” (2004), Spirtes, P., Scheines, R.,Glymour, C., Richardson, T.,
and Meek, C. (2004), in Handbook of Quantitative Methodology in the Social
Sciences, ed. David Kaplan, Sage Publications, 447-478
•
Computation, Causation, & Discovery (1999), edited by C. Glymour and G.
Cooper, MIT Press
16
Related documents