Download Document

Document related concepts

Central limit theorem wikipedia , lookup

Transcript
Basic Statistics for Linkage and Association Studies of
Quantitative Traits
Boulder Colorado Workshop March 5 2007
Overview
•
•
•
•
A brief history of SEM
Regression
Maximum likelihood estimation
Models
– Twin data
– Sib pair linkage analysis
– Association analysis
• Mixture distributions
• Some extensions
Origins of SEM
• Regression analysis
– ‘Reversion’ Galton 1877: Biological
phenomenon
– Yule 1897 Pearson 1903: General Statistical
Context
– Initially Gaussian X and Y; Fisher 1922 Y|X
• Path Analysis
– Sewall Wright 1918; 1921
– Path Diagrams of regression and covariance
relationships
Structural Equation Model basics
• Two kinds of relationships
– Linear regression X -> Y single-headed
– Unspecified covariance X<->Y double-headed
• Four kinds of variable
– Squares – observed variables
– Circles – latent, not observed variables
– Triangles – constant (zero variance) for specifying means
– Diamonds -- observed variables used as moderators (on paths)
Linear Regression Model
Var(X)
Res(Y)
b
X
Y
Models covariances only
Of historical interest
Linear Regression Model
Var(X)
Res(Y)
b
Y
X
Mu(x)
1
Mu(y*)
Models Means and Covariances
Linear Regression Model
Res(Y)
X
1
i
D
b
Yi
1
Mu(y*)
Models Mean and Covariance of Y only
Must have raw (individual level) data
X is a definition variable
Mean of Y different for every observation
Single Factor Model
1.00
F
lm
l1
l2
l3
S1
S2
S3
Sm
e1
e2
e3
e4
Factor Model with Means
MF
1.00
1.00
mF2
mF1
F1
F2
lm
l1
S1
e1
l2
l3
S2
e2 mS2
mS1
S3
mS3
B8
e3
Sm
mSm
e4
Factor model essentials
• The factor itself is typically assumed to be normally
distributed: SEM
• May have more than one latent factor
• The error variance is typically assumed to be normal
as well
• May be applied to binary or ordinal data
– Threshold model
Multifactorial Threshold Model
Normal distribution of liability. ‘Affected’ when liability x > t

t
0.5
0.4
0.3
0.2
0.1
0
-4
-3
-2
-1
0
x
1
2
3
4
Measuring Variation
• Distribution
– Population
– Sample
– Observed measures
• Probability density function ‘pdf’
– Smoothed out histogram
– f(x) >= 0 for all x
One Coin toss
2 outcomes
0.6
Probability
0.5
0.4
0.3
0.2
0.1
0
Heads
Tails
Outcome
Two Coin toss
3 outcomes
0.6
Probability
0.5
0.4
0.3
0.2
0.1
0
HH
HT/TH
Outcome
TT
Four Coin toss
5 outcomes
0.4
Probability
0.3
0.2
0.1
0
HHHH
HHHT
HHTT
Outcome
HTTT
TTTT
Ten Coin toss
9 outcomes
0.3
Probability
0.25
0.2
0.15
0.1
0.05
0
Outcome
Fort Knox Toss
Infinite outcomes
0.5
0.4
0.3
0.2
0.1
0
-4
-3
-2
-1
0
1
Heads-Tails
De Moivre 1733 Gauss 1827
2
3
4
Variance
 Measure of Spread
 Easily calculated
 Individual differences
Average squared deviation
Normal distribution

xi
di
-3
-2
-1
0
1
2
Variance =  di /N
2
3
Measuring Variation
Weighs & Means
• Absolute differences?
• Squared differences?
• Absolute cubed?
• Squared squared?
Measuring Variation
Ways & Means
• Squared differences
Fisher (1922) Squared has minimum variance under
normal distribution
Concept of “Efficiency” emerges
Deviations in two dimensions
x
+
+
+
+
+
+
+ +
+
+
+ +
+
+
+
+
+
+ +
+
+
+
+
+
+
+ +
+
+
+ + +
+
y
Deviations in two dimensions
x
dx
+
dy
y
Covariance
• Measure of association between two variables
• Closely related to variance
• Useful to partition variance
• Analysis of variance coined by Fisher
Summary
Formulae for sample statistics; i=1…N observations
x = (xi)/N
x =
2
xy =
rxy =
 (xi - x ) / (N)
 (xi-x )(yi-y ) / (N)
2
 xy
2
2
x  x
Variance covariance matrix
Univariate Twin/Sib Data
Var(Twin1)
Cov(Twin1,Twin2)
Cov(Twin2,Twin1)
Var(Twin2)
Only suitable for complete data
Good conceptual perspective
Summary
• Means and covariances
• Basic input statistics for “Traditional SEM”
• Notion of probability density function
Maximum Likelihood Estimates: Nice Properties
1. Asymptotically unbiased
• Large sample estimate of p -> population value
2. Minimum variance “Efficient”
• Smallest variance of all estimates with property 1
3. Functionally invariant
• If g(a) is one-to-one function of parameter a
• and MLE (a) = a*
• then MLE g(a) = g(a*)
• See http://wikipedia.org
Likelihood computation
Calculate height of curve
-1
Height of normal curve: x = 0
Probability density function
x
(xi)
-3
-2
-1
0
1
xi
2
3
(xi) is the likelihood of data point xi for
particular mean & variance estimates
Height of normal curve at xi: x = .5
Function of mean
x
(xi)
-3
-2
-1
0
1
xi
2
3
Likelihood of data point xi increases as x
approaches xi
Height of normal curve at x1
Function of variance
x
(xi var = 1)
(xi var = 2)
(xi var = 3)
-3
-2
-1
0
x1
1
2
3
Likelihood of data point xi changes as
variance of distribution changes
Height of normal curve at x1 and x2
x
(x1 var = 1)
(x1 var = 2)
-3
-2
-1
0
x1
1
2
3
x2
(x2 var = 2)
(x2 var = 1)
x1 has higher likelihood with var=1 whereas
x2 has higher likelihood with var=2
Likelihood of xi as a function of 
Likelihood function
L(xi)
-3
MLE
-2
-1
0
1
xi
2
^
x
3
L(xi) is the likelihood of data point xi for
particular mean & variance estimates
Likelihood as a measure of “outlierness”
• Unlikely observation may be an outlier
– Genuine
– Data entry error
– Model-specific
• Can use Mx feature to obtain case-wise likelihoods
– Raw data
– Option mx%p= uni_pi.out
– Output for each case: the contribution to the -2ll as well as z-score
statistic and Mahalanobis distance, weight and weighted likelihood
– Generates R syntax to read in file, and sort by z-score
– Beeby Medland & Martin (2006) ViewPoint and ViewDist: utilities for
rapid graphing of linkage distributions and identification of outliers.
Behav Genet. 2006 Jan;36(1):7-11
Height of bivariate normal density function
An unlikely pair of (x,y) values
0.5
0.4
0.3
-3
-2
-1
0
x
1
2
x1
3
-3
-2
-1
y1
1
y
0
2
3
Height of bivariate normal density function
A more likely pair of (x,y) values
0.5
0.4
0.3
-3
-2
-1
0
x2
1
2
3
-3
-2
-1
0
y2
1
2
3
Likelihood of Independent Observations
• Chance of getting two heads
• L(x1…xn) = Product (L(x1), L(x2) , … L(xn))
• L(xi) typically < 1
• Avoid vanishing L(x1…xn)
• Computationally convenient log-likelihood
• ln (a * b) = ln(a) + ln(b)
• Minimization more manageable than maximization
– Minimize -2 ln(L)
Likelihood Ratio Tests
•
•
•
•
Comparison of likelihoods
Consider ratio L(data,model 1) / L(data, model 2)
-3
-2
Ln(a/b) = ln(a) - ln(b)
Log-likelihood lnL(data, model 1) - ln L(data, model 2)
l
-1
0
1
2
3
• Useful asymptotic feature when model 2 is a submodel of model 1
-2 (lnL(data, model 1) - lnL(data, model 2)) ~ 
df = # parameters of model 1 - # parameters of model 2
• BEWARE of gotchas!
– Estimates of a2 q2 etc. have implicit bound of zero
– Distributed as 50:50 mixture of 0 and 
Exercises: Compute Normal PDF
• Get used to Mx script language
• Use matrix algebra
• Taste of likelihood theory
Mx script part 1: Declare groups and
matrices
#NGroups 1
Title figure out likelihood by hand
Calculation
Begin Matrices;
E symm 2 2 ! Expected Covariance Matrix
H full 1 1 ! One half
T full 1 1 ! Two
M full 2 1 ! Mean vector
P full 1 1 ! Pi
X full 2 1 ! Observed Data
End Matrices;
Mx script part 2: Put values in matrices
Matrix E
1 .0 1
Matrix H .5
Matrix M 0 0
Matrix P 3.141592
Matrix T 2
Matrix X 1 2
Mx script part 3: Matrix Algebra
Begin Algebra;
O=T*P*\sqrt\det(E); ! Fractional part, 2pi*sqrt(det(e))
Q=(X-M)'&(E~); ! Mahalanobis Distance
R=\exp(-H*Q); ! e to the power -.5*Mahalanobis distance
S=-T*\ln(R%O);! minus twice log-likelihood
Z=-T*\ln(\pdfnor(X'_M'_E)); ! A simpler way
End Algebra;
End Group;
Exercises 1
• Bivariate normal distribution
– Means [110.28 112.00]
– Covariance matrix [299.40
174.20 281.18]
• Compute likelihood of observed vector x = [87 89]
Exercises 2
• Bivariate normal distribution
– Means [1 1]
– Covariance matrix [1 .3
.3 1 ]
• Compute likelihood of observed vector x = [1 2]
• Compute likelihood with correlation of .0 instead
• Optional compute likelihood of observed vector x = [-2 -2] with
correlations .5, .0, and 0
• Which is the most likely combination of model and data?
Exercises 3
• Univariate normal distribution
– Mean [1]
– Variance [1 ]
• Compute likelihood of observed vector x = [1]
• Compute likelihood of observed vector x = [2]
• Compute their product
• Which bivariate case does this equal?
Two Group Model: ACE
1
1
A
.5
1
1
1
1
C
a
1
c
E
1
E
e
1
C
e
T1
c
T2
m1
m2
1
MZ twins
7 parameters
1
A
a
1
A
1
1
C
a
c
E
1
E
e
1
C
e
T1
c
T2
m3
m4
1
DZ twins
A
a
DZ by IBD status
1
1
Q
.5
1
1
1
1
F
q
1
f
E
E
e
T1
1
f
Q
1
Q
q
1
1
F
q
T2
m2
1
1
F
e
m1
1
f
T1
E
E
e
m3
1
F
e
1
m4
1
1
1
Q
1
1
F
q
f
T1
E
E
e
m5
1
F
e
1
1
m6
f
T2
Q
q
•
•
1
Variance = Q + F + E
Covariance = πQ + F + E
f
T2
Q
q
Extensions to More Complex Applications
• Endophenotypes
• Linkage Analysis
• Association Analysis
Basic Linkage (QTL) Model
 = p(IBD=2) + .5 p(IBD=1)
1
1
1
1
E1
F1
Q1
e
f
P1
q
Pihat
1
1
1
Q2
F2
E2
q
f
e
P2
Q: QTL Additive Genetic
F: Family Environment
E: Random Environment
3 estimated parameters: q, f and e
Every sibship may have different model
Measurement
Linkage (QTL) Model

 = p(IBD=2) + .5 p(IBD=1)
1
1
1
1
E1
F1
Q1
e
f
Pihat
1
1
1
Q2
F2
E2
q
q
P1
l6
M6
M5
M4
M2
e
P2
l1
M3
f
l1
M1
M1
l6
M2
M3
M4
M5
M6
q f
e
F1
Q
1
1
1
E
1
1
Q: QTL Additive Genetic
F: Family Environment
E: Random Environment
3 estimated parameters: q, f and e
Every sibship may have different model
Fulker Association Model
M
Geno1
Geno2
G1
G2
0.50
0.50
-0.50
0.50
S
D
m
m
b
Multilevel model
for the means
w
B
W
1.00
1.00
-1.00
1.00
LDL1
LDL2
R
R
C
0.75
Measurement Fulker Association Model (SM)
M
Geno1
Geno2
G1
G2
0.50
0.50
0.50
0.50
S
D
m
m
b
w
B
W
1.00
1.00
1.00
1.00
F1
M1
M1
w
0.
50
b
m
D
m
S
0.
50
0.
50
0.
50
G
2
G
1
Gen
o2
Gen
o1
M
M2
B
M3
W
M4
1.
00
M5
l1
1.
00
1.
00
M6
l1
1.
00
l6
F2
l6
M2
M3
M4
M5
M6
Multivariate Linkage & Association Analyses
• Computationally burdensome
• Distribution of test statistics questionable
• Permutation testing possible
– Even heavier burden
– Sarah Medland’s rapid approach
• Potential to refine both assessment and genetic models
• Lots of long & wide datasets on the way
– Dense repeated measures: EMA; fMRI(!)
– Need to improve software! Open source Mx