Download wed_intro_thresh

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
Categorical Data
Frühling Rijsdijk1 & Caroline van Baal2
1IoP,
London
2Vrije
Universiteit, A’dam
Twin Workshop, Boulder
Tuesday March 2, 2004
Aims
• Introduce Categorical Data
• Define liability and describe
assumptions of the liability model
• Show how heritability of liability can be
estimated from categorical twin data
• Practical exercises
Categorical data
Measuring instrument is able to only
discriminate between two or a few ordered
categories : e.g. absence or presence of a
disease
Data therefore take the form of counts, i.e.
the number of individuals within each
category
Univariate Normal
Distribution of Liability
Assumptions:
(1) Underlying normal distribution
of liability
(2) The liability distribution has
1 or more thresholds (cut-offs)
The standard Normal distribution
Liability is a latent variable, the scale is arbitrary,
distribution is, therefore, assumed to be a
Standard Normal Distribution (SND) or z-distribution:
• mean () = 0 and SD () = 1
• z-values are the number of SD away from the mean
• area under curve translates directly to probabilities
> Normal Probability Density function ()
68%
-3
-2
-1
0
1
2
3
Standard Normal Cumulative Probability in right-hand tail
(For negative z values, areas are found by symmetry)
Z0
Area
0
.2
.4
.6
.8
1
1.2
1.4
1.6
1.8
2
2.2
2.4
2.6
2.8
2.9
.50
.42
.35
.27
.21
.16
.12
.08
.06
.036
.023
.014
.008
.005
.003
.002
50%
42%
35%
27%
21%
16%
12%
8%
6%
3.6%
2.3%
1.4%
.8%
.5%
.3%
.2%
Area=P(z 
z 0)
-3
z0
3
When we have one variable it is possible to
find a z-value (threshold) on the SND, so that
the proportion exactly matches the observed
proportion of the sample
• i.e if from a sample of 1000 individuals, 150
have met a criteria for a disorder (15%): the
z-value is 1.04
3
-3
1.04
Two categorical traits
When we have two categorical traits, the data are
represented in a Contingency Table, containing
cell counts that can be translated into proportions
0 = absent
1 = present
Trait2
Trait1
Trait2
Trait1
0
1
0
00
01
0
1
10
11
1
0
545
(.76)
56
(.08)
1
75
(.11)
40
(.05)
Categorical Data for twins:
When the measured trait is dichotomous i.e. a
disorder either present or absent in an
unselected sample of twins:
cell a: number of pairs concordant for unaffected
cell d: number of pairs concordant for affected
cell b/c: number of pairs discordant for the disorder
Twin2
Twin1
0
0
00
1
10
1
a
c
01
11
b
d
0 = unaffected
1 = affected
Joint Liability Model for twin pairs
• Assumed to follow a bivariate normal distribution
• The shape of a bivariate normal distribution is
determined by the correlation between the traits
• Expected proportions under the distribution can be
calculated by numerical integration with
mathematical subroutines
Bivariate Normal
R=.00
R=.90
Bivariate Normal (R=0.6) partitioned at threshold 1.4 (z-value) on both liabilities
Expected Proportions of the BN, for R=0.6, Th1=1.4, Th2=1.4
Liab 2
Liab 1
0
1
0
.87
.05
1
.05
.03
Correlated dimensions:
The correlation (shape) and the two thresholds
determine the relative proportions of observations in the
4 cells of the CT.
Conversely, the sample proportions in the 4 cells can be
used to estimate the correlation and the thresholds.
Twin2
Twin1
0
1
0
1
00a
01b
10c
11d
a
c
d
b
a
cd
b
Twin Models
• A variance decomposition (A, C, E) can
be applied to liability, where the
correlations in liability are determined
by path model
• This leads to an estimate of the
heritability of the liability
ACE Liability Model
1
1/.5
E
C
A
A
C
L
L
1
Unaf
1
¯
Aff
Twin 1
E
Unaf
¯
Aff
Twin 2
Summary
It is possible to estimate a correlation
between categorical traits from simple
counts (CT), because of the assumptions
we make about their joint distributions
How can we fit ordinal
data in Mx?
Summary statistics: CT
Mx has a built-in fit function for the maximumlikelihood analysis of 2-way Contingency Tables
>analyses limited to only two variables
Raw data analyses
- multivariate
- handles missing data
- moderator variables
Model-fitting to CT
Mx has a built in fit function for the maximum-likelihood
analysis of 2-way Contingency Tables
The Fit Function is twice the log-likelihood
of the observed frequency data calculated as:
r
c
2 ln L  2
i 1
 n ln P 
j 1
ij
ij
nij is the observed frequency in cell ij
pij is the expected proportion in cell ij
Expected proportions
Are calculated by numerical integration of the
bivariate normal over two dimensions:
the liabilities for twin1 and twin2
e.g. the probability that both twins are affected :

  (L 1 , L 2 ;0, Σ) dL 1dL 2
TT
Φ is the bivariate normal probability density function, L1 and
L2 are the liabilities of twin1 and twin2, with means 0, and 
is the correlation matrix of the two liabilities.
L
B
d
a
2
B
L1
For example: for a correlation of .9 and thresholds (z-values)
of 1, the probability that both twins are above threshold
(proportion d) is around .12
The probability that both twins are are below threshold
(proportion a) is given by another integral function with
reversed boundaries :
T
T
  (L1 , L 2 ;0, Σ) dL 1dL 2
 
and is around .80 in this example
χ² statistic:
log-likelihood of the data under
the model
subtracted from
log-likelihood of the observed
frequencies themselves:
 n ij 

2 ln L  2  n ij ln

n
i 1 j1
 tot 
r
c
The model’s failure to predict the observed
data i.e. a bad fitting model,is reflected
in a significant χ²
Model-fitting to Raw
Ordinal Data
Zyg
1
1
1
2
2
1
2
2
2
ordinal
respons1
0
0
0
1
0
1
.
0
0
ordinal
respons2
0
0
1
0
0
1
1
.
1
Model-fitting to Raw
Ordinal Data
The likelihood of a vector of ordinal responses is
computed by the Expected Proportion in
the corresponding cell of the MN
Expected proportion are calculated by
numerical integration of the MN normal
over n dimensions. In this example it will
be two, the liabilities for twin1 and twin2
(0 0)
T
(1 1)

T
  ( x , x ) dx dx
1
2
1
  ( x , x ) dx dx
2
1
 
2
1
T T
T 
(0 1)
(1 0)
  ( x , x ) dx dx
1
2
1
2
 T
 T
  ( x , x ) dx dx
1
T 
2
1
2
2
T

T

(
x
,
x
)
dx
dx
1
2
1
2

 
 is the MN pdf, which is a function of
, the correlation matrix of the variables
By maximizing the likelihood of the data
under a MN distribution, the ML estimate of the
correlation matrix and the thresholds are obtained
Practical Exercise 1
Simulated data for 625 MZ and 625 DZ pairs
(h2 = .40 c2 = .20 e2 = .40 > rmz=.60 rdz=.40)
Dichotomized 0 = bottom 88%, 1 = top 12%
This corresponds to threshold (z-value) of 1.18
Observed
MZ
0
0 508
1 35
counts:
1
48
34
DZ
0
0 497
1 49
1
59
20
Raw ORD File: bin.dat
Scripts: tetracor.mx and ACEbin.mx
Practical Exercise 2
Same simulated data
Categorized 0 = bottom 22%, 1 = mid 66%,
2 = top 12%
This corresponds to thresholds (z-values) of
-0.75 1.18
Observed counts:
MZ
DZ
0
1
2
0
1
2
0 80
58
1
0 63
74
2
1 68
302 47
1 71
289 57
2 1
34
34
2
4
45 20
Raw ORD File: cat.dat
Adjust the correlation and ACE script
Threshold Specification in Mx
2 Categories
Threshold Matrix : 1 x 2
T(1,1) T(1,2) threshold twin1 & twin2
3 Categories
Threshold Matrix : 2 x 2
T(1,1) T(1,2) threshold 1 for twin1 & twin2
T(2,1) T(2,2) threshold 2 for twin1 & twin2
Threshold Specification in Mx
3 Categories
nthresh=2
nvar=2
Matrix T: nthresh nvar (2 x 2)
T(1,1) T(1,2) threshold 1 for twin1 & twin2
T(2,1) T(2,2) increment
L LOW nthresh nthresh
Value 1 T 1 1 to T nthresh nthresh
Threshold Model
L*T /
10
t11 t12
1 1 * t21 t22 =
t11
t11 + t21
t12
t12 + t22
Using Frequency Weights
Zyg
ordinal
ordinal
respons1 respons2 FREQ
1
1
1
1
2
2
2
2
0
0
1
1
0
0
1
1
0
1
0
1
0
1
0
1
508
48
35
34
497
59
49
20
The 1250 lines data file (bin.dat) can be summarized like this
Using Frequency Weights
G1: Data and model for MZ correlation
DAta NGroups=2 NInput_vars=4
Missing=.
Ordinal File=binF.dat
Labels
zyg bin1 bin2 freq
SELECT IF zyg = 1
SELECT bin1 bin2 freq /
DEFINITION freq /
Begin Matrices;
R STAN 2 2 FREE
T FULL nthresh nvar FREE
L LOW nthresh nthresh
F FULL 1 1
End matrices;
Value 1 L 1 1 to L nthresh nthresh
!Correlation matrix
!thresh tw1, thresh tw2
! initialize L
COV R /
!Predicted Correlation matrix for MZ pairs
Thresholds L*T /
!to ensure t1>t2>t3 etc.......
FREQ F /
Example: 2 Variables measured in twins:
x has 2 cat > 0 below Tx1 , 1 above Tx1
y has 3 cat > 0 below Ty1 , 1 Ty1 - Ty2 , 2 above Ty2
Ordinal respons vector (x1, y1, x2, y2)
For example (1 2 0 1)
The likelihood of this vector of observations is
the Expected Proportion in
the corresponding cell of the MN :
  tx1 ty 2

tx1 ty 2  ty1
( x1 , y1 , x2 , y2 ) dx1dy1dx2 dy2
Proband-Ascertained
Samples
For rare disorders (e.g. Schizophrenia), selecting a random
sample of twins will lead to the vast majority of pairs being
unaffected.
A more efficient design is to ascertain twin pairs through a
register of affected individuals.
When an affected twin (the proband) is identified, the cotwin
is followed up to see if he or she is also affected.
There are several types of ascertainment
Types of ascertainment
Complete
Ascertainment
Twin 1
Twin 2
0
0
00
1
10
Proband
Co-twin
1
a
c
01
11
Single
Ascertainment
b
d
0
0
00
1
10
1
a
c
01
11
b
d
Ascertainment Correction
Omission of certain classes from observation leads to an increase
of the likelihood of observing the remaining individuals
Mx corrects for incomplete ascertainment by dividing the likelihood
by the proportion of the population remaining after ascertainment
CT from ascertained data can be analysed in Mx
by simply substituting a –1 for the missing cells
CTable 2 2
-1
11
-1
13
Summary
For a 2 x 2 CT:
3 observed statistics, 3 parameters
(1 correlation, 2 threshold) df=0 
any pattern of observed frequencies
can be accounted for, no goodness of fit
of the normal distribution assumption.
This problem is solved when we have a CT
which is at least 3 x 2: df>0
A significant 2 reflects departure from
normality.
Summary
Power to detect certain effects increases
with increasing number of categories >
continuous data most powerful
For raw ordinal data analyses, the first category
must be coded 0!
Threshold specification when analyzing CT are
different