Download Computational Diagnosis

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Long non-coding RNA wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Oncogenomics wikipedia , lookup

Quantitative trait locus wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Genome evolution wikipedia , lookup

Gene wikipedia , lookup

Microevolution wikipedia , lookup

Genomic imprinting wikipedia , lookup

Helitron (biology) wikipedia , lookup

Minimal genome wikipedia , lookup

Genome (book) wikipedia , lookup

Designer baby wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Ridge (biology) wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

RNA-Seq wikipedia , lookup

Nutriepigenomics wikipedia , lookup

NEDD9 wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Mir-92 microRNA precursor family wikipedia , lookup

Gene expression programming wikipedia , lookup

Gene expression profiling wikipedia , lookup

Transcript
Computational Diagnostics
based on Large Scale Gene
Expression Profiles using MCMC
Rainer Spang,
Max Planck Institute for Molecular Genetics, Berlin
Harry Zuzan, Carrie Blanchette, Erich Huang, Holly Dressman,
Jeff Marks, Joe Nevins, Mike West
Duke Medical Center & Duke University
Estrogen Receptor Status
•
•
•
•
7000 genes
49 breast tumors
25 ER+
24 ER-
Tumor – Chip - 7000 Numbers
Given
Wanted
89%
7000 Numbers
The probability
that the tumor
is ER+
7000 Numbers Are More
Numbers Than We Need
Predict ER status based on the expression
levels of super-genes
Singular Value Decomposition
Loadings
Singular values
E  A 
DF
X
Data
Expression levels of super
genes, orthogonal matrix
Probit Model
P[ Yi  1 |  ]   (0  
βi xi
all supergenes
Yi

i
xi
Class of tumor i
Distribution Function of a Standard Normal
Regression weight for super gene i
Expression Level of super gene i
)
Overfitting
• Using only a small number of super genes
is not robust at all
• When using many (all) supergenes, the
linear model can be easily saturated, i.e.
we have several models that fit perfectly
well
• Consequence: For a new patient we find
among these models some that support
that she is ER+ and others that predict
she is ER-
Given the Few Profiles With
Known Diagnosis:
• The uncertainty on the right model is
high
• The variance of the model-weights is
large
• The likelihood landscape is flat
• We need additional model
assumptions to solve the problem
Informative Priors
Likelihood
Prior
Posterior
If the Prior Is Chosen
Badly:
• We can not reproduce the diagnosis
of the training profiles any more
• We still can not identify the model
• The diagnosis is driven mostly by the
additional assumptions and not by
the data
The Prior Needs to Be
designed in 49 Dimensions
•
•
•
•
Shape?
Center?
Orientation?
Not to narrow ... not to wide
Shape
multidimensional
normal
for simplicity
Center
   i  P [ Yi  1 |  ]
Assumptions on the model correspond
to assumptions on the diagnosis
Orientation
orthogonal super-genes !
Not to Narrow ... Not to
Wide
Auto adjusting model
Scales are hyper
parameters with their
own priors
Prior given the hyper parameter
Hyper parameter
Rescaling by
singular values
n
p(  | T )   N (  i | 0, / d )
i 1
2
i
Independent
super genes
Unbiased prior
2
i
A prior for the hyper parameters
-Conjugate prior
-Flexibility for
i
-Symmetric U-Shaped prior for

2
i
 i  P [ Yi  1 |  ]
~ Gamma ( k / 2, k / 2 )
k=2 or k=3
Latent Variable
P[ Yi  1 |  ]   (0  
βi xi
)
all supergenes
hi  0   i x i  
 ~ N ( 0,1 )
Yi  1   hi  0
Albert & Chip 1993
MCMC
- Gibbs Sampler
- Sequential updates of conditional distributions
p(  | X , h, T ) ~ normal
p(T | X , h,  ) ~ gamma
p(h | X ,  , T ) ~
truncated normal
All conditional posteriors can be calculated analytically
West 2001, Albert & Chip 1993
What are the
additional assumptions
that came in by the prior?
• The model can not be dominated by only a
few super-genes ( genes! )
• The diagnosis is done based on global
changes in the expression profiles
influenced by many genes
• The assumptions are neutral with respect
to the individual diagnosis
Which Genes Have Driven
the Prediction ?
Gene
Weight
nuclear factor 3 alpha
0.853
cysteine rich heart protein
0.842
estrogen receptor
0.840
intestinal trefoil factor
0.840
x box binding protein 1
0.835
gata 3
0.818
ps 2
0.818
liv1
0.812
... many many more ...
...
Thank you!