Download Talk Powerpoint

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Harvard Medical School
Transcriptional Diagnosis by
Bayesian Network
Hsun-Hsien Chang and Marco F. Ramoni
Children’s Hospital Informatics Program
Harvard-MIT Division of Health Sciences and Technology
Harvard Medical School
March 17, 2009
1
Harvard Medical School
Background
• Microarray technology enables profiling expression of
thousands of genes in parallel on a single chip.
• Comparative analysis of gene expression across tissue
states extracts signature genes for disease diagnosis.
• Challenge:
– Number of variables (i.e., genes) is much greater than the
number observations (i.e., biological samples), inducing the
problem of overfitting.
• Existing methods:
– Gene selection: compute statistics (eg., t-statistics, SNR,
PCA) of individual genes and select high rank genes.
– Classification model: create a classification function of
selected genes.
2
Harvard Medical School
Proposed Approach
• Issues:
– Assumption on gene independencies is inadequate.
– Other genes may be collinearly expressed with the signature.
– Selection and classification are two non-integrated steps.
Need a cut-off threshold to select high rank genes.
• Proposed strategies:
– Adopt system biology approach to infer the functional
dependence among genes.
– Use the dependence network for tissue discrimination.
– Integrate gene selection and classification model in
Bayesian network framework.
3
Harvard Medical School
Data Representation by Bayesian Network
Tissue state 1
Tissue state 2
Pheno
Gene 1
Gene 2
.
.
.
.
• Bayesian networks are directed acyclic graphs where:
– Node corresponds to random variables.
G
– Directed arcs encode conditional probabilities
G of the target
nodes on the source nodes.
1
2
.
.
.
.
.
.
.
.
.
.
.
.
Gene N
GN
4
Harvard Medical School
Gene Selection by Bayes Factor
Pheno
G1
G1
G2
gene selection by
Bayes factor
G2
.
.
.
.
Pheno
Gp
Gq
.
.
GN
GN
5
Harvard Medical School
Collinearity Elimination via Network Learning
Pheno
G1
G1
G2
G2
Gp
collinearity
elimination
Pheno
Gp
Gq
Gq
GN
GN
6
Harvard Medical School
Sample Classification
G1
G2
Pheno
Gp
Gq
• The phenotype variable is
independent of the blue genes, given
the green genes.
• Technically, the green genes are
under the Markov blanket of the
phenotype variable, and they are the
signature genes used for phenotype
determination.
• Tissue classification:
GN
7
Harvard Medical School
Algorithm Summary
..
.
..
.
Gene Selection
by Bayes Factor
Collinearity
Elimination
..
.
Sample
Classification
..
.
..
.
Optimize
Performance
Optimize
Hyperparameters
(sensitivity analysis)
8
Harvard Medical School
Discriminate Lung Carcinoma Subtypes
• Adenocarcinoma (AC) and squamous cell carcinoma
(SCC) are major subtypes of lung cancer:
– AC and SCC are distinct in survival, chances of metastasis,
and responses to chemotherapy and targeted therapy.
– Physicians lack confidence in correct recognition when there
are multiple primary carcinomas.
• Training:
– 58 ACs and 53 SCCs.
– 77 genes selected in the network.
– 25 signature genes.
9
Harvard Medical School
Bayesian Network for Lung Carcinoma
10
Harvard Medical School
Large-Scale Testing on Independent Samples
• 422 samples (232 ACs and 190 SCCs) aggregated from 7
cohorts (including Caucasians, African-Americans, Chinese).
• Accuracy = 95.2% AUROC.
ROC curves
1
0.9
0.8
0.7
sensitivity
0.6
0.5
0.4
0.3
0.2
0.1
Proposed Bayes Net (95.2%)
0
0
0.1
0.2
0.3
0.4
0.5
0.6
1-specificity
0.7
0.8
0.9
1
11
Harvard Medical School
Comparisons with Other Popular Methods
• Higher classification accuracy.
• Small-sized signature to avoid overfitting.
Bayesian Network
PCA/LDA
PAM
(Tibshirani et al., PNAS 2002)
Weighted Voting
(Golub et al., Science 1999)
Testing
AUROC
95.2%
---
# signature
genes
25
91.2%
0.0047
13
91.0%
0.0014
77
93.4%
0.6240
800
p-value
12
Harvard Medical School
KRT6 Family Characterizes the Lung
Carcinoma Discrimination
13
Harvard Medical School
KRT6 Family Characterizes the Lung
Carcinoma Discrimination
• Keratin-6 family genes (KRT6A, KRT6B, KRT6C) are
important for distinguishing lung cancer subtypes.
– Accounting for 95% of
the accuracy of the whole
25-gene signature.
– Located on chromosome
12q12-q13.
– A nonlinear, concave
discriminative surface.
14
Harvard Medical School
Verification by Chr12q12-q13 Aberrations
• Investigate DNA copy number changes in comparative
genomic hybridization (CGH) array.
– 12 ACs and 13 SCCs from
Vrije University Medical
Center, Netherland.
– A dumbbell discriminative
surface achieves 80%
classification accuracy.
– Treat average CGH values
of genes occupying q12,
q13, and q12-13
respectively as three
features to construct a
Naïve Bayes Classifier.
15
Harvard Medical School
Conclusion
• Reverse engineer regulatory network
information for tissue classification.
• Adopt the system biology approach to infer
gene dependencies network.
– Select genes by Bayes factor.
– Eliminate collinearity via network learning.
– Integrate gene selection and classification model
in a single Bayesian network framework.
• Demonstrate the promising translational
value of the system biology approach in
clinical study.
16