Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Harvard Medical School Transcriptional Diagnosis by Bayesian Network Hsun-Hsien Chang and Marco F. Ramoni Children’s Hospital Informatics Program Harvard-MIT Division of Health Sciences and Technology Harvard Medical School March 17, 2009 1 Harvard Medical School Background • Microarray technology enables profiling expression of thousands of genes in parallel on a single chip. • Comparative analysis of gene expression across tissue states extracts signature genes for disease diagnosis. • Challenge: – Number of variables (i.e., genes) is much greater than the number observations (i.e., biological samples), inducing the problem of overfitting. • Existing methods: – Gene selection: compute statistics (eg., t-statistics, SNR, PCA) of individual genes and select high rank genes. – Classification model: create a classification function of selected genes. 2 Harvard Medical School Proposed Approach • Issues: – Assumption on gene independencies is inadequate. – Other genes may be collinearly expressed with the signature. – Selection and classification are two non-integrated steps. Need a cut-off threshold to select high rank genes. • Proposed strategies: – Adopt system biology approach to infer the functional dependence among genes. – Use the dependence network for tissue discrimination. – Integrate gene selection and classification model in Bayesian network framework. 3 Harvard Medical School Data Representation by Bayesian Network Tissue state 1 Tissue state 2 Pheno Gene 1 Gene 2 . . . . • Bayesian networks are directed acyclic graphs where: – Node corresponds to random variables. G – Directed arcs encode conditional probabilities G of the target nodes on the source nodes. 1 2 . . . . . . . . . . . . Gene N GN 4 Harvard Medical School Gene Selection by Bayes Factor Pheno G1 G1 G2 gene selection by Bayes factor G2 . . . . Pheno Gp Gq . . GN GN 5 Harvard Medical School Collinearity Elimination via Network Learning Pheno G1 G1 G2 G2 Gp collinearity elimination Pheno Gp Gq Gq GN GN 6 Harvard Medical School Sample Classification G1 G2 Pheno Gp Gq • The phenotype variable is independent of the blue genes, given the green genes. • Technically, the green genes are under the Markov blanket of the phenotype variable, and they are the signature genes used for phenotype determination. • Tissue classification: GN 7 Harvard Medical School Algorithm Summary .. . .. . Gene Selection by Bayes Factor Collinearity Elimination .. . Sample Classification .. . .. . Optimize Performance Optimize Hyperparameters (sensitivity analysis) 8 Harvard Medical School Discriminate Lung Carcinoma Subtypes • Adenocarcinoma (AC) and squamous cell carcinoma (SCC) are major subtypes of lung cancer: – AC and SCC are distinct in survival, chances of metastasis, and responses to chemotherapy and targeted therapy. – Physicians lack confidence in correct recognition when there are multiple primary carcinomas. • Training: – 58 ACs and 53 SCCs. – 77 genes selected in the network. – 25 signature genes. 9 Harvard Medical School Bayesian Network for Lung Carcinoma 10 Harvard Medical School Large-Scale Testing on Independent Samples • 422 samples (232 ACs and 190 SCCs) aggregated from 7 cohorts (including Caucasians, African-Americans, Chinese). • Accuracy = 95.2% AUROC. ROC curves 1 0.9 0.8 0.7 sensitivity 0.6 0.5 0.4 0.3 0.2 0.1 Proposed Bayes Net (95.2%) 0 0 0.1 0.2 0.3 0.4 0.5 0.6 1-specificity 0.7 0.8 0.9 1 11 Harvard Medical School Comparisons with Other Popular Methods • Higher classification accuracy. • Small-sized signature to avoid overfitting. Bayesian Network PCA/LDA PAM (Tibshirani et al., PNAS 2002) Weighted Voting (Golub et al., Science 1999) Testing AUROC 95.2% --- # signature genes 25 91.2% 0.0047 13 91.0% 0.0014 77 93.4% 0.6240 800 p-value 12 Harvard Medical School KRT6 Family Characterizes the Lung Carcinoma Discrimination 13 Harvard Medical School KRT6 Family Characterizes the Lung Carcinoma Discrimination • Keratin-6 family genes (KRT6A, KRT6B, KRT6C) are important for distinguishing lung cancer subtypes. – Accounting for 95% of the accuracy of the whole 25-gene signature. – Located on chromosome 12q12-q13. – A nonlinear, concave discriminative surface. 14 Harvard Medical School Verification by Chr12q12-q13 Aberrations • Investigate DNA copy number changes in comparative genomic hybridization (CGH) array. – 12 ACs and 13 SCCs from Vrije University Medical Center, Netherland. – A dumbbell discriminative surface achieves 80% classification accuracy. – Treat average CGH values of genes occupying q12, q13, and q12-13 respectively as three features to construct a Naïve Bayes Classifier. 15 Harvard Medical School Conclusion • Reverse engineer regulatory network information for tissue classification. • Adopt the system biology approach to infer gene dependencies network. – Select genes by Bayes factor. – Eliminate collinearity via network learning. – Integrate gene selection and classification model in a single Bayesian network framework. • Demonstrate the promising translational value of the system biology approach in clinical study. 16