* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download presentation
Quantitative trait locus wikipedia , lookup
Quantitative comparative linguistics wikipedia , lookup
Pathogenomics wikipedia , lookup
Essential gene wikipedia , lookup
History of genetic engineering wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Public health genomics wikipedia , lookup
Polycomb Group Proteins and Cancer wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Genome evolution wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Genomic imprinting wikipedia , lookup
Genome (book) wikipedia , lookup
Ridge (biology) wikipedia , lookup
Metagenomics wikipedia , lookup
Minimal genome wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Designer baby wikipedia , lookup
Microevolution wikipedia , lookup
Mining of Microarray, Proteomics, and Clinical Data for Improved Identification of Chronic Fatigue Syndrome Zoran Obradovic Hongbo Xie, Slobodan Vucetic Information Science and Technology Center Temple University, Philadelphia Biomarker Identification Objective: Useful for Select a small number of informative attributes (genes; protein) disease diagnosis, disease progress monitoring, evaluation of treatment effects etc. Challenges include finding many irrelevant attributes; uncertainty is due to small sample size vs. number of attributes lack of replicates Approaches to Biomarkers Identification Select significantly differentially expressed genes (in microarray data), or the most discriminative mass-charge peaks (in proteomics data) Measure difference among classes of samples using statistics tests: T-test, ANOVA, and Non-Parametric test data mining procedures SVM, Neural networks, etc Limitations Very noisy data is subject to false discoveries Relationships among selected attributes are often ignored For many diseases, multiple data resources are available; however how to use them together is often unclear Our Approach Motivation: For various diseases the most discriminative genes are likely to correspond to a limited set of biological functions or pathways Hypothesis: Focusing to key functional expression patterns could result in improved accuracy as compared to analyzing individual gene expression readings Approach: Exclude genes whose biological properties deviate from other selected genes Challenges of Biomarke Identification for Chronic Fatigue Syndrome (CFS) CFS diagnosis is less accurate than for some other diseases (e.g. cancer) Pathophysiology of CFS is insufficient understood Diagnosis of CFS is highly depending on clinical practice Patients’ response is often subjective There is no standard criteria or laboratory technique to reduce the risk of malpractice CFS Data CFS Microarray data CFS Proteomics data 79 arrays representing 39 clinical identified CFS samples and 40 non-CFS samples 20,160 genes for each sample Using SOURCE database(http://source.stanford.edu) 13,213 genes annotated by 4,110 unique GO terms 65 samples representing 33 CFS and 32 non-CFS samples Each sample was profiled under 48 conditions, with factors such as fractionation, protein-chip surfaces, and binding and elution conditions CFS Clinical data 227 samples representing 43 CFS, 60 NF, and 123 others, CFS/NF are defined by Empiric attribute each sample contained 85 attributes Task 1: Identifying Biomarker Genes from CFS Microarray Data Objective: Identify a robust set of genes discriminating patients (CFS) from normal subjects (NF) Method: Identify a Subset of Genes (SG) significantly different between CFS and NF in training sample (use a non-parametric statistical test) Select a subset of SG annotated with a specific function (use domain knowledge of GO) Evaluate the method (Use leave one out cross validation) Identifying a Subset of Significant Genes by Kruskal-Wallis (KW) Test For each gene, its expression values for CFS samples and NF samples are compared, p-value is obtained comparing to a random population Gene with p-value less than a threshold is selected as significant (SG) Traditional approaches use those SG as markers to discriminate classes of samples. However: A large proportion of such genes are irrelevant; applying false discovery rate control won’t help much in most case Functional correlations among those genes are ignored Selecting Significant Functions by Hypergeometric Test Given: Set of k genes selected by KW test Objective: Determine whether a given term GOi is overrepresented by the selection The idea: If the gene selection were random, the number Xi of selected genes annotated with GOi would follow hypergeometric distribution Approach: So, significance of GOi is measured using the p-value of GOi = P(X Xi), where X ~ H(K, k, ki) is selecting probability for a random gene ki is the number of genes annotated with GOi KW Statistical Attribute Selection All genes selected by KW test, {gi, Gi < }, are used as attributes in classification Knowledge Based Selection TopGO: nGO: Select GO terms with n smallest p-values GOn. Use only genes selected by KW test and annotated with GOn AllTopGO: Select GO term with the smallest p-value GO*. Use only genes selected by KW test and annotated with GO* Use all genes annotated with GO* AllSignificantGO: Use all significant genes annotated with one of significant GO terms Comparison of 5 Attribute Selection Methods on CFS Data Using domain knowledge for attributes selection procedure improved the prediction accuracy TopGO was the most accurate domain knowledge based attributes selection method Decision Tree Classifiers were less accurate than corresponding Support Machines (SVM) Comparison Details of 5 Attribute Selection Methods for p-value = 0.05 Selection Approach Decision Trees SVM KW Statistical 53% 53% TopGO 59% 72% 10GO 54% 58% AllTopGO 56% 61% AllSignificantGO 53% 60% Further Comparison of KW Statistical vs. TopGO Selection KW Statistical TopGO (KW Test threshold) SVM Decision Tree Number of Selected Attribute 0.001 58 53 0.01 48 0.05 0.2 SVM Decision Tree Number of Selected Attribute 41 54 46 1 56 257 48 48 3 53 53 1296 72 59 17 49 51 3761 56 62 19 Accuracy (%) Accuracy (%) •Overall, knowledge based TopGo selection was the most accurate (SVM: 58% vs. 72%; Decision Tree: 55% vs. 62%) •For very small threshold, KW Statistical selection was slightly more accurate •However, knowledge based TopGO selection always used far less number of attributes Comparison Using Same Number of Attributes (Statistical vs.TopGO by SVM) 0.75 0.7 Domain dependent feature selection Classical feature selection accuracy 0.65 0.6 0.55 0.5 0.45 0.4 1 2 3 16 number of selected features 17 19 Comparison Using Same Number of Attributes (Statistical vs.TopGO by Decision Tree) 0.65 Domain dependant feature selection classical feature selection accuracy 0.6 0.55 0.5 0.45 0.4 0.35 1 2 3 16 number of selected features 17 19 Comparison of SVM vs. Decision Trees for TopGO Attribute Selection 0.75 SVM Decision Tree 0.7 0.65 0.6 0.55 0.5 0.45 1 2 3 16 17 19 Most Overrepresented GO Terms among Significantly Differentially Expressed Genes in CFS (by TopGo) Gene Ontology ID Function/Process Name p-value Number of Selected Genes GO:0006397 mRNA processing 0.0016 10 GO:0008203 cholesterol metabolism 0.0021 7 GO:0003779 actin binding 0.0027 31 GO:00015629 actin cytoskeleton 0.0078 14 GO:00016564 transcriptional repressor activity 0.0105 9 GO:0005515 protein binding 0.0136 124 GO:0007187 G-protein signaling 0.0153 5 GO:0008009 chemokine activity 0.0153 5 GO:0007229 integrin-mediated signaling pathway 0.0155 9 GO:0007517 muscle development 0.016 14 * Top 2 functions are consistent with previously reported result on CFS (Whistler, T.et al, Transl Med. 2003; 1: 10) mRNA Processing Genes Identified as Potential Biomarkers Gene Name Gene ID Symbol UniGene P-value Debranching enzyme homolog 1 AK000116 DBR1 Hs.477700 0.0086 Cleavage and polyadenylation specific factor 6 NM_007007 CPSF6 Hs.369606 0.0096 Small nuclear ribonucleoprotein polypeptide N AF101044 SNRPN Hs.525700 0.0131 Hypothetical protein BC006407 MGC14151 Hs.333414 0.0186 Heterogeneous nuclear ribonucleoprotein L-like BC008217 HNRPLL Hs.445497 0.0212 TRNA splicing endonuclease 2 homolog AK074794 SEN2L Hs.335550 0.0223 Poly(A) polymerase beta AF218840 PAPOLB Hs.487409 0.0333 ELAV-like 4 (Hu antigen D) BC036071 ELAVL4 Hs.213050 0.0376 ER to nucleus signaling 1 AF059198 ERN1 Hs.133982 0.0395 Nuclear ribonucleoprotein polypeptide J04564 SNRPB Hs.83753 0.0444 Nuclear RNA export factor 1 AF112880 NXF1 Hs.523739 0.0499 Using Only Significant Genes Associated with a Given Function Several key functions could well discriminate Chronic Fatigue Syndrome from non-Fatigue population How to select the best function(s) for out of sample prediction is still a challenge The most overrepresented functions identified by our analysis were the most discriminative Accuracy by Using Only Significant Genes Associated with a Given Function (p-value <0.05) Function name Category Number of selected Accuracy (%) attributes hydrolase activity Function 54 77 cholesterol metabolism Process 7 75 lyase activity Function 6 75 GTPase activator activity Function 12 73 ATP binding Function 84 72 mesoderm development process 4 72 sarcoglycan complex Cell component 3 72 telomeric DNA binding Function 2 72 chromatin binding function 4 71 steroid hormone receptor activity function 8 71 mRNA processing Process 10 70 Evaluation on Additional Data Central Nervous System (CNS) CNS Data Source: "Prediction of Central Nervous System Embryonal Tumour Outcome Based on Gene Expression", Letters to Nature, Nature, 415:436-442, January 2002. http://www-genome.wi.mit.edu/mpr/CNS/ Description: The data set contains 60 patient samples, 21 are survivors and 39 are failures. There are 7129 genes in the dataset. Results on CNS Data (KW Test threshold) Traditional Accuracy (%) TopGO Accuracy (%) SVM Decision Tree SVM Decision Tree 0.001 47 47 58 50 0.01 43 27 45 43 0.05 58 32 63 60 0.1 57 32 62 57 0.2 57 32 60 62 •Findings were consistent to CFS analysis Task 2: Proteomics Based Approach to Diagnostics Proteomics Based CFS Data Analysis Overall data preprocessing protocol: Baseline correction Peak alignment Spectra normalization Smooth spectrogram Normalize using QC samples: For each test sample at every condition, its m/z value is divided by control QC m/z value and followed by taking log hood (a relative ratio is obtained for each testing sample). Proteomics Based CFS Classification Procedure Used leave-one-sample-out cross validation to train and test the data Prediction on replicates of same sample is obtained by voting with tie labeled as CFS Kruskal-Wallis analysis of ranks and the Median test are applied for all mass/charge values. P-values are ranked and peaks with p-value less than a threshold are selected as attributes. P-value threshold of 0.05 resulted in selection of over 2000 attributes Trained SVM classifier with selected attributes and evaluated for discriminating out of sample test data Result of Proteomics Based CFS Classification Accuracy of our method of separating CFS samples and NF samples was just slightly better than trivial predictor IMAC chips provided the best overall results. The accuracy of an ensemble of IMAC classifiers by the leave-one-sampleout cross-validation was 51%. Task 3: Combining Microarray Data and Proteomics Data for CFS Diagnoses (SVM) Used integrated data of 38 subjects (20 CFS and 18 non-CFS samples) containing both proteomics and microarray data Proteomics and microarray-based CFS predictions agreed for 50% of the sample (19 subjects) When two classification methods agreed, the accuracy of a combined approach was significantly improved to 79% Task 4: Analysis of Clinical CFS Data Motivation: Reason of low accuracy of prediction could lie in CFS clinical data attributes Objective: Detect potential factors that reveal the reasons of disagreement between microarray and proteomics CSF classifiers Approach: Applied ANOVA analysis on each attribute of two groups of clinical data (groups were subjects where microarray and proteomics predictions agree on vs. remaining subjects where microarray and proteomics predictions disagree) Result of Clinical Data Analysis by ANOVA Three of the clinical data classifying attributes are discovered as significantly different between two groups Mental heath Physical fatigue General fatigue Low accuracy of CFS diagnosis could be partially blamed on the clinical definition of the disease Conclusions Complementing statistical gene selection and domain knowledge to focus on the most significantly overrepresented GO terms was beneficial for improving accuracy identifying much smaller number of attributes Integrating information from multiple sources (microarray, proteomics and clinical data) could lead to improved understanding and diagnosis of CFS Thank You ! More information: http://www.ist.temple.edu Contact: Zoran Obradovic, director IST Center, Temple University 215-204-6265 [email protected]