Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Investigating the predictability of essential genes across distantly related organisms using an integrative approach • Department:Division of Biomedical Informatics • Author: Long J. Lu • Nucleic Acids Research, September 24, 2010 Reporter: 华红丽 Abstract This paper presents a machine learning-based integrative approach that reliably transfers essential gene annotations between distantly related bacteria. They focused on four bacterial species (E. coli K-12 (EC), P. aeruginosa PAO1 (PA), A. baylyi ADP1 (AB), B. subtilis (BS)) that have well-characterized essential genes, and tested the transferability between three pairs among them. 2 Abstract They trained theclassifier to learn traits associated with essential genes in one organism, and applied it to make predictions in the other. The predictions were then evaluated by examining the agreements with the known essential genes in the target organism, and used ten-fold cross-validation in the same organism yielded AUC scores . This is the first to report that gene essentiality can be reliably predicted using features trained and tested in a distantly related organism. 3 Introduction 4 any functional microorganism must contain a minimal set of essential genes that are required for survival and carrying out desired functions. Experimental identification of essential genes has its shortcomings, like higher-cost, many years of research and so on Many researchers attempt to identify essential genes relying on homology mappong, which is limited to the conserved orthologs between species this approach identifies relevant features of essential genes and makes predictions using a weighted combination of hallmark features. Materials and methods 5 Data sources: E. coli K-12 (EC): sequence data: Comprehensive Microbial Resource (CMR) database 302 essential genes: PEC database P. aeruginosa PAO1 (PA): sequence data & 678 essential genes: Pseudomonas database A. baylyi ADP1 (AB): sequence data & 499 essential genes: Magnifying Genomes database B. subtilis (BS): sequence data & 192 essential genes: Microbial Genome Database All gene expression data: NCBI GEO, ArrayExpress, as well as from Gasch et al. Materials and methods 6 Homology mapping by reciprocal best hit (RBH): Between EC and PA queried an ORFi in PA against all known ORFs in EC by Blastp, E-value = 10-5, yield the set of hits {W} queried the hit with the lowest E-value in {W} (ORFj) against all ORFs in PA to yield the set of hits {Y} (ORFi, ORFj) are considered putative orthologs if ORFi is the hit in {Y} with the lowest E-value. Meet the two strict criteria: 7 Materials and methods Reference features: Feature Name of Feature Category Data type Available Rank by Nomogram Aromo Aromaticity score A Real E/P/A/B 11 A3s Base composition A A Real E/P/A/B 17 C3s Base composition C A Real E/P/A/B 14 G3s Base composition G A Real E/P/A/B 24 T3s Base composition T A Real E/P/A/B 19 CAI Codon adaptation index A Real E/P/A/B 2 CBI Codon bais index A Real E/P/A/B 3 Fop Frequency of optimal codons A Real E/P/A/B 4 Nc Effective number of codons A Real E/P/A/B 5 L_sym Frequency of synonymous codons A Integer E/P/A/B 7 L_aa Length amino acids A Integer E/P/A/B 8 GC GC content A Real E/P/A/B 13 GC3s GC content 3rd position of synonymous codons A Real E/P/A/B 20 Gravy Hydrophobicity score A Real E/P/A/B 22 8 Materials and methods Feature Name of Feature Category Data type Available Rank by Nomogram Cytoplasm Subcellular localization: cytoplasm B Boolean E/P/A/B 9 Extracellular Subcellular localization: Extracelluar B Boolean E/P/A/B 15 Inner Subcellular localization: Inner membrane B Boolean E/P/A 21 Outer Subcellular localization: Outer membrane B Boolean E/P/A 25 Periplasm Subcellular localization: Periplasm B Boolean E/P/A 23 ExpAA Expect number of Amino acids in helices B Real E/P/A/B 28 First60 Real E/P/A/B 26 PredHel Expect number of AAs in helices in first 60 B AAs Number of predicted TM helices B Integer E/P/A/B 27 PHYS Phylogenetic score B Real E/P/A/B 6 PA Paralogy B Boolean E/P/A/B 16 DES Domain enrichment score B Real E/P/A/B 1 FLU Fluctuation C Real E/P/B 18 CEH Coexpression network hubs C Boolean E/P/B 12 CEB Coexpression network bottlenecks C Boolean E/P/B 10 Materials and methods 9 Calculate DES(Domain enrichment score): and represent a domain’s occurrence frequency in the essential and non-essential data set, respectively represent the size of the essential and non-essential dataset, respectively Materials and methods Feature evaluation and selections: three criteria: 1. the features should be easily obtained and available to most microorganisms. 10 Materials and methods 2. have high predictive power of gene essentiality, Naïve Bayes analysis and ranked all features according to the coverage length of log-odds ratio 11 Materials and methods 3. minimize biological redundancy 12 Materials and methods Classifier design: used four classifiers to train and test the model: (i) Naïve Bayes classifier; (ii) a logistical regression model; (iii) a C4.5 decision tree; (iv) CN2 rule. The best performance was obtained by combining the outputs of these diverse classifiers using an unweighted average approach. 13 Materials and methods 14 Materials and methods 15 Training and testing sets preparation: Each gene was assigned a Boolean value regarding its essentiality (1—essential; 0—non-essential). 10-fold cross-validation; control training set (randomization test): randomly assigning essential labels to all E. coli genes, the same number of random ‘essential genes’ as the number of true essential genes was used in the training and testing frame. 16 Results Cross-validations of the classifier using EC essential gene set: Ten-fold cross-validations on the EC essential gene data set 13 selected features in EC by nomogram Results 10 features were selected for EC and AB (A) EC – EC, AUC = 0.93, PPV = 0.70; (B) EC – AB, AUC = 0.80, PPV = 0.81 PPV = TP / (TP + FP) 17 Results (C) AB – AB; (D) AB– EC, AUC = 0.89, PPV = 0.43 Why is the PPV low? 18 Results Why is the PPV low? The AB data set contained about 100 genes associated with biosynthesis function that are needed for survival only on minimal media, but not essential under rich media They removed 82 genes associated with biosynthesis function from the AB essential gene set, and the refined data set achieved a substantially better precision (PPV = 0.53) 19 Results Prediction of essential genes between E. coli and others: EC – PA PA 678 – EC PA 335 – EC EC – BS BS - EC AUC = 0.69 AUC = 0.79 AUC = 0.82 AUC = 0.80 AUC = 0.86 PPV = 0.57 PPV = 0.41 PPV = 0.47 PPV = 0.54 PPV = 0.48 20 Results Precision of predictions from EC to three target organisms 21 Results compared with homology mapping: 22 Results compared with homology mapping: 23 Conclusion 24 Our 10-fold cross-validations in four organisms showed AUC scores about 0.9, suggesting that gene essentiality, albeit a complex property is highly predictable by learning the characteristics underlying gene essentiality. They discovered domain enrichment, which has not been considered in previous studies, as the strongest feature. Conclusion 25 When using our method to transfer essentiality between distantly related organisms, the accuracy of predicting essential genes can be affected by the following four factors: the essential gene data set on which the classifiers are trained should be of high quality; the essentiality should be transferred under the same or highly similar growth conditions; the evolutionary distance seems to play an important role in the accuracy of predictions; the prediction also depends on the availability of features with a similar distribution between organisms. 26 Compare with Shiheng Tao’s work (2013, BMC Genomics) Tao proposed the feature-based weighted Naïve Bayes model (FWM), which can address multicollinearity impacts among gene features and feature divergence between species. Conclude FWM: (1)select proper feature; (2)machine learning algorithms (Naïve Bayes ); (3)weight of each feature by logistic regression and genetic algorithm They applied FWM to reciprocally predict essential genes between and within 21 species and compared its performance with those of other models including SVM, Naïve Bayes model (NBM), and logistic regression model (LRM). Selected features: 27 28 Flow chart 29 30 SCE–SCE SCE–SPO They are generated by randomly selecting 20% 31 Comparison of FWM with LRM, NBM, and SVM Four AUC matrices among the 21 species are produced using the four methods. AUC scores (mij) in the same position of the four matrices are then sorted and replaced with markers(1, 2, 3, 4). Conclusion 32 Both the two methods are based on a series of features while there are some differences, use the machine learning algorithms, and the accuracy of predition are weighted by ROC curves. The most difference between Lu and Tao is that, the former gets the predictive essential genes by combining the outputs of four diverse classifiers using an unweighted average approach, the latter gets predictive essential genes just by providing a improved Bayesian algorithm (FWM), and it gets the optimal accuracy than the there others. Thank you!