Download Investigating the predictability of essential genes across distantly

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Investigating the predictability of essential
genes across distantly related organisms
using an integrative approach
• Department:Division of Biomedical Informatics
• Author: Long J. Lu
• Nucleic Acids Research, September 24, 2010
Reporter: 华红丽
Abstract
This paper presents a machine learning-based integrative approach
that reliably transfers essential gene annotations between distantly
related bacteria.
They focused on four bacterial species (E. coli K-12 (EC),
P. aeruginosa PAO1 (PA), A. baylyi ADP1 (AB), B. subtilis (BS)) that have
well-characterized essential genes, and tested the transferability
between three pairs among them.
2
Abstract
They trained theclassifier to learn traits associated with essential
genes in one organism, and applied it to make predictions in the
other. The predictions were then evaluated by examining the
agreements with the known essential genes in the target organism,
and used ten-fold cross-validation in the same organism yielded
AUC scores .
This is the first to report that gene essentiality can be reliably
predicted using features trained and tested in a distantly related
organism.
3
Introduction
4
 any functional microorganism must contain a minimal set of essential
genes that are required for survival and carrying out desired functions.
 Experimental identification of essential genes has its shortcomings,
like higher-cost, many years of research and so on
 Many researchers attempt to identify essential genes relying on
homology mappong, which is limited to the conserved orthologs
between species
 this approach identifies relevant features of essential genes and
makes predictions using a weighted combination of hallmark features.
Materials and methods
5
Data sources:
E. coli K-12 (EC):
sequence data: Comprehensive Microbial Resource (CMR) database
302 essential genes: PEC database
P. aeruginosa PAO1 (PA):
sequence data & 678 essential genes: Pseudomonas database
A. baylyi ADP1 (AB):
sequence data & 499 essential genes: Magnifying Genomes database
B. subtilis (BS):
sequence data & 192 essential genes: Microbial Genome Database
All gene expression data: NCBI GEO, ArrayExpress, as well as from
Gasch et al.
Materials and methods
6
Homology mapping by reciprocal best hit (RBH):
Between EC and PA
 queried an ORFi in PA against all known ORFs in EC by Blastp, E-value
= 10-5, yield the set of hits {W}
 queried the hit with the lowest E-value in {W} (ORFj) against all ORFs
in PA to yield the set of hits {Y}
 (ORFi, ORFj) are considered putative orthologs if ORFi is the hit in {Y}
with the lowest E-value.
 Meet the two strict criteria:
7
Materials and methods
Reference features:
Feature
Name of Feature
Category
Data type Available
Rank by
Nomogram
Aromo
Aromaticity score
A
Real
E/P/A/B
11
A3s
Base composition A
A
Real
E/P/A/B
17
C3s
Base composition C
A
Real
E/P/A/B
14
G3s
Base composition G
A
Real
E/P/A/B
24
T3s
Base composition T
A
Real
E/P/A/B
19
CAI
Codon adaptation index
A
Real
E/P/A/B
2
CBI
Codon bais index
A
Real
E/P/A/B
3
Fop
Frequency of optimal codons
A
Real
E/P/A/B
4
Nc
Effective number of codons
A
Real
E/P/A/B
5
L_sym
Frequency of synonymous codons
A
Integer
E/P/A/B
7
L_aa
Length amino acids
A
Integer
E/P/A/B
8
GC
GC content
A
Real
E/P/A/B
13
GC3s
GC content 3rd position of
synonymous codons
A
Real
E/P/A/B
20
Gravy
Hydrophobicity score
A
Real
E/P/A/B
22
8
Materials and methods
Feature
Name of Feature
Category
Data type Available Rank by
Nomogram
Cytoplasm
Subcellular localization: cytoplasm
B
Boolean
E/P/A/B
9
Extracellular
Subcellular localization: Extracelluar
B
Boolean
E/P/A/B
15
Inner
Subcellular localization: Inner membrane B
Boolean
E/P/A
21
Outer
Subcellular localization: Outer membrane
B
Boolean
E/P/A
25
Periplasm
Subcellular localization: Periplasm
B
Boolean
E/P/A
23
ExpAA
Expect number of Amino acids in helices
B
Real
E/P/A/B
28
First60
Real
E/P/A/B
26
PredHel
Expect number of AAs in helices in first 60 B
AAs
Number of predicted TM helices
B
Integer
E/P/A/B
27
PHYS
Phylogenetic score
B
Real
E/P/A/B
6
PA
Paralogy
B
Boolean
E/P/A/B
16
DES
Domain enrichment score
B
Real
E/P/A/B
1
FLU
Fluctuation
C
Real
E/P/B
18
CEH
Coexpression network hubs
C
Boolean
E/P/B
12
CEB
Coexpression network bottlenecks
C
Boolean
E/P/B
10
Materials and methods
9
Calculate DES(Domain enrichment score):
and
represent a domain’s occurrence frequency in
the essential and non-essential data set, respectively
represent the size of the essential and non-essential
dataset, respectively
Materials and methods
Feature evaluation and selections:
three criteria:
1. the features should be easily obtained and available to most
microorganisms.
10
Materials and methods
2. have high predictive power of gene essentiality, Naïve Bayes
analysis and ranked all features according to the coverage length
of log-odds ratio
11
Materials and methods
3. minimize biological redundancy
12
Materials and methods
Classifier design:
used four classifiers to train and test the model:
(i) Naïve Bayes classifier;
(ii) a logistical regression model;
(iii) a C4.5 decision tree;
(iv) CN2 rule.
The best performance was obtained by combining the outputs of
these diverse classifiers using an unweighted average approach.
13
Materials and methods
14
Materials and methods
15
Training and testing sets preparation:
Each gene was assigned a Boolean value regarding its essentiality
(1—essential; 0—non-essential).
10-fold cross-validation;
control training set (randomization test):
randomly assigning essential labels to all E. coli genes,
the same number of random ‘essential genes’ as the number of true
essential genes was used in the training and testing frame.
16
Results
Cross-validations of the classifier using EC essential gene set:
Ten-fold cross-validations on
the EC essential gene data set
13 selected features in EC by nomogram
Results
10 features were selected for EC and AB
(A) EC – EC, AUC = 0.93, PPV = 0.70;
(B) EC – AB, AUC = 0.80, PPV = 0.81
PPV = TP / (TP + FP)
17
Results
(C) AB – AB;
(D) AB– EC, AUC = 0.89, PPV = 0.43
Why is the PPV low?
18
Results
Why is the PPV low?
The AB data set contained about 100 genes associated with
biosynthesis function that are needed for survival only on minimal
media, but not essential under rich media
They removed 82 genes associated
with biosynthesis function from the
AB essential gene set, and the
refined data set achieved a
substantially better precision
(PPV = 0.53)
19
Results
Prediction of essential genes between E. coli and others:
EC – PA
PA 678 – EC
PA 335 – EC
EC – BS
BS - EC
AUC = 0.69
AUC = 0.79
AUC = 0.82
AUC = 0.80
AUC = 0.86
PPV = 0.57
PPV = 0.41
PPV = 0.47
PPV = 0.54
PPV = 0.48
20
Results
Precision of predictions from EC to three target organisms
21
Results
compared with homology mapping:
22
Results
compared with homology mapping:
23
Conclusion
24
 Our 10-fold cross-validations in four organisms showed AUC scores
about 0.9, suggesting that gene essentiality, albeit a complex property
is highly predictable by learning the characteristics underlying gene
essentiality.
 They discovered domain enrichment, which has not been considered
in previous studies, as the strongest feature.
Conclusion
25
 When using our method to transfer essentiality between distantly
related organisms, the accuracy of predicting essential genes can be
affected by the following four factors:
 the essential gene data set on which the classifiers are trained
should be of high quality;
 the essentiality should be transferred under the same or highly
similar growth conditions;
 the evolutionary distance seems to play an important role in the
accuracy of predictions;
 the prediction also depends on the availability of features with a
similar distribution between organisms.
26
Compare with
Shiheng Tao’s work (2013, BMC Genomics)
Tao proposed the feature-based weighted Naïve Bayes model (FWM),
which can address multicollinearity impacts among gene features
and feature divergence between species.
Conclude FWM:
(1)select proper feature;
(2)machine learning algorithms (Naïve Bayes );
(3)weight of each feature by logistic regression and genetic algorithm
They applied FWM to reciprocally predict essential genes between
and within 21 species and compared its performance with those of
other models including SVM, Naïve Bayes model (NBM), and logistic
regression model (LRM).
Selected features:
27
28
Flow chart
29
30
SCE–SCE
SCE–SPO
They are generated by randomly selecting 20%
31
Comparison of FWM
with LRM, NBM, and
SVM
Four AUC matrices among the 21 species are produced using the four methods.
AUC scores (mij) in the same position of the four matrices are then sorted and
replaced with markers(1, 2, 3, 4).
Conclusion
32
Both the two methods are based on a series of features while there
are some differences, use the machine learning algorithms, and the
accuracy of predition are weighted by ROC curves.
The most difference between Lu and Tao is that, the former gets the
predictive essential genes by combining the outputs of four diverse
classifiers using an unweighted average approach, the latter gets
predictive essential genes just by providing a improved Bayesian
algorithm (FWM), and it gets the optimal accuracy than the there others.
Thank you!