Download About

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Silencer (genetics) wikipedia , lookup

Genetic code wikipedia , lookup

Paracrine signalling wikipedia , lookup

Metabolism wikipedia , lookup

Metalloprotein wikipedia , lookup

Gene expression wikipedia , lookup

Thylakoid wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Signal transduction wikipedia , lookup

Expression vector wikipedia , lookup

G protein–coupled receptor wikipedia , lookup

Magnesium transporter wikipedia , lookup

Bimolecular fluorescence complementation wikipedia , lookup

Biochemistry wikipedia , lookup

SR protein wikipedia , lookup

Structural alignment wikipedia , lookup

Interactome wikipedia , lookup

QPNC-PAGE wikipedia , lookup

Protein purification wikipedia , lookup

Protein wikipedia , lookup

Homology modeling wikipedia , lookup

Protein–protein interaction wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Western blot wikipedia , lookup

Proteolysis wikipedia , lookup

Transcript
Machine learning assisted target selection for structural genomics
Max-Planck-Institut für Biochemie
Martinsried
Pawel Smialowski, Antonio J. Martin-Galiano, Jürgen Coxa and Dmitrij Frishman
Department of Genome Oriented Bioinformatics, Technische Universität München, Wissenschaftszentrum Weihenstephan, 85350 Freising, Germany
aMax
Planck Institute of Biochemistry, 82152 Martinsried, Germany
1. Structural genomics
5. Solubility of globular proteins
Structural genomics consists in the determination of the three-dimensional structure of all
proteins of a given organism, by experimental methods such as X-ray crystallography, NMR
spectroscopy or computational approaches such as homology modeling.
PROSO - A PROtein SOlubility predictor.
Protein solubility is a prerequisite for most of
biophysical
studies:
NMR,
High throughput genomics approach puts new demands on both experimentation and target
selection.
Scattering,
Biocore,
ITC,
X-ray,
Target selection strategies:
Protein solubility – here defined as a ability of a
DSC,
Dynamic
ROC
Light
Chromatography,
Spectrophotometry.
„
Protein function, metabolic pathway
protein to be heterologously expressed E.coli in a
„
Experimental amenability of proteins
soluble fraction or to be tractable to standard refolding
procedures.
Accurate theoretical prediction of solubility from
sequence is instrumental for setting priorities on targets
2. Input data
in large-scale proteomics projects.
Positive dataset
PDB - Protein structure database,
provides structural information on
proteins, peptides and nucleic acids.
http://www.rcsb.org/pdb/
•
Proteins from PDB which were expressed in E.Coli
TargetDB - Target registration
database, provides status and
tracking information on the progress
of sample preparation and solution
of structures.
http://targetdb.pdb.org
•
using plasmid. Proteins from TargetDB having status at
Receiver operating characteristic (ROC) of our method
least “Soluble”. Soluble proteins from literature.
Negative dataset
for protein solubility prediction (PROSO). Curves for
from
the soluble and the insoluble classes are drawn
individually. True positive rate (TP rate) and true
TargetDB. All proteins identical to those in PDB were
negative rate (TP rate) are the number of correctly
removed
classified instances from the respective class divided
Sequence length 20 – 1900 amino acids
by the number of all instances from that class. False
Performance
positive rate (FP rate) and false negative rate (FN rate)
are defined as the number of incorrectly classified
An overall prediction accuracy of 71.2%
instances from the respective class divided by the
74.5% for the soluble, 68% for the insoluble
number of all members of that class.
Persistantly
Population of instances along the structural genomics pipeline. The numbers were taken from the TargetDB (Status:
March 2006). Out of 83596 targets selected by structural genomics consortia only for 2830 (3.4%) structures were
determined and deposited with the PDB.
3. Correlating sequence with experimental properties
recalcitrant
expressed
proteins
Matthews Correlation Coefficient (MCC) = 0.425
6. Heterologous expression of transmembrane proteins
• Each sequence from the positive and
MEMEX - A MEMbrane protein EXperimental properties predictor.
the negative dataset was represented by
Integral membrane proteins
are challenging targets for
frequencies of amino acids, di-, and tripeptides.
• The
classification
algorithm
was
constructed as a two-layered structure in
which the output of primary classifiers,
the first attempt to predict their experimental
behavior in the context of high-throughput
structure determination. At present, the work
support vector machine (SVM) with
is hindered by the severely limited amount of
data available for training the classifiers.
gaussian kernel, operating on peptide
frequencies was combined by a secondlevel Naive Bayes classifier.
The number of amenable and recalcitrant targets at different stages of
experimental structure determination.
Positive dataset
Integral membrane proteins amiable for given experimental step – based on
• Naive Bayes was used to integrate
results from classifiers
different sets of attributes
structure
determination due to the substantial
experimental difficulties involved in their
sample preparation. Our study constitutes
trained
TargetDB database
on
Negative dataset
Integral membrane proteins persistently recalcitrant for given experimental step –
based on TargetDB database
• Ten times cross-validation training and
evaluation was used to measure the
Summary of protein sequence features.
Classification
Naive Bayes classifiers were built using as an input 72 features of protein
sequences, including amino acid composition, occurrence of amino acid groups,
performance of classifiers.
ratios between residue groups, and hydrophobicity measures.
• Feature selection was performed using
Wrapper method with Naive Bayes as a
base classifier.
Performance
An overall prediction accuracy of 70%, 64% and 60% was achieved for the
cloning, expression and solubilization stages of the structural genomics pipeline,
• Crystallizability of small proteins –
Accuracy ~67%
respectively, with an acceptable balance between the performance for the
positive and negative class.
• Solubility of non-membrane proteins
– Accuracy ~72%
Our predictors produce results 37-59% better than random with p-values of 10-3 to 10-10 as judged by their respective success
rates (contingency table analysis).
Thus our classifiers are immediately useful for designing rational target selection strategies and may be instrumental in setting
Performance of the Bayesian classifier
priorities on members of membrane protein families according to their predicted experimental behavior.
• Expressability of membrane proteins
– Accuracy ~65%
4. Crystallizability of globular proteins
7. Web servers
SECRET - A SEquence-based CRystallizability EvaluaTor
ALL CLASSIFIERS INTEGRATED INTO ONE WEB PAGE
Crystallization – a major time limiting step in structure
• EXPROPRIATOR - An EXPerimental PROterin PRoperties predicTOR
http://binfo.bio.wzw.tum.de/proj/expropriator/index.html
ROC
determination
1
INDIVIDUAL CLASSIFIERS
0.9
Multidimensional
Small changes in sequence can strongly influence
crystallizability and/or crystal quality.
TP rate, TN rate
Crystallization is highly influenced by experimental setup.
Non linear
0.8
0.7
0.6
Negative
Positive
0.5
0.4
0.3
Non conserved
0.2
There is no evolution pressure on crystallization.
0.1
• MEMEX - A MEMbrane protein EXperimental properties predictor
http://webclu.bio.wzw.tum.de:8080/memex
• SECRET - A SEquence-based CRystallizability EvaluaTor
http://webclu.bio.wzw.tum.de:8080/secret
• PROSO - A PROtein SOlubility predictor
http://webclu.bio.wzw.tum.de:8080/proso
0
Positive dataset
0
0.1 0.2 0.3 0.4
0.5 0.6 0.7 0.8 0.9
1
FP rate, FN rate
References
proteins with NMR (Nuclear Magnetic Resonance)
structures that also have an X-ray structure, or bear a
high sequence similarity (blast 75-100% identity, 10%
length difference) to proteins with X-ray determined
structures
Negative dataset
proteins with NMR structures only (based on BLAST bit
score cutoff 30) .
Sequence length 46 – 200 amino acids
The meta-classifier had an accuracy of 67% (65%
and 69% for positive and negative class
respectively). As seen at Receiver Operating
Characteristic (ROC) graph it classified correctly
70% of negative instances with having only 35% of
positive instances misclassified into the negative
class. TP, FP, TN, FN rates as described in
solubility section
• Martin-Galiano, A.J. and Frishman, D. (2006a) Defining the fold space of membrane proteins: the CAMPS
database, Proteins, 64, 906-922.
• Martin-Galiano, A.J., Smialowski, P. and Frishman, D. (2006b) Predicting experimental properties of integral
membrane proteins by a Naive Bayes aproach, Proteins (Submited).
• Smialowski, P., Martin-Galiano, A., Mikolajka, A., Girschick, T., Holak, T.A. and Frishman, D. (2006a) Protein
solubility: sequence based prediction and experimental verification. Technical University Munich.
• Smialowski, P., Schmidt, T., Cox, J., Kirschner, A. and Frishman, D. (2006b) Will my protein crystallize? A
sequence-based predictor, Proteins, 62, 343-355.