Download 09_Handelman - Structural Biology Knowledgebase

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

Neuronal ceroid lipofuscinosis wikipedia , lookup

Metagenomics wikipedia , lookup

Point mutation wikipedia , lookup

Gene desert wikipedia , lookup

Gene expression programming wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

History of genetic engineering wikipedia , lookup

Non-coding DNA wikipedia , lookup

Gene expression profiling wikipedia , lookup

Gene wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Microevolution wikipedia , lookup

Designer baby wikipedia , lookup

NEDD9 wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Pathogenomics wikipedia , lookup

Gene nomenclature wikipedia , lookup

Genome editing wikipedia , lookup

Genome evolution wikipedia , lookup

Public health genomics wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Helitron (biology) wikipedia , lookup

Genomics wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Protein moonlighting wikipedia , lookup

Transcript
Functional Classification of PSI
Proteins to Support High
Throughput Biochemical
Characterization:
Classes of Reciprocal Sequence
Homologs (CRSH)
Samuel Handelman, Nelson Tong, Jon
D. Luff, David P. Lee, André Lazar, Paul
Smith, Prasanna Gogate, Rohan
Mallelwar and John Hunt
Bacterial physiology in the post-genome era
• Exponential growth in sequence information.
• Structural information is more difficult to obtain.
Evolution is key to leveraging what we do know.
• Direct functional information is scarcer still: evolution and
comparative studies are even more critical.
vs.
genome images from BacMap (UAlberta) and VirtualLaboratory; protein
structure images from NESG (Columbia/Rutgers).
Even today, most proteins are of unknown
biochemical function
E. coli
53%
“Known”
“hypothetical”
“putative”
“uncharacterized”
or “unknown”
(01/23/08)
~4,200
proteins
•
Closing this gap lays the
groundwork for systems
biology.
H. Sapiens
54%
“Known”
Neither identical
nor similar to any
experimentally
validated
protein *
~27,000
proteins
*Genome Information Integration Project
And H-Invitational 2 (2007) Nucleic
Acids Research 36:D793-799 3
CRSH Goal: Group Functionally
Equivalent Homologs.
•
Homology clusters
contain multiple
distinct protein
functions.
CRSH Approach:
•
Identify subclusters such
that all
members have
equivalent
function (in
bacteria only).
Topic Overview
• CRSH: what they are, why they’re useful
• CRSH Web Interface, merits of mapping of
TargetDB to protein functional groups
• Using CRSH and Gene Neighborhood to
predict stable tertiary interactions.
Classes of Reciprocal Sequence Homologs
(CRSHs)
Predicted proteins from 474 fully
sequenced bacterial genomes
Cluster based on BLAST scores;
verify clusters on profile scores
Main application: Gene
neighborhood method. Calculate
“co-localization” counts for all
CRSH pairs
(# of times their genes are within 15 kB on
chromosomes of fully diverged organisms)
Split into sub-clusters when multiple
members come from a single
organism (likely paralogs);
verify sub-clusters on profile scores
Merge sub-clusters into classes if more similar
than expected after accounting for interorganism distances; verify final classes on
profile scores
}
CRSHs  likely same function
~75,000
Split into sub-clusters when
multiple members come from a
single organism
Indicates a pair of reciprocal
closest homologs in their
respective organisms
M. tuberculosis
RV0859
E. coli PaaJ
A. tumefaciens
ATU0502
A. tumefaciens
PcaF
acetyl-CoA
acetyltransferases
beta-ketoadipyl
CoA thiolases
Courtesy Marco Punta
Gene Neighborhood Preview
Each Octagon represents a CRSH
O
O
O
O
1
2
3
4
Genome 1
O
O
1
3
Genome 2
Genome 3
…
O
O
1
3
O
O
1
3
O
N
• Stronger
neighborhood
conservation
=> better
function
predictions.
• Insight into
function of
unknown
proteins.
“Co-localized” = within 15 kB
Frequency Distribution of
Mean %ID in CRSH
Frequency
0.2
0.15
0.1
0.05
Mean %ID
0
0
25
50
75
• Tremendous range
in sequence
conservation with
more or less
equivalent
conservation of
function.
100
P (Each Gene Neighbor is Conserved)
A Fixed Homology Threshold Fails to Reliably
Segregate Functionally Equivalent Proteins
0.4
Orthologs
0.3
Paralogs
0.2
0.1
0.0
0.00
0.25
0.50
0.75
1.00
1.25
1.50
Length-Normalized Blast Bit-Score
Like Rost clusters, but for function
• Based on sequence
information, you can
conclude that two
proteins have the same
structure, even if you
don’t know the structure.
• We’re working
towards an analogous
scheme for protein
function, but each
functional group
needs it’s own cutoff.
• We propose to do this
especially for
proteins whose
function we do not yet
know.
Graph Courtesy Burkhard Rost
• We have developed a web interface for these CRSH, which is meant
for use by experimentalists.
• Presently hosted in India (at http://61.8.141.68:8080/Columbia/), will
be hosted at the NESG (at www.orthology.org), where CRSH pages
will be available for each entry in targetDB.
• The CRSH Pages that follow have been mapped to targetDB, so
that biologists working in the centers can access them directly.
• Within 2 mos. we hope for a direct link from the PSI TargetDB
gateway to the CRSHs.
• CRSHs already have links to biocyc, a leading bacterial physiology
database; links coming to other functional genomics databases.
• A consensus domain architecture schematic will appear shortly.
•
•
•
The applet on the left provides a graphical display of the phylogenetic
distribution. In the near future, we’ll add the info from targetDB to this
applet and to the table below.
Known complexes in biocyc are targets for structural genomics efforts to
solve multi-protein structures.
The genetically co-localizing CRSH are promising secondary targets, as I
will explain…
Gene Neighborhood
Hypothesis Generation
With suggested applications in structural
genomics and functional genomics
OR
Rational ideas have consequences for action;
reason necessarily has a constructive function.
• For every pair of CRSH for
which complex-membership
data is available in biocyc, we
count the instances where the
two CRSH appear in a putative
operon together.
• These counts correlate
strongly with well-established,
well-studied, stable and
definitive physical complexes
(drawn in this case from
biocyc).
• These Probabilities are
overestimated due to
the methods used.
P(CRSH together in stable complex)
Known Stable Complexes Strongly Correlate
with Gene Neighborhood
1
0.8
0.6
0.4
All Hetero-Complexes
Heterodimers Only
0.2
0
0
50
100
Co-localization counts (logarithmic bins)
1
0.8
0.6
0.4
0.2
59
10
-2
4
25
-3
9
40
10 99
024
9
25
0+
0
14
• For each CRSH, we extract
from biocyc a set of known
small molecule interaction
partners (ligands, substrates,
products, etc.) We excluded
very common partners (water,
phosphate, ATP, etc.)
• Because proteins together in
operons are often part of the
same metabolic pathways or
respond to similar chemical
signals, it is reasonable to
extrapolate small molecule
interactions to the conserved
gene neighbors.
• There is a definite correlation.
This graph is preliminary – it is
likely an underestimate.
P (Known Interaction between CRSH
Member and Small Molecule)
Gene Neighborhood has some Correlation
with Small Molecule Interaction Partners
Aggregate Co-localization counts for CRSH/Small Molecule
A
• This view, which is still in beta, gives the known small-molecule
interactions of all of the gene neighbors for a given CRSH, weighted
to reflect the strength of gene neighborhood conservation.
• As well as providing a starting point for interaction screening, this
can make the functional insights provided by the gene neighborhood
method more accessible.
Salvage Pipeline
• For structural genomics targets which have been
cloned and are soluble, but which have failed to
crystallize, we introduce a parallel pipeline to
salvage them by adding “known” or predicted
protein or small molecule binding partners.
Crystallize
without
Partner
Crystallize
with
Partner
• Bonus biology: whole greater than sum of parts.
Concluding Remarks
• We are eager to add links to PSI
resources to our CRSH pages – they are
intended to facilitate collaboration between
structural and functional genomics, in
particular.
• Functional information can improve the
impact of structural genomics efforts, and
may provide new salvage pathways for
difficult targets.
Thank you
John “The Jersey Eliminator” Hunt
Paul “Schmitty” Smith
Greg “Cassis” Boel
Sai “Full Nelson” Tong
Marco “The Shark” Punta
Burkhard “Wrecking Ball” Rost
Prasanna “Crackerjack” Gogate
Rohan “The Punisher” Mallelwar
Jon “JD” Luff
Liang “Red, White and Thunder” Tong
Howard “Hurricane” Shuman
Dana “Steel Toe” Pe’er
Harmen “H-Bomb” Bussemacher
Larry “The Tank” Chasin
Dre “Enter the Dragon” Lazar
David “Intravenous” Lee
Girish “Bone Breaker” Rao
Stephanie “Bronx” Wong
Diana “1-2-3” Flynn
George “El Pato Loco” Oldan
Allison “Grid Iron” Fay
Jordi “El Chupacabra” Banach
John “Steel” Dworkin
Etay “Aces” Ziv
Chris “Fireball” Wiggins
Gerwald “Sunshine” Jogl
Cal “Howitzer” Lobel
Yongzhao “Downtown” Shao
David “Finger of Death” Draper
Gae “Knuckles” Monteleone
Mike “The Red Baron” Baran
John “Mountain Man” Everett
The Hunt Lab, The NESG
American Heart Association, CF Foundation, NSF.
2.0
1.5
1.0
0.5
0.5
1.0
1.5
2.0
D. radiodurans with B. subtilis
length-normalized blast bit score
E. coli with S. elongatus a.a. %ID
with binomial standard error
E. coli with S. elongatus
length-normalized blast bit score
Consistency in CRSH sequence divergence
levels between remote phyla
85
65
45
25
25
45
65
85
D. radiodurans with B. subtilis a.a. %ID
with binomial standard error
EACH DOT IS A CRSH
Deviation from Evolutionary Consensus
in Protein Complexes
0.4
Interaction Pairs from Biocyc
0.35
Frequency
0.3
Random Pairs from Biocyc
Interaction Set
With two S.D. against
hypothesis
0.25
0.2
0.15
0.1
0.05
0
-1
-0.5
0
0.5
1
Spearman's Rho on Deviation from Consensus Distance
2.0
1.5
1.0
0.5
0.5
1.0
1.5
2.0
D. radiodurans with B. subtilis
length-normalized blast bit score
E. coli with S. elongatus a.a. %ID
with binomial standard error
E. coli with S. elongatus
length-normalized blast bit score
Consistency in CRSH sequence divergence
levels between remote phyla
85
65
45
25
25
45
65
85
D. radiodurans with B. subtilis a.a. %ID
with binomial standard error
EACH DOT IS A CRSH