Download PPT

Document related concepts

Genomics wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Minimal genome wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Protein moonlighting wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Long non-coding RNA wikipedia , lookup

Epitranscriptome wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

NEDD9 wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Gene expression profiling wikipedia , lookup

Gene expression programming wikipedia , lookup

Genome evolution wikipedia , lookup

Mir-92 microRNA precursor family wikipedia , lookup

RNA-Seq wikipedia , lookup

Transcript
Surveys of a Finite Parts List
Mark Gerstein
Molecular Biophysics & Biochemistry and
Computer Science, Yale University
H Hegyi, J Lin, B Stenger, P Harrison, N Echols,
J Qian, A Drawid, D Greenbaum, R Jansen
Transcriptome 2000, Paris
8 November 2000
1 (c) Mark Gerstein, 2000, Yale, bioinfo.mbb.yale.edu
Analysis of Genomes &
Transcriptomes in terms of the
Occurrence of Parts and Features:
Bacteria,
1.6 Mb,
~1600 genes
[Science 269: 496]
1997
Eukaryote,
13 Mb,
~6K genes
[Nature 387: 1]
Genomes
highlight
the
Finiteness
of the
“Parts” in
Biology
1998
Animal,
~100 Mb,
~20K genes
[Science 282:
1945]
2000?
Human,
~3 Gb,
~100K
genes [???]
‘98 spoof
2 (c) Mark Gerstein, 2000, Yale, bioinfo.mbb.yale.edu
1995
real thing, Apr ‘00
Simplifying the Complexity of Genomes:
Global Surveys of a
Finite Set of Parts from Many Perspectives
2
3
4
5
6
7
8
9
10 11
12 13
14 15 16
17 18 19
20
…
~100000 genes
~1000 folds
(human)
(T. pallidum)
1
2
3
4
5
6
7
8
9
10 11
12 13
Same logic for sequence
families, blocks, orthologs,
motifs, pathways, functions....
Functions picture from www.fruitfly.org/~suzi (Ashburner); Pathways picture from,
ecocyc.pangeasystems.com/ecocyc (Karp, Riley). Related resources: COGS, ProDom,
Pfam, Blocks, Domo, WIT, CATH, Scop....
14 15 …
~1000 genes
3 (c) Mark Gerstein, 2000, Yale, bioinfo.mbb.yale.edu
1
4 (c) Mark Gerstein, 2000, Yale, bioinfo.mbb.yale.edu
A Parts List Approach to Bike Maintenance
A Parts List Approach to Bike Maintenance
What are the
shared parts (bolt,
nut, washer, spring,
bearing), unique
parts (cogs,
levers)? What are
the common parts - types of parts
(nuts & washers)?
Where are
the parts
located?
Which parts
interact?
5 (c) Mark Gerstein, 2000, Yale, bioinfo.mbb.yale.edu
How many roles
can these play?
How flexible and
adaptable are they
mechanically?
Analysis of Genomes &
Transcriptomes in terms of the
Occurrence of Parts & Features
5(G)1(T)
7(G)15(T)
1 Using Parts to Interpret
Genomes. Shared and/or unique parts. Venn
1(G)2(Y)
2 Using Parts to Interpret
Pseudogenomes. In worm, top Y-folds
(DNAse, hydrolase) v top-folds (Ig). chr. IV
enriched, dead and dying families (90YG v 1G)
3 Using Parts to Interpret
Transcriptomes:
Expression & Structure. Top-10
parts in mRNA. Enriched in transcriptome: ab folds,
energy, synthesis,TIM fold, VGA. Depleted: TMs,
transport, transcription, Leu-zip, NS. Compare with
prot. abundance.
4 Expression & Localization.
Enriched : Cytoplasmic. Depleted: Nuclear.
Bayesian localizer
5 Expression & Function.
Expression relates to structure & localization but to
function, globally? P-value formalism. Weak
relation to protein-protein interactions.
bioinfo.mbb.yale.edu
H Hegyi, J Lin, B Stenger,
P Harrison, N Echols,
R Jansen, A Drawid, J Qian,
D Greenbaum, M Snyder
6 (c) Mark Gerstein, 2000, Yale, bioinfo.mbb.yale.edu
Diagrams, Fold tree with all-b diff. Ortholog tree.
Top-10 folds.
ProtoMap
One approach of many...
Much previous work on
Sequence & Structure Clustering
CATH, Blocks, FSSP,
Interpro, eMotif, Prosite,
CDD, Pfam, Prints, VAST,
TOGA…
Remington, Matthews ‘80; Taylor, Orengo ‘89, ‘94; Thornton,
CATH; Artymiuk, Rice, Willett ‘89; Sali, Blundell, ‘90; Vriend, Sander
‘91; Russell, Barton ‘92; Holm, Sander ‘93+ (FSSP); Godzik,
Skolnick ‘94; Gibrat, Bryant ‘96 (VAST); F Cohen, ‘96; Feng, Sippl
‘96; G Cohen ‘97; Singh & Brutlag, ‘98
finding
parts in
genome
sequences
blast,
y-blast,
fasta,
TM, lowcomplexity,
&c
(Altschul,
Pearson,
Wooton)
part occurrence profiles
7 (c) Mark Gerstein, 2000, Yale, bioinfo.mbb.yale.edu
Integrated
Analysis
System:
X-ref
Parts with
Genomes
Folds: scop+automatic
Orthologs: COGs
“Families”: homebrew,
21
43
E. coli
42
149
16
8
yeast
8 (c) Mark Gerstein, 2000, Yale, bioinfo.mbb.yale.edu
Shared
Folds
worm
of 339
35
Cluster Trees Grouping Initial
Genomes on Basis of Shared Folds
Cele
Mthe
Mjan
Phor
Cpne
Ctra
Tpal
Aful
Bbur
Aaeo
“Classic”
Tree
Fold
Tree
Mpne
Syne
Mgen
Rpro
Hpyl
Mtub
D=S/T
20
10
S = # shared folds
30
D=10/(20+10+30)
D = shared fold dist.
betw. 2 genomes
T= total #
folds in both
Hinf
Bsub
0.1
20 Genomes
Ecol
9 (c) Mark Gerstein, 2000, Yale, bioinfo.mbb.yale.edu
Scer
Fold
Tree
Unusual distribution of all-beta folds
10 (c) Mark Gerstein, 2000, Yale, bioinfo.mbb.yale.edu
Distribution of Folds in Various Classes
Ortholog
Tree
Fold Tree
(based on COGs scheme of Koonin & Lipman, similar approaches by Dujon, Bork, &c.)
11 (c) Mark Gerstein, 2000, Yale, bioinfo.mbb.yale.edu
Compare with Ortholog Occurrence Trees
(Part = ortholog v fold)
Depends on comparison
method, DB, sfams v folds, &c
(new top superfamilies via yBlast, Intersection of top-10 to
get shared and common)
Top-10 Worm Folds
class
Ig
B
Knottins
SML
Protein kinases (cat. core)
MULT
C-type lectin-like
A+B
corticoid recep. (DNA-bind dom.) SML
Ligand-bind dom. nuc. receptor
A
alpha-alpha superhelix
A
C2H2 Zn finger
SML
P-loop NTP Hydrolase
A/B
Ferrodoxin
A+B
M. genitalium
Rank
1
2

=

B. subtilis
Superfamily
#
P-loop hydrolase
60
SAM methyltransferase
16
Rossmann
domain
13
4
Class I
synthetase
12
5
Class II
synthetase
11
6
Nucleic acid
binding dom.
11
3
Total ORFs
with Common
Superfamilies
479
105
(22%)





=
Superfamily
P-loop
hydrolyase
Rossmann
domain
M. thermoautotrophicum
E. coli
#
173
165
Phosphatebinding barrel
79
PLP-transferase
44
CheY-like domain
36
SAM methyltransferase
30
4268
465
(11%)






Superfamily
#
P-loop hydrolase
191
1
Rossmann
domain
158
2
Rank
Phosphatebinding barrel
64
3
PLP-transferase
38
4
CheY-like domain
36
5
Ferredoxins
35
6
4268
458
(11%)
Total ORFs
with Common
Superfamilies




=

Superfamily
P-loop
hydrolyase
Phosphatebinding barrel
num.
matches frac. all
in worm worm
genome dom. in
in
(N)
(F) EC? SC?
830 1.7% 18
4
565 1.1%
0
3
472 0.9%
1 142
322 0.6%
0
1
276 0.5%
1 10
257 0.5%
0
0
247 0.5%
6 114
239 0.5%
0 78
235 0.5% 72 133
207 0.4% 83 114
S. cerevisiae
A. fulgidus
#
93
54
Rossmann
domains
53
Ferredoxins
48
SAM methyltranferase
17
PLP-transferases
15
1869
252
(14%)




=

Superfamily
P-loop
hydrolyase
Rossmann
domain
#
118
104
Rank
1
2
3
Phosphatebinding barrel
56
Ferredoxins
49
4
SAM methyltranferase
24
5
PLP-transferases
18
6
2409
309
(13%)
Total ORFs
with Common
Superfamilies

x

=
Superfamily
P-loop
hydrolyase
#
249
Protein kinase
123
Rossmann
domain
90
RNA-binding
domain
75
SAM methyltransferase
63
Ribonuclease Hlike
57
6218
560
(9%)
12 (c) Mark Gerstein, 2000, Yale, bioinfo.mbb.yale.edu
Common Folds in Genome, Varies Betw. Genomes
x

C
Su
Pro
h
Liga
N
C-
a
h
Ig s
Common,
Shared Folds:
bab structure
All share a/b structure with repeated
R.H. bab units connecting adjacent
strands or nearly so (18+4+2 of 24)
HI, MJ, SC
vs scop
1.32
P-loop
hydrolase
Flavodoxin
like
Rossmann
Fold
Thiamin
Binding
TIMbarrel
13 (c) Mark Gerstein, 2000, Yale, bioinfo.mbb.yale.edu
336: 42
Analysis of Genomes &
Transcriptomes in terms of the
Occurrence of Parts & Features
5(G)1(T)
7(G)15(T)
1 Using Parts to Interpret
Genomes. Shared and/or unique parts. Venn
1(G)2(Y)
2 Using Parts to Interpret
Pseudogenomes. In worm, top Y-folds
(DNAse, hydrolase) v top-folds (Ig). chr. IV
enriched, dead and dying families (90YG v 1G)
3 Using Parts to Interpret
Transcriptomes:
Expression & Structure. Top-10
parts in mRNA. Enriched in transcriptome: ab folds,
energy, synthesis,TIM fold, VGA. Depleted: TMs,
transport, transcription, Leu-zip, NS. Compare with
prot. abundance.
4 Expression & Localization.
Enriched : Cytoplasmic. Depleted: Nuclear.
Bayesian localizer
5 Expression & Function.
Expression relates to structure & localization but to
function, globally? P-value formalism. Weak
relation to protein-protein interactions.
bioinfo.mbb.yale.edu
H Hegyi, J Lin, B Stenger,
P Harrison, N Echols,
R Jansen, A Drawid, J Qian,
D Greenbaum, M Snyder
14 (c) Mark Gerstein, 2000, Yale, bioinfo.mbb.yale.edu
Diagrams, Fold tree with all-b diff. Ortholog tree.
Top-10 folds.
Pseudogenomics: Surveying “Dead” Parts
(Our def’n: YG = obvious homolog to known protein with frameshift or stop in mid-domain)
15 (c) Mark Gerstein, 2000, Yale, bioinfo.mbb.yale.edu
Example of
a potential
YG with
frameshift in
mid-domain
Folds in Pseudogenes
16 (c) Mark Gerstein, 2000, Yale, bioinfo.mbb.yale.edu
YG identification pipeline
to Summary of Pseudogenes
in worm
G=19K GE=8K YG=4K (2K)
Example of a potential YG with
frameshift in mid-domain
17 (c) Mark Gerstein, 2000, Yale, bioinfo.mbb.yale.edu
Most Common Worm “Pseudofolds” #1
18 (c) Mark Gerstein, 2000, Yale, bioinfo.mbb.yale.edu
Most Common Worm “Pseudofolds” #2
YG
-G
YG
-G
16%
(min)
Pseudogene
Distribution
on
Chomosomes
~50% YG in
terminal 3Mb vs
~30% G
19 (c) Mark Gerstein, 2000, Yale, bioinfo.mbb.yale.edu
29%
(max)
D1022.6 has
90 dead
fragments of
itself – a
disused line
of chemoreceptors?
1
Num. pseudogenes in family
90
20 (c) Mark Gerstein, 2000, Yale, bioinfo.mbb.yale.edu
Num. genes in family
Decayed Lines of Genes?
Completely Dead
Families
1
Rank
Number
matches
Organism of
closest match*
#1
7 *******
5 *****
5 *****
4 ****
4 ****
4 ****
3 ***
3 ***
3 ***
3 ***
Yeast
Human
Cow
Frog
Human
Fly
#2 =
#2 =
#4 =
#4 =
#4 =
#7 =
#7 =
#7 =
#7 =
PROTOMAP
family
representative
YJA7_YEAST
Notes on representative
Hypothetical protein in yeast
THB_RANCA
Xeroderma pigmentosum
group D complementing protein
Cleavage and polyadenylation specificity
factor
Thyroid hormone receptor beta
SEX_HUMAN
SEX gene
MDR1_RAT
Multidrug resistance protein 1
Vaccinia virus
YVFB_VACCC
Hypothetical vaccinia virus protein
Fly
Human
E. coli
VHRP_VACCC
Host range protein from vaccinia
IF4V_TOBAC
Eukaryotic initiation factor 4A
ACRR_ECOLI
Acrab operon repressor
XPD_MOUSE
CPSA_BOVIN
21 (c) Mark Gerstein, 2000, Yale, bioinfo.mbb.yale.edu
0
Analysis of Genomes &
Transcriptomes in terms of the
Occurrence of Parts & Features
5(G)1(T)
7(G)15(T)
1 Using Parts to Interpret
Genomes. Shared and/or unique parts. Venn
1(G)2(Y)
2 Using Parts to Interpret
Pseudogenomes. In worm, top Y-folds
(DNAse, hydrolase) v top-folds (Ig). chr. IV
enriched, dead and dying families (90YG v 1G)
3 Using Parts to Interpret
Transcriptomes:
Expression & Structure. Top-10
parts in mRNA. Enriched in transcriptome: ab folds,
energy, synthesis,TIM fold, VGA. Depleted: TMs,
transport, transcription, Leu-zip, NS. Compare with
prot. abundance.
4 Expression & Localization.
Enriched : Cytoplasmic. Depleted: Nuclear.
Bayesian localizer
5 Expression & Function.
Expression relates to structure & localization but to
function, globally? P-value formalism. Weak
relation to protein-protein interactions.
bioinfo.mbb.yale.edu
H Hegyi, J Lin, B Stenger,
P Harrison, N Echols,
R Jansen, A Drawid, J Qian,
D Greenbaum, M Snyder
23 (c) Mark Gerstein, 2000, Yale, bioinfo.mbb.yale.edu
Diagrams, Fold tree with all-b diff. Ortholog tree.
Top-10 folds.
Young, Church…
Affymetrix GeneChips
Abs. Exp.
Brown,
marrays, Rel.
Exp. over
Timecourse
Yeast Expression Data: 6000 levels!
Integrated Gene Expression Analysis
System: X-ref. Parts and Features against
expression data...
Also:
SAGE (mRNA);
2D gels for
Protein
Abundance
(Aebersold,
Futcher)
Snyder,
Transposons,
Protein Abundance
24 (c) Mark Gerstein, 2000, Yale, bioinfo.mbb.yale.edu
Gene Expression
Datasets: the Yeast
Transcriptome
Common Parts: the Transcriptome
alpha/beta hydrolases
a/ b 2ace
2.2
0.9
Zn2/C6 DNA-bind. dom.
sml 1aw6
2.6
0.3
SAGE-S
1.6
SAGE-L
6.8
SAGE-GM
multi 1hcl
Church-heat
Protein kinases (cat. core)
Church-gal
2.1
Church-alpha
3.8
Church-a
1zta
8
1 1 1 1 1 1 1 1
2 4 4 4 5 5 6 7
7 11 9 8 10 4 10 11
3 3 3 2 2 19 15 9
4 5 6 6 7 9 9 16
11 15 16 12 12 8
5 8
6 8 2 5 4 11 10 6
12
2
5 3 3 35 19 30
9
10
21
9
18
21
82 122 120
10
16
17
11
16
12
48
25
56
15
8
14
21
15
19
21
20
33
18
18
19
9
16
11
15
13
16
17
10
32
31
25
26
21
23
26
26
26
9
75
94
27
50
32
40
48
39
50
+
+
+
+
+
-46 -77 - 1
-62 -89 7
1
2
3
4
5
6
Samson
a
Genome
8.3
+98
5
5.2
-11
3
•
3.4
-14
• 6
8
3.3
0
•
2.9
-55
2
2.7
-37
4
14
2.7
+63
81
2.7 +1316
36
2.6 +348
2.6 +231
31
Young
Rel. Diff. [%]
Rep. PDB
4.2
5.8
3.9
3.3
6.4
4.4
1.7
0.2
0.6
0.8
7
25 (c) Mark Gerstein, 2000, Yale, bioinfo.mbb.yale.edu
715
Transcriptome [%]
Fold Class
1byb
1gky
1fxd
1xel
1mda*
2bct
2trx
1drw†
1igd
1dky
TIM barrel
P-loop NTP hydrolases
Ferredoxin like
Rossmann fold
7-bladed beta-propeller
aplha-alpha superhelix
Thioredoxin fold
G3P dehydrogenase-like
beta grasp
HSP70 C-term. fragment
long
helices oligomers
Leu-zipper
118
Rank
a/ b
a/ b
ab
a/ b
b
a
a/ b
ab
ab
multi
Fold
51
Genome [%]
Composition
Change
Common
Folds
Folds
that
change
a lot
in
frequency
are
not
common
Changing
Folds
26 (c) Mark Gerstein, 2000, Yale, bioinfo.mbb.yale.edu
Freq.
Similar results
Martin et al.
(1998)
The number of
interactions for
each fold = the
number of
other folds it is
found to
contact in the
PDB
27 (c) Mark Gerstein, 2000, Yale, bioinfo.mbb.yale.edu
Most Versatile Folds – Relation to Interactions
unclassified 
transcription 
transport 
signaling 
Prot. Syn. 
cell structure 
energy 
Functional Category
(MIPS)
TMs
ab
28 (c) Mark Gerstein, 2000, Yale, bioinfo.mbb.yale.edu
Transcriptome Enrichment
Composition of Transcriptome in terms of
Functional Classes
NS 
Genome
Composition
Transcriptome
Composition
VGA 
Amino Acid
29 (c) Mark Gerstein, 2000, Yale, bioinfo.mbb.yale.edu
Transcriptome Enrichment
Composition of Genome vs. Transcriptome
2D-gel electrophoresis Data sets: Futcher (71),
Aebersold (156), scaled set with 171 proteins
New effect is dealing with gene selection bias
What is
Proteome?
Protein
complement in
genome or
cellular protein
population?
31 (c) Mark Gerstein, 2000, Yale, bioinfo.mbb.yale.edu
Relating the Transcriptome to Cellular
Protein Abundance (Translatome)
Amino Acid Enrichment
mRNA
Simple story is translatome is
enriched in same way as
transcriptome
33 (c) Mark Gerstein, 2000, Yale, bioinfo.mbb.yale.edu
Protein
Protein
34 (c) Mark Gerstein, 2000, Yale, bioinfo.mbb.yale.edu
Amino Acid Enrichment –
Complexities
mRNA
Analysis of Genomes &
Transcriptomes in terms of the
Occurrence of Parts & Features
5(G)1(T)
7(G)15(T)
1 Using Parts to Interpret
Genomes. Shared and/or unique parts. Venn
1(G)2(Y)
2 Using Parts to Interpret
Pseudogenomes. In worm, top Y-folds
(DNAse, hydrolase) v top-folds (Ig). chr. IV
enriched, dead and dying families (90YG v 1G)
3 Using Parts to Interpret
Transcriptomes:
Expression & Structure. Top-10
parts in mRNA. Enriched in transcriptome: ab folds,
energy, synthesis,TIM fold, VGA. Depleted: TMs,
transport, transcription, Leu-zip, NS. Compare with
prot. abundance.
4 Expression & Localization.
Enriched : Cytoplasmic. Depleted: Nuclear.
Bayesian localizer
5 Expression & Function.
Expression relates to structure & localization but to
function, globally? P-value formalism. Weak
relation to protein-protein interactions.
bioinfo.mbb.yale.edu
H Hegyi, J Lin, B Stenger,
P Harrison, N Echols,
R Jansen, A Drawid, J Qian,
D Greenbaum, M Snyder
35 (c) Mark Gerstein, 2000, Yale, bioinfo.mbb.yale.edu
Diagrams, Fold tree with all-b diff. Ortholog tree.
Top-10 folds.
Composition of Transcriptome in
terms of Broad Structural Classes
ab protein 
# TM helices in yeast protein
Fold Class
of Soluble
Proteins
36 (c) Mark Gerstein, 2000, Yale, bioinfo.mbb.yale.edu
Transcriptome Enrichment
Membrane
(TM) Protein 
37 (c) Mark Gerstein, 2000, Yale, bioinfo.mbb.yale.edu
Expression Level is Related to Localization
38 (c) Mark Gerstein, 2000, Yale, bioinfo.mbb.yale.edu
Distributions of
Expression Levels
~6000 yeast genes
with expression levels
39 (c) Mark Gerstein, 2000, Yale, bioinfo.mbb.yale.edu
but only ~2000 with localization….
State Vects
loc=
Represent localization of each
protein by the state vector P(loc)
and each feature by the feature
vector P(feature|loc). Use Bayes
rule to update.
18 Features: Expression Level
(absolute and fluctuations), signal
seq., KDEL, NLS, Essential?, aa
composition
40 (c) Mark Gerstein, 2000, Yale, bioinfo.mbb.yale.edu
Bayesian
System for
Localizing
Proteins
Feature Vects
P(feature|loc)
loc=
Represent localization of each
protein by the state vector P(loc)
and each feature by the feature
vector P(feature|loc). Use Bayes
rule to update.
State Vects
41 (c) Mark Gerstein, 2000, Yale, bioinfo.mbb.yale.edu
Bayesian
System for
Localizing
Proteins
Feature Vects
P(feature|loc)
Individual proteins: 75%
with cross-validation
Carefully clean training dataset to avoid circular logic
Testing, training data, Priors: ~2000 proteins from
Swiss-Prot Master List
Also, YPD, MIPS, Snyder Lab
42 (c) Mark Gerstein, 2000, Yale, bioinfo.mbb.yale.edu
Results on
Testing Data
Compartment
Populations. Like QM,
directly sum state vectors
to get population. Gives
96% pop. similarity.
43 (c) Mark Gerstein, 2000, Yale, bioinfo.mbb.yale.edu
Results on
Testing Data #2
ytoplasmic
Mem.
uclear
44 (c) Mark Gerstein, 2000, Yale, bioinfo.mbb.yale.edu
Extrapolation to Compartment
Populations of Whole Yeast Genome:
~4000 predicted + ~2000 known
Analysis of Genomes &
Transcriptomes in terms of the
Occurrence of Parts & Features
5(G)1(T)
7(G)15(T)
1 Using Parts to Interpret
Genomes. Shared and/or unique parts. Venn
1(G)2(Y)
2 Using Parts to Interpret
Pseudogenomes. In worm, top Y-folds
(DNAse, hydrolase) v top-folds (Ig). chr. IV
enriched, dead and dying families (90YG v 1G)
3 Using Parts to Interpret
Transcriptomes:
Expression & Structure. Top-10
parts in mRNA. Enriched in transcriptome: ab folds,
energy, synthesis,TIM fold, VGA. Depleted: TMs,
transport, transcription, Leu-zip, NS. Compare with
prot. abundance.
4 Expression & Localization.
Enriched : Cytoplasmic. Depleted: Nuclear.
Bayesian localizer
5 Expression & Function.
Expression relates to structure & localization but to
function, globally? P-value formalism. Weak
relation to protein-protein interactions.
bioinfo.mbb.yale.edu
H Hegyi, J Lin, B Stenger,
P Harrison, N Echols,
R Jansen, A Drawid, J Qian,
D Greenbaum, M Snyder
45 (c) Mark Gerstein, 2000, Yale, bioinfo.mbb.yale.edu
Diagrams, Fold tree with all-b diff. Ortholog tree.
Top-10 folds.
Do Expression
Clusters Relate to
Protein Function?
46 (c) Mark Gerstein, 2000, Yale, bioinfo.mbb.yale.edu
Can they predict
functions?
Func. A
Func. B
•
•
Clustering of expression profiles
Grouping functionally related genes together (?)
•
Botstein (Eisen), Lander, Haussler, and
Church groups, Eisenberg
Sample for Diauxic shift Expt. (Brown),
Ex. Ravg,G=3 =
[ R(gene-1,gene-3) + R(gene-1,gene-4)
+
R(gene-5,gene-7) ] / 3
47 (c) Mark Gerstein, 2000, Yale, bioinfo.mbb.yale.edu
Distributions of Gene
Expression Correlations,
for All Possible Gene
Groupings
P-value for
specific 10gene func.
group
Sample for Diauxic shift Expt. (Brown),
Ex. Ravg,G=3 =
[ R(gene-1,gene-3) + R(gene-1,gene-4)
+
R(gene-5,gene-7) ] / 3
48 (c) Mark Gerstein, 2000, Yale, bioinfo.mbb.yale.edu
Distributions of Gene
Expression Correlations,
for All Possible Gene
Groupings 2
Correlation:
Always
Significant
49 (c) Mark Gerstein, 2000, Yale, bioinfo.mbb.yale.edu
Sometimes
Significant
(depends
on expt.)
Never
Significant
Based on Distributions,
Correlation of
Established Functional
Categories, Computer
Clusterings
Protein-Protein
Interactions &
Expression
Sets of interactions
(all pairs)
(control)
(from MIPS)
(Uetz et al.)
between selected expression
timecourses in CDC28 expt. (Davis)
(strong interaction, clearly diff.)
50 (c) Mark Gerstein, 2000, Yale, bioinfo.mbb.yale.edu
Use same formalism to assess
how closely related expression
timecourses to sets of known p-p
interactions
Sets of Interacting Proteins
for
Relation of P-P Interactions
to Abs. Expression Level
51 (c) Mark Gerstein, 2000, Yale, bioinfo.mbb.yale.edu
Distribution of
Normalized
Expression Levels
Can we define FUNCTION well
enough to relate to expression?
Fold, Localization,
Interactions &
Regulation are
Problems defining function:
Multi-functionality: 2 functions/protein (also 2 proteins/function)
attributes of proteins that
are much more clearly
defined
Conflating of Roles: molecular action, cellular role, phenotypic
Non-systematic Terminology:
‘suppressor-of-white-apricot’ & ‘darkener-of-apricot’
Functional
Classification
(cross-org.,
just conserved,
NCBI
Koonin/Lipman)
GenProtEC
(E. coli, Riley)
(SwissProt
Bairoch/
Apweiler,
just enzymes,
cross-org.)
Also:
Other
SwissProt
Annotation
“Fly”
(fly, Ashburner)
now extended to
GO (cross-org.)
MIPS/PEDANT
(yeast, Mewes)
WIT, KEGG
(just pathways)
TIGR EGAD
(human ESTs)
24 (c) Mark Gerstein, 2000, Yale, bioinfo.mbb.yale.edu
ENZYME
COGs
vs.
52 (c) Mark Gerstein, 2000, Yale, bioinfo.mbb.yale.edu
manifestation.
Phenotype ORF Clusters from Transposon Expt.
20 Conditions
Conditions
20
Metabolism
Cold
k-means
clustering of
ORFs based on
“phenotype
patterns,” crossref. to MIPs
Functional
Classes
Cluster showing cold
phenotype (containing genes
most necessary in cold) is
enriched in metabolic functions
M Snyder,
A Kumar,
et al….
54 (c) Mark Gerstein, 2000, Yale, bioinfo.mbb.yale.edu
28
ORFs
28 ORFs
in
cluster
in cluster
Transposon insertions
into (almost) each
yeast gene to see how
yeast is affected in 20
conditions. Generates
a phenotype pattern
vector, which can be
treated similarly to
expression data
Analysis of Genomes &
Transcriptomes in terms of the
Occurrence of Parts & Features
5(G)1(T)
7(G)15(T)
1 Using Parts to Interpret
Genomes. Shared and/or unique parts. Venn
1(G)2(Y)
2 Using Parts to Interpret
Pseudogenomes. In worm, top Y-folds
(DNAse, hydrolase) v top-folds (Ig). chr. IV
enriched, dead and dying families (90YG v 1G)
3 Using Parts to Interpret
Transcriptomes:
Expression & Structure. Top-10
parts in mRNA. Enriched in transcriptome: ab folds,
energy, synthesis,TIM fold, VGA. Depleted: TMs,
transport, transcription, Leu-zip, NS. Compare with
prot. abundance.
4 Expression & Localization.
Enriched : Cytoplasmic. Depleted: Nuclear.
Bayesian localizer
5 Expression & Function.
Expression relates to structure & localization but to
function, globally? P-value formalism. Weak
relation to protein-protein interactions.
bioinfo.mbb.yale.edu
H Hegyi, J Lin, B Stenger,
P Harrison, N Echols,
R Jansen, A Drawid, J Qian,
D Greenbaum, M Snyder
55 (c) Mark Gerstein, 2000, Yale, bioinfo.mbb.yale.edu
Diagrams, Fold tree with all-b diff. Ortholog tree.
Top-10 folds.
GeneCensus
bioinfo.mbb.yale.edu
Detailed
Tables
Alignment
Server
ORF
Query
Ranks
Trees
PDB
Query
56 (c) Mark Gerstein, 2000, Yale, bioinfo.mbb.yale.edu
Alignment
Database
J Qian,
B Stenger,
J Lin....
Rank Folds by Genome
Occurrence, Expression, Fold
Clustering, Length, &c
57 (c) Mark Gerstein, 2000, Yale, bioinfo.mbb.yale.edu
PartsList
Ranking
Viewers
58 (c) Mark Gerstein, 2000, Yale, bioinfo.mbb.yale.edu
Surveying a Finite PartsList from
Many Perspective
Recluster organisms based
on folds, composition, &c and
compare to traditional
taxonomy
59 (c) Mark Gerstein, 2000, Yale, bioinfo.mbb.yale.edu
GeneCensus Dynamic
Tree Viewers
Analysis of Genomes &
Transcriptomes in terms of the
Occurrence of Parts & Features
5(G)1(T)
7(G)15(T)
1 Using Parts to Interpret
Genomes. Shared and/or unique parts. Venn
1(G)2(Y)
2 Using Parts to Interpret
Pseudogenomes. In worm, top Y-folds
(DNAse, hydrolase) v top-folds (Ig). chr. IV
enriched, dead and dying families (90YG v 1G)
3 Using Parts to Interpret
Transcriptomes:
Expression & Structure. Top-10
parts in mRNA. Enriched in transcriptome: ab folds,
energy, synthesis,TIM fold, VGA. Depleted: TMs,
transport, transcription, Leu-zip, NS. Compare with
prot. abundance.
4 Expression & Localization.
Enriched : Cytoplasmic. Depleted: Nuclear.
Bayesian localizer
5 Expression & Function.
Expression relates to structure & localization but to
function, globally? P-value formalism. Weak
relation to protein-protein interactions.
bioinfo.mbb.yale.edu
H Hegyi, J Lin, B Stenger,
P Harrison, N Echols,
R Jansen, A Drawid, J Qian,
D Greenbaum, M Snyder
60 (c) Mark Gerstein, 2000, Yale, bioinfo.mbb.yale.edu
Diagrams, Fold tree with all-b diff. Ortholog tree.
Top-10 folds.