Download Printout, 6 slides per page, no animation PDF (12MB)

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Protein phosphorylation wikipedia , lookup

Histone acetylation and deacetylation wikipedia , lookup

Cell nucleus wikipedia , lookup

Magnesium transporter wikipedia , lookup

Signal transduction wikipedia , lookup

Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup

Protein moonlighting wikipedia , lookup

List of types of proteins wikipedia , lookup

Silencer (genetics) wikipedia , lookup

Protein–protein interaction wikipedia , lookup

Gene regulatory network wikipedia , lookup

Gene expression wikipedia , lookup

Transcript
What is Biology?
“A
branch of knowledge that deals with living
organisms and vital processes”
Computational Methods in
Systems Biology
Nir Friedman
The
hottest scientific frontier of our times
• Many great processes have been figured out
• Much is still unknown
Maya Schuldiner
Tremendous
impact on Medicine
• Both diagnosis, prognosis, and treatment
.
2
Biological Systems are Complex
Bakers Yeast Saccharomyces Cereviciae
•The System is NOT just a sum of its parts
3
•Used to make bread and beer
•The simplest cell that still resembles human cells
4
The Age of Genomes
What is Systems Biology?
“Systems biology is the study of the interactions between the
components of a biological system, and how these
interactions give rise to the function and behavior of that
system”
404 Complete Microbial Genomes (Thousands in progress)
31 Complete Eukaryotic Genomes (315 in progress!)
3 Complete Plant Genomes (6 in progress)
• The last decades lead to revolution on how we can
examine and understand biological systems
Characterized by
• High-throughput assays
• Integration of multiple forms of experiments & knowledge
• Mathematical modeling
5
Bacteria
1.6Mb
1600 genes
95
96
97
Eukaryote
Animal
Human
13Mb
100Mb
3Gb
~6000 genes ~20,000 genes ~30,000 genes?
98
99
00
01
02
03
04
05
06
Individual Genomes?
07
08
09
10
6
1
Ask Not What Systems Biology Can do
For you….
.
8
Why Biology for NIPS Crowd?
Flow of Information in Biology
 Quantity
• Data-intense discipline: Too vast for manual
interpretation
 Systematic
• Collection of data on all genes/proteins/…
 Multi-faceted
• Measurements of complementary aspects of cellular
function, development and disease states
• Challenge of integration and fusion of multiple data
DNA
RNA
Recipe
(in safe)
Working
copy
Protein
The resulting
dish
Phenotype
The Review
Has the potential to be medically applicative!
9
10
The “Post-Genomic Era”
Systematic is Not Just More
Outline
Assays
DNA
Genomic
sequences
 Variations
within a
population
11
…

RNA
Quantity
 Structure
 Degradation
rate
…

DNA
Protein
Quantity
 Location
 Modifications
 Interactions
…

RNA
Protein
Phenotype
Phenotype
Stores
genetically inherited information
of four nucleotide types (A, C, G, T)
Two complementary strands creating base pairs (bp)
105 bp in bacteria, 3x109 in humans 6 X1013 in wheat
Genetic
interventions
 Environmental
interventions
…

Sequence
12
2
Understanding Genome Sequences
~3,289,000,000 characters:
aattgtgctctgcaaattatgatagtgatctgtatttactacgtgcatat
attttgggccagtgaatttttttctaagctaatatagttatttggacttt
tgacatgactttgtgtttaattaaaacaaaaaaagaaattgcagaagtgt
tgtaagcttgtaaaaaaattcaaacaatgcagacaaatgtgtctcgcagt
cttccactcagtatcatttttgtttgtaccttatcagaaatgtttctatg
tacaagtctttaaaatcatttcgaacttgctttgtccactgagtatatta
tggacatcttttcatggcaggacatatagatgtgttaatggcattaaaaa
taaaacaaaaaactgattcggccgggtacggtggctcacgcctgtaatcc
cagcactttgggagatcgaggagggaggatcacctgaggtcaggagttac
agacatggagaaaccccgtctctactaaaaatacaaaattagcctggcgt
ggtggcgcatgcctgtaatcccagctactcgggaggctgaggcaggagaa
tcgcttgaacccgggagcggaggttgcggtgagccgagatcgcaccgttg
cactccagcctgggcgacagagcgaaactgtctcaaacaaacaaacaaaa
aaacctgatacatggtatgggaagtacattgtttaaacaatgcatggaga
tttaggttgtttccagtttttactggcacagatacggcaatgaatataat
tttatgtatacattcatacaaatatatcggtggaaaattcctagaagtgg
aatggctgggtcagtgggcattcatattgagaaattggaaggatgttgtc
aaactctgcaaatcagagtattttagtcttaacctctcttcttcacaccc
ttttccttggaagaaagctaaatttagacttttaaacacaaaactccatt
ttgagacccctgaaaatctgggttcaaagtgtttgaaaattaaagcagag
gctttaatttgtacttatttaggtataatttgtactttaaagttgttcca
. . .
Goal:
Identify components encoded in the DNA sequence
13
14
Open Reading Frame
Finding Open Reading Frames
ATGCTCAGCGTGACCTCA . . . CAGCGTTAA
M
L
S
V
T
S . . .
Q
ATGCTCAGCGTGACCTCA . . . CAGCGTTAA
R STP
Protein-encoding
DNA sequence consists of a
sequence of 3 letter codons
Starts with the START codon (ATG)
Ends with a STOP codon (TAA, TAG, or TGA)
M
L
S
V
T
S . . .
Q
R STP
Try all possible starting points
3 possible offsets
2 possible strands
Simple algorithm finds all ORFs in a genome
Many of these are spurious (are not real genes)
How do we focus on the real ones?
15
16
Using Additional Genomes
Phylogentic Tree of Yeasts
S. cerevisiae
Basic premise
“What is important is conserved”
~10M years
S. paradoxus
S. mikatae
S. bayanus
C. glabrata
S. castellii
Evolution = Variation + Selection
• Variation is random
• Selection reflects function
K. lactis
A. gossypii
K. waltii
D. hansenii
C. albicans
Y. lipolytica
Idea:
 Instead of studying a single genome, compare
related genomes
 A real open reading frame will be conserved
17
N. crassa
M. graminearum
M. grisea
A. nidulans
S. pombe
18
Kellis et al, Nature 2003
3
Evolution of Open Reading Frame
S. cerevisiae
S. paradoxus
S. mikatae
S. bayanus
ATGCTCAGCGTGACCTCA
ATGCTCAGCGTGACATCA
ATGCTCAGGGTGACA--A
ATGCTCAGG---ACA--A
.
.
.
.
.
.
.
.
Conserved
Variable
Frame shift
Spurious ORF
.
.
.
.
ATG not
conserved
Frame shift
changes interpretation
of downstream seq
Conserved
positions
Examples
Confirmed ORF
Variable
positions
A deletion
Greedy algorithm to find conserved ORFs surprisingly
Sequencing
effective (> 99% accuracy) on verified yeast data
error
19
Defining Conservation
Naïve approach
Consensus between all
species
Problem:
A
A
A
A
grained
distances between
species
 Ignores the tree topology
 Ignores
A
C
A
G
Variable
A C
A C
A C
A A
A G C A
A G C A
A T C C
Goal:
More sensitive and robust
methods
Probabilistic Model of Evolution
Conserved
A T C C
A C C A
 Rough
% conserv 100 33 55 55
21
Aardvark Bison Chimp Dog
Assumptions:
Positions (columns) are independent of each other
Each branch is a reversible continuous time
discrete state Markov process
22
Conserved vs. unconserved
Two hypotheses:
P (a " c |t + t ') = ! P (a " b |t )P (b " c |t ')
b
2 3 4 1
Conserved
Short branches
(fewer mutations)
P (a )P (a ! b |t ) = P (b )P (b ! a |t )
governed by a rate matrix Q
Q a,b =
d
P(a " b | t)
dt
t= 0
P(a " b | t) = [e tQ ]
Elephant
Random variables – sequence at current day taxa or
at ancestors
Potentials/Conditional distribution – represent the
probability of evolutionary changes along each
branch
Parameterization of Phylogenies
23
[Kellis et al, Nature 2003]
20
Use log
a,b
24
2
3
4
1
Unconserved
Long branches
(more mutations)
P(position | unconserved )
P(position | conserved)
[Boffelli et al, Science 2003]
!
4
Genes Are Better Conserved
Challenges
% conserved
log Fast/Slow
Other types of genomic elements
Small polypeptides (peptohormones,
neuropeptides)
RNA coding genes
• rRNA, tRNA, snoRNA…
• miRNA
Regulatory regions
[Boffelli et al, Science 2003]
25
27
Transcription Factor Binding Sites
Regulatory Elements



Relatively short words (6-20bp)
Recognition is not perfect
• Binding sites allow variations
Often conserved
*Essential Cell Biology; p.268
28
29
Challenges
Other types of genomic elements
Small polypeptides (peptohormones,
neuropeptides)
RNA coding genes
• rRNA, tRNA, snoRNA…
• miRNA
Regulatory regions
Recognition of elements without comparisons
Clearly sequence contains enough information to
“parse” it within the living cell
30
Outline
DNA
RNA
Protein
Phenotype
Copied from DNA template
 Conveys information (mRNA)
 Can also perform function (tRNA, rRNA, …)
 Single stranded, four nucleotide types (A,C, G, U)
 For each expressed gene there can be as few as 1
molecule and up to 10,000 molecules per cell.

31
5
Gene Expression
High Throughput Gene Expression
Transcription
Translation
Extract
Same
DNA content
different phenotype
Difference is in regulation of expression of genes
Very
33
RNA expression
levels of 10,000s
of genes in
one experiment
Microarray
34
Dynamic Measurements
Conditions
Expression: Supervised Approaches
Labeled samples
Time
courses
perturbations
(genetic & environmental)
Biopsies from different
patient populations
…
Feature selection
+
Classification
Classifier confidence
Genes
Different
Potential
diagnosis/prognosis tool
the disease state
⇒ insights about underlying processes
Characterizes
35
Gasch et al. Mol. Cell 2001
Segman et al, Mol. Psych. 2005
36
Expression: Unsupervised
P-value =< 0.027
Papers Compendia
26 datasets from Whitehead and Stanford
Various tumors
Stimulated
PBMC
B lymphoma
Breast cancer
Stimulated
immune
PCA
Viral infection
Fibroblast
EWS/FLI
Prostate
cancer
Cluster
Fibroblast
infection
Neuro tumors
Fibroblast
serum
NCI60
Gliomas
HeLa cell cycle
Lung cancer
37
Eisen et al. PNAS 1998; Alter et al, PNAS 2000
39
Leukemia
Liver cancer
Segal et al Nat. Gen. 2004
6
Apoptosis
DNA damage /
nucleotide metabolism
Apoptosis
Immune
MMPs
Immune
Cancer types
Goal: Reconstruct Cellular Networks
Signaling &
growth regulation
Signaling
Immune
Muscle
Immune
Immune
Cytoskeleton & ECM
Adhesion & signaling
Synapse & signaling
Metabolism
Chromatin
Breast
Cytoskeleton (IF & MT)
Modules
Cancer span wide range of phenomena
•Tumor type specific
•Tissue specific
•Generic across many tumors
Signaling &
development
Protein biosynthesis
Translation, degradation
& folding
Nucleotide metabolism
Signaling, development
& oxidative phos.
Cell lines
Cell cycle
ECM
IF & keratins
Metabolism, detox &
immune
Signaling &
growth regulation
Metabolism, detox &
immune
Signaling
Signaling
Signaling
Signaling & CNS
Biocarta. http://www.biocarta.com/
Tissues
Immune
Liver
Immune
Lung /
Hemato
Immune
AD/CNS
CNS
Hemao
Hemato
Cell lines
Liver
Hemato
Immune
Breast\liver
Hemato
Lung/AD
Hemati
40
>0.4
Hemato
0
Leukemia
>0.4
Segal et al Nat. Gen. 2004
41
First Attempt: Bayesian networks
Gene A
Second Attempt: Module Networks
Gene B
MAPK of cell
wall integrity
pathway
SLT2
Gene C
Gene D
RLM1
Regulation
Function 1
Gene E
One gene  One variable
An instance: microarray sample
Use standard approaches for learning networks
Friedman et al, JCB 2000
42
CRH1
Regulation
Function 2
Validation
Module 25
(59 genes)
Tpk1
Kin82
Msn4
Usv1
How do we evaluate ourselves?
Statistical validation
• Ability to generalize (cross validation test)
Nrg1
150
Module 4
(42 genes)
Hap4
Mth1
Module 1
(55 genes)
Atp1
44
Atp3
PTP2
Regulation
Function 4
Segal et al, Nature Genetics 2003
43
Test Data Log-Likelihood
(gain per instance)
et al. 2001: Yeast
Response to Environmental
Stress
Module 2
 173 Yeast arrays
(64 genes)
 2355 Genes
Tpk2
 50 modules
Regulation
Function 3
Idea: enforce common regulatory program
Statistical robustness: Regulation programs are
estimated from m*k samples
Organization of genes into regulatory modules:
Concise biological description
Learned Network (fragment)
 Gasch
YPS3
One common regulation function
100
50
-50
-100
-150
Atp16
45
Bayesian
network
performance
0
0
100
200
300
400
Number of modules
500
7
Validation
Visualization & Interpretation
How do we evaluate ourselves?
Statistical validation
Biological interpretation
• Annotation database
• Literature reports
• Other experiments, potentially different
experiment types
GO
Cis-regulatory
motifs
Functional
annotations
Molecular Pathways
(KEGG GeneMAPP)
Visualization
Interpretation
Hypotheses
 Function
 Dynamics
 Regulation
Expression
profiles
47
Hap4
Msn4
Oxid. Phosphorylation (26, 5x10-35)
Mitochondrion (31, 7x10-32)
Aerobic Respiration (12, 2x10-13)
46
HAP4 Motif
29/55; p<2x10-13
STRE (Msn2/4)
32/55; p<103
HAP4+STRE
17/29; p<7x10-10
-500
-400
-300
-200
-100
Gene set coherence (GO, MIPS, KEGG)
Match between regulator and targets
Match between regulator and cis-reg motif
Validation
How do we evaluate ourselves?
Statistical validation
Biological interpretation
Experiments
• Test causal predictions in the real system
• Lead to additional understanding beyond the
prediction
• Experimental validation of three regulators
♦ 3/3
Match between regulator and condition/logic
p-values using hypergeometric dist; corrected for multiple hypotheses
48
successful results
Segal et al, Nature Genetics 2003
49
Challenges
Outline
New
methodologies for the huge amount of existing
RNA profiles
• Meta analysis
• Better mechanistic models
• Contrasting new profiles with existing databases
• Visualization
DNA
RNA
Protein
Phenotype
Proteins
Other
50
measurements
• Degradation rates
• Localization
are the main executers of cellular function
Building blocks are 20 different amino-acid
Synthesized from mRNA template
Acquires a sequence dependent 3-D conformation
Proteomics: Systematic Study of Proteins
51
8
Why Measure Proteins?
Challenges in Proteomics
RNA
Level ≠ Protein level
Protein quantity is not a direct
function of RNA levels
Problematic
recognition:
No generic mechanism to detect different protein
forms
Protein
Thousands
Level ≠ Activity level
Activity of proteins is regulated
by many additional mechanisms
• Cellular localization
• Post-translational
modifications
• Co-factors (protein, RNA, …)
of different proteins in the typical cell
Protein
abundances vary over several orders of
magnitude
52
53
Making a Protein Generic
TAP-Tag Libraries for Abundance
~4500 Yeast strains have been TAP tagged
TAG
•
•
Tags make a protein generic
Underlying assumption is that the tag does not
change the protein
All proteins have the same tag
1. Inability to pool strains
2. Each experiment is done on a “different” strain
•
•How much is each protein expressed?
•What is the proteome under different conditions?
54
55
Why Study Protein Complexes?
Using TAP-Tag to Find Complexes
#
Most proteins in the cell work in protein
complexes or through protein/protein interactions
#
To understand how proteins function we must
know:
 - who they interact with
- when do they interact
- where do they interact
- what is the outcome of that interaction
56
.
9
Large Scale Pull Downs Provide
Information on Protein Complexes
•Both labs used the same proteins as bait
•Each lab got slightly different results
•The results depended dramatically on analysis method
*Gavin et al. Nature 2006
*Gavin et al. Nature 2006
*Krogan et al. Nature 2006
.
*Krogan et al. Nature 2006
59
Making a Protein Generic
We can now define a yeast “interactome”
•Isnt full use of data
•Static picture
.
1. Fluorescent proteins allow us to visualize the
proteins within the cell.
2. Allow us to measure individual cells and the
variation/ noise within a population
61
Cellular Localization Using GFP Tags
What can it teach us?
Challenges in Fluorescence-based
Approaches
Better
Vision processing will allow to do this
in High-Throughput and answer questions
like:
• Changes in localization in response to cellular
cues
• Changes in localization in response to
environment cues
• Changes in localization in various genetic
backgrounds
• Dynamics of localization changes
A library of yeast GFP fusion strains has been
used to localize nearly all yeast proteins
62
Huh et al Nature 2003
A collection of cloned C. elegans promoters
is being created for similar purposes
Genome Research 14:2169-2175, 2004
63
10
Single Cell Measurements:
Flow Cytometry
 Cells
THROUGHPUT
THE MAJOR BOTTLENECK
pass through a flow cell
one at a time
 Lasers focused on the flow
cell excite fluorescent protein
fusions
 Allows multiple
measurements (cell size,
shape, DNA content)
Applications:
 Protein abundance
 Protein-protein interactions
 Single-cell measurements
.
65
Comparison of mRNA to Protein Levels Allows
Identification of Post-transcriptional Regulation
High Throughput Flow Cytometer
 Rich
 Poor
media
media
Observed behaviors
 No
change in both
change
 Change in protein, but
not mRNA
 Coordinated
7
seconds/sample
counts per sample
Log2 Poor/Rich protein
Compare
~50,000
66
Log2 Poor/Rich mRNA
67
Newman et al Nature 2006
Noise in Biological Systems
Challenges
Proteomics is in its infancy - easier to make an impact



 Measurement
of 10,000 individual cells allows measurement
of variation (noise) in a biological context
 factors that affect levels of noise in gene expression:
• Abundance, mode of transcriptional regulation, subcellular localization
68
Nature 441, 840-846(15 June 2006)
Integrating this data with other proteomic/genomic data to
better predict protein function
Higher Throughput methods such as flow cytometry will
allow generation of varied data: Different growth
conditions, Cell cycle, Stress, Mating
Tagging is mammalian cells becoming more feasable near future should bring proteomic data on human cells
69
11
Outline
Single Gene KO
Phenotypic
RNA
DNA
Protein
Screen
Phenotype
 Traits
that selection can apply to, the observable characteristics
in the DNA can cause a change in a phenotype.
• Shape and size
• Growth rate
• How many years your liver can survive alcohol damage….
 Mutations
70
Giaver et ., 2002
71
Starting to Probe the Cellular Network
What is a genetic interaction (Epistasis)?
The effect of a mutation in one gene on the phenotype of a mutation in a second gene.
Genotype
WT
Δ geneA
Δ geneB
Genetic Interaction
•The effect of a mutation in one gene on the phenotype of a
mutation in a second gene
•Different type of interaction - not physical
Δ
Growth Rate
1
x
(x <= 1)
y
(y <= 1)
geneA Δ geneB
xy (Product)
DIFFERENT TYPE OF INTERACTION - NOT PHYSICAL
72
74
What is a genetic interaction?
Genetic Interaction
None
Aggravating
Alleviating
Growth Rate
xy
less than xy
greater than xy
Systematic Method of Analyzing Double
Mutants
X
∆X:NAT
∆Y:KAN
∆X:NAT
A
X
×B ×Y
75
C
∆Y:KAN
×A
×B
X
WT
Y
C


76
∆X:NAT
∆Y:KAN
∆X:NAT
∆Y:KAN
Double deletion mutants are made systematically
Colony sizes are measured in high throughput
Tong et al., 2001
12
E-MAPS
Defining Protein Complexes
Epistasis Mini Array Profiles
On
A
Co-complex proteins have
• similar interaction patterns
• alleviating interactions
B
C
Off
77
Aggravating
Alleviating
Aggravating
Schuldiner et al., Cell 2005
Alleviating
78
Outline
Challenges for the future
Only
a small fraction of the information has been
utilized in E-MAPS made so far
E-MAPS to cover all yeast cellular processes to
come out until the end of 2007
Extending this to human cells is now feasible
using gene silencing techniques
Amount of data scales exponentially - Higher
organisms - more genes
RNA
DNA
Protein
Phenotype
Combined Insights
Model-free
approach
Model-based approach
.
80
Why Integrate Data?
Model-Free Approach
Location
Gene
Nuc
Cyto
Expression Phenotype
Mito
Rich
Poor
Salt
Kan
Binding sites
RAP1
HSF1
GCN4
YAL001C
YAL002W
YAL003W
YAR040W
attttgggccagtgaatttttttctaagctaatatagttatttggacttt
tgacatgactttgtgtttaattaaaacaaaaaaagaaattgcagaagtgt
tgtaagcttgtaaaaaaattcaaacaatgcagacaaatgtgtctcgcagt
cttccactcagtatcatttttgtttgtaccttatcagaaatgtttctatg
tacaagtctttaaaatcatttcgaacttgctttgtccactgagtatatta
tggacatcttttcatggcaggacatatagatgtgttaatggcattaaaaa
taaaacaaaaaactgattcggccgggtacggtggctcacgcctgtaatcc
aattgtgctctgcaaattatgatagtgatctgtatttactacgtgcatat
High-throughput assays:
• Observations about one aspect of the system
• Often noisy and less reliable than traditional
assays
• Provide partial account of the system
81
YAR041C
Treat
different observations about elements as
multivariate data
• Clustering
• Statistical tests
82
13
Model-Free Approach
Finding bi-clusters in large compendium of functional
data
Tanay et al PNAS 2004
83
Model-Free Approach
Pros:
No assumptions about data
• Unbiased
• Can be applied to many data types
Can use existing tools to analyze combined data
Cons:
No assumptions about data
• Interpretation is post-analysis
• No sanity check
Cannot deal with data from different modalities
(interactions, other types of genetic elements)
84
Model-Based Approach
Explaining Expression
DNA binding proteins
Non-coding region
Gene
What is a model?
“A description of a process that could have
generated the observed data”
attttgggccagtgaatttttttctaagctaatatagttatttggacttt
tgacatgactttgtgtttaattaaaacaaaaaaagaaattgcagaagtgt
tgtaagcttgtaaaaaaattcaaacaatgcagacaaatgtgtctcgcagt
cttccactcagtatcatttttgtttgtaccttatcagaaatgtttctatg
tacaagtctttaaaatcatttcgaacttgctttgtccactgagtatatta
tggacatcttttcatggcaggacatatagatgtgttaatggcattaaaaa
taaaacaaaaaactgattcggccgggtacggtggctcacgcctgtaatcc
aattgtgctctgcaaattatgatagtgatctgtatttactacgtgcatat
 Idealized, simplified, cartoonish
 Describes the system & how it generates
observations
85
Activator Repressor
Coding region
Binding sites
RNA transcript
Key Question:
Can we explain changes in expression?
General concept:
Transcription factor binding sites in promoter region
should “explain” changes in transcription
86
Explaining Expression
A Stab at Model-Based Analysis
Relevant data:
Expression under environmental perturbations
Expression under transcription factors KOs
Predicted binding sites of transcription factors
Protein-DNA interactions of transcription factors
Protein levels/location of transcription factors
…
87
ACGATGCTAGTGTAGCTGATGCTGATCGATCGTACGTGCTAGCTAGCTAGCTAGCTAGCTAGCTAGC
CCAAT
AGCTAGCTCGACTGCTTTGTGGGGCCTTGTGTGCTCAAACACACACAACACCAAATGTGCTTTGTGGT
GATAC
TCGACTGC
CCAAT
ACTGATGATCGTAGTAACCACTGTCGATGATGCTGTGGGGGGTATCGATGCATACCACCCCCCGCTC
CCAAT
GATCGATCGTAGCTAGCTAGCTGACTGATCAAAAACACCATACGCCCCCCGTCGCTGCTCGTAGCATG
TCGACTGC
GATAC
CCAAT
TCGACTGC GATAC
CTAGCTAGCTGATCGATCAGCTACGATCGACTGATCGTAGCTAGCTACTTTTTTTTTTTTGCTAGCAC
GCAGTT
CCAACTGACTGATCGTAGTCAGTACGTACGATCGTGACTGATCGCTCGTCGTCGATGCATCGTACGTA
CCAAT
TCGACTGC
GCAGTT CCAAT
GCTACGTAGCATGCTAGCTGCTCGCAAAAAAAAAACGTCGTCGATCGTAGCTGCTCGCCCCCCCCCCC
GATAC
TCGACTGC
CCAAT
GCAGTT
CGACTGATCGTAGCTAGCTGATCGATCGATCGATCGTAGCTGAATTATATATATATATATACGGCG
Motifs
TCGACTGC
Motif
Profiles
TCGACTGC
+
GATAC
GATAC
CCAAT
CCAAT
Genes
Sequence
GCAGTT
GCAGTT
+
CCAAT
Expression
Profiles
88
14
Unified Probabilistic Model
Sequence
Sequence
S1
S2
S3
S4
S1
Motifs
Motif
Profiles
Gene
Expression
Segal et al, RECOMB 2002, ISMB 2003
Sequence
S1
S2
S3
Expression
Segal et al, RECOMB 2002, ISMB 2003
Probabilistic Model
Observed
S4
Experiment
Gene
Expression
Profiles
90
Sequence
Regulatory Modules
Sequence
S1
Motifs
S2
S3
S4
Motif profile
Motifs
R1 R2 R3
Expression
Profiles
91
S4
Module
Unified Probabilistic Model
Motif
Profiles
S3
R1 R2 R3
Experiment
Expression
Profiles
89
Sequence
S2
Motifs
R1 R2 R3
Motif
Profiles
Sequence
genes
Sequence
Unified Probabilistic Model
Expression profile
R1 R2 R3
Experiment
Module
Motif
Profiles
ID
Gene
Level
Observed
Expression
Segal et al, RECOMB 2002, ISMB 2003
Expression
Profiles
92
Model-Based Approach
Experiment
Module
ID
Gene
Level
Expression
Segal et al, RECOMB 2002, ISMB 2003
Physical Interactions
Pros:
Incorporates biological principles
• Suggests mechanisms
• Incorporate diverse data modalities
Declarative semantics -- easy to extend
Cons:
Reconstruction depends on the model
Biological principles
• Bias
93
94
15
Physical Interactions
Relational Markov Network
Interaction between two proteins makes it more
probable that they
• share a function
• reside in the same cellular localization
• their expression is coordinated
• have similar genetic interactions
•…
Can we exploit this to make better inference of
properties of proteins?
Probabilistic
Represent
patterns hold for all groups of objects
local probabilistic dependencies
P1 .N
0
0
1
1
Protein
Nucleus
Cytoplasm
P2.M
0
1
0
1
φ
0
0
0
-1
Mitochndria
P1 .N
0
0
0
0
1
1
1
1
P2.N
0
0
1
1
0
0
1
1
I.E
0
1
0
1
0
1
0
1
φ
0
0
0
-1
0
-1
0
2
Interaction
Exists
Protein
Nucleus
Cytoplasm
Mitochndria
95
96
Relational Markov Network
Compact
Adding Noisy Observations
Add
model
Allows
to infer protein attributes by combining
• Interaction network topology (observed)
• Observations about neighboring proteins
class for experimental assay
View assay result as stochastic function (CPD) of
underlying biology
GFP image
Protein
Cytoplasm
Nucleus
Nucleus
Cytoplasm
Mitochndria
Mitochndria
Interaction
Exists
Directed
CPD
Protein
Nucleus
Cytoplasm
Mitochndria
97
98
Uncertainty About Interactions
Design Plan
Add
interaction assays as noisy sensors for
interactions
GFP image
Protein
Nucleus
Cytoplasm
Nucleus
Cytoplasm
Mitochndria
Mitochndria
Simultaneous
prediction
Interaction
Exists
Taf1
Protein
Nucleus
Mitochndria
99
Relational Markov
Network
Tbf1
Med17
Cytoplasm
Assay
Cln5
Srb1
Interact
Mcm1
Pre7
Cdk8
Pre9
Pup3
Med1
Taf10
Pre5
Med5
100
16
Relational Markov Network
Add
Relational Markov Models
potentials over interactions
Protein
Combine
(Noisy) interaction assays
(Noisy) protein attribute assays
Preferences over network structures
Protein
Nucleus
Nucleus
Interaction
Exists
Potential over
Interaction
Interaction
Exists
Exists
To find a coherent prediction of the interaction
network
Protein
Nucleus
101
102
Discussion
The Need for Computational Methods
Every
day papers are published with highthroughput data that is not analyzed completely or
not used in all ways possible
Experiment
The
bottlenecks right now are the time and ideas to
analyze the data
Modeling &
Simulation
Low-level
analysis
High-level
analysis
104
106
What are the Options?
Analyze
published data
• Abundant, easy to obtain
• Method oriented
• Don’t have to bump into biologists
• Two million other groups have that data too
Collaborate
with an experimental group
• Be involved in all stages of project
• Understand the system and the data better
• Have priority on the data
• Involved in generating & testing biological hypotheses
• Goal oriented
Start
107
your own experimental group…(yeah, sure)
Questions to Keep in Mind
Crucial questions to ask about biological problems
What quantities are measured?
Which aspects of the biological systems are probed
How are they measured?
How this measurement represents the underlying
system? Bias and noise characteristics of the data
Why are these measurements interesting?
Which conclusions will make the biggest
impact?
108
17
Acknowledgements
Slides:
The Computational Bunch
•Yoseph Barash
•Ariel Jaimovich
•Tommy Kaplan
•Daphne Koller
•Noa Novershtern
•Dana Pe’er
•Itsik Pe’er
•Aviv Regev
•Eran Segal
The Biologist Crowd
•David Breslow
•Sean Collins
•Jan Ihmels
•Nevan Krogan
•Jonathan Weissman
Special thanks: Gal Elidan, Ariel Jaimovich
109
18