Download Gene Net Analysis: Motifs vs. Correlation

Document related concepts

Transposable element wikipedia , lookup

Metagenomics wikipedia , lookup

Short interspersed nuclear elements (SINEs) wikipedia , lookup

Essential gene wikipedia , lookup

X-inactivation wikipedia , lookup

Gene therapy wikipedia , lookup

Genetic engineering wikipedia , lookup

Cancer epigenetics wikipedia , lookup

Oncogenomics wikipedia , lookup

Quantitative trait locus wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Gene nomenclature wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Pathogenomics wikipedia , lookup

Epigenetics in learning and memory wikipedia , lookup

Gene desert wikipedia , lookup

Public health genomics wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

NEDD9 wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Minimal genome wikipedia , lookup

History of genetic engineering wikipedia , lookup

Long non-coding RNA wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Genomic imprinting wikipedia , lookup

Ridge (biology) wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Gene wikipedia , lookup

Mir-92 microRNA precursor family wikipedia , lookup

Genome evolution wikipedia , lookup

Genome (book) wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Microevolution wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Designer baby wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

RNA-Seq wikipedia , lookup

Gene expression programming wikipedia , lookup

Gene expression profiling wikipedia , lookup

Transcript
OVERVIEW
Omer Berkman
1
Contents



Biological background
Using Gene-arrays to decipher generegulatory interactions
Applications…
2
Hybridization


DNA double strand form by “gluing” of
complementary single starnds
Complementary rule:
A-T/U, G-C
3
Protein production
4
From DNA to Protein
Transcription
Gene
Translation
mRNA
Protein
cells express different subset of the genes in
different tissues and under different conditions
5
Functional genomics


The complete sequences of many microbial genomes are
already known - the inventory of the building blocks of life
was collected.
next stage is ‘‘re-assembling the pieces’’ :
 Defining the role of each gene in these genomes.
 Understanding how the genome functions as a whole in
the complex natural history of a living organism.
Knowing when and where a gene is expressed
often provides a strong clue as to its biological role
6
Transcriptional process




This process is highly regulated.
One of the most important ways in which the
cell regulates gene expression is by using a
feedback loop.
some of the proteins are transcription factors.
These proteins regulate the expression of
other genes (and possibly, their own
expression) by either initiating or repressing
7
transcription.
Transcriptional networks



One gene can be a regulator of another gene.
Biochemical networks responsible for regulating the
expression of genes in cells.
In these transcription networks, the nodes
represent transcriptional factors (genes) and the
edges represent direct transcriptional regulation.
[Shen-Orr 2002, Thieffry 1998]
8
Transcriptional networks
example
9
Gene-arrays for mRNA
analysis



Differences in cell type or state are
correlated with changes in the mRNA
levels of its genes.
The only specific reagent required to
measure the abundance of the mRNA for a
specific gene is a cDNA sequence.
DNA microarrays provide a practical and
economical tool for studying gene
expression on a very large scale.
10
Affymetrix model for DNA
chip
Now, we
can infer
which of
the genes
were
expressed
and in
what
intensity.
Due to some biological processes, not always the correct sequence will
hybridized to the oligo.
11
Gene Arrays / DNA chips



From “one gene in one experiment” to
“massively parallel biological data
acquisition”.
Simultaneously analyzing the expression
levels of large numbers of genes provides
the opportunity to study the activity of
whole genomes.
Large-scale gene expression analysis reveals
the behavior of co-regulated gene networks.
12
Raw Data

The curse of
dimensionality :
Thousands of Genes
versus only few
observations
13
Static versus dynamic
We distinguish between static experiments and
time series experiments:

Static –



A snapshot in different samples is measured.
Data are assumed to be independent identically
distributed.
Dynamic –


A temporal process is measured.
Data have strong autocorrelation between successive
points.
14
Temporal observations



It’s possible to produce time-dependent
measurements, termed expression matrices.
These expression matrices are the result of the
underlying regulatory network.
Reverse engineering seeks to extract information
from time-series measurements in order to identify
regulatory interactions in these genetic networks.
15
Complications







The curse of dimensionality
Extremely noisy observations
Expensive experiments
Stochastic nature
Population averaged
Feasible time scale
Partially information
We are facing a hard problem…
16
1. The curse of
dimensionality (Bellman,
1961)



The number of genes typically far exceeds the
number of time points for which data are available,
making the problem an ill-posed one.
“Traditional statistics” won’t help here - the amount
of samples, versus the number of genes, does not
provide enough information to construct a full
detailed model with high statistical significance.
New statistical methods/approaches were developed
(Bootstrap, Interpolations, Clustering, FDR…) 17
2. Stochastic nature
Deterministic
Stochastic
Biology has no deterministic processes…
18
3. Population averaged



Measurements are obtained
as population-averaged data
The measurement itself kills
or alters the organism
This mask the real regulatory
interactions (quantization
problem)
19
4. Feasible time scale
Empirical limit on the number of time points :


The average speed of the biologic process
determines the number of informative points.
The error of the method applied have to be
smaller than the expression level difference.
MISSING REGULATORY INTERACTIONS
COST and ERROR
20
5. Partial information




Biological systems are robust, adaptable, and
redundant.
Genes are not the only actor in the game –
transcriptional factors can be of many kinds.
The regulatory interactions between genes are
not deterministic at the mRNA level - a gene
has few independently regulated derivatives.
mRNA expression data alone only gives a
partial picture that does not reflect key events
such as translation and protein (in)activation. 21
Fundamental question


How much information is needed to map the generegulatory interactions of a biological system?
Hertz’s Estimation [1998] for the number of gene
states to be measured for a successful reverse
engineering:
P=K log (N/K)
N - The size of the network (e.g. the number of genes)
K - The average number of interactions per gene.
22
Application 1 [DeRisi 1997]
Exploring the metabolic and
genetic control of gene
expression


Investigation of gene expression
accompanying the metabolic shift from
fermentation to respiration in yeast.
Identify genes whose expression was
affected by deletion of TUP1 or overexpression of YAP1.
23
Yeast genome micro-array
Genes induced or repressed appear in this image as
red and green spots, respectively.
24
Temporal samples
25
Analysis




Stable gene expression during exponential growth.
A marked change was seen as glucose was
progressively depleted from the growth media.
- mRNA levels for 710 genes were induced.
- mRNA levels for 1030 genes declined.
The expression patterns observed for previously
characterized genes showed concordance with
previously published results.
About half of these differentially expressed genes
have no apparent homology to any gene whose
function is known. This provides the first small clue to26
Coordinated regulation of
functionally related genes
Genes can be grouped on the basis of the
similarities in their expression patterns
27
Distinct temporal patterns
28
Metabolic Diagram
Red boxes identify
genes whose
expression increases
in the diauxic shift.
Green boxes identify
genes whose
expression
diminishes in the
diauxic shift.
29
Defining the contributions of
individual regulatory genes


Using a DNA micro-array to identify genes whose
expression is affected by mutations in each putative
regulatory gene.
Performing:
- Deletion the transcriptional repressor TUP1.
- Overexpression of the transcriptional activator YAP1.
30
Deleting the TUP1 gene




Wild-type yeast cells and cells bearing a deletion
of the TUP1 gene were grown.
mRNA was isolated from the two populations
and used to prepare c-DNA labeled with green
and red.
The labeled probes were mixed and
simultaneously hybridized to the micro-array.
Red spots on the array represent genes that
were induced in the TUP1 strain, and thus
presumably repressed by TUP1.
31
Overexpressing the YAP1
gene


Complementary DNA from the control
and YAP1 over-expressing strains,
labeled with Cy3 and Cy5, respectively,
was prepared from mRNA isolated from
the two strains and hybridized to the
micro-array.
Red spots on the array represent genes
that were induced in the strain overexpressing YAP1.
32
Characterization of
regulatory pathways and
networks



Use of a micro-array to characterize the
transcriptional consequences of mutations
provides a simple and powerful approach.
This strategy also has an important practical
application in drug screening.
However, one should keep in mind that
transcriptional regulations might be
complicated.
33
Application1 summary



DNA micro-arrays provide a simple and economical
way to explore gene expression patterns on a
genomic scale.
“The greatest challenge now is to develop efficient
methods for organizing, interpreting, and extracting
insights from the large volumes of data these
experiments provide.”
Technical advances have made array experiments
fairly easy to do, but tools for analysis of data
produced have lagged behind.
34
Application 2 [Friedman 2000]
Using Bayesian Networks to
Analyze Expression Data


Probabilistic approach.
Bayesian network as a model for genetic
networks.
35
Bayesian networks –
definitions

Representation of a joint probability distribution.
This representation, consists of two components:


G is a directed acyclic graph (DAG) whose vertices
correspond to the random variables
θ describes a conditional distribution for each
variable, given its parents in G.
36
Simple example
37
Bayesian networks –
properties

Encodes the Markov assumption :
Each variable is independent of its non-descendants,
given its parents in the graph



A graph-based model that captures properties of
conditional independence between variables.
Useful for describing processes composed of
locally interacting components.
Provide models of causal influence.
38
Equivalence classes




Let Ind(G) be the set of independence statements (of the
form X is independent of Y given Z).
More than one graph can imply exactly the same set of
independencies.
Two graphs G’ and G’’ are equivalent if Ind(G’)=Ind(G’’),
that is, both graphs are alternative ways of describing the
same set of independencies.
Equivalent graphs have the same underlying undirected
graph but might disagree on the direction of some of the
39
arcs (we switch to PDAG).
Learning Bayesian Networks




Given a training set D of independent instances of X, find a
network B={G, θ} that best matches D.
Several scoring functions are available.
Finding the structure G that maximizes the score is a
problem which is known to be NP-hard.
For Heuristic search we need :

A score function which is decomposable
For example -

S(G:D) = log P(D|G) + log P(G) + C
An iterative search method
For example - Greedy/stochastic hill climbing, simulated annealing…
40
Biological (causal)
interpretation


Edges: the parents of a variable are its immediate
causes (the parent of a node is a transcription factor
for this gene).
A causal network models the effects of
interventions: If X causes Y, then manipulating the
value of X affects the value of Y, but not the other
way around (If we knockout gene X then this will
affect the expression of gene Y, but a knockout of
gene Y has no effect on the expression of gene X). 41
Analyzing Expression Data



Random variable denote the expression level of
individual genes.
In addition, we can include random variables that
denote other attributes that affect the system
(experimental conditions, temporal indicators…).
We want to learn one from the available data and
use it to answer questions about the system.
42
Find high-scoring networks


The data is not informative enough to determine which
single model is the right one
Focusing on features that are common to most of the
possible models:


Markov relation - indicates that two genes are related in some joint
biological interaction or process (if there is either an edge between
them, or both are parents of another variable (Pearl 1988)).
Order relation - X is an ancestor of Y in all the networks of a given
equivalence class (the given PDAG contain a directed path from X
to Y).
43
How can we estimate a measure
of confidence in the features?

bootstrap method (Efron & Tibshirani 1993)


A method to enlarge our data set by generating
“perturbed” versions of our original data set. In this way
we collect many networks, all of which are fairly
reasonable models of the data.
For each feature f of interest calculate :
1 m
conf ( f )   f (Gi )
m i 1
where f(G) is 1 if f is a feature in G, and 0 otherwise.
44
Local Probability Models
In order to specify a Bayesian network model, we
still need to choose the type of the local probability
models we learn. In the current work, we consider
two approaches:


Multinomial model (discretizing to (-1,0,1).
Linear Gaussian model.
45
Robustness analysis
46
Multinomial versus
Gaussian
The two methods highlight different types of connections
47
between genes.
Biological Analysis



Order relations reveals existence of dominant
genes. Out of all 800 genes only few seem to
dominate the order (i.e., appear before many
genes).
Top Markov relations reveals genes that most are
functionally related.
Nice presentation:
http://www.cs.huji.ac.il/~nirf/GeneExpression/top800/
48
An example of the graphical
display of Markov features
This suits biological knowledge!
49
Application2 summary

Using Bayesian networks to model genetic
network:


Involves thousands of genes while current data sets contain a
few dozen samples. This raises problems in computational
complexity and the statistical significance of the results.
Genetic regulation networks are sparse (gene assumed to have
no more than a few dozen genes directly affect its
transcription). Bayesian networks are especially suited for
learning in such sparse domains.
 Did not use any (biological) prior knowledge.
 This theory can provide tools for experimental design.
50
Dynamic Bayesian Networks




DBNs are an extension of Bayesian networks, which
have been successfully applied to model expression
data (Pe’er et al., 2001).
The main advantage that unlike BNs, DBNs allow
for cycles, which are common in biological systems.
In addition, DBNs can also improve our ability to
learn causal relationships by relying on the
temporal nature of the data.
DBNs seem like a promising direction for modeling
temporal system and recently a number of papers
51
discuss this model.
Application 3 [Holter 2000]
Fundamental patterns
underlying gene
expression


Algebraic approach.
Using SVD to a model gene expression.
52
Singular Value
Decomposition


A standard and straight-forward analytic
procedure which finds eigenvectors, or
fundamental patterns of expression with time,
of the array matrices.
The SVD theorem states that the matrix A
can be written as :
A = USVT
53
SVD theorem


U and V are orthogonal
S elements are all zero except for Si,i which are
singular values (square roots of the eigenvalues)
54
Characteristic modes



We define the vectors Xi to be the first rows of
the matrix SVT.
Those r vectors are the characteristic modes
associated with the matrix A.
The temporal variation of any gene j can be
written as a linear combination of these vectors:
55
Results


The first two values were significantly greater
than the others for all three data sets, but the
same is not true in a control calculation on
random data.
Only the first few modes are required to capture
the essential features of the expression data in
most cases (the modes reflect the genomewide expression pattern and are not genespecific).
56
gene expression and random
data sets
57
Characteristic modes for the
gene expression and random
data sets

The
magnitude of
the singular
value is
reflected in
the
amplitude of
each mode.
58
A reconstruction of the
expression profiles
59
Analysis 1

Type of ‘‘spectral’’ analysis : a gene expression
profile can be precisely represented by
specifying the magnitude and sign of the
contribution of each of its characteristic modes.

This suggests that at a gross level, most timedependent expression patterns are very simple.

Data from SVD agree with previous knowledge
60
of expression patterns.
Plot of the coefficients
Symbols of
different colors
and shapes
are used for
genes that
belong to the
different
clusters.
61
Analysis 2



The data points (which are not random) are
concentrated near the perimeter of a circle or
an ellipse, with the interior rather sparsely
populated.
Expression profiles clustered by more
conventional methods correspond well to
groups of genes with similar coefficients.
Despite the evolutionary distance between
yeast and humans, the observed behavior is
both simple and similar.
62
Application3 summary




SVD has uncovered underlying patterns or
‘‘characteristic modes’’ in gene temporal profiles.
The expression pattern of any particular gene can be
represented precisely by a linear combination of the
modes with gene-specific coefficients.
A good approximation of the exact pattern can be
obtained by using just a few of the modes,
underscoring the simplicity of the gene expression
patterns.
This paradigm may find expression patterns that
would not be detected using other methods.
63
Application 3b [Holter et al
2001]
Dynamic modeling



In the previous application we treated the gene
expression pattern as a ‘‘static’’ image and derived the
underlying genomewide characteristic modes of which
it is composed.
Now we carry out a dynamical analysis, exploring the
possible causal relationships among the genes by
deducing a time translation matrix for the
characteristic modes defined by SVD.
This matrix predicts future expression levels of genes
based on their expression levels at some initial time.
64
How to deduce a
time translation matrix?



To uniquely and unambiguously determine the g2
elements of the matrix, one needs a set of g2
linearly independent equations.
D’haeseleer [1999] used a nonlinear interpolation
scheme to guess the shapes of gene expression
profiles between the measured time points
(speculative).
Van Someren [2000] chose to cluster the genes and
study the interrelationships between the clusters
(based on profile similarity).
65
Deduce a time translation
matrix by using SVD



The SVD construction gives a linear
combination of which exactly describes the
expression pattern of each gene.
The modes form a linearly independent basis
set.
The problem is mathematically well defined and
tractable if one considers the causal
relationships among the modes.
66
Analysis


The causal links between the modes, and
thence the genes, involve just a few essential
connections. Any additional connections
among the genes must therefore provide
redundancy in the network.
An important corollary is that it may be
impossible to determine detailed connectivities
among genes with just the micro-array data,
because the number of genes greatly exceeds
the number of contributing modes.
67
Application3b summary



A model in which the expression levels of the genes at
a given time are linear combinations of their levels at
a previous time.
Temporal evolution of the gene expression profiles can
be described by using a ‘‘time translation’’ matrix,
which reflects the magnitude of the connectivities
between genes.
Because there are only a few essential connections
among modes and therefore among genes, additional
links provide redundancy in the network.
68
References
Yaacov Lindzen’s presentation “Introduction to Micro-arrays”
“Genetic network analysis in light of massively parallel
biological data acquisition”. Szallasi, 1999, PSB
“Exploring the metabolic and genetic control of gene
expression on a genomic scale”. DeRisi et al, 1997, Science
“Using Bayesian networks to analyze expression data”.
Friedman et al, 2000.
“Fundamental patterns underlying gene expression profiles:
Simplicity from complexity”. Holter et al, 2000, Genetics
“Dynamic modeling of gene expression data”. Holter et al,
2001, Genetics
“Analyzing time series gene expression data”. Bar Joseph,
69
2004, Bioinformatics