Download Chromatin modification-aware network model - Bio

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Epigenetics of depression wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Epigenetic clock wikipedia , lookup

Ridge (biology) wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Epigenetics in stem-cell differentiation wikipedia , lookup

Gene therapy wikipedia , lookup

Oncogenomics wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Pathogenomics wikipedia , lookup

Genetic engineering wikipedia , lookup

Long non-coding RNA wikipedia , lookup

Epigenomics wikipedia , lookup

Gene nomenclature wikipedia , lookup

Genomic imprinting wikipedia , lookup

Gene wikipedia , lookup

Genome evolution wikipedia , lookup

Public health genomics wikipedia , lookup

Gene desert wikipedia , lookup

Cancer epigenetics wikipedia , lookup

History of genetic engineering wikipedia , lookup

Helitron (biology) wikipedia , lookup

Genome (book) wikipedia , lookup

Epigenetics wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Epigenetics in learning and memory wikipedia , lookup

Transgenerational epigenetic inheritance wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

RNA-Seq wikipedia , lookup

Behavioral epigenetics wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Microevolution wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Gene expression profiling wikipedia , lookup

Gene expression programming wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Designer baby wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Transcript
Chromatin modification-aware network model
Based on ARACNe
Cho, Young Mi
a
Department of Biological Science, KAIST
Abstract
DNA microarray experiments can measure the mRNA expression level of all the
genes of an organism, providing a “genomic” viewpoint on gene expression. But the
control of gene expression can be represented as not only genetic mechanism, but
also epigenetic regulation mechanism. This epigenetic part takes a significant role in
gene expression regulation.
In this study, we develop “Chromatin modification-aware network model” which
integrates epigenetic regulatory mechanism with previously well known gene
regulatory network, ARACNe. Using Chip-Chip and histone modification pattern of
gene, we give prior information about edge and node to compose the epigenetic
regulatory network with ARACNe algorithm.
1. Introduction
Inferring gene regulatory network is one of the main goals of functional genomics.
Development of microarray technology has provided the large-scale gene expression
profile data. DNA microarray experiments can measure the mRNA expression level of
all the genes of an organism, providing a “genomic” viewpoint on gene expression.
But the organization of gene-expression profile data into functionally meaningful
genetic information has proven difficult and so far has fallen short of uncovering the
intricate structure of cellular interactions. There are several available methods for this
challenge, called network reverse engineering or deconvolution as following :
optimization methods, which maximize a scoring function over alternative network2,3
models such as Bayesian network, Boolean network, etc; regression techniques, which
fit the data to a priori models; integrative bioinformatics approaches, which combine
data from a number of independent experiment clues; and statistical methods which
rely on a variety of measures of pairwise gene-expression correlation.
Recently, an “epigenetic” view point in gene expression is more and more
emphasized. Epigenetics is the study of epigenetic inheritance, a set of reversible
heritable changes in gene functions or other cell phenotypes that occur without a
change in DNA sequence (genotype). It has been understood for some time that many
diseased cells, and particularly those in cancer tumors, have altered epigenetic patterns.
However, only more recently has the importance of the role of epigenetic modification
mechanisms begun to be appreciated as a new way of attacking cancer.
The control of gene expression is not a single result of genetic regulation but also of
epigenetic regulation and post-transcriptional regulation. But, previous works on gene
regulatory network have ignored the epigenetic part of gene expression and mainly
concerned about only the genetic
mechanism.
In this work, we develop Chromatin modification-aware network model which
integrates epigenetic regulatory mechanism with previously well known gene
regulatory network, ARACNe. ARACNe1 (an Algorithm for the Reconstruction of
Accurate Cellular Networks) was designed to build a model which is available for the
genome-wide reverse engineering of larger scale cellular networks. ARACNe was
comparable to Bayesian networks in sensitivity and largely superior in precision 2.
To see the epigenetic state of the gene regulation, we introduce Chip-Chip data and
Histone modification pattern of the regulatory region of gene. Chip-Chip data offers
the prior information of edge comprising the regulatory network. In addition, Histone
modification pattern of the regulatory region of gene provides the prior information
about nodes comprising the network.
And with this two prior information, we made scoring system to determine more
precise edge and nodes on the network.
2. Background
2.1 Theoretical Background of ARACNe1
Temporal gene expression data is difficult to obtain for higher eukaryotes, and
cellular populations harvested from different individuals generally capture random
steady states of the underlying biochemical dynamics. Therefore only steady-state
statistical dependences can be studied. The joint probability distribution(JPD) of the
stationary expressions of all genes,
, as :
(1)
where N is the number of genes, Z is the normalization factor, also called the partition
function, Φ… are potentials, and H({qj}) is the Hamiltonian that defines the system’s
statistics. Within this model, a set of variables interacts if and only if the single
potential that depends exclusively on these variables is nonzero. ARACNe aims
precisely at identifying which of these potentials are nonzero, and eliminating the
others even though their corresponding marginal JPDs may not factorize.
2.2 Approximations of the interaction structure
Since typical microarray sample sizes are relatively small, inferring the exponential
number of potential n-way interactions of Eq.(1) is infeasible and a set of simplifying
assumptions must be made about the dependency structure. The simplest model is one
where genes are assumed independent, i.e.,
, such that first-order
potentials can be evaluated from the marginal probabilities, P({qj}), which are estimated
from experimental observations. M>100 is generally sufficient to estimate 2-way
marginals in genomics problems, while
magnitude more samples. Within approximation
requires about an order of
, all genes
for which
are declared mutually non-interacting. This includes genes that are
statiscally independent(i.e.,
) as well as genes that do not interact
directly but are statistically dependent due to their interaction via other genes(i.e.,
, but
).
3. Algorithm
In ARACNe algorithm, within the assumption of a two-way network, all statistical
dependencies can be inferred from pairwise marginals, and no higher order analysis is
needed. First step is idenfying candidate interactions by estimating pairwise gene
expression profile mutual information. Then filter MIs using an appropriate threshold,
computed for a specific p-value, in the null-hypothesis of two independent genes.
[Figure 1]
Now we develop additionally a method to introduce epigenetic regulatory
mechanism for inferring gene regulatory network. To see the epigenetic state of the
gene regulation, we introduce Chip-Chip data and Histone modification pattern of the
regulatory region of the gene. Chip-Chip data offers the prior information of edge
comprising the regulatory network. In addition, Histone modification pattern of
regulatory region of the gene provides the information about nodes comprising the
network.
And with these two prior informations, we made scoring system to determine more
precise edge and nodes on the network. Chip-Chip, a tool for genome-scale mapping of
in vivo protein–DNA interactions allows global views of transcription factor binding.
Chip-Chip provides physical interaction data of two nodes. Chip-Chip data, which
provides location data, is used as prior information for determining edge. When chipchip indicates that If the gene product of A binds on the promoter of B, then the edge
between A and B is strengthened. On the contrary, If there is no binding property
between two genes, then the edge between two genes will be weakened.
In addition, Histone modification patterns of the gene region can give us epigenetic
state of the nodes on the network. Histone modification patterns such as H3K9Ac,
H3K14Ac and H3K4 tri-Me are highly associated with transcription level. Histone
modification patterns of each nodes can be classified as two states.(ref 5) One indicates
“active” and the other does “inactive”.
Histone modification data is used as prior information about the nodes constructing
the network. If a node is indicated “active” in histone modification pattern, then all the
edges connected with this node will be strengthened. But if a node is “inactive” on
histone modification pattern, all the edges connected with this node will be weakened.
And then strengthened or weakened mutual information of each gene pair is delivered
to edge determining algorithm, Data processing inequality(DPI). As a result, devised
regulatory network considering epigenetic state comes out.
3.1 Mutual Information
Mutual information for a pair of discrete random variables, x and y, is defined as
I(x,y) = S(x) + S(y) - S(x,y), where S(t) is the entropy of an arbitrary variable t. Entropy
for a discrete variable is defined as the average of the log probability of its states:
where p(ti) = Pr(t = ti) is the probability associated with each discrete state or value of
the variable. If the variable is continuous, the entropy is replaced by the differential
entropy, which has the same definition as S(t) in the preceding equation but where the
summation is replaced by an integral and the discrete distribution is replaced by a
probability density. To estimate the entropy, the property that mutual information is
invariant under any invertible reparameterization of either x or y, is used. This can be
expressed as I(x' = f1(x),y' = f2(y)) = I(x,y), with both f1 and f2 being invertible. Here,
reparameterize the data using a rank transformation that projects the Nm
measurements for each gene into equally spaced real numbers in the interval [0,1],
preserving their original order. This transformation is also called copula, and it has the
advantage of transforming the probability density of the individual variables into a
constant, p(x') = p(y') = 1. Under this transformation, both S(x') and S(y') become
constant and equal to zero. As a result, only S(x',y') must be estimated. For the
synthetic analysis, this is done using a Gaussian Kernel estimator where
Here, p(xi), p(yi) and p(xi,yi) are defined as
The optimal values of the smoothing parameters d1 and d2 are obtained from Monte
Carlo simulations, using a wide range of bivariate normal probability densities. For the
large set of cell expression profiles, using a slightly less accurate but much more
computationally efficient approximation is better.
3.2 Statistical threshold for mutual information.
For each value of Nm in the synthetic data analysis, the P value associated with a
given value of mutual information in the null hypothesis is obtained by Monte Carlo
simulation using 10,000 iterations. The null hypothesis corresponds to pairs of nodes
that are disconnected from the network and from each other. These follow a randomwalk dynamic, in the range [1,100], with a noise term drawn from a uniform
probability density over the interval [-10,10]. For analysis, the P value should be
computed by Monte Carlo simulation. Because a null-hypothesis dynamical model is
not available, it is defined as a pair of existing genes whose values are randomly
shuffled at each iteration with respect to the microarray profile in which they were
observed.
3.3 Data Processing Inequality
First define two genes, x and y, as indirectly interacting through a third gene, z, if the
conditional MI I(x,y|z) is equal to zero. The two genes are directly interacting if no such
third gene exists, implying that there is direct transfer of information between them.
The DPI asserts that if both (x,y) and (y,z) are directly interacting, and (x,z) are
indirectly interacting through y, then I(x,z)
I(x,y) and I(x,z)
I(y,z). This inequality is
not symmetric, meaning that there may be situations where the triangle inequality is
satisfied but x and z may be directly interacting. As a result, by applying the DPI to
discard indirect interactions (i.e., (x,z) relationships for which the inequality is satisfied),
we may be discarding some direct interactions as well. These are of two kinds: (i) cyclic
or acyclic loops with exactly three genes and (ii) sets of three genes whose information
exchange is not completely captured by the pairwise marginals. A typical example of
the latter would be the Boolean operator XOR, for which the mutual information
between any subpair of the three variables is zero. A percent tolerance for the DPI to
account for inaccurate estimates of the difference between two close mutual
information values is used. This is implemented by rewriting the DPI using a percent
tolerance threshold ε: I(x,z) I(x,y)[1 - ε] and I(x,z) I(y,z)[1 - ε]. This has the advantage of
avoiding rejection of some borderline edges, resulting in some loops of size three to
occur in the predicted topology.
Reference
1. Margolin, A. A. et al., 2004, ARACNe : An Algorithm for the Reconstruction of
Gene Regulatory Networks in a Mammalian Cellular Context
2. Basso, K. et al., 2005, Reverse engineering of regulatory networks in human B
cells, Nature Genetics 37, 382-388
3. Bulashevska, S., Eils, R., 2005, Inferring genetic regulatory logic from
expression data, Bioinformatics 21, 2706-2713
4. Friedman, N., Linial, M., Machman, I. & Pe’er, D., 2000, Using Bayesian
networks to analyze expression data, J. Comput. Biol. 7, 601-620
5. Chih Long Liu, Tommy Kaplan, Minkyu Kim, Stephen Buratowski, Stuart L
Schreiber, Nir Friedman, and Oliver J Rando, 2005, Single-Nucleosome
Mapping of Histone Modifications in S. cerevisiae, PLoS Biol. 3(10): e328
6. Sunjae Lee’ s lab seminar presentation.