Download Supplemental Appendix A: ClueGene Algorithm and Time

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

NEDD9 wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Quantitative trait locus wikipedia , lookup

Genomic library wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Human genome wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Transposable element wikipedia , lookup

Oncogenomics wikipedia , lookup

Essential gene wikipedia , lookup

Copy-number variation wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Genetic engineering wikipedia , lookup

Gene therapy wikipedia , lookup

Public health genomics wikipedia , lookup

Pathogenomics wikipedia , lookup

Gene nomenclature wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

The Selfish Gene wikipedia , lookup

Genomic imprinting wikipedia , lookup

Helitron (biology) wikipedia , lookup

Gene desert wikipedia , lookup

History of genetic engineering wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Genome editing wikipedia , lookup

RNA-Seq wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Gene wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Minimal genome wikipedia , lookup

Ridge (biology) wikipedia , lookup

Genome (book) wikipedia , lookup

Gene expression programming wikipedia , lookup

Microevolution wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Designer baby wikipedia , lookup

Gene expression profiling wikipedia , lookup

Genome evolution wikipedia , lookup

Transcript
Supplemental Appendix A: ClueGene Algorithm and
Time Complexity Analysis
1
ClueGene Scoring Function
ClueGene is given a set of genes called the query, Q, that are thought to be functionally
related. It scores each gene g in the genome G based on how often the gene appears in
clusters with the query genes. We define a function that assigns higher scores to genes that
appear in clusters containing a high proportion of query genes.
Let D be a set of clustering solutions where each element of D is a set of clusters. Define
Ngd to be the number of clusters in dataset d that contain g and at least one gene from Q.
The co-clustering score C(g) of gene g ∈ G is:


1 X |Q ∩ c|

I(g ∈ c)
C(g) =
Ngd c∈d |Q ∪ c|
d∈D
X
(1)
where I is the indicator function that returns 1 if its argument is true and 0 otherwise.
The intuition underlying the choice of scoring function is to identify genes that occur in
small and specific clusters with the query genes. If g belongs to a large cluster that also
happens to have several of the query genes, this observation is down-weighted because the
co-occurrence of gene g with the query may arise by chance if the cluster is large enough.
On the other hand, if g belongs to a small cluster that also contains several of the query
genes, this observation receives a high weight because the co-occurrance is less likely to be
serendipidous.
Dividing by Ngd corrects for the number of clusters a gene appears in. Without this
correction, high scores could be assigned to genes that are “central” in the coexpression
network simply because they appear in several clusters. Note that one might also consider
including an additional normalization term, Mg , which is the number of datasets in which g
appears. Dividing by Mg would allow genes with highly different amounts of missing data
to be directly compared, since C(g) would then reflect an average co-clustering index per
dataset. In our case, we found that dividing by Mg had little effect on the search results. This
has to do with the fact that the yeast expression database contains very little missing data:
for every dataset, nearly all of the genes in the yeast genome are spotted on the microarray(s)
used. However, if applied to other species in which more missing data is expected, such as
mouse and human, Mg should be included.
2
ClueGene Scoring Algorithm
The ClueGene algorithm is given two input parameters, a set of query genes Q and a set of
datasets D. The set of genes in the organism’s genome G is inferred from D.
ClueGene uses several arrays in its computations. S is an array indexed by gene and
dataset; S[g][d] is the score contribution to gene g by dataset d. N is an array indexed by
1
gene and dataset; N [g][d] is the number of clusters in dataset d that contribute scores to gene
g. N [g][d] corresponds to Ngd in Equation (1). Assume the S and N arrays are initialized
to all 0s.
The output of the scoring algorithm is an array of co-clustering scores C indexed by gene.
2.1
Score Each Gene by Dataset
In this part of the ClueGene scoring algorithm, dataset-specific scores are computed for each
gene. The score contributed to each gene by each dataset is computed (array S), along with
the number of clusters in each dataset that contribute to the score (array N ).
. repeat for each dataset
for d in D
. repeat for each cluster in the dataset
for c in d
. count the number of query genes found in the cluster
n←0
for q ∈ Q
if q∈ c
n←n+1
. test whether cluster c contributes to the score
if n > 0
. compute the cluster score
s ← n/(|Q| + |c| − n)
. assign the cluster score to each gene in the cluster
for g ∈ c
S[g][d] ← S[g][d] + s
N [g][d] ← N [g][d] + 1
2.2
Normalize Dataset-Specific Scores for Each Gene
In this part of the ClueGene scoring algorithm, the co-clustering score for each gene is
computed by combining the normalized dataset-specific scores.
. compute the co-clustering score of each gene in the genome
for g ∈ G
C[g] ← 0
2
. repeat for each dataset
for d ∈ D
. test whether dataset d contributes to the score of g
if S[g][d] 6= 0
. normalize the score contribution of dataset d by the number of clusters
. within d that contribute to the score
C[g] ← C[g] + S[g][d]/N [g][d]
3
ClueGene Time Complexity
The ClueGene algorithm is given two parameters, a set of query genes Q and a set of datasets
D. The number of genes in the query set is q = |Q|. The query set is a subset of the genome
G; G has size g = |G|. Each dataset corresponds to an experiment series and consists of
clusters of genes derived from experimental data. The number of datasets is d = |D|.
There are two steps in the ClueGene scoring algorithm. First, dataset-specific scores are
calculated for each gene. That is, each gene is scored with respect to each dataset, giving
each gene a set of “subscores”, one for each dataset. Let nc denote the average number of
clusters per dataset, and let ng denote the average number of genes in each cluster. The
algorithm iterates over each of the nc clusters of each of the d datasets. For each iteration,
membership of each of the q query genes in the cluster is counted, and each of the ng genes
in the cluster is assigned the score.
We expect query set sizes to be about the same size as the average cluster size (or smaller);
thus q = ng . Also note that g = nc ng . Thus the time complexity for dataset-specific score
computation is
dnc (q + ng )
= dnc (ng + ng )
= dnc (2ng )
= 2dg
Second, each gene score is normalized by dataset. The algorithm iterates over each gene
in the genome and normalizes the dataset-specific scores by the number of clusters within
the dataset that contribute to the gene’s score. Thus the time complexity for normalization
is
dg
The overall time complexity for ClueGene is
TCG = 2dg + dg
= 3dg
= O(dg)
3
For a particular organism the genome size g is constant, so the time complexity for
ClueGene can be expressed as TCG = O(d).
4
GeneRecommender Time Complexity
The GeneRecommender algorithm is given two parameters, a set of query genes Q and a
matrix Y of normalized expression values. The number of genes in the query set is q = |Q|.
The rows in Y correspond to genes in the genome G; the number of rows is g = |G|. The
columns correspond to experiments. Let d = |D| denote the number of datasets and e denote
the average number of experiments per dataset. The number of columns in Y is de.
There are two steps in the GeneRecommender scoring algorithm. First, the experiments
are scored for their relevance with respect to the query set. To do this, mean and variance
of the q query gene expression values are computed for each of the de experiments. Thus
the time complexity for experiment scoring is
qde
Second, each gene in the genome is scored with respect to the relevant experiments. Let
denote the set of relevant experiments determined in the first step; the number of relevant
experiments is given by n = ||. To compute the gene score, the algorithm iterates over the
n relevant experiments for each of the g genes in the genome. Thus the time complexity for
gene scoring is
n g
The overall time complexity for GeneRecommender is
TGR = qde + n g
Note that n can be at most de, the total number of experiments. We can express n as a
fraction 0 ≤ f ≤ 1 of the number of experiments: n = f de. Also, g is an upper bound for
the query size q. Thus an upper bound on the overall time complexity of GeneRecommender
is
TGR =
≤
=
=
de(q + f g)
de(g + f g)
de[(1 + f )g]
O(deg)
For a particular organism the genome size g is constant, so the time complexity for
GeneRecommender can be expressed as TGR = O(de).
4