Download Supplemental Appendix A: ClueGene Algorithm and Time

Supplemental Appendix A: ClueGene Algorithm and Time Complexity Analysis 1 ClueGene Scoring Function ClueGene is given a set of genes called the query, Q, that are thought to be functionally related. It scores each gene g in the genome G based on how often the gene appears in clusters with the query genes. We define a function that assigns higher scores to genes that appear in clusters containing a high proportion of query genes. Let D be a set of clustering solutions where each element of D is a set of clusters. Define Ngd to be the number of clusters in dataset d that contain g and at least one gene from Q. The co-clustering score C(g) of gene g ∈ G is:   1 X |Q ∩ c|  I(g ∈ c) C(g) = Ngd c∈d |Q ∪ c| d∈D X (1) where I is the indicator function that returns 1 if its argument is true and 0 otherwise. The intuition underlying the choice of scoring function is to identify genes that occur in small and specific clusters with the query genes. If g belongs to a large cluster that also happens to have several of the query genes, this observation is down-weighted because the co-occurrence of gene g with the query may arise by chance if the cluster is large enough. On the other hand, if g belongs to a small cluster that also contains several of the query genes, this observation receives a high weight because the co-occurrance is less likely to be serendipidous. Dividing by Ngd corrects for the number of clusters a gene appears in. Without this correction, high scores could be assigned to genes that are “central” in the coexpression network simply because they appear in several clusters. Note that one might also consider including an additional normalization term, Mg , which is the number of datasets in which g appears. Dividing by Mg would allow genes with highly different amounts of missing data to be directly compared, since C(g) would then reflect an average co-clustering index per dataset. In our case, we found that dividing by Mg had little effect on the search results. This has to do with the fact that the yeast expression database contains very little missing data: for every dataset, nearly all of the genes in the yeast genome are spotted on the microarray(s) used. However, if applied to other species in which more missing data is expected, such as mouse and human, Mg should be included. 2 ClueGene Scoring Algorithm The ClueGene algorithm is given two input parameters, a set of query genes Q and a set of datasets D. The set of genes in the organism’s genome G is inferred from D. ClueGene uses several arrays in its computations. S is an array indexed by gene and dataset; S[g][d] is the score contribution to gene g by dataset d. N is an array indexed by 1 gene and dataset; N [g][d] is the number of clusters in dataset d that contribute scores to gene g. N [g][d] corresponds to Ngd in Equation (1). Assume the S and N arrays are initialized to all 0s. The output of the scoring algorithm is an array of co-clustering scores C indexed by gene. 2.1 Score Each Gene by Dataset In this part of the ClueGene scoring algorithm, dataset-specific scores are computed for each gene. The score contributed to each gene by each dataset is computed (array S), along with the number of clusters in each dataset that contribute to the score (array N ). . repeat for each dataset for d in D . repeat for each cluster in the dataset for c in d . count the number of query genes found in the cluster n←0 for q ∈ Q if q∈ c n←n+1 . test whether cluster c contributes to the score if n > 0 . compute the cluster score s ← n/(|Q| + |c| − n) . assign the cluster score to each gene in the cluster for g ∈ c S[g][d] ← S[g][d] + s N [g][d] ← N [g][d] + 1 2.2 Normalize Dataset-Specific Scores for Each Gene In this part of the ClueGene scoring algorithm, the co-clustering score for each gene is computed by combining the normalized dataset-specific scores. . compute the co-clustering score of each gene in the genome for g ∈ G C[g] ← 0 2 . repeat for each dataset for d ∈ D . test whether dataset d contributes to the score of g if S[g][d] 6= 0 . normalize the score contribution of dataset d by the number of clusters . within d that contribute to the score C[g] ← C[g] + S[g][d]/N [g][d] 3 ClueGene Time Complexity The ClueGene algorithm is given two parameters, a set of query genes Q and a set of datasets D. The number of genes in the query set is q = |Q|. The query set is a subset of the genome G; G has size g = |G|. Each dataset corresponds to an experiment series and consists of clusters of genes derived from experimental data. The number of datasets is d = |D|. There are two steps in the ClueGene scoring algorithm. First, dataset-specific scores are calculated for each gene. That is, each gene is scored with respect to each dataset, giving each gene a set of “subscores”, one for each dataset. Let nc denote the average number of clusters per dataset, and let ng denote the average number of genes in each cluster. The algorithm iterates over each of the nc clusters of each of the d datasets. For each iteration, membership of each of the q query genes in the cluster is counted, and each of the ng genes in the cluster is assigned the score. We expect query set sizes to be about the same size as the average cluster size (or smaller); thus q = ng . Also note that g = nc ng . Thus the time complexity for dataset-specific score computation is dnc (q + ng ) = dnc (ng + ng ) = dnc (2ng ) = 2dg Second, each gene score is normalized by dataset. The algorithm iterates over each gene in the genome and normalizes the dataset-specific scores by the number of clusters within the dataset that contribute to the gene’s score. Thus the time complexity for normalization is dg The overall time complexity for ClueGene is TCG = 2dg + dg = 3dg = O(dg) 3 For a particular organism the genome size g is constant, so the time complexity for ClueGene can be expressed as TCG = O(d). 4 GeneRecommender Time Complexity The GeneRecommender algorithm is given two parameters, a set of query genes Q and a matrix Y of normalized expression values. The number of genes in the query set is q = |Q|. The rows in Y correspond to genes in the genome G; the number of rows is g = |G|. The columns correspond to experiments. Let d = |D| denote the number of datasets and e denote the average number of experiments per dataset. The number of columns in Y is de. There are two steps in the GeneRecommender scoring algorithm. First, the experiments are scored for their relevance with respect to the query set. To do this, mean and variance of the q query gene expression values are computed for each of the de experiments. Thus the time complexity for experiment scoring is qde Second, each gene in the genome is scored with respect to the relevant experiments. Let denote the set of relevant experiments determined in the first step; the number of relevant experiments is given by n = ||. To compute the gene score, the algorithm iterates over the n relevant experiments for each of the g genes in the genome. Thus the time complexity for gene scoring is n g The overall time complexity for GeneRecommender is TGR = qde + n g Note that n can be at most de, the total number of experiments. We can express n as a fraction 0 ≤ f ≤ 1 of the number of experiments: n = f de. Also, g is an upper bound for the query size q. Thus an upper bound on the overall time complexity of GeneRecommender is TGR = ≤ = = de(q + f g) de(g + f g) de[(1 + f )g] O(deg) For a particular organism the genome size g is constant, so the time complexity for GeneRecommender can be expressed as TGR = O(de). 4

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Supplemental Appendix A: ClueGene Algorithm and Time