* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Megatask 2 : Clustering of an unspecified set of gene lists
Therapeutic gene modulation wikipedia , lookup
Epigenetics of neurodegenerative diseases wikipedia , lookup
Public health genomics wikipedia , lookup
Pathogenomics wikipedia , lookup
Gene desert wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Polycomb Group Proteins and Cancer wikipedia , lookup
Quantitative trait locus wikipedia , lookup
History of genetic engineering wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Essential gene wikipedia , lookup
Genome evolution wikipedia , lookup
Microevolution wikipedia , lookup
Genomic imprinting wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Gene expression programming wikipedia , lookup
Genome (book) wikipedia , lookup
Minimal genome wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Designer baby wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
Megatask 2 : Clustering of an unspecified set of gene lists Anton Roodnat, [email protected] 1 Introduction This document describes my attempt on finding structure in the presented set of gene-lists. My research questions are : 1. Is there similarity between the gene-lists ? How can the gene-lists be clustered ? 2. Are there genes that appear together in gene-lists ? To answer the first two questions the data has been processed with the steps described in chapter 2. The last question was answered with the same procedure but with flipped gene/gene-list axes. All processing has been done in matlab / octave on a linux system (at work, don't tell my boss). 2 Clustering the gene-lists based on gene-list similarity 2.1 from set of gene-lists to binary matrix In this step the set of gene-lists is read in and converted to a binary matrix in which the rows represent the genelist (or experiment) and the columns represent the genes (I believe the more conventional way is to transpose this matrix). The code to do this is rather straightforward and will not be presented here. The resulting binary matrix can be displayed as a scatterplot using octave's imshow() function. It is not shown here but as expected it looks rather random. 2.2 from binary matrix to proximity matrix The proximity metric I tried here was the amount of overlap between two gene-lists. Suppose we have two vectors that present two gene-lists S1[1 , .. , N] and S2[1 , .. , N] of which the n-th element is 1 if a gene is present and 0 otherwise. Then a measure of proximity (or overlap) could be : prox(S1, S2)= number of similar genes total number of genes in S1 , S2 So if there are no similar genes then proximity is 0. If all genes are similar it will be 1. In matlab code this has been implemented as follows : for i=1:N_exp exp_sum(i) = sum( M(i,:)); end Fig.1 : histogram of maximum proximity of a genelist to any other genelist in the set proximity=zeros(N_exp,N_exp); for i=1:N_exp for j=(i+1):N_exp exp_overlap = M(i,:) & M(j,:) ; overlap_sum = sum(exp_overlap); length_sum = exp_sum(i) + exp_sum(j) - overlap_sum ; proximity(i,j) = overlap_sum/length_sum ; end end in which M is the binary matrix, N_exp is number of gene-lists. Since proximity(i,j) = proximity(j,i) only half of this matrix needs to be calculated. The full matrix can be obtained by adding the transposed matrix. This matrix is very similar to the distance-matrix presented in the course but shows proximity instead of distance. A clue about the structure in the presented dataset can be found by plotting the histogram of the maximum proximity per gene-list to any other gene-list, so hist(max(proximity)) in matlab (see figure 1). Apparently there are some genelists that have high correlation with other gene-lists but most gene-lists have a maximum proximity of about 0.1. 2.3 Assigning cluster-labels to gene-lists In this step an attempt is made to cluster those gene-lists that have high proximity/overlap. Because of processing-time a very simple clustering algorithm has been used : first a proximity-threshold is assumed and gene-lists that have proximity larger than this threshold with respect to each other are grouped into a cluster. So there is no hierarchy in clusters now. Either a gene-list is clustered or not. For a threshold of 0.5 (so 50% gene-overlap) a total of 449 clusters is found with a maximum of 61 clustermembers and a median of only 2 members (so a pair) but some clusters show a rather high number of members, see figure 2. The maximum number of members found is 61. 2.4 Swapping gene-lists to get a heatmap Fig.2 : cluster-size for clusters with 50% proximity-threshold in genesets In this perhaps unnecessary step the rows of the binary matrix from the first step are permuted such that genelists from the same clusters (for a proximity-threshold of 50%) are next to each other in an attempt to get a graphical representation of the clustering. Fig.3 : scatterplot of genes (x-axis) per gene-list (y-axis) The result is given in figure 3. This plot could be much improved by adding color and grouping the genes but unfortunately I have to work tomorrow :-). Indeed some clustering becomes visible as vertical lines that indicate genes shared by gene-lists. 2.5 Results for a 90% proximity/overlap threshold The same procedure was followed for a threshold of 90%. In this case the total number gene-lists that show this amount of overlap was 240. The number of clusters was 110 and the maximum number of members per cluster was only 4 for 6 clusters so not tremendous. 3 Detecting genes that show up together Fig.4 : histogram of maximum proximity of a gene to any other gene in the set This is exactly the same procedure as described above but now the genelists are compared to check if presence of a gene A always coincides with presence of gene B. To calculate this the binary matrix of genes vs genelists is transposed and then the proximity matrix is determined again. It appears there are quite a few genes that (almost) always occur together as can been seen in figure 4. These genes have been examined a bit more by clustering them. There are 1986 genes that coincide with a proximity-threshold of 100% so a perfect overlap. Clustering with one level (no hierarchy) and a threshold of 100% results in 105 distinct clusters. Most clusters are small with <10 members but there are 3 large clusters with >100 members, namely 878, 406, 171 (see figure 5). 4 Conclusions • When comparing similarity of gene-lists it appears there are 240 gene-lists that show more than 90% of proximity / overlap. The maximum cluster-size is four so rather low. • When comparing coincidence of genes over genelists it appears there are Fig.5 : cluster-size for clusters with three large clusters of genes that coincide 100%. 100% proximity-threshold in genesets Much more processing could be done and also the biological function of these genelists could be investigated with enrichr but unfortunately my time is up for this megatask. Thanks for the really interesting course ! It motivates me to learn more about the subject and hopefully apply it.