Download Megatask 2 : Clustering of an unspecified set of gene lists

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Therapeutic gene modulation wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Public health genomics wikipedia , lookup

Pathogenomics wikipedia , lookup

Gene desert wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Quantitative trait locus wikipedia , lookup

History of genetic engineering wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Essential gene wikipedia , lookup

RNA-Seq wikipedia , lookup

Genome evolution wikipedia , lookup

Gene wikipedia , lookup

Microevolution wikipedia , lookup

Genomic imprinting wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Gene expression programming wikipedia , lookup

Genome (book) wikipedia , lookup

Minimal genome wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Designer baby wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Gene expression profiling wikipedia , lookup

Ridge (biology) wikipedia , lookup

Transcript
Megatask 2 : Clustering of an unspecified set of gene lists
Anton Roodnat, [email protected]
1
Introduction
This document describes my attempt on finding structure in the presented set of gene-lists.
My research questions are :
1. Is there similarity between the gene-lists ? How can the gene-lists be clustered ?
2. Are there genes that appear together in gene-lists ?
To answer the first two questions the data has been processed with the steps described in chapter 2.
The last question was answered with the same procedure but with flipped gene/gene-list axes.
All processing has been done in matlab / octave on a linux system (at work, don't tell my boss).
2
Clustering the gene-lists based on gene-list similarity
2.1
from set of gene-lists to binary matrix
In this step the set of gene-lists is read in and converted to a binary matrix in which the rows represent the genelist (or
experiment) and the columns represent the genes (I believe the more conventional way is to transpose this matrix).
The code to do this is rather straightforward and will not be presented here.
The resulting binary matrix can be displayed as a scatterplot using octave's imshow() function. It is not shown here but
as expected it looks rather random.
2.2
from binary matrix to proximity matrix
The proximity metric I tried here was the amount of overlap between two
gene-lists.
Suppose we have two vectors that present two gene-lists S1[1 , .. , N] and
S2[1 , .. , N] of which the n-th element is 1 if a gene is present and 0
otherwise. Then a measure of proximity (or overlap) could be :
prox(S1, S2)=
number of similar genes
total number of genes in S1 , S2
So if there are no similar genes then proximity is 0. If all genes are similar
it will be 1.
In matlab code this has been implemented as follows :
for i=1:N_exp
exp_sum(i) = sum( M(i,:));
end
Fig.1 : histogram of maximum proximity of a
genelist to any other genelist in the set
proximity=zeros(N_exp,N_exp);
for i=1:N_exp
for j=(i+1):N_exp
exp_overlap = M(i,:) & M(j,:) ;
overlap_sum = sum(exp_overlap);
length_sum = exp_sum(i) + exp_sum(j) - overlap_sum ;
proximity(i,j) = overlap_sum/length_sum ;
end
end
in which M is the binary matrix, N_exp is number of gene-lists. Since proximity(i,j) = proximity(j,i) only half of this
matrix needs to be calculated. The full matrix can be obtained by adding the transposed matrix. This matrix is very
similar to the distance-matrix presented in the course but shows proximity instead of distance.
A clue about the structure in the presented dataset can be found by plotting the histogram of the maximum proximity
per gene-list to any other gene-list, so hist(max(proximity)) in matlab (see figure 1). Apparently there are some genelists that have high correlation with other gene-lists but most gene-lists have a maximum proximity of about 0.1.
2.3
Assigning cluster-labels to gene-lists
In this step an attempt is made to cluster those gene-lists that have high
proximity/overlap. Because of processing-time a very simple clustering
algorithm has been used : first a proximity-threshold is assumed and gene-lists
that have proximity larger than this threshold with respect to each other are
grouped into a cluster.
So there is no hierarchy in clusters now. Either a gene-list is clustered or not.
For a threshold of 0.5 (so 50% gene-overlap) a total of 449 clusters is found
with a maximum of 61 clustermembers and a median of only 2 members (so a
pair) but some clusters show a rather high number of members, see figure 2.
The maximum number of members found is 61.
2.4
Swapping gene-lists to get a heatmap
Fig.2 : cluster-size for clusters with 50%
proximity-threshold in genesets
In this perhaps unnecessary step the rows of the binary matrix from the first
step are permuted such that genelists from the same clusters (for a proximity-threshold of 50%) are next to each other in
an attempt to get a graphical representation of the clustering.
Fig.3 : scatterplot of genes (x-axis) per gene-list (y-axis)
The result is given in figure 3. This plot could be much improved by adding color and grouping the genes but
unfortunately I have to work tomorrow :-).
Indeed some clustering becomes visible as vertical lines that indicate genes
shared by gene-lists.
2.5
Results for a 90% proximity/overlap threshold
The same procedure was followed for a threshold of 90%. In this case the
total number gene-lists that show this amount of overlap was 240. The
number of clusters was 110 and the maximum number of members per
cluster was only 4 for 6 clusters so not tremendous.
3
Detecting genes that show up together
Fig.4 : histogram of maximum proximity of a
gene to any other gene in the set
This is exactly the same procedure as described above but now the genelists are compared to check if presence of a gene
A always coincides with presence of gene B. To calculate this the binary matrix of genes vs genelists is transposed and
then the proximity matrix is determined again.
It appears there are quite a few genes that (almost) always occur together as can been seen in figure 4. These genes have
been examined a bit more by clustering them. There are 1986 genes that coincide with a proximity-threshold of 100%
so a perfect overlap.
Clustering with one level (no hierarchy) and a threshold of 100% results in 105
distinct clusters. Most clusters are small with <10 members but there are 3 large
clusters with >100 members, namely 878, 406, 171 (see figure 5).
4
Conclusions
• When comparing similarity of gene-lists it appears there are 240 gene-lists
that show more than 90% of proximity / overlap. The maximum cluster-size
is four so rather low.
• When comparing coincidence of genes over genelists it appears there are
Fig.5 : cluster-size for clusters with
three large clusters of genes that coincide 100%.
100% proximity-threshold in genesets
Much more processing could be done and also the biological function of these
genelists could be investigated with enrichr but unfortunately my time is up for this megatask.
Thanks for the really interesting course ! It motivates me to learn more about the subject and hopefully apply it.