Download here

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Gene prediction wikipedia , lookup

Transcript
Clustering Genes by Plotting Them on a
61-dimensional Vector Space, and Finding the
Cluster of a Foreign Gene
Sonam Kumar
Chennai Mathematical Institute
Abstract - Using Genetics as the base, and Algebra, Theoretical Computer Science and Statistics as the
tools, I analysed the expression of a codon in the genes
of a specified organism and formed a suitable number of
clusters in which these genes could be plotted. I then
found out foreign genes which are statistically different
from the genes in all the characters.
III. CONSTRUCTION
whether a medicine is appropriate for the treatment of a
certain genetic disease and find alternative drugs that will
work on drug-resistant viruses and bacteria.
IV. THE k-MEANS CLUSTERING ALGORITTHM
Using the symbols from the alphabet Σ = {A, T, G, C},
we get 43 = 64 strings of length three. Hence there are 64
codons, out of which three are stop codons. Hence, we have
61 codons coding for 20 amino acids. So, we represent each
gene (protein) as a 61-tuple. In other words they could also
be referred to as 61-dimensional vectors. The axes may be
in any order : AAA, AAU, AAG, AAC, AUA, · · · etc. which
I. INTRODUCTION
means that in a 61-tuple representation of a gene, the first
The branch of Science that deals with the study of biolog- number corresponds to the number of AAA codons in it,
ical phenomena and their analysis using the basic concepts the second to the number of AAU in it, the third to the
of Mathematics (and Statistics ), Theoretical Computer Science number of AAG in it, and so on.
and Physics, is called Bioinformatics . Bioinformatics can
Definition (Codon Usage ) : Each gene uses codons based
rightly be referred to as one of the ’backbones’ of the field
on
their frequency. This is called the codon usage of the
of medicine.
gene.
We then plot these genes in a 61-dimensional vector
Motivation for the project
space.
or in other words, each gene is a point in this 61The discovery of new medicines is triggered by more and
dimensional
space.
more investigations. I here propose a way how to test
Aim :
The genes of an organism are plotted on the
61-dimensional space. They are then clustered into an arbitrary number of clusters.
II. BACKGROUND
It is a method of Cluster Analysis which aims to partition
n
observations
into k clusters in which each observation beDNA is Deoxy-ribonucleic Acid and RNA is Ribonucleic
longs
to
the
cluster
with the nearest mean.
Acid. The structural and functional unit of a DNA and
a RNA is the quadrapole of Nitrogenous bases, namely,
A. The Algorithm
Adenine(A), Thymine(T ), Guanine(G) and Cytosine(C ).
Given a set of observations T = {~x1 , ~x2 , · · · , ~xn }, where
DNA is a double stranded structure, with A=T double H each
observation in a gene is a d-dimensional real vector,
bonds and G≡C triple H -bonds joining the strands. The
the
k
-means clustering algorithm aims to partition the n
strands are sequences of these nitrogenous bases. RNA is a
observations
into k sets (k < n) S = {S1 , S2 , · · · , Sk } so as
single stranded structure, with the base Thymine replaced
to
maximize
the
within-cluster sum of squares:
by Uracil(U ). So, an RNA is a sequence of symbols from
the alphabet Σ = {A, U, G , C}. A codon is a sequence
k X
X
of three nitrogenous bases. Technically speaking, they are
argmin
k~
xj − µ
~ i k2
(1)
strings of length three on the alphabet Σ. A single codon is a
S
i=1 j∈Si
triplet that codes for an amino acid. A protein is a sequence
of amino acids.
where, µ
~ i is the mean of Si .
We have the notion of the Central Dogma where DNA
We, thereby get the clusters of the genes in the 61transcribes to m RNA, which is translated to protein. In dimensional space which are formed by following the above
this process of protein synthesis, codons in the RNA are rule.
replaced by the amino acids by the ribosomes.
1
B. Description of the algorithm
Given the genome of an organism, we find out the 61Clustering n particles into k clusters is a NP-hard prob- tuple representation of each gene in it. We then construct
lem. This algorithm approximately clusters the genes in the appropriate number of clusters, by the k-means clusthe space and is described below :
tering algorithm. Label these clusters as in equation (3),
as S = {S1 , S2 , · · · , Sk }. We construct another set of the
means of the clusters Si ∀i ∈ {1, 2, · · · , k}, namely:
1. Partition the n points arbitrarily into k parts (or clusters), S1 , S2 , · · · , Sk .
µ = {~
µ1 , µ
~ 2, · · · , µ
~ k}
(4)
2. Calculate the cluster mean;
µ
~i =
1 X i
xj
ni
We consider a foreign gene ~g and represent it by a 61tuple, ~g = (g1 , g2 , · · · , g61 ). Find the distances of ~g from
each of the µ
~ i s in equation (4). We conclude that ~g is far
away from each of the k clusters, if it is a foreign gene.
(2)
j∈Si
where ni is the number of points and xij is the coordinate of the jth point in the cluster tagged Si .
VI. USAGES
Many viruses like HIV become drug resistant when a patient is treated with antibiotics. The reason for this is due
to mutations. It will be interesting to see if the statistically
different foreign genes undergo mutations. This technique
would help in answering such questions.
3. Take each point and find its distance from each cluster mean. Move it to the cluster to which it is closest
to.
4. Iterate this process until there is no further movement.
VII. CONCLUSION
Hence, the partition:
S = S1 ∪ S2 ∪ · · · ∪ Sk
By using this technique of k-means clustering algorithm,
we could take a cell of an organism and look at all the genes
in the nucleus. We then represent them. We also find the
statistically different foreign genes which donot fall in the
clusters.
(3)
defines the clustering of the n points into k clusters.
V. FINDING FOREIGN GENES
The approach outlined here can be complemented with
Definition : A foreign gene is one which is statistically experiments that I could carry out in nearby institutions
different in codon usage from other genes in the genome of such as the Indian Institute of Technology, Madras.
the organism.
VIII. REFERENCES
By the statement ’genes ~g1 and ~g2 are statistically dif1. Sequences in Biological Sciences, Durbin.
ferent’, we mean that the codon usage of the genes are
different, and represent different position vectors on the
2. Computational Biology, Mount
61-dimensional vector space.
3. http://www.wikipedia.org/
2