Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Clustering Genes by Plotting Them on a 61-dimensional Vector Space, and Finding the Cluster of a Foreign Gene Sonam Kumar Chennai Mathematical Institute Abstract - Using Genetics as the base, and Algebra, Theoretical Computer Science and Statistics as the tools, I analysed the expression of a codon in the genes of a specified organism and formed a suitable number of clusters in which these genes could be plotted. I then found out foreign genes which are statistically different from the genes in all the characters. III. CONSTRUCTION whether a medicine is appropriate for the treatment of a certain genetic disease and find alternative drugs that will work on drug-resistant viruses and bacteria. IV. THE k-MEANS CLUSTERING ALGORITTHM Using the symbols from the alphabet Σ = {A, T, G, C}, we get 43 = 64 strings of length three. Hence there are 64 codons, out of which three are stop codons. Hence, we have 61 codons coding for 20 amino acids. So, we represent each gene (protein) as a 61-tuple. In other words they could also be referred to as 61-dimensional vectors. The axes may be in any order : AAA, AAU, AAG, AAC, AUA, · · · etc. which I. INTRODUCTION means that in a 61-tuple representation of a gene, the first The branch of Science that deals with the study of biolog- number corresponds to the number of AAA codons in it, ical phenomena and their analysis using the basic concepts the second to the number of AAU in it, the third to the of Mathematics (and Statistics ), Theoretical Computer Science number of AAG in it, and so on. and Physics, is called Bioinformatics . Bioinformatics can Definition (Codon Usage ) : Each gene uses codons based rightly be referred to as one of the ’backbones’ of the field on their frequency. This is called the codon usage of the of medicine. gene. We then plot these genes in a 61-dimensional vector Motivation for the project space. or in other words, each gene is a point in this 61The discovery of new medicines is triggered by more and dimensional space. more investigations. I here propose a way how to test Aim : The genes of an organism are plotted on the 61-dimensional space. They are then clustered into an arbitrary number of clusters. II. BACKGROUND It is a method of Cluster Analysis which aims to partition n observations into k clusters in which each observation beDNA is Deoxy-ribonucleic Acid and RNA is Ribonucleic longs to the cluster with the nearest mean. Acid. The structural and functional unit of a DNA and a RNA is the quadrapole of Nitrogenous bases, namely, A. The Algorithm Adenine(A), Thymine(T ), Guanine(G) and Cytosine(C ). Given a set of observations T = {~x1 , ~x2 , · · · , ~xn }, where DNA is a double stranded structure, with A=T double H each observation in a gene is a d-dimensional real vector, bonds and G≡C triple H -bonds joining the strands. The the k -means clustering algorithm aims to partition the n strands are sequences of these nitrogenous bases. RNA is a observations into k sets (k < n) S = {S1 , S2 , · · · , Sk } so as single stranded structure, with the base Thymine replaced to maximize the within-cluster sum of squares: by Uracil(U ). So, an RNA is a sequence of symbols from the alphabet Σ = {A, U, G , C}. A codon is a sequence k X X of three nitrogenous bases. Technically speaking, they are argmin k~ xj − µ ~ i k2 (1) strings of length three on the alphabet Σ. A single codon is a S i=1 j∈Si triplet that codes for an amino acid. A protein is a sequence of amino acids. where, µ ~ i is the mean of Si . We have the notion of the Central Dogma where DNA We, thereby get the clusters of the genes in the 61transcribes to m RNA, which is translated to protein. In dimensional space which are formed by following the above this process of protein synthesis, codons in the RNA are rule. replaced by the amino acids by the ribosomes. 1 B. Description of the algorithm Given the genome of an organism, we find out the 61Clustering n particles into k clusters is a NP-hard prob- tuple representation of each gene in it. We then construct lem. This algorithm approximately clusters the genes in the appropriate number of clusters, by the k-means clusthe space and is described below : tering algorithm. Label these clusters as in equation (3), as S = {S1 , S2 , · · · , Sk }. We construct another set of the means of the clusters Si ∀i ∈ {1, 2, · · · , k}, namely: 1. Partition the n points arbitrarily into k parts (or clusters), S1 , S2 , · · · , Sk . µ = {~ µ1 , µ ~ 2, · · · , µ ~ k} (4) 2. Calculate the cluster mean; µ ~i = 1 X i xj ni We consider a foreign gene ~g and represent it by a 61tuple, ~g = (g1 , g2 , · · · , g61 ). Find the distances of ~g from each of the µ ~ i s in equation (4). We conclude that ~g is far away from each of the k clusters, if it is a foreign gene. (2) j∈Si where ni is the number of points and xij is the coordinate of the jth point in the cluster tagged Si . VI. USAGES Many viruses like HIV become drug resistant when a patient is treated with antibiotics. The reason for this is due to mutations. It will be interesting to see if the statistically different foreign genes undergo mutations. This technique would help in answering such questions. 3. Take each point and find its distance from each cluster mean. Move it to the cluster to which it is closest to. 4. Iterate this process until there is no further movement. VII. CONCLUSION Hence, the partition: S = S1 ∪ S2 ∪ · · · ∪ Sk By using this technique of k-means clustering algorithm, we could take a cell of an organism and look at all the genes in the nucleus. We then represent them. We also find the statistically different foreign genes which donot fall in the clusters. (3) defines the clustering of the n points into k clusters. V. FINDING FOREIGN GENES The approach outlined here can be complemented with Definition : A foreign gene is one which is statistically experiments that I could carry out in nearby institutions different in codon usage from other genes in the genome of such as the Indian Institute of Technology, Madras. the organism. VIII. REFERENCES By the statement ’genes ~g1 and ~g2 are statistically dif1. Sequences in Biological Sciences, Durbin. ferent’, we mean that the codon usage of the genes are different, and represent different position vectors on the 2. Computational Biology, Mount 61-dimensional vector space. 3. http://www.wikipedia.org/ 2