Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Project Report for the course Introduction to Bioinformatics Xinyang Liu Description This project originated from the paper by Olivier Fedrigo and Gavin Naylor: A gene-specific DNA sequencing chip for exploring molecular evolutionary change. The main idea in the paper is to present a novel gene-specific DNA chip algorithm which can reduce the number of possible oligonucleotides combinations for hybridization(SBH) approaches. The algorithm contains five parts: 1) documenting variability; 2) computing permutations based on observed variability; 3) filtering at the amino acid level; 4) optimizing the length of oligonucleotides; 5) computing the probes; In the third part, they have implemented a filtering feature that excludes any combination of nucleotides that result in an amino acid that was not present at a particular position in the training set. With the same idea, the purpose of this project is to create a basic filter for a certain feature at the amino acid level. Here, I select to use Matlab for coding of the filter and my goal is to get rid of the “stop” codon(TAA, TAG, TGA) form our previous possible combinations. You can input any DNA sequences you like. However, the code is only workable with at least 2 and no more than 5 DNA sequences which have the same length(without gap and no limitation of length) and are with a multiple of 3. Algorithm of filtering Since the “stop” sign has only three codons: TAA, TAG, TGA, I use the following algorithm to filter them out. Suppose the length of our sequence is L, I separate the sequence by subsequences of 3. So there are (L/3) subsequences and each of them has three columns: first, second, third. Going through each column gives the following four cases(here “a” means at least): 1) If there is a “T” in the first column and an “A” or “G” in the second and an “A” or “G” in the third, then we have 1 stop codon in this subsequence. 2) If there is a “T” in the first column and an “A” or “G” in the second and an “A” and “G” in the third, then we have 2 stop codons in this subsequence. 3) If there is a “T” in the first column and an “A” and “G” in the second and an “A” or “G” in the third, then we have 2 stop codons in this subsequence. 4) If there is a “T” in the first column and an “A” and “G” in the second and an “A” and “G” in the third, then we have 3 stop codons in this subsequence. Finally, sum up the number of stop codons for every subsequences and this is the total number we should get rid of from our number before the filter. Example and result Let’s take an example of three sequences with the length of nine sites. position 1st 2nd 3rd 1st 2nd 3rd 1st 2nd Seq 1 A T G T T G A G Seq 2 T C C A A C G A Seq 3 T A C C G A T G Variability 2 3 2 3 3 3 3 2 3rd C A C 2 The number of combinations before the filter is: 2*3*2+3*3*3+3*2*2=51 Follow the algorithm described above, there are three subsequences and each has three columns. The first subsequence has 1 stop codon. The second subsequence has 3 stop codon. The third subsequence has 2 stop codon. So the number of combinations after the filter is: 51-(1+3+2)=45 The result of my Matlab code is as following: >> This is a basic filter for stop codon: taa, tag, tga. Please enter at least 2 and no more than 5 DNA sequences which all have the same length(without gap) and are with a multiple of 3. How many DNA sequences you want to input: 3 Enter your 1st DNA sequence:atgttgagc Enter your 2nd DNA sequence:tccaacgaa Enter your 3rd DNA sequence:taccgatgc The length of your DNA sequence is: length = 9 Number of possible variations before the filter: S= 51 Number of possible variations after the filter: X= 45 >> Discussion The use of this code is limited since this is only a basic filter. The input number of DNA sequences(now from 2 to 5) can be easily increased to any extent based on the length of the code. I failed in writing the code in a successful loop because of the limitations of Matlab. The filter can also be improved by considering more features at amino acid level, not only the stop codon restriction. Moreover, I only use one probe with length three in this project. It is better to implement more probes with different lengths. However, this project concentrates more on the filtering by a certain feature at amino acid lever, but not on the optimization using different probes.