Download Project Prospectus

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

DNA sequencing wikipedia , lookup

DNA polymerase wikipedia , lookup

Replisome wikipedia , lookup

DNA profiling wikipedia , lookup

United Kingdom National DNA Database wikipedia , lookup

DNA nanotechnology wikipedia , lookup

Microsatellite wikipedia , lookup

Helitron (biology) wikipedia , lookup

Transcript
Project Report
for the course Introduction to Bioinformatics
Xinyang Liu
Description
This project originated from the paper by Olivier Fedrigo and Gavin Naylor: A
gene-specific DNA sequencing chip for exploring molecular evolutionary change.
The main idea in the paper is to present a novel gene-specific DNA chip algorithm
which can reduce the number of possible oligonucleotides combinations for
hybridization(SBH) approaches. The algorithm contains five parts:
1) documenting variability;
2) computing permutations based on observed variability;
3) filtering at the amino acid level;
4) optimizing the length of oligonucleotides;
5) computing the probes;
In the third part, they have implemented a filtering feature that excludes any
combination of nucleotides that result in an amino acid that was not present at a
particular position in the training set.
With the same idea, the purpose of this project is to create a basic filter for a
certain feature at the amino acid level. Here, I select to use Matlab for coding of
the filter and my goal is to get rid of the “stop” codon(TAA, TAG, TGA) form our
previous possible combinations. You can input any DNA sequences you like.
However, the code is only workable with at least 2 and no more than 5 DNA
sequences which have the same length(without gap and no limitation of length)
and are with a multiple of 3.
Algorithm of filtering
Since the “stop” sign has only three codons: TAA, TAG, TGA, I use the
following algorithm to filter them out.
Suppose the length of our sequence is L, I separate the sequence by
subsequences of 3. So there are (L/3) subsequences and each of them has three
columns: first, second, third. Going through each column gives the following four
cases(here “a” means at least):
1) If there is a “T” in the first column and an “A” or “G” in the second and an “A” or
“G” in the third, then we have 1 stop codon in this subsequence.
2) If there is a “T” in the first column and an “A” or “G” in the second and an “A”
and “G” in the third, then we have 2 stop codons in this subsequence.
3) If there is a “T” in the first column and an “A” and “G” in the second and an “A”
or “G” in the third, then we have 2 stop codons in this subsequence.
4) If there is a “T” in the first column and an “A” and “G” in the second and an “A”
and “G” in the third, then we have 3 stop codons in this subsequence.
Finally, sum up the number of stop codons for every subsequences and this is the
total number we should get rid of from our number before the filter.
Example and result
Let’s take an example of three sequences with the length of nine sites.
position
1st
2nd
3rd
1st
2nd
3rd
1st
2nd
Seq 1
A
T
G
T
T
G
A
G
Seq 2
T
C
C
A
A
C
G
A
Seq 3
T
A
C
C
G
A
T
G
Variability 2
3
2
3
3
3
3
2
3rd
C
A
C
2
The number of combinations before the filter is:
2*3*2+3*3*3+3*2*2=51
Follow the algorithm described above, there are three subsequences and each
has three columns.
The first subsequence has 1 stop codon.
The second subsequence has 3 stop codon.
The third subsequence has 2 stop codon.
So the number of combinations after the filter is:
51-(1+3+2)=45
The result of my Matlab code is as following:
>>
This is a basic filter for stop codon: taa, tag, tga.
Please enter at least 2 and no more than 5 DNA sequences which all have
the same length(without gap) and are with a multiple of 3.
How many DNA sequences you want to input: 3
Enter your 1st DNA sequence:atgttgagc
Enter your 2nd DNA sequence:tccaacgaa
Enter your 3rd DNA sequence:taccgatgc
The length of your DNA sequence is:
length =
9
Number of possible variations before the filter:
S=
51
Number of possible variations after the filter:
X=
45
>>
Discussion
The use of this code is limited since this is only a basic filter. The input number
of DNA sequences(now from 2 to 5) can be easily increased to any extent based
on the length of the code. I failed in writing the code in a successful loop because
of the limitations of Matlab. The filter can also be improved by considering more
features at amino acid level, not only the stop codon restriction. Moreover, I only
use one probe with length three in this project. It is better to implement more
probes with different lengths. However, this project concentrates more on the
filtering by a certain feature at amino acid lever, but not on the optimization using
different probes.