Download Name: Ruairi Lennon

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Structural alignment wikipedia , lookup

Protein structure prediction wikipedia , lookup

Homology modeling wikipedia , lookup

Transcript
DotPlot Software Analysis
Module Code: BE542
Module Name: Comparative Genomics and Phylogenomics
Name:
Ruairi Lennon
Student No. : 52585731
Declaration:
I, Ruairi Lennon, submit my project as declared as my own work, subject to findings obtained from
the use of the program developed specifically for this purpose. Plagarism and poor scientific
practices have been avoided as far as possible. The software is to be open-source and the results and
methods will be referenced to previous similar work in the community.
Signed:
__________________
Ruairi Lennon
Date:
__________________
Dotplot Outputs :
(a)
(b)
(c)
Figure1 :(a) Showing TCH3 vs TCH3 at window size 40 Threshold 60. Three repeat sequences can clearly be
seen. (b) TCH3 (vertical) Vs CAM3 (horizontal) at window size 40 threshold 60. The three TCH3 repeat
sequences are found as a similar single sequence in CAM3 and are each split in half by one large CAM3 intron.
This sequence appears only once in CAM3 and is also split in half at the same point. (c) The LDL-receptor AA
sequence compared to itself at windows size 23 and threshold 39 using PAM10. Repeats of the first section of
the sequence are clear to be seen.
The raw outputs come as a graph in a txt file format with hits annotated as “\” and mishits
annotated with “ “(one whitespace). Since no GUI was built into the dotplot software (since Perl
wasn’t built for GUI), the data was visualised by either of two methods: in excel or notepad++.
Before using Excel, the data was quickly transformed by a script called delimiter_excel.pl that adds
one whitespace for the first row (containing the list of characters in the first sequence). This is run
on a file called TCH3_row.txt where the row is copied from the first few lines of the results file and
outputted to the output.txt file. This formatted row can then be easily copied into excel using the
text to columns function delimiting by the whitespace added previously. All data following the first
character row is then copied directly into the same spreadsheet without being formatted to
preserve the data, and the data may need to be formatted by text to columns using whitespace as a
delimiter.
For both Excel and Notepad++ the data could be printed as a pdf (using a pdf printer driver) on a
large paper sized and scaled down (zoomed out) as low as needed to reduce the pages needed to
one page and the pdf created should contain the dotplot above.
Other features of the software include the following:
Figure 2.(left) A printout in text format recording the
settings and inputs for the program which can easily be
reported for repeated runs. Any window size, threshold
and scoring matrix can be used (as long as the scoring
matrix can be found in the same directory), and the
program runs both protein or nucleotide sequences with
the relevant scoring matrixes. The software will not
extend the window, reiterate runs or run second hits.
The program is capable of running on computers with
512MB RAM (considering computers in labs specifically
for this purpose maybe second hand older spec) for the
analysis of sequences less than 3kb in size each, and will
run with any PERL-ready operating systems. The speed is
the factor, considering PERL is usually slow, sped up by
the use of a single matrix, the appropriate FOR Loops and recursion to avoid unnessary tasks. The
issue with this program is mainly lack of GUI and heavy RAM usage for larger sequence comparisons
(could not compare LDLR nucleotide sequences because of memory shortages on a 2GB RAM PC). To
solve this JAVA is recommended since it is compiled, faster and should take less memory, for
example JDotter found online.
Analysis of gene and intron evolution
(a). CAM3 and TCH3 in Arabidopsis thaliana.
The results in Figure 1 (a) and (b) show that the CAM3 and TCH3 contain a similar section, and that
this is sequence is repeated three times in TCH3. We also see that CAM3 has one intron roughly 400
base long due to the gap of the matching sequences. Since the gap in the matching sequence occurs
in both axes, it indicates strongly that both sequences contain an intron at this location. It is not
possible from this plot alone however to determine whether or not the intron is the same size or
homologous. Also, since TCH3 contains three sequences compared to the CAM3, and each one in
TCH3 contains an intron, whether or not these TCH3 introns are self-related is another question
posed from the dotplots above. It can be seen that the gap in the verticals between the matching
sequences are all roughly the same length by eye, and coupled with the gap in the horizontals,
indicates some kind of gene duplication twice going from CAM3 to TCH3, reducing the intron size
significantly, and the duplication events involved the introns to a large extent. To test this, the
sequences of the introns can be aligned to check for this, and indeed has been found to be the case
by further work in the last few years[1], where two introns in TCH3 are found to be sequentially
similar. The dotplot above is highly indicative of the evolution of the TCH3 gene including introns
from CAM3 and suggests further hypothesis testing by further analysis.
(‘b) It is clear that we have many repeats at the N terminus of the LDL receptor protein. The
sequences are all roughly around 40 amino acids in length, and seems highly unlikely to be random
since from about 300 amino acids into the sequence there appears not one repeat afterwards. The
cause of these repeats is as much of interest as its function itself. All of the repeats have a cysteine
in the first 3-6 bases from their beginning(highlighted in red in the excel spreadsheet). At least two
of them begin with a glutamic acid, which bears less meaning as the cysteine since cysteine serves as
a very important structural base. The generation of the repeats could be by direct transcription and
direct translation, meaning the cDNA for the receptor protein would be identical to the nuclear DNA,
which is unlikely. Given that human gene numbers to human protein numbers don’t show that, and
human genomics are heavily tissue specific and intron/exon activated, it seems highly more likely
that there is exon shuffling building up the mRNA, and that the one gene may be repeatedly
transcribed into the one mRNA consecutively, and direct translation of the mRNA from there could
end up with the protein that had the repeats. This dotplot highlights the likely scenario for exon
shuffling and suggests futher hypothesis to be tested. However, the current research shows that
indeed the repeats are exon based[2], and that they function to bind lipoproteins and have similar
repeats in VLDL[3].Since the dotplot shows 12 repeats at this window size and threshold, we can
assume there is up to 12 exons or maybe more. It would have been ideal to compare the cDNA of
this protein to the complete CDS in the human chromosome in the dotplot analysis, or even
compare the 51kb nucleotide sequence to itself (which is computationally exhaustive on a 2GB
processor by this program) to see the exon and intron boundaries.
References:
[1]
David G. Knowles; Aoife McLysaght. “High rate of recent intron gain and loss in simultaneously duplicated
Arabidopsis genes”. Mol. Biol.Evol 23(8):1548–1557. 2006
[2].
T. Maruvama, Y. Miyake, Y. Toyota, M. Harada-Shiba, T. Yamamura and A. Yamamoto. “A point mutation
in splice donor site of intron 12 of LDLR receptor gene causing exon skipping, alternative splicing and read
through: a high-frequency mutation in Japan”. Atherosclerosis. 109, (1-2), 110-111. 1994
[3]
Fass D , Blacklow S , Kim PS , Berger JM. “Molecular basis of familial hypercholesterolaemia from structure
of LDL receptor module.” Nature, 388 (6643) 691-693. 1997