Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
DotPlot Software Analysis Module Code: BE542 Module Name: Comparative Genomics and Phylogenomics Name: Ruairi Lennon Student No. : 52585731 Declaration: I, Ruairi Lennon, submit my project as declared as my own work, subject to findings obtained from the use of the program developed specifically for this purpose. Plagarism and poor scientific practices have been avoided as far as possible. The software is to be open-source and the results and methods will be referenced to previous similar work in the community. Signed: __________________ Ruairi Lennon Date: __________________ Dotplot Outputs : (a) (b) (c) Figure1 :(a) Showing TCH3 vs TCH3 at window size 40 Threshold 60. Three repeat sequences can clearly be seen. (b) TCH3 (vertical) Vs CAM3 (horizontal) at window size 40 threshold 60. The three TCH3 repeat sequences are found as a similar single sequence in CAM3 and are each split in half by one large CAM3 intron. This sequence appears only once in CAM3 and is also split in half at the same point. (c) The LDL-receptor AA sequence compared to itself at windows size 23 and threshold 39 using PAM10. Repeats of the first section of the sequence are clear to be seen. The raw outputs come as a graph in a txt file format with hits annotated as “\” and mishits annotated with “ “(one whitespace). Since no GUI was built into the dotplot software (since Perl wasn’t built for GUI), the data was visualised by either of two methods: in excel or notepad++. Before using Excel, the data was quickly transformed by a script called delimiter_excel.pl that adds one whitespace for the first row (containing the list of characters in the first sequence). This is run on a file called TCH3_row.txt where the row is copied from the first few lines of the results file and outputted to the output.txt file. This formatted row can then be easily copied into excel using the text to columns function delimiting by the whitespace added previously. All data following the first character row is then copied directly into the same spreadsheet without being formatted to preserve the data, and the data may need to be formatted by text to columns using whitespace as a delimiter. For both Excel and Notepad++ the data could be printed as a pdf (using a pdf printer driver) on a large paper sized and scaled down (zoomed out) as low as needed to reduce the pages needed to one page and the pdf created should contain the dotplot above. Other features of the software include the following: Figure 2.(left) A printout in text format recording the settings and inputs for the program which can easily be reported for repeated runs. Any window size, threshold and scoring matrix can be used (as long as the scoring matrix can be found in the same directory), and the program runs both protein or nucleotide sequences with the relevant scoring matrixes. The software will not extend the window, reiterate runs or run second hits. The program is capable of running on computers with 512MB RAM (considering computers in labs specifically for this purpose maybe second hand older spec) for the analysis of sequences less than 3kb in size each, and will run with any PERL-ready operating systems. The speed is the factor, considering PERL is usually slow, sped up by the use of a single matrix, the appropriate FOR Loops and recursion to avoid unnessary tasks. The issue with this program is mainly lack of GUI and heavy RAM usage for larger sequence comparisons (could not compare LDLR nucleotide sequences because of memory shortages on a 2GB RAM PC). To solve this JAVA is recommended since it is compiled, faster and should take less memory, for example JDotter found online. Analysis of gene and intron evolution (a). CAM3 and TCH3 in Arabidopsis thaliana. The results in Figure 1 (a) and (b) show that the CAM3 and TCH3 contain a similar section, and that this is sequence is repeated three times in TCH3. We also see that CAM3 has one intron roughly 400 base long due to the gap of the matching sequences. Since the gap in the matching sequence occurs in both axes, it indicates strongly that both sequences contain an intron at this location. It is not possible from this plot alone however to determine whether or not the intron is the same size or homologous. Also, since TCH3 contains three sequences compared to the CAM3, and each one in TCH3 contains an intron, whether or not these TCH3 introns are self-related is another question posed from the dotplots above. It can be seen that the gap in the verticals between the matching sequences are all roughly the same length by eye, and coupled with the gap in the horizontals, indicates some kind of gene duplication twice going from CAM3 to TCH3, reducing the intron size significantly, and the duplication events involved the introns to a large extent. To test this, the sequences of the introns can be aligned to check for this, and indeed has been found to be the case by further work in the last few years[1], where two introns in TCH3 are found to be sequentially similar. The dotplot above is highly indicative of the evolution of the TCH3 gene including introns from CAM3 and suggests further hypothesis testing by further analysis. (‘b) It is clear that we have many repeats at the N terminus of the LDL receptor protein. The sequences are all roughly around 40 amino acids in length, and seems highly unlikely to be random since from about 300 amino acids into the sequence there appears not one repeat afterwards. The cause of these repeats is as much of interest as its function itself. All of the repeats have a cysteine in the first 3-6 bases from their beginning(highlighted in red in the excel spreadsheet). At least two of them begin with a glutamic acid, which bears less meaning as the cysteine since cysteine serves as a very important structural base. The generation of the repeats could be by direct transcription and direct translation, meaning the cDNA for the receptor protein would be identical to the nuclear DNA, which is unlikely. Given that human gene numbers to human protein numbers don’t show that, and human genomics are heavily tissue specific and intron/exon activated, it seems highly more likely that there is exon shuffling building up the mRNA, and that the one gene may be repeatedly transcribed into the one mRNA consecutively, and direct translation of the mRNA from there could end up with the protein that had the repeats. This dotplot highlights the likely scenario for exon shuffling and suggests futher hypothesis to be tested. However, the current research shows that indeed the repeats are exon based[2], and that they function to bind lipoproteins and have similar repeats in VLDL[3].Since the dotplot shows 12 repeats at this window size and threshold, we can assume there is up to 12 exons or maybe more. It would have been ideal to compare the cDNA of this protein to the complete CDS in the human chromosome in the dotplot analysis, or even compare the 51kb nucleotide sequence to itself (which is computationally exhaustive on a 2GB processor by this program) to see the exon and intron boundaries. References: [1] David G. Knowles; Aoife McLysaght. “High rate of recent intron gain and loss in simultaneously duplicated Arabidopsis genes”. Mol. Biol.Evol 23(8):1548–1557. 2006 [2]. T. Maruvama, Y. Miyake, Y. Toyota, M. Harada-Shiba, T. Yamamura and A. Yamamoto. “A point mutation in splice donor site of intron 12 of LDLR receptor gene causing exon skipping, alternative splicing and read through: a high-frequency mutation in Japan”. Atherosclerosis. 109, (1-2), 110-111. 1994 [3] Fass D , Blacklow S , Kim PS , Berger JM. “Molecular basis of familial hypercholesterolaemia from structure of LDL receptor module.” Nature, 388 (6643) 691-693. 1997