Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Improvement of the CpG Island Search Algorithm and its Application as a Gene Marker in Vertebrate Genomes Emily Mitchell Introduction: CpG islands are regions of mammalian genomes in which there is a higher than average percentage of guanine and cytosine and in which methylation has not occurred. Most of the genomes’ CpG dinucleotides are methylated, but roughly 60% of human (Robinson et al. 2004) and 50% of all mammalian genes (Ioshikhes and Zhang 2000) contain unmethylated CpG islands which mark the promoter and exonic regions of genes. CpG islands are important for gene expression; studies show that methylation of CpG islands plays a significant role in gene silencing (Antequera 2003, Takai and Jones 2002). The fact that CpG islands can be identified from introns makes CpG islands useful as the most dependable predictors of gene location (Antequera 2003, Hannenhalli and Levy 2001). Using other signals around the transcriptional start site in addition to CpG islands to identify promoters does not improve the accuracy of the prediction (Hannenhalli and Levy 2001). However, the criteria for identifying a CpG island were set down by Gardiner-Garden and Frommer in 1987, before the genome of any mammal had been mapped. The original criteria are: a 200 base pair length of DNA with a G+C content of at least 50% and an observed–expected CpG ratio of at least 0.6. The results of applying Gardiner-Garden and Frommer’s criteria to a genome include not only legitimate CpG islands, but also many areas which aren’t true CpG islands, like Alu repeats (Takai and Jones 2002, 2003). These extraneous results make the data difficult to use. In addition, current computer programs which search genomes for CpG islands and are available on the Web are slow—it takes roughly one month to locate the CpG islands in the human genome using the CpG Island Searcher. That and the CpG Island Searcher’s simple and limited user interface mean that users are unable to capitalize on this program by using it to find the best parameters—parameters which would allow users to locate all of the CpG islands and none of the junk. Project Plan: This summer I propose to write a new program to locate CpG islands in the human, chimpanzee, mouse, rat, and dog genomes, all of which are available to the public at http://genome.ucsc.edu/cgi-bin/hgGateway. The proposed program will use Takai and Jones’s sliding-window algorithm for the location of CpG islands. Once it works the program will be modified to improve on the algorithm. Takai and Jones’s algorithm involves checking the percentage of C and G (as well as the other CpG island criteria) in a window—200 base pairs, according to their criteria—at the start of a contig. If that window doesn’t meet the criteria, the window slides one bp to the right and evaluates again. If the window does meet the criteria, another window is created at the first’s boundary on the 3’ side and evaluated. If this second (or third, or fourth…) window does not meet the criteria, the window slides back toward the 5’ end until it does meet the criteria. Then the large CpG island is evaluated to make sure it, as a whole, meets the criteria. If it does not, base pairs are trimmed from either end until it does. Any two islands that are less than 200 bps apart on the contig are united into a single larger island (Takai and Jones 2002). Takai and Jones’s algorithm wastes time by evaluating more windows than it needs to. One of the improvements to be made to their algorithm is to find and implement a better, faster way of locating potential CpG island sites instead of wasting time creeping along the contig. All modifications made would be to enable the program run with increased efficiency and/or accuracy. Because the new program will be coded in C++ instead of Perl, its runtime should be faster than the CpG Island Searcher’s even without any algorithmic changes. The new program will also allow users to set their own parameters for what constitutes a CpG island, so that users will not be limited to anyone else’s criteria. Goals: Once the program works, we will be able to run tests to find out what parameters are best for locating CpG islands in vertebrate genomes. Modifications can then be made so that the program is more likely to find CpG islands in the promoter region of genes in the various genomes, increasing its accuracy. By the end of next summer, data will be assembled and we will, hopefully, have created a database to store it in. Giving the database a user-friendly interface will allow users to find pertinent information without having to run the program again. Resources: Antequera, F. 2003. “Structure, function and evolution of CpG island promoters.” CMLS. Antequera, F., and Adrian Bird. 1993. “Number of CpG islands and genes in human and mouse.” Proc. Natl. Acad. Sci. USA. 90: 11995-9. Hannenhali, S., and Samuel Levy. 2001. “Promoter prediction in the human genome.” Bioinformatics. 17 Suppl. Ioshikhes, I. P. and Michael Q. Zhang. 2000. “Large-scale human promoter mapping using CpG islands.” Nature Genetics. 26. Ponger, L., and Dominique Mouchiroud. 2002. “CpGProD: identifying CpG islands associated with transcription start sites in large genomic mammalian sequences.” Bioinformatics. 18.4. Robinson, P. N. et al. 2004. “Gene-Ontology analysis reveals association of tissuespecific 5’ CpG-island genes with development and embryogenesis.” Human Molecular Genetics. 1969-78. Takai, D., and Peter Jones. 2002. “Comprehensive analysis of CpG islands in human chromosomes 21 and 22.” PNAS. Takai, D., and Peter Jones. 2003. “The CpG Island Searcher: A New WWW Resource.” In Silico Biology 3, 0021. Yamashita, R., et al. 2005. “Genome-wide analysis reveals strong correlation between CpG islands with nearby transcription start sites of genes and their tissue specificity.” Gene 350: 129–136.