Download Improvement of CpG island search algorithm and its

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Site-specific recombinase technology wikipedia , lookup

Cancer epigenetics wikipedia , lookup

Genome evolution wikipedia , lookup

Bisulfite sequencing wikipedia , lookup

Transcript
Improvement of the CpG Island Search Algorithm and its Application as a Gene
Marker in Vertebrate Genomes
Emily Mitchell
Introduction:
CpG islands are regions of mammalian genomes in which there is a higher than
average percentage of guanine and cytosine and in which methylation has not occurred.
Most of the genomes’ CpG dinucleotides are methylated, but roughly 60% of human
(Robinson et al. 2004) and 50% of all mammalian genes (Ioshikhes and Zhang 2000)
contain unmethylated CpG islands which mark the promoter and exonic regions of genes.
CpG islands are important for gene expression; studies show that methylation of CpG
islands plays a significant role in gene silencing (Antequera 2003, Takai and Jones 2002).
The fact that CpG islands can be identified from introns makes CpG islands useful as
the most dependable predictors of gene location (Antequera 2003, Hannenhalli and Levy
2001). Using other signals around the transcriptional start site in addition to CpG islands
to identify promoters does not improve the accuracy of the prediction (Hannenhalli and
Levy 2001). However, the criteria for identifying a CpG island were set down by
Gardiner-Garden and Frommer in 1987, before the genome of any mammal had been
mapped. The original criteria are: a 200 base pair length of DNA with a G+C content of
at least 50% and an observed–expected CpG ratio of at least 0.6. The results of applying
Gardiner-Garden and Frommer’s criteria to a genome include not only legitimate CpG
islands, but also many areas which aren’t true CpG islands, like Alu repeats (Takai and
Jones 2002, 2003). These extraneous results make the data difficult to use.
In addition, current computer programs which search genomes for CpG islands and
are available on the Web are slow—it takes roughly one month to locate the CpG islands
in the human genome using the CpG Island Searcher. That and the CpG Island
Searcher’s simple and limited user interface mean that users are unable to capitalize on
this program by using it to find the best parameters—parameters which would allow
users to locate all of the CpG islands and none of the junk.
Project Plan:
This summer I propose to write a new program to locate CpG islands in the human,
chimpanzee, mouse, rat, and dog genomes, all of which are available to the public at
http://genome.ucsc.edu/cgi-bin/hgGateway. The proposed program will use Takai and
Jones’s sliding-window algorithm for the location of CpG islands. Once it works the
program will be modified to improve on the algorithm. Takai and Jones’s algorithm
involves checking the percentage of C and G (as well as the other CpG island criteria) in
a window—200 base pairs, according to their criteria—at the start of a contig. If that
window doesn’t meet the criteria, the window slides one bp to the right and evaluates
again. If the window does meet the criteria, another window is created at the first’s
boundary on the 3’ side and evaluated. If this second (or third, or fourth…) window does
not meet the criteria, the window slides back toward the 5’ end until it does meet the
criteria. Then the large CpG island is evaluated to make sure it, as a whole, meets the
criteria. If it does not, base pairs are trimmed from either end until it does. Any two
islands that are less than 200 bps apart on the contig are united into a single larger island
(Takai and Jones 2002).
Takai and Jones’s algorithm wastes time by evaluating more windows than it needs to.
One of the improvements to be made to their algorithm is to find and implement a better,
faster way of locating potential CpG island sites instead of wasting time creeping along
the contig. All modifications made would be to enable the program run with increased
efficiency and/or accuracy.
Because the new program will be coded in C++ instead of Perl, its runtime should be
faster than the CpG Island Searcher’s even without any algorithmic changes. The new
program will also allow users to set their own parameters for what constitutes a CpG
island, so that users will not be limited to anyone else’s criteria.
Goals:
Once the program works, we will be able to run tests to find out what parameters are
best for locating CpG islands in vertebrate genomes. Modifications can then be made so
that the program is more likely to find CpG islands in the promoter region of genes in the
various genomes, increasing its accuracy. By the end of next summer, data will be
assembled and we will, hopefully, have created a database to store it in. Giving the
database a user-friendly interface will allow users to find pertinent information without
having to run the program again.
Resources:
Antequera, F. 2003. “Structure, function and evolution of CpG island promoters.”
CMLS.
Antequera, F., and Adrian Bird. 1993. “Number of CpG islands and genes in human
and mouse.” Proc. Natl. Acad. Sci. USA. 90: 11995-9.
Hannenhali, S., and Samuel Levy. 2001. “Promoter prediction in the human genome.”
Bioinformatics. 17 Suppl.
Ioshikhes, I. P. and Michael Q. Zhang. 2000. “Large-scale human promoter mapping
using CpG islands.” Nature Genetics. 26.
Ponger, L., and Dominique Mouchiroud. 2002. “CpGProD: identifying CpG islands
associated with transcription start sites in large genomic mammalian sequences.”
Bioinformatics. 18.4.
Robinson, P. N. et al. 2004. “Gene-Ontology analysis reveals association of tissuespecific 5’ CpG-island genes with development and embryogenesis.” Human
Molecular Genetics. 1969-78.
Takai, D., and Peter Jones. 2002. “Comprehensive analysis of CpG islands in human
chromosomes 21 and 22.” PNAS.
Takai, D., and Peter Jones. 2003. “The CpG Island Searcher: A New WWW Resource.”
In Silico Biology 3, 0021.
Yamashita, R., et al. 2005. “Genome-wide analysis reveals strong correlation between
CpG islands with nearby transcription start sites of genes and their tissue specificity.”
Gene 350: 129–136.