Download Genome Sequence Analysis

Genome Sequence Analysis Secondary article Article Contents Margaret M DeAngelis, Louisiana State University Health Sciences Center, New Orleans, Louisiana, USA . Introduction Mark A Batzer, Louisiana State University Health Sciences Center, New Orleans, Louisiana, USA . Exons and Introns . Control Elements The human genome has an estimated 40 000–100 000 genes dispersed throughout 3.5 billion nucleotides of sequence. DNA sequences are inherently complex and a number of computational tools are required to analyse the genomic sequences of eukaryotic, bacterial and model organisms. . Open Reading Frames . Expressed Sequence Tags . Cross-species Genome Comparison . Pseudogenes . Repetitive Elements . Computer-aided Analyses Introduction The human genome has approximately 3.5 billion base pairs (bp) and is an excellent example for the analysis of eukaryotic genomes. The goal of genome research is to sequence each one of these base pairs so that all the genes and regulatory regions in the genome can be located. This information can then be used to facilitate discoveries in the basic and clinical sciences. Thus, the aim of most largescale sequencing projects is the discovery of new genes in previously uncharacterized or only partially characterized genomic DNA sequences. A gene, which is the basic functional unit of heredity, is typically a specific sequence of nucleotides that carries the information required for making a functional protein or, in some cases, a functional RNA. Several computational methods have been developed for analysing genomic sequences and the identification of genes. Exons and Introns Exons and introns are important features of the eukaryotic gene. Eukaryotic genes are composed of regulatory sequences, short exons and introns of variable length. Most have their protein-coding regions interrupted by introns that are removed in a process called splicing to generate a mature messenger RNA (mRNA), which is translated into protein. Exons are the sequences that remain in the final mature mRNA and generally code for a protein. Introns are the sequences that are removed by splicing from the full-length heterogeneous nuclear RNA (hnRNA). However, introns can also interrupt noncoding regions such as 5’ and 3’ untranslated regions of premRNA. The 5’ untranslated regions are sequences that are transcribed into mRNA but not translated. Usually, translation does not begin until the first AUG sequence (the initiation codon) that appears in the RNA, so that sequences located 5’ to this do not appear in the protein. The 3’ untranslated regions are sequences found in mRNA after the stop codon sequence (UAG, UGA or UAA). Together, the coding segments and the 5’ and 3’ untranslated regions represent the exons. Control Elements In general, it is the final protein product of a gene that carries out the function of that gene. Protein production in eukaryotes can be influenced by control elements at the level of transcription. The major control elements of a gene include the promoter and associated basal transcription factor-binding sites, polyadenylation site, enhancers and accessory transcription factor-binding sites. Most protein-coding genes are transcribed by RNA polymerase II into RNA. Transcription is initiated in the promoter region by several different factors. The promoter is generally defined as the control element located immediately 5’ to the gene that specifies the start of RNA synthesis. The basal promoter elements are the TATA and the CCAAT sequences. The TATA sequence is found in most protein-coding genes and is important for the positioning of RNA polymerase II for the initiation of transcription. TATA sequences are generally located 25– 30 bp 5’ to the transcription start site. Further upstream (70–90 bp) there is often a CCAAT sequence, although this is less common than the TATA sequence. RNA polymerase II interacts with these upstream elements, allowing transcription to proceed. Intron splicing is a multiphase process of RNA maturation that takes place in the nucleus to generate mature mRNA molecules for transport into the cytoplasm. The process involves small or heterogeneous ribonucleoprotein particles in a complex structure called the spliceosome. Introns usually begin with the donor sequence GU, and terminate with an acceptor sequence AG. These splice donor and acceptors are the common recognition features that computer algorithms use to identify putative coding or noncoding DNA sequences. The polyadenylation signal AAUAAA in the 3’ untranslated region is also a common feature of mRNAs. Most messenger RNAs that code for protein are polyadenylated and contain an additional string of hundreds of adenosine residues at their 3’ end. The process of polyadenylation involves cleavage of pre-mRNA at the 3’ ENCYCLOPEDIA OF LIFE SCIENCES / & 2001 Nature Publishing Group / www.els.net 1 Genome Sequence Analysis end followed by synthesis of a poly A tract. The adenosine residues are added at a point 15–20 bp downstream from the AAUAAA polyadenylation signal that is found in about 90% of mRNAs. This ‘poly-A tail’ appears to play a role in stabilizing mRNAs and in transport of messages out of the nucleus. Transcription factors are proteins that bind to specific DNA sequences within the regulatory region of the target gene to alter the level of gene expression. Transcription factors often contain specific domains that bind directly to DNA as well as segments involved in interaction with other proteins. The interactions of many families of transcription factors with their specific DNA target sequences in promoters as well as with each other determines the complex patterns of developmental and tissue-specific gene expression. In addition, 5’ or 3’ flanking enhancer elements may also influence the frequency of transcription. The identification of transcription factor-binding sites, promoter structure, processing signals and enhancers can help to elucidate the functional capacity of genomic DNA sequences. Computational analysis of control elements involved in transcription is an important component of genome analysis. The most comprehensive transcription factor database is TRANSFAC. Nucleotide sequences can be entered and searched for the presence of transcription factor-binding sites. This database also contains descriptions of transcription factor-binding sites within genes, functional properties of these sites, and information on the transcription factors. Computer software available through the Baylor College of Medicine’s ‘Search Launcher’ website can also be used to locate promoter regions, polyadenylation regions and splice sites in genomic sequence. Open Reading Frames An open reading frame (ORF) is defined as a series of nucleotide triplets coding for amino acids without any termination codons that is potentially translatable into protein. The presence of genes is inferred by the detection of these ORFs by computational analysis. The Gene Recognition and Analysis Internet Link (GRAIL), available through the Oak Ridge National Laboratory, is designed to provide an initial automatic localization and characterization of the ORFs of genes from genomic sequence data. GRAIL provides a starting point for further computational and experimental study such as cloning and sequencing of a cDNA for a gene or identification and functional analysis of the gene product. GRAIL recognizes coding regions in genomic sequence through a technology called ‘pattern recognition’ as part of a neural network system. The GRAIL program also can detect other features in addition to ORFs that include polyadenylation sites, repeat sequences and CpG island 2 sequences. The CpG islands are regions of the genome that are rich in CG dinucleotides. They are generally associated with gene-rich regions. Additionally, the National Center for Biotechnology Information has an ORF finder available. Expressed Sequence Tags Expressed sequence tags (ESTs) are complementary DNA (cDNA) fragments generated from mRNA (cDNAs are derived in vitro from mRNA sequences). Sequencing of ESTs has proved to be a method for gene identification. Since only a small number of sequences in the 3.5 billion base pair human genome actually code for a functional protein, partially sequenced cDNA fragments or ESTs can be used as markers to search for expressed genes using computational analysis. Additionally, sequencing of ESTs from many different organisms significantly increases the probability that any gene in an unknown human genomic fragment can be identified by similarity searching in the publicly available sequence database based upon nucleotide identity. Cross-species Genome Comparison Sequence analysis of nonhuman genomes such as those of Saccharomyces cerevisiae, the fruitfly (Drosophila melanogaster), the roundworm (Caenorhabditis elegans) and the mouse (Mus musculus) provide excellent model systems since they are genetically well defined with generation times shorter than that of humans. A large amount of genetic information has been derived from the sequence data of these organisms, providing important information for the analysis of normal gene regulation, genetic diseases and evolutionary processes. For example, exon sequences are usually found to be conserved from one species to another at some level whereas introns are usually not. A new gene located in a region of genomic sequence can be identified by similarity searching if it or a homologous relative in another organism’s genome is represented in a searchable database. Finding similarity between a query sequence and genomic DNA sequences in sequence databases from many species suggests that the query sequence contains genetic information that has been conserved throughout evolution. For this reason, it is believed that most vertebrates share the majority of the same set of approximately 40 000–100 000 genes, although some genes are unique to each vertebrate. Large-scale comparisons of bacterial genomes will facilitate identification of genes held in common between genomes that have conserved functions as well as those that differ. A database crossreferencing the genetics of model organisms with mammalian phenotypes is available from the National Center for ENCYCLOPEDIA OF LIFE SCIENCES / & 2001 Nature Publishing Group / www.els.net Genome Sequence Analysis Biotechnology Information (NCBI). The sequencing of cDNAs from model organisms has also facilitated identification of potential homologues of a number of human genes. A computer database search may also detect homology with a previously characterized gene from a model organism. This type of identification based on nucleotide identity may provide insight into the function of the corresponding human gene. Pseudogenes Pseudogenes may have the same structure as the functional genes from which they were derived (e.g. exons and introns) but have acquired one or more mutations during evolution that make them unable to produce a functional protein product. Alternatively, pseudogenes may be processed forms of the original locus that no longer contain introns and are nonfunctional. Mutations that generate pseudogenes may interfere with transcription initiation, splicing at the intron/exon junction or translation termination. A pseudogene often has several destructive mutations, presumably due to enhanced mutability of the duplicated copy following the initial loss of function. Inactive genomic sequences that resemble the mature mRNA transcript are called processed pseudogenes. Processed pseudogenes originate by the insertion of a cDNA product derived from the mRNA into the genome. Pseudogenes are fairly common within many gene families, including globin genes and immunoglobulin genes. Repetitive Elements In many genomes there are large sections of repeated DNA sequences that exist in variable copy number. These repetitive sequences can be divided into two groups: tandem and dispersed. Tandemly repeated DNA generally refers to highly repetitive sequences such as microsatellites. These are short repeat units (2–10 nucleotides long) that occur in tandem and have no known function. Other examples of tandemly repeated DNA are telomeres, more complex minisatellites and alpha satellites. Interspersed repetitive elements are usually the products of transposable element integrations but may include retropseudogenes of a functional gene. Alu and L1 elements are the major repetitive interspersed sequences in the human genome. Alu elements are 300-bp elements derived from RNA polymerase III-derived transcripts that have no coding capacity and have duplicated to a copy number of 500 000 within primate genomes. By contrast, L1 elements are longer (6.5 kb in length), exist at a copy number of 100 000, and encode for two open reading frames. One open reading frame is believed to code for an RNA-binding protein whose function is unknown. The second open-reading frame codes for a reverse transcriptase and endonuclease, important in the mobilization of the L1 elements. While the ORFs of L1 elements may serve an important biological function to the L1 elements, their presence can cause serious problems for genomic sequence analysis identity searches. This is because the L1 elements are recognized as potential genes in homology-based computer searches and tend to obscure the coding capacity of the genes in which they reside. The program REPEATMASKER from the University of Washington identifies and characterizes most but not all human and rodent repeat families. It analyses the repeats and censors them by replacing nucleotide sequences with Ns and protein sequences with Xs to facilitate the analysis of nonrepetitive coding potential within a gene. Computer-aided Analyses As stated previously, given that less than 10% of the human genome is involved in coding for protein, it is necessary to have computer algorithms that are capable of recognizing the small amounts of coding information (the exons) contained in large stretches of DNA. Exons have certain distinguishing features: they are preceded by a splice acceptor, they usually contain an open reading frame, and they are terminated by splice donor sites. However, these characteristics cannot always be visualized, especially over a long stretch of genomic DNA sequence. Sophisticated neural network approaches have allowed the development of powerful computer algorithms that can identify exons of 100 bp or larger in a region of DNA that has yet to be analysed. The most important basic computational sequence analysis tool is the basic local alignment search tool (BLAST) which looks for similar segments between a query sequence and the database sequences. The results of the query reveal whether the newly sequenced DNA is similar to previously reported sequences in the database. The BLAST programs, available on the Internet through the NCBI, are probably the most widely used. BLAST allows one to search for sequences using several different algorithms to explore databases of expressed sequence tags, proteins, cloned genes and open reading frames. Additional molecular biological and genetic tools including PubMed (literature search) are also available through NCBI. The BLAST approach to similarity searching used by NCBI is also used by the Institute for Genome Research (TIGR). The TIGR databases contain expressed sequence tags, transcripts and genes from the genomes of human, mouse, rat, Drosophila, rice, tomato, zebrafish and various parasites. Other programs, such as GENSCAN maintained by the Massachusetts Institute of Technology, and HEXON, FEXH and FGENESH, available through the Baylor College of Medicine’s ‘Search Launcher’ website, ENCYCLOPEDIA OF LIFE SCIENCES / & 2001 Nature Publishing Group / www.els.net 3 Genome Sequence Analysis can be used to complement the information ascertained from the BLAST analysis, especially when no information is found through BLAST (NCBI). These programs, at a minimum, may be able to inform one as to what portion of genomic sequence is in fact coding. The use of nucleotide sequence/homology search programs that use BLAST and pattern recognition programs, such as GRAIL, GENSCAN, HEXON, FEXH and FGENESH, are complementary and should be used together in order to assure complete analysis of genomic sequence. These tools are continually being re-evaluated in order to improve on the analytical capabilities of the programs. These programs offer an excellent starting point for annotating sequences for locating genes. Further Reading Adams MD, Fields C and Venter C (eds) (1994) Automated DNA Sequencing and Analysis. San Diego, CA: Academic Press. Altschul SF, Gish W, Miller W, Myers EW and Lipman D (1990) Basic local alignment search tool. Journal of Molecular Biology 215: 403– 410. Bishop MJ (ed.) (1998) Guide to Human Genome Computing. San Diego, CA: Academic Press. Burch PE (1999) Molecular Biology Computation Resource. Houston, TX: Baylor College of Medicine. [http://condor.bcm.tmc.edu/ home.html] Burge C (2000) GENSCAN. Cambridge, MA: Massachusetts Institute of Technology. [http://genes.mit.edu/GENSCAN.html] [Burge C and 4 Karlin S (1997) Prediction of complete gene structures in human genomic DNA. Journal of Molecular Biology 268: 78–94.] Cook JL (1999) Internet biomolecular resources. Analytical Biochemistry 268: 165–172. Deininger PL and Batzer MA (1993) Evolution of retroposons. Evolutionary Biology 27: 157–196. Deininger PL and Batzer MA (1999) Alu repeats and human disease. Molecular Genetics and Metabolism 67: 183–193. Gene Regulation (2000) TRANSFAC. Braunschweig, Germany: BIOBASE GmbH. [http://www.gene-regulation.com/databases.html#transfac] GRAIL (1996) Gene Recognition and Assembly Internet Link Version 1.3. [http://compbio.ornl.gov/Grail-1.3/] Human Genome Sequencing Center (2000) Search Launcher. Houston, TX: Baylor College of Medicine. [http://www.hgsc.bcm.tmc.edu/ SearchLauncher/] Jurka J and Batzer MA (1996) Human repetitive elements. In: Meyers RA (ed.) Encyclopedia of Molecular Biology and Medicine, vol. 3, pp. 240–246. Weinheim, Germany: VCH Publishers. Lewin B (1997) Genes VI. New York: Oxford University Press. NCBI (2000) National Center for Biotechnology Information. [http:// www.ncbi.nlm.nih.gov] TIGR (2000) TIGR Databases. Rockville, MD: The Institute for Genomic Research. [http//:www.tigr.org/tdb/tdb.html] Uberbacher EC and Mural RJ (1991) Locating protein-coding regions in human DNA by multiple neural sensor neural network approach. Proceedings of the National Academy of Sciences of the USA 88: 11261–11265. University of Washington Genome Center (1999) Repeat Masker. Seattle, WA: University of Washington. [http://ftp.genome.washington.edu] ENCYCLOPEDIA OF LIFE SCIENCES / & 2001 Nature Publishing Group / www.els.net

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Genome Sequence Analysis