Download Genome Sequence Analysis

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Oncogenomics wikipedia , lookup

Ridge (biology) wikipedia , lookup

Deoxyribozyme wikipedia , lookup

Bisulfite sequencing wikipedia , lookup

Genetic engineering wikipedia , lookup

Epitranscriptome wikipedia , lookup

Extrachromosomal DNA wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Cre-Lox recombination wikipedia , lookup

Whole genome sequencing wikipedia , lookup

Non-coding RNA wikipedia , lookup

Gene desert wikipedia , lookup

Genomic imprinting wikipedia , lookup

History of RNA biology wikipedia , lookup

Public health genomics wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Long non-coding RNA wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Short interspersed nuclear elements (SINEs) wikipedia , lookup

Genome (book) wikipedia , lookup

Gene expression profiling wikipedia , lookup

Transposable element wikipedia , lookup

Minimal genome wikipedia , lookup

History of genetic engineering wikipedia , lookup

Human Genome Project wikipedia , lookup

Designer baby wikipedia , lookup

Microevolution wikipedia , lookup

Microsatellite wikipedia , lookup

Point mutation wikipedia , lookup

Genomic library wikipedia , lookup

Gene wikipedia , lookup

Pathogenomics wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Primary transcript wikipedia , lookup

Human genome wikipedia , lookup

Metagenomics wikipedia , lookup

RNA-Seq wikipedia , lookup

Genome evolution wikipedia , lookup

Genome editing wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Non-coding DNA wikipedia , lookup

Genomics wikipedia , lookup

Helitron (biology) wikipedia , lookup

Transcript
Genome Sequence Analysis
Secondary article
Article Contents
Margaret M DeAngelis, Louisiana State University Health Sciences Center,
New Orleans, Louisiana, USA
. Introduction
Mark A Batzer, Louisiana State University Health Sciences Center, New Orleans, Louisiana, USA
. Exons and Introns
. Control Elements
The human genome has an estimated 40 000–100 000 genes dispersed throughout 3.5
billion nucleotides of sequence. DNA sequences are inherently complex and a number of
computational tools are required to analyse the genomic sequences of eukaryotic, bacterial
and model organisms.
. Open Reading Frames
. Expressed Sequence Tags
. Cross-species Genome Comparison
. Pseudogenes
. Repetitive Elements
. Computer-aided Analyses
Introduction
The human genome has approximately 3.5 billion base
pairs (bp) and is an excellent example for the analysis of
eukaryotic genomes. The goal of genome research is to
sequence each one of these base pairs so that all the genes
and regulatory regions in the genome can be located. This
information can then be used to facilitate discoveries in the
basic and clinical sciences. Thus, the aim of most largescale sequencing projects is the discovery of new genes in
previously uncharacterized or only partially characterized
genomic DNA sequences. A gene, which is the basic
functional unit of heredity, is typically a specific sequence
of nucleotides that carries the information required for
making a functional protein or, in some cases, a functional
RNA. Several computational methods have been developed for analysing genomic sequences and the identification of genes.
Exons and Introns
Exons and introns are important features of the eukaryotic
gene. Eukaryotic genes are composed of regulatory
sequences, short exons and introns of variable length.
Most have their protein-coding regions interrupted by
introns that are removed in a process called splicing to
generate a mature messenger RNA (mRNA), which is
translated into protein. Exons are the sequences that
remain in the final mature mRNA and generally code for a
protein. Introns are the sequences that are removed by
splicing from the full-length heterogeneous nuclear RNA
(hnRNA). However, introns can also interrupt noncoding
regions such as 5’ and 3’ untranslated regions of premRNA. The 5’ untranslated regions are sequences that are
transcribed into mRNA but not translated. Usually,
translation does not begin until the first AUG sequence
(the initiation codon) that appears in the RNA, so that
sequences located 5’ to this do not appear in the protein.
The 3’ untranslated regions are sequences found in mRNA
after the stop codon sequence (UAG, UGA or UAA).
Together, the coding segments and the 5’ and 3’ untranslated regions represent the exons.
Control Elements
In general, it is the final protein product of a gene that
carries out the function of that gene. Protein production in
eukaryotes can be influenced by control elements at the
level of transcription. The major control elements of a gene
include the promoter and associated basal transcription
factor-binding sites, polyadenylation site, enhancers and
accessory transcription factor-binding sites.
Most protein-coding genes are transcribed by RNA
polymerase II into RNA. Transcription is initiated in the
promoter region by several different factors. The promoter
is generally defined as the control element located
immediately 5’ to the gene that specifies the start of RNA
synthesis. The basal promoter elements are the TATA and
the CCAAT sequences. The TATA sequence is found in
most protein-coding genes and is important for the
positioning of RNA polymerase II for the initiation of
transcription. TATA sequences are generally located 25–
30 bp 5’ to the transcription start site. Further upstream
(70–90 bp) there is often a CCAAT sequence, although this
is less common than the TATA sequence. RNA polymerase II interacts with these upstream elements, allowing
transcription to proceed.
Intron splicing is a multiphase process of RNA
maturation that takes place in the nucleus to generate
mature mRNA molecules for transport into the cytoplasm.
The process involves small or heterogeneous ribonucleoprotein particles in a complex structure called the
spliceosome. Introns usually begin with the donor sequence GU, and terminate with an acceptor sequence AG.
These splice donor and acceptors are the common
recognition features that computer algorithms use to
identify putative coding or noncoding DNA sequences.
The polyadenylation signal AAUAAA in the 3’ untranslated region is also a common feature of mRNAs.
Most messenger RNAs that code for protein are polyadenylated and contain an additional string of hundreds of
adenosine residues at their 3’ end. The process of
polyadenylation involves cleavage of pre-mRNA at the 3’
ENCYCLOPEDIA OF LIFE SCIENCES / & 2001 Nature Publishing Group / www.els.net
1
Genome Sequence Analysis
end followed by synthesis of a poly A tract. The adenosine
residues are added at a point 15–20 bp downstream from
the AAUAAA polyadenylation signal that is found in
about 90% of mRNAs. This ‘poly-A tail’ appears to play a
role in stabilizing mRNAs and in transport of messages out
of the nucleus.
Transcription factors are proteins that bind to specific
DNA sequences within the regulatory region of the target
gene to alter the level of gene expression. Transcription
factors often contain specific domains that bind directly to
DNA as well as segments involved in interaction with other
proteins. The interactions of many families of transcription
factors with their specific DNA target sequences in
promoters as well as with each other determines the
complex patterns of developmental and tissue-specific gene
expression. In addition, 5’ or 3’ flanking enhancer elements
may also influence the frequency of transcription. The
identification of transcription factor-binding sites, promoter structure, processing signals and enhancers can help
to elucidate the functional capacity of genomic DNA
sequences. Computational analysis of control elements
involved in transcription is an important component of
genome analysis. The most comprehensive transcription
factor database is TRANSFAC. Nucleotide sequences can
be entered and searched for the presence of transcription
factor-binding sites. This database also contains descriptions of transcription factor-binding sites within genes,
functional properties of these sites, and information on the
transcription factors. Computer software available
through the Baylor College of Medicine’s ‘Search Launcher’ website can also be used to locate promoter regions,
polyadenylation regions and splice sites in genomic
sequence.
Open Reading Frames
An open reading frame (ORF) is defined as a series of
nucleotide triplets coding for amino acids without any
termination codons that is potentially translatable into
protein. The presence of genes is inferred by the detection
of these ORFs by computational analysis. The Gene
Recognition and Analysis Internet Link (GRAIL), available through the Oak Ridge National Laboratory, is
designed to provide an initial automatic localization and
characterization of the ORFs of genes from genomic
sequence data. GRAIL provides a starting point for
further computational and experimental study such as
cloning and sequencing of a cDNA for a gene or
identification and functional analysis of the gene product.
GRAIL recognizes coding regions in genomic sequence
through a technology called ‘pattern recognition’ as part of
a neural network system. The GRAIL program also can
detect other features in addition to ORFs that include
polyadenylation sites, repeat sequences and CpG island
2
sequences. The CpG islands are regions of the genome that
are rich in CG dinucleotides. They are generally associated
with gene-rich regions. Additionally, the National Center
for Biotechnology Information has an ORF finder
available.
Expressed Sequence Tags
Expressed sequence tags (ESTs) are complementary DNA
(cDNA) fragments generated from mRNA (cDNAs are
derived in vitro from mRNA sequences). Sequencing of
ESTs has proved to be a method for gene identification.
Since only a small number of sequences in the 3.5 billion
base pair human genome actually code for a functional
protein, partially sequenced cDNA fragments or ESTs can
be used as markers to search for expressed genes using
computational analysis. Additionally, sequencing of ESTs
from many different organisms significantly increases the
probability that any gene in an unknown human genomic
fragment can be identified by similarity searching in the
publicly available sequence database based upon nucleotide identity.
Cross-species Genome Comparison
Sequence analysis of nonhuman genomes such as those of
Saccharomyces cerevisiae, the fruitfly (Drosophila melanogaster), the roundworm (Caenorhabditis elegans) and the
mouse (Mus musculus) provide excellent model systems
since they are genetically well defined with generation times
shorter than that of humans. A large amount of genetic
information has been derived from the sequence data of
these organisms, providing important information for the
analysis of normal gene regulation, genetic diseases and
evolutionary processes. For example, exon sequences are
usually found to be conserved from one species to another
at some level whereas introns are usually not. A new gene
located in a region of genomic sequence can be identified by
similarity searching if it or a homologous relative in
another organism’s genome is represented in a searchable
database. Finding similarity between a query sequence and
genomic DNA sequences in sequence databases from
many species suggests that the query sequence contains
genetic information that has been conserved throughout
evolution. For this reason, it is believed that most
vertebrates share the majority of the same set of
approximately 40 000–100 000 genes, although some genes
are unique to each vertebrate. Large-scale comparisons of
bacterial genomes will facilitate identification of genes held
in common between genomes that have conserved functions as well as those that differ. A database crossreferencing the genetics of model organisms with mammalian phenotypes is available from the National Center for
ENCYCLOPEDIA OF LIFE SCIENCES / & 2001 Nature Publishing Group / www.els.net
Genome Sequence Analysis
Biotechnology Information (NCBI). The sequencing of
cDNAs from model organisms has also facilitated
identification of potential homologues of a number of
human genes. A computer database search may also detect
homology with a previously characterized gene from a
model organism. This type of identification based on
nucleotide identity may provide insight into the function of
the corresponding human gene.
Pseudogenes
Pseudogenes may have the same structure as the functional
genes from which they were derived (e.g. exons and
introns) but have acquired one or more mutations during
evolution that make them unable to produce a functional
protein product. Alternatively, pseudogenes may be
processed forms of the original locus that no longer
contain introns and are nonfunctional. Mutations that
generate pseudogenes may interfere with transcription
initiation, splicing at the intron/exon junction or translation termination. A pseudogene often has several destructive mutations, presumably due to enhanced mutability of
the duplicated copy following the initial loss of function.
Inactive genomic sequences that resemble the mature
mRNA transcript are called processed pseudogenes.
Processed pseudogenes originate by the insertion of a
cDNA product derived from the mRNA into the genome.
Pseudogenes are fairly common within many gene families,
including globin genes and immunoglobulin genes.
Repetitive Elements
In many genomes there are large sections of repeated DNA
sequences that exist in variable copy number. These
repetitive sequences can be divided into two groups:
tandem and dispersed. Tandemly repeated DNA generally
refers to highly repetitive sequences such as microsatellites.
These are short repeat units (2–10 nucleotides long) that
occur in tandem and have no known function. Other
examples of tandemly repeated DNA are telomeres, more
complex minisatellites and alpha satellites. Interspersed
repetitive elements are usually the products of transposable
element integrations but may include retropseudogenes of
a functional gene. Alu and L1 elements are the major
repetitive interspersed sequences in the human genome.
Alu elements are 300-bp elements derived from RNA
polymerase III-derived transcripts that have no coding
capacity and have duplicated to a copy number of 500 000
within primate genomes.
By contrast, L1 elements are longer (6.5 kb in length),
exist at a copy number of 100 000, and encode for two open
reading frames. One open reading frame is believed to code
for an RNA-binding protein whose function is unknown.
The second open-reading frame codes for a reverse
transcriptase and endonuclease, important in the mobilization of the L1 elements. While the ORFs of L1 elements
may serve an important biological function to the L1
elements, their presence can cause serious problems for
genomic sequence analysis identity searches. This is
because the L1 elements are recognized as potential genes
in homology-based computer searches and tend to obscure
the coding capacity of the genes in which they reside.
The program REPEATMASKER from the University
of Washington identifies and characterizes most but not all
human and rodent repeat families. It analyses the repeats
and censors them by replacing nucleotide sequences with
Ns and protein sequences with Xs to facilitate the analysis
of nonrepetitive coding potential within a gene.
Computer-aided Analyses
As stated previously, given that less than 10% of the
human genome is involved in coding for protein, it is
necessary to have computer algorithms that are capable of
recognizing the small amounts of coding information (the
exons) contained in large stretches of DNA. Exons have
certain distinguishing features: they are preceded by a
splice acceptor, they usually contain an open reading
frame, and they are terminated by splice donor sites.
However, these characteristics cannot always be visualized, especially over a long stretch of genomic DNA
sequence. Sophisticated neural network approaches have
allowed the development of powerful computer algorithms
that can identify exons of 100 bp or larger in a region of
DNA that has yet to be analysed. The most important basic
computational sequence analysis tool is the basic local
alignment search tool (BLAST) which looks for similar
segments between a query sequence and the database
sequences. The results of the query reveal whether the
newly sequenced DNA is similar to previously reported
sequences in the database. The BLAST programs, available on the Internet through the NCBI, are probably the
most widely used. BLAST allows one to search for
sequences using several different algorithms to explore
databases of expressed sequence tags, proteins, cloned
genes and open reading frames. Additional molecular
biological and genetic tools including PubMed (literature
search) are also available through NCBI.
The BLAST approach to similarity searching used by
NCBI is also used by the Institute for Genome Research
(TIGR). The TIGR databases contain expressed sequence
tags, transcripts and genes from the genomes of human,
mouse, rat, Drosophila, rice, tomato, zebrafish and various
parasites. Other programs, such as GENSCAN maintained by the Massachusetts Institute of Technology, and
HEXON, FEXH and FGENESH, available through the
Baylor College of Medicine’s ‘Search Launcher’ website,
ENCYCLOPEDIA OF LIFE SCIENCES / & 2001 Nature Publishing Group / www.els.net
3
Genome Sequence Analysis
can be used to complement the information ascertained
from the BLAST analysis, especially when no information
is found through BLAST (NCBI). These programs, at a
minimum, may be able to inform one as to what portion of
genomic sequence is in fact coding.
The use of nucleotide sequence/homology search
programs that use BLAST and pattern recognition
programs, such as GRAIL, GENSCAN, HEXON, FEXH
and FGENESH, are complementary and should be used
together in order to assure complete analysis of genomic
sequence. These tools are continually being re-evaluated in
order to improve on the analytical capabilities of the
programs. These programs offer an excellent starting point
for annotating sequences for locating genes.
Further Reading
Adams MD, Fields C and Venter C (eds) (1994) Automated DNA
Sequencing and Analysis. San Diego, CA: Academic Press.
Altschul SF, Gish W, Miller W, Myers EW and Lipman D (1990) Basic
local alignment search tool. Journal of Molecular Biology 215: 403–
410.
Bishop MJ (ed.) (1998) Guide to Human Genome Computing. San Diego,
CA: Academic Press.
Burch PE (1999) Molecular Biology Computation Resource. Houston,
TX: Baylor College of Medicine. [http://condor.bcm.tmc.edu/
home.html]
Burge C (2000) GENSCAN. Cambridge, MA: Massachusetts Institute of
Technology. [http://genes.mit.edu/GENSCAN.html] [Burge C and
4
Karlin S (1997) Prediction of complete gene structures in human
genomic DNA. Journal of Molecular Biology 268: 78–94.]
Cook JL (1999) Internet biomolecular resources. Analytical Biochemistry 268: 165–172.
Deininger PL and Batzer MA (1993) Evolution of retroposons.
Evolutionary Biology 27: 157–196.
Deininger PL and Batzer MA (1999) Alu repeats and human disease.
Molecular Genetics and Metabolism 67: 183–193.
Gene Regulation (2000) TRANSFAC. Braunschweig, Germany: BIOBASE GmbH. [http://www.gene-regulation.com/databases.html#transfac]
GRAIL (1996) Gene Recognition and Assembly Internet Link Version 1.3.
[http://compbio.ornl.gov/Grail-1.3/]
Human Genome Sequencing Center (2000) Search Launcher. Houston,
TX: Baylor College of Medicine. [http://www.hgsc.bcm.tmc.edu/
SearchLauncher/]
Jurka J and Batzer MA (1996) Human repetitive elements. In: Meyers
RA (ed.) Encyclopedia of Molecular Biology and Medicine, vol. 3, pp.
240–246. Weinheim, Germany: VCH Publishers.
Lewin B (1997) Genes VI. New York: Oxford University Press.
NCBI (2000) National Center for Biotechnology Information. [http://
www.ncbi.nlm.nih.gov]
TIGR (2000) TIGR Databases. Rockville, MD: The Institute for
Genomic Research. [http//:www.tigr.org/tdb/tdb.html]
Uberbacher EC and Mural RJ (1991) Locating protein-coding regions in
human DNA by multiple neural sensor neural network approach.
Proceedings of the National Academy of Sciences of the USA 88:
11261–11265.
University of Washington Genome Center (1999) Repeat Masker.
Seattle, WA: University of Washington. [http://ftp.genome.washington.edu]
ENCYCLOPEDIA OF LIFE SCIENCES / & 2001 Nature Publishing Group / www.els.net