* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Protein-coding genes in eukaryotic DNA
Survey
Document related concepts
Transcript
Gene Structure & Genomes Biology 224 Instructor: Tom Peavy March 6 & 11, 2008 <Images adapted from Bioinformatics and Functional Genomics by Jonathan Pevsner> Similarities & Differences Prokaryotic vs. Eukaryotic Genomic DNA size of genome? Complexity of genes? Open reading Frames (1 gene per stretch)? Regulatory sequences for Transcription? Density of genes? One gene = 1 transcript? Finding genes in eukaryotic DNA Types of genes include • protein-coding genes • pseudogenes • functional RNA genes (e.g. tRNA, rRNA, snRNA) There are several kinds of exons: -- noncoding -- initial coding exons -- internal exons -- terminal exons -- some single-exon genes are intronless Eukaryotic gene prediction algorithms distinguish several kinds of exons Gene-finding algorithms Homology-based searches (“extrinsic”) Rely on previously identified genes Algorithm-based searches (“intrinsic”) Investigate nucleotide composition, openreading frames, and other intrinsic properties of genomic DNA (refer to Chapter 12, Completed genomes, Figure 12.17 for list of extrinsic vs intrinsic based algorithms). Extrinsic, homology-based searching: compare genomic DNA to expressed genes (ESTs) DNA intron RNA RNA protein DNA RNA Intrinsic, algorithm-based searching: Identify open reading frames (ORFs). Compare DNA in exons (unique codon usage) to DNA in introns (unique splices sites) and to noncoding DNA. Finding genes in eukaryotic DNA Cautionary Notes: -- The quality of EST sequence is sometimes low -- Highly expressed genes are disproportionately represented in many cDNA libraries -- ESTs provide no information on genomic location Finding genes in eukaryotic DNA Both intrinsic and extrinsic algorithms vary in their rates of false-positive and false-negative gene identification. Programs such as GENSCAN and Grail account for features such as the nucleotide composition of coding regions, and the presence of signals such as promoter elements. Genome sequencing projects There are three main resources for genomes: EBI European Bioinformatics Institute http://www.ebi.ac.uk/genomes/ NCBI National Center for Biotechnology Information http://www.ncbi.nlm.nih.gov TIGR The Institute for Genomic Research http://www.tigr.org C value paradox: why eukaryotic genome sizes vary The haploid genome size of eukaryotes, called the C value, varies enormously. Small genomes include: Encephalotiozoon cuniculi (2.9 Mb) A variety of fungi (10-40 Mb) Takifugu rubripes (pufferfish)(365 Mb)(same number of genes as other fish or as the human genome, but 1/10th the size) Large genomes include: Pinus resinosa (Canadian red pine)(68 Gb) Protopterus aethiopicus (Marbled lungfish)(140 Gb) Amoeba dubia (amoeba)(690 Gb) Genome sizes in nucleotide base pairs plasmids viruses bacteria fungi plants algae insects mollusks bony fish The size of the human genome is ~ 3 X 109 bp; almost all of its complexity is in single-copy DNA. amphibians reptiles birds The human genome is thought to contain ~30,000-40,000 genes. 104 105 106 107 mammals 108 109 1010 1011 http://www3.kumc.edu/jcalvet/PowerPoint/bioc801b.ppt C value paradox: why eukaryotic genome sizes vary The range in C values does not correlate well with the complexity of the organism. This phenomenon is called the C value paradox. Why? Britten and Kohne (1968) identified repetitive DNA classes Reassociation Kinetics = isolated genomic DNA, Shear, denature (melted), & measure the rates of DNA reassociation. Protein-coding genes in eukaryotic DNA: a new paradox Why are the number of protein-coding genes about the same for worms, flies, plants, and humans? This has been called the N-value paradox (number of genes) or the G value paradox (number of genes). Transcription factor databases In addition to identifying repetitive elements and genes, it is also of interest to predict the presence of genomic DNA features such as promoter elements and GC content. Websites that predict transcription factor binding sites and related sequences. AliBaba2 (http://www.gene-regulation.de/) Eukaryotic Promoter Database (http://www.epd.isb-sib.ch) PlantProm (http://mendel.cs.rhul.ac.uk) Five main classes of repetitive DNA 1. Interspersed repeats (RNA/DNA transposon-derived) -- approx 45% of human genome (e.g. LINES, SINES, Alu) 2. Processed pseudogenes (gene loss) 3. Simple sequence repeats -- Microsatellites (1-12 bp); Minisatellites (12-500 bp) 4. Segmental duplications -- blocks of about 1 kilobase to 300 kb that are copied intra- or interchromosomally (5% of human genome) 5. Blocks of tandem repeats -- includes telomeric and centromeric repeats and can span millions bp (often species-specific) Chronology of genome sequencing projects 1977 first viral genome (Sanger et. Al. bacteriophage fX174; 11 genes) 1981 Human mitochondrial genome 16,500 base pairs (encodes 13 proteins, 2 rRNA, 22 tRNA) Today, over 400 mitochondrial genomes sequenced 1986 Chloroplast genome 156,000 base pairs (most are 120 kb to 200 kb) 1995 Haemophilus influenzae genome sequenced 1996 Saccharomyces cerevisiae (1st Euk. Genome) and archaeal genome, Methanococcus jannaschii. Chronology of genome sequencing projects 1997 More bacteria and archaea Escherichia coli 4.6 megabases, 4200 proteins (38% of unknown function) 1998 Nematode Caenorhabditis elegans (1st multicellular org.) 97 Mb; 19,000 genes. 1999 first human chromosome: Chrom 22 (49 Mb, 673 genes) 2000 Drosophila melanogaster (13,000 genes); Plant Arabidopsis thaliana & Human chromosome 21 2001: draft sequence of the human genome (public consortium and Celera Genomics) You can find functional annotation through the COGs database (Clusters of Orthologous Genes) Overview of genome analysis [1] Selection of genomes for sequencing [2] Sequence one individual genome, or several? [3] How big are genomes? [4] Genome sequencing centers [5] Sequencing genomes: strategies [6] When has a genome been fully sequenced? [7] Repository for genome sequence data [8] Genome annotation Overview of genome analysis [1] Selection of genomes for sequencing is based on criteria such as: • genome size (some plants are >>>human genome) • cost • relevance to human disease (or other disease) • relevance to basic biological questions • relevance to agriculture Overview of genome analysis [2] Sequence one individual genome, or several? --Each genome center may study one chromosome from an organism --It is necessary to measure polymorphisms (e.g. SNPs) in large populations For viruses, thousands of isolates may be sequenced. For the human genome, cost is the impediment. Overview of genome analysis [3] How big are genomes? Viral genomes: 1 kb to 350 kb (Mimivirus: 800 kb) Bacterial genomes: 0.5 Mb to 13 Mb Eukaryotic genomes: 8 Mb to 686 Mb Overview of genome analysis [4] 20 Genome sequencing centers contributed to the public sequencing of the human genome. Many of these are listed at the Entrez genomes site. (See Table 17.6, page 625.) Overview of genome analysis [5] There are two main strategies for sequencing genomes a) Whole genome shotgun (WGS) method -- applied to the entire genome all at once (sequenced fragments ordered by alignment of overlaps) VERSUS b) hierarchical shotgun method --applied to large overlapping DNA fragments of known location in the genome. (Assemble contigs from chromosomes and then systematically sequence them and reassemble complete sequence) Overview of genome analysis [6] When has a genome been fully sequenced? A typical goal is to obtain five to ten-fold coverage. Finished sequence: a clone insert is contiguously sequenced with high quality standard of error rate 0.01%. There are usually no gaps in the sequence. Draft sequence: clone sequences may contain several regions separated by gaps. The true order and orientation of the pieces may not be known. Overview of genome analysis [7] Repository for genome sequence data Raw data from many genome sequencing projects are stored at the trace archive at NCBI or EBI (main NCBI page, bottom right) Overview of genome analysis [8] Genome annotation Information content in genomic DNA includes: -- repetitive DNA elements -- nucleotide composition (GC content) -- protein-coding genes, other genes How can whole genomes be compared? -- molecular phylogeny -- You can BLAST (or PSI-BLAST) all the DNA and/or protein in one genome against another -- TaxPlot and COG for bacterial (and for some eukaryotic) genomes -- PipMaker, MUMmer and other programs align large stretches of genomic DNA from multiple species