Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Bioinformática y Genómica Central dogma of molecular biology DNA genome RNA transcriptome protein proteome Central dogma of bioinformatics and genomics Central dogma of molecular biology DNA genome 30,000 genes RNA protein transcriptome proteome 15,000 transcripts/cell >>15,000 proteins/cell Integrative biology in the postgenomic era DNA RNA Protein Informational pathways Informational networks Cells Organs Individuals Populations Ecologies Integrative biology in the postgenomic era DNA RNA Protein Informational pathways Informational networks Cells Organs Individuals Populations Ecologies Systems biology Cell biology Medicine, physiology Medicine Genetics Ecology Integrative biology in the postgenomic era DNA RNA Protein Informational pathways Informational networks Cells Organs Individuals Populations Ecologies Genomics Functional genomics Proteomics Metabolomics Systems biology Cell biology Medicine, physiology Medicine Genetics Ecology Genómica: Genómica Funcional Genómica Estructural Genómica Funcional -- Analysis of RNA: gene expression and microarrays -- Overview of proteins and proteomics DNA RNA cDNA protein phenotype Functional Genomics Gene expression is regulated in several basic ways • by region (e.g. brain versus kidney) • in development (e.g. fetal versus adult tissue) • in dynamic response to environmental signals (e.g. immediate-early response genes) Genomics Technologies • Automated DNA sequencing • Automated annotation of sequences • DNA microarrays – gene expression (measure RNA levels) – single nucleotide polymorphisms (SNPs) The human genome is the the complete DNA content of the 23 pairs of human chromosomes - 44 autosomes plus two sex chromosomes - approximately 3.5 billion base pairs. How does genome sequencing technology work? • Molecular biology of the Sanger method • Manual Gels vs. ABI machines • Sub-cloning of fragments - BAC, PAC, cosmid, plasmid, phage • The need for computers to assemble the "reads" and manage the workflow Automated sequencing machines, particularly those made by PE Applied Biosystems, use 4 colors, so they can read all 4 bases at once. All the Genes? • Any human gene can now be found in the genome by similarity searching with over 95% certainty. • However, the sequence still has many gaps – unlikely to find an uninterrupted genomic segment for any gene – still can’t identify pseudogenes with certainty • This will improve as more sequence data accumulates Finding Genes in genome Sequence is Not Easy • About 2% of human DNA encodes functional genes. • Genes are interspersed among long stretches of non-coding DNA. • Repeats, pseudo-genes, and introns confound matters Impact on Bioinformatics • Genomics produces high-throughput, highquality data, and bioinformatics provides the analysis and interpretation of these massive data sets. • It is impossible to separate genomics laboratory technologies from the computational tools required for data analysis. Billions of years ago (BYA) Origin of Earliest life fossils Hadean eon Origin of eukaryotes Archean eon 4 3 insects Fungi/animal Plant/animal Phanerozoic eon Proterozoic eon 2 1 0 Millions of years ago (MYA) deuterostome/ echinoderm/ protostome Insects chordate Cambrian Land explosion plants Proterozoic eon 1000 Age of Reptiles ends Phanerozoic eon 500 100 0 Millions of years ago (MYA) Mass extinction 100 Human/chimp divergence Dinosaurs extinct; Mammalian radiation 50 10 0 Millions of years ago (MYA) Homo Sapiens/ Chimp divergence 10 Australepithecus Earliest Lucy stone tools 5 Emergence of Homo erectus 1 0 Years ago Homo erectus emerges in Africa 1,000,000 Mitochondrial Eve 500,000 100,000 Years ago Emergence of anatomically modern H. sapiens 100,000 Neanderthal and Homo erectus disappear 50,000 10,000 Years ago “Ice Man” Earliest from Alps pyramids 10,000 5,000 1,000 0 prominent archaea bacteria eukaryota http://www.ncbi.nlm.nih.gov/Entrez/ Overview of viral complete genomes Overview of archaea complete genomes Overview of eukaryota genomes in NCBI’s Entez division Chronology 1977 Sanger et al. sequence bacteriophage fX174 1981 Human mitochondrial genome 16,500 base pairs 1986 Chloroplast genome 156,000 base pairs mitochondrion chloroplast Lack mitochondria (?) 1995: genome of the bacterium Haemophilus influenzae is sequenced Overview of bacterial complete genomes COGs database: organisms and tools COGs database: functional annotation 1996: a yeast genome is sequenced Key yeast databases Saccharomyces genome database (SGD) http://genome-www.stanford.edu/Saccharomyces/ MIPS Comprehensive Yeast Genome Database (MIPS = Munich Information Center for Protein Sequences) http://mips.gsf.de/proj/yeast/CYGD/db/ PomBase -A S. pombe ACeDB Database http://www.sanger.ac.uk/Projects/S_pombe/pombase.shtml Genome sizes in nucleotide base pairs plasmids viruses bacteria fungi plants algae insects mollusks bony fish The size of the human genome is ~ 3 X 109 bp; almost all of its complexity is in single-copy DNA. amphibians reptiles birds The human genome is thought to contain ~30,000-40,000 genes. 104 105 106 107 mammals 108 109 1010 1011 http://www3.kumc.edu/jcalvet/PowerPoint/bioc801b.ppt eukaryotes 1999: Human chromosome 22 sequenced 1999: Human chromosome 22 sequenced 47 MB 780 genes 2000 Completed genome projects Eukaryotes: 9 In progress (partial): Anopheles gambiae Danio rerio (zebrafish) Arabidopsis thaliana Glycine max (soybean) Caenorhabditis elegans Hordeum vulgare (barley) Drosophila melanogaster Leishmania major Encephalitozoon cuniculi Rattus norvegicus Guillardia theta nucleomorph Plasmodium falciparum Saccharomyces cerevisiae (yeast) Schizosaccharomyces pombe Bacteria: 132 Archaea: 16 Viruses: 1413 eukaryotes Six basic questions about genomes [1] how is a genome sequenced? [2] when is the project finished? [3] sequence one individual or many? [4] what information is in the DNA? [5] how many genes are in the genome? [6] how can whole genomes be compared? [1] Genome projects: sequencing strategies Hierarchical shotgun method Assemble contigs from various chromosomes, then sequence and assemble them. A contig is a set of overlapping clones or sequences from which a sequence can be obtained. The sequence may be draft or finished. A contig is thus a chromosome map showing the locations of those regions of a chromosome where contiguous DNA segments overlap. Contig maps are important because they provide the ability to study a complete, and often large segment of the genome by examining a series of overlapping clones which then provide an unbroken succession of information about that region. Scaffold: an ordered set of contigs placed on a chromosome. Shotgun An approach used to decode an organism's genome by shredding it into smaller fragments of DNA which can be sequenced individually. The sequences of these fragments are then ordered, based on overlaps in the genetic code, and finally reassembled into the complete sequence. The 'whole genome shotgun' method is applied to the entire genome all at once, while the 'hierarchical shotgun' method is applied to large, overlapping DNA fragments of known location in the genome. http://www.genome.gov/glossary.cfm 3. Whole Genome Shotgun Sequencing genome cut many times at random • plasmids (2 – 10 Kbp) • cosmids (40 Kbp) ~500 bp forward-reverse linked reads known dist ~500 bp ARACHNE: Whole Genome Shotgun Assembly 1. Find overlapping reads 2. Merge good pairs of reads into longer contigs 3. Link contigs to form supercontigs 4. Derive consensus sequence ..ACGATTACAATAGGTT.. http://www-genome.wi.mit.edu/wga/ [2] When is the project finished? Get five to ten-fold coverage Finished sequence: a clone insert is contiguously sequenced with high quality standard of error rate 0.01%. There are usually no gaps in the sequence. Draft sequence: clone sequences may contain several regions separated by gaps. The true order and orientation of the pieces may not be known. [3] Sequence one individual or many? Try one… --Each genome center may study 1 chromosome --Measure polymorphisms (e.g. SNPs) in large populations [4] What information is in the DNA? -- repetitive DNA elements -- nucleotide composition (GC content) -- protein-coding genes, other genes Genome sizes in nucleotide base pairs plasmids viruses bacteria fungi plants algae insects mollusks bony fish The size of the human genome is ~ 3 X 109 bp; almost all of its complexity is in single-copy DNA. amphibians reptiles birds The human genome is thought to contain ~30,000-40,000 genes. 104 105 106 107 mammals 108 109 1010 1011 http://www3.kumc.edu/jcalvet/PowerPoint/bioc801b.ppt DNA reassociation (renaturation) Double-stranded DNA Denatured, single-stranded DNA k2 Slower, rate-limiting, second-order process of finding complementary sequences to nucleate base-pairing Faster, zippering reaction to form long molecules of doublestranded DNA http://www3.kumc.edu/jcalvet/PowerPoint/bioc801b.ppt % DNA reassociated Britten and Kohne (1968) provided evidence for large amounts of repetitive DNA in many genomes 0 fast ~10% intermediate ~15% 50 slow (single-copy) ~75% 100 I I I I I log Cot I I I I Repetitive DNA sequences: five classes [1] Interspersed repeats: transposon-derived repeats -- 45% of human genome; LTR, SINE, LINE [2] Processed pseudogenes [3] Simple sequence repeats -- micro- and minisatellites -- ACAAACT, 11 million times in a Drosophila -- Human genome has 50,000 CA dinucleotide repeats [4] Segmental duplications (about 5% of human genome) [5] Tandem repeats (e.g. telomeres, centromeres) • LINE and SINE repeats. A LINE (long interspersed nuclear element) encodes a reverse transcriptase (RT) and perhaps other proteins. Mammalian genomes contain an old LINE family, called LINE2, which apparently stopped transposing before the mammalian radiation, and a younger family, called L1 or LINE1, many of which were inserted after the mammalian radiation (and are still being inserted). A SINE (short interspersed nuclear element) generally moves using RT from a LINE. Examples include the MIR elements, which co-evolved with the LINE2 elements. Since the mammalian radiation, each lineage has evolved its own SINE family. Primates have Alu elements and mice have B1, B2, etc. The process of insertion of a LINE or SINE into the genome causes a short sequence (721 bp for Alus) to be repeated, with one copy (in the same orientation) at each end of the inserted sequence. Alus have accumulated preferentially in GC-rich regions, L1s in GCpoor regions. What is the function of nongenic DNA? Hypotheses: • Nongenic DNA performs essential functions, such as regulation of gene expression. • Nongenic DNA is inert, genetically and physiologically. Excess DNA is incidental and is called “junk DNA.” • Nongenic DNA is a functional parasite or selfish DNA (retrotransposons). • Nongenic DNA has a structural function. GC content varies across genomes Bacteria Number of species in each GC class 10 5 Plants 5 Invertebrates 3 Vertebrates 10 5 20 30 40 50 60 70 GC content (%) 80 [5] How many genes are in the a genome? This depends how a gene is defined (e.g. proteincoding versus noncoding) It also depends what methods are used to find genes, and what criteria are applied to determine whether they are “real” (functional). Clasificación del ADN FUNCIONAL (secuencias que cumplen una función) - Codante (se traducen en proteínas) -No codante (no se traducen) * Transcrito (cumple función a nivel de RNA: subun. ribos.) * No transcrito (cumple función a nivel de DNA: intrón, promotor, enhancer, etc.) NO-FUNCIONAL (secuencias que no cumplen ninguna función: “Junk DNA” – basura) Gene-finding algorithms Homology-based searches (“extrinsic”) Rely on previously identified genes Algorithm-based searches (“intrinsic”) Investigate nucleotide composition, openreading frames, and other intrinsic properties of genomic DNA DNA intron RNA Mature RNA protein Homology-based searching: compare DNA to expressed genes (ESTs) DNA intron RNA RNA protein DNA RNA Algorithm-based searching: compare DNA in exons (unique codon usage) to introns (unique splices sites) to noncoding DNA. Identify open reading frames (ORFs). [5’] How many genes are in the human genome? One answer is about 30,000. BUT how many genes?… -- A lot more than a fungus (6,000) -- Somewhat more than a fly (13,000) or a worm (19,000) -- About the same as a plant (Arabidopsis, 25,000) -- Two groups estimate 30,000 to 35,000, but there is only partial overlap in their gene lists! -- One Drosophila gene potentially yields 38,000 distinct proteins by alternative splicing. -- A microarray-based survey of chromosomes 21, 22 finds 10 times more transcripts than are annotated [6] how can whole genomes be compared? -- molecular phylogeny -- You can BLAST (or PSI-BLAST) all the DNA and/or protein in one genome against another -- We looked at TaxPlot and COG for bacterial (and for some eukaryotic) genomes -- PipMaker and other programs align large stretches of genomic DNA from multiple species Resources to study the human genome NCBI www.ncbi.nlm.nih.gov The Sanger Institute/European Bioinformatics Institute www.ensembl.org UCSC Genome Bioinformatics Site http://genome.ucsc.edu/ Top ten challenges for bioinformatics [1] Precise models of where and when transcription will occur in a genome (initiation and termination) [2] Precise models RNA splicing [3] Precise models of signal transduction pathways; ability to predict cellular responses to external stimuli [4] Determining protein:DNA, protein:RNA, protein:protein recognition codes [5] Accurate ab initio protein structure prediction Top ten challenges for bioinformatics [6] Rational design of small molecule inhibitors of proteins [7] Mechanistic understanding of protein evolution [8] Mechanistic understanding of speciation [9] Development of effective gene ontologies: systematic ways to describe gene and protein function [10] Education: development of bioinformatics curricula Comparative Genomics Using ACT The Artemis Comparison Tool Artemis comparison tool ACT • Based on artemis and coded in java. • Allows visualisation of two sequences or more and a comparison file. • The comparison file can be BLASTn or tBLASTx. • Retains all the functionality of artemis. The ACT Display genome1 Zoom scroll bar Filter scroll bar genome2 Genome2 Blast HSPs genome3 Running ACT Sequence 1 Sequence 2 BLASTn tBLASTx MSPcrunch Reformat ACT • Designed for looking at complete bacterial genomes. Knowlesi contgs tblastx Falciparum Chr 3 tblastx Yoelii Contigs (TIGR) Orthologue & Paralogue • Orthologue- homologous genes with identical function in different organisms. • Paralogue- homologous genes in the same organism originated from gene duplication. Orthologue & Paralogue Species 1 Species 2 Gene A Gene A Gene B Gene B diverge Orthologue & Paralogue Species 1 Species 2 Gene A Gene A Gene B Gene B Orthologue & Paralogue Species 1 Species 2 Gene A Gene A Gene B Orthologue & Paralogue Species 1 Species 2 Gene A Gene B T. brucei vs L. major (cont.) T. brucei vs T. cruzi L. major has break in synteny that is conserved in T. brucei and T. cruzi T. cruzi Chr3. T. Brucei chr1 T. Brucei chr6 L. Major chr12 Software • www.sanger.ac.uk/Software/Artemis • www.sanger.ac.uk/Software/ACT • www.genome.nghri.nih.gov/blastall • www.cgr.ki.se/cgr/goups/sonnhammer/MSPcrunch.html