* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Genome Databases and Open Access Resources
Human genetic variation wikipedia , lookup
Copy-number variation wikipedia , lookup
Quantitative trait locus wikipedia , lookup
Essential gene wikipedia , lookup
Segmental Duplication on the Human Y Chromosome wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
Short interspersed nuclear elements (SINEs) wikipedia , lookup
No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup
Mitochondrial DNA wikipedia , lookup
Microevolution wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Designer baby wikipedia , lookup
Transposable element wikipedia , lookup
Ridge (biology) wikipedia , lookup
Metabolic network modelling wikipedia , lookup
Oncogenomics wikipedia , lookup
History of genetic engineering wikipedia , lookup
Gene expression profiling wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Genomic imprinting wikipedia , lookup
Genome (book) wikipedia , lookup
Non-coding DNA wikipedia , lookup
Public health genomics wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Helitron (biology) wikipedia , lookup
Whole genome sequencing wikipedia , lookup
Human genome wikipedia , lookup
Genome editing wikipedia , lookup
Metagenomics wikipedia , lookup
Genomic library wikipedia , lookup
Human Genome Project wikipedia , lookup
Pathogenomics wikipedia , lookup
Genomes Databases and Open Access Bibliographic Resources Antonio Basílio de Miranda Laboratório de Genômica Funcional e Bioinformática Instituto Oswaldo Cruz Fundação Oswaldo Cruz Rio de Janeiro - Brazil • Outline • General introduction and overview of complete genome sequences • Genomes databases and where to find them • Comparative Genomics Databases • Other Omics resources • Bibliographic/Open access resources Why use databases? In the genomic era we have billions of data that need to be stored, curated and made accessible for analysis and knowledge discovery. Databases are essential resources for both experimental and computational biologists. We have crossed the Terabyte threshold of genomic data. And what is a database system? From Oxford Dictionary: Database: an organized body of related information. Database system, DataBase Management System (DBMS): a software system that facilitates the creation, maintenance and use of an electronic database. Common database models: Hierarchical Network Relational Object-relational Object Other models: Associative Concept-oriented Entity-Attribute-Value Multi-dimensional Semantic data model Semi-structured Star schema XML database What is stored: Nucleotide sequences Protein sequences Genomes Patterns Structures Etc. Some problems: Different data formats and technologies Different types of data Size Redundancy “Hereditary” mistakes Inconsistent annotations Different formats – C. trachomatis pyruvate kinase Completely sequenced genomes – a timeline 1977 first viral genome (5386 base pairs; encoding 11 genes). Sanger et al. sequence bacteriophage fX174. 1981 Human mitochondrial genome. 16,500 base pairs (encodes 13 proteins, 2 rRNA, 22 tRNA). 1986 Chloroplast genome. 156,000 base pairs (most are 120 kb to 200 kb). 1995 first genome of a free-living organism, the bacterium Haemophilus influenzae, by TIGR, 1830 Kb, 1713 genes. 1996 first genome of an archaeal genome: Methanococcus jannaschii DSM 2661, by TIGR, 1664 Kb, 1773 genes. 1997 first eukaryotic genome : Saccharomyces cerevisiae S288C; International collaboration; 16 Chromosomes; 12,057 Kb, ~6000 genes. 1998 first multicellular organism Nematode Caenorhabditis elegans; 97 Mb; ~19,000 genes. 1999: first human chromosome: Chromosome 22 (49 Mb, 673 genes). 2000 Fruitfly Drosophila melanogaster (137 Mb; ~13,000 genes). 2000 first plant genome: Arabidopsis thaliana (115,428 Mb; 22670 genes 2001 draft sequence of the human genome (3300 Mb; ~28000 genes) 2002 Plasmodium falciparum (22,9 Mb; 5334 genes) 2002 mouse genome (2700 Mb; ~28000 genes) 2004 Fish draft Tetraodon nigroviridis genome (x Mb; ~28000 genes); 2005 Dog (41Mb, 33651 genes) and chicken genomes ( 18031 genes) 2007 James Watson’s genome is sequenced. 2007 Craig Venter publishes the results of his own sequenced genome. October 2013 Deadline for the X Prize Foundation challenge to sequence 100 human genomes for less than $10,000 each. www.genomesonline.org 3825 projects • 827 published (06-29-08) • 1842 bacteria • 90 archaea • 936 eukaryotes • 130 metagenomes Genome sequencing projects There are several web-based resources that document the progress of completely sequenced genomes and their reference publications, including: GOLD - Genomes Online Database http://www.genomesonline.org/gold.cgi How big are genome sizes? Viral genomes: 1 kb to 360 kb (Canarypox virus) Note: Mimivirus: 1.2 Mb (http://www.giantvirus.org/top.html ) (Top 100 largest viral genome sequences) Bacterial genomes: 0.5 Mb to 13 Mb; Eukaryotic genomes: 8 Mb to 670 Gb; Database of Genome sizes: http://www.cbs.dtu.dk/databases/DOGS/index.php 3500 3000 2500 2000 Size 1500 1000 500 0 E.coli Yeast Worm Fly Fugu Human Genome size and database increase BIOLOGICAL DATABASE CATEGORIES • Databases of nucleic acid sequences (RNA, DNA) • Databases of protein sequences • Databases of protein motifs and protein domains • Databases of structures • Databases of genomes • Databases of genes • Databases of expression profiles • Databases of SNPs and mutations • Databases of metabolic pathways and protein associations • Databases of taxonomy •… Can we find a list of ‘clean’ databases ? • The NAR database issue • The 2008 update includes 1078 databases, 110 more than the previous one. • 98 new databases • updates of 84 existing databases • 25 obsolete databases removed! • The complete database list and summaries are available online on the Nucleic Acids Research web site http://nar.oxfordjournals.org/ • NAR database category list • • • • • • • • • • • • • • Nucleotide Sequence Databases RNA sequence databases Protein sequence databases Structure Databases Genomics Databases (non-vertebrate) Metabolic and Signalling Pathways Human and other Vertebrate Genomes Human Genes and Diseases Microarray Data and other Gene Expression Databases Proteomics Resources Other Molecular Biology Databases Organelle databases Plant databases Immunological databases • Genomics Databases (non-vertebrate) – MGD - Mouse Genome Database – TIGR Gene Indices – Genome annotation terms, ontologies and nomenclature – Taxonomy and identification – General genomics databases – Viral genome databases – Prokaryotic genome databases – Unicellular eukaryotes genome databases – Fungal genome databases – Invertebrate genome databases • Three types of genome databases: • Databases which collect data of all sequenced genomes (Entrez_Genomes; EBI_genomes) • Databases which collect data of a category of organisms with sequenced genomes (Microbial Genomes at TIGR) • Databases specific for one organism with sequenced genomes (Flybase, MGD, Ensembl) • What kind of information you find there? • Genome databases contain genomic information collected from many sources. – Genome assembly – Gene predictions – Known genes, mRNA, ESTs, proteins – Genetic maps, markers and polymorphisms – Gene expression and phenotypes – Annotations – Interspecies homologues Resources for genomes There are two main resources for genomes: EBI European Bioinformatics Institute http://www.ebi.ac.uk/genomes/ NCBI National Center for Biotechnology Information http://www.ncbi.nlm.nih.gov/Genomes/ But many others resources from sequencing Institutions: Sanger The welcome Trust Sanger Institute http://www.sanger.ac.uk/ TIGR The Institute for Genomic Research http://cmr.tigr.org/tigr-scripts/CMR/shared/Genomes.cgi Genolevures http://cbi.labri.fr/Genolevures/index.php Databases by phylogenetic groups Eucaryotic genomes: http://www.ncbi.nlm.nih.gov/genomes/leuks.cgi Bacteria, fungi genomes: http://www.ncbi.nlm.nih.gov/genomes/leuks.cgi?p3=11:Fungi&taxgroup =11:Fungi|12: Insects: http://www.ncbi.nlm.nih.gov/genomes/leuks.cgi?p3=12:Insects&taxgrou p=11:|12:Insects Plant genomes: http://www.ncbi.nlm.nih.gov/genomes/PLANTS/PlantList.html ... The Entrez System OMIM PubMed PubMed Central 3D Domains Journals Structure Books CDD/CDART Entrez Protein Taxonomy Genome GEO/GDS UniSTS UniGene Nucleotide SNP PopSet WGS Other GenBank RefSeq Contig BAC RefSeq Transcript Mouse assembly UniGene Transcript Maps and Options Some common features of genomic databases: Possibility to download all the sequences of the genome or part of them (chromosomes, clones, genes, CDS,..) Most of them have a corresponding protein resource (the set of proteins obtained by translating all CDS – conceptual translation) Example: Entrez-Genome of the NCBI Genpept Comparative genomics Analyses of the genetic material of different species help in the understanding of the similarities and differences between genomes, their evolution and the evolution of their genes. • Intra-genomic comparisons help understanding the degree of duplication (genome regions; genes) and genes organization,... • Inter-genomic comparisons help understanding the degree of similarity between genomes; degree of conservation between genes; • Understanding gene and genome evolution Internet resources for whole-genome comparative analysis and associated tools UCSC Genome4 Bioinformatics Ensembl MapViewer VISTA Genome Browser K-BROWSER Comparative Regulatory Genomics GALA EnsMart ETOPE PipMaker and MultiPipMaker VISTA server MAVID server zPicture server rVISTA server COGs http://genome.ucsc.edu/ http://www.ensembl.org/ http://www.ncbi.nlm.nih.gov/mapview/ http://pipeline.lbl.gov/ http://hanuman.math.berkeley.edu/cgi-bin/kbrowser2 http://corg.molgen.mpg.de/ http://www.bx.psu.edu/ http://www.ensembl.org/EnsMart/ http://www.bx.psu.edu/ http://www.bx.psu.edu/ http://www-gsd.lbl.gov/vista/ http://baboon.math.berkeley.edu/mavid/ http://zpicture.dcode.org/ http://rvista.dcode.org/ http://www.ncbi.nlm.nih.gov/COG/ NCBI Homo sapiens Genome: Statistics -- Build 36 version 2 Protein coding genes: 21,541 General considerations: Organism specific databases can be more up-to-date than general databases. Genome databases are not a one stop shop for all information, other databases like UniProt are still needed! Bibliographic Databases and Open Access resources • Pubmed - http://www.pubmed.org/ • An access to more than 12 millions papers since 1950 (3790 jounals). • Simple and advanced literature Search with keywords, author name, MESH terms, journals, single citation,.. • Some papers are free from the journal website or through the editors. • Free access journals • Authors pay to allow readers to get the papers free • The BMC initiative • The Plos initiative • Other initiatives: some journals are giving immediate free online access and others after a few (1-12) months from publication • The HINARI initiative • The Health InterNetwork Access to Research Initiative (HINARI) provides free or very low cost online access to the major journals in biomedical and related social sciences to local, not-for-profit institutions in developing countries. • HINARI was launched in January 2002, with some 1500 journals from 6 major publishers. 22 additional publishers joined in May 2002, bringing the total number of journals to over 2000. • Today more than 70 publishers are offering their content in HINARI and others will soon be joining the programme. http://citeseer.ist.psu.edu/ Laboratório de Genômica Funcional e Bioinformática Instituto Oswaldo Cruz Wim Maurits Degrave – Pesquisador Titular Antonio Basílio de Miranda – Pesquisador Associado Nicolas Carels – Pesquisador Visitante Fábio Faria da Mota – Pesquisador Visitante Thomas Dan Otto – Pesquisador Visitante Marcos Catanho – Aluno de Doutorado (BCM – IOC) Ana Carolina Guimarães – Aluno de Doutorado (BCM – IOC) Flávio Engelke – Aluno de Mestrado (PCM - UERJ) Monete Rajão – Aluna de Mestrado (BCS – IOC) Erica Ramos Cardoso - Bolsista PIBITI