Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
www. .uni-rostock.de Bioinformatics Information Resources And Networks [email protected] Bioinformatics and Systems Biology Group www.sbi.informatik.uni-rostock.de Ulf Schmitz, Bioinformatics Information Resources and Networks 1 www. Outline .uni-rostock.de • Bioinformatics Information Resources And Networks – EMBnet – European Molecular Biology Network • DBs and Tools – NCBI – National Center For Biotechnology Information • DBs and Tools – – – – – – Nucleic Acid Sequence Databases Protein Information Resources Metabolic Databases Mapping Databases Databases concerning Mutations Literature Databases Ulf Schmitz, Bioinformatics Information Resources and Networks 2 EMBnet – European Molecular Biology Network • • • • • • www. .uni-rostock.de Founded in 1988 Network that links European laboratories that use biocomputing and bioinformatics in molecular biology research is a science-based group of collaborating nodes throughout Europe and nodes outside Europe provides information, services and training to the useres efforts to increase the availability and accessibility of data resources and computing tools increase knowledge and proficiency in bioinformatics through education and training Ulf Schmitz, Bioinformatics Information Resources and Networks 3 www. EMBnet - Nodes National Nodes .uni-rostock.de • governmental (18) EMBnet • academic, industrial research centers (41 nodes) • Biocomputing centers from non European countries Specialist Nodes Associate Nodes (9) (11) Ulf Schmitz, Bioinformatics Information Resources and Networks 4 www. EMBnet - Nodes .uni-rostock.de National Nodes Vienna Biocenter - Austria BEN - Belgium CSC - Finland INFOBIOGEN - France DKFZ - Germany HEN - Hungary INCBI - Ireland INN - Israel IEN-AdR - Italy CMBI - Netherlands Bio - Norway IBB - Poland PEN - Portugal GeneBee - Russia CNB-CSIC - Spain BMC - Sweden SIB - Switzerland SEQNET - UK • Appointed by the governments • Provide on-line services, user support and training Ulf Schmitz, Bioinformatics Information Resources and Networks 5 www. EMBnet - Nodes Munich Information Center for protein sequences Specialist Nodes MIPS ICGEB .uni-rostock.de • Academic, industrial or research centers in specific areas of bioinformatics • Largely responsible for maintainance of biological databases and software Pharmarcia F.Hoffmann – La Roche EBI HGMP - RC Sanger Hinxton Hall (Cambridge UK) Important key specialist node and home of: EMBL, SWISS-PROT and TrEMBL databases UCL Ulf Schmitz, Bioinformatics Information Resources and Networks 6 EMBnet - Nodes Associate Nodes IBBM - Argentina ANGIS - Australia CBI - China CIGB - Cuba CDFD - India SANBI – South Africa EMBnet - Brazil CBR - Canada EMBnet - Chile EBMnet - Colombia www. .uni-rostock.de • Centers from non European countries CIFN - MEXICO Ulf Schmitz, Bioinformatics Information Resources and Networks 7 EMBnet’s Mission www. .uni-rostock.de • Assist in biotechnological and bioinformatics related research • Provide training and education • Exploit network infrastructures • Investigate and develop new technologies • Bridge between commercial and academic sectors Ulf Schmitz, Bioinformatics Information Resources and Networks 8 Who are EMBnet’s Users? www. .uni-rostock.de • > 40,000 registered users from all over the world as well as a larger number of Internet users • All scientists working in Life Sciences, from undergraduate students to top level scientists, in academia as well as industry, can get support from EMBnet Ulf Schmitz, Bioinformatics Information Resources and Networks 10 EMBnets – SRS www. .uni-rostock.de Sequence Retrieval System - SRS National Nodes • result of a research project with the EMBnet to interrogating all resources gathered together EMBnet Specialist Nodes • SRS is a network browser for DBs in molecular Biology Associate Nodes • SRS allows any flat-file DB to be indexed to any other • queries across a range of different DB types via a single interface • independent of underlying data structures or query languages Ulf Schmitz, Bioinformatics Information Resources and Networks 11 www. Ulf Schmitz, Bioinformatics Information Resources and Networks .uni-rostock.de 12 EMBnets - EMBOSS www. .uni-rostock.de • The European Molecular Biology Open Software Suite • EMBOSS is a free Open Source software analysis package specially developed for the needs of the molecular biology (e.g. EMBnet) user community. • The software automatically copes with data in a variety of formats and even allows transparent retrieval of sequence data from the web. • Also, as extensive libraries are provided with the package, it is a platform to allow other scientists to develop and release software in true open source spirit. • EMBOSS also integrates a range of currently available packages and tools for sequence analysis into a seamless whole. Ulf Schmitz, Bioinformatics Information Resources and Networks 13 What can EMBOSS do for you? www. .uni-rostock.de • Within EMBOSS you will find around hundreds of programs (applications) covering areas such as: – – – – Sequence alignment, Rapid database searching with sequence patterns, Protein motif identification, including domain analysis, Nucleotide sequence pattern analysis---for example to identify CpG islands or repeats, – Codon usage analysis for small genomes, – Rapid identification of sequence patterns in large scale sequence sets, – Presentation tools for publication, • and much more. Check: http://emboss.sourceforge.net/ Ulf Schmitz, Bioinformatics Information Resources and Networks 14 Jemboss www. Ulf Schmitz, Bioinformatics Information Resources and Networks .uni-rostock.de 15 NCBI – National Center For Biotechnology Information • Leading American information provider • Established in 1988 as a division of the National Library of Medicine (NLM) – Located on the campus of the National Institute of Health (NIH – Rockville/Maryland) www. .uni-rostock.de Mission: • Development of new information technologies to aid our understanding of the molecular and genetic processes that underlie health and disease • Creation of systems for storing and analysing biological information • Development of advanced methods of computer-based information processing • Facilitation of user access to DBs and software • Co-ordination of efforts to gather biotechnology information worldwide Ulf Schmitz, Bioinformatics Information Resources and Networks 16 NCBI www. .uni-rostock.de • Since 1992 – maintainance of GenBank and collaboration with international nucleotide DBs: EMBL and DDBJ (Japan) • Providing the Entrez that facilates to access biological DBs (similar to SRS that is provided by the EMBnet) Ulf Schmitz, Bioinformatics Information Resources and Networks 17 www. Ulf Schmitz, Bioinformatics Information Resources and Networks .uni-rostock.de 18 NCBI - Responsibilities • • • • • • • www. .uni-rostock.de administers research on biomedical problems at the molecular level using mathematical and computational methods maintains collaborations with several NIH (National Institutes of Health) institutes, academia, industry, and other governmental agencies promotes scientific communication by sponsoring meetings, workshops, and lecture series supports training on basic and applied research in computational biology for postdoctoral fellows through the NIH Intramural Research Program engages members of the international scientific community in informatics research and training through the Scientific Visitors Program develops, distributes, supports, and coordinates access to a variety of databases and software for the scientific and medical communities develops and promotes standards for databases, data deposition and exchange, and biological nomenclature Ulf Schmitz, Bioinformatics Information Resources and Networks 19 Nucleic Acid Sequence Databases www. .uni-rostock.de • the principal nucleic acid sequence databases are GeneBank, EMBL and DDBJ, which each collect a portion of the total sequence data reported world-wide, and exchange new and updated entries on a daily basis Nucleic acid sequence Databases EMBL (Europe) GenBank (USA) DDBJ (Japan) ENSEMBL (project between EMBL - EBI and the Sanger Institute) dbEST (division of GenBank) GSDB (division of GenBank) Ulf Schmitz, Bioinformatics Information Resources and Networks 20 EMBL www. Ulf Schmitz, Bioinformatics Information Resources and Networks .uni-rostock.de 21 Nucleic Acid Sequence Databases EMBL www. .uni-rostock.de The EMBL Database (yesterday morning) containes 115,478,836,243 nucleotides in 63,713,453 entries. Entry Type Standard Constructed (CON) Third Party Annotation (TPA) Whole Genome Shotgun (WGS) Entries Nucleotides 52,092,157 55,843,115,059 339,875 n/a 4,737 331,788,217 11,275,863 58,772,358,766 source: http://www3.ebi.ac.uk/Services/DBStats/ Ulf Schmitz, Bioinformatics Information Resources and Networks 22 Nucleic Acid Sequence Databases EMBL Number of entries (current 63,713,453) www. .uni-rostock.de Total nucleotides (current 115,478,836,243) Ulf Schmitz, Bioinformatics Information Resources and Networks 23 Nucleic Acid Sequence Databases EMBL www. .uni-rostock.de By nucleotide count Zea mays corn Gallus gallus rooster Other Homo sapiens human environmental sequence Danio rerio toy fish Bos taurus Canis familiaris dog breed Rattus norvegicus rat Pan troglodytes Wren (bird) Mus musculus mouse Ulf Schmitz, Bioinformatics Information Resources and Networks 24 Nucleic Acid Sequence Databases – GenBank www. .uni-rostock.de • GenBank which is produced at NCBI, is split into smaller, discrete divisions. • This facilitates fast, specific searches by restricting queries to perticular database subsets • During 1992-1997, the level of EST and STS data within GenBank grew 10-fold. • the overall sequence information contributed by such partial data was still less than that of higher quality sequences in the other major divisions Ulf Schmitz, Bioinformatics Information Resources and Networks 25 www. Ulf Schmitz, Bioinformatics Information Resources and Networks .uni-rostock.de 26 www. Ulf Schmitz, Bioinformatics Information Resources and Networks .uni-rostock.de 27 Specialised Genomic Resources www. .uni-rostock.de • In addition to the comprehensive DNA sequence DBs, there is a variety of more specialised genomic resources. • These so called boutique DBs bring focus to speciesspecific genomics and to particular sequencing techniques. Specialised Genomic Resources SGD – Saccharomyces Genome Database UniGene - gene-oriented clusters from GenBank TIGR - Databases of The Institute for Genomic Research ACeDB – A C.elegans DataBase Ulf Schmitz, Bioinformatics Information Resources and Networks 28 Specialised Genomic Databases www. .uni-rostock.de • SGD http://genome-www.stanford.edu/Saccharomyces (bakers yeast) • AceDB http://www.acedb.org (c.elegans) • FlyBase http://flybase.bio.indiana.edu (fruit fly) • MGD http://www.informatics.jax.org (Mouse) Ulf Schmitz, Bioinformatics Information Resources and Networks 29 Protein Information Resources www. .uni-rostock.de Levels of protein sequence and structural organisation: primary secondary tertiary The primary structure of a protein is its amino acid sequence The second structure of a protein corresponds to regions of local regularity (e.g., α-helices and β-strands). The tertiary structure of a protein arises from the packing of its secondary structure elements, which may form discrete domains within a fold. Ulf Schmitz, Bioinformatics Information Resources and Networks 30 www. Protein Information Resources .uni-rostock.de Levels of protein sequence and structural organisation: primary sequence AVILDRYFH secondary motif [AS]-[IL]2-X[DE]-R-[FYW]2-H tertiary domain module a,b,c @.*,# Ulf Schmitz, Bioinformatics Information Resources and Networks primary database secondary database structure database 31 Primary Protein Databases www. .uni-rostock.de • The primary structure of a protein is its amino acid sequence • these are stored in primary databases as linear alphabets that denote the constituent residues Protein sequence Databases SWISS-PROT - Protein knowledgebase TrEMBL - Computer-annotated supplement to Swiss-Prot PIR – Protein Information Resource MIPS – Munich Information Centre for Protein Sequences NRL-3D - produced by PIR Ulf Schmitz, Bioinformatics Information Resources and Networks 32 www. Protein Sequence Databases .uni-rostock.de Table of the most represented species • • • • Swiss-Prot contains 197,228 sequence entries, comprising 71,501,181 amino acids abstracted from 135,257 references Total number of species represented in Swiss-Prot: 9,520 The average sequence length in Swiss-Prot is 362 amino acids. Swiss-Prot is the most highly annotated protein sequence DB No. Frequ. Species 1 13049 Homo sapiens (Human) 2 10132 Mus musculus (Mouse) 3 5189 Saccharomyces cerevisiae (Baker's yeast) 4 4847 Escherichia coli 5 4669 Rattus norvegicus (Rat) 6 3665 Arabidopsis thaliana (Mouse-ear cress) 8 2863 Schizosaccharomyces pombe (Fission yeast) 7 2814 Bacillus subtilis 9 2750 Caenorhabditis elegans 10 2286 Drosophila melanogaster (Fruit fly) Ulf Schmitz, Bioinformatics Information Resources and Networks 33 Composite Protein Sequence Databases www. .uni-rostock.de • Composite databases amalgamate a variety of different primary databases • They render sequence searching much more efficient, because they obviate the need to interrogate multiple resources • Different composite databases use different primary sources and different redundancy criteria in their amalgamation procedures Ulf Schmitz, Bioinformatics Information Resources and Networks 34 Composite Protein Sequence Databases NRDB OWL Natural Resource DB www. MIPSX .uni-rostock.de SP+TrEMBL SwissProt TrEMBL PDB SWISS-PROT PIR1-4 SWISS-PROT SWISS-PROT PIR MIPSOwn TrEMBL PIR GenBank MIPSTrn GenPept NRL-3D MIPSH SWISS-PROTupdate PIRMOD GenPeptupdate NRL-3D SWISS-PROT EMTrans GBTrans Kabat PseqIP Ulf Schmitz, Bioinformatics Information Resources and Networks 35 Secondary databases www. .uni-rostock.de • Secondary databases contain pattern data, i.e., diagnostic signatures for protein families. These signatures encode the most highly conserved features of multiply aligned sequences, which are often crucial to the structure or function of the protein • The second structure of a protein corresponds to regions of local regularity (e.g., α-helices and β-strands). • Which, in sequence alignments, are often apparent as well-conserved motifs • patterns are regular expressions, fingerprints, blocks, profiles, etc. Ulf Schmitz, Bioinformatics Information Resources and Networks 36 Secondary databases www. .uni-rostock.de Stored information Secondary DB Primary source PROSITE SWISS-PROT Regular expressions (patterns) Profiles SWISS-PROT Weighted matrices (profiles) PRINTS OWL Aligned motifs (fingerprints) BLOCKS PROSITE/PRINTS Aligned motifs (blocks) IDENTIFY BLOCKS/PRINTS Fuzzy regular expressions (patterns) Ulf Schmitz, Bioinformatics Information Resources and Networks 37 Secondary databases • • • • • • • • • • • • www. .uni-rostock.de TRANSFAC http://transfac.gbf.de EPD http://www.epd.isb-sib.ch InterPro http://www.ebi.ac.uk/interpro/ PROSITE http://www.expasy.ch/prosite BLOCKS http://blocks.fhcrc.org PRINTS ftp://ftp.seqnet.dl.ac.uk/pub/database/prints PFAM http://www.sanger.ac.uk/Software/Pfam/index.shtml ProDom http://www.toulouse.inra.fr/prodom.html InterPro http://www.ebi.ac.uk/interpro GeneCards http://bioinformatics.weizmann.ac.il/cards ENSEMBL http://www.ensembl.org EcoCyc http://ecocyc.panbio.com/ecocyc/ecocyc.html Ulf Schmitz, Bioinformatics Information Resources and Networks 38 Secondary databases www. .uni-rostock.de • There is some overlap in content between the secondary databases • PDBsum alone has 35,291 entries • Pattern DB growth is slow because the addition of detailed family annotation is very time consuming. • PROSITE and PRINTS are the only comprehensively, manually annotated secondary DBs • To address the annotation bottleneck, the secondary database curators are together created a unified database of protein families known as InterPro Ulf Schmitz, Bioinformatics Information Resources and Networks 39 Structure Classification DBs www. .uni-rostock.de • Contain 3D structures available from crystallographic and spectroscopic studies Structure Classification Databases PDBsum – Protein Data Bank CATH – Class, Architecture, Topology, Homology SCOP – Structural Classification of Proteins Ulf Schmitz, Bioinformatics Information Resources and Networks 40 Structure Classification DBs www. .uni-rostock.de • PDB http://www.rcsb.org • SCOP http://scop.mrc-lmb.cam.ac.uk/scop • CATH http://www.biochem.ucl.ac.uk/bsm/cath • DSSP http://www.sander.ebi.ac.uk/dssp • FSSP http://www.ebi.ac.uk/dali/fssp • HSSP http://www.sander.ebi.ac.uk/hssp Ulf Schmitz, Bioinformatics Information Resources and Networks 41 www. Ulf Schmitz, Bioinformatics Information Resources and Networks .uni-rostock.de 42 Metabolic Databases www. .uni-rostock.de • A number of metabolic databases are available electronically • some with features for querying and visualizing metabolic pathways and regulatory networks. • KEGG (Kyoto Encyclopedia of Genes and Genomes) http://www.genome.ad.jp/kegg • ENZYME (Enzyme nomenclature database) http://www.expasy.ch/enzyme • BRENDA (Enzyme Information System) http://www.brenda.uni-koeln.de • EMP (Enzymes and Metabolic Pathways database) http://www.empproject.com Ulf Schmitz, Bioinformatics Information Resources and Networks 43 www. Ulf Schmitz, Bioinformatics Information Resources and Networks .uni-rostock.de 44 Mapping Databases www. .uni-rostock.de • OMIM http://www3.ncbi.nlm.nih.gov/omim • GDB http://www.gdb.org • RHDB http://corba.ebi.ac.uk/RHdb Ulf Schmitz, Bioinformatics Information Resources and Networks 45 www. Ulf Schmitz, Bioinformatics Information Resources and Networks .uni-rostock.de 46 www. Ulf Schmitz, Bioinformatics Information Resources and Networks .uni-rostock.de 47 Databases concerning Mutations www. .uni-rostock.de • dbSNP http://www.ncbi.nlm.nih.gov/SNP • HGBASE http://hgbase.cgr.ki.se • The SNP Consortium (TSC) http://snp.cshl.org • HAEMA http://europium.csc.mrc.ac.uk/usr/WWW/WebPages/database.dir/q uiz.dir/intrquiz.htm Ulf Schmitz, Bioinformatics Information Resources and Networks 48 Literature Databases www. .uni-rostock.de • PubMed http://www.ncbi.nlm.nih.gov/entrez/query • The Lancet http://www.thelancet.com • Bioinformatics Online http://www.bioinformatics.oupjournals.org • Nature http://www.nature.com • Science http://www.sciencemag.org Ulf Schmitz, Bioinformatics Information Resources and Networks 49 Outlook – coming lecture • • .uni-rostock.de Introduction to sequence alignment pair wise sequence alignment – – – • • www. The Dot Matrix Dynamic Programming Scoring Matrices local alignment Alignment tools – – – BLAST FASTA ALIGN Ulf Schmitz, Bioinformatics Information Resources and Networks 50 The End www. .uni-rostock.de Thanks for your attention! Ulf Schmitz, Bioinformatics Information Resources and Networks 51