Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Introduction to Bioinformatics Databases Central dogma of molecular biology DNA RNA protein phenotype A main focus of bioinformatics is to study molecular sequence data to gain insight into a broad range of biological problems. Page 6 After Pace NR (1997) Science 276:734 With the use of bioinformatics we can learn the variation that occur between species, and we can deduce the evolutionary history of life on Earth. 70 60 50 40 30 20 10 0 Base pairs of DNA (billions) Sequences (millions) Growth of GenBank 1985 December 1982 1990 1995 2000 June 2006 Base pairs of DNA (billions) Growth of the International Nucleotide Sequence Database Collaboration Base pairs contributed by GenBank EMBL DDBJ http://www.ncbi.nlm.nih.gov/Genbank/ Central dogma of molecular biology DNA genome RNA transcriptome protein proteome Central dogma of bioinformatics and genomics DNA genomic DNA databases RNA cDNA ESTs UniGene protein phenotype protein sequence databases Fig. 2.2 Page 20 There are three major public DNA databases EMBL GenBank DDBJ The underlying raw DNA sequences are identical Page 16 There are three major public DNA databases EMBL Housed at EBI European Bioinformatics Institute GenBank DDBJ Housed at NCBI National Center for Biotechnology Information Housed in Japan Page 16 >300,000 species are represented in GenBank Table 2-1 Taxonomy nodes at NCBI 8/06 http://www.ncbi.nlm.nih.gov/Taxonomy/txstat.cgi The most sequenced organisms in GenBank Homo sapiens Mus musculus Rattus norvegicus Danio rerio Zea mays Oryza sativa Drosophila melanogaster Gallus gallus Arabidopsis thaliana Updated 8-12-04 GenBank release 142.0 10.7 billion bases 6.5b 5.6b 1.7b 1.4b 0.8b 0.7b 0.5b 0.5b Table 2-2 Page 18 The most sequenced organisms in GenBank Homo sapiens Mus musculus Rattus norvegicus Danio rerio Bos taurus Zea mays Oryza sativa (japonica) Xenopus tropicalis Canis familiaris Drosophila melanogaster Updated 8-29-05 GenBank release 149.0 11.2 billion bases 7.5b 5.7b 2.1b 1.9b 1.4b 1.2b 0.9b 0.8b 0.7b Table 2-2 Page 18 The most sequenced organisms in GenBank Homo sapiens Mus musculus Rattus norvegicus Bos taurus Danio rerio Zea mays Oryza sativa (japonica) Strongylocentrotus purpurata Sus scrofa Xenopus tropicalis Updated 7-19-06 GenBank release 154.0 12.3 billion bases 8.0b 5.7b 3.5b 2.5b 1.8b 1.5b 1.2b 1.0b 1.0b Table 2-2 Page 18 National Center for Biotechnology Information (NCBI) www.ncbi.nlm.nih.gov Page 24 Types of Data in GenBank DNA level RNA level (cDNA) Protein sequences. … www.ncbi.nlm.nih.gov Fig. 2.5 Page 25 Fig. 2.5 Page 25 PubMed is… • National Library of Medicine's search service • 16 million citations in MEDLINE • links to participating online journals • PubMed tutorial (via “Education” on side bar) Page 24 Entrez integrates… • the scientific literature; • DNA and protein sequence databases; • 3D protein structure data; • population study data sets; • assemblies of complete genomes Page 24 Entrez is a search and retrieval system that integrates NCBI databases Page 24 BLAST is… • Basic Local Alignment Search Tool • NCBI's sequence similarity search tool • supports analysis of DNA and protein databases • 100,000 searches per day Page 25 OMIM is… •Online Mendelian Inheritance in Man •catalog of human genes and genetic disorders •edited by Dr. Victor McKusick, others at JHU Page 25 Books is… • searchable resource of on-line books Page 26 TaxBrowser is… • browser for the major divisions of living organisms (archaea, bacteria, eukaryota, viruses) • taxonomy information such as genetic codes • molecular data on extinct organisms Page 26 Structure site includes… • Molecular Modelling Database (MMDB) • biopolymer structures obtained from the Protein Data Bank (PDB) • Cn3D (a 3D-structure viewer) • vector alignment search tool (VAST) Page 26 Accessing information on molecular sequences Page 26 Accession numbers are labels for sequences NCBI includes databases (such as GenBank) that contain information on DNA, RNA, or protein sequences. You may want to acquire information beginning with a query such as the name of a protein of interest, or the raw nucleotides comprising a DNA sequence of interest. DNA sequences and other molecular data are tagged with accession numbers that are used to identify a sequence or other record relevant to molecular data. Page 26 What is an accession number? An accession number is label that used to identify a sequence. It is a string of letters and/or numbers that corresponds to a molecular sequence. Examples (all for retinol-binding protein, RBP4): X02775 NT_030059 Rs7079946 GenBank genomic DNA sequence Genomic contig dbSNP (single nucleotide polymorphism) DNA N91759.1 NM_006744 An expressed sequence tag (1 of 170) RefSeq DNA sequence (from a transcript) RNA NP_007635 AAC02945 Q28369 1KT7 RefSeq protein GenBank protein SwissProt protein Protein Data Bank structure record protein Page 27 Four ways to access DNA and protein sequences [1] Entrez Gene with RefSeq [2] UniGene [3] European Bioinformatics Institute (EBI) and Ensembl (separate from NCBI) [4] ExPASy Sequence Retrieval System (separate from NCBI) Page 27 4 ways to access protein and DNA sequences [1] Entrez Gene with RefSeq Entrez Gene is a great starting point: it collects key information on each gene/protein from major databases. It covers all major organisms. RefSeq provides a curated, optimal accession number for each DNA (NM_006744) or protein (NP_007635) Page 27 From the NCBI home page, type “rbp4” and hit “Go” revised Fig. 2.7 Page 29 revised Fig. 2.7 Page 29 By applying limits, there are now just two entries Entrez Gene (top of page) Note that links to many other RBP4 database entries are available revised Fig. 2.8 Page 30 Entrez Gene (middle of page) Entrez Gene (bottom of page) Fig. 2.9 Page 32 Fig. 2.9 Page 32 Fig. 2.9 Page 32 FASTA format Fig. 2.10 Page 32 FASTA format What is an accession number? An accession number is label that used to identify a sequence. It is a string of letters and/or numbers that corresponds to a molecular sequence. Examples (all for retinol-binding protein, RBP4): X02775 NT_030059 Rs7079946 GenBank genomic DNA sequence Genomic contig dbSNP (single nucleotide polymorphism) DNA N91759.1 NM_006744 An expressed sequence tag (1 of 170) RefSeq DNA sequence (from a transcript) RNA NP_007635 AAC02945 Q28369 1KT7 RefSeq protein GenBank protein SwissProt protein Protein Data Bank structure record protein Page 27 NCBI’s important RefSeq project: best representative sequences RefSeq (accessible via the main page of NCBI) provides an expertly curated accession number that corresponds to the most stable, agreed-upon “reference” version of a sequence. RefSeq identifiers include the following formats: Complete genome Complete chromosome Genomic contig mRNA (DNA format) Protein NC_###### NC_###### NT_###### NM_###### e.g. NM_006744 NP_###### e.g. NP_006735 Page 29-30 NCBI’s RefSeq project: accession for genomic, mRNA, protein sequences Accession AP_123456 NC_123456 NG_123456 NM_123456 NM_123456789 NP_123456 NP_123456789 NR_123456 NT_123456 NW_123456 NZ_ABCD12345678 XM_123456 XP_123456 XR_123456 YP_123456 ZP_12345678 Molecule Protein Genomic Genomic mRNA mRNA Protein Protein RNA Genomic Genomic Genomic mRNA Protein RNA Protein Protein Note Protein products; alternate Complete genomic molecules Incomplete genomic regions Transcript products; mRNA Transcript products; 9-digit Protein products; Protein products; 9-digit Non-coding transcripts Genomic assemblies Genomic assemblies Whole genome shotgun data Transcript products Protein products Transcript products Protein products Protein products Ensembl to access protein and DNA sequences Try Ensembl at www.ensembl.org for a premier human genome web browser. Ensembl is a joint scientific project between the European Bioinformatics Institute and the Wellcome Trust Sanger Institute, Its aim is to provide a centralised resource for geneticists, molecular biologists and other researchers studying the genomes of our own species and other vertebrates. We will encounter Ensembl as we study the human genome, BLAST, and other topics. click human Species in Ensembl MAMMALS PLACENTALS MONOTREMES MARSUPIALS OTHER BIRDS BIRDS REPTILES PALEOGNATHS PASSERINES CROCODILES TURTLES LIZARDS AMPHIBIANS TELEOSTS FISHES SHARKS RAYS LATIMERIA BICHIR/POLYPTERUS LUNGFISHES AGNATHANS NON-VERTEBRATES enter RBP4 Five ways to access DNA and protein sequences [1] Entrez Gene with RefSeq [2] UniGene [3] European Bioinformatics Institute (EBI) and Ensembl (separate from NCBI) [4] ExPASy Sequence Retrieval System (separate from NCBI) Page 33 ExPASy to access protein and DNA sequences ExPASy sequence retrieval system (ExPASy = Expert Protein Analysis System) Visit http://www.expasy.ch/ Page 33 Fig. 2.11 Page 33 Example of how to access sequence data: HIV-1 pol There are many possible approaches. Begin at the main page of NCBI, and type an Entrez query: hiv-1 pol Page 34 Searching for HIV-1 pol: Following the “genome” link yields a manageable three results Page 34 Example of how to access sequence data: HIV-1 pol For the Entrez query: hiv-1 pol there are about 40,000 nucleotide or protein records (and >100,000 records for a search for “hiv-1”), but these can easily be reduced in two easy steps: --specify the organism, e.g. hiv-1[organism] --limit the output to RefSeq! Page 34 over 100,000 nucleotide entries for HIV-1 only 1 RefSeq Examples of how to access sequence data: histone query for “histone” # results protein records RefSeq entries 21847 7544 RefSeq (limit to human) NOT deacetylase 1108 697 At this point, select a reasonable candidate (e.g. histone 2, H4) and follow its link to Entrez Gene. There, you can confirm you have the right gene/protein. 8-12-06 Access to Biomedical Literature Page 35 PubMed at NCBI to find literature information PubMed is the NCBI gateway to MEDLINE. MEDLINE contains bibliographic citations and author abstracts from over 4,600 journals published in the United States and in 70 foreign countries. It has >14 million records dating back to 1966. Page 35 PubMed search strategies Try the tutorial (“education” on the left sidebar) Use boolean queries (capitalize AND, OR, NOT) lipocalin AND disease Try using “limits” Try “Links” to find Entrez information and external resources Obtain articles on-line via Welch Medical Library (and download pdf files): http://www.welch.jhu.edu/ Page 35 1 AND 2 1 2 lipocalin AND disease (60 results) 1 OR 2 1 2 lipocalin OR disease (1,650,000 results) 1 NOT 2 1 2 lipocalin NOT disease (530 results) Fig. 2.12 Page 34 8/04