Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Databases Where to get data? • GenBank – http://www.ncbi.nlm.nih.gov • Protein Databases – SWISS-PROT: http://www.expasy.ch/sprot – PDB: http://www.pdb.gov/ • And many others Bibliograph y Growth in genome sequencing Working Draft Sequence gaps The reagent: databases • Organized array of information • Place where you put things in, and (if all is well) you should be able to get them out again. • Resource for other databases and tools. • Simplify the information space by specialization. • Bonus: Allows you to make discoveries. Contains files or tables, each containing numerous records and fields Simplest form, either a large single text file or collection of text files Commonest type, stores the data within a number of tables (with records and fields). Each table will link each other by a shared file called a key Flat file Relational database model The operators are written in query-specific languages based on relational algebra Structured Query Language (SQL) is commonly used • • • • • XML (eXtensible Markup Language) is now a general tool for storage of data and information. HTML and XHTML are subsets of XML. The key feature is to use identifiers called tabs <title> Understanding Bioinformatics </ title> <publisher> tag can be defined and used to identify book publishers Extraction from XML file is similar to database querying. Databases Information system Query system Storage System Data GenBank flat file PDB file Interaction Record Title of a book Book Databases Information system Query system Storage System Data Boxes Oracle MySQL PC binary files Unix text files Bookshelves Databases Information system Query system Storage System Data A List you look at A catalogue indexed files SQL grep Databases Information system Query system Storage System Data The UBC library Google Entrez SRS Bioinformatics Information Space July 17, 1999 • • • • • • • • • • • Nucleotide sequences: Protein sequences: 3D structures: Human Unigene Clusters: Maps and Complete Genomes: Different species node: dbSNP RefGenes human contigs > 250 kb PubMed records: OMIM records: 4,456,822 706,862 9,780 75,832 10,870 52,889 6,377 515 341 (4.9MB) 10,372,886 10,695 The challenge of the information space: Feb 10 2004 Nucleotide records Protein sequences 3D structures Interactions & complexes Human Unigene Cluster Maps and Complete Genomes Different taxonomy Nodes Human dbSNP Human RefSeq records bp in Human Contigs > 5,000 kb (116) PubMed records OMIM records 36,653,899 4,436,362 19,640 52,385 118,517 6,948 283,121 13,179,601 22,079 2,487,920,000 12,570,540 15,138 From a CBW student course evaluation: “I could probably live the rest of my life happily without ever seeing the ‘growth of GenBank’ curve … again.” Databases • Primary (archival) – – – – – GenBank/EMBL/DDBJ UniProt PDB Medline (PubMed) BIND • Secondary (curated) – – – – – RefSeq Taxon UniProt OMIM SGD http://nar.oupjournals.org/content/vol31/issue1/ Tools of trade for the “armchair scientist” • Databases – PubMed and other NCBI databases – Biochemical databases – Protein domain databases – Structural databases – Genome comparison databases • Tools – CDD / COGs – VAST / FSSP Distribution of the type of databases as classified at the NAR database web site Types of databases • Archival or Primary Data – Text: PubMed – DNA Sequence: GenBank – Protein Sequence: Entrez Proteins, TREMBL – Protein Structures: PDB • Curated or Processed Data – DNA sequences : RefSeq, LocusLink, OMIM – Protein Sequences: SWISS-PROT, PIR – Protein Structures : SCOP, CATH, MMDB – Genomes: Entrez Genomes, COGs Nucleic Acids Research: Database Issue each January 1 Articles on ~100 different databases 4 ways to access protein and DNA sequences [1] LocusLink with RefSeq [2] Entrez [3] UniGene UniGene collects expressed sequence tags (ESTs) into clusters, in an attempt to form one gene per cluster. Use UniGene to study where your gene is expressed in the body, when it is expressed, and see its abundance. [4] ExPASy SRS 4 ways to access protein and DNA sequences [1] LocusLink with RefSeq [2] Entrez [3] UniGene [4] ExPASy SRS There are many bioinformatics servers outside NCBI. Try ExPASy’s sequence retrieval system at http://www.expasy.ch/ (ExPASy = Expert Protein Analysis System) Or try ENSEMBL at www.ensembl.org for a premier human genome web browser. National Center for Biotechnology Information (NCBI) www.ncbi.nlm.nih.gov Page 24 The National Center for Biotechnology Information (NCBI) • Created as a part of the National Library of Medicine, National Institutes of Health in 1988 – Establish public databases – Research in computational biology – Develop software tools for sequence analysis – Disseminate biomedical information • Tools: BLAST(1990), Entrez (1992) • GenBank (1992) • Free MEDLINE (PubMed, 1997) • Other databases: dbEST, dbGSS, dbSTS, MMDB, OMIM, UniGene, Taxonomy, GeneMap, SAGE, LocusLink, RefSeq What is GenBank? • Archival nucleotide sequence database • Sample slogans: “Easy deposits, unlimited withdrawals, high interest”, “All bases covered”, “Billions and billions served” • Data are shared nightly among three collaborating databases: • GenBank at NCBI - Bethesda, Maryland, USA • DNA Database of Japan (DDBJ) at NIG Mishima, Japan • European Molecular Biology Laboratory Database (EMBL) at EBI - Hinxton, UK Some guiding principles of working with GenBank • GenBank is a nucleotide-centric view of the information space • GenBank is a repository of all publically available sequences • In GenBank, records are grouped for various reasons • Data in GenBank is only as good as what you put in NCBI databases and their links Article Word Abstracts Weight Medline 3D 3-D Structure Structure Taxonomy MMDB Phylogeny Genomes Nucleotide Sequences BLAST VAST Protein Sequences BLAST www.ncbi.nlm.nih.gov Fig. 2.5 Page 25 Fig. 2.5 Page 25 PubMed is… • National Library of Medicine's search service • 16 million citations in MEDLINE • links to participating online journals • PubMed tutorial (via “Education” on side bar) Page 24 Entrez integrates… • the scientific literature; • DNA and protein sequence databases; • 3D protein structure data; • population study data sets; • assemblies of complete genomes Page 24 Entrez is a search and retrieval system that integrates NCBI databases Page 24 Entrez: An integrated search and retrieval system BLAST is… • Basic Local Alignment Search Tool • NCBI's sequence similarity search tool • supports analysis of DNA and protein databases • 100,000 searches per day Page 25 OMIM is… •Online Mendelian Inheritance in Man •catalog of human genes and genetic disorders •edited by Dr. Victor McKusick, others at JHU Page 25 OMIM record for Presenilin 1 (PSEN1) Content s Additional info in OMIM Associated LocusLink record Each record provides a state of the art summary of current knowledge External resources Extensive references to literature OMIM Search Results by Titles alzheimer AND presenilin 1 Entrez Genome: Gene Location View of chromoso me 14 Multiple Maps STSs, ESTs, etc. Gene Name Integrated View of Chromosome 7 Entrez Genomes Map Viewer Chromosome 7 GenBank Map Contig Map STS Map Multiple Maps STSs, ESTs, etc. Entrez Genome: Gene Location View of chromoso me 14 Gene Name Entrez Genome: Gene Location Entrez Genomes Map Viewer Chromosome 14 Cytogenetic map Location of PSEN1 and surrounding genes Books is… • searchable resource of on-line books Page 26 TaxBrowser is… • browser for the major divisions of living organisms (archaea, bacteria, eukaryota, viruses) • taxonomy information such as genetic codes • molecular data on extinct organisms Page 26 Structure site includes… • Molecular Modelling Database (MMDB) • biopolymer structures obtained from the Protein Data Bank (PDB) • Cn3D (a 3D-structure viewer) • vector alignment search tool (VAST) Page 26 PDB • Protein DataBase – Protein and NA 3D structures – Sequence present – YAFFF PDB • • • • • • • • • JRNL JRNL JRNL REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK SEQRES SEQRES SEQRES SEQRES SEQRES SEQRES SEQRES HELIX CRYST1 ORIGX1 ORIGX2 ORIGX3 SCALE1 SCALE2 SCALE3 ATOM HEADER COMPND SOURCE AUTHOR DATE JRNL REMARK SECRES ATOM COORDINATES TITL 3 FLEXIBILITY REF J.MOL.BIOL. REFN ASTM JMOBAK V. 233 UK ISSN 0022-2836 139 1993 0070 1 2 2 RESOLUTION. 3.0 ANGSTROMS. 3 3 REFINEMENT. 3 PROGRAM X-PLOR 3 AUTHORS BRUNGER 3 R VALUE 0.216 3 RMSD BOND DISTANCES 0.020 ANGSTROMS 3 RMSD BOND ANGLES 3.86 DEGREES 3 3 NUMBER OF REFLECTIONS 3296 3 RESOLUTION RANGE 10.0 - 3.0 ANGSTROMS 3 DATA CUTOFF 3.0 SIGMA(F) 3 PERCENT COMPLETION 98.2 3 3 NUMBER OF PROTEIN ATOMS 456 3 NUMBER OF NUCLEIC ACID ATOMS 386 4 4 GCN4: TRANSCRIPTIONAL ACTIVATOR OF GENES ENCODING FOR AMINO 4 ACID BIOSYNTHETIC ENZYMES. 5 5 AMINO ACIDS NUMBERING (RESIDUE NUMBER) CORRESPONDS TO THE 5 281 AMINO ACIDS OF INTACT GCN4. 6 6 BZIP SEQUENCE 220 - 281 USED FOR CRYSTALLIZATION. 7 7 MODEL FROM AMINO ACIDS 227 - 281 SINCE AMINO ACIDS 220 7 226 ARE NOT WELL ORDERED. 8 8 RESIDUE NUMBERING OF NUCLEOTIDES: 8 5' T G G A G A T G A C G T C A T C T C C 8 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 1 2 3 4 5 6 7 8 9 9 9 THE ASYMMETRIC UNIT CONTAINS ONE HALF OF PROTEIN/DNA 9 COMPLEX PER ASYMMETRIC UNIT. 10 10 MOLECULAR DYAD AXIS OF PROTEIN DIMER AND PALINDROMIC HALF 10 SITES OF THE DNA COINCIDES WITH CRYSTALLOGRAPHIC TWO-FOLD 10 AXIS. THE FULL PROTEIN/DNA COMPLEX CAN BE OBTAINED BY 10 APPLYING THE FOLLOWING TRANSFORMATION MATRIX AND 10 TRANSLATION VECTOR TO THE COORDINATES X Y Z: 10 10 0 -1 0 X 117.32 X SYMM 10 -1 0 0 Y + 117.32 = Y SYMM 10 0 0 -1 Z 43.33 Z SYMM 1 A 62 ILE VAL PRO GLU SER SER ASP PRO ALA ALA LEU LYS ARG 2 A 62 ALA ARG ASN THR GLU ALA ALA ARG ARG SER ARG ALA ARG 3 A 62 LYS LEU GLN ARG MET LYS GLN LEU GLU ASP LYS VAL GLU 4 A 62 GLU LEU LEU SER LYS ASN TYR HIS LEU GLU ASN GLU VAL 5 A 62 ALA ARG LEU LYS LYS LEU VAL GLY GLU ARG 1 B 19 T G G A G A T G A C G T C 2 B 19 A T C T C C 1 A ALA A 228 LYS A 276 1 58.660 58.660 86.660 90.00 90.00 90.00 P 41 21 2 8 1.000000 0.000000 0.000000 0.00000 0.000000 1.000000 0.000000 0.00000 0.000000 0.000000 1.000000 0.00000 0.017047 0.000000 0.000000 0.00000 0.000000 0.017047 0.000000 0.00000 0.000000 0.000000 0.011539 0.00000 1 N PRO A 227 35.313 108.011 15.140 1.00 38.94 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 Accessing information on molecular sequences Page 26 Accession numbers are labels for sequences NCBI includes databases (such as GenBank) that contain information on DNA, RNA, or protein sequences. You may want to acquire information beginning with a query such as the name of a protein of interest, or the raw nucleotides comprising a DNA sequence of interest. DNA sequences and other molecular data are tagged with accession numbers that are used to identify a sequence or other record relevant to molecular data. Page 26 What is an accession number? An accession number is label that used to identify a sequence. It is a string of letters and/or numbers that corresponds to a molecular sequence. Examples (all for retinol-binding protein, RBP4): X02775 NT_030059 Rs7079946 GenBank genomic DNA sequence Genomic contig dbSNP (single nucleotide polymorphism) DNA N91759.1 NM_006744 An expressed sequence tag (1 of 170) RefSeq DNA sequence (from a transcript) RNA NP_007635 AAC02945 Q28369 1KT7 RefSeq protein GenBank protein SwissProt protein Protein Data Bank structure record protein Page 27 Four ways to access DNA and protein sequences [1] Entrez Gene with RefSeq [2] UniGene [3] European Bioinformatics Institute (EBI) and Ensembl (separate from NCBI) [4] ExPASy Sequence Retrieval System (separate from NCBI) Note: LocusLink at NCBI was recently retired. The third printing of the book has updated these sections (pages 27-31). Page 27 4 ways to access protein and DNA sequences [1] Entrez Gene with RefSeq Entrez Gene is a great starting point: it collects key information on each gene/protein from major databases. It covers all major organisms. RefSeq provides a curated, optimal accession number for each DNA (NM_006744) or protein (NP_007635) Page 27 From the NCBI home page, type “rbp4” and hit “Go” Pevsner Fig. 2.7 Page 29 revised Fig. 2.7 Page 29 By applying limits, there are now just two entries GenBank Record Locus Name Accession Number gi Number Medline ID Protein Sequence [rest of protein sequence deleted for brevity] GenPept ID [rest of nucleotide sequence deleted for brevity] Nucleotide Sequence LOCUS, Accession, NID and protein_id LOCUS: Unique string of 10 letters and numbers in the database. Not maintained amongst databases, and is therefore a poor sequence identifier. ACCESSION: A unique identifier to that record, citable entity; does not change when record is updated. A good record identifier, ideal for citation in publication. VERSION: : New system where the accession and version play the same function as the accession and gi number. Nucleotide gi: Geninfo identifier (gi), a unique integer which will change every time the sequence changes. PID: Protein Identifier: g, e or d prefix to gi number. Can have one or two on one CDS. Protein gi: Geninfo identifier (gi), a unique integer which will change every time the sequence changes. protein_id: Identifier which has the same structure and function as the nucleotide Accession.version numbers, but slightlt different format. Entrez Gene (top of page) Note that links to many other RBP4 database entries are available revised Fig. 2.8 Page 30 Entrez Gene (middle of page) Entrez Gene (bottom of page) Fig. 2.9 Page 32 Fig. 2.9 Page 32 Fig. 2.9 Page 32 FASTA format Fig. 2.10 Page 32 What is an accession number? An accession number is label that used to identify a sequence. It is a string of letters and/or numbers that corresponds to a molecular sequence. Examples (all for retinol-binding protein, RBP4): X02775 NT_030059 Rs7079946 GenBank genomic DNA sequence Genomic contig dbSNP (single nucleotide polymorphism) DNA N91759.1 NM_006744 An expressed sequence tag (1 of 170) RefSeq DNA sequence (from a transcript) RNA NP_007635 AAC02945 Q28369 1KT7 RefSeq protein GenBank protein SwissProt protein Protein Data Bank structure record protein Page 27 NCBI’s important RefSeq project: best representative sequences RefSeq (accessible via the main page of NCBI) provides an expertly curated accession number that corresponds to the most stable, agreed-upon “reference” version of a sequence. RefSeq identifiers include the following formats: Complete genome Complete chromosome Genomic contig mRNA (DNA format) Protein NG_###### NC_###### NT_###### NM_###### e.g. NM_006744 NP_###### e.g. NP_006735 Page 29-30 NCBI’s RefSeq project: accession for genomic, mRNA, protein sequences Accession AC_123456 AP_123456 NC_123456 NG_123456 NM_123456 NM_123456789 NP_123456 NP_123456789 NR_123456 NT_123456 NW_123456 NZ_ABCD12345678 XM_123456 XP_123456 XR_123456 YP_123456 ZP_12345678 Molecule Genomic Protein Genomic Genomic mRNA mRNA Protein Protein RNA Genomic Genomic Genomic mRNA Protein RNA Protein Protein Method Mixed Mixed Mixed Mixed Mixed Mixed Mixed Curation Mixed Automated Automated Automated Automated Automated Automated Auto. & Curated Automated Note Alternate complete genomic Protein products; alternate Complete genomic molecules Incomplete genomic regions Transcript products; mRNA Transcript products; 9-digit Protein products; Protein products; 9-digit Non-coding transcripts Genomic assemblies Genomic assemblies Whole genome shotgun data Transcript products Protein products Transcript products Protein products Protein products Four ways to access DNA and protein sequences [1] Entrez Gene with RefSeq [2] UniGene [3] European Bioinformatics Institute (EBI) and Ensembl (separate from NCBI) [4] ExPASy Sequence Retrieval System (separate from NCBI) Page 31 DNA RNA protein complementary DNA (cDNA) UniGene In genetics, complementary DNA (cDNA) is DNA synthesized from a mature mRNA template in a reaction catalyzed by the enzyme reverse transcriptase. Fig. 2.3 Page 23 Expressed Sequence Tag What Are ESTs and How Are They Made? ESTs are small pieces of DNA sequence (usually 200 to 500 nucleotides long) that are generated by sequencing either one or both ends of an expressed gene. The idea is to sequence bits of DNA that represent genes expressed in certain cells, tissues, or organs from different organisms and use these "tags" to fish a gene out of a portion of chromosomal DNA by matching base pairs. The challenge associated with identifying genes from genomic sequences varies among organisms and is dependent upon genome size as well as the presence or absence of introns, the intervening DNA sequences interrupting the protein coding sequence of a gene. STS Sequenced Tagged Sites, are operationally unique sequence that identifies the combination of primer pairs used in a PCR assay that generate a mapping reagent which maps to a single position within the genome. Also see: http://www.ncbi.nlm.nih.gov/dbSTS/ http://www.ncbi.nlm.nih.gov/genemap/ UniGene: unique genes via ESTs • Find UniGene at NCBI: www.ncbi.nlm.nih.gov/UniGene • UniGene clusters contain many expressed sequence tags (ESTs), which are DNA sequences (typically 500 base pairs in length) corresponding to the mRNA from an expressed gene. ESTs are sequenced from a complementary DNA (cDNA) library. • UniGene data come from many cDNA libraries. Thus, when you look up a gene in UniGene you get information on its abundance and its regional distribution. Pages 20-21 Cluster sizes in UniGene This is a gene with 1 EST associated; the cluster size is 1 Fig. 2.3 Page 23 Cluster sizes in UniGene This is a gene with 10 ESTs associated; the cluster size is 10 Cluster sizes in UniGene (human) Cluster size (ESTs) 1 2 3-4 5-8 9-16 17-32 500-1000 2000-4000 8000-16,000 16,000-30,000 UniGene build 194, 8/06 Number of clusters 42,800 6,500 6,500 5,400 4,100 3,300 2,128 233 21 8 UniGene: unique genes via ESTs Conclusion: UniGene is a useful tool to look up information about expressed genes. UniGene displays information about the abundance of a transcript (expressed gene), as well as its regional distribution of expression (e.g. brain vs. liver). We will discuss UniGene further later (gene expression). Page 31 Five ways to access DNA and protein sequences [1] Entrez Gene with RefSeq [2] UniGene [3] European Bioinformatics Institute (EBI) and Ensembl (separate from NCBI) [4] ExPASy Sequence Retrieval System (separate from NCBI) Page 31 Ensembl to access protein and DNA sequences Try Ensembl at www.ensembl.org for a premier human genome web browser. We will encounter Ensembl as we study the human genome, BLAST, and other topics. click human enter RBP4 Five ways to access DNA and protein sequences [1] Entrez Gene with RefSeq [2] UniGene [3] European Bioinformatics Institute (EBI) and Ensembl (separate from NCBI) [4] ExPASy Sequence Retrieval System (separate from NCBI) Page 33 ExPASy to access protein and DNA sequences ExPASy sequence retrieval system (ExPASy = Expert Protein Analysis System) Visit http://www.expasy.ch/ Page 33 Fig. 2.11 Page 33 Example of how to access sequence data: HIV-1 pol There are many possible approaches. Begin at the main page of NCBI, and type an Entrez query: hiv-1 pol Page 34 Searching for HIV-1 pol: Following the “genome” link yields a manageable three results Page 34 Example of how to access sequence data: HIV-1 pol For the Entrez query: hiv-1 pol there are about 40,000 nucleotide or protein records (and >100,000 records for a search for “hiv-1”), but these can easily be reduced in two easy steps: --specify the organism, e.g. hiv-1[organism] --limit the output to RefSeq! Page 34 over 100,000 nucleotide entries for HIV-1 only 1 RefSeq Examples of how to access sequence data: histone query for “histone” # results protein records RefSeq entries 21847 7544 RefSeq (limit to human) NOT deacetylase 1108 697 At this point, select a reasonable candidate (e.g. histone 2, H4) and follow its link to Entrez Gene. There, you can confirm you have the right gene/protein. 8-12-06 Access to Biomedical Literature Page 35 PubMed at NCBI to find literature information PubMed is the NCBI gateway to MEDLINE. MEDLINE contains bibliographic citations and author abstracts from over 4,600 journals published in the United States and in 70 foreign countries. It has >14 million records dating back to 1966. Page 35 MeSH is the acronym for "Medical Subject Headings." MeSH is the list of the vocabulary terms used for subject analysis of biomedical literature at NLM. MeSH vocabulary is used for indexing journal articles for MEDLINE. The MeSH controlled vocabulary imposes uniformity and consistency to the indexing of biomedical literature. Page 35 PubMed search strategies Try the tutorial (“education” on the left sidebar) Use boolean queries (capitalize AND, OR, NOT) lipocalin AND disease Try using “limits” Try “Links” to find Entrez information and external resources Obtain articles on-line via Welch Medical Library (and download pdf files): http://www.welch.jhu.edu/ Page 35 1 AND 2 1 2 lipocalin AND disease (60 results) 1 OR 2 1 2 lipocalin OR disease (1,650,000 results) 1 NOT 2 1 2 lipocalin NOT disease (530 results) Fig. 2.12 Page 34 8/04 Article contents: “globin” is present “globin” is absent Search result: “globin” is found true positive false positive (article does not discuss globins) “globin” is not found false negative (article discusses globins) true negative 8/06 Protein sequence motif is a descriptor of a protein family • Glutamine amidotransferase class I [PAS]-[LIVMFYT]-[LIVMFY]-G-[LIVMFY]-C[LIVMFYN]-G-x-[QEH]- x-[LIVMFA] [C is the active site residue] • Glutamine amidotransferase class II <x(0,11)-C-[GS]-[IV]-[LIVMFYW]-[AG] [C is the active site residue] Searching MMDB Principles of structural alignment • Dali: http://www.ebi.ac.uk/dali/ Looks for minimal RMSD between Ca atoms. Calculate Ca - Ca distance matrices, then identifies the longest alignable segments • VAST (Vector Alignment Search Tool) http://www.ncbi.nlm.nih.gov/Structure/ looks for pairs of secondary structure elements (a-helices, b-strands) that have similar orientation and connectivity Dali alignment of Tyr phosphatase VAST Structure Neighbors Structure Summary BLAST neighbors VAST neighbors Cn3D viewer Cn3D : Displaying Structures Chloroquine Structure Neighbors Use of structural alignments Chloroquine NADH PDB • Protein DataBase – Protein and NA 3D structures – Sequence present – YAFFF PDB • • • • • • • • • HEADER COMPND SOURCE AUTHOR DATE JRNL REMARK SECRES ATOM COORDINATES HEADER COMPND COMPND SOURCE AUTHOR REVDAT JRNL JRNL JRNL JRNL JRNL JRNL REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK SEQRES SEQRES SEQRES SEQRES SEQRES SEQRES SEQRES HELIX CRYST1 ORIGX1 ORIGX2 ORIGX3 SCALE1 SCALE2 SCALE3 ATOM ATOM LEUCINE ZIPPER 15-JUL-93 1DGC GCN4 LEUCINE ZIPPER COMPLEXED WITH SPECIFIC 2 ATF/CREB SITE DNA GCN4: YEAST (SACCHAROMYCES CEREVISIAE); DNA: SYNTHETIC T.J.RICHMOND 1 22-JUN-94 1DGC 0 AUTH P.KONIG,T.J.RICHMOND TITL THE X-RAY STRUCTURE OF THE GCN4-BZIP BOUND TO TITL 2 ATF/CREB SITE DNA SHOWS THE COMPLEX DEPENDS ON DNA TITL 3 FLEXIBILITY REF J.MOL.BIOL. V. 233 139 1993 REFN ASTM JMOBAK UK ISSN 0022-2836 0070 1 2 2 RESOLUTION. 3.0 ANGSTROMS. 3 3 REFINEMENT. 3 PROGRAM X-PLOR 3 AUTHORS BRUNGER 3 R VALUE 0.216 3 RMSD BOND DISTANCES 0.020 ANGSTROMS 3 RMSD BOND ANGLES 3.86 DEGREES 3 3 NUMBER OF REFLECTIONS 3296 3 RESOLUTION RANGE 10.0 - 3.0 ANGSTROMS 3 DATA CUTOFF 3.0 SIGMA(F) 3 PERCENT COMPLETION 98.2 3 3 NUMBER OF PROTEIN ATOMS 456 3 NUMBER OF NUCLEIC ACID ATOMS 386 4 4 GCN4: TRANSCRIPTIONAL ACTIVATOR OF GENES ENCODING FOR AMINO 4 ACID BIOSYNTHETIC ENZYMES. 5 5 AMINO ACIDS NUMBERING (RESIDUE NUMBER) CORRESPONDS TO THE 5 281 AMINO ACIDS OF INTACT GCN4. 6 6 BZIP SEQUENCE 220 - 281 USED FOR CRYSTALLIZATION. 7 7 MODEL FROM AMINO ACIDS 227 - 281 SINCE AMINO ACIDS 220 7 226 ARE NOT WELL ORDERED. 8 8 RESIDUE NUMBERING OF NUCLEOTIDES: 8 5' T G G A G A T G A C G T C A T C T C C 8 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 1 2 3 4 5 6 7 8 9 9 9 THE ASYMMETRIC UNIT CONTAINS ONE HALF OF PROTEIN/DNA 9 COMPLEX PER ASYMMETRIC UNIT. 10 10 MOLECULAR DYAD AXIS OF PROTEIN DIMER AND PALINDROMIC HALF 10 SITES OF THE DNA COINCIDES WITH CRYSTALLOGRAPHIC TWO-FOLD 10 AXIS. THE FULL PROTEIN/DNA COMPLEX CAN BE OBTAINED BY 10 APPLYING THE FOLLOWING TRANSFORMATION MATRIX AND 10 TRANSLATION VECTOR TO THE COORDINATES X Y Z: 10 10 0 -1 0 X 117.32 X SYMM 10 -1 0 0 Y + 117.32 = Y SYMM 10 0 0 -1 Z 43.33 Z SYMM 1 A 62 ILE VAL PRO GLU SER SER ASP PRO ALA ALA LEU LYS ARG 2 A 62 ALA ARG ASN THR GLU ALA ALA ARG ARG SER ARG ALA ARG 3 A 62 LYS LEU GLN ARG MET LYS GLN LEU GLU ASP LYS VAL GLU 4 A 62 GLU LEU LEU SER LYS ASN TYR HIS LEU GLU ASN GLU VAL 5 A 62 ALA ARG LEU LYS LYS LEU VAL GLY GLU ARG 1 B 19 T G G A G A T G A C G T C 2 B 19 A T C T C C 1 A ALA A 228 LYS A 276 1 58.660 58.660 86.660 90.00 90.00 90.00 P 41 21 2 8 1.000000 0.000000 0.000000 0.00000 0.000000 1.000000 0.000000 0.00000 0.000000 0.000000 1.000000 0.00000 0.017047 0.000000 0.000000 0.00000 0.000000 0.017047 0.000000 0.00000 0.000000 0.000000 0.011539 0.00000 1 N PRO A 227 35.313 108.011 15.140 1.00 38.94 2 CA PRO A 227 34.172 107.658 15.972 1.00 39.82 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 1DGC 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 ATOM ATOM TER MASTER END 842 843 844 1DGC 1DGC 1DGC 1DGC 1DGC 916 917 918 919 920 C5 C6 46 C B C B C B 0 9 9 9 0 57.692 100.286 58.128 100.193 1 0 0 0 22.744 21.465 6 842 1.00 29.82 1.00 30.63 2 0 7 UniProt • New protein sequence database that is the result of a merge from SWISS-PROT and PIR. It will be the annotated curated protein sequence database. • Data in UniProt is primarily derived from coding sequence annotations in EMBL (GenBank/DDBJ) nucleic acid sequence data. • UniProt is a Flat-File database just like EMBL and GenBank • Flat-File format is SwissProt-like, or EMBL-like Swiss-Prot Swiss-Prot • SWISS-PROT incorporates: • • • • • • • • Function of the protein Post-translational modification Domains and sites. Secondary structure. Quaternary structure. Similarities to other proteins; Diseases associated with deficiencies in the protein Sequence conflicts, variants, etc.