* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Flow of genetic information DNA --> RNA -
Promoter (genetics) wikipedia , lookup
Genetic engineering wikipedia , lookup
Gene nomenclature wikipedia , lookup
Western blot wikipedia , lookup
Transposable element wikipedia , lookup
Deoxyribozyme wikipedia , lookup
Nucleic acid analogue wikipedia , lookup
Interactome wikipedia , lookup
Proteolysis wikipedia , lookup
Whole genome sequencing wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Molecular ecology wikipedia , lookup
Metalloprotein wikipedia , lookup
Community fingerprinting wikipedia , lookup
Expression vector wikipedia , lookup
Gene expression wikipedia , lookup
Protein–protein interaction wikipedia , lookup
Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup
Genomic library wikipedia , lookup
Protein structure prediction wikipedia , lookup
Silencer (genetics) wikipedia , lookup
Non-coding DNA wikipedia , lookup
Endogenous retrovirus wikipedia , lookup
Ancestral sequence reconstruction wikipedia , lookup
Point mutation wikipedia , lookup
Homology modeling wikipedia , lookup
Two-hybrid screening wikipedia , lookup
Flow of genetic information DNA --> RNA --> PROTEIN --> ---> CONFORMATION --> BIOLOGICAL FUNCTION Overview of molecular biology databases - Sequence DNA Genbank (www.ncbi.nlm.nih.gov) - BLAST - Entrez EMBL (European Molecular Biology Laboratory, www.ebi.ac.uk) - SRS : srs.ebi.ac.uk, www.sanger.ac.uk/srs6/ DDBJ (DNA Data Bank of Japan) Protein Swissprot (www.ebi.ac.uk) NCBI Protein classification databases Prosite (expasy.hcuge.ch) Pfam (www.sanger.ac.uk/Pfam) InterPro (www.ebi.ac.uk/interpro) Gene ontology www.geneontology.org - Structure PDB Protein Data Bank, www.rcsb.org/pdb/cgi/queryForm.cgi (RCSB, Research Collaboratory for Structural Bioinformatics, rcsb.rutgers.edu) Xray crystallography NMR modeling KLOTHO (small molecules, www.ibc.wustl.edu/moirai/klotho/compound_list.html) - Genome GDB (Human Genome Data Base, www.gdb.org) Mouse genome database (www.informatics.jax.org) Yeast genome (genome-ftp.stanford.edu/Saccharomyces) Bacterial genomes (www.tigr.org) - Human genome browsers NCBI UCSC EBI Celera www.ncbi.nlm.nih.gov genome.ucsc.edu www.ensembl.org www.celera.com - Genetic disorders OMIM (Online Mendelian Inheritance in Man, www.ncbi.nlm.nih.gov) - Taxonomy (www.ncbi.nlm.nih.gov) - Literature PubMed (www.ncbi.nlm.nih.gov/Entrez) Molecular biology databases DNA sequence Genome data Protein sequence Protein classification Protein structure Major bioinformatics sites / public sequence database administrators Genbank NCBI, NIH, US DDBJ (Japan) EMBL (EBI, UK ) DNA sequence data : EMBL - Genbank - DDBJ EMBL and Genbank formats EMBL format ID XX AC XX SV XX DT DT XX DE XX KW XX OS OC OC XX RN RX RA RT RT RT RL XX RN RP RA RT RL RL RL XX DR XX LISOD standard; DNA; PRO; 756 BP. X64011; S78972; X64011.1 28-APR-1992 (Rel. 31, Created) 30-JUN-1993 (Rel. 36, Last updated, Version 6) L.ivanovii sod gene for superoxide dismutase sod gene; superoxide dismutase. Listeria ivanovii Bacteria; Firmicutes; Bacillus/Clostridium group; Bacillus/Staphylococcus group; Listeria. [1] MEDLINE; 92140371. Haas A., Goebel W.; "Cloning of a superoxide dismutase gene from Listeria ivanovii by functional complementation in Escherichia coli and characterization of the gene product."; Mol. Gen. Genet. 231:313-322(1992). [2] 1-756 Kreft J.; ; Submitted (21-APR-1992) to the EMBL/GenBank/DDBJ databases. J. Kreft, Institut f. Mikrobiologie, Universitaet Wuerzburg, Biozentrum Am Hubland, 8700 Wuerzburg, FRG SWISS-PROT; P28763; SODM_LISIV. FH FH FT FT FT FT FT FT FT FT FT FT FT FT FT FT FT FT FT FT FT XX SQ Key Location/Qualifiers source 1..756 /db_xref="taxon:1638" /organism="Listeria ivanovii" /strain="ATCC 19119" 95..100 /gene="sod" 723..746 /gene="sod" 109..717 /db_xref="SWISS-PROT:P28763" /transl_table=11 /gene="sod" /EC_number="1.15.1.1" /product="superoxide dismutase" /protein_id="CAA45406.1" /translation="MTYELPKLPYTYDALEPNFDKETMEIHYTKHHNIYVTKLNEAVSG HAELASKPGEELVANLDSVPEEIRGAVRNHGGGHANHTLFWSSLSPNGGGAPTGNLKAA IESEFGTFDEFKEKFNAAAAARFGSGWAWLVVNNGKLEIVSTANQDSPLSEGKTPVLGL DVWEHAYYLKFQNRRPEYIDTFWNVINWDERNKRFDAAK" RBS terminator CDS Sequence 756 BP; 247 A; 136 C; 151 G; 222 T; 0 other; cgttatttaa ggtgttacat agttctatgg aaatagggtc tatacctttc gtaatttctt ttcacataaa taataaacaa tccgaggagg aatttttaat ttaccaaaat taccttatac ttatgatgct ttggagccga attttgataa gaaattcact atacaaagca ccacaatatt tatgtaacaa aactaaatga ggacacgcag aacttgcaag taaacctggg gaagaattag ttgctaatct gccttacaat gacttacgaa agaaacaatg agcagtctca agatagcgtt 60 120 180 240 300 3.2.4 Feature key examples Key Description conflict rep_origin protein_bind CDS misc_RNA insertion_seq D-loop Separate determinations of the "same" sequence differ Origin of replication Protein binding site on DNA Protein-coding sequence Generic label for an undefined RNA Insertion element Mitochondrial or other D-loop structure 3.3.4 Qualifier examples Key Location/Qualifiers CDS 86..742 /product="hypoxanthine phosphoribosyltransferase" /label=hprt /note="hprt catalyzes vital steps in the reutilization pathway for purine biosynthesis and its deficiency leads to forms of ""gouty"" arthritis" 234..243 /direction=left 109..564 /usedin=X10009:catalase rep.origin CDS 3.5.3 Location examples The following is a list of common location descriptors with their meanings: Location Description 467 Points to a single base in the presented sequence 340..565 Points to a continuous range of bases bounded by and including the starting and ending bases <345..500 Indicates that the exact lower boundary point of a feature is unknown. The location begins at some base previous to the first base specified (which need not be contained in the presented sequence) and continues to and includes the ending base <1..888 The feature starts before the first sequenced base and continues to and includes base 888 (102.110) Indicates that the exact location is unknown but that it is one of the bases between bases 102 and 110, inclusive (23.45)..600 Specifies that the starting point is one of the bases between bases 23 and 45, inclusive, and the end point is base 600 (122.133)..(204.221) The feature starts at a base between 122 and 133, inclusive, and ends at a base between 204 and 221, inclusive 123^124 Points to a site between bases 123 and 124 145^177 Points to a site between two adjacent bases anywhere between bases 145 and 177 complement(34..(122.126)) Start at one of the bases complementary to those between 122 and 126 on the presented strand and finish at the base complementary to base 34 (the feature is on the strand complementary to the presented strand) join("acct",449..670) Concatenate the four bases 'acct' to the 5' end of the sequence from bases 449 to 670, inclusive J00193:hladr Points to a feature whose location is described in another entry: the feature labelled 'hladr' in the entry (in this database) with primary accession number 'J00193' J00194:(100..202) Points to bases 100 to 202, inclusive, in the entry (in this database) with primary accession number 'J00194' EMBL and Genbank formats EMBL format ID XX AC XX SV XX DT DT XX DE XX KW XX OS OC OC XX RN RX RA RT RT RT RL XX RN RP RA RT RL RL RL XX DR XX LISOD standard; DNA; PRO; 756 BP. X64011; S78972; X64011.1 28-APR-1992 (Rel. 31, Created) 30-JUN-1993 (Rel. 36, Last updated, Version 6) L.ivanovii sod gene for superoxide dismutase sod gene; superoxide dismutase. Listeria ivanovii Bacteria; Firmicutes; Bacillus/Clostridium group; Bacillus/Staphylococcus group; Listeria. [1] MEDLINE; 92140371. Haas A., Goebel W.; "Cloning of a superoxide dismutase gene from Listeria ivanovii by functional complementation in Escherichia coli and characterization of the gene product."; Mol. Gen. Genet. 231:313-322(1992). [2] 1-756 Kreft J.; ; Submitted (21-APR-1992) to the EMBL/GenBank/DDBJ databases. J. Kreft, Institut f. Mikrobiologie, Universitaet Wuerzburg, Biozentrum Am Hubland, 8700 Wuerzburg, FRG SWISS-PROT; P28763; SODM_LISIV. Common sequence formats 1. EMBL release format 2. Genbank (ASN.1) 3. FASTA format : >X12345 Y098TR gene CGTATCTTACGAGCTACTACGA GGTCTTATCGGACGAGCGACT ... EMBL divisions Human Mus musculus Rodents Other Mammals Other Vertebrates Invertebrates Plants Fungi Prokaryotes (+ Archae) Organanelles Viruses Bacteriophages Patented Synthetic EST HTG STS GSS EST (Expressed Sequence Tag) Expressed Sequence Tags (ESTs) are partial mRNA sequences, they are sequences of cDNA which have been reverse-transcribed from mRNA Short sequences (~500-1000 bases), each is result of single sequencing experiment -> high frequency of errors Applications: Discovery of new genes Mapping of various genomes Identification of coding regions in genomic sequences. EST libraries are used to answer questions like: What genes in specific cell or tissue are expressed ? UniGene clusters UniGene partitions GenBank sequences into a non-redundant set of gene-oriented clusters. Each UniGene cluster contains sequences that represent a unique gene. A majority of sequences are ESTs. The mouse dataset contains 84,247 clusters with a total of 2,332,864 sequences. 5’ UTR mRNA public ESTs CDS 3’ UTR High-Throughput Genomic Sequences The High Throughput Genomic (HTG) Sequences division was created to accommodate a growing need to make 'unfinished' genomic sequence data rapidly available to the scientific community. It was done in a coordinated effort between the three International Nucleotide Sequence databases: DDBJ, EMBL, and GenBank. The HTG division contains 'unfinished' DNA sequences generated by the high-throughput sequencing centers. Sequence data in this division are available for BLAST homology searches against either the "htgs" database or the "month" database, which includes all new submissions for the prior month. The HTG division of GenBank was described in a [Genome Research (1997) 7(10)] article by Ouellette and Boguski. Location of HTG records: Unfinished HTG sequences containing contigs greater than 2 kb are assigned an accession number and deposited in the HTG division. A typical HTG record might consist of all the first pass sequence data generated from a single cosmid, BAC, YAC, or P1 clone which together comprise more than 2 kb and contain one or more gaps. A single accession number is assigned to this collection of sequences and each record includes a clear indication of the status (phase 1 or 2) plus a prominent warning that the sequence data is "unfinished" and may contain errors. The accession number does not change as sequence records are updated; only the most recent version of a HTG record remains in GenBank. 'Finished' HTG sequences (phase 3) retain the same accession number, but are moved into the relevant primary GenBank division. An example of a submission (one accession number) that has progressed through phase 1, phase 2, and phase 3 is available Genome Survey Sequence (GSS) This division is similar in nature to the EST division, except that its sequences will be genomic rather than cDNA (mRNA). The GSS division will contain (but not be limited to) the following types of data: - random "single pass read" genome survey sequences - single pass reads from cosmid/BAC/YAC ends - exon trapped genomic sequences - Alu PCR sequences STS (Sequence Tagged Sites) Sequence Tagged Sites (STS) are short DNA segments with a single location in the genome. This feature of STS makes them useful tags for mapping. Molecular biology databases DNA sequence Genome data Protein sequence Protein classification Protein structure Genome sequencing: published complete microbial genomes Genome Strain Domain Size (Mb) Haemophilus influenzae Rd KW20 B 1.83 TIGR 1995 Mycoplasma genitalium G-37 B 0.58 TIGR 1995 Methanococcus jannaschii DSM 2661 A 1.66 TIGR 1996 Mycoplasma pneumoniae M129 B 0.81 Univ. of Heidelberg 1996 Synechocystis sp. PCC 6803 B 3.57 1996 Archaeoglobus fulgidus DSM4304 A 2.18 Kazusa DNA Research Inst. TIGR Bacillus subtilis 168 B 4.2 1997 Deinococcus radiodurans R1 B 3.28 International Consortium TIGR K-12 Strain MG1655 B 4.6 Helicobacter pylori 26695 B 1.66 Methanobacterium thermoautotrophicum delta H A 1.75 Saccharomyces cerevisiae S288C E 13 VF5 B 1.5 International Consortium Diversa Escherichia coli Aquifex aeolicus Chlamydia trachomatis Institution University of Wisconsin TIGR Year 1997 1997 1997 1997 1997 1996/19 97 1998 serovar D (D/UW3/Cx) Mycobacterium tuberculosis H37Rv (lab strain) B 1.05 UC Berkeley Stanford 1998 B 4.4 Sanger Centre 1998 Pyrococcus horikoshii OT3 A 1.8 Biotechnology Center 1998 Rickettsia prowazekii Madrid E B 1.1 University of Uppsala 1998 Rickettsia prowazekii Madrid E B 1.1 University of Uppsala 1998 Treponema pallidum Nichols B 1.14 TIGR 1998 K1 A 1.67 Biotechnology Center 1999 CWL029 B 1.23 UC Berkeley Stanford 1999 J99 B 1.64 1999 Thermotoga maritima MSB8 B 1.8 Astra Research Center Boston Genome Therapeutics TIGR Bacillus halodurans C-125 B 4.2 2000 APS B 0.64 Japan Marine Science and Technology Center Univ. Tokyo / RIKEN NCTC 11168 B 1.64 Sanger Centre 2000 Chlamydia pneumoniae AR39 B 1.23 TIGR 2000 Chlamydia trachomatis MoPn B 1.07 TIGR 2000 Halobacterium sp. NRC-1 A 2.57 2000 Neisseria meningitidis MC58 B 2.27 Halobacterium genome consortium TIGR Neisseria meningitidis serogroup A strain Z2491 PAO1 B 2.18 Sanger Centre 2000 B 6.3 2000 A 1.56 2000 Aeropyrum pernix Chlamydia pneumoniae Helicobacter pylori Buchnera sp. Campylobacter jejuni Pseudomonas aeruginosa 1999 2000 2000 Thermoplasma volcanum GSS1 A 1.58 University of Washington Max-Planck-Institute for Biochemistry AIST Ureaplasma urealyticum serovar 3 B 0.75 Applied Biosystems / 2000 serotype O1, Biotype El Tor, strain N16961 9a5c B 4 TIGR 2000 B 2.68 ONSA Consortium 2000 B 4.1 2001 B 1.44 University of Wisconsin TIGR Thermoplasma acidophilum Vibrio cholerae Xylella fastidiosa Escherichia coli Borrelia burgdorferi O157:H7 strain EDL933 B31 2000 1997 / Nucleotide sequence database statistics - distribution among organisms Comparison of fully sequenced genomes MB Genes Bacteria 0.6 - 7.5 500-7,000 S. cerevisiae 12 6,000 S. pombe 13 6,000 Caenorhabditis elegans 97 20,000 Drosophila melanogaster 120 14,000 Arabidopsis thaliana 110 26,000 Fugu rubripes 365 ~38,000? Mus musculus ~3000 >40,000? H. sapiens 3200 >40,000? Sites for exploring fully sequenced genomes of man, mouse and other higher eukaryotes. NCBI www.ncbi.nlm.nih.gov UCSC genome.ucsc.edu EBI www.ensembl.org Celera www.celera.com Genome MOT, Genome monitoring table http://www.ebi.ac.uk/genomes/mot/index.html March 2003: % Finished % Finished+Draft Drosophila 100 C. elegans 100 A. thaliana 100 H. sapiens 118 183 Danio rerio 5 23 Mouse 110 181 Rat 0.5 169 Taxonomy database www3.ncbi.nlm.nih.gov/Taxonomy/tax.html This is the top level of the taxonomy database maintained by NCBI/GenBank. You can explore any of the taxa listed below by clicking it. Archaea Eubacteria Eukaryotae Viroids Viruses Other Unclassified Molecular biology databases DNA sequence Genome data Protein sequence Protein classification Protein structure Most entries in protein sequence databases are computational translations from gene sequences DNA -> RNA -> protein -> conformation FT FT FT FT FT FT FT FT FT FT FT CDS 109..717 /db_xref="SWISS-PROT:P28763” /transl_table=11 /gene="sod” /EC_number="1.15.1.1” /product="superoxide dismutase” /protein_id="CAA45406.1” /translation="MTYELPKLPYTYDALEPNFDKETMEIHYTKHHNIYVTKLNEAVSG HAELASKPGEELVANLDSVPEEIRGAVRNHGGGHANHTLFWSSLSPNGGGAPTGNLKAA IESEFGTFDEFKEKFNAAAAARFGSGWAWLVVNNGKLEIVSTANQDSPLSEGKTPVLGL DVWEHAYYLKFQNRRPEYIDTFWNVINWDERNKRFDAAK" The flow of genetic information DNA -> RNA -> protein -> conformation Translation products of DNA - Amino acids in three letter code ValArgIleArgIleSerAsp TyrGlyPheGlyPheArgMet ThrAspSerAspPheGlyCys 5' GUACGGAUUCGGAUUUCGGAUGC 3' 3' CAUGCCUAAGCCUAAAGCCUACG 5' TyrProAsnProAsnArgIle ValSerGluSerLysProHis ArgIleArgIleGluSerAla Amino acids in one letter code V R I R I S D Y G F G F R M T D S D F G C 5' GUACGGAUUCGGAUUUCGGAUGC 3' 3' CAUGCCUAAGCCUAAAGCCUACG 5' Y P N P N R I V S E S K P H R I R I E S A Three- and one-letter codes of the amino acids. Alanine Arginine Asparagine Aspartate Cysteine Glutamate Glutamine Glycine Histidine Isoleucine Leucine Lysine Metionine Fenylalanine Proline Serine Treonine Tryptofan Tyrosine Valine Ala Arg Asn Asp Cys Glu Gln Gly His Ile Leu Lys Met Phe Pro Ser Thr Trp Tyr Val A R N D C E Q G H I L K M F P S T W Y V 4. THE GENETIC CODE UUU UUC UUA UUG Phe Phe Leu Leu UCU UCC UCA UCG Ser Ser Ser Ser UAU UAC UAA UAG Tyr Tyr Stop Stop UGU UGC UGA UGG Cys Cys Stop Trp CUU CUC CUA CUG Leu Leu Leu Leu CCU CCC CCA CCG Pro Pro Pro Pro CAU CAC CAA CAG His His Gln Gln CGU CGC CGA CGG Arg Arg Arg Arg AUU AUC AUA AUG Ile Ile Ile Met ACU ACC ACA ACG Thr Thr Thr Thr AAU AAC AAA AAG Asn Asn Lys Lys AGU AGC AGA AGG Ser Ser Arg Arg GUU GUC GUA GUG Val Val Val Val GCU GCC GCA GCG Ala Ala Ala Ala GAU GAC GAA GAG Asp Asp Glu Glu GGU GGC GGA GGG Gly Gly Gly Gly Table I. The genetic code Deviations from the standard genetic code # Cilian protozoa UAA = Gln:Q UAG = Gln:Q # Yeast mitochondria UGA CUU CUC CUA CUG AUA = = = = = = Trp:W Thr:T Thr:T Thr:T Thr:T Met:M # Mammalian mitochondria UGA AUU AUC AUA AGA AGG = = = = = = Trp:W Ile:I Ile:I Met:M * :* * :* # Drosophila mitochondria UGA AUU AUA AGA AGG = = = = = Trp:W Ile:I Met:M Ser:S Ser:S # mycoplasma UGA = Trp Sequence symbols: Nucleotides Symbol Meaning Complement A A T C C G G G C T/U T A M A or C K R A or G Y W A or T W S C or G S Y C or T R K G or T M V A or C or G B H A or C or T D D A or G or T H B C or G or T V X/N G or A or T or C X . . not G or A or T or C ‘Reverse’ translation A - G - K - M GCN GGN AAR ATG GCU GGU AAA ATG Protein DNA - most ambiguous DNA - most likely Codon usage for enteric bacterial (highly expressed) genes 7/19/83 AmAcid Codon Number /1000 Fraction Gly Gly Gly Gly GGG GGA GGU GGC 13.00 3.00 365.00 238.00 1.89 0.44 52.99 34.55 0.02 0.00 0.59 0.38 Glu Glu Asp Asp GAG GAA GAU GAC 108.00 394.00 149.00 298.00 15.68 57.20 21.63 43.26 0.22 0.78 0.33 0.67 Val Val Val Val GUG GUA GUU GUC 93.00 146.00 289.00 38.00 13.50 21.20 41.96 5.52 0.16 0.26 0.51 0.07 Ala Ala Ala Ala GCG GCA GCU GCC 161.00 173.00 212.00 62.00 23.37 25.12 30.78 9.00 0.26 0.28 0.35 0.10 Arg Arg Ser Ser AGG AGA AGU AGC 1.00 0.00 9.00 71.00 0.15 0.00 1.31 10.31 0.00 0.00 0.03 0.20 Lys Lys Asn Asn AAG AAA AAU AAC 111.00 320.00 19.00 274.00 16.11 46.46 2.76 39.78 0.26 0.74 0.06 0.94 Met Ile Ile Ile AUG AUA AUU AUC 170.00 1.00 70.00 345.00 24.68 0.15 10.16 50.09 1.00 0.00 0.17 0.83 Thr Thr Thr Thr ACG ACA ACU ACC 25.00 14.00 130.00 206.00 3.63 2.03 18.87 29.91 0.07 0.04 0.35 0.55 .. Protein sequence databases - Content The SWISS-PROT Protein Sequence Data Bank is a database of protein sequences produced collaboratively by Amos Bairoch (University of Geneva) and the EBI. It contains high-quality annotation, is nonredundant, and cross-referenced to many other databases. Release 40.44of SWISS-PROT contains 122'214 sequence entries comprising 44864044 amino acids. SWISS-PROT is accompanied by TrEMBL, a computer-annotated supplement to SWISS-PROT. TrEMBL contains the translations of all coding sequences (CDS) present in the EMBL Nucleotide Sequence Database not yet integrated into SWISS-PROT. TrEMBL (March 2003) contains 725'373 sequence entries NCBI protein database : 1’335'897 sequences Growth of Swissprot protein sequence database RELATIONSHIPS BETWEEN SWISS-PROT AND SOME BIOMOLECULAR DATABASES ************ * EMBL Nucleotide * * Sequence Database * * [EBI] * *********************** ^ ^ ^ ^ ^ ^ ^ ^ ^ ****************** | | | I | | | | | * FlyBase * <-------+ | | I | | | | +-------> ****************** | | | I | | | | | | | | I | | | | | ****************** | | | I | | | | | * SubtiList * <---------+ | I | | | +---------> * [B.subtilis] * | | | I | | | | | ****************** | | | I | | | | | | | | I | | | | | ****************** | | | I | | +-----------> * Mendel [Plant] * <-----+ | | | I | | | | | ****************** | | | | I | | | | | | | | | I | | | | | ****************** | | | | I +---------------> * MaizeDb * <-----------+ I | | | | | * [Zea mays] * | | | | I | | | | | ****************** | | | | I | | | | | | | | | I | +-------------> ****************** | | | | I | | | | | * WormPep * | | | | I | | | | | * [C.elegans] * <---+ | | | | I | | | | | ****************** | | | | | I | | | | | +-----> | | | | | I | | | | | | ****************** | v v v v v v v v v v v * REBASE * ************************* * [Restriction * <-- * SWISS-PROT * ----> * enzymes] * * Protein Sequence * ****************** * Data Bank * ************************* ****************** ^ ^ ^ ^ ^ ^ ^ | ^ ^ ^ * StyGene * | | | | | | | | | | +--------> * [S.Typhimurium]* <----+ | | | | | | | | | ****************** | | | | | | | | | | | | | | | | | +----------> ****************** | | | | | | | | * TRANSFAC * <------+ | | | | | | | ****************** | | | | | | | | | | | | | +------------> ****************** | | | | | | * Harefield [2D] * <--------+ | | | | | ****************** | | | | | | | | | +--------------> ****************** | | | | * PROSITE * | | | | * [Patterns and * <----------+ | | +----------------> * profiles] * | | ****************** | +----------------+ | v | | *********************** +-> +--------> * PDB [3D structures] * <----*********************** ********************** * MGD [Mouse] * ********************** ********************** * GCRDb [7TM recep.] * ********************** ********************** * EcoGene [E.coli] * ********************** ********************** * SGD [Yeast] * ********************** ********************** * DictyDB [D.disco.] * ********************** ********************** * ENZYME [Nomencl.] * ********************** v ********************** * OMIM [Human] * ********************** ********************** * ECO2DBASE [2D] * ********************** ********************** * Maize-2DPAGE [2D] * ********************** ********************** * SWISS-2DPAGE [2D] * ********************** ********************** * Aarhus/Ghent [2D] * ********************** ********************** * YEPD [Yeast] [2D] * ********************** ********************** * HSSP [3D similar.] * ********************** Swissprot and relation to other databases Example of Swissprot entry ID AC DT DT DT DE GN OS OC OC RN RP RX RA RA RL RN RP RX RA RL RN RP RX RA RA RL RN RP RX RA RL PRIO_HUMAN STANDARD; PRT; 253 AA. P04156; 01-NOV-1986 (REL. 03, CREATED) 01-NOV-1986 (REL. 03, LAST SEQUENCE UPDATE) 01-NOV-1997 (REL. 35, LAST ANNOTATION UPDATE) MAJOR PRION PROTEIN PRECURSOR (PRP) (PRP27-30) (PRP33-35C) (ASCR). PRNP. HOMO SAPIENS (HUMAN). EUKARYOTA; METAZOA; CHORDATA; VERTEBRATA; TETRAPODA; MAMMALIA; EUTHERIA; PRIMATES. [1] SEQUENCE FROM N.A. MEDLINE; 86300093. KRETZSCHMAR H.A., STOWRING L.E., WESTAWAY D., STUBBLEBINE W.H., PRUSINER S.B., DEARMOND S.J.; DNA 5:315-324(1986). [2] SEQUENCE OF 8-253 FROM N.A. MEDLINE; 86261778. LIAO Y.-C.J., LEBO R.V., CLAWSON G.A., SMUCKLER E.A.; SCIENCE 233:364-367(1986). [3] VARIANT AMYLOID GSS, SEQUENCE OF 58-85 AND 111-150. MEDLINE; 91160504. TAGLIAVINI F., PRELLI F., GHISO J., BUGIANI O., SERBAN D., PRUSINER S.B., FARLOW M.R., GHETTI B., FRANGIONE B.; EMBO J. 10:513-519(1991). [4] REVIEW ON VARIANTS. MEDLINE; 93372867. PALMER M.S., COLLINGE J.; HUM. MUTAT. 2:168-173(1993). CC CC CC CC CC CC CC CC CC CC CC CC CC CC CC CC CC CC CC CC CC CC CC CC CC CC CC CC CC CC CC CC CC CC CC CC CC CC CC CC CC CC CC CC CC CC -!- FUNCTION: THE FUNCTION OF PRP IS NOT KNOWN. PRP IS ENCODED IN THE HOST GENOME AND IS EXPRESSED BOTH IN NORMAL AND INFECTED CELLS. -!- SUBUNIT: PRP HAS A TENDENCY TO AGGREGATE YIELDING POLYMERS CALLED "RODS". -!- SUBCELLULAR LOCATION: ATTACHED TO THE MEMBRANE BY A GPI-ANCHOR. -!- DISEASE: PRP IS FOUND IN HIGH QUANTITY IN THE BRAIN OF HUMANS AND ANIMALS INFECTED WITH NEURODEGENERATIVE DISEASES KNOWN AS TRANSMISSIBLE SPONGIFORM ENCEPHALOPATHIES OR PRION DISEASES, LIKE: CREUTZFELDT-JACOB DISEASE (CJD), GERSTMANN-STRAUSSLER SYNDROME (GSS), FATAL FAMILIAL INSOMNIA (FFI) AND KURU IN HUMANS; SCRAPIE IN SHEEP AND GOAT; BOVINE SPONGIFORM ENCEPHALOPATHY (BSE) IN CATTLE; TRANSMISSIBLE MINK ENCEPHALOPATHY (TME); CHRONIC WASTING DISEASE (CWD) OF MULE DEER AND ELK; FELINE SPONGIFORM ENCEPHALOPATHY (FSE) IN CATS AND EXOTIC UNGULATE ENCEPHALOPATHY (EUE) IN NYALA AND GREATER KUDU. THE PRION DISEASES ILLUSTRATE THREE MANIFESTATIONS OF CNS DEGENERATION: (1) INFECTIOUS (2) SPORADIC AND (3) DOMINANTLY INHERITED FORMS. TME, CWD, BSE, FSE, EUE ARE ALL THOUGHT TO OCCUR AFTER CONSUMPTION OF PRION-INFECTED FOODSTUFFS. -!- DISEASE: CJD OCCURS PRIMARILY AS A SPORADIC DISORDER (1 PER MILLION), WHILE 10-15% ARE FAMILIAL. ACCIDENTAL TRANSMISSION OF CJD TO HUMANS APPEARS TO BE IATROGENIC (CONTAMINATED HUMAN GROWTH HORMONE (HGH), CORNEAL TRANSPLANTATION, ELECTROENCEPHALOGRAPHIC ELECTRODE IMPLANTATION. . .). EPIDEMIOLOGIC STUDIES HAVE FAILED TO IMPLICATE THE INGESTION OF INFECTED ANNIMAL MEAT IN THE PATHOGENESIS OF CJD IN HUMAN. THE TRIAD OF MICROSCOPIC FEATURES THAT CHARACTERIZE THE PRION DISEASES CONSISTS OF (1) SPONGIFORM DEGENERATION OF NEURONS, (2) SEVERE ASTROCYTIC GLIOSIS THAT OFTEN APPEARS TO BE OUT OF PROPORTION TO THE DEGREE OF NERF CELL LOSS, AND (3) AMYLOID PLAQUE FORMATION. CJD IS CHARACTERIZED BY PROGRESSIVE DEMENTIA AND MYOCLONIC SEIZURES, AFFECTING ADULTS IN MID-LIFE. SOME PATIENTS PRESENT SLEEP DISORDERS, ABNORMALITIES OF HIGH CORTICAL FUNCTION, CEREBELLAR AND CORTICOSPINAL DISTURBANCES. THE DISEASE ENDS IN DEATH AFTER A 3-12 MONTHS ILLNESS. -!- DISEASE: GSS IS A HETEROGENEOUS DISORDER AND WAS DEFINED AS A "SPINOCEREBELLAR ATAXIA WITH DEMENTIA AND PLAQUELIKE DEPOSITS". GSS INCIDENCE IS LESS THAN 2 PER 100 MILLION. -!- DISEASE: KURU IS TRANSMITTED DURING RITUALISTIC CANNIBALISM, AMONG NATIVES OF THE NEW GUINEA HIGHLANDS. PATIENTS EXHIBIT VARIOUS MOVEMENT DISORDERS LIKE CEREBELLAR ABNORMALITIES, RIGIDITY OF THE LIMBS, AND CLONUS. EMOTIONNAL LABILITY IS PRESENT, AND DEMENTIA IS CONSPICUOUSLY ABSENT. DEATH USUALLY OCCURS FROM 3 TO 12 MONTH AFTER ONSET. -!- SIMILARITY: TO OTHER PRP. -!- DATABASE: NAME=HotMolecBase; NOTE=PrP entry; WWW="http://bioinformatics.weizmann.ac.il/hotmolecbase/entries/prp.htm". FT FT FT FT FT FT FT FT FT FT FT FT FT FT FT FT FT FT FT FT FT FT FT FT FT FT FT FT FT FT FT FT SQ // SIGNAL CHAIN PROPEP LIPID CARBOHYD CARBOHYD DISULFID DOMAIN REPEAT REPEAT REPEAT REPEAT REPEAT VARIANT VARIANT VARIANT VARIANT VARIANT VARIANT VARIANT VARIANT VARIANT VARIANT VARIANT CONFLICT SEQUENCE MANLGCWMLV HGGGWGQPHG VVGGLGGYML NITIKQHTVT ILLISFLIFL 1 23 231 230 181 197 179 51 22 230 253 230 181 197 214 91 MAJOR PRION PROTEIN. REMOVED IN MATURE FORM (BY SIMILARITY). GPI-ANCHOR (BY SIMILARITY). PROBABLE. PROBABLE. BY SIMILARITY. 5 X 8 AA TANDEM REPEATS OF P-H-G-G-G-W-GQ. 51 59 1. 60 67 2. 68 75 3. 76 83 4. 84 91 5. 102 102 P -> L (IN GSS). 105 105 P -> L (IN GSS). 117 117 A -> V (LINKED TO DEVELOPMENT OF DEMENTING GSS). 129 129 M -> V (DETERMINES THE DISEASE PHENOTYPE IN PATIENTS WHO HAVE A PRP MUTATION AT CODON 178: PATIENTS WITH MET DEVELOP FFI, THOSE WITH VAL DEVELOP CJD). 178 178 D -> N (IN FFI AND CJD). 180 180 V -> I (IN CJD). 198 198 F -> S (IN A ATYPICAL FORM OF GSS WITH NEUROFIBRILLARY TANGLES). 200 200 E -> K (IN CJD). 210 210 V -> I (IN CJD). 217 217 Q -> R (IN GSS WITH NEUROFIBRILLARY TANGLES). 232 232 M -> R (IN CJD). 118 118 MISSING (IN REF. 2). 253 AA; 27661 MW; FD5373AD CRC32; LFVATWSDLG LCKKRPKPGG WNTGGSRYPG QGSPGGNRYP PQGGGGWGQP GGWGQPHGGG WGQPHGGGWG QGGGTHSQWN KPSKPKTNMK HMAGAAAAGA GSAMSRPIIH FGSDYEDRYY RENMHRYPNQ VYYRPMDEYS NQNNFVHDCV TTTKGENFTE TDVKMMERVV EQMCITQYER ESQAYYQRGS SMVLFSSPPV IVG Molecular biology databases DNA sequence Genome data Protein sequence Protein classification Protein structure Protein classification databases PROSITE Pfam InterPro Prosite: Patterns are identified from multiple alignments of protein sequences PROSITE Release 16.45 : 1483 patterns Example 1 ID AC DT DE PA CC 3D DO ATP_GTP_A; PATTERN. PS00017; APR-1990 (CREATED); APR-1990 (DATA UPDATE); NOV-1990 (INFO UPDATE). ATP/GTP-binding site motif A (P-loop). [AG]-x(4)-G-K-[ST]. /TAXO-RANGE=ABEPV; 1EFM; 1ETU; 1Q21; 2Q21; 4Q21; 5Q21; 6Q21; PDOC00017; Example II ID AC DT DE PA NR ZINC_FINGER_C2H2; PATTERN. PS00028; APR-1990 (CREATED); JUN-1994 (DATA UPDATE); NOV-1997 (INFO UPDATE). Zinc finger, C2H2 type, domain. C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H. /RELEASE=35,69113; Pfam www.sanger.ac.uk/Pfam/ Pfam is a database of multiple alignments of protein domains or conserved protein regions. Hopefully they represent some evolutionary conserved structure which has implications for the protein's function. Version 6.6, August 2001, 3071 families Over 65% of the proteins in SWISSPROT 38 and TrEMBL-11 have at least one match to a Pfam family. 72% of protein sequences have at least one match to Pfam. Applications of gene ontology databases: 1. Cases of non-informative protein sequence databases Query= FWR602467643.F1 664 23 640 ABI cut from 23 to 663. Remaining: 640 bases. (640 letters) Database: /pubdata/ncbi/nr 771,594 sequences; 245,249,561 total letters Searching..................................................done Sequences producing significant alignments: dbj|BAB23278.1| (AK004371) putative [Mus musculus] gb|AAH08101.1|AAH08101 (BC008101) Similar to hypothetical protei... Score (bits) E Value 369 174 e-101 7e-43 2. You want to answer questions like: Which proteins are linked to a specific biological process, like glycolysis ? Gene ontology consortium Major principles •Molecular function •Biological process •Cellular component Gene ontology www.geneontology.org Extract of gene assocation table: SP SP SP SP SP SP O00115 O00115 O00115 O00115 O00116 O00116 DRN2_HUMAN DRN2_HUMAN DRN2_HUMAN DRN2_HUMAN ADAS_HUMAN ADAS_HUMAN GO:0003677 GO:0004519 GO:0004531 GO:0005764 GO:0005777 GO:0005777 F F F C C C Deoxyribonuclease II precursor Deoxyribonuclease II precursor Deoxyribonuclease II precursor Deoxyribonuclease II precursor Alkyldihydroxyacetonephosphate.. Alkyldihydroxyacetonephosphate.. Molecular biology databases DNA sequence Genome data Protein sequence Protein classification Protein structure Databases of protein/nucleic structure HIV protease Secondary structure elements of proteins: ?????? helix Secondary structure elements of proteins: ????? sheet Schematic pictures of proteins highlight secondary structure Determination of protein structure * X ray crystallography * NMR Example of PDB entry HEADER COMPND SOURCE AUTHOR REVDAT REVDAT JRNL JRNL JRNL JRNL JRNL REMARK REMARK HORMONE 30-OCT-92 1BPH INSULIN (CUBIC) IN 0.1M SODIUM SALT SOLUTION AT PH9 BOVINE (BOS $TAURUS) PANCREAS O.GURSKY,J.BADGER,Y.LI,D.L.D.CASPAR 2 31-OCT-93 1BPHA 1 REMARK HET FORMUL 1 15-JAN-93 1BPH 0 AUTH O.GURSKY,J.BADGER,Y.LI,D.L.D.CASPAR TITL CONFORMATIONAL CHANGES IN CUBIC INSULIN CRYSTALS TITL 2 IN THE PH RANGE 7-11 REF BIOPHYS.J. V. 63 1210 1992 REFN ASTM BIOJAU US ISSN 0006-3495 030 1 1 REFERENCE 1 1BPH 2 1BPH 3 1BPH 4 1BPH 5 1BPHA 1 1BPH 6 1BPH 7 1BPH 8 1BPH 9 1BPH 10 1BPH 11 1BPH 12 ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 N CA C O N CA C O CB CG1 CG2 CD1 N CA C O CB CG1 CG2 GLY GLY GLY GLY ILE ILE ILE ILE ILE ILE ILE ILE VAL VAL VAL VAL VAL VAL VAL A A A A A A A A A A A A A A A A A A A 1 1 1 1 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 13.994 14.277 15.574 16.078 16.088 17.342 18.526 19.425 17.571 18.638 17.859 18.914 18.619 19.774 19.952 21.018 19.719 20.847 19.868 47.196 46.226 45.507 45.660 44.766 44.034 44.939 44.457 43.072 42.049 43.936 40.930 46.195 47.080 47.453 47.421 48.274 49.225 47.724 31.798 30.708 31.085 32.217 30.126 30.404 30.686 31.392 29.158 29.605 27.903 28.590 30.192 30.436 31.895 32.561 29.462 29.754 28.044 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 35.87 38.67 31.18 22.60 28.39 23.76 25.29 18.74 27.36 18.03 25.54 17.07 24.42 30.26 19.08 28.15 33.87 30.40 24.51 1BPH 1BPH 1BPH 1BPH 1BPH 1BPH 1BPH 1BPH 1BPH 1BPH 1BPH 1BPH 1BPH 1BPH 1BPH 1BPH 1BPH 1BPH 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 3D viewers Several programs are available for viewing protein and nucleic 3D structures: Rasmol www.umass.edu/microbio/rasmol/ Weblab www.msi.com Kinemage www.cryst.bbk.ac.uk/PPS/vsns-pps/technology/kinemage.html Chime www.umass.edu/microbio/rasmol/ Protein explorer www.umass.edu/microbio/chime/explorer/ Cn3D www.ncbi.nlm.nih.gov/Entrez SwissPDB viewer expasy.proteome.org.au/spdbv/ (Molscript www.avatar.se/molscript/)