* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Computer Storage of Sequences
Survey
Document related concepts
Protein (nutrient) wikipedia , lookup
Deoxyribozyme wikipedia , lookup
Silencer (genetics) wikipedia , lookup
Gene expression wikipedia , lookup
Protein adsorption wikipedia , lookup
DNA barcoding wikipedia , lookup
Community fingerprinting wikipedia , lookup
Genetic code wikipedia , lookup
Protein structure prediction wikipedia , lookup
Molecular evolution wikipedia , lookup
Point mutation wikipedia , lookup
Two-hybrid screening wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Transcript
Computer Storage of Sequences (Chapter 2 of Bioinformatics: Sequence and Genome Analysis By David W. Mount) CSE730: Seminar on “Information Retrieval of Biomedical Text and Data” Outline Storing DNA/Protein sequences into computer files or databases. Related information placed in the database along with the sequence in a number of sequence data formats. Online public access Databases for sequence retrieval. Nucleotide Sequence Nomenclature Committee of the International Union of Biochemistry Code Nucleic Acid(s) Code Nucleic Acid(s) A Adenine M A or C (amino) C Cytosine R A or G (purine) G Guanine W A or T (weak) T Thymine S C or G (strong) U Uracil Y C or T (pyrimidine) K G or T (keto) V A or C or G H A or C or T D A or G or T B C or G or T N A or G or C or T (any) Protein Sequence Code Amino acid Code Amino acid A Alanine N Asparagine B Asparagine P Proline C Cysteine Q Glutamine D Aspartic acid R Arginine E Glutamic acid S Serine F Phenylalanine T Threonine G Glycine V Valine H Histidine W Tryptophan I Isoleucine X Unknown K Lysine Y Tyrosine L Leucine Z Glutamine M Methionine Adapted from IUPAC-IUB (1969,1972, 1983) Sequence Formats Sequence is stored as ASCII text (i.e. sequence of A,G,C,T…) along with annotations. Different sequence formats recognized by different sequence analyzer programs. Sequence Format includes accessory information, gene names, source organism, investigator name, references, and the actual sequence. Sequence Formats (continued) FASTA GenBank Flat File format PIR/CODATA format EMBL sequence entry format Intelligenetics sequence entry format GCG (Genetics Computer Group) sequence entry format. ASN.1 XML Databases NCBI GenBank at the National Center of Biotechnology Information (NCBI), National Library of Medicine, Washington, DC NBRF Protein Information Resource (PIR) database at the National Biomedical Research Foundation in Washington, DC Databases (continued) SwissProt The SwissProt protein sequence database at ISREC, Swiss Institute for Experimental Cancer Research. EMBL European Molecular Biology Laboratory (EMBL) Outstation at Hixton, England DDBJ DNA DataBank of Japan (DDBJ) at Mishima, Japan Databases on Internet NCBI http://www.ncbi.nlm.nih.gov PIR http://www-nbrf.georgetown.edu/pirwww SwissProt http://www.expasy.ch/cgi-bin/sprot-search-de EMBL http://www.ebi.ac.uk/embl/index.html DDBJ http://www.ddbj.nig.ac.jp/ NCBI National resource for molecular biology information. Maintains comprehensive databases for variety of Biotech related information. Develops and manages access to a range of databases and softwares for scientific and medical communities. NCBI : Integrated Databases Literature Databases Pubmed PubMed Central OMIM PROW BookShelf NCBI : Integrated Databases (continued) Nucleotide Databases GenBank EST Database GSS Database SNPs Database RefSeq STS Database NCBI : Integrated Databases (continued) Entrez Databases Pubmed Protein Sequence Database Nucleotide Sequence Database Taxonomy OMIM GenBank GenBank is the NIH genetic sequence database. Annotated collection of all publicly available DNA sequences. GenBank is a part of an international collaboration of sequence databases along with EMBL and DDBJ. GenBank DNA Sequence Format DNA sequence in GenBank is formatted into distinct attributes as following Locus: locus name, sequence length, division, date Definition: description of entry Accession: unique accession number Version: version of sequence Keywords: keywords for cross referencing GenBank DNA Sequence Format (continued) Source: source organism of DNA Organism: description of organism References: authors, title, journal, Medline, etc Features: information about sequence Base count: number of bases in sequence Origin: sequence data begin following origin. Genebank sample NCBI : Tools Tools for Data Retrieval and submission Text Term Searching Sequence Similarity Searching Taxonomy Searching Sequence Submission NCBI : ENTREZ Entrez is a search and retrieval system that integrates information from databases at NCBI. These databases include nucleotide sequences, protein sequences, macromolecular structures, whole genomes, and MEDLINE, PubMed. Etc. Entrez NCBI : BLAST BLAST: Basic Local Alignment Search Tool It is a set of similarity search programs designed to explore available sequence databases. It uses a heuristic algorithm which is able to detect relationships among sequences which share only isolated regions of similarity Q-BLAST: It is a queuing system to BLAST that allows users to retrieve results at their convenience and format their results. NCBI : BLAST (continued) Access to BLAST service Web-BLAST Standalone BLAST Network BLAST BLAST URL API NCBI : BLAST (continued) BLAST Programs Blastp : Compares amino acid sequence against protein sequence Database Blastn : Compares nucleotide sequence against nucleotide sequence Database Blastx :Compares nucleotide query sequence against protein sequence Database Tblastn : Compares protein query sequence against nucleotide sequence Database BLAST NBRF :PIR Protein Information Resource 3 Major Databases: PSD (Protein Sequence Database) iProClass PIR-NREF (Nonredundant REFerence protein database) PIR: PSD The PIR, in collaboration with MIPS and JIPID, produces and distributes the PIR-International Protein Sequence Database (PSD) . Comprehensive and expertly annotated protein sequence database. The primary sources of PSD data are sequences from GenBank/EMBL/DDBJ translations, published literature, and direct submission to PIR-International. PIR: PSD (continued) The PIR-PSD data is available in XML format and NBRF, PIR/CODATA formats. The sequence file is available in FASTA format. Also available at PIR UNIX FTP server. Address: ftp://ftp.pir.georgetown.edu/pir_databases/psd/ CODATA format CODATA format has approximately the same information as a GenBank or EMBL sequence file, but is slightly differently formatted and has different field names. Also called PIR format, used by NBRF. CODATA Sample PIR: iProClass The iProClass database provides comprehensive descriptions of all proteins and serves as a framework for data integration in a distributed networking environment. Very user-friendly description. PIR: NREF (Non-redundant REFerence protein database) Comprehensive: Containing all sequences from PIR-PSD, SwissProt, TrEMBL, RefSeq, GenPept, and updated bi-weekly. Non-Redundant: Clustered by sequence identity and taxonomy at the species level. Source Attribution: Containing protein IDs and names from associated databases (with hypertext links), in addition to protein sequence, taxonomy, and bibliography. The current version (July 2002) consists of more than 809,000 non-redundant PIR-PSD, SwissProt and TrEMBL proteins organized with more than 36,200 PIR superfamilies, 145,340 families, and links to over 50 molecular biology databases. Swiss-Prot Swiss-Prot is a protein knowledgebase established in 1986. Maintained collaboratively, by the Department of Medical Biochemistry of the University of Geneva (now the Swiss Institute of Bioinformatics) and the EMBL Data Library. Swiss-Prot Sequence Entry Example Sequence Format Conversion READSEQ: Sequence Format Conversion program. http://bimas.dcrt.nih.gov/molbio/readseq/ Can convert to/from: ASN.1 FASTA CODATA GCG EMBL format GenBank format and many other formats References http://www.ncbi.nlm.nih.gov http://www-nbrf.georgetown.edu/pirwww http://www.expasy.ch/cgi-bin/sprot-search-de http://www.ebi.ac.uk/embl/index.html http://www.ddbj.nig.ac.jp/ Thank You Presented by: Hemal Patel & Jeetal Shah