* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Databases
Protein moonlighting wikipedia , lookup
Transcriptional regulation wikipedia , lookup
Deoxyribozyme wikipedia , lookup
List of types of proteins wikipedia , lookup
Cre-Lox recombination wikipedia , lookup
Whole genome sequencing wikipedia , lookup
Promoter (genetics) wikipedia , lookup
DNA barcoding wikipedia , lookup
Genome evolution wikipedia , lookup
Gene expression wikipedia , lookup
Community fingerprinting wikipedia , lookup
Multilocus sequence typing wikipedia , lookup
Silencer (genetics) wikipedia , lookup
Genomic library wikipedia , lookup
Molecular ecology wikipedia , lookup
Non-coding DNA wikipedia , lookup
Endogenous retrovirus wikipedia , lookup
Homology modeling wikipedia , lookup
Two-hybrid screening wikipedia , lookup
Point mutation wikipedia , lookup
Ancestral sequence reconstruction wikipedia , lookup
Databases Databases in Bioinfo • Central in bioinformatics, as presented in 2 of the definitions in previous lecture: – The use of computers in solving information problems in the life sciences. It mainly involves the creation of extensive electronic databases on genomes, protein sequences etc. Also involves techniques such as threedimensional modelling of biomolecules and biological systems (www.universityscience.ie/pages/glossary.php). – The collection, organization and analysis of large amounts of biological data, using networks of computers and databases (www.abc.net.au/science/slab/genome2001/glossary.htm). Databases in Bioinfo • Central in bioinformatics, as presented in 2 of the definitions in previous lecture. • There are a few central databases which are used often. • There are many different databases which specialize in specific fields. Databases in Bioinfo • Central databases: – NCBI/GenBank (http://www.ncbi.nlm.nih.gov/) • Nucleotide + protein sequences and much more – EBI/EMBL (http://www.ebi.ac.uk/) • Similar to GenBank – Ensembl (http://www.ensembl.org/index.html) • Whole genomes browsing – ExPASy - Swiss-Prot/Trembl (http://www.expasy.ch/sprot/) • Manually annotated and reviewed protein DB. Databases in Bioinfo • Specalised databases: (examples) – Relevant to a particular gene (e.g. RDP - 16S gene) – Groups of sequences sharing common properties (e.g. Pfam) – Particular organisms (e.g. TAIR - The Arabidopsis Information Resource) – Protein structures (PDB) – And many more… Public Sequence Databases • Three main databases • Sequences are pooled (identical between DBs) – GenBank (NCBI – National Center for Biotechnology Information) – EMBL (EBI – European Bioinformatics Institute) – DDBJ (Japan Center for Information Biology) NCBI - GenBank • A collection of nucleotide sequences and their translation to proteins • Data resources: – Submission of sequences from research groups – Bulk submission of sequences from large sequencing centers. • An accession number is given to each sequence, including translated protein sequences. • Most journals require indication of a GenBank/EMBL/DDBJ accession number for any newly described sequence as an obligatory condition for publication. Types of sequences Genomic sequences cDNA EST sequences Nucleotide sequences Protein sequences NCBI - RefSeq • The Reference Sequence (RefSeq) collection aims to provide a comprehensive, integrated, nonredundant, well-annotated set of sequences, including genomic DNA, transcripts, and proteins. (From the NCBI RefSeq website) • Site: http://www.ncbi.nlm.nih.gov/RefSeq/ NCBI - RefSeq • The RefSeq database is a curated collection of DNA, RNA, and protein sequences built by NCBI. • RefSeq provides only one example of each natural biological molecule for major organisms ranging from viruses to bacteria to eukaryotes. • For each model organism, RefSeq aims to provide separate and linked records for the genomic DNA, the gene transcripts, and the proteins arising from those transcripts. NCBI - RefSeq • RefSeq is limited to major organisms for which sufficient data is available (16,248 distinct organisms as of Sep. 2011), • GenBank includes sequences for any organism submitted (more than 300,000 different named organisms). • RefSeq records appear in a similar format as the GenBank records from which they are derived. They can be distinguished by their accession prefix, which includes an underscore. RefSeq accession numbers Examples: • NC_123456 – Complete genomic molecules including genomes, chromosomes, organelles, plasmids. • NM_123456 – Transcript products; mature messenger RNA (mRNA) transcripts. • NP_123456 – Protein products; primarily full-length precursor products but may include some partial proteins and mature peptide products. • Other names, see http://www.ncbi.nlm.nih.gov/RefSeq/key.html Primary vs. Derivative Sequence Databases RefSeq Labs Sequencing Centers TATAGCCG AGCTCCGATA CCGATGACAA Curators TATAGCCG TATAGCCG TATAGCCG TATAGCCG Updated continually by NCBI GenBank Updated ONLY by submitters From NCBI field guides Genome Assembly UniGene Algorithms NCBI - Entrez • Entrez Global Query is an integrated search and retrieval system that provides access to all databases simultaneously with a single query string and user interface. • Entrez can efficiently retrieve related sequences, structures, and references. The Entrez system can provide views of gene and protein sequences and chromosome maps. • http://www.ncbi.nlm.nih.gov/sites/gquery NCBI - Entrez • Some of the databases searched by Entrez: – Nucleotide: sequence database (GenBank) – Protein: sequence database – Gene: gene-centered information – PubMed: biomedical literature citations and abstracts, – OMIM: online Mendelian Inheritance in Man – Genome: whole genome sequences and Mapping – Structure: three-dimensional macromolecular structures – UniGene: gene-oriented clusters of transcript sequences – GEO Profiles: expression and molecular abundance profiles NCBI - PubMed • PubMed is a free search engine for accessing the MEDLINE database of citations, abstracts and some full text articles on life sciences and biomedical topics • MEDLINE (Medical Literature Analysis and Retrieval System Online) is a literature database of life sciences and biomedical information, compiled by the U.S. National Library of Medicine (NLM). • http://www.ncbi.nlm.nih.gov/pubmed/ NCBI - Gene • Integrated Access to Genes of Genomes in the Reference Sequence (RefSeq) Collection. • Supplies key connections in the nexus of map, sequence, expression, structure, function, citation, and homology data. • http://www.ncbi.nlm.nih.gov/sites/entrez?d b=gene EMBL-EBI • The European Bioinformatics Institute (EBI) is part of European Molecular Biology Laboratory (EMBL). • Hosts two main databases: – For nucleotide sequences (EMBL-Bank) – For protein sequences (UniProt). • Provide data resources in all the major molecular domains EMBL-EBI • Indexes of databases and services: – Databases – Tools for Data Analysis Index – Services EMBL-EBI • “The EMBL Nucleotide Sequence Database is Europe's primary nucleotide sequence resource. The main sources of the DNA and RNA sequences in the database are direct submissions from individual researchers, genome sequencing projects and patent applications.” • Similar to GenBank • Growth stats UniProt Knowledgebase • UniProtKB/Swiss-Prot: – an annotated protein sequence database. – Contains high-quality annotation, is non-redundant and cross-referenced to many other databases • UniProtKB/TrEMBL: – a computer-annotated supplement to UniProtKB/Swiss-Prot. – Contains the translations of all coding sequences present in the EMBL Nucleotide Sequence Database, which are not yet integrated into Swiss-Prot. Database entries • Sequence entries are composed of different line types, each with their own format. • Built in such a way that they are readable both to humans and computers • Three main formats: – Genbank – EBI – FASTA Database entries EBI and SWISSPROT Genbank ID RBL_AETCO Reviewed; 483 AA. AC A4QJC3; LOCUS A4QJC3 483 aa linear PLN 13-OCT-2009 DT 11-SEP-2007, integrated into UniProtKB/Swiss-Prot. DEFINITION RecName: Full=Ribulose bisphosphate carboxylase large chain; DT 15-MAY-2007, sequence version 1. Short=RuBisCO large subunit; Flags: Precursor. DT 13-OCT-2009, entry version 18. ACCESSION A4QJC3 DE RecName: Full=Ribulose bisphosphate carboxylase large chain; VERSION A4QJC3.1 GI:158513601 DE Short=RuBisCO large subunit; DBSOURCE UniProtKB: locus RBL_AETCO, accession A4QJC3; DE EC=4.1.1.39; class: standard. >gi|158513601|sp|A4QJC3.1|RBL_AETCO RecName: Full=Ribulose bisphosphate carboxylase large chain; Short=RuBisCO large subunit; Flags: Precursor DE Flags: Precursor; created: Sep 11, 2007. MSPQTETKASVGFKAGVKEYKLTYYTPEYETKDTDILAAFRVTPQPGVPPEEAGAAVAAESSTGTWTTVW GN Name=rbcL; TDGLTSLDRYKGRCYHIEPVPGEESQFIAYVAYPLDLFEEGSVTNMFTSIVGNVFGFKALAALRLEDLRI sequence updated: May 15, 2007. OS Aethionema cordifolium (Lebanon stonecress). PPAYTKTFQGPPHGIQVERDKLNKYGRPLLGCTIKPKLGLSAKNYGRAVYECLRGGLDFTKDDENVNSQP annotation updated: Oct 13, 2009. OG Plastid; Chloroplast. FMRWRDRFLFCAEAIYKSQAETGEIKGHYLNATAGTCEEMIKRAVFARELGVPIVMHDYLTGGFTANTSL xrefs: AP009366.1, BAF49778.1, YP_001122954.1 AHYCRDNGLLLHIHRAMHAVIDRQKNHGMHFRVLAKALRLSGGDHIHAGTVVGKLEGDRESTLGFVDLLR OC Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; xrefs (non-sequence databases): GeneID:4968541, GO:0009507, DDYVEKDRSRGIFFTQDWVSLPGVLPVASGGIHVWHMPALTEIFGDDSVLQFGGGTLGHPWGNAPGAVAN OC Spermatophyta; Magnoliophyta; eudicotyledons; core eudicotyledons; RVALEACVQARNEGRDLAVEGNEIIREACKWSPELAAACEVWKEIRFNFPTIDKLDPSAEKVA GO:0000287, GO:0004497, GO:0016984, GO:0055114, GO:0009853, OC rosids; eurosids II; Brassicales; Brassicaceae; Aethionema. GO:0019253, HAMAP:MF_01338, InterPro:IPR000685, InterPro:IPR017443, OX NCBI_TaxID=434059; InterPro:IPR017444, Gene3D:G3DSA:3.20.20.110, RN [1] Gene3D:G3DSA:3.30.70.150, Pfam:PF00016, Pfam:PF02788, RP NUCLEOTIDE SEQUENCE [LARGE SCALE GENOMIC DNA]. PROSITE:PS00157 RA Hosouchi T., Tsuruoka H., Kotani H.; KEYWORDS Acetylation; Calvin cycle; Carbon dioxide fixation; Chloroplast; RT "Sequencing analysis of Aethionema coridifolium chloroplast DNA."; Disulfide bond; Lyase; Magnesium; Metal-binding; Methylation; RL Submitted (MAR-2007) to the EMBL/GenBank/DDBJ databases. Monooxygenase; Oxidoreductase; Photorespiration; Photosynthesis; CC -!- FUNCTION: RuBisCO catalyzes two reactions: the carboxylation of DPlastid. CC ribulose 1,5-bisphosphate, the primary event in carbon dioxide SOURCE chloroplast Aethionema cordifolium CC fixation, as well as the oxidative fragmentation of the pentose ORGANISM Aethionema cordifolium CC substrate in the photorespiration process. Both reactions occur Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; CC simultaneously and in competition at the same active site (By Spermatophyta; Magnoliophyta; eudicotyledons; core eudicotyledons; CC similarity). rosids; eurosids II; Brassicales; Brassicaceae; Aethionema. CC -!- CATALYTIC ACTIVITY: 2 3-phospho-D-glycerate + 2 H(+) = D-ribulose CC 1,5-bisphosphate + CO(2) + H(2)O. CC -!- CATALYTIC ACTIVITY: 3-phospho-D-glycerate + 2-phosphoglycolate = FASTA GenBank/GenPept entry • Records in Genbank are divided into 2 parts: – Annotation: • General information: Accession, length & type (aa/bp), Version, Database from which derived, Source organism. • References to articles and comments about the sequence • Features: Notes pertaining to sections of the whole sequence. – The sequence itself. GenBank/GenPept entry • Options to control search results: (Grey bar on top). • Format to display: – GenPept (for protein) – full annotation and features. – FASTA – just tag and sequence. – Graph – graphical presentation of sequence and features. • How many results to show on the page (relevant when browsing query results). • Results can be sent to: – The screen (default), – Displayed in plain text format – A file. (for downloading many sequences at once). GenBank/GenPept entry • You can display a subsection of the sequence. This is useful for whole genomes or for large contigs. • “Links” opens a dropdown of links to many other databases GenBank/GenPept entry • Further details in hands-on session • Best knowledge through trial and error…