Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Other biological databases Biological systems Sequence data Protein folding and 3D structure Taxonomic data Literature Pathways and networks Protein families and domains Small molecules Whole genome data Ontologies -GO Biological systems Other Biological Databases • • • • • • • • Transcription factor binding sites -TRANSFAC Protein structure databases- PDB, SCOP, CATH Protein family databases- Pfam, Prints, PROSITE etc. Chemicals and small molecules -ChEBI Gene expression databases –GEO, ArrayExpress Metabolic pathways - Reactome, KEGG Genome Databases- Ensembl, FlyBase, WormBase etc. Human genetics-related databases –HapMap, dbSNP Transcription factor binding sites • TRANSFAC –database of eukaryotic transcription factors: http://www.generegulation.com/pub/databases.html#transfac • TESS –Transcription Element Search System –for predicting transcription factor binding sites, uses TRANSFAC: http://www.cbi.upenn.edu/tess • TFsearch –for searching transcription factor binding sites: http://www.cbrc.jp/research/db/TFSEARCH.html Protein structure databases • Main resource is Protein Data Bank (PDB): http://www.rcsb.org/pdb/ • Contains the spatial coordinates of macromolecule atoms whose 3D structure has been obtained by X-ray or NMR studies • Proteins represent more than 90% of available structures (others are DNA, RNA, sugars, viruses, protein/DNA complexes…) • Can search by PDB code Searching MSD http://www.ebi.ac.uk/msd -Search by PDB code Protein structure-related databases • Structural family databases based on PDB – SCOP (http://scop.mrc-lmb.cam.ac.uk/scop/) and CATH (http://www.biochem.ucl.ac.uk/bsm/cath/) • Predicted structures in SWISS-MODEL (http://swissmodel.expasy.org//SWISSMODEL.html) Protein family databases • Databases that produce signatures for identifying protein families or domains • Used for functional classification of proteins • E.g. Pfam, PROSITE, Prints, SMART, TIGRFAMs etc. • Integrated into single resource InterPro (http://www.ebi.ac.uk/interpro) InterProScan sequence search Stand-alone version available InterPro text search Search keyword, protein acc or InterPro acc Results for protein acc Example InterPro entry Chemicals and small molecules • Chemical abstracts- http://www.cas.org/ • ChEBI- http://www.ebi.ac.uk/chebi • KEGG –part of it includes chemicals http://www.genome.jp/kegg • ChemID plus -chemicals cited in NLM databases http://chem2.sis.nlm.nih.gov/chemidplus/chemi dlite.jsp • MSD-Chem –ligands and chemicals in MSD CheBI example entry Hierarchy for chemicals Gene expression databases • NCBI Gene Expression Omnibus (GEO) http://www.ncbi.nlm.nih.gov/geo/ • ArrayExpress http://www.ncbi.nlm.nih.gov/geo/ • Stanford microarray database http://genomewww5.stanford.edu/ • Can usually search for experiments or particular expression profiles GEO search page Profiles search results Specific entry and experiment info ArrayExpress search results What does the data look like? • Info on experiment, array used, etc. • Raw or processed tab delimited file containing spots and their intensities cy3/cy5 ratios) across different samples • Files with meta data e.g. sample info, annotation and coordinates of each spot on array Proteomics: SWISS-2DPAGE Enzymes and metabolic pathways • Contain information describing enzymes, biochemical reactions and metabolic pathways; • ENZYME and BRENDA: nomenclature databases that store information on enzyme names and reactions; • IntEnz: Integrated relational Enzyme database Enzyme nomenclature • E.C. (Enzyme Commission) numbers assigned based on reactions they catalyze • Hierarchy, high level groups: – – – – – – EC 1 –Oxidoreductases EC 2 –Transferases EC 3 –Hydrolases EC 4 –Lyases EC 5 –Isomerases EC 6 –Ligases EC example Metabolic Pathway databases • PATHGUIDE >200 pathways • KEGG (Kyoto encyclopedia of genes and genomes): http://www.genome.jp/kegg -includes: – Database of chemicals, genes and networks (metabolic, regulatory etc.) – Well-curated and quite specific • EcoCyc (Encyclopedia of E. coli K12 genes and metabolism): http://ecocyc.org –curation of entries genome • Reactome –curated biological pathways: http://www.reactome.org/ • GenMAPP –pathways contributed by users http://www.genome.ad.jp/kegg Different pathway in different species: -> comparison Pathway in Reactome Example of a pathway in BioCyc Protein-protein interaction databases • Protein-protein interaction databases store pairwise interactions or complexes • Can get 1 to more than 20,000 interactions per publication • IntAct http://www.ebi.ac.uk/intact • DIP (Database of Interacting Proteins) http://dip.doembi.ucla.edu/ • BIND (Biomolecular Interaction Network Database) http://submit.bind.ca:8080/bind/ Protein-protein interactions in IntAct Integrated functional interactions in STRING Genome browsers • Integrate sequence & functional data for a genome • Ensembl –genome browser for major eukaryotic genomes, e.g. human, mouse etc. http://www.ensembl.org • UCSC browser -http://genome.ucsc.edu/ • FlyBase –Drosophila genome database: http://www.ebi.ac.uk/flybase • WormBase –C. elegans: http://www.wormbase.org • PlasmoDB –Plasmodium (malaria): http://plasmodb.org • Etc. Ensembl genome browser Ensembl gene view 1 Ensembl gene view 2 Gene within context on chromosome Human genetics databases • • • • GeneCards (http://www.genecards.org/) HapMap (http://hapmap.ncbi.nlm.nih.gov/) OMIM http://www.ncbi.nlm.nih.gov/omim HGDP Human Genome Diversity Project (http://hagsc.org/hgdp/files.html) Mutation/polymorphism databases Most of the databases are disease or gene centric i.e. p53 dbSNP http://www.ncbi.nlm.nih.gov/SNP/ Repository of all known mutation (human and other organisms) Where to find the databases • Table of addresses for major databases and tools • Nucleic Acids Research Database issue January each year • Nucleic Acids Research Software issue –new • Expasy list of tools: http://ca.expasy.org/links.html Large scale data retrieval • • • • Programmatic access to many databases MySQL access to some BioMart access –public and private FTP sites –large data downloads Other tutorials • http://www.ensembl.org/info/website/tutorials/ind ex.html • http://www.ebi.ac.uk/training/online/ • http://www.ebi.ac.uk/2can/home.html