Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006 BioMart • A joint project – European Bioinformatics Institute (EBI) – Cold Spring Harbor Laboratory (CSHL) • Aim – To develop a generic, query-oriented data management system capable of integrating distributed data sources. Focus • ‘Data mining’ or advance search – Creating custom datasets – Querying multiple datasets – Interactive • Users – People who provide database-based service – ‘Power user’ biologists and bioinformaticians Requirements • User – ‘One-stop shop’ for biological data – Suitable for power biologists and bioinformaticians – A set of interfaces that allow user to group and refine biological data based upon many criteria • Deployer – ‘Out of the box’ installation – Built in ‘ query optimization – Easy data federation • Architecture – Domain agnostic – Distributed – Platform independent Advanced search GUIs Single interface Single access point Queries across different databases Dataset 1 Links Dataset 2 Main features • Domain agnostic • Platform independent (MySQL, ORACLE, Postgres) • Scalable for big datasets • Federated architecture • Automated UI configuration How does it work? BioMart Source data XML XML XML BioMart software Data mart Meta data Federated architecture Query Engine Data model FK FK PK PK FK FK Data model FK FK FK FK PK PK FK FK FK FK Data model - ‘reversed star’ FK1dm FK1 FK2 FK2dm FK2 PK1 main1 PK1 2 PK2 FK1 PK2 PK1 FK1dm FK1 FK2 FK2 FK2 Data mart and dataset Dataset Data mart, dataset and virtual schema virtual schema BioMart abstractions • Dataset – A subset of data organized into 1 or more tables • Attribute – A single data point – e. g. gene name • Filter – An operation on an attribute – e. g. ‘Chromosome =1’ Datasets, Attributes and Filters Mart Dataset GENE gene_id(PK) gene_stable_id gene_start gene_chrom_end chromosome gene_display_id description Attribute Filter BioMart abstractions (cont) • Link – ‘common currency’ between two datasets – e. g. accession • Exportable – Potential links to export • Importable – Potential links to import Exportables, Importables and Links Dataset 1 Links Dataset 2 Exportables, Importables and Links Exportable Links Importable name = uniprot_id name = uniprot_id attributes = uniprot_ac filters = uniprot_ac Dataset 1 Dataset 2 Exportables, Importables and Links Exportable Links Importable name=genomic_region name=genomic_region attributes=chr_name, chr_start, chr_end filters=chr_name (=), chr_start (>=), chr_end (<=) Dataset 1 Dataset 2 Creating BioMart databases Building BioMart databases Configuration Transformation Mart Source databases MartBuilde MartBuilder r XML MartEditor Schema transformation principles • Central table – Longest n:1, 1:1 path • Dimension table – Central transformation ‘around’ 1:n table. – Link tables are decomposed into a set of 1:n first MartBuilder Application • Read database meta data • Transforms a source schema into suggested datasets and lets you edit the process • Produces a set of SQL statements (DDL) to run against the server to perform the transformation Dataset Configuration • Dataset configuration • • • • • • Attributes Filters Trees, Groups, Collections Exportables, Importables Semantics Relational mapping • User interface • Linking datasets • XML-based Table naming convention Naïve configuration • Tables – Meta tables – Data tables meta_content dataset__content__type • Data tables – Main – Dimension __main __dm • Columns – Key _key Naming convention examples • Homo sapiens gene ensembl – hsapiens_gene_ensembl__gene__main – hsapiens_gene_ensembl__xref_hugo__dm • Encode – hsapiens_encode__encode__main • Uniprot – uniprot__protein__main – uniprot__interpro__dm • Uniprot sequence – uniprot_sequence__sequence__main Dataset Configuration XML XML XML MartEditor Accessing BioMart databases BioMart architecture Retrieval MartShell MartExplorer JAVA MartView Perl BioMart API Databases Public data (local or remote) MartBuilder MartEditor Vega SNP myMart myDatabase Schema transformation Configuration XML MSD UniProt Ensembl MartView (current) MartView (new 0_5) MartExplorer MartShell Using = dataset Get = attribute Where = filter MartShell (MQL) ● Uses Mart Query Language (MQL) to generate queries: using <dataset> get <attributes> where <filters> ● Can join datasets together: using Dataset1 get Attribute1 where Filter1=var1 as q; using Dataset2 get Attribute2 where Filter2=var2 and filter3 in q ● Can script and pipe: martshell.sh -E MQLscript.mql > results.txt martshell.sh -E MQLscript.mql | wc MartShell examples MartShell> using MSD.msd get pdb_id where resolution_less < 1.5 and has_ec_info only; 193l 194l 1arb ... MartShell> using MSD.msd get pdb_id where resolution_less < 1.5 and has_ec_info only as q; MartShell> using Ensembl.hsapiens_gene_ensembl get sequence transcript_flanks+1000 where pdb in q; ENST00000270142.2 ENSG00000142168.2 strand=forward chr=21 assembly=NCBI34 downstream flanking sequence of transcript only AAACTAAATTAGCTCTGATACTTATTTATATAAACAGCTTCAGTGG AA .... biomaRt Taverna DAS ProServer BioMart deployers • Large scale data federation (EBI) • Optimising access to a large database (Ensembl, WormBase) • Connecting priopriatery datasets to public data (Pasteur, Unilever, Serono, Sanofi-Aventis, DevGen etc …) Hinxton example EBI SANGER Uniprot MSD Ensembl SNP Vega Sequence WWW BioMart deployers • Large scale data federation (Hinxton) • Optimising access to a large database (Ensembl, WormBase, ArrayExpress) • Connecting priopriatery datasets to public data (Pasteur, Unilever, Serono, Sanofi-Aventis, DevGen etc …) WormBase Genes Expression Phenotypes Variations Literature Ontologies Sequence Ensembl Genes Ontologies Variations Protein annotation Disease Homologies Sequence Array annotations HapMap Population Frequencies Inter population comparisons Gene annotation ArrayExpress BioMart deployers • Large scale data federation (Hinxton) • Optimising access to a large database (Ensembl, WormBase) • Federating third party data with public data (Pasteur, INRA, Bayer,Unilever, Serono, Sanofi-Aventis, DevGen, Solexa etc …) In development • • • • • CAPRISA RGD DICTYBASE PURDUE UNIVERSITY RZPD Music Mart BioMart model • Already applied – – – – – – – – – – Ensembl Vega SNP Uniprot MSD ArrayExpress WormBase Gramene HapMap Variety of ‘in house’ projects (academia and industrial) User restriction martUser XML “default” “advanced” XML Dataset Interface configuration Interface XML “single-page web interface” “wizard style web interface” XML Dataset Web services XML MartView MartService 80 3306 3306 X 3306 Local Mart Remote Mart Web services (cont) MartService requests • Registry XML • Dataset information: name, type etc • DatasetConfig XML • Mart Query: – API query object is converted to a XML representation on the client and sent to the server. – Query object is regenerated on the server and processed. Results are sent back to client as a simple tab-delim HTML page. Summary • A generic data management system – A set of easily configurable user interfaces – Distributed Data federation – Query optimization BioMart • • • • • • www.biomart.org Open source (LGPL) Public MySQL server ftp [email protected] [email protected] Acknowledgments • BioMart – – – – Arek Kasprzyk (EBI) Damian Smedley (EBI) Syed Haider (EBI) Gudmundur Thorisson (CSHL) • Contributors – – – – – – – – – – – – – Darin London (EBI) Will Spooner (CSHL) Damian Keefe (Ensembl) Arne Stabenau (Ensembl) Andreas Kahari (Ensembl) Craig Melsopp (Ensembl) Katerina Tzouvara (Uniprot) Paul Donlon (Unilever) Steffen Durinck (SCD-ESAT, Katholieke Universiteit Leuven) Benoit Ballester (Universite de la Mediterranee) Stephen Robinson (EBI) Asif Kibria (EBI) Paul Donlon (Unilever)