Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Gene expression data in VectorBase Fotis Kafatos, George Christophides, Bob MacCallum & Seth Redmond Imperial College London (thanks also to EBI, Sanger and ND) VectorBase Outline 1. Project goals 2. What’s currently available 3. Current challenges and future plans VectorBase Project goals • For vector biologists: – Easy access to gene expression data • consistent data processing • For array specialists: – ArrayExpress submission – Advanced analysis tools – Array annotation VectorBase EXPRESSION DATA BULK LOADER STORAGE & ANALYSIS VectorBase • BASE: BioArray Software Environment • http://base.thep.lu.se/ • Open source, active development and user community • LIMS, data storage, export and analysis • Web-based, user/group access control • BASE 2.x adoption will bring Affy support Data submission • • • • Community submission guidelines available First batch of experiments loaded by us Bulk data loader Sample/experiment annotation requires intervention from curators ArrayExpress EXPRESSION DATA BULK LOADER ‘PUBLIC’ STORAGE VectorBase STORAGE & ANALYSIS • Data held in BASE is largely MIAME compliant • Script for semiautomated export in TAB2MAGE format • One experiment submitted so far ArrayExpress EXPRESSION DATA BULK LOADER ‘PUBLIC’ STORAGE VectorBase STORAGE & ANALYSIS ArrayExpress EXPRESSION DATA BULK LOADER ‘PUBLIC’ STORAGE VectorBase STORAGE & ANALYSIS DATA SUMMARIES • BASE web interface offers powerful and extendable analysis environment • Can be used for multisite collaborations on pre-publication data • Steep learning curve/not 100% intuitive • Not easily linked to • We provide simpler views so the casual user can quickly draw biological inferences VectorBase Standardised data All displayed data is processed in the same way: 1. Poor quality spots removed • Currently using submitted spot flags 2. Normalisation • VectorBase “lowess” for two-colour experiments VectorBase ArrayExpress EXPRESSION DATA BULK LOADER • 3 probe types PROBE MAPPING • 6 array designs • Mapping handled via Ensembl pipeline: – Oligo exonerate – PCR e-PCR – cDNA exonerate2genes ‘PUBLIC’ STORAGE VectorBase STORAGE & ANALYSIS DATA SUMMARIES VectorBase ArrayExpress EXPRESSION DATA BULK LOADER GENOMIC DATA PROBE MAPPING AUTOMATIC ANNOTATION GFF3 ‘PUBLIC’ STORAGE VectorBase STORAGE & ANALYSIS DATA SUMMARIES GENOME BROWSER contigview VectorBase featureview VectorBase VectorBase VectorBase ArrayExpress EXPRESSION DATA BULK LOADER ‘PUBLIC’ STORAGE STORAGE & ANALYSIS GENOMIC DATA PROBE MAPPING AUTOMATIC ANNOTATION DATA SUMMARIES GENOME BROWSER DATA MINING ARRAY BIOLOGISTS GENOME BIOLOGISTS VECTOR BIOLOGISTS BioMart • Beta version currently available – http://base.vectorbase.org:9999/biomart/martview • Improvements still needed: – experiment annotations – Alignments (i.e. handle split alignments) • Federation with current marts • Integration with new data? VectorBase Current challenges and future plans • How do you want to query? • CVs & ontologies • APIs • Community submission • Manual annotation VectorBase Querying strategy • What do you want to query on? – Fetch all genes upregulated under condition X – Fetch all experiments with gene X and condition Y – Fetch all probes with expression similar to probe X • All essentially boil down to: – Define probe (genes etc) – Define significant expression • ANOVA? • Up/down-regulation WRT what? – Define experimental conditions • Sample annotation • Experimental design VectorBase ArrayExpress EXPRESSION DATA BULK LOADER ‘PUBLIC’ STORAGE STORAGE & ANALYSIS GENOMIC DATA PROBE MAPPING CV / ONTOLOGY DATA SUMMARIES GENOME BROWSER DATA MINING ARRAY BIOLOGISTS AUTOMATIC ANNOTATION GENOME BIOLOGISTS VECTOR BIOLOGISTS ArrayExpress EXPRESSION DATA BULK LOADER AE API ? ‘PUBLIC’ STORAGE STORAGE & ANALYSIS GENOMIC DATA PROBE MAPPING CV / ONTOLOGY Array API ? AUTOMATIC ANNOTATION e! API DATA SUMMARIES GENOME BROWSER MartJ / MQL DATA MINING Array API Perl / Java objects for retrieval / handling of array data – Dual purpose: • Consistency & efficiency of VB expression website • Computational access to VB data for all – Objects must be: • General, DB-independent • Compatible with pre-existing Bio API (BioPerl / BioJava) – Nb. May be pre-existing solution: • ArrayExpress API? • BioPerl-Expression? • MAGE-OM-stk • http://neuron.cse.nd.edu/vectorbase/index.php/Array_API_proposal VectorBase VectorBase Community data submission • Carrot? – Help with ArrayExpress submission – Analysis tools – Dissemination • Stick? – Outreach (courses, conferences) – Networking VectorBase GE data manual annotators • EST clone-based arrays – http://tinyurl.com/vlkwo • Gene-build designed arrays – Negative evidence less compelling VectorBase Longer term plans Host-parasite GE data integration & analysis GE-clusters “upstream” regions regulatory elements, upstream TFs RNAi phenotypes Images VectorBase VectorBase VectorBase CVs & ontologies • Integrate MGED and specialist ontologies for – Body parts – Developmental stages – Disease processes –… • Allows comparison across experiments with similar experimental conditions VectorBase BioMart Most biomarts: VB Biomart: • Gene-based • Probe based – Many probes not aligned • Mostly ‘binary’ data – e.g. a gene either has a signal domain or doesn’t • Easily linked with other (gene-based) biomarts • Exp data less clear – e.g. define ‘differential expression’ • Exports gene/trans IDs for linking to other Marts Clustering • A priority? • Easy to do on reporter level within experiments • Harder to do at gene level across all experiments – Binary gene profile: “yes/no differentially expressed in experiment” ? • Amazon-style links to “genes which may have similar expression profiles”? VectorBase BASE 2.x • • • • Adoption delayed, now in progress Brings Affymetrix support Cleaner/modern interface Better API (Java) VectorBase