Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
MIAME and ArrayExpress – a standard for microarray gene expression data and the public database at EBI Susanna-Assunta Sansone (Toxicogenomics project coordinator) Microarray Informatics Team EMBL- EBI (European Bioinformatics Institute) Transcriptome Symposium, April 2002 CHU Pitié-Salpêtrière, Université Paris VI Why have a public database? EMBL- EBI centre for research and services in bioinformatics that makes and maintains public db: • EMBL Nucleotide Sequence, SWISS-PROT, Ensembl, MSD, etc. Practical reasons: • • • Easy data access Resolves local storage issues Common data exchange formats can be developed Scientific reasons: • • • • Curation can be applied Annotation can be controlled Additional info can be stored that is missing in publications Improve data comparison ! Public standard can be applied Talk structure MIAME standard MIAME annotation challenge: • MGED BioMaterial Ontology Uses of MIAME concepts: • ArrayExpress: a public repository for gene expression data • MIAMExpress submission and annotation tool Talk structure MIAME standard Standard for microarray data Why? Size of dataset Different platforms - nylon, glass Different technologies - oligos, spotted References to external db not stable! Array annotation Sample annotation Data sharing needs standardized way to annotate and record the information! Standard for microarray data MGED Group Microarray Gene Expression Data Group: EBI + world’s largest microarray labs and companies (Sanger, Stanford, TIGR, Universite D'Aix-Marseille II, Affymetrics, Agilent, NCBI, DDBJ, etc.) MGED Group aims to • Facilitate adoption of standards for: – Experiment annotation – Data representation • Introduce standard for: – Experimental controls – Data normalization methods General MIAME principles Minimum information about a microarray experiment NOT a formal specification BUT a set of guidelines Sufficient information must be recorded to: • Correctly interpret and verify the results • Replicate the experiments Structured information must be recorded to: • Query and correctly retrieve the data • Analyse the data MIAME- Brazma et al., Nature Genetics, 2001 • Sample source • Sample treatments • Extraction protocol • Labeling protocol Sample MIAME Hybridization protocol Hybridisation Array • Array design information • Location of each element • Description of each element • Image • Scanning protocol • Software specifications • Quantification matrix • Analysis protocol • Software specifications MIAME 6 parts of a microarray experiment MIAME Experiment Sample Hybridisation Array Sample Hybridisation Array Sample Hybridisation Array Sample Hybridisation Array • Strategy • Algorithm • Control array elements • 3 data processing levels • Lack of gene expression measurement units ! Normalisation Final data MIAME 6 parts of a microarray experiment MIAME – Annotation challenge Annotation implementations are required ! • Avoid/reduce free text descriptions • Use of controlled terms • Definitions and sources for each term • Remove of synonyms, or use of synonym mappings • Data curation at source (LIMS) • Integration of controlled terms in query interfaces Facilitate data queries-analysis……. A gene expression database from the data analyst’s point of view Genes and transcription units Samples Gene expression matrix Gene expression levels A gene expression database from the data analyst’s point of view Genes and transcription units Samples • Array description: - Gene annotations • Sample annotations: - Source - Treatment Gene expression matrix Gene expression levels MIAME - Gene annotation Unambiguous identification Synonyms ! • Community approved names • Alternative to gene names Usable external sources e.g.: • EMBL-GenBank - sequence accession n. • Jackson Lab - approved mouse gene names • HUGO - approved human gene names • GO categories - function, process, location MIAME - Sample annotation Gene expression data only have a meaning in the context of detailed sample descriptions ! Usable external sources e.g.: • NCBI Taxonomy - organisms • Jackson Lab - mouse strains names • Mouse Anatomical Dictionary – mouse anatomy • ChemID – compounds • ICD-9 – diseases classification More is needed….. Annotation – implementations required! Need an ontology to describe the sample: • Defining controlled vocabularies and…… • ….Using existing external ontologies Integrate the ontology in LIMS and databases: • Develop browser or interface for the ontology • Develop internal editing tools for the ontology However some free text description is unavoidable Talk structure MIAME standard MIAME annotation challenge: • MGED BioMaterial Ontology What CV and ontology are? Controlled Vocabulary (CV): • Set of restrictive terms used to describe something, in the simplest case it could be a list Ontology is more then a CV: • Describes the relationship between the terms in a structured way, provides semantics and constraints • Capture knowledge and make it machine processable Sample annotation – MGED BioMaterial Ontology Under construction by Chris Stoeckert (Univ. of Penn.) and MGED members Use OILed (rdf, daml and html files available) Motivated by MIAME and guided by ‘case scenarios’ Defines terms, provides constraints, develops CVs for sample annotation Links also to external CVs and ontologies Will be extended to other part of a microarray experiment that need to be described Sample annotation – MGED BioMaterial Ontology an example Sample source and treatment description, and its correct annotation using the MGED BioMaterial Ontology classes and correspondent external references: “Seven week old C57BL/6N mice were treated with fenofibrate. Liver was dissected out, RNA prepared………” MGED BioMaterial Ontology External References Instances ©-BioMaterialDescription ©-Biosource Property ©-Organism NCBI Taxonomy ©-Age ©-DevelopmentStage Mouse Anatomical Dictionary Mus musculus musculus id: 39442 7 weeks after birth Stage 28 Female ©-Sex ©-StrainOrLine International Committee on Standardized Genetic Nomenclature for Mice Charles River, Japan ©-BiosourceProvider ©-OrganismPart C57BL/6 Mouse Anatomical Dictionary Liver ©-BioMaterialManipulation ©-EnvironmentalHistory ©-CultureCondition ©-Temperature 22 2C ©-Humidity 55 5% ©-Light 12 hours light/dark cycle ©-PathogenTests Specified pathogen free conditions ©-Water ad libitum ©-Nutrients MF, Oriental Yeast, Tokyo, Japan ©-Treatment ©-CompoundBasedTreatment (Compound) ChemIDplus Fenofibrate, CAS 49562-28-9 (Treatment_application) in vivo, oral gavage (Measurement) 100mg/kg body weight Talk structure MIAME standard Sample annotation: • MGED BioMaterial Ontology Uses of MIAME concepts: • ArrayExpress a public repository for gene expression data • MIAMEpress submission and annotation tool Uses of MIAME concepts Specifies the content of the information: • Sufficient • Structured Uses: • Creation of MIAME-compliant LIMS or databases e.g: ArrayExpress • Development of submission/annotation tool for generating MIAME-compliant information e.g.: MIAMExpress ArrayExpress – data flow EBI Web server Users Submission Submission LIMS Browse-Query MIAMExpress ArrayExpress MIAMExpress Curation database Output MAGE-ML Update Image server Central database Data warehouse ArrayExpress - details Implementation in ORACLE of the MAGE-OM model: • Microarray gene expression - Object Model • OMG approved standard (MGED and Rosetta, 2001) • Model developed in UML Object model-based query mechanism: • Automatic mapping to SQL ArrayExpress Independent of: • Experimental platform • Image analysis method • Normalization method Central database Data warehouse MAGE-ML data loader: • Microarray gene expression - Mark-up Language generated from model ArrayExpress – conceptual model Experiment Sample Hybridisation Array Sample Hybridisation Array Sample Hybridisation Array Sample Hybridisation Array Normalisation Final data MIAME 6 parts of a microarray experiment ArrayExpress – simplified model • Classes are represented by boxes • Classes describe objects • Related classes are grouped together in packages • MAGE-OM has 16 packages, ~ 150 tables ArrayExpress data (via MAGE-ML) Currently: Near future: • Human data - EMBL • (ironchip) • • Yeast data - EMBL • • S. pombe - Sanger Institute • • Available as example • annotated and curated data sets • Array descriptions - TIGR Array description - Affymetrix Mouse data - TIGR and HGMP Anopheles data - EMBL Direct pipeline - Sanger Institute LIMS Data - DESPRAD partners • Toxicogenomics data- ILSI HESI ArrayExpress – query interface First release 12 Januray 2002 ArrayExpress – link to Expression Profiler External data, tools pathways, function, etc. Expression data EP:PPI Prot-Prot ia. EP:GO GeneOntology EPCLUST GENOMES Expression data URLMAP provide links sequence, function, annotation SEQLOGO SPEXS PATMATCH visualise patterns discover patterns ArrayExpress – curation effort User support and help documentation: • Ontologies and CV’s • Minimize free text, removal of synonyms • Help on MAGE-ML format and MAGE-OM MIAME compliance-check Curation at source (LIMS) To provide high-quality, well-annotated data and allow automated data analysis MIAMExpress - details Submission and annotation tool: • Curators will monitor the submissions Based on MIAME concepts: MIAMExpress • Experiment, Array and Protocol submissions • Generates MIAME-compliant information Uses MGED BioMaterial Ontology terms: • Terms and required fields are explained Allows user driven ontology development: • User can provide new terms and their sources Allows browsing: • Array descriptions • Protocols MIAMExpress - details Version 1 launch in December 2002 Expected users: MIAMExpress • Limited local bioinformatic support • No LIMS on site • Small scale users with custom made arrays Can be installed as local version: • As a lab-book to annotate your experiment • As part of a LIMS Interfaces: • Version 1 is general • Future versions, application specific interfaces - Species specific - Toxicogenomics specific (ILSI- HESI) ArrayExpress - future Load public data into ArrayExpress: • TIGR, EMBL, ILSI HESI, DESPRAD partners Improve query interfaces Launch MIAMExpress v.1 (Dec.2002) MIAMExpress v.2: • Extended according to the user needs • Integrated MGED ontology • Increased usability, flexibility and scalability Develop curation tools Acknowledgments Microarray Informatics Team at EBI (19 members): • • • • • • Alvis Brazma (Team Leader and MGED President) Helen Parkinson (Curation Coordinator) Mohammad Shojatalab (MIAMExpress Database Programmer) Ugis Sarkans (ArrayExpress Database development coordinator) Jaak Vilo (Expression Profiler) Curators and Programmers. MGED members and working groups: • Alvis Brazma (MGED President, MIAME) • Chris Stoeckert, U. Penn. (MGED Ontology Working Group) Resources and ….messages Open sources resources: • • • • ArrayExpress and MIAMExpress schema-access to code MIAME document and glossary MAGE-ML dtd and annotation examples MGED Ontology and other resources……… www.mged.org / www.ebi.ac.uk/microarray [email protected] Be aware of MIAME ! • Nature, Lancet and have already expressed their interest • Founding agencies Join MGED meetings, tutorials and mailing lists: • MGED-5 meeting in Japan (Sept. 2002) • Ontology for BioSample description, EBI (Nov. 2002)