* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download ArrayExpress and Gene Expression Atlas: Mining Functional
Survey
Document related concepts
Transcript
ArrayExpress and Gene Expression Atlas: Mining Functional Genomics data Amy Tang PhD ArrayExpress Production Team Functional Genomics Group EMBL-EBI [email protected] What’s covered this morning? Why do we need a database for functional genomics data? ArrayExpress databases: • Archive • Gene Expression Atlas What’s in each database, how to browse, search, interpret, download data Hands-on exercises (How to submit data to ArrayExpress?) 2 ArrayExpress Functaionl genomics (FG) data • The aim of FG is to understand the function of genes and other (non-genic) parts of the genome • Often involved high-throughput technologies (microarrays, high-throughput sequencing [HTS]) • Questions addressed: • Gene expression - when? where? how much? changes? • Gene function - roles of different genes in cellular processes, pathways • Gene regulation - e.g. epigenetic modifications of histones or DNA 3 ArrayExpress ArrayExpress www.ebi.ac.uk/arrayexpress Public repository for functional genomics data (both microarray and sequencing) Together with GEO at NCBI and CIBEX at DDBJ, serves the scientific community as an archive for data supporting publications Provides access to curated data in a structured and standardised format. Facilitates the sharing of experimental information Submissions are curated based on community standards: MIAME guidelines & MAGE-TAB format for microarray MINSEQE guidelines & MAGE-TAB format for HTS data 4 ArrayExpress Community standards for data requirement MIAME = Minimal Information About a Microarray Experiment (http://www.mged.org/Workgroups/MIAME/miame_2.0.html) MINSEQE = Minimal Information about a high-throughput Nucleotide SEQuencing Experiment (http://www.mged.org/minseqe) The checklist: Requirements 5 MIAME MINSEQE 1. Experiment design / background description 2. Sample annotation and experimental factor 3. Array design annotation (e.g. probe sequence) 4. All protocols (wet-lab bench and data processing) 5. Raw data files (from scanner or sequencing machine) 6. Processed data files (normalised and/or transformed) ArrayExpress What is an experimental factor? The main variable(s) studied in the experiment It often is the independent variable of the microarray or HTS experiment. Values of the factor (“factor values”) should vary. Examples: Experiment design human blood samples vs mouse blood samples lung samples from male C57BL/6 mice vs lung samples from male 129 mice 6 ArrayExpress Factor Factor Values Not factor organism Homo sapiens, Mus musculus organism part (blood only) C57BL/6, 129 organism part (lung only), sex (male only) strain Reporting standards - MAGE-TAB format MAGE-TAB is a simple spreadsheet format that uses a number of different files to capture information about a microarray or sequencing experiment. Investigation Description Format file IDF Experiment title and background, investigator(s)’ contact details, definition of protocols Sample and Data Relationship Format file SDRF ADF (for array data only) Captures the chronological flow of experiment from source materials to data files. Shows relationship between sample, data files, experiment factors. Array Design Format file Describes probes on an array, e.g. sequence, genomic mapping location Raw and processed data files. Data files • • 7 ArrayExpress Raw = unmodified from microarray scanner (e.g. CEL for Affymetrix, GPR for GenePix), or trace data files (fastq and bam) for sequencing. Processed file = data normalised/transformed from the raw data MAGE-TAB Example: IDF 8 ArrayExpress & Atlas MAGE-TAB Example: SDRF 9 ArrayExpress & Atlas ArrayExpress – two databases 10 ArrayExpress What is the difference between them? ArrayExpress Archive • Central object: experiment • Contains both microarray and HTS experiments • Query to retrieve experimental information and associated data Expression Atlas • Central object: gene/condition • Contains data from mainly microarray experiments (HTS coming very soon!) • Query for up/downregulated genes across experiments and across platforms 11 ArrayExpress ArrayExpress – two databases 12 ArrayExpress ArrayExpress Archive – when to use it? • Find FG experiments that might be relevant to your research • Download data and re-analyse it yourself. Data deposited in public repositories may shed light on biological questions different from the one asked in the original experiments. • Submit microarray or HTS data that you want to publish. Major journals will require data to be submitted to a public repository like ArrayExpress as part of the peer-review process. 13 ArrayExpress How much data in AE Archive? (as of September 2012) (up to Sept.) 14 ArrayExpress HTS data in AE Archive (as of mid-September 2012) Microarray vs HTS RNA-, DNA-, ChIPseq breakdown Browsing the AE Archive www.ebi.ac.uk/arrayexpress 16 ArrayExpress Browsing the AE Archive AE unique experiment ID Curated title of experiment Number of assays Species investigated The date when the data were loaded in the Archive loaded in Atlas flag Raw sequencing data available in ENA The list of experiments retrieved can be printed, saved as Tabdelimited format or exported to Excel or as RSS feed 17 ArrayExpress The total number of experiments and assay retrieved The direct link to raw and processed data. An icon indicates that this type of data is available. Browsing the AE Archive 18 ArrayExpress Experimental factor ontology (EFO) http://www.ebi.ac.uk/efo An ontology modeling the relationship between experimental factors (EFs) and other data elements Used in EBI databases: and external projects (e.g. NHGRI GWAS Catalogue) Combine terms from a subset of well-maintained and compatible ontologies, e.g. Gene Ontology (cellular component + biological process terms), NCBI Taxonomy 19 ArrayExpress Experimental factor ontology (EFO) http://www.ebi.ac.uk/efo EFO developed to: increase the richness of annotations in databases expand on search terms when querying ArrayExpress and Gene Expression Atlas • using synonyms (e.g. “cerebral cortex” = “adult brain cortex”) • using child terms (e.g. “bone” “rib” and “vertebra”) promote consistency (e.g. F/female/, 1day/24hours) facilitate automatic annotation and integration of external data (e.g. changing “gender” to “sex” automatically) 20 ArrayExpress Building EFO An example Take all experimental factors sarcoma Find the logical connection between them Organize them in an ontology disease disease is the parent term [-] cancer neoplasm is a type of disease neoplasm [-] neoplasm cancer is synonym of cancer neoplasm [-] disease sarcoma is a type of sarcoma cancer [-] Kaposi’s sarcoma 21 ArrayExpress Kaposi’s sarcoma is a type of sarcoma Kaposi’s sarcoma Exploring EFO An example 22 ArrayExpress Searching AE Archive Simple query 23 • “Auto-complete” with suggestions (like Google search) • Avoid acronyms as search terms ArrayExpress Filter your search results by: • Species of interest • One array design (platform), • molecule (DNA, RNA, protein, etc) • technology (microarray or HTS) Searching AE Archive Simple query Search across all fields: • AE accession number e.g. E-MEXP-568 • Secondary accession numbers e.g. GEO series accession GSE5389 • Experiment title, submitter’s experiment description • Submitter's email address • Sample attributes, experimental factor and values, including species (e.g. GeneticModification, Mus musculus, DREB2C over-expression) • Publication title, authors and journal name, PubMed ID Synonyms for terms are always included in searches e.g. 'human' and 'Homo sapiens’ 24 ArrayExpress AE Archive query output • Matches to exact terms are highlighted in yellow • Matches to synonyms are highlighted in green • Matches to child terms in the EFO are highlighted in pink AE Archive – experiment view Experimental factor(s) and its values MIAME or MINSEQE scores show how much the experiment is standard compliant (* = compliant) Link to files available. This varies between sequencing and microarray data. For microarray experiments you also have array design file (ADF) 26 ArrayExpress SDRF file – sample & data relationship 27 ArrayExpress Searching AE Archive Advanced query Combine search terms • Join two or more keywords in the search box with the operators AND, OR or NOT (in CAPS), e.g. brain OR prostate NOT mouse • Search terms of more than one word must be entered inside quotes otherwise only the first word will be searched for, e.g. “kidney cancer” Specify fields for searches • E.g. Search only for human assays on Agilent microarrays: species: “homo sapiens” AND array:Agilent* * For more details and examples, see http://www.ebi.ac.uk/fg/doc/help/ae_help.html 28 ArrayExpress Hands-on exercise 1 Find RNA-seq assays studying human prostate adenocarcinoma 29 ArrayExpress ArrayExpress – two databases 30 ArrayExpress Expression Atlas – when to use it? • Find out if the expression of a gene (or a group of genes with a common gene attribute, e.g. GO term) change(s) across all the experiments available in the Expression Atlas; • Discover which genes are differentially expressed in a particular biological condition that you are interested in. • Experiments in Archive are curated before being introduced into the Atlas 31 ArrayExpress Expression Atlas construction Experiment selection criteria during curation • Array (platform) designs relating to the experiment must be provided. Probe annotation must be adequate to enable reannotation of external references (e.g. Ensembl gene ID, Uniprot ID) • At least 3 replicates for each value of the experimental factor • Maximum 4 experimental factors • Adequate sample annotation using EFO terms • Presence of data files: CEL raw data files for Affymetrix assays, processed data files for non-Affymetrix ones 32 ArrayExpress Expression Atlas construction Analysis pipeline A dummy example: Cond.1 Cond.2 Cond.3 genes Cond.1 Cond.2 Cond.3 Input data (Affy CEL, non-Affy processed) Linear model* (Bio/C Limma) Output: 2-D matrix 1= differentially expressed 0 = not differentially expressed * More information about the statistical methodology: http://nar.oxfordjournals.org/content/38/suppl_1/D690.full 33 ArrayExpress Expression Atlas construction Analysis pipeline “Is gene X differentially expressed in condition 1 in this experiment?” = a single expression value for gene X Cond.1 mean Cond.2 mean Cond.3 mean Compare and calculate statistic 34 ArrayExpress Mean of all samples Exp.1 Cond.1 Cond.2 Cond.3 genes Statistical test Exp. 2 Cond.4 Cond.5 Cond.6 genes Statistical test Exp. n Cond.X Cond.Y Cond.Z genes Statistical test 35 ArrayExpress Each experiment has its own “verdict” or “vote” on whether a gene is differentially expressed or not under a certain condition Expression Atlas construction Summary of the “verdicts” from different experiments 36 ArrayExpress Expression Atlas 37 ArrayExpress Atlas home page http://www.ebi.ac.uk/gxa Query for genes Restrict query by direction of differential expression Query for conditions The ‘advanced query’ option allows building more complex queries 38 ArrayExpress Atlas home page The ‘Genes’ and ‘Conditions’ search boxes Conditions Genes 39 ArrayExpress Atlas single gene query gene summary page 40 ArrayExpress Atlas single gene query (cont’d) experiment page 41 ArrayExpress Atlas single gene query gene summary page – jump to orthologs Orthology comes from Ensembl Compara database 42 ArrayExpress Atlas single gene query compare orthologs – heatmap view 43 ArrayExpress Atlas ‘condition-only’ query 44 ArrayExpress Atlas ‘condition-only’ query (cont’d) heatmap view 45 ArrayExpress Atlas gene + condition query 46 ArrayExpress Atlas query refining (method 1) What if there are no terms in the “REFINE YOUR QUERY” box which fit my biological question? 47 ArrayExpress Atlas query refining (method 2) 48 ArrayExpress Atlas query refining (method 2) 49 ArrayExpress Atlas query refining (method 2) 50 ArrayExpress Hands-on exercise 2 Find genes in the “androgen receptor signaling pathway” which are (i) expressed in prostate carcinoma and (ii) involved in regulation of transcription from RNA Pol II Hands-on exercise 3 Find information on Tbx5 expression in mouse in relation to Holt-Oram syndrome 51 ArrayExpress ArrayExpress-Atlas Crossword 52 ArrayExpress A glimpse of what’s coming… “Differential atlas” “Is gene X differentially expressed in condition 1 in this experiment?” = a single expression value for gene X Cond.1 mean Cond.2 mean Cond.3 mean Compare and calculate statistic 53 ArrayExpress Mean of all samples A glimpse of what’s coming… “Differential atlas” mock-up (1) 54 ArrayExpress A glimpse of what’s coming… “Differential atlas” mock-up (2) 55 ArrayExpress A glimpse of what’s coming… “Baseline atlas” • Gene expression in normal tissues, not looking for differentially expressed genes based on different conditions • E.g. “Give me all the genes expressed in normal human kidney” • Can also filter genes by expression level (e.g. FPKM values) • Start with Illumina Body Map 2.0 RNA-seq data • 16 tissues: adrenal, adipose, brain, breast, colon, heart, kidney, liver, lung, lymph, ovary, prostate, skeletal muscle, testes, thyroid, and white blood cells • We are working on something similar for mouse 56 ArrayExpress A glimpse of what’s coming… “Baseline atlas” mock-up display 57 ArrayExpress Find out more about Archive and Atlas • Visit our eLearning portal, Train online, at http://www.ebi.ac.uk/training/online/ for tutorials on ArrayExpress and Atlas • ArrayExpress BioConductor R package: http://bioconductor.org/packages/release/bioc/html/ArrayExpre ss.html • Try the ArrayExpress help page: www.ebi.ac.uk/fg/doc • Email us at: [email protected] • Atlas mailing list: [email protected] 58 ArrayExpress Data submission to ArrayExpress Archive 59 ArrayExpress Data submission to AE 60 ArrayExpress Data submission to AE www.ebi.ac.uk/microarray/submissions.html • MIAMExpress originally designed mainly for simple Affymetrix and Agilent two-colour microarray submissions • MAGE-TAB route recommended for large/complicated experiments. • HTS experiments must be submitted via MAGE-TAB route. • MAGE-TAB spreadsheet (IDF and SDRF) tailor-made for your experiment if you follow the MAGE-TAB submission tool (i.e. with all mandatory column headings present) 61 ArrayExpress Submission of HTS data • ArrayExpress acts as a “broker” for submitter. • Meta-data and processed data: ArrayExpress • Raw sequence reads* (e.g. fastq, bam): ENA *See http://www.ebi.ac.uk/ena/about/sra_data_format for accepted read file format 62 ArrayExpress What happens after submission? Email confirmation Curation: Submission ‘closed’ so no more editing on your end We will email you with any questions May ‘re-open’ submission for you to make changes Can keep data private until publication. Will provide login account details to you and reviewer for private data access Get your submission in the best possible shape to shorten curation and processing time! 63 ArrayExpress Submission checklist Microarrays 1. Is your array design already accessioned in ArrayExpress? (Check: http://www.ebi.ac.uk/arrayexpress/arrays/browse.html? directsub=on If your array design is not represented, you will have to submit the array design to us before submitting any experimental data, because all data points in your raw/processed files refer back to the array design file) HTS 1. Are your reads file in a format accepted by the SRA? (Check here: http://www.ebi.ac.uk/ena/about/sra_data_ format) 2. If yes, have you dropped the files on the private ArrayExpress FTP site and email us about them? 2. Do you have all the data files ready in the required formats? 3. Have you filled in the MAGE-TAB spreadsheet with adequate meta-data? 64 ArrayExpress Need help with submitting your data? • Visit our eLearning portal, Train online, at www.ebi.ac.uk/training/online/course/arrayexpresssubmitting-data-using-mage-tab for the specific tutorial on how to submit data using MAGE-TAB • Watch this short YouTube video on how to navigate the MAGE-TAB submission tool: http://youtu.be/KVpCVGpjw2Y • Email curators at: [email protected] 65 ArrayExpress