* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download ArrayExpress and Expression Atlas
Survey
Document related concepts
Microevolution wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Genome evolution wikipedia , lookup
Genomic imprinting wikipedia , lookup
Pathogenomics wikipedia , lookup
Genome (book) wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Designer baby wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Public health genomics wikipedia , lookup
Gene expression programming wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Metagenomics wikipedia , lookup
Transcript
ArrayExpress and Expression Atlas: Mining Functional Genomics data Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL [email protected] What is functional genomics (FG)? • The aim of FG is to understand the function of genes and other parts of the genome • FG experiments typically utilize genome-wide assays to measure and track many genes (or proteins) in parallel under different conditions • High-throughput technologies such as microarrays and high-throughput sequencing (HTS) are frequently used in this field to interrogate the transcriptome 2 ArrayExpress What biological questions is FG addressing? • When and where are genes expressed? • How do gene expression levels differ in various cell types and states? • What are the functional roles of different genes and in what cellular processes do they participate? • How are genes regulated? • How do genes and gene products interact? • How is gene expression changed in various diseases or following a treatment? 3 ArrayExpress Components of a FG experiment 4 ArrayExpress FG public repositories: ArrayExpress Is a public repository for FG data, which provides easy access to well annotated data in a structured and standardized format Serves the scientific community as an archive for data supporting publications, together with GEO at NCBI and CIBEX at DDBJ Facilitates the sharing of experimental information associated with the data such as microarray designs, experimental protocols,…… Based on community standards: MIAME guidelines & MAGE-TAB format for microarray, MINSEQE guidelines for HTS data (http://www.mged.org/minseqe/) 5 ArrayExpress Community standards for data requirement MIAME = Minimal Information About a Microarray Experiment MINSEQE = Minimal Information about a high-throughput Nucleotide SEQuencing Experiment The checklist: Requirements 6 MIAME MINSEQE 1. Experiment design / background description 2. Sample annotation and experimental factor 3. Array design annotation (e.g. probe sequence) 4. All protocols (wet-lab bench and data processing) 5. Raw data files (from scanner or sequencing machine) 6. Processed data files (normalised and/or transformed) ArrayExpress Standards for microarray & sequencing MAGE-TAB format MAGE-TAB is a simple spreadsheet format that uses a number of different files to capture information about a microarray or HTS experiments IDF Investigation Description Format file, contains top-level information about the experiment including title, description, submitter contact details and protocols. SDRF Sample and Data Relationship Format file contains the relationships between samples and arrays, as well as sample properties and experimental factors, as provided by the data submitter. ADF Data files 7 ArrayExpress Array Design Format file that describes probes on an array, e.g. sequence, genomic mapping location (for array data ony) Raw and processed data files. The ‘raw’ data files are the trace data files (.srf or .sff). Fastq format files are also accepted, but SRF format files are preferred. The trace data files that you submit to ArrayExpress will be stored in the European Nucleotide Archive (ENA). The processed data file is a ‘data matrix’ file containing processed values, e.g. files in which the expression values are linked to genome coordinates. ArrayExpress – two databases 8 ArrayExpress What is the difference between them? ArrayExpress Archive • Central object: experiment • Query to retrieve experimental information and associated data Expression Atlas • Central object: gene/condition • Query for gene expression changes across experiments and across platforms 9 ArrayExpress ArrayExpress – two databases 10 ArrayExpress ArrayExpress Archive – when to use it? • Find FG experiments that might be relevant to your research • Download data and re-analyze it. Often data deposited in public repositories can be used to answer different biological questions from the one asked in the original experiments. • Submit microarray or HTS data that you want to publish. Major journals will require data to be submitted to a public repository like ArrayExpress as part of the peer-review process. 11 ArrayExpress How much data in AE Archive? (as of mid-September 2012) 12 ArrayExpress HTS data in AE Archive (as of mid-September 2012) Microarray vs HTS RNA-, DNA-, ChIPseq breakdown 13 ArrayExpress ArrayExpress www.ebi.ac.uk/arrayexpress/ 14 ArrayExpress The date when the data were loaded in the Archive Browsing the AE Archive AE unique experiment ID Curated title of experiment Number of assays Species investigated loaded in Atlas flag Raw sequencing data available in ENA The list of experiments retrieved can be printed, saved as Tabdelimited format or exported to Excel or as RSS feed 15 ArrayExpress The total number of experiments and assay retrieved The direct link to raw and processed data. An icon indicates that this type of data is available. Browsing the AE Archive 16 ArrayExpress Searching AE with the Experimental factor ontology (EFO) Application focused ontology modeling the relationship between experimental factors (EFs) in AE Developed to: • increase the richness of annotations that are currently made in AE Archive • to promote consistency • to facilitate automatic annotation and integrate external data EFs are transformed into an ontological representation, forming classes and relationships between those classes Combine terms from a subset of well-maintained and compatible ontologies, e.g. Gene Ontology, NCBI Taxonomy 17 ArrayExpress Building EFO An example Take all experimental factors sarcoma Find the logical connection between them Organize them in an ontology disease disease is the parent term [-] cancer neoplasm is a type of disease neoplasm [-] neoplasm cancer is synonym of cancer neoplasm [-] disease sarcoma is a type of sarcoma cancer [-] Kaposi’s sarcoma 18 ArrayExpress Kaposi’s sarcoma is a type of sarcoma Kaposi’s sarcoma Exploring EFO An example More information at: http://www.ebi.ac.uk/efo 19 ArrayExpress Searching AE Archive Simple query 20 ArrayExpress AE Archive query output • Matches to exact terms are highlighted in yellow • Matches to synonyms are highlighted in green • Matches to child terms in the EFO are highlighted in pink 21 ArrayExpress AE Archive – experiment view 22 ArrayExpress SDRF file – sample & data relationship 23 ArrayExpress ArrayExpress – two databases 24 ArrayExpress Expression Atlas – when to use it? • Find out if the expression of a gene (or a group of genes with a common gene attribute, e.g. GO term) change(s) across all the experiments available in the Expression Atlas; • Discover which genes are differentially expressed in a particular biological condition that you are interested in. 25 ArrayExpress Expression Atlas construction Experiment selection criteria during curation • Array (platform) designs relating to the experiment must be provided. Probe annotation must be adequate to enable reannotation of external references (e.g. Ensembl gene ID, Uniprot ID) • At least 3 replicates for each value of the experimental factor • Maximum 4 experimental factors • Adequate sample annotation using EFO terms • Presence of data files: CEL raw data files for Affymetrix assays, processed data files for non-Affymetrix ones 26 ArrayExpress Expression Atlas construction Analysis pipeline Cond.1 Cond.2 Cond.3 genes Cond.1 Cond.2 Cond.3 Input data (Affy CEL, non-Affy processed) Linear model* (Bio/C Limma) Output: 2-D matrix 1= differentially expressed 0 = not differentially expressed * More information about the statistical methodology: http://nar.oxfordjournals.org/content/38/suppl_1/D690.full 27 ArrayExpress Expression Atlas construction Analysis pipeline “Is gene X differentially expressed in condition 1 in this experiment?” =a Cond.1 mean single expression value for gene X Cond.2 mean Cond.3 mean Compare and calculate statistic 28 ArrayExpress Mean of all samples Expression Atlas construction Exp.1 Cond.1 Cond.2 Cond.3 genes Statistical test Exp. 2 Cond.4 Cond.5 Cond.6 genes Statistical test Exp. n Cond.X Cond.Y Cond.Z genes Statistical test 29 ArrayExpress Each experiment has its own “verdict” or “vote” on whether a gene is differentially expressed or not under a certain condition Expression Atlas construction Summary of the “verdicts” from different experiments 30 ArrayExpress Expression Atlas 31 ArrayExpress Atlas home page http://www.ebi.ac.uk/gxa/ Query for genes Restrict query by direction of differential expression Query for conditions The ‘advanced query’ option allows building more complex queries 32 ArrayExpress Atlas home page The ‘Genes’ and ‘Conditions’ search boxes 33 ArrayExpress Atlas home page A single gene query 34 ArrayExpress Atlas gene summary page Atlas experiment page 36 ArrayExpress Atlas experiment page – HTS data 37 ArrayExpress Atlas home page A ‘Conditions’ only query 38 ArrayExpress Atlas heatmap view 39 ArrayExpress Atlas gene-condition query 40 ArrayExpress Atlas advanced search 41 ArrayExpress Atlas advanced search 42 ArrayExpress Atlas advanced search 43 ArrayExpress A glimpse of what’s coming… “Differential atlas” “Is gene X differentially expressed in condition 1 in this experiment?” =a Cond.1 mean single expression value for gene X Cond.2 mean Cond.3 mean Compare and calculate statistic 44 ArrayExpress Mean of all samples A glimpse of what’s coming… “Differential atlas” mock-up (1) 45 ArrayExpress A glimpse of what’s coming… “Differential atlas” mock-up (2) 46 ArrayExpress A glimpse of what’s coming… “Baseline atlas” • Gene expression in normal tissues, not looking for differentially expressed genes based on different conditions • E.g. “Give me all the genes expressed in normal human kidney” • Can also filter genes by expression level (e.g. FPKM values) • Start with Illumina Body Map 2.0 RNA-seq data • 16 tissues: adrenal, adipose, brain, breast, colon, heart, kidney, liver, lung, lymph, ovary, prostate, skeletal muscle, testes, thyroid, and white blood cells • We are working on something similar for mouse 47 ArrayExpress A glimpse of what’s coming… “Baseline atlas” mock-up display 48 ArrayExpress Data submission to AE 49 ArrayExpress Data submission to AE www.ebi.ac.uk/microarray/submissions.html 50 ArrayExpress Submission of HTS data ArrayExpress acts as a “broker” for submitter. • Meta-data and processed data: ArrayExpress • Raw sequence reads* (e.g. fastq, bam): ENA *See http://www.ebi.ac.uk/ena/about/sra_data_format for accepted read file format 51 ArrayExpress Find out more • Visit our eLearning portal, Train online, at http://www.ebi.ac.uk/training/online/ for courses on ArrayExpress and Atlas • Watch this short YouTube video on how to navigate the MAGE-TAB submission tool: http://youtu.be/KVpCVGpjw2Y • Email us at: [email protected] • Atlas mailing list: [email protected] 52 ArrayExpress