* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Integrative omics in Expression Atlas
Nutriepigenomics wikipedia , lookup
Oncogenomics wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Gene expression programming wikipedia , lookup
Designer baby wikipedia , lookup
Genome evolution wikipedia , lookup
Metagenomics wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Gene expression profiling wikipedia , lookup
Integrative omics: Expression Atlas as an example Amy Tang [email protected] k Curation and Training Project Leader Functional Genomics Group, EMBL-EBI http://www.ebi.ac.uk/~rpetry/geteam/ @ArrayExpressEBI and @ExpressionAtlas 7 March 2017 In the next hour … What is the Expression Atlas? Types of data integrated, and how Challenges we are facing and how tackle them Favourable factors for integration in the Atlas How Atlas can help your project, best practice check-list 2 ArrayExpress What we mean by “integration” mRNA expression data Normalised, summarised proteomics data Normalised, summarised metabolomics data Normalised, summarised Ideal world What we mean by “integration” metabolomics data proteomics data mRNA expression data Normalised, summarised Ideal world Normalised, summarised Normalised, summarised Heart brain Heart Expt 1 Expt 1 Expt 2 Expt 4 Heart Expt 1 Expt 3 brain brain Reality Between reality and the ideal world… Integrated visualisation of multiple omics data sets to spot patterns GENE A Heart Brain Expt 1 X X Expt 2 X X Expt 1 X Expt 3 X Expt 1 X Expt 4 X mRNA expression data proteomics data metabolomics data X www.ebi.ac.uk/gxa 6 ArrayExpress What questions does the Atlas address? Where is the SOX2 gene expressed in the human body? Under what conditions does human SOX2 change expression levels? What genes are expressed in normal skeletal muscles? Two types of questions: Baseline and differential expression 8 ArrayExpress Baseline Differential Gene expression in healthy/untreated conditions Changes in gene expression in “comparisons” of experiment conditions E.g. human liver, ENCODE cell lines, rice root E.g. mutant vs wild type, drought vs normal watering, drug treated vs untreated 122 experiments 2941 experiments We answer the questions by extracting value from public expression data Direct submission, curated Automatic import ~ 70000 studies ArrayExpress Curators handpick and curate Run a standardised QC & analysis pipeline Genome sequences & gene models Atlas 3063 experiments (494 [16%] are RNA-seq) Figures as of 1 Feb 2017 release Integration in the Atlas – where we are at ~ 70000 studies ArrayExpress Proteomics data Metabolomics data Atlas 3063 experiments ? ? ? Mined gene names from PubMed paper Query by protein Query by function/pathway 10 ArrayExpress NCBI RefSeq Query by gene The Atlas links to other EBI service resources Genes, genomes & variation European Nucleotide Archive Ensembl 1000 Genomes Ensembl Genomes European Genome-phenome Archive Metagenomics portal Gene, protein & metabolite ArrayExpress Metabolights expression Expression Atlas PRIDE Protein sequences, families & InterPro Pfam UniProt motifs Literature & ontologies Europe PubMed Central Gene Ontology Experimental Factor Ontology Molecular structure Protein Data Bank in Europe Electron Microscopy Data Bank Chemical biology ChEMBL Reactions, interactions & pathways IntAct Reactome 11 ArrayExpress MetaboLight s ChEBI Systems BioModels Enzyme Portal BioSamples Baseline Atlas view Zoomed in slice of “baseline” expression of one gene (HPRT1) tissue s Data sets RNAseq Proteomic s 12 ArrayExpress Baseline Atlas view genes Liver-specific protein expression in one expt. - Human Protein tissues Atlas 13 ArrayExpress Differential Atlas view “Differential” expression of one gene (Car3) under different conditions 14 ArrayExpress Differential Atlas view “differential” expression (infected vs control) in one experiment upregulated downregulated 15 ArrayExpress How do we do it: Curation and standardisation ~ 70000 studies ArrayExpress Curators hand-pick and curate • Only ~9% directly-submitted experiments are atlas eligible Run a standardised QC & analysis pipeline Atlas • R/BioConductor packages • iRAP (RNA-seq pipeline) 3063 experiments (494 [16%] are RNA-seq) Figures as of 1 Feb 2017 release 16 ArrayExpress How do we do it: we consume Ensembl data Probe mapping Microarray data Ref. genome to map reads RNA-seq data Gene X Transcript, gene and protein models/annotation 17 ArrayExpress ID mapping from gene to other entities What challenges are we facing? 1. Incomplete and inconsistent meta-data 2. Studies carried out in the same “type” of samples but in different research teams: comparable? 3. How to quantify expression level within one data set? How to “normalise” expression levels across data sets? 4. Microarray and RNA-seq data on the same set of samples: comparable? 5. How to keep up with gene annotation and genome assembly updates? 18 ArrayExpress 1. Incomplete meta-data Example from RIKEN FANTOM5 project (human tissues, CAGE) 19 ArrayExpress 1. Incomplete meta-data BioSamples database record for the sample: 20 ArrayExpress 2. Inconsistent meta-data How many ways can researchers label their samples as “female” under attribute “sex”? female F femme 2 21 ArrayExpress How many ways can you say “female”? 18-day pregnant females 2 yr old female 400 yr. old female adult female asexual female castrate female cf.female cystocarpic female dikaryon dioecious female diploid female f famale femail female (lactating) female (pregnant) female (outbred) female parent female plant female with eggs female worker female, 6-8 weeks old female, virgin female, worker female(gynoecious) femele female, pooled femalen individual female lgb*cc females mare female (worker) monosex female ovigerous female oviparous sexual females worker bee female enriched pseudohermaprhoditic female remale semi-engorged female sexual oviparous female sterile female worker worker caste (female) sex: female female, other female child femal 3 female female (phenotype) female mice female, spayed femlale metafemale sterile female normal female sf female females strictly female vitellogenic replete female female - worker females only tetraploid female worker female (alate sexual) gynoecious thelytoky hexaploid female female (calf) healthy female female (gynoecious) female (f-o) hen probably female (based on morphology) female (note: this sample was originally provided as a \"male\" sample to us and therefore labeled this way in the brawand et al. paper and original geo submission; however, detailed data analyses carried out in the meantime clearly show that this sample stems from a female individual)", Courtesy of N. Silvester, European Nucleotide Archive, EMBLEBI 22 ArrayExpress How many ways can you say “male”? 37 year old male initial phase male male fetus six males mixed 600 yr. old male m male plant stallion adult male make male, 8 weeks old steer bull makle male, castrated sterile male castrated male mal e male, pooled strictly male cm male males tetraploide male dioecious male male (7-2872) man type i males diploid male male (7-3074) men type ii males drone male (m-a) normale male virgin male engorged male male (m-o) ram winged and wingless males fertile male male caucasian rooster young male four males mixed male child s1 male sterile individual male male fertile sex: male male (note: this sample was originally provided as a \female\ sample to us and therefore labeled this way in the brawand et al. paper and original geo submission; however, detailed data analyses carried out in the meantime clearly show that this sample stems from a male individual) Courtesy of N. Silvester, European Nucleotide Archive, EMBLEBI 23 ArrayExpress It can be rather gross sometimes… An excerpt from: “Laying a Community-Based Foundation for Data-Driven Semantic Standards in Environmental Health Sciences,”, by Carolyn J. Mattingly, Rebecca Boyles, Cindy P. Lawler, Astrid C. Haugen, Allen Dearry, and Melissa Haendel. Environmental Health Perspectives, vol. 124, no. 8, August 2016, pages 1136-1140. 1,2. Meta-data - solutions Get it right at the very beginning: Encourage submission using webform tool “Annotare” 25 ArrayExpress 3. Comparable samples? (Tissues) HMGCL gene in “normal” human tissues – data normalised per data set Data sets tissue s In which tissue(s) is the gene most highly expressed? 26 ArrayExpress 3. Comparable samples? (Cell lines) Genentech (675 cancer cell lines), Genentech NCI-60 E-MTAB-2706, Klijn et al. (2014) A comprehensive transcriptional portrait of human cancer cell lines vs NCI-60 panel (39 cancer cell lines), E-MTAB-2980 Top 16 expressed genes in MCF-7 (breast cancer cell line) 27 ArrayExpress Breast cancer marker 3. How to match the “right” samples Ideal RNA-seq & proteomics data known to be from the same samples: COSMOS (“COordination of Standards in MetabOlomicS”) (http://www.cosmos-fp7.eu/WP2) Workaround Link “equivalent” samples. But some sample records are incomplete! Last resort Only integrate if sample annotation is “clear”. Curator’s call? 28 ArrayExpress 4. Transcript quantification & normalisation across data sets Algorithms Alignment: bwa, TopHat, Gsnap, STAR… Quantification: Cufflinks, HTSeq … TopHat + HTSeq : scalable, efficient, open source, good support ? Normalisatio n (research project in EBI functional genomics team) 29 ArrayExpress “give me a single read-out of gene X’s expression in each tissue” 5. Microarray vs sequencing – comparable? Example from project (http://www.gramene.org/) (Unpublished data) untreated control treate d vs t1 t2 t3 t1 t2 t3 x3 x3 x3 x3 x3 x3 or 30 ArrayExpress 5. Microarray vs sequencing – comparable? t1 ctrl vs t2 ctrl Microarray t2 ctrl vs t3 ctrl RNA-seq = Fold-change of one gene = differentially expressed in either or both datasets 31 ArrayExpress 5. Microarray vs sequencing – comparable? Gramene example is not an isolated case Parallel comparison of Illumina RNA-Seq and Affymetrix microarray platforms on transcriptomic profiles generated from 5-aza-deoxy-cytidine treated HT-29 colon cancer cells and simulated datasets, BMC Bioinformatics, 2013. Solution? Only use RNA-seq data for baseline atlas Focus more on integrating sequencing data? Microarray overheads: 1. 2. array design annotation, non-standardised files (except for large proprietary platforms), 3. outdated array designs/probe sequences 32 ArrayExpress 6. Genome assembly and gene annotation updates never stop Atlas release (monthly) Gene annotation update (every 3-4 months for popular species) 33 ArrayExpress Genome assembly patches (yearly) New genome assembly (every few years) 6. Genome assembly and gene annotation updates Atlas release (monthly) Gene annotation update (every 3-4 months) Genome assembly patches (yearly) New genome assembly (every few years) Impossible to reanalyse everything for every release “Freeze” Atlas data with “old” assembly Investigating incremental updates (re-align reads only in “patched” regions?) 34 ArrayExpress Technical (interface) challenges Static heatmaps don’t scale. E.g. 96 T-helper cells, single-cell RNA-seq (http://www.ebi.ac.uk/gxa/experiments/E-MTAB-2512) 27 out of 96 Individual cells Genes GTEx (Genotype-Tissue Expression project ) data from BROAD (http://www.gtexportal.org/home/ ), close to 10,000 samples (bio. replicates) Heatmaps need to get smarter (our solution: highcharts), or drop heatmaps! 35 ArrayExpress Technical (interface) challenges Too many buttons! 36 ArrayExpress I want more customisatio n options! Too many colours! Technical (interface) challenges – solutions Take advice from user-experience design experts Run user-experience sessions Photos courtesy of Nikiforos Karamanis, EBI 37 ArrayExpress Technical (interface) challenges – solutions NOW POSSIBLE FUTURE 38 ArrayExpress Factors favouring data integration in Atlas Curation, curation, curation • Contributing databases accept submissions. Curators fix metadata problems at the point of submission Collaboration, constant feedback, agree on standards • Good connection between EBI omics groups • Use ontology terms to annotate meta-data Openness • Transparent data source and analysis procedures • Only use open-source analysis software 39 ArrayExpress Create those favourable factors for your project too… Curation • Use credible data sources, ask the contributors if you know them! Collaboration, constant feedback, agree on standards • Keep talking! Reproducibility • Document as much as you can • Use e.g. git to manage and share code 40 ArrayExpress How to use Atlas data for your own project We’ve done some of the legwork for you! + My own data Raw counts FPKM values Integrate with your data/do meta-analysis My own data vs Corroborate your own findings 41 ArrayExpress Gateway to other EBI resources Examples of key vertebrate baseline data Key cell line models Prenatal human brain Single-cell RNA-seq, zebrafish dev. stage Proteomics Cancer tissues Knockout mouse phenotyping (KOMP2 project) Basic research Wild-type and knockout mouse models Cancer cell line encyclopaedia (CCLE) 675 Cancer cell lines Cancer Best practice tips 1. Plan carefully before you start. E.g. how will you incorporate new data sets generated on a different genome assembly? 2. Get as much experimental details as possible, and store them in a repository which is maintained 3. Data quality control. 4. Always take the results of integration with a pinch of salt, beware of the caveats 5. Prepare for data presentation. It’s harder than you think! 43 ArrayExpress Faces behind Expression Atlas Gene Expression Team Leader Robert Petryszak Curation and training Atlas production Amy Tang Maria Keays (previous member) Atlas content Irene Papatheodorou Elisabet Barrera Anja Fullgrabe Laura Huerta Sebastien Pesseat Suhaib Mohammed Curators/Bioinformaticians Alfonso Fuentes Wojtek Bazant Web development Nuno Fonseca (RNA-seq manager) Goodbye, and thank you, Maria!