Download Integrative omics in Expression Atlas

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nutriepigenomics wikipedia , lookup

Genomics wikipedia , lookup

Oncogenomics wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Gene expression programming wikipedia , lookup

Designer baby wikipedia , lookup

Genome evolution wikipedia , lookup

Metagenomics wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

NEDD9 wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Gene expression profiling wikipedia , lookup

Mir-92 microRNA precursor family wikipedia , lookup

RNA-Seq wikipedia , lookup

Transcript
Integrative omics:
Expression Atlas as an example
Amy Tang
[email protected]
k
Curation and Training Project Leader
Functional Genomics Group, EMBL-EBI
http://www.ebi.ac.uk/~rpetry/geteam/
@ArrayExpressEBI and @ExpressionAtlas
7 March 2017
In the next hour …
 What is the Expression Atlas?
 Types of data integrated, and how
 Challenges we are facing and how tackle them
 Favourable factors for integration in the Atlas
 How Atlas can help your project, best practice check-list
2
ArrayExpress
What we mean by “integration”
mRNA expression data
Normalised,
summarised
proteomics data
Normalised,
summarised
metabolomics data
Normalised,
summarised
Ideal world
What we mean by “integration”
metabolomics data
proteomics data
mRNA expression data
Normalised,
summarised
Ideal world
Normalised,
summarised
Normalised,
summarised
Heart
brain
Heart
Expt 1
Expt 1
Expt 2
Expt 4
Heart
Expt 1
Expt 3
brain
brain
Reality
Between reality and the ideal world…
Integrated visualisation of multiple omics data sets to
spot patterns
GENE A Heart
Brain
Expt 1
X
X
Expt 2
X
X
Expt 1
X
Expt 3
X
Expt 1
X
Expt 4
X
mRNA expression data
proteomics data
metabolomics data
X
www.ebi.ac.uk/gxa
6
ArrayExpress
What questions does the Atlas address?
Where is the SOX2 gene expressed
in the human body?
Under what conditions
does human SOX2 change
expression levels?
What genes are expressed in
normal skeletal muscles?
Two types of questions:
Baseline and differential expression
8
ArrayExpress
Baseline
Differential
Gene expression in
healthy/untreated
conditions
Changes in gene expression in
“comparisons”
of experiment conditions
E.g.
human liver,
ENCODE cell lines,
rice root
E.g.
mutant vs wild type,
drought vs normal watering,
drug treated vs untreated
122 experiments
2941 experiments
We answer the questions by extracting value
from public expression data
Direct
submission,
curated
Automatic
import
~ 70000 studies
ArrayExpress
Curators handpick and curate
Run a standardised QC &
analysis pipeline
Genome sequences
& gene models
Atlas
3063 experiments
(494 [16%] are RNA-seq)
Figures as of 1 Feb 2017 release
Integration in the Atlas – where we are at
~ 70000 studies
ArrayExpress
Proteomics data
Metabolomics data
Atlas
3063 experiments
?
?
?
Mined gene names
from PubMed paper
Query by protein
Query by function/pathway
10
ArrayExpress
NCBI RefSeq
Query by gene
The Atlas links to other EBI service
resources Genes, genomes & variation
European Nucleotide Archive Ensembl
1000 Genomes
Ensembl
Genomes
European Genome-phenome Archive
Metagenomics portal
Gene, protein & metabolite
ArrayExpress
Metabolights
expression
Expression Atlas
PRIDE
Protein sequences, families &
InterPro
Pfam
UniProt
motifs
Literature &
ontologies
Europe PubMed Central
Gene Ontology
Experimental Factor Ontology
Molecular structure
Protein Data Bank in Europe
Electron Microscopy Data Bank
Chemical biology
ChEMBL
Reactions, interactions &
pathways
IntAct
Reactome
11
ArrayExpress
MetaboLight
s
ChEBI
Systems
BioModels
Enzyme Portal
BioSamples
Baseline Atlas view
Zoomed in slice of “baseline” expression of one gene
(HPRT1)
tissue
s
Data
sets
RNAseq
Proteomic
s
12
ArrayExpress
Baseline Atlas view
genes
Liver-specific protein expression in one expt. - Human Protein
tissues
Atlas
13
ArrayExpress
Differential Atlas view
“Differential” expression of one gene (Car3) under different conditions
14
ArrayExpress
Differential Atlas view
“differential” expression (infected vs control) in one experiment
upregulated
downregulated
15
ArrayExpress
How do we do it: Curation and
standardisation
~ 70000
studies
ArrayExpress
Curators hand-pick and curate
• Only ~9% directly-submitted
experiments are atlas eligible
Run a standardised QC &
analysis pipeline
Atlas
•
R/BioConductor packages
• iRAP (RNA-seq pipeline)
3063 experiments
(494 [16%] are RNA-seq)
Figures as of 1 Feb 2017 release
16
ArrayExpress
How do we do it: we consume Ensembl data
Probe mapping
Microarray data
Ref. genome to map reads
RNA-seq data
Gene
X
Transcript, gene and
protein models/annotation
17
ArrayExpress
ID mapping from gene to other
entities
What challenges are we facing?
1. Incomplete and inconsistent meta-data
2. Studies carried out in the same “type” of samples but in
different research teams: comparable?
3. How to quantify expression level within one data set?
How to “normalise” expression levels across data sets?
4. Microarray and RNA-seq data on the same set of samples:
comparable?
5. How to keep up with gene annotation and genome
assembly updates?
18
ArrayExpress
1. Incomplete meta-data
Example from RIKEN FANTOM5 project (human tissues,
CAGE)
19
ArrayExpress
1. Incomplete meta-data
BioSamples database record for the sample:
20
ArrayExpress
2. Inconsistent meta-data
How many ways can researchers label their samples as
“female” under attribute “sex”?
female
F
femme
2
21
ArrayExpress
How many ways can you say “female”?
18-day pregnant females
2 yr old female
400 yr. old female
adult female
asexual female
castrate female
cf.female
cystocarpic female
dikaryon
dioecious female
diploid female
f
famale
femail
female (lactating)
female (pregnant)
female (outbred)
female parent
female plant
female with eggs
female worker
female, 6-8 weeks old
female, virgin
female, worker
female(gynoecious)
femele
female, pooled
femalen
individual female
lgb*cc females
mare
female (worker)
monosex female
ovigerous female
oviparous sexual females
worker bee
female enriched
pseudohermaprhoditic female
remale
semi-engorged female
sexual oviparous female
sterile female worker
worker caste (female)
sex: female
female, other
female child
femal
3 female
female (phenotype)
female mice
female, spayed
femlale
metafemale
sterile female
normal female
sf
female
females
strictly female
vitellogenic replete female
female - worker
females only
tetraploid female
worker
female (alate sexual)
gynoecious
thelytoky
hexaploid female
female (calf)
healthy female
female (gynoecious)
female (f-o)
hen
probably female (based on morphology)
female (note: this sample was originally provided as a \"male\" sample to us and therefore labeled this way in the brawand et al. paper
and original geo submission; however, detailed data analyses carried out in the meantime clearly show that this sample stems from a
female individual)",
Courtesy of N. Silvester, European Nucleotide Archive, EMBLEBI
22
ArrayExpress
How many ways can you say “male”?
37 year old male
initial phase male
male fetus
six males mixed
600 yr. old male
m
male plant
stallion
adult male
make
male, 8 weeks old
steer
bull
makle
male, castrated
sterile male
castrated male
mal e
male, pooled
strictly male
cm
male
males
tetraploide male
dioecious male
male (7-2872)
man
type i males
diploid male
male (7-3074)
men
type ii males
drone
male (m-a)
normale male
virgin male
engorged male
male (m-o)
ram
winged and wingless males
fertile male
male caucasian
rooster
young male
four males mixed
male child
s1 male sterile
individual male
male fertile
sex: male
male (note: this sample was originally provided as a \female\ sample to us and therefore labeled this way in the brawand et al.
paper and original geo submission; however, detailed data analyses carried out in the meantime clearly show that this sample
stems from a male individual)
Courtesy of N. Silvester, European Nucleotide Archive, EMBLEBI
23
ArrayExpress
It can be rather gross sometimes…
An excerpt from: “Laying a Community-Based Foundation for Data-Driven Semantic Standards in Environmental Health Sciences,”, by Carolyn J. Mattingly, Rebecca Boyles, Cindy P. Lawler,
Astrid C. Haugen, Allen Dearry, and Melissa Haendel. Environmental Health Perspectives, vol. 124, no. 8, August 2016, pages 1136-1140.
1,2. Meta-data - solutions
Get it right at the very beginning:
Encourage submission using webform
tool “Annotare”
25
ArrayExpress
3. Comparable samples? (Tissues)
HMGCL gene in “normal” human tissues – data normalised per data
set
Data sets
tissue
s
In which tissue(s) is the gene most highly expressed?
26
ArrayExpress
3. Comparable samples? (Cell lines)
Genentech
(675 cancer cell
lines),
Genentech
NCI-60
E-MTAB-2706, Klijn et al.
(2014) A comprehensive
transcriptional portrait of
human cancer cell lines
vs
NCI-60 panel
(39 cancer cell lines),
E-MTAB-2980
Top 16 expressed
genes in MCF-7
(breast cancer cell line)
27
ArrayExpress
Breast cancer
marker
3. How to match the “right” samples
Ideal
RNA-seq & proteomics data known to be from the same samples:
COSMOS (“COordination of Standards in MetabOlomicS”)
(http://www.cosmos-fp7.eu/WP2)
Workaround
Link “equivalent” samples. But some sample records are
incomplete!
Last resort
Only integrate if sample annotation is “clear”. Curator’s call?
28
ArrayExpress
4. Transcript quantification &
normalisation across data sets
Algorithms Alignment: bwa, TopHat, Gsnap,
STAR…
Quantification: Cufflinks, HTSeq
…
 TopHat + HTSeq : scalable, efficient, open source, good support
?
Normalisatio
n
(research
project in EBI
functional
genomics team)
29
ArrayExpress
“give me a single read-out of gene X’s
expression in each tissue”
5. Microarray vs sequencing –
comparable?
Example from
project (http://www.gramene.org/)
(Unpublished data)
untreated
control
treate
d
vs
t1
t2
t3
t1
t2
t3
x3
x3
x3
x3
x3
x3
or
30
ArrayExpress
5. Microarray vs sequencing –
comparable?
t1 ctrl vs t2 ctrl
Microarray
t2 ctrl vs t3 ctrl
RNA-seq
= Fold-change of one gene
= differentially expressed in
either or both datasets
31
ArrayExpress
5. Microarray vs sequencing –
comparable?
 Gramene example is not an isolated case
Parallel comparison of Illumina RNA-Seq and Affymetrix microarray platforms on transcriptomic profiles generated from
5-aza-deoxy-cytidine treated HT-29 colon cancer cells and simulated datasets, BMC Bioinformatics, 2013.
Solution?
 Only use RNA-seq data for baseline atlas
 Focus more on integrating sequencing data?
Microarray overheads:
1.
2.
array design annotation,
non-standardised files (except for large proprietary
platforms),
3. outdated array designs/probe sequences
32
ArrayExpress
6. Genome assembly and gene annotation
updates never stop
Atlas release
(monthly) Gene annotation update (every 3-4 months for popular
species)
33
ArrayExpress
Genome assembly patches
(yearly)
New genome assembly
(every few years)
6. Genome assembly and gene annotation
updates
Atlas release
(monthly) Gene annotation update (every 3-4
months)
Genome assembly patches
(yearly)
New genome assembly
(every few years)
 Impossible to reanalyse everything for every release
 “Freeze” Atlas data with “old” assembly
 Investigating incremental updates (re-align reads only in
“patched” regions?)
34
ArrayExpress
Technical (interface) challenges
Static heatmaps don’t scale.
E.g. 96 T-helper cells, single-cell RNA-seq
(http://www.ebi.ac.uk/gxa/experiments/E-MTAB-2512)
27 out of 96 Individual
cells
Genes
 GTEx (Genotype-Tissue Expression project ) data from BROAD
(http://www.gtexportal.org/home/ ), close to 10,000 samples (bio. replicates)
 Heatmaps need to get smarter (our solution: highcharts), or drop
heatmaps!
35
ArrayExpress
Technical (interface) challenges
Too
many
buttons!
36
ArrayExpress
I want more
customisatio
n options!
Too
many
colours!
Technical (interface) challenges –
solutions
 Take advice from user-experience design experts
 Run user-experience sessions
Photos courtesy of Nikiforos Karamanis,
EBI
37
ArrayExpress
Technical (interface) challenges –
solutions
NOW
POSSIBLE FUTURE
38
ArrayExpress
Factors favouring data integration in
Atlas
Curation, curation, curation
• Contributing databases accept submissions. Curators fix metadata problems at the point of submission
Collaboration, constant feedback, agree on standards
• Good connection between EBI omics groups
• Use ontology terms to annotate meta-data
Openness
• Transparent data source and analysis procedures
• Only use open-source analysis software
39
ArrayExpress
Create those favourable factors for your
project too…
Curation
• Use credible data sources, ask the contributors if you know them!
Collaboration, constant feedback, agree on standards
• Keep talking!
Reproducibility
• Document as much as you can
• Use e.g. git to manage and share code
40
ArrayExpress
How to use Atlas data for your own project
We’ve done some of the legwork for you!
+
My own
data
Raw
counts
FPKM
values
Integrate with your data/do meta-analysis
My own
data
vs
Corroborate your own findings
41
ArrayExpress
Gateway to other EBI
resources
Examples of key vertebrate baseline
data
Key cell line
models
Prenatal
human brain
Single-cell RNA-seq,
zebrafish dev. stage
Proteomics
Cancer
tissues
Knockout mouse
phenotyping (KOMP2 project)
Basic
research
Wild-type and knockout
mouse models
Cancer cell line
encyclopaedia
(CCLE)
675 Cancer cell lines
Cancer
Best practice tips
1. Plan carefully before you start. E.g. how will you
incorporate new data sets generated on a different
genome assembly?
2. Get as much experimental details as possible, and store
them in a repository which is maintained
3. Data quality control.
4. Always take the results of integration with a pinch of salt,
beware of the caveats
5. Prepare for data presentation. It’s harder than you think!
43
ArrayExpress
Faces behind Expression Atlas
Gene Expression Team Leader
Robert Petryszak
Curation and
training
Atlas production
Amy Tang
Maria Keays
(previous member)
Atlas
content
Irene
Papatheodorou
Elisabet Barrera
Anja
Fullgrabe
Laura Huerta
Sebastien Pesseat
Suhaib
Mohammed
Curators/Bioinformaticians
Alfonso Fuentes
Wojtek Bazant
Web development
Nuno
Fonseca
(RNA-seq
manager)
Goodbye, and thank you, Maria!