Download Hematopoietic axis

Document related concepts

Transcriptional regulation wikipedia , lookup

X-inactivation wikipedia , lookup

Ridge (biology) wikipedia , lookup

Genomic imprinting wikipedia , lookup

Gene desert wikipedia , lookup

Promoter (genetics) wikipedia , lookup

Genome evolution wikipedia , lookup

Secreted frizzled-related protein 1 wikipedia , lookup

Gene nomenclature wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

List of types of proteins wikipedia , lookup

Gene expression wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Expression vector wikipedia , lookup

Community fingerprinting wikipedia , lookup

Silencer (genetics) wikipedia , lookup

Endogenous retrovirus wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Gene regulatory network wikipedia , lookup

RNA-Seq wikipedia , lookup

Gene expression profiling wikipedia , lookup

Transcript
Building a Global Map of
(Human) Gene Expression
Misha Kapushesky
European Bioinformatics Institute, EMBL
St. Petersburg, Russia
May, 2010
From one genome to many biological states
• While there is only one genome sequence, different
genes are expressed in many different cell types and
tissues, different developmental or disease states
• The size and structure of this “expression space” is still
largely unknown
• Most individual experiments are looking at small regions
• We would like to build a map of the global human gene
expression space
Mapping the human transcriptome
A microarray experiment
Traditional research
The map we want to build
Everest
Kathmandu
Lhasa
How to build such a global map
• This space is huge - There are thousands of potentially
different states – cell types, tissue types, developmental
stages, disease states, systems under various treatments
(drugs, radiation, stress, …) –
• It is not feasible to study them all in a single laboratory
experiment (costs, rare samples, …)
• However thousands of gene expression experiments are
performed every year (microarrays, new generation
sequencing)
• Can we use the published data to build the global
expression map?
ArrayExpress
• www.ebi.ac.uk/arrayexpress
• Data from over 280,000 assays and over 10,000
independent studies (microarrays, sequencing, …)
• Gene expression and other functional genomics assays
• Over 200 species
• Data collection and exchange from GEO
Can we integrate these data to answer questions
that go beyond what was done in the individual
studies?
• On a quantitative level - data on only the same
microarray platform can be integrated
A global map of human gene expression
• Angela Gonzales (EBI)
• Misha Kapushesky (EBI)
• Janne Nikkila (Helsinki
University of Technology)
• Helen Parkinson (EBI),
• Wolfgang Huber (EMBL)
• Esko Ukkonen (University of
Helsinki)
Margus Lukk et al, Nature Biotechnology, 28,
p322-324 (April, 2010)
The most popular gene expression
microarray platform: Affymetrix U133A
• We collected over 9000 raw data files from Affymetrix
U133A from GEO and ArrayExpress
• Applying strict quality controls, removing the duplicates
• Data on 5372 samples remained
from 206 different studies generated
in 163 different laboratories
grouped in 369 different biological ‘conditions’ (tissue types,
diseases, various cell lines, etc)
• The 369 conditions grouped in different larger
‘metagroups’
Different metagroupings (4 and 15):
After RMA normalisation we obtain:
~18,000 genes
5372 samples (369 different conditions)
2nd
Principal Component Analysis – each dot is one of
the 5372 samples
1st
2nd
1st
16
23/05/2017
Human gene expression map
2nd
Hematopoietic axis
17
23/05/2017
Human gene expression map
2nd
Hematopoietic axis
18
23/05/2017
Human gene expression map
Malignancy
Hematopoietic axis
19
23/05/2017
Human gene expression map
Hematopoietic and malignancy axes
Lukk et al, Nature Biotechnology, 28: 322
2nd
1st
3rd
Coloured by tissues of origin
3rd PC
Tissues of origin
Neurological axis
First 3 (5) principal components
1.
2.
3.
4.
5.
Hematopoietic axis – blood, ‘solid tissues’, ‘incompletely
differentiated cells and connective tissues’
Malignancy axis - Cell lines – cancer – normals and other
diseases
Neurological axis – nervous system / the rest
RNA degradation
Samples seem to ‘cluster’ by the tissues of origin
Hierarchical clustering of 97 groups with at least 10
replicates each
26
23/05/2017
Human gene expression map
Comparison of the 97 larger sample
groups to the rest
Incompletely differentiated cell type
and connective tissue group
Conclusions so far
•
We have identified 6 major transcription profile classes
in these data:
1. cell lines
2. incompletely differentiated cells and connective tissues
3. neoplasms
4. blood
5. brain
6. muscle
•
Cell lines cluster together!
Gene expression across the 5372 samples
• The expression of most genes is relatively constant
• There are only 1034 probesets (mapping to less than
900) genes where normalised signal variability has
standard deviation > 2
Clustering of 97 sample groups and 1000 most
variable probesets (about 900 genes)
1
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
2
3
4
5
Immune repsonse
Nervous system development
Lipid raft
Mitosis
Neurotransmitter uptake
Cytoskeletal protein binding
Extracellular matrix
Extracellular regions
Extracellular matirx
Extracellular region
Mitosis
6
7
8 9 10 11 12
12.
13.
14.
15.
16.
17.
18.
19.
13
14 16 17 18 19
15
Defence response
Nervous system development
Actin cytoskeleton organisation and biogenesis
Protein carrier activity
No significant resout
Antigen presentation, exogenous antigen
Trans – 1,2-dyhydrobenzene, 1,2-dyhydrogenase
activity
S100 alpha binding
Clustering based on subset of these genes produce
similar results
• Clustering based on 350 most variable probesets gives
almost the same result
• Even clustering based on 30 most variable probesets is
very close
24 most variable genes
CALD1
CDH1
COL1A1
COL1A2
COL3A1
COL6A3
CXCR4
DCN
DKK3
FN1
HBA1
HLA-DRA
HLA-DRA1
HLA-DRB3
JGA1
KRT15
KRT18
LUM
LYZ
PLS3
S100AB
SPARC
SPARCL1
TACSTD2
Actin- and myosin-binding protein implicated in the regulation of actomyosin interactions in smoo
calcium dependent cell-cell adhesion glycoprotein
Type I collagen - fibrillar forming collagen (alpha 1 chain)
Type I collagen - fibrillar forming collagen (alpha 2 chain)
Collagen type III occurs in most soft connective tissues along with type I collagen
Collagen VI acts as a cell-binding protein
Receptor for the C-X-C chemokine CXCL12/SDF-1, participates in a signal transduction
May affect the rate of fibrils formation
Inhibitor of Wnt signaling pathway (Potential)
Involved in cell adhesion, cell motility, opsonization, wound healing, and maintenance of cell shap
Involved in oxygen transport from the lung to the various peripheral tissues
One of the HLA class II alpha chain paralogues, plays a central role in the immune system
Plays a central role in the immune system by presenting peptides derived from extracellular prote
Cluster of closely packed pairs of transmembrane channels, the connexons
Encodes a member of the keratin gene family
Type I intermediate filament chain keratin
A member of the small leucine-rich proteoglycan (SLRP) family
Encodes human lysozyme
Actin-bundling protein found in intestinal microvilli, hair cell stereocilia, and fibroblast filopodia
S-100 is a group of low molecular weight (10–12 kD) calcium-binding proteins highly conserved a
Appears to regulate cell growth through interactions with the extracellular matrix and cytokines
Seems to be little known
Tumor-associated calcium signal transducer 2
www.ebi.ac.uk/gxa/human/U133a
Can we go beyond the 6 major classes?
Hierarchical clustering of all 369 sample groups
Some finer
groups:
Cancer:
•Sarcomas
•Carcinomas
•Neuroblastomas
Normal:
•Liver and gut
39
23/05/2017
Human gene expression map
Normal blood and blood
non-neoplastic disease
Leukemia
Other blood neoplasm
Blood cell lines
Identifying condition specific genes by
supervised analysis
• Using linear models to find condition specific
genes, multiple testing correction, differential
expression cut-offs
• Example - 174 leukemia specific genes
include most well known markers (e.g, BCR, ETV6, FLT3,
HOXA9, MUST3, PRDM2, RUNX1, and TAL1)
Many confirmed as associated with leukemia
• Beyond the major 6 classes the ‘signal’ becomes
weak
• The problem may be lab effects
The large biological effects are stronger than the lab
effects
However, when we zoom into particular subclasses, the
lab effects may be taking precedence
Mapping the human transcriptome
Our current view on global
A microarray experiment transcriptome
Traditional research
The map we want to build
Everest
Kathmandu
Lhasa
Frontal cortex
Brain
Hippocampal
tissue
Cerebellum
Caudate nucleus
Brain and
system
Mononuclear
cells
Muscular dystrophy
Skeletal muscle
AML
Heart and
heart parts
97 groups – colours recycled
Nervous system tumors
Second approach
• Integrating data on statistics level
Gene Expression Atlas
•
•
•
•
•
•
•
•
•
Ele Holloway
Ibrahim Emam
Pavel Kurnosov
Helen Parkinson
Anrey Zorin
Tony Burdett
Gabriella Rustici
Eleanor William
Andrew Tikhonov
Global Differential Expression Analysis
•
•
•
Selected ~10% of the data from ArrayExpress (including GEO imports),
manually curated for quality and mapped to a custom-built ontology of
experimental factors, EFO: http://www.ebi.ac.uk/efo
Data on differential expression of genes in 1000+ studies, comprising
~30000 assays, in over 5000 conditions
For each experiment, differentially expressed genes have been identified
computationally via moderated t-tests and statistical meta-analysis
Meta-Analysis Approaches
• Vote counting:
number of independent studies supporting an observation for a particular gene
• Effect size integration:
compute effect size statistics in each study, assess relevant statistical model and compute combined
z-score, for each gene/condition/study combination (extension of Choi et al, 2003)
Analysing each contributing dataset separately:
AML
CML
normal
AML
CML
normal
gene 1
gene 2
gene 3
0
1
0
1
1
0
0
0
0
gene n
0
0
1
genes
one-way
ANOVA
Combining the datasets
AML e1
Experiments 1, 2, 3, …, m
AML e2
AML e3
CML e1
CML e2
CML e3
CML e4
normal
gene 1
gene 2
gene 3
0
1
0
0
1
0
0
1
0
1
1
0
1
0
0
0
0
0
1
0
0
0
0
0
gene n
0
1
1
0
0
0
0
1
…
Effect size-based meta-analysis
• We have for each gene in each experiment/condition:
p-value for significance
simulaneous t-statistics & confidence intervals
d.e. label (“up” or “down”)
• However, we would like to:
Measure of strength of d.e. effect size
Ability to combine d.e. findings statistically
• Effect Size
Standardized mean difference or similar (e.g., correlation coef.)
Meta-analysis Procedure
• For each gene-experiment-condition combination
Compute effect size from simultaneous d.e. t-statistics
• Combine effect sizes across
multiple studies
Using fixed-effects or randomeffects models
Obtain for each gene-condition
combination:
• Mean effect size estimate
• Combined z-score
• Overall p-value
Long tail of annotations…
Annotating data with ontologies
• Diverse nature of annotations on data
• Need to support complex queries which contain semantic
information
E.g. which genes are under-expressed in brain cancer samples in
human or mouse
• If we annotate with adenocarcinoma do we get this data?
James Malone
Decoupling knowledge from data
Atlas/AE
James Malone
Semantically-enriched Queries with EFO
We can use the ontology structure
We can perform effect size
meta-analysis on a hierarchy,
if we follow several rules:
Increased statistical power
Condition-specificity through EFO
Condition-specific Gene Expression
www.ebi.ac.uk/gxa
Query for genes
species
Query for conditions
The ‘advanced query’
option allows building
more complex queries
http://www.ebi.ac.uk/gxa
Query results for gene ASPM
Zoom into one of the
‘Glioblastoma’ studies.
Each in
bar
represents
ASPM is downregulated
‘normlal’
an expression
condition
in comparison
to a disease
Upregulated
in ‘Glioblastoma’
in 3 level in
a particular sample
in
9 studies outstudies
of 10
indepnendent
61
ArrayExpress
‘wnt pathway’ genes in various cancers
62
ArrayExpress
Integrating both approaches
• First approach gives the global view, but obsucres the
detail
• The second approach gives detail, but doesn’t allow
easily to integrate everything in one map
• Can we combine both approaches?
Other data
• RNAseq data
• Proteomics data – Human Proteome Atlas from KTH in
Stockholm (collaboration with Mathias Uhlen)
• Time series – what states a cell goes through to become
from an ESC to a mature cell?
Two ways of integrating the data
• On a quantitative level – normalise all data together
Advantages – results easier to interpret
Disadvantages – lab effects
• On a statistics level – analyse each dataset separately
first
Advantages – less lab effects
Disadvantages – combined data difficult to interpret (in each
experiment each conditions is compared to something else)
• How to combine the two approaches?
Acknowledgements
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Margus Lukk
•
Misha Kapushesky
•
Angela Gonzales
•
Helen Parkinson
•
Gabriela Rustici
•
Ugis Sarkans
•
Ele Holloway
•
Roby Mani
Mohammadreza Shojatalab •
•
Nikolay Kolesnikov
•
Niran Abeygunawardena
•
Anjan Sharma
•
Miroslaw Dylag
•
Ekaterina Pilicheva
•
Ibrahim Emam
•
Pavel Kurnosov
•
Andrew Tikhonov
•
Andrey Zorin
•
•
Anna Farne
Eleanor Williams
Tony Burdett
James Malone
Holly Zheng
Tomasz Adamusiak
•
Susanna-Assunta Sansone
Philippe Rocca-Serra
Natalija Sklyar
Marco Brandizi
Chris Taylor
Eamonn Maguire
Maria Krestyaninova
Mikhail Gostev
Johan Rung
Natalja Kurbatova
Katherine Lawler
Nils Gehlenborg
Lynn French
Collaborators
Audrey Kaufman (EBI)
Wolfgang Huber (EBI)
Sami Kaski (Helsinki)
Morris Swertz (Groningen)
…
Funding
European Commision
• FELICS
• MolPAGE
• ENGAGE
• MuGEN
• SLING
• DIAMONDS
• EMERALD
NIH (NHGRI)
EMBL