Download Gene set enrichment analysis

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Gene interaction networks for
functional analysis and
prognostication
Andrey Alexeyenko
“Data is not information,
information is not knowledge,
knowledge is not wisdom,
wisdom is not truth,”
—Robert Royar (1994)
Biological data production
http://www.hdpaperz.com
…and analysis
Pathway analysis: why needed, what it is?
Primary molecular
processes
Response, recorded
with high-throughput
paltforms
Protein abundance
1.
2. Phosphorylation
3. 1.
4. 2.
… 3. Methylation
N. 4. 1.
2.
… 3. mRNA expression
N. 4. 1.
2.
…
ChipSeq
3.
N.
1.
4.
2.
…
Mutation profile
3.
N.
1.
4.
2.
…
3.
N.
4.
…
Metabolites
N.
1.
2.
3.
…
N.
Known biological units:
processes, pathways
Cancer heterogeneity
Cancers of a primary site may represent a
heterogeneous group of diseases of diverse
molecular origin which vary with regard to
– oncogenic mutations that have initiated them,
– protein landscape (abundance, functionality) that
maintains the disease progression, and thereby:
– their responsiveness to specific drugs.
FunCoup is a data integration framework to discover
functional coupling in eukaryotic proteomes with
?
Bmouse
Human
Rat
Fly
Yeast
High-throughput
evidence
Amouse
Find orthologs*
data from model organisms
Andrey Alexeyenko and Erik L.L. Sonnhammer (2009) Global networks of functional
coupling in eukaryotes from comprehensive data integration. Genome Research.
http://FunCoup.sbc.su.se
FunCoup: on-line interactome resource
Andrey Alexeyenko and Erik L.L. Sonnhammer (2009) Global networks of functional
coupling in eukaryotes from comprehensive data integration. Genome Research.
State-of-the-art method to beat:
Frequency analysis of somatic mutations
•Yellow diamonds: somatic
mutations in prostate cancer
•Pink crosses: also mutated in
glioblastome (TCGA)
Do gene
networks tell
any story?
Network enrichment analysis:
Question: How to find something in common between many altered genes?
Altered genes (green) from one individual lung cancer are enriched
in network connections to members of ErbB (HER2) pathway
(yellow) and GO term “apoptosis” (blue).
Apophenia: the human
propensity to see
meaningful patterns in
random data
(Brugger 2001; Fyfe et al. 2008)
Network enrichment analysis:
compared to a reference and quantified
Actual network:
observed pattern
A random pattern
N links_real = 12
Question:
Is ANXA2 related to TGFbeta
signaling?
N links_expected = 4.65
Standard deviation = 1.84
Z = (N links_observed – N links_expected) / SD = 3.98
P-value = 0.0000344
FDR < 0.1
Functional characterization of novel gene sets
State-of-the-art method to beat:
Gene set enrichment analysis
Our alternative:
Network enrichment analysis
Altered genes
?
Functional set
1400
No. of positives, r eal groups
1200
Alexeyenko A, Lee W, … Pawitan Y. Network
enrichment analysis: extension of gene-set
enrichment analysis to gene networks. BMC
Bioinformatics, 2012
1000
800
600
400
GEA_SIGN
GEA_REST
NEA_SIGN
NEA_REST
200
0
0
50
100
150
200
250
300
No. of positives, random groups
350
400
450
Towards even better network prediction
partial correlations:
a way to get rid of spurious links
0.7
0.6
0.4
Cancer-specific networks:
links inferred from
expression, methylation, mutations
State-of-the-art
method to beat:
Reverse
engeneering
froma single
source (usually
transcriptome)
Functional coupling
transcription  transcription
transcription  methylation
methylation
 methylation
mutation

methylation
mutation

transcription
mutation
 mutation
+
mutated gene
How informative is a global gene network?
Now:
answering biological questions
Molecular phenotypes in network space
1.0
(lung cancer, data from ChemoRes consortium)
0.6
0.4
0.2
Relapse-free survival
0.8
P53 signaling
Predictors:
GO:0001666 response to hypoxia
GO:0005154 EGFR binding
GO:0005164 TNF binding
GO:0070848 response to growth factor stimulus
0.0
Apoptosis
0
2
4
Years since surgery
Cell cycle
Question: How to
distinguish between
different molecular
subtypes of cancer?
6
8
Science 29 March 2013:
Cancer Genome Landscapes
Bert Vogelstein, Nickolas Papadopoulos, Victor E. Velculescu,
Shibin Zhou, Luis A. Diaz Jr., Kenneth W. Kinzler*
"Though all 20,000 protein-coding genes have been evaluated
in the genome-wide sequencing studies of 3284 tumors, with a
total of 294,881 mutations reported, only 125 Mut-driver genes,
as defined by the 20/20 rule, have been discovered to date
(table S2A). Of these, 71 are tumor suppressor genes and 54
are oncogenes. An important but relatively small fraction (29%)
of these genes was discovered to be mutated through unbiased
genome-wide sequencing; most of these genes had already
been identified by previous, more directed investigations. “
“At best, methods based on mutation frequency can only
prioritize genes for further analysis but cannot
unambiguously identify driver genes that are mutated at
relatively low frequencies”
Is NTRK1 a driver in the GBM tumor TCGA-02-0014?
Somatic mutations: drivers vs. passengers
data from The Cancer Genome Atlas
Driver copy number changes
Confidence
Copy number of MAP3K11
Copy number
Point
mutation in
PTEN
Present
Absent
Altered
Normal
6 (~2)
2 (~6)
29 (~33)
105 (~102)
Gains and losses on chromosome 7, in 142 glioblastoma multiforme genomes, TCGA
Validation of candidate disease genes
(work with Jonathan Prince, MEB, KI)
Genetic association of sequence variants near AGER/NOTCH4 and dementia.
Bennet AM, Reynolds CA, Eriksson UK, Hong MG, Blennow K, Gatz M, Alexeyenko A, Pedersen NL,
Prince JA.
J Alzheimers Dis. 2011;24(3):475-84.
Genome-wide pathway analysis implicates intracellular transmembrane protein transport in Alzheimer
disease.
Hong MG, Alexeyenko A, Lambert JC, Amouyel P, Prince JA.
J Hum Genet. 2010 Oct;55(10):707-9. Epub 2010 Jul 29.
Analysis of lipid pathway genes indicates association of sequence variation near
SREBF1/TOM1L2/ATPAF2 with dementia risk.
Reynolds CA, Hong MG, Eriksson UK, Blennow K, Wiklund F, Johansson B, Malmberg B, Berg S,
Alexeyenko A, Grönberg H, Gatz M, Pedersen NL, Prince JA.
Hum Mol Genet. 2010 May 15;19(10):2068-78. Epub 2010 Feb 18.
Question: Is
there extra
evidence for
GWAScandidates to be
involved?
Answer: Yes,
for some…
Resistance to vinorelbine in lung
cancer: pathways
0.8
0.6
0.2
0.0
2
4
6
8
0
2
4
6
8
Years since surgery
GO:0005164 TNF binding
GO:0070848 response to growth factor stimulus
0.8
0.6
0.4
0.2
0.0
0.0
0.2
0.4
0.6
Relapse-free survival
0.8
1.0
Years since surgery
1.0
0
Relapse-free survival
0.4
Relapse-free survival
0.6
0.4
0.0
0.2
Relapse-free survival
0.8
1.0
GO:0005154 EGFR binding
1.0
GO:0001666 response to hypoxia
0
2
4
Years since surgery
6
8
0
2
4
Years since surgery
6
8
Overlap between gene signatures of relapse
Usually the overlap between signatures is negligible.
Because e.g.:
• Different sub-types of patient population,
• Different microarray platforms!
From Roepman et al., 2007
However the main reason is:
Dimensionality curse!
Consistency of molecular landscape:
raw genes or network-based pathway scores?
Gene expression => pathway scores
Gene expression
Resistance to vinorelbine in lung cancer
T-R+
T+R-
T+R+
T-R-
T-R+
T+R-
T+R+
T-R-
T-R+
T+R-
T+R+
T-R-
T-R+
T+R-
T+R+
T-R-
T-R+
T+R-
T+R+
T-R-
T-R+
T+R-
T+R+
T-R+
T+R-
T+R+
T-R-
T-R+
T+R-
T+R+
-4
-6
-3
-4
-2
-2
-3
-4
-1
0
MED17
0
CDC2L6
-1
MED4
-2
-2
1
0
2
0
2
1
3
4
2
2
-3
-3
-6
-4
0
-3
FNBP4
0
0.0
TBL1Y
-0.5
1
-2
AC005921.3
-1
-1
-1.0
-5
-2
-2
-1.5
CLPTM1L
Relapse
1
-1
2
0.5
+
-
+
A
B
-
C
D
2
0
tumor resistance to
chemotherapy?
Answer: Depletion of
differential transcriptome
towards few specific genes
3
1.0
Question: What predicts
Treatment
CDK8
T-R-4
-5
T-R-
Resistance to vinorelbine in lung cancer
Functionally coherent genes associated with vinorelbine resistance.
A. Network representation of the group. Magenta: genes associated with
resistance in NEA and likely producing a protein complex (ranked 1, 3, 5,
6, 11, and 18) plus one more gene CLPTM1L (beyond the ranking but
also significantly associated, was previously reported as related to
cisplatin resistance); yellow: non-commonly expressed genes linked with
the NEA genes and less represented in the susceptible tumors, hence
most contributing to the association (see Results for more explanation).
B. Box-plots of NEA scores for the genes colored magenta in A.
Network analysis:
how to succeed?
• Analyze prioritized candidates (from genotyping, DE, GWAS…) rather
than any genes.
• Do not lean on single “interesting” network links. Employ statistics!
i.e.
“concrete questions” => “testable hypotheses” => “concrete answers”
The amount of information in known gene networks is
enormous.
Let’s just use it!
Summary
•
•
•
•
•
•
•
Raw data production
Data accumulation globally
Ambitions
Need in data interpretation
Functional interpretation
Networks: in reality and as logical models
Significance: as important in bioinformatics as in “core” biology
– Risk of false discovery
– Adjustment for multiple testing
• Enrichment analysis: pure gene sets and in the network
• Biological questions to answer:
– Driver mutations
– Other “driving” events (changed copy number, methylation, expression)
– Association between a clinical feature and multiple genes at once