Download Microarray-based Disease Prognosis using Gene Annotation

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Genetic engineering wikipedia , lookup

Ridge (biology) wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Pathogenomics wikipedia , lookup

Saethre–Chotzen syndrome wikipedia , lookup

Genomic imprinting wikipedia , lookup

History of genetic engineering wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Neuronal ceroid lipofuscinosis wikipedia , lookup

Public health genomics wikipedia , lookup

Oncogenomics wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Genome evolution wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Gene wikipedia , lookup

NEDD9 wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Gene desert wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Gene therapy wikipedia , lookup

Genome (book) wikipedia , lookup

The Selfish Gene wikipedia , lookup

Gene nomenclature wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Microevolution wikipedia , lookup

Gene expression programming wikipedia , lookup

RNA-Seq wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Designer baby wikipedia , lookup

Gene expression profiling wikipedia , lookup

Transcript
Microarray-based Disease
Prognosis using Gene
Annotation Signatures
Michael Kovshilovsky
Swapna Annavarapu
SoCalBSI 2005
• Internship site: BioDiscovery, Inc.
• Mentor: Dr. Bruce Hoff
• Source of Funding: BioDiscovery, Inc.
Motivation
• Microarray gene-expression profiling
studies to predict disease outcomes.
– ex: cancer outcome
• To improve treatment of patients based
on knowledge of gene-expression
profile (molecular signature).
Lancet Paper
“Prediction of cancer outcome with microarrays:
a multiple random validation strategy”
Findings of Stefan Michiels et al :“Gene expression microarray-based predictors of
clinical outcome have been poorly optimistic and
careful review shows that performance is poor and
variable.”
- Analyzed data from the 7 largest published studies
that have attempted to predict prognosis of cancer
patients based on DNA microarray analysis.
- Random sampling approach
Goal
• Reproduce the Lancet paper.
• Compare the classification based on expression
levels of microarray probes, with classification based
on GSEA scores of biological pathways.
• Validate our hypothesis:– By abstracting away from the gene expression
domain to that of biological properties,
performance should stabilize and improve.
Phase I :
Reproduce the
Lancet Paper
(Gene-Expression
based classification)
Methodology
• Data loading
• Data preprocessing
• Data selection
• Correlating with clinical outcome
• Determine the molecular signature
• Classification of data
Data Loading
•
Read Affymetrix chip expression
data.
Sample data:
Probe set
AFFX-MurIL2_at
AFFX-MurIL10_at
AFFX-MurIL4_at
AFFX-MurFAS_at
AFFX-BioB-5_at
AFFX-BioB-M_at
AFFX-BioB-3_at
AFFX-BioC-5_at
AFFX-BioC-3_at
AFFX-BioDn-5_at
AFFX-BioDn-3_at
Descriptions
T-ALL-C1T-ALL-C1T-ALL-C2T-ALL-C2
Avg Diff Abs CallAvg Diff Abs Call
M16762 Mouse interleukin 2 (IL-2)
5803.2
gene,
P exon 4 -968.6 A
M37897 Mouse interleukin 10-1626.9
mRNA, Acomplete -929.1
cds A
M25892 Mus musculus interleukin
-2599.64 (Il-4)
A mRNA,254.3
complete
A cds
M83649 Mus musculus Fas antigen
2353.9mRNA,
A
complete
1430.1 cds
A
J04423 E coli bioB gene biotin
124288
synthetase
P
(-5, -M,
77263
-3 represent
P
tra
J04423 E coli bioB gene biotin
177215
synthetase
P
(-5,113251
-M, -3 represent
P
tra
J04423 E coli bioB gene biotin
105651
synthetase
P
(-5, -M,
78284
-3 represent
P
tra
J04423 E coli bioC protein (-5107134
and -3 Prepresent 70346
transcript
P regions
J04423 E coli bioC protein (-5 96543
and -3 Prepresent 84231
transcript
P regions
J04423 E coli bioD gene dethiobiotin
145965 Psynthetase
84271
(-5 and
P -3 repre
J04423 E coli bioD gene dethiobiotin
431822 Psynthetase
328727
(-5 and
P -3 repre
Data Preprocessing
• Scaling
– Identify the present, absent and marginal
expressional levels.
– scaling the average of the fluorescent intensities
of all genes to a constant target intensity of 2500.
– Expression values above 45000 capped to 45000
and the ones below 100 to 1.
• Filtration
– Eliminate the genes with low or no variance
• Log transformation
– Log2(values)
Preprocessed Data:
Before
After
Data Selection
• Training-Validation Approach:– Training set for identifying the molecular signature.
– Validation set for estimating the proportion of
misclassifications.
Therefore,
Dataset(N)
(Random selection)
Training(n)
Validation(N-n)
such that,
– Each set includes half the patients with and half
without a favorable outcome.
Correlation
• Clinical outcome
– Favorable = 1 (continuous complete remission)
– Unfavorable = -1 (relapse)
• Correlate expression values of each gene
with the clinical outcome
– Pearson’s correlation coefficient
• Determined the molecular signature
– defined by the top 50 highest correlated genes.
Data Classification
(Nearest Centroid Prediction Rule)
• A new point is classified
based on which centroid is
nearest.
Unfavorable
Centroid
• Data is 50- dimensional.
• PCA plot is used to plot the
data.
• Principle component
analysis(PCA) is a powerful
tool for analysing data by
identifying patterns in it.
Favorable
Centroid
Results(cont’d.)
• Each of the 500 training
sets provided a different
molecular signature
Top 250 genes included in 500 molecular signatures
248
229
210
• Plot of genes that
occurred most frequently
in the molecular signature.
Probe IDs
191
172
153
134
115
96
77
58
39
20
1
0
2
4
6
8
Number of signatures
10
12
Analysis
• The frequency of the genes participating
in defining the signature is quite low.
• This suggests that the molecular
signature is selected almost randomly
and is unstable.
• Phase II
Analysis of Microarray
data using GSEA
(Gene Set Enrichment
Analysis)
http://www.nature.com/ng/journal/v37/n1/full/ng1490.html
Methodology
• Data loading
• Data preprocessing
• Data selection
• GSEA – Determine enrichment scores
• Correlating with clinical outcome
• Classification of data
Preliminary steps
• Data loading
• Data preprocessing
• Data selection
same as in
phase I
GSEA
• Gene Set Enrichment Analysis
– A microarray data analysis method that
uses predefined gene sets and ranks of
genes to identify significant biological
changes in microarray data sets.
– GSEA provides an enrichment score that
measures the degree of enrichment of the
gene set of a rank-ordered gene list
derived from the data set.
GSEA(cont’d)
• GSEA Inputs:
– List of genes ranked according to the expression
difference between two classes.
– a priori defined gene sets (ex. pathways), each
consisting of members drawn from the list of
genes.
• Ranking of genes is done using a distance
metric, Signal-to-Noise ratio (SNR).
http://www.mit.edu/~scyudits/He&Yuditskaya_final_project_report.pdf
Signal to Noise ratio
• The signal-to-noise ratio method looks at the
difference of the means in each of the classes
scaled by the sum of the standard
deviations:
((α)* sqrt(n)) ÷ σ
where α (signal) is the difference in mean
expressions of two classes and σ (noise) is
the standard deviation.
Implementation
• Determine SNR for each
microarray.
• Sort gene list based on
SNR values.
• The degree of enrichment
of the gene set is
measured by comparing
the SNR-ordered gene list
with the gene
set(pathways).
http://www.nature.com/ng/journal/v37/n1/full/ng1490.html
Enrichment Score (ES)
• If gene is in gene set,
increment rank by Y
• If gene is not in gene set,
decrement rank by X
X=√G/(N-G)
Y=√(N-G)/G
G=number of genes in set
N=size of data
http://www.broad.mit.edu/gsea/doc/detailed_description_of_gsea_algorithm.doc
ES=greatest positive deviation of this running sum
across all genes
Correlation & Classification
• Similar to phase I
– First, the top 50 pathways are selected to create
favorable and unfavorable centroids
– Next, the training and validation set is classified
based on the nearest-centroid prediction rule.
Results(cont’d.)
• Each of the 500
training sets provided
a different molecular
signature
16
13
Pathways
• Plot of pathways that
occurred in over 150 of
the molecular
signatures.
Pathways included in over 150 signatures
10
7
4
1
0
50
100
150
200
250
Number of signatures
300
350
400
Results
Gene Expression
Average % =93.77%
Gene Set Based
Average % =97.88%
Results (cont’d)
Gene Expression
Average % =96.45%
Gene Set Based
Average % =93.80%
Results (cont’d)
Gene Expression
Average % =75.17%
Gene Set Based
Average % =52.91%
Results (cont’d)
Gene Expression
Average % =26.48%
Gene Set Based
Average % =47.76%
Three significant pathways
•
Iron ion homeostasis
–
•
Unfolded protein response, positive
regulation of target gene transcription
–
•
Reduces tumor angiogenesis by protecting cells
from oxidative stress
A stress-signaling pathway in tumor cells
Tryptophan catabolism
–
Has an antiproliferative effect on many tumor
cells
Conclusion
• Our results have shown that
• The centroid classification based on
gene expression performs poorly with
the validation set.
• The GSEA method does not perform
any better than the gene expression
method
Future Work
• Analysis with a different classification
approach.
• Using much larger data sets from
different samples.
Acknowledgements
• Dr. Bruce Hoff
• Dr. Soheil Shams
• SoCalBSI
References
1.
2.
3.
4.
5.
Stefan Michiels, Serge Koscielny, Catherine Hill. Prediction of
cancer outcome with microarrays: a multiple random
validation strategy. Lancet, Vol. 365, 488–92 (2005).
Mootha, V. K., et al. PGC-1α-responsive genes involved in
oxidative phosphorylation are coordinately downregulated in
human diabetes. Nature Genetics, Vol. 34, 267-273 (2003).
http://www.broad.mit.edu/gsea/doc/detailed_description_of_g
sea_algorithm.doc.
http://www.mit.edu/~scyudits/He&Yuditskaya_final_project_re
port.pdf
http://www.nature.com/ng/journal/v37/n1/full/ng1490.html