* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download presentation
Quantitative trait locus wikipedia , lookup
Quantitative comparative linguistics wikipedia , lookup
Pathogenomics wikipedia , lookup
Essential gene wikipedia , lookup
History of genetic engineering wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Public health genomics wikipedia , lookup
Polycomb Group Proteins and Cancer wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Genome evolution wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Genomic imprinting wikipedia , lookup
Genome (book) wikipedia , lookup
Ridge (biology) wikipedia , lookup
Metagenomics wikipedia , lookup
Minimal genome wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Designer baby wikipedia , lookup
Microevolution wikipedia , lookup
Mining of Microarray, Proteomics,
and Clinical Data for Improved
Identification of Chronic Fatigue
Syndrome
Zoran Obradovic
Hongbo Xie, Slobodan Vucetic
Information Science and Technology Center
Temple University, Philadelphia
Biomarker Identification
Objective:
Useful for
Select a small number of informative
attributes (genes; protein)
disease diagnosis,
disease progress monitoring,
evaluation of treatment effects etc.
Challenges include finding many irrelevant
attributes; uncertainty is due to
small sample size vs. number of attributes
lack of replicates
Approaches to Biomarkers
Identification
Select
significantly differentially expressed genes
(in microarray data), or
the most discriminative mass-charge peaks
(in proteomics data)
Measure difference among classes of
samples using
statistics tests:
T-test,
ANOVA, and Non-Parametric test
data mining procedures
SVM,
Neural networks, etc
Limitations
Very noisy data is subject to false
discoveries
Relationships among selected attributes
are often ignored
For many diseases, multiple data
resources are available; however how to
use them together is often unclear
Our Approach
Motivation:
For various diseases the most discriminative
genes are likely to correspond to a limited
set of biological functions or pathways
Hypothesis:
Focusing to key functional expression
patterns could result in improved accuracy
as compared to analyzing individual gene
expression readings
Approach:
Exclude genes whose biological properties
deviate from other selected genes
Challenges of Biomarke Identification for
Chronic Fatigue Syndrome (CFS)
CFS diagnosis is less accurate than for some
other diseases (e.g. cancer)
Pathophysiology of CFS is insufficient
understood
Diagnosis of CFS is highly depending on
clinical practice
Patients’ response is often subjective
There is no standard criteria or laboratory
technique to reduce the risk of malpractice
CFS Data
CFS Microarray data
CFS Proteomics data
79 arrays representing 39 clinical identified CFS
samples and 40 non-CFS samples
20,160 genes for each sample
Using SOURCE database(http://source.stanford.edu)
13,213 genes annotated by 4,110 unique GO terms
65 samples representing 33 CFS and 32 non-CFS
samples
Each sample was profiled under 48 conditions, with
factors such as fractionation, protein-chip surfaces, and
binding and elution conditions
CFS Clinical data
227 samples representing 43 CFS, 60 NF, and 123
others, CFS/NF are defined by Empiric attribute
each sample contained 85 attributes
Task 1: Identifying Biomarker
Genes from CFS Microarray Data
Objective:
Identify a robust set of genes discriminating
patients (CFS) from normal subjects (NF)
Method:
Identify a Subset of Genes (SG) significantly
different between CFS and NF in training
sample (use a non-parametric statistical test)
Select a subset of SG annotated with a specific
function (use domain knowledge of GO)
Evaluate the method (Use leave one out cross
validation)
Identifying a Subset of Significant
Genes by Kruskal-Wallis (KW) Test
For each gene, its expression values for CFS
samples and NF samples are compared, p-value
is obtained comparing to a random population
Gene with p-value less than a threshold is
selected as significant (SG)
Traditional approaches use those SG as markers
to discriminate classes of samples. However:
A large proportion of such genes are irrelevant;
applying false discovery rate control won’t help much
in most case
Functional correlations among those genes are
ignored
Selecting Significant Functions by
Hypergeometric Test
Given:
Set of k genes selected by KW test
Objective:
Determine whether a given term GOi is overrepresented
by the selection
The idea:
If the gene selection were random, the number Xi of
selected genes annotated with GOi would follow
hypergeometric distribution
Approach: So, significance of GOi is measured using
the p-value of GOi = P(X Xi), where
X ~ H(K, k, ki) is selecting probability for a random
gene
ki is the number of genes annotated with GOi
KW Statistical Attribute Selection
All genes selected by KW test,
{gi, Gi < }, are used as attributes in
classification
Knowledge Based Selection
TopGO:
nGO:
Select GO terms with n smallest p-values GOn. Use only
genes selected by KW test and annotated with GOn
AllTopGO:
Select GO term with the smallest p-value GO*. Use only
genes selected by KW test and annotated with GO*
Use all genes annotated with GO*
AllSignificantGO:
Use all significant genes annotated with one of
significant GO terms
Comparison of 5 Attribute Selection
Methods on CFS Data
Using domain knowledge for attributes
selection procedure improved the
prediction accuracy
TopGO was the most accurate domain
knowledge based attributes selection
method
Decision Tree Classifiers were less
accurate than corresponding Support
Machines (SVM)
Comparison Details of 5 Attribute
Selection Methods for p-value = 0.05
Selection Approach
Decision Trees
SVM
KW Statistical
53%
53%
TopGO
59%
72%
10GO
54%
58%
AllTopGO
56%
61%
AllSignificantGO
53%
60%
Further Comparison of KW Statistical
vs. TopGO Selection
KW Statistical
TopGO
(KW Test
threshold)
SVM
Decision
Tree
Number
of
Selected
Attribute
0.001
58
53
0.01
48
0.05
0.2
SVM
Decision
Tree
Number
of
Selected
Attribute
41
54
46
1
56
257
48
48
3
53
53
1296
72
59
17
49
51
3761
56
62
19
Accuracy (%)
Accuracy (%)
•Overall, knowledge based TopGo selection was the most accurate
(SVM: 58% vs. 72%; Decision Tree: 55% vs. 62%)
•For very small threshold, KW Statistical selection was slightly more
accurate
•However, knowledge based TopGO selection always used far less
number of attributes
Comparison Using Same Number of
Attributes (Statistical vs.TopGO by SVM)
0.75
0.7
Domain dependent feature selection
Classical feature selection
accuracy
0.65
0.6
0.55
0.5
0.45
0.4
1
2
3
16
number of selected features
17
19
Comparison Using Same Number of
Attributes (Statistical vs.TopGO by Decision Tree)
0.65
Domain dependant feature selection
classical feature selection
accuracy
0.6
0.55
0.5
0.45
0.4
0.35
1
2
3
16
number of selected features
17
19
Comparison of SVM vs. Decision Trees
for TopGO Attribute Selection
0.75
SVM
Decision Tree
0.7
0.65
0.6
0.55
0.5
0.45
1
2
3
16
17
19
Most Overrepresented GO Terms among Significantly
Differentially Expressed Genes in CFS (by TopGo)
Gene Ontology ID
Function/Process Name
p-value
Number of
Selected Genes
GO:0006397
mRNA processing
0.0016
10
GO:0008203
cholesterol metabolism
0.0021
7
GO:0003779
actin binding
0.0027
31
GO:00015629
actin cytoskeleton
0.0078
14
GO:00016564
transcriptional repressor activity
0.0105
9
GO:0005515
protein binding
0.0136
124
GO:0007187
G-protein signaling
0.0153
5
GO:0008009
chemokine activity
0.0153
5
GO:0007229
integrin-mediated signaling pathway
0.0155
9
GO:0007517
muscle development
0.016
14
* Top 2 functions are consistent with previously reported
result on CFS (Whistler, T.et al, Transl Med. 2003; 1: 10)
mRNA Processing Genes Identified
as Potential Biomarkers
Gene Name
Gene ID
Symbol
UniGene
P-value
Debranching enzyme homolog 1
AK000116
DBR1
Hs.477700
0.0086
Cleavage and polyadenylation specific factor 6
NM_007007
CPSF6
Hs.369606
0.0096
Small nuclear ribonucleoprotein polypeptide N
AF101044
SNRPN
Hs.525700
0.0131
Hypothetical protein
BC006407
MGC14151
Hs.333414
0.0186
Heterogeneous nuclear ribonucleoprotein L-like
BC008217
HNRPLL
Hs.445497
0.0212
TRNA splicing endonuclease 2 homolog
AK074794
SEN2L
Hs.335550
0.0223
Poly(A) polymerase beta
AF218840
PAPOLB
Hs.487409
0.0333
ELAV-like 4 (Hu antigen D)
BC036071
ELAVL4
Hs.213050
0.0376
ER to nucleus signaling 1
AF059198
ERN1
Hs.133982
0.0395
Nuclear ribonucleoprotein polypeptide
J04564
SNRPB
Hs.83753
0.0444
Nuclear RNA export factor 1
AF112880
NXF1
Hs.523739
0.0499
Using Only Significant Genes Associated
with a Given Function
Several key functions could well
discriminate Chronic Fatigue Syndrome
from non-Fatigue population
How to select the best function(s) for out of
sample prediction is still a challenge
The most overrepresented functions
identified by our analysis were the most
discriminative
Accuracy by Using Only Significant Genes
Associated with a Given Function (p-value <0.05)
Function name
Category
Number
of
selected
Accuracy
(%)
attributes
hydrolase activity
Function
54
77
cholesterol metabolism
Process
7
75
lyase activity
Function
6
75
GTPase activator activity
Function
12
73
ATP binding
Function
84
72
mesoderm development
process
4
72
sarcoglycan complex
Cell component
3
72
telomeric DNA binding
Function
2
72
chromatin binding
function
4
71
steroid hormone receptor activity
function
8
71
mRNA processing
Process
10
70
Evaluation on Additional Data
Central Nervous System (CNS)
CNS Data Source:
"Prediction of Central Nervous System
Embryonal Tumour Outcome Based on Gene
Expression", Letters to Nature, Nature,
415:436-442, January 2002.
http://www-genome.wi.mit.edu/mpr/CNS/
Description: The data set contains 60
patient samples, 21 are survivors and 39
are failures. There are 7129 genes in the
dataset.
Results on CNS Data
(KW Test
threshold)
Traditional Accuracy
(%)
TopGO Accuracy
(%)
SVM
Decision
Tree
SVM
Decision
Tree
0.001
47
47
58
50
0.01
43
27
45
43
0.05
58
32
63
60
0.1
57
32
62
57
0.2
57
32
60
62
•Findings were consistent to CFS analysis
Task 2: Proteomics Based Approach
to Diagnostics
Proteomics Based CFS Data Analysis
Overall data preprocessing protocol:
Baseline correction
Peak alignment
Spectra normalization
Smooth spectrogram
Normalize using QC samples:
For each test sample at every condition, its
m/z value is divided by control QC m/z value
and followed by taking log hood (a relative
ratio is obtained for each testing sample).
Proteomics Based CFS Classification
Procedure
Used leave-one-sample-out cross validation to
train and test the data
Prediction on replicates of same sample is obtained by
voting with tie labeled as CFS
Kruskal-Wallis analysis of ranks and the Median test are
applied for all mass/charge values. P-values are ranked
and peaks with p-value less than a threshold are
selected as attributes.
P-value threshold of 0.05 resulted in selection of over
2000 attributes
Trained SVM classifier with selected attributes
and evaluated for discriminating out of sample
test data
Result of Proteomics Based CFS
Classification
Accuracy of our method of separating CFS
samples and NF samples was just slightly
better than trivial predictor
IMAC chips provided the best overall
results. The accuracy of an ensemble of
IMAC classifiers by the leave-one-sampleout cross-validation was 51%.
Task 3: Combining Microarray Data and
Proteomics Data for CFS Diagnoses (SVM)
Used integrated data of 38 subjects
(20 CFS and 18 non-CFS samples)
containing both proteomics and
microarray data
Proteomics and microarray-based CFS
predictions agreed for 50% of the sample
(19 subjects)
When two classification methods agreed,
the accuracy of a combined approach
was significantly improved to 79%
Task 4: Analysis of Clinical CFS Data
Motivation:
Reason of low accuracy of prediction could lie in
CFS clinical data attributes
Objective:
Detect potential factors that reveal the reasons
of disagreement between microarray and
proteomics CSF classifiers
Approach:
Applied ANOVA analysis on each attribute of two
groups of clinical data (groups were subjects
where microarray and proteomics predictions
agree on vs. remaining subjects where
microarray and proteomics predictions disagree)
Result of Clinical Data Analysis by
ANOVA
Three of the clinical data classifying
attributes are discovered as significantly
different between two groups
Mental heath
Physical fatigue
General fatigue
Low accuracy of CFS diagnosis could be
partially blamed on the clinical definition of
the disease
Conclusions
Complementing statistical gene selection
and domain knowledge to focus on the
most significantly overrepresented GO
terms was beneficial for
improving accuracy
identifying much smaller number of attributes
Integrating information from multiple
sources (microarray, proteomics and clinical
data) could lead to improved understanding
and diagnosis of CFS
Thank You !
More information:
http://www.ist.temple.edu
Contact:
Zoran Obradovic, director
IST Center, Temple University
215-204-6265
[email protected]