Download presentation

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Quantitative trait locus wikipedia , lookup

Quantitative comparative linguistics wikipedia , lookup

Pathogenomics wikipedia , lookup

Essential gene wikipedia , lookup

History of genetic engineering wikipedia , lookup

NEDD9 wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Public health genomics wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Genome evolution wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Genomic imprinting wikipedia , lookup

Gene wikipedia , lookup

Genome (book) wikipedia , lookup

Ridge (biology) wikipedia , lookup

Metagenomics wikipedia , lookup

Minimal genome wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Designer baby wikipedia , lookup

Microevolution wikipedia , lookup

Gene expression programming wikipedia , lookup

RNA-Seq wikipedia , lookup

Gene expression profiling wikipedia , lookup

Transcript
Mining of Microarray, Proteomics,
and Clinical Data for Improved
Identification of Chronic Fatigue
Syndrome
Zoran Obradovic
Hongbo Xie, Slobodan Vucetic
Information Science and Technology Center
Temple University, Philadelphia
Biomarker Identification

Objective:


Useful for




Select a small number of informative
attributes (genes; protein)
disease diagnosis,
disease progress monitoring,
evaluation of treatment effects etc.
Challenges include finding many irrelevant
attributes; uncertainty is due to


small sample size vs. number of attributes
lack of replicates
Approaches to Biomarkers
Identification

Select



significantly differentially expressed genes
(in microarray data), or
the most discriminative mass-charge peaks
(in proteomics data)
Measure difference among classes of
samples using

statistics tests:
 T-test,

ANOVA, and Non-Parametric test
data mining procedures
 SVM,
Neural networks, etc
Limitations

Very noisy data is subject to false
discoveries

Relationships among selected attributes
are often ignored

For many diseases, multiple data
resources are available; however how to
use them together is often unclear
Our Approach



Motivation:
 For various diseases the most discriminative
genes are likely to correspond to a limited
set of biological functions or pathways
Hypothesis:
 Focusing to key functional expression
patterns could result in improved accuracy
as compared to analyzing individual gene
expression readings
Approach:
 Exclude genes whose biological properties
deviate from other selected genes
Challenges of Biomarke Identification for
Chronic Fatigue Syndrome (CFS)





CFS diagnosis is less accurate than for some
other diseases (e.g. cancer)
Pathophysiology of CFS is insufficient
understood
Diagnosis of CFS is highly depending on
clinical practice
Patients’ response is often subjective
There is no standard criteria or laboratory
technique to reduce the risk of malpractice
CFS Data

CFS Microarray data




CFS Proteomics data



79 arrays representing 39 clinical identified CFS
samples and 40 non-CFS samples
20,160 genes for each sample
Using SOURCE database(http://source.stanford.edu)
13,213 genes annotated by 4,110 unique GO terms
65 samples representing 33 CFS and 32 non-CFS
samples
Each sample was profiled under 48 conditions, with
factors such as fractionation, protein-chip surfaces, and
binding and elution conditions
CFS Clinical data


227 samples representing 43 CFS, 60 NF, and 123
others, CFS/NF are defined by Empiric attribute
each sample contained 85 attributes
Task 1: Identifying Biomarker
Genes from CFS Microarray Data


Objective:
 Identify a robust set of genes discriminating
patients (CFS) from normal subjects (NF)
Method:
 Identify a Subset of Genes (SG) significantly
different between CFS and NF in training
sample (use a non-parametric statistical test)
 Select a subset of SG annotated with a specific
function (use domain knowledge of GO)
 Evaluate the method (Use leave one out cross
validation)
Identifying a Subset of Significant
Genes by Kruskal-Wallis (KW) Test



For each gene, its expression values for CFS
samples and NF samples are compared, p-value
is obtained comparing to a random population
Gene with p-value less than a threshold is
selected as significant (SG)
Traditional approaches use those SG as markers
to discriminate classes of samples. However:


A large proportion of such genes are irrelevant;
applying false discovery rate control won’t help much
in most case
Functional correlations among those genes are
ignored
Selecting Significant Functions by
Hypergeometric Test

Given:
Set of k genes selected by KW test

Objective:
Determine whether a given term GOi is overrepresented
by the selection

The idea:
If the gene selection were random, the number Xi of
selected genes annotated with GOi would follow
hypergeometric distribution

Approach: So, significance of GOi is measured using
the p-value of GOi = P(X  Xi), where


X ~ H(K, k, ki) is selecting probability for a random
gene
ki is the number of genes annotated with GOi
KW Statistical Attribute Selection

All genes selected by KW test,
{gi, Gi < }, are used as attributes in
classification
Knowledge Based Selection

TopGO:


nGO:


Select GO terms with n smallest p-values GOn. Use only
genes selected by KW test and annotated with GOn
AllTopGO:


Select GO term with the smallest p-value GO*. Use only
genes selected by KW test and annotated with GO*
Use all genes annotated with GO*
AllSignificantGO:

Use all significant genes annotated with one of
significant GO terms
Comparison of 5 Attribute Selection
Methods on CFS Data

Using domain knowledge for attributes
selection procedure improved the
prediction accuracy

TopGO was the most accurate domain
knowledge based attributes selection
method

Decision Tree Classifiers were less
accurate than corresponding Support
Machines (SVM)
Comparison Details of 5 Attribute
Selection Methods for p-value = 0.05
Selection Approach
Decision Trees
SVM
KW Statistical
53%
53%
TopGO
59%
72%
10GO
54%
58%
AllTopGO
56%
61%
AllSignificantGO
53%
60%
Further Comparison of KW Statistical
vs. TopGO Selection
KW Statistical
TopGO

(KW Test
threshold)
SVM
Decision
Tree
Number
of
Selected
Attribute
0.001
58
53
0.01
48
0.05
0.2
SVM
Decision
Tree
Number
of
Selected
Attribute
41
54
46
1
56
257
48
48
3
53
53
1296
72
59
17
49
51
3761
56
62
19
Accuracy (%)
Accuracy (%)
•Overall, knowledge based TopGo selection was the most accurate
(SVM: 58% vs. 72%; Decision Tree: 55% vs. 62%)
•For very small threshold, KW Statistical selection was slightly more
accurate
•However, knowledge based TopGO selection always used far less
number of attributes
Comparison Using Same Number of
Attributes (Statistical vs.TopGO by SVM)
0.75
0.7
Domain dependent feature selection
Classical feature selection
accuracy
0.65
0.6
0.55
0.5
0.45
0.4
1
2
3
16
number of selected features
17
19
Comparison Using Same Number of
Attributes (Statistical vs.TopGO by Decision Tree)
0.65
Domain dependant feature selection
classical feature selection
accuracy
0.6
0.55
0.5
0.45
0.4
0.35
1
2
3
16
number of selected features
17
19
Comparison of SVM vs. Decision Trees
for TopGO Attribute Selection
0.75
SVM
Decision Tree
0.7
0.65
0.6
0.55
0.5
0.45
1
2
3
16
17
19
Most Overrepresented GO Terms among Significantly
Differentially Expressed Genes in CFS (by TopGo)
Gene Ontology ID
Function/Process Name
p-value
Number of
Selected Genes
GO:0006397
mRNA processing
0.0016
10
GO:0008203
cholesterol metabolism
0.0021
7
GO:0003779
actin binding
0.0027
31
GO:00015629
actin cytoskeleton
0.0078
14
GO:00016564
transcriptional repressor activity
0.0105
9
GO:0005515
protein binding
0.0136
124
GO:0007187
G-protein signaling
0.0153
5
GO:0008009
chemokine activity
0.0153
5
GO:0007229
integrin-mediated signaling pathway
0.0155
9
GO:0007517
muscle development
0.016
14
* Top 2 functions are consistent with previously reported
result on CFS (Whistler, T.et al, Transl Med. 2003; 1: 10)
mRNA Processing Genes Identified
as Potential Biomarkers
Gene Name
Gene ID
Symbol
UniGene
P-value
Debranching enzyme homolog 1
AK000116
DBR1
Hs.477700
0.0086
Cleavage and polyadenylation specific factor 6
NM_007007
CPSF6
Hs.369606
0.0096
Small nuclear ribonucleoprotein polypeptide N
AF101044
SNRPN
Hs.525700
0.0131
Hypothetical protein
BC006407
MGC14151
Hs.333414
0.0186
Heterogeneous nuclear ribonucleoprotein L-like
BC008217
HNRPLL
Hs.445497
0.0212
TRNA splicing endonuclease 2 homolog
AK074794
SEN2L
Hs.335550
0.0223
Poly(A) polymerase beta
AF218840
PAPOLB
Hs.487409
0.0333
ELAV-like 4 (Hu antigen D)
BC036071
ELAVL4
Hs.213050
0.0376
ER to nucleus signaling 1
AF059198
ERN1
Hs.133982
0.0395
Nuclear ribonucleoprotein polypeptide
J04564
SNRPB
Hs.83753
0.0444
Nuclear RNA export factor 1
AF112880
NXF1
Hs.523739
0.0499
Using Only Significant Genes Associated
with a Given Function

Several key functions could well
discriminate Chronic Fatigue Syndrome
from non-Fatigue population

How to select the best function(s) for out of
sample prediction is still a challenge

The most overrepresented functions
identified by our analysis were the most
discriminative
Accuracy by Using Only Significant Genes
Associated with a Given Function (p-value <0.05)
Function name
Category
Number
of
selected
Accuracy
(%)
attributes
hydrolase activity
Function
54
77
cholesterol metabolism
Process
7
75
lyase activity
Function
6
75
GTPase activator activity
Function
12
73
ATP binding
Function
84
72
mesoderm development
process
4
72
sarcoglycan complex
Cell component
3
72
telomeric DNA binding
Function
2
72
chromatin binding
function
4
71
steroid hormone receptor activity
function
8
71
mRNA processing
Process
10
70
Evaluation on Additional Data

Central Nervous System (CNS)


CNS Data Source:


"Prediction of Central Nervous System
Embryonal Tumour Outcome Based on Gene
Expression", Letters to Nature, Nature,
415:436-442, January 2002.
http://www-genome.wi.mit.edu/mpr/CNS/
Description: The data set contains 60
patient samples, 21 are survivors and 39
are failures. There are 7129 genes in the
dataset.
Results on CNS Data

(KW Test
threshold)
Traditional Accuracy
(%)
TopGO Accuracy
(%)
SVM
Decision
Tree
SVM
Decision
Tree
0.001
47
47
58
50
0.01
43
27
45
43
0.05
58
32
63
60
0.1
57
32
62
57
0.2
57
32
60
62
•Findings were consistent to CFS analysis
Task 2: Proteomics Based Approach
to Diagnostics
Proteomics Based CFS Data Analysis
Overall data preprocessing protocol:
 Baseline correction
 Peak alignment
 Spectra normalization
 Smooth spectrogram
 Normalize using QC samples:

For each test sample at every condition, its
m/z value is divided by control QC m/z value
and followed by taking log hood (a relative
ratio is obtained for each testing sample).
Proteomics Based CFS Classification
Procedure

Used leave-one-sample-out cross validation to
train and test the data




Prediction on replicates of same sample is obtained by
voting with tie labeled as CFS
Kruskal-Wallis analysis of ranks and the Median test are
applied for all mass/charge values. P-values are ranked
and peaks with p-value less than a threshold are
selected as attributes.
P-value threshold of 0.05 resulted in selection of over
2000 attributes
Trained SVM classifier with selected attributes
and evaluated for discriminating out of sample
test data
Result of Proteomics Based CFS
Classification

Accuracy of our method of separating CFS
samples and NF samples was just slightly
better than trivial predictor

IMAC chips provided the best overall
results. The accuracy of an ensemble of
IMAC classifiers by the leave-one-sampleout cross-validation was 51%.
Task 3: Combining Microarray Data and
Proteomics Data for CFS Diagnoses (SVM)
Used integrated data of 38 subjects
(20 CFS and 18 non-CFS samples)
containing both proteomics and
microarray data
 Proteomics and microarray-based CFS
predictions agreed for 50% of the sample
(19 subjects)
 When two classification methods agreed,
the accuracy of a combined approach
was significantly improved to 79%

Task 4: Analysis of Clinical CFS Data



Motivation:
 Reason of low accuracy of prediction could lie in
CFS clinical data attributes
Objective:
 Detect potential factors that reveal the reasons
of disagreement between microarray and
proteomics CSF classifiers
Approach:
 Applied ANOVA analysis on each attribute of two
groups of clinical data (groups were subjects
where microarray and proteomics predictions
agree on vs. remaining subjects where
microarray and proteomics predictions disagree)
Result of Clinical Data Analysis by
ANOVA

Three of the clinical data classifying
attributes are discovered as significantly
different between two groups




Mental heath
Physical fatigue
General fatigue
Low accuracy of CFS diagnosis could be
partially blamed on the clinical definition of
the disease
Conclusions

Complementing statistical gene selection
and domain knowledge to focus on the
most significantly overrepresented GO
terms was beneficial for



improving accuracy
identifying much smaller number of attributes
Integrating information from multiple
sources (microarray, proteomics and clinical
data) could lead to improved understanding
and diagnosis of CFS
Thank You !
More information:
http://www.ist.temple.edu
Contact:
Zoran Obradovic, director
IST Center, Temple University
215-204-6265
[email protected]