Download Microarray Data Mining: Dealing with Challenges

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Microarray
Data Mining:
Dealing with
Challenges
Gregory Piatetsky-Shapiro
KDnuggets
© 2005 KDnuggets
Dec 8, 2005
DNA and Gene Expression
Cell
Nucleus
Chromosome
Gene
expression
Protein
© 2005 KDnuggets
Gene (mRNA),
single strand
2
Gene (DNA)
Graphics courtesy of the National Human Genome Research Institute
BYU, Dec 8, 2005
Gene Expression
ƒ Cells are different because of differential gene
expression.
ƒ About 40% of human genes are expressed at one
time.
ƒ Gene is expressed by transcribing DNA exons
into single-stranded mRNA
ƒ mRNA is later translated into a protein
ƒ Microarrays measure the level of mRNA
expression
© 2005 KDnuggets
3
BYU, Dec 8, 2005
Gene Expression Measurement
ƒ mRNA expression represents dynamic aspects of
cell
ƒ mRNA expression can be measured with latest
technology
ƒ mRNA is isolated and labeled with fluorescent
protein
ƒ mRNA is hybridized to the target; level of
hybridization corresponds to light emission which
is measured with a laser
© 2005 KDnuggets
4
BYU, Dec 8, 2005
Affymetrix DNA Microarrays
image courtesy of Affymetrix
© 2005 KDnuggets
5
BYU, Dec 8, 2005
Some Probes Match
Match
Probe
Gene
T
A
A
T
C
G
T
A
C
G
T
T
…
© 2005 KDnuggets
6
BYU, Dec 8, 2005
Other Probes MisMatch
Gene
MisMatch
Probe
A
T
T
A
G
T
A
T
G
C
T
T
…
© 2005 KDnuggets
7
BYU, Dec 8, 2005
Affymetrix Microarray Concept
1. mRNA segments tagged with
fluorescent chemical
2. Matches with complementary
probes
3. Fluorescence measured with
laser
image courtesy of Affymetrix
© 2005 KDnuggets
8
BYU, Dec 8, 2005
From Raw Image to Numbers
Scanner
Cleaning and Processing
Software
enlarged section of raw image
© 2005 KDnuggets
9
Gene
D26528_at
D26561_cds1_at
D26561_cds2_at
D26561_cds3_at
D26579_at
D26598_at
D26599_at
Value
193
-70
144
33
318
1764
153
BYU, Dec 8, 2005
Microarray Potential Applications
ƒ New and better molecular diagnostics
ƒ Jan 11, 2005: FDA approved Roche Diagnostic AmpliChip, based
on Affymetrix technology
ƒ New molecular targets for therapy
ƒ few new drugs, large pipeline, …
ƒ Improved treatment outcome
ƒ Partially depends on genetic signature
ƒ Fundamental Biological Discovery
ƒ finding and refining biological pathways
ƒ Personalized medicine ?!
© 2005 KDnuggets
10
BYU, Dec 8, 2005
Microarray Data Analysis Types
ƒ Gene Selection
ƒ Find genes for therapeutic targets (new drugs)
ƒ Classification (Supervised Learning)
ƒ Identify disease
ƒ Predict outcome / select best treatment
ƒ Clustering (Unsupervised)
ƒ Find new biological classes / refine existing ones
ƒ Discovery of Associations and Pathways
© 2005 KDnuggets
11
BYU, Dec 8, 2005
Outline
ƒ DNA and Biology
ƒ Microarray Classification - Best practices
ƒ Gene Set Analysis
ƒ Synthetic microarray data sets
© 2005 KDnuggets
12
BYU, Dec 8, 2005
Microarray Data Analysis
Challenges
ƒ Biological, Process, and Other variation
ƒ Model needs to be explainable to biologists
ƒ Few records (samples), usually < 100
ƒ Many columns (genes), usually > 1,000
ƒ This is very likely to result in false positives,
“discoveries” due to random noise
© 2005 KDnuggets
13
BYU, Dec 8, 2005
Data Mining from Small Data:
Example
Country
US
Favorite
Food
Wine
Heart-Disease
Rate
High
France
Wine
Low
England
Beer
High
Germany
Beer
Low
© 2005 KDnuggets
14
BYU, Dec 8, 2005
Data Mining from Small Data:
Conclusion
ƒ You can drink whatever you want
ƒ It is speaking English which causes heart
disease
© 2005 KDnuggets
15
BYU, Dec 8, 2005
Same Data, Different Results –
Who is Right?
ƒ Many examples (e.g. CAMDA conferences) where
researchers analyzed the same data, and found
different results and gene sets.
ƒ Who is right?
ƒ Good methodology is needed
ƒ for avoiding errors
ƒ for best results
© 2005 KDnuggets
16
BYU, Dec 8, 2005
Capturing Best Practices for
Microarray Data Analysis
ƒ Developed with SPSS and S. Ramaswamy
(MIT / Whitehead Institute).
ƒ Implemented as SPSS Microarray CATs
(Clementine Application Template).
ƒ Also implemented using Weka and opensource software
Capturing Best Practice for Microarray Gene
Expression Data Analysis, G. PiatetskyShapiro, T. Khabaza, S. Ramaswamy,
Proceedings of KDD-2003
www.kdnuggets.com/gpspubs/
© 2005 KDnuggets
17
BYU, Dec 8, 2005
Best Practices
ƒ Capture the complete process, from raw data to final
results
ƒ Gene (feature) selection inside cross-validation
ƒ Randomization testing
ƒ Robust classification algorithms
ƒ Simple methods give good results
ƒ Advanced methods can be better
ƒ Wrapper approach for best gene subset selection
ƒ Use bagging to improve accuracy
ƒ Remove/relabel mislabeled or poorly differentiated
samples
© 2005 KDnuggets
18
BYU, Dec 8, 2005
Observations
ƒ Simplest approaches are most robust
ƒ Advanced approaches can be more accurate
ƒ “Small” increase in diagnostic accuracy (e.g. 90% to 98%)
can greatly reduce rate of false positives (5-fold)
p(disease | test=yes),
disease rate=0.01
1
0.8
0.6
0.4
0.2
0
0
10 20 30 40 50 60 70 80 90 100
Accuracy
© 2005 KDnuggets
19
BYU, Dec 8, 2005
Microarray Classification Desired
Features
ƒ Robust in presence of false positives
ƒ Stable under cross-validation
ƒ Results understandable by biologists
ƒ Return confidence/probability
ƒ Fast enough
© 2005 KDnuggets
20
BYU, Dec 8, 2005
Popular Classification Methods
ƒ Decision Trees/Rules
ƒ Find smallest gene sets, but not robust – poor performance
ƒ Neural Nets - work well for reduced number of genes
ƒ K-nearest neighbor – good results for small number of genes, but
no model
ƒ Naïve Bayes – simple, robust, but ignores gene interactions
ƒ Support Vector Machines (SVM)
ƒ Good accuracy, does own gene selection,
but hard to understand
ƒ Specialized methods, D/S/A (Dudoit), …
© 2005 KDnuggets
21
BYU, Dec 8, 2005
Gene Reduction Improves
Classification Accuracy - Why?
ƒ Most learning algorithms look for non-linear
combinations of features
ƒ Can easily find spurious combinations given few
records and many genes – “false positives problem”
ƒ Classification accuracy improves if we first reduce
number of genes by a linear method
ƒ e.g. T-values of mean difference
© 2005 KDnuggets
22
BYU, Dec 8, 2005
Feature selection approach
ƒ Rank genes by measure & select top 100-200
ƒ T-test for Mean Difference=
ƒ Signal to Noise (S2N) =
ƒ Other methods …
© 2005 KDnuggets
23
( Avg1 − Avg 2 )
(σ 12 / N1 + σ 22 / N 2 )
( Avg1 − Avg 2 )
(σ 1 + σ 2 )
BYU, Dec 8, 2005
Global Feature / Parameter
Selection is wrong
Gene
data
Data Cleaning & Preparation
Class
data
Feature/Parameter Selection
Train data
Model Building
Evaluation
Test data
Because the information is “leaked” via gene selection.
Leads to overly “optimistic” classification results.
© 2005 KDnuggets
24
BYU, Dec 8, 2005
Global Feature Selection Bias
ƒ Example: Data w 6000 features, ~ 100 samples
ƒ Used 3 Weka algorithms: SVM, IB1, J48
ƒ Error with Global feature selection was 50% to 150% lower than
using X-validation – overly “optimistic” error estimate
Average%%decrease
increase in error from using
Average
Global Feature Selection (A)
200
150
100
50
0
-50
IB1
J48
SVM
5
10
20
30
50
Best
Features
© 2005 KDnuggets
25
BYU, Dec 8, 2005
Microarray Classification Process
Data Cleaning & Preparation
Train data
Gene
data
Class
data
Feature/Parameter Selection
Model Building
Evaluation
Test data
© 2005 KDnuggets
26
BYU, Dec 8, 2005
Evaluation
ƒ Evaluation on one test dataset is not accurate (for
small datasets)
ƒ Evaluation, including feature/parameter selection
needs to be inside cross-validation (Validation
Croisee)
© 2005 KDnuggets
27
BYU, Dec 8, 2005
Evaluation inside X-validation
Gene Data
T r a i n
Microarray Classification Process
Final Test
1
Final Model 1
Final Result 1
© 2005 KDnuggets
28
BYU, Dec 8, 2005
Evaluation inside X-validation, 2
Gene Data
T r a i n
Microarray Classification Process
Final Test
2
Final Model 2
Final Result 2
© 2005 KDnuggets
29
BYU, Dec 8, 2005
Evaluation inside X-validation, N
T r a i n
Gene Data
Final Test
N
Microarray Classification Process
Final Model N
Final Result N
© 2005 KDnuggets
30
BYU, Dec 8, 2005
Which of N models is your final
model?
ƒ Neither of them!
ƒ average Error from Models 1, … N is the best
estimate of the error of the process
ƒ You then use the same process on the entire data
to get your final and best model
© 2005 KDnuggets
31
BYU, Dec 8, 2005
Iterative Wrapper approach
to selecting the best gene set
ƒ Model with top 100 genes is not optimal
ƒ Test models using 1,2,3, …, 10, 20, 30, 40, ..., N top
genes with cross-validation.
ƒ Gene selection:
ƒ Simple: equal number of genes from each class
ƒ advanced: best number from each class
ƒ For “randomized” algorithms (e.g. neural nets),
average 10 cross-validation runs
© 2005 KDnuggets
32
BYU, Dec 8, 2005
Best Gene Set:
one X-validation run
Error Avg for 10-fold X-val
30%
25%
20%
15%
10%
5%
0%
1
2
3
4
5
10
20
30
40
Genes per Class
Single Cross-Validation run
© 2005 KDnuggets
33
BYU, Dec 8, 2005
Best Gene Set
10 X-validation runs
ƒ Select gene
set with
lowest
combined
Error
ƒ good, but
not optimal!
Average, high and low error rate for all classes
© 2005 KDnuggets
34
BYU, Dec 8, 2005
Multi-class classification
ƒ Simple: One model for all classes
ƒ Advanced: Separate model for each class
© 2005 KDnuggets
35
BYU, Dec 8, 2005
Error rates for each class
Error rate
Genes per Class
© 2005 KDnuggets
36
BYU, Dec 8, 2005
Example: Pediatric Brain Tumor
Data
ƒ 92 samples, 5 classes (MED, EPD, JPA, MGL,
RHB) from U. of Chicago Children’s Hospital
ƒ Photomicrographs of tumors (400x)
© 2005 KDnuggets
37
BYU, Dec 8, 2005
Single Neural Net
Class
1-Net
Error Rate
MED 2.1%
MGL
17%
RHB
24%
EPD
9%
JPA
19%
*ALL* 8.3%
© 2005 KDnuggets
38
BYU, Dec 8, 2005
Bagging Improves Results
ƒ Bagging or simple voting of N different neural
nets improves the accuracy
© 2005 KDnuggets
39
BYU, Dec 8, 2005
Bagging 100 Networks
Class
1-Net
Error Rate
MED 2.1%
17%
MGL
24%
RHB
9%
EPD
19%
JPA
*ALL* 8.3%
Bag Error
rate
2% (0)*
10%
11%
0
0
3% (2)*
Bag Avg
Conf
98%
83%
76%
91%
81%
92%
ƒ Note: suspected error on one sample (labeled as
MED but consistently classified as RHB)
© 2005 KDnuggets
40
BYU, Dec 8, 2005
Cross-validated prediction strength
misclassified?
MED: Strong
predictions
poorly
differentiated
© 2005 KDnuggets
41
BYU, Dec 8, 2005
Outline
ƒ DNA and Biology
ƒ Microarray Classification - Best practices
ƒ Gene Set Analysis
ƒ Synthetic microarray data sets
© 2005 KDnuggets
42
BYU, Dec 8, 2005
Gene Sets – Key to Power
ƒ Number of genes is a problem for single gene
analysis
ƒ Analyzing gene sets increases statistical power
ƒ Group genes statistically (prototypes,
correlations)
ƒ Group genes using biological knowledge
(pathways, Gene Ontology, medical text, …)
© 2005 KDnuggets
43
BYU, Dec 8, 2005
Gene Set Analysis
ƒ Gene Set Enrichment
ƒ Mootha et al, Nature Genetics, July 2003
ƒ Question: Genes involved in oxidative
phosphorylation (~100) vs. diabetes ?
ƒ No genes in OXPHOS group are individually
significant (avg. ~20% decrease)
© 2005 KDnuggets
44
BYU, Dec 8, 2005
OXPHOS Gene set example
Decrease is consistent for
89% of OXPHOS genes
Mean expression of all genes (gray) and of OXPHOS
genes (red) for people with DM2 (Type 2 diabetes
mellitus) vs Normal.
© 2005 KDnuggets
45
BYU, Dec 8, 2005
Gene sets from Gene Ontology( FatiGO)
Cluster Query – subset of genes; reference - remaining genes.
http://bioinfo.cnio.es/docus/courses/FatiGO/
© 2005 KDnuggets
46
BYU, Dec 8, 2005
Comprehensive Microarray Test
Suite
ƒ Getting high accuracy is very important
ƒ Current data has too few samples; the “true”
labels may be unknown.
ƒ What methods are best for given data and
what is their estimated accuracy?
ƒ Proposal: use realistic simulated microarray data
to evaluate how current algorithms perform under
different conditions
© 2005 KDnuggets
47
BYU, Dec 8, 2005
Global Analysis of Microarray
Data
Analyzed means of all genes across samples
Found: log(means >1) ~ normal distribution
Plot: log10(means) (blue) vs normal quantiles (red)
Other regularities:
log(SD) ~ a log(means) + b
© 2005 KDnuggets
48
(not shown)
BYU, Dec 8, 2005
True Rank vs T-value Rank
ƒ Simulated 100 microarray datasets, with 6000
genes and 30 samples (using R)
ƒ First 10 genes up-regulated by UP factor
© 2005 KDnuggets
49
BYU, Dec 8, 2005
Future Research Directions
ƒ Why is gene distribution log normal?
ƒ Develop a synthetic microarray data generator
ƒ Which gene selection methods are most robust
and under what conditions?
ƒ Evaluate algorithm accuracy for a given test set
by creating similar synthetic sets with known
answers
© 2005 KDnuggets
50
BYU, Dec 8, 2005
Acknowledgements
ƒ Eric Bremer, Children’s Hospital (Chicago) &
Northwestern U.
ƒ Tom Khabaza, SPSS
ƒ Sridhar Ramaswamy, MIT/Broad Institute
ƒ Pablo Tamayo, MIT/Broad Institute
© 2005 KDnuggets
51
BYU, Dec 8, 2005
SIGKDD Explorations Special Issue
on Microarray Data Mining
www.acm.org/sigkdd/explorations/
© 2005 KDnuggets
52
BYU, Dec 8, 2005
Thank you!
Further resources on Data Mining:
www.KDnuggets.com
Microarrays:
www.KDnuggets.com/websites/microarray.html
Data Mining Course (one semester)
www.KDnuggets.com/dmcourse
© 2005 KDnuggets
53
BYU, Dec 8, 2005