Download Introduction

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Data Mining in
Genomics: the dawn
of personalized
medicine
Gregory Piatetsky-Shapiro
KDnuggets
www.KDnuggets.com/gps.html
Connecticut College, October 15, 2003
Overview
 Data Mining and Knowledge Discovery
 Genomics and Microarrays
 Microarray Data Mining
© 2003 KDnuggets
2
Trends leading to Data Flood
 More data is generated:
 Bank, telecom, other
business transactions ...
 Scientific Data: astronomy,
biology, etc
 Web, text, and e-commerce
 More data is captured:
 Storage technology faster
and cheaper
 DBMS capable of handling
bigger DB
© 2003 KDnuggets
3
Knowledge Discovery Process
Integration
Interpretation
& Evaluation
Knowledge
Knowledge
__ __ __
__ __ __
__ __ __
DATA
Ware
house
© 2003 KDnuggets
Transformed
Data
Target
Data
4
Patterns
and
Rules
Understanding
Raw
Dat
a
Major Data Mining Tasks
 Classification: predicting an item class
 Clustering: finding clusters in data
 Associations: e.g. A & B & C occur frequently
 Visualization: to facilitate human discovery
 Summarization: describing a group
 Estimation: predicting a continuous value
 Deviation Detection: finding changes
 Link Analysis: finding relationships
© 2003 KDnuggets
5
Major Application Areas for
Data Mining Solutions












Advertising
Bioinformatics
Customer Relationship Management (CRM)
Database Marketing
Fraud Detection
eCommerce
Health Care
Investment/Securities
Manufacturing, Process Control
Sports and Entertainment
Telecommunications
Web
© 2003 KDnuggets
6
Genome, DNA & Gene
Expression
 An organism’s genome is the “program” for
making the organism, encoded in DNA
 Human DNA has about 30-35,000 genes
 A gene is a segment of DNA that specifies how to make
a protein
 Cells are different because of differential gene
expression
 About 40% of human genes are expressed at one time
 Microarray devices measure gene expression
© 2003 KDnuggets
7
Molecular Biology Overview
Nucleus
Cell
Chromosome
Gene
expression
Protein
© 2003 KDnuggets
Gene (mRNA),
single strand
8
Gene (DNA)
Graphics courtesy of the National Human Genome Research Institute
Affymetrix Microarrays
1.28cm
50um
~107 oligonucleotides,
half Perfectly Match mRNA (PM),
half have one Mismatch (MM)
Gene expression computed from
PM and MM
© 2003 KDnuggets
9
Affymetrix Microarray Raw
Image
Gene
D26528_at
D26561_cds1_at
D26561_cds2_at
D26561_cds3_at
D26579_at
D26598_at
D26599_at
D26600_at
D28114_at
Scanner
enlarged section of raw image
© 2003 KDnuggets
10
raw data
Value
193
-70
144
33
318
1764
1537
1204
707
Microarray Potential Applications
 New and better molecular diagnostics
 New molecular targets for therapy
 few new drugs, large pipeline, …
 Outcome depends on genetic signature
 best treatment?
 Fundamental Biological Discovery
 finding and refining biological pathways
 Personalized medicine ?!
© 2003 KDnuggets
11
Microarray Data Mining
Challenges
 Avoiding false positives, due to
 too few records (samples), usually < 100
 too many columns (genes), usually > 1,000
 Model needs to be robust in presence of noise
 For reliability need large gene sets; for
diagnostics or drug targets, need small gene sets
 Estimate class probability
 Model needs to be explainable to biologists
© 2003 KDnuggets
12
False Positives in Astronomy
cartoon used with permission
© 2003 KDnuggets
13
CATs: Clementine Application
Templates
 CATs - examples of
complete data mining
processes
 Microarray CAT
Preparation
MultiClass
2-Class
© 2003 KDnuggets
14
Clustering
Key Ideas
 Capture the complete process
 X-validation loop w. feature selection inside
 Randomization to select significant genes
 Internal iterative feature selection loop
 For each class, separate selection of optimal gene
sets
 Neural nets – robust in presence of noise
 Bagging of neural nets
© 2003 KDnuggets
15
Microarray Classification
© 2003 KDnuggets
Train data
Feature and Parameter Selection
Data
Model Building
Test data
Evaluation
16
Classification: External X-val
Gene Data
T r a i n
Train data
Feature and Parameter Selection
Data
Model Building
Test data
Evaluation
Final Model
FinalTest
Final Results
© 2003 KDnuggets
17
Measuring false positives with
randomization
Gene
Class
178
105
4174
7133
1
1
2
2
© 2003 KDnuggets
Rand
Class
Randomize
500 times
2
1
1
2
Gene
Class
178
105
4174
7133
2
1
1
2
18
Bottom
1% T-value = -2.08
Select potentially
interesting genes at 1%
Gene Reduction improves
Classification
 most learning algorithms look for non-linear
combinations of features -- can easily find many
spurious combinations given small # of records
and large # of genes
 Classification accuracy improves if we first reduce
# of genes by a linear method, e.g. T-values of
mean difference
 Heuristic: select equal # genes from each class
 Then apply a favorite machine learning algorithm
© 2003 KDnuggets
19
Iterative Wrapper approach to
selecting the best gene set
 Test models using 1,2,3, …, 10, 20, 30, 40, ...,
100 top genes with x-validation.
 Heuristic 1: evaluate errors from each class;
select # number of genes from each class that
minimizes error for that class
 For randomized algorithms, average 10+
Cross-validation runs!
 Select gene set with lowest average error
© 2003 KDnuggets
20
Clementine stream for subset
selection by x-validation
© 2003 KDnuggets
21
Microarrays: ALL/AML Example
 Leukemia: Acute Lymphoblastic (ALL) vs Acute
Myeloid (AML), Golub et al, Science, v.286, 1999
 72 examples (38 train, 34 test), about 7,000 genes
 well-studied (CAMDA-2000), good test example
ALL
AML
Visually similar, but genetically very different
© 2003 KDnuggets
22
Gene subset selection: one Xvalidation
Error Avg for 10-fold X-val
30%
25%
20%
15%
10%
5%
0%
1
2
3
4
5
10
20
Genes per Class
Single Cross-Validation run
© 2003 KDnuggets
23
30
40
Gene subset selection:
multiple cross-validation runs
For ALL/AML data, 10 genes per
class had the lowest error: (<1%)
Point in the center
is the average error
from 10 crossvalidation runs
Bars indicate 1 st. dev
above and below
© 2003 KDnuggets
24
ALL/AML: Results on the test
data
 Genes selected and model trained on Train set
ONLY!
 Best Net with 10 top genes per class (20 overall)
was applied to the test data (34 samples):
 33 correct predictions (97% accuracy),
 1 error on sample 66
 Actual Class AML, Net prediction: ALL
 other methods consistently misclassify sample 66 -misclassified by a pathologist?
© 2003 KDnuggets
25
Pediatric Brain Tumour Data
 92 samples, 5 classes (MED, EPD, JPA, EPD, MGL,
RHB) from U. of Chicago Children’s Hospital
 Outer cross-validation with gene selection inside
the loop
 Ranking by absolute T-test value (selects top
positive and negative genes)
 Select best genes by adjusted error for each class
 Bagging of 100 neural nets
© 2003 KDnuggets
26
Selecting Best Gene Set
 Minimizing
Combined
Error for all
classes is
not optimal
Average, high and low error rate for all classes
© 2003 KDnuggets
27
Error rates for each class
Error rate
Genes per Class
© 2003 KDnuggets
28
Evaluating One Network
Averaged over 100 Networks:
Class
Error rate
MED
MGL
2.1%
17%
RHB
EPD
JPA
24%
9%
19%
*ALL* 8.3%
© 2003 KDnuggets
29
Bagging 100 Networks
Class
MED
MGL
Individual
Error Rate
2.1%
17%
Bag Error
rate
2% (0)*
10%
Bag Avg
Conf
98%
83%
RHB
EPD
JPA
*ALL*
24%
9%
19%
8.3%
11%
0
0
3% (2)*
76%
91%
81%
92%
 Note: suspected error on one sample (labeled as
MED but consistently classified as RHB)
© 2003 KDnuggets
30
AF1q: New Marker for
Medulloblastoma?
 AF1Q ALL1-fused gene from chromosome 1q
 transmembrane protein
 Related to leukemia (3 PUBMED entries) but not to Medulloblastoma
© 2003 KDnuggets
31
Future directions for Microarray
Analysis
 Algorithms optimized for small samples
 Integration with other data
 biological networks
 medical text
 protein data
 Cost-sensitive classification algorithms
 error cost depends on outcome (don’t want to miss
treatable cancer), treatment side effects, etc.
© 2003 KDnuggets
32
Acknowledgements
 Eric Bremer, Children’s Hospital (Chicago) &
Northwestern U.
 Greg Cooper, U. Pittsburgh
 Tom Khabaza, SPSS
 Sridhar Ramaswamy, MIT/Whitehead Institute
 Pablo Tamayo, MIT/Whitehead Institute
© 2003 KDnuggets
33
Thank you
Further resources on Data Mining:
www.KDnuggets.com
Microarrays:
www.KDnuggets.com/websites/microarray.html
Contact:
Gregory Piatetsky-Shapiro:
www.kdnuggets.com/gps.html
© 2003 KDnuggets
34