Download Mine Microarray Gene Expression Data, Predict Cancers

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Gene therapy of the human retina wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Maximum parsimony (phylogenetics) wikipedia , lookup

Metagenomics wikipedia , lookup

Genomic imprinting wikipedia , lookup

Epigenetics in learning and memory wikipedia , lookup

Ridge (biology) wikipedia , lookup

Gene desert wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Gene nomenclature wikipedia , lookup

Genome evolution wikipedia , lookup

Gene wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Genome (book) wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

The Selfish Gene wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Quantitative comparative linguistics wikipedia , lookup

Designer baby wikipedia , lookup

Microevolution wikipedia , lookup

RNA-Seq wikipedia , lookup

Gene expression profiling wikipedia , lookup

Computational phylogenetics wikipedia , lookup

Gene expression programming wikipedia , lookup

Transcript
Gene Expression Data Analysis
Zhang Louxin
Dept. of Mathematics
Nat. University of Singapore
Molecular Class Prediction
Several supervised learning methods available:
• Neural Networks
• Support Vector Machines
• Decision trees
• Other statistical methods
A Supervised Learning Method
for Predicting a Binary Class
Positive and negative examples
Yes
Learning
Prediction
?
A new item
A class is just a concept!
In the learning step, the class is modelled as
a math. object -- a function with multiple variables,
or a subspace in a high dimensional space,
representing knowledge of the class.
No
Learning the class of tall men
The class is modelled as the half space h>=6’3
Examples:
8ft
7ft
6ft
5ft
Jordan
Brown
Ewin
Yao
Iverson
O'neal
Studamire
Boykins
6'6
5'9
6'11
7'6
6'0
7'1
5'11
5'5
tall
short
tall
tall
short
tall
short
short
Support Vector Machines
A support vector machine finds a hyperplane that maximally
separate data points into two classes in the feature space.
Embed
Using a
kernel function
Input
space
Feature
?
space
Molecular Class Prediction
-- Leukemia Case
Morphology does not distinguish leukemias very well.
Golub et al. (Science, 1999) proposed a voting method
for predicting
Acute lymphoblastic leukemia(ALL) and
Acute Myeloid Leukemia(AML)
using gene expression fingerprinting.
In the work, Affymetrix DNA chip with 6817 genes
was used for 72 ALL/AML samples.
Courtesy
Golub
The voting algorithm
(Golub’99]
1. Select a subset of (2X25) genes highly correlating with
ALL/AML distinction based on 38 training samples.
Correlation metric:
P( g ) =
µ1 − µ 2
d1 + d 2
µ1 ( µ 2 )
: the mean expression level of g in AML
(ALL) samples;
d1 (d 2 )
: the within-class standard deviation of
expression of g in AML (ALL) samples.
2. Each selected gene casts a weighted vote for a new sample;
the total of the weighted votes decides the winning class.
The voting method :
Separating samples by hyperplanes
Mathematically, the total of all the votes on a new sample X
is
50
50
V = ? vi = ? P ( g i )( xi − bi )
i =1
xi
i =1
is the expression level of
gi
in the new sample X.
If V>0, X is classified as AML;
otherwise , X is ALL.
AML
ALL
Decision Tree Learning
• Information-reduction learning method.
• Representing a class or concept as a logic sentence.
IF (Outlook = = Sunny) & (Humidity = = Normal) THEN playTennis
• When to use decision trees
• Instance describable by attribute-value pairs
• Target function is discrete valued
• Possibly noisy training data
Examples: medical diagnosis, credit risk analysis
•Each internal node tests
Textbook Example:PlayTennis an attribute
•Each branch has a value
•Each leaf assigns a
classification
Outlook
Sunny
NO
Rain
Yes
Humidity
High
Overcast
Normal
Yes
Wind
Strong
NO
Weak
Yes
Remarks
• Decision tree is constructed by top-down induction
• Preference for short trees, and for those with high
information gain attributes near the root.
• Information is measured with entropy.
ALL vs AML
- Decision Tree this time
(Y. Sun, tech report, MIT)
• Single gene (zyxin), single branch tree
38/38 correct on training cases
31/34 correct on test cases, 3 errors
X*5735_at <=(8+1)38: ALL
• Tree size up to 3 genes
1 decision tree with 1 error
7 decision trees with 2 errors
7 decision trees with 3 errors
Gene Selection
Gene Selection is critical in molecular class prediction
as we learn from decision tree results. Why?
• In a cellular processe, only a relatively small set
of genes are active.
•Mathematically, each gene is just a feature.
The more weak features, the more noise the data.
More features arise overfitting problem.
Research Problem: How to select genes?
Two Approaches
1. Gene selection is done first, and then
use these genes to learn; such as Golub et al’s paper.
2. Gene selection and learning are done together,
like decision tree learning.
Does this make difference in learning?
Discovery
Sample Data Sets
Sample Data Sets
Correlation Test
Correlation Test
NewData
DataSets
Sets
New
VoterSelection
Selection
Voter
Correlation metric
Predictors
Predictors
Support Vector Machines
Cross Validation
Pearson correlation coefficient
Eucliden distance
Voting Method
None cross validation
Bayesian Method
Bayesian coefficient
Classification
Result