Download Multiclass classification of microarray data with repeated

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Ridge (biology) wikipedia , lookup

Transcript
Multiclass classification of microarray
data with repeated measurements:
application to cancer
Ka Yee Yeung & Roger E Bumgarner
Genome Biology 2003, 4:R83
Sample Classification
• Use gene expression measurements
from microarray experiments to classify
biological sample (e.g. types of tumors).
• Goals
• Utilize Repeated Measurements
• Multiclass classification
• Remove redundancy
• No assumption of distribution
Shrunken Centroid Classification
• Feature selection
• Consider features individually
• Calculate overall centroid and each class centroid
• “Shrink” class centroids by factor Δ
• Compare shrunken class centroids to overall centroid
• If significantly different, feature is predictive for the class
• Estimate optimum Δ using 10-fold cross validation
• Classification
• Calculate standardized, squared difference of sample to
each shrunken class centroid for selected features
• Assign to class with nearest centroid
Redundancy & Error Estimation
• Uncorrelated Shrunken Centroid (USC)
• Removes redundant genes
• For each set of relevant genes
• Compute pairwise correlations
• Remove least relevant gene from pairs with
correlation above given threshold
• Use cross-validation to determine best pair
(shrinkage factor, correlation threshold)
• Error Weighted Uncorrelated SC (EWUSC)
• The standard deviation of the sample mean is
used to down weight the most variable genes
and experiments
Experiments
•
•
Datasets
• Synthetic datasets, varying:
• Biological noise level
• Technical noise level
• Number of repeated measurements
• Percent of relevant genes
• Real Datasets
• Multiple tumor dataset – 7,129 genes, 123 samples, 11 classes
(types of tumors)
• Breast cancer dataset – 25,000 genes, 97 samples, 2 classes
(good or poor prognosis)
Evaluation Criteria
• Prediction Accuracy
• Number of relevant features selected
• Feature stability
Synthetic data results
•
•
•
Removing redundant genes (USC)
= Similar accuracy
+ Using same or fewer genes
Error weighting results on synthetic datasets
• Two types of error defined
• Technical noise – variation over repeated measurements (λ)
• Low (1) or High (5, 10)
+ Handled “technical noise” well (similar accuracy similar, fewer
genes)
• Biological noise – signal to noise ratio (α)
• 20 to 1, 2 to 1, or 1 to 1
• Accuracy was worse with increased “biological noise”, despite
increasing number repeated measurements
Criticism
• Noise same over entire dataset, should vary for different genes
• Each dataset would have some high signal to noise genes
Real Data Results
• Removing redundant genes (USC)
= Similar, but varying accuracy
+ Using many fewer genes
• Error weighting – Real Datasets
• Multiple tumor data
+ Improved accuracy
+ Improved feature stability
= Using similar number of genes
• Breast cancer data
+ Improved accuracy
= Similar feature stability
– Using increased number of genes