Download Imputation Algorithms, a Data Mining Approach

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Expectation–maximization algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Cluster analysis wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Imputation Algorithms for Data
Mining: Categorization and New
Ideas
Aleksandar R. Mihajlovic
Technische Universität München
[email protected]
+49 176 673 41387
+381 63 183 0081
1
Overview
• Explain input data based imputation algorithm
categorization scheme
• Introduce a new categorization scheme of
imputation algorithms
• Introduce some new ideas for recategorization and improvement of existing
algorithms and creation of new ones
Digitization of Microarray Data and the
Missing Value Problem
• Missing SNPs in individual DNA
• These missing values statistically blur SNP
allele association with the disease gene allele
3
Earlier Input Data Based Classification
of Imputation Algorithms [2]
• Categorized according to the input data
4
Global Approach
5
Local Approach
6
Hybrid Approach
+
7
Knowledge Based Approach
8
Earlier Input Data Based Classification
of Imputation Algorithms
• Classification example: Imputation Algorithms
(briefly describe each)
– Global
• SVDImpute
– Local
• KNNimpute
– Hybrid
• LinCmb
– Knowledge
• GOimpute
9
Ideas for Algorithmic Improvement [3]
• Ideas for new categorization model of
algorithms based on the methods they use.
– Link between the method used and the input data
– Room for subcategories based on methods
• Revising the categorization model
– Mendeleyevization
– Hybridization
– Transdisciplinarization
– Retrajectorization
10
Mendeleyevization
• Catalyst
– Probability based algorithms
• EM: expectation maximization algorithms
have not been classified
• Accelerator
– Algebraic based algorithms
• With more memory and better processing power
we can increase the number of subjects to be examined. This
would improve the precision
of Principle Component Analysis algorithms
such as BPCA and Single Value Decomposition SVDimpute.
11
Mendeleyevizaiton
Imputation
Algorithms
Global
Probability
Based
Algebra
Based
12
Hybridization
• Symbiosis
– NN Based and Regression Based: The Local based algorithms can
be classified as both symbiotic and synergic. The difference
being the varying data types available for the imputation
process. Based on the data set, the proper algorithm from
statistical closeness category can be selected.
• Synergy
– Statistical Closeness: Both Nearest Neighbor based and
Regression based algorithms can be made to work together,
they are not too computationally expensive and can thus be
used. It can be assumed that Regression based algorithms can
be used to correct NN based algorithms by using the regression
based result in an average of the two results.
13
Hybridization
Imputation
Algorithms
Local
Regression
Based
NN Based
Statistical
Closeness
14
Transdisciplinarization
• Modification
– Modified NN:
• Modify KNN to include additional parameters
– Compare large K to small K or find the average of all plausible K
vlaues
– Use different number of flanking markers
– Average out all possible outcomes
• Mutation
– Modified probability
• Compare probabilites of flanking markers in sequence of i’th
subject j’th SNP allele with the rest. The value along with
sequence with the highest probability wins.
15
Transdisciplinarization (1)
Imputation
Algorithms
Local
Regression
Based
NN Based
Modified NN
Statistical
Closeness
16
Transdisciplinarization (2)
Imputation
Algorithms
Global
Probability
Based
Modified
Probability
Algebra Based
17
Retrajectorization
• Reparametrization
– Proteome Based and Gene Based Algorithms
• How protein/aminoacid/codon databases can be utilized in
gene imputation is being researched
• Regranularization
– Process Based: Data Set Partitioning
• Checking if there is Linkage Disequilibrium between the i’th
subject with missing values and other sets of diseased
patients.
– Sets are organized by the geographic origin of the subjects
• Find the frequencies of the j’th SNP alleles (missing SNP
allele under scrutiny in one subject) in the other sets
– If LD exists between other set and subject then take allele into
account if not then don’t
18
Retrajectorization
Imputation
Algorithms
Knowledge
Gene
Based
Process
Based
Proteome
Based
19
The Whole Categorization Tree
Imputation
Algorithms
Global
Local
Knowledge
Hybrid
Probability
based
Modified
Probability
Algebra
Based
NN based
Modified
NN
Regression
based
Gene
Based
Process
Based
Proteome
Based
Statistical
Closeness
20
References
• [1] Frey M., Gierl A., De Angelis, Beckers J., Kieser A.,
Genomics Lecture; Fakultät für Biowissenschaft, TUM,
Weihenstephan, Freising bei München; Winter Semester 2011
• [2] Liew A.W., Law N., Yan H., Missing value imputation for
gene expression data: computational techniques to recover
missing data from available information, Briefings in
Bioinformatics, December 14, 2010, pp.3
• [3] Milutinovic V., Korolija N., A Short Course for PhD Students
in Science and Engineering: How to Write Papers for JCR
Journals
21
The End
Questions
Aleksandar R. Mihajlovic
Technisceh Universität München
[email protected]
+49 176 673 41387
+381 63 183 0081
22