Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
A Combinatorial Approach to the Analysis of Differential Gene Expression Data The Use of Graph Algorithms for Disease Prediction and Screening The Goal • To classify patients based on expression profiles – Presence of cancer – Type of cancer – Response to treatment • To identify the genes required for accurate classification – Too many = unnecessary noise – Too few = insufficient information Classic Clustering Problem • Current techniques: – – – – Hierarchical Clustering K-Means Clustering Self-Organizing Maps Others • Drawbacks: – Determining cluster boundaries difficult with diffuse data – Objects can only belong to one group Algorithmic Training Raw Data Gene Scoring Eliminate Poorly Discriminating Genes Dominating Set Eliminate Poorly Covering Genes Calculate Sample Similarities Apply Threshold Verify by Classification Set of Discriminatory Genes Maximal Cliques Gene Scores Algorithmic Training Raw Data Eliminate Poorly Discriminating Genes The Gene Scoring Function: Identifying Discriminators vs. 0 2 4 6 8 10 0 2 4 6 score(genei ) mclassA mclassB classA classB 8 Algorithmic Training Raw Data Eliminate Poorly Discriminating Genes Eliminate Poorly Covering Genes Eliminate Poorly Covering Genes Class 1 Class 2 Samples Genes Algorithmic Training Raw Data Eliminate Poorly Discriminating Genes Eliminate Poorly Covering Genes Calculate Sample Similarities Apply Threshold Create Unweighted Graph • Complete, edge-weighted graph – Vertices = samples – Edge weight = similarity metric • Remove edge weights – If edge weight < threshold, remove edge from graph – Otherwise, keep edge, ignore weight • Result: incomplete unweighted graph The Edge Weight Function score(gene ) (1 expression_value i ij expression_valueik ) where, expression valueij = expression value of genei for samplej Algorithmic Training Raw Data Eliminate Poorly Discriminating Genes Eliminate Poorly Covering Genes Calculate Sample Similarities Apply Threshold Verify by Classification Set of Discriminatory Genes Gene Scores What is a Clique? • A completely connected subset of vertices in a graph • Maximal clique = local optimization • NP-complete Classification Using Clique GRAPH Class 2 Class 1 Class 1 Class 3 Class2 A Selection of Discriminators ADH1B alcohol dehydrogenase IB alcohol dehydrogenase activity FHL1 four and a half LIM domains 1 cell growth, cell differentiation HBB hemoglobin, beta oxygen transport CYP4B1 cytochrome P450 4B1 electron transport TNA plasminogen binding protein tetranectin TGFBR2 transforming growth factor, beta receptor II transmembrane receptor protein serine/threonine kinase signaling pathway The Algorithm - Unsupervised Raw Data Set of Discriminatory Genes, Scores Calculate Sample Similarities Apply Threshold Classify Unknown Samples Summary • Intersection of clique and dominating set techniques improves results • Combined orthogonal scoring identifies limited number of discriminatory genes • Clique offers means of validating obtained scores and weights • Our technique identifies differing set of discriminatory genes from original paper • Clique-based classification a viable complement to present clustering methods Ongoing and Future Research • Reverse Training • Train to distinguish among types of cancer • Experiment with different weight functions (ex. Pearson’s coefficient) • Investigate using less stringent techniques – Near-cliques – Neighborhood search – K-dense subgraphs • Port codes to SGI Altix supercomputer Our Research Group Mike Langston, Ph. D. Lan Lin Xinxia Peng Chris Symons Bing Zhang, Ph. D.