Download Document

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
A Combinatorial Approach to the
Analysis of Differential Gene
Expression Data
The Use of Graph Algorithms for
Disease Prediction and Screening
The Goal
• To classify patients based on expression profiles
– Presence of cancer
– Type of cancer
– Response to treatment
• To identify the genes required for accurate
classification
– Too many = unnecessary noise
– Too few = insufficient information
Classic Clustering Problem
• Current techniques:
–
–
–
–
Hierarchical Clustering
K-Means Clustering
Self-Organizing Maps
Others
• Drawbacks:
– Determining cluster boundaries difficult with diffuse
data
– Objects can only belong to one group
Algorithmic Training
Raw Data
Gene Scoring
Eliminate Poorly
Discriminating Genes
Dominating Set
Eliminate Poorly
Covering Genes
Calculate Sample Similarities
Apply Threshold
Verify by Classification
Set of Discriminatory
Genes
Maximal Cliques
Gene Scores
Algorithmic Training
Raw Data
Eliminate Poorly
Discriminating Genes
The Gene Scoring Function:
Identifying Discriminators
vs.
0
2
4
6
8
10
0
2
4
6
score(genei )  mclassA  mclassB   classA   classB
8
Algorithmic Training
Raw Data
Eliminate Poorly
Discriminating Genes
Eliminate Poorly
Covering Genes
Eliminate Poorly Covering Genes
Class 1
Class 2
Samples
Genes
Algorithmic Training
Raw Data
Eliminate Poorly
Discriminating Genes
Eliminate Poorly
Covering Genes
Calculate Sample Similarities
Apply Threshold
Create Unweighted Graph
• Complete, edge-weighted graph
– Vertices = samples
– Edge weight = similarity metric
• Remove edge weights
– If edge weight < threshold, remove edge from
graph
– Otherwise, keep edge, ignore weight
• Result: incomplete unweighted graph
The Edge Weight Function
score(gene )  (1 expression_value
i
ij

 expression_valueik )
where,
expression valueij = expression value of genei for samplej
Algorithmic Training
Raw Data
Eliminate Poorly
Discriminating Genes
Eliminate Poorly
Covering Genes
Calculate Sample Similarities
Apply Threshold
Verify by Classification
Set of Discriminatory
Genes
Gene Scores
What is a Clique?
• A completely connected subset of vertices in a
graph
• Maximal clique = local optimization
• NP-complete
Classification Using Clique
GRAPH
Class 2
Class 1
Class 1
Class 3
Class2
A Selection of Discriminators
ADH1B
alcohol dehydrogenase IB
alcohol dehydrogenase activity
FHL1
four and a half LIM domains 1
cell growth, cell differentiation
HBB
hemoglobin, beta
oxygen transport
CYP4B1 cytochrome P450 4B1
electron transport
TNA
plasminogen binding protein
tetranectin
TGFBR2 transforming growth factor, beta
receptor II
transmembrane receptor
protein serine/threonine kinase
signaling pathway
The Algorithm - Unsupervised
Raw Data
Set of Discriminatory
Genes, Scores
Calculate Sample Similarities
Apply Threshold
Classify Unknown Samples
Summary
• Intersection of clique and dominating set
techniques improves results
• Combined orthogonal scoring identifies limited
number of discriminatory genes
• Clique offers means of validating obtained scores
and weights
• Our technique identifies differing set of
discriminatory genes from original paper
• Clique-based classification a viable complement to
present clustering methods
Ongoing and Future Research
• Reverse Training
• Train to distinguish among types of cancer
• Experiment with different weight functions (ex.
Pearson’s coefficient)
• Investigate using less stringent techniques
– Near-cliques
– Neighborhood search
– K-dense subgraphs
• Port codes to SGI Altix supercomputer
Our Research Group
Mike Langston, Ph. D.
Lan Lin
Xinxia Peng
Chris Symons
Bing Zhang, Ph. D.