Download Network Design Problems

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Part II:
Discriminative Margin Clustering
Joint work with:
Rob Tibshirani,
Patrick O. Brown,
Dept of Statistics
School of Medicine
Stanford University
Gene Expression

Micro-array technology



Find expression values of all genes in a tissue
Expression pattern of genes related to characteristics
of tissue type
Gene expression is combinatorial:



Many factors need to combine for expression of a gene
Combinations of expressions lead to certain phenotypes
Poorly understood
Feature Sets for Tumors

Set of genes with higher expression in a cancer type
compared to every normal tissue type in the body

Combinatorial gene expression signature

Potential use in diagnostics and drug treatments




If these genes encode cell surface proteins…
… can target them using antibodies
Kills tumor cells
Does not harm normal cells
Feature Set Definition
Expression value
for Gene y
Convex combination of genes which gives
maximum separation in expression values
Constraint: w1+w2 = 1
Tumor t
Around 100
samples
Normal Set N
Expression Value for Gene x
Computing the Feature Set
Maximize vt  vN
Subject to:
vt   wg eg (t )
g
vN  MaxnN  wg eg (n)
g
w
g
 1 and wg  0
g
Definition naturally extends to collections of tumor samples
Example
Gene
T
N1
N2
g1
100
50
10
w1 = 0.5
g2
100
10
50
w2 = 0.5
w1g1+w2g2
100
30
30
Margin = 100 – 30 = 70
Contrast with Previous Work

Previous work focused just on classifiers:



Separating tumor class from corresponding normal class
Separating tumor from all other tumor tissues
Linear and quadratic Support Vector Machines
[Brown et al. , Moler et al. , Ramaswamy et al. , Su et al., Grate
et al.]
 Problem: Many cancers have poorly understood subtypes

We focus on two combined aspects:


Classifiers separating tumor from all normal tissue classes
Clustering tumors based on this paradigm of separation
Traditional Clustering

Cluster tissues based on similarity of gene expression
patterns


Similar tissues have correlated gene expressions
[Eisen, et al. PNAS 1998]
Problem: Genes driving the clustering

Large classes of genes that are all regulated together




Cell cycle and cell proliferation
Protein biosynthesis and cell growth
Respiration
We need to weight these gene classes appropriately
Our Results

Feature sets for tumor samples very small



Hierarchically cluster tumor samples:




Picks only one from a correlated set of genes
Genes with different functions expressed in different normal
tissues
Similarity metric for two tumor sets = Combined Margin
Tumor samples with similar feature sets group together
Identify natural clusters of tumor samples
Construct feature sets for each cluster:

Biological significance
Clustering: Hardness

Given:


Set of n tumors
Margin M

Find largest tumor subset with margin  M

Problem is n1- hard to approximate

Reduction from maximum clique problem
Clustering: Algorithm
Gene y
G
H
F
m2
m1
E
Tumors
A
A
B
C
D
D
C
B
Normal
Gene x
G
F
H
E
Cluster Boundaries



Each node in tree labeled with combined margin of
tumor samples in sub-tree
Margin reduces as we move up the tree
Chop tree at a chosen margin cut-off


Sub-trees are the clusters
Breast cancer samples group into three clusters:



ERBB2
(ERBB2 and GRB7)
Luminal A type
(ESR1, NAT1 and GATA3)
Basal cell type(?) (Keratin, Fibrillin and Fibronectin)
Properties of Feature Sets

Feature set for a tumor cluster:


Has at most 20 genes
Most of the weight concentrated on a few genes
Genes
Fraction of weight
ERBB2 Breast
ERBB2
65%
Luminal A Breast
ESR1, NAT1, GATA3
55%
Prostate sub-type
AMACR
40%
Ovarian sub-type
MSLN, PAX8, COL1A2
65%
Tumor Cluster
Quality of Clustering

Random partitioning of tumor samples:





Divide tumor samples randomly into training and test groups
Cluster training group
Find cluster with best feature set margin for test sample
Label the sample with the tumor type for that cluster
Classifies unknown tumor samples accurately


At least 75% accuracy in categorizing test samples
At least 90% accuracy for CNS, Breast, Kidney, Ovary and
Prostate cancers
Discussion

Small feature sets for a tumor class:




Based only on discriminating it versus normal tissues
Property: Also discriminates it from other tumor classes
Highly expressed genes unique to the tumor class
Biological validation of our method:



ERBB2 and ESR1 can be targeted by monoclonal antibodies
 Some of the most effective treatments for breast cancers
AMACR is recently recognized prostate cancer marker
 Function not very well understood
MSLN is a well studied ovarian cancer marker
Expanding Feature Sets





Consider weighted combinations which have close to
optimal margin
Let optimal margin = M
P() = Polytope of feature sets with margin  M - 
Find weight vector with min Euclidean norm in P()
Intuition:



Manhattan norm of any weight vector = 1
Minimizing Euclidean norm spreads the weights
Around 100 genes in feature set
Genes in Larger Feature Sets

Genes with similar expression patterns:


Example: ERBB2 and GRB7
Genes expressed across cancer types:

Not very strongly expressed
Do not drive the clustering

Example: Proliferation and cell cycle related genes



C20ORF1, CENPF, NUF2R, TOPK, L2DTL, KNSL1, …
Example: Possible alterations to chromosome 22

PRAME
Future Work

Identify cell surface proteins in feature sets



Identify genes highly expressed across cancer types:



Possible use in chemotherapy and diagnostics
Findings for Ovarian and Pancreatic cancers being tested in
the laboratory
Examples: TFAP2A, ADAM12 and LOX
Biological significance?
Succinct representations for biological functions:


Examples: Cell cycle, respiration, …
Applications in clustering and modeling gene expression