Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Learning disjunctions in
Geronimo’s regression trees
Felix Sanchez Garcia
supervised by Prof. Dana Pe’er
Motivation
•
•
•
•
Gliobastoma: most common primary brain tumour in adults.
Newly diagnosed patients have an average survival of 1 year.
Need for better models of the network.
Data used to create models: microarrays
# genes 8000
# candidate regulators 800
# samples 120
Module networks
• Bayesian model that benefits from high correlation of groups of
variables [2]
• Algorithm similar to EM (but hard decisions). Loop:
– Module assignment step: assign variables to modules
– Structure search step: calculate CPD for each module
Module 3
Module 4
Module 1
Module 2
Regression trees as CPD
•
•
•
•
Regression trees are used for each module’s CPD
Internal nodes: condition on a single variable
Leaf nodes: parameters for normal distribution
Bayesian score
prior on structure (complexity+biological penalties)
x<0.3
y>-0.2
pdf of normal-gamma
• Exhaustively calculates score for each split for each regulator
……
target gene’s values sorted by regulator
Incorporating pathway
information
• Biological pathways: contain sets of genes and represent
chains of biochemical reactions that perform some function
• Aberrations in gliobastoma tend to occure as disjunctions
within pathways: derregulating 1 component is usually
enough to alter the function of the whole pathway [4]
• Idea: use pathway information to obtain a better model
• Methodology: extend node conditions to disjunctions of
conditions on pathway elements
• We will use 15 sets of regulators (20-30 genes per set)
– 5 sets of regulators of pathways known to be related to
cancer.
– 5 sets of regulators of other pathways
– 5 sets of regulators chosed at random
Problem setting
• Concept class: disjunction of threshold functions on a single
variable
• Loss functions: -Bayesian score (biological penalty?)
• Potential number of hypotheses: 2^{m}
• Related classification problem tackled by Marchand and Shah
(2005) and Kestler et al. (2006).
Bibliography
1.
2.
3.
4.
5.
6.
Pe'er, D., Bayesian Network Analysis of Signaling Networks: A Primer. Sci. STKE, 2005.
2005(281): p. pl4-.
Segal, E., et al., Module networks: identifying regulatory modules and their conditionspecific regulators from gene expression data. Nat Genet, 2003. 34(2): p. 166-176.
Lee, S.-I., et al., Identifying regulatory mechanisms using individual variation reveals key
role for chromatin modification. Proceedings of the National Academy of Sciences, 2006.
103(38): p. 14062-14067.
Comprehensive genomic characterization defines human glioblastoma genes and core
pathways. Nature, 2008. 455(7216): p. 1061-1068.
Kestler, H., W. Lindner, and A. Müller, Learning and Feature Selection Using the Set
Covering Machine with Data-Dependent Rays on Gene Expression Profiles, in Artificial
Neural Networks in Pattern Recognition. 2006. p. 286-297.
Marchand, M. and M. Shah, PAC-Bayes Learning of Conjunctions and Classification of
Gene-Expression Data, in Advances in Neural Information Processing Systems 17.
2005, MIT Press: Cambridge, MA. p. 881-888.