Download B - NYU Computer Science

Document related concepts
no text concepts found
Transcript
Biocomputational Puzzles
Dennis Shasha
Courant Institute, New York Univ
[email protected]
Collaborations with labs of Philip Benfey
(bio, Duke) , Gloria Coruzzi (bio, NYU),
Allen Mincer (physics, NYU), Ken
Birnbaum, Laurence Lejay, Peter
Palenchar (bio, NYU), Rodrigo Gutierrez,
Manny Kitari, Chris Poultny
Overview
•
•
•
•
Activist Data Mining
Transcription factor-cis-element prediction
Visualization tool for multi-factor data
Time series for fun and profit.
2
Classical Data Mining
•
•
•
•
Wait for data to appear
Find patterns in it.
Hope they are actionable.
Works well when data is pertinent, e.g.
Amazon’s other books recommendation,
extrapolation of trends.
3
Activist Data Mining
• Propose initial experiments to explore
subspace of some predefined search
space
• Evaluate the results
• Propose new experiments, evaluate,
propose, evaluate, propose ….
• Iterative and adaptive
4
Which is Better for Natural
Science?
• Classical is obviously right when you have
no control over data generation.
• When you do, active data mining (active
learning) may work much better.
5
Natural Science Lesson 1
• Natural scientists, like most people, care
about their own time. Computational time
matters only if it costs people time.
• If you can get more insight out of their
data, they like you.
• If you can save them experimental time,
they love you.
• If you can lead them to new discoveries,
they will give you cookies!
6
Activist Data Mining Philosophy
Passive Approach: Natural scientists do experiments.
Computer Scientists help to glean something from it.
Activist Approach: Computer scientists help
(1) Design experiments
(2) Analyze results
(3) Design new experiments based on results
7
Activist Data Mining Philosophy (Reminder)
Passive Approach: Natural scientists do experiments.
Computer Scientists help to glean something from it.
Activist Approach: Computer scientists help
(1) Design experiments
(2) Analyze results
(3) Design new experiments based on results
Our particular methodology:
Adaptive Combinatorial Design
Our innovation: applying this combinatorial design in an
interative way.
8
Safecracking Puzzle: you are a thief…
Combinatorial Safe: 10 switches with 3 settings each.
Over 59,000 (3^10) possible configurations. However
there is a certain pair of switches (you don’t know
which pair) and a certain pair of values of those
switches that will open the safe.
Illustration:
S1 S2 S3 S4 S5 S6 S7 S8 S9 S10
C
A
Challenge:
Open the safe in 15 configurations or fewer.
9
Scientific Goal
We want to describe the factors (e.g. light, carbon,
nitrogen….) that determine whether plants will
produce critical amino acids and how those factors
interact.
Long-term goal: Virtual plant (and later … frankenfoods)
10
Light, Carbon and Amino acids
differentially regulate N-assimilation genes
-- pathway diagram
Light
Carbon
Light
GS2 Amino acids
Gln
C:N
C5:N2
Carbon
AS1
Asn
C:N
Amino acids
C4:N2
11
Design Space
Inputs:
*Light
*Starvation to Various Nutrients
*Carbon
*Inorganic N (NO3/NH4)
*Organic N (Glu)
*Organic N (Gln)
If inputs are take binary values (first approximation)
6 binary (+/-) inputs= 26 or 64 input combinations
(or treatments)
Use 2-factor combinatorial design to reduce
number of treatment combinations required to cover
the experimental space, assuming that important
interactions will have to do with two factors.
12
“Combinatorial design” finds six conditions to explore every
pairwise interaction. Want to discover important factors.
EXPT 1
NO
PIVOT
LANE
ILLUMIN
STARVE
CARBON
NO3NH4
GLU
GLN
1
DARK
Y
0
L
H
H
2
DARK
N
L
0
0
H
3
LIGHT
N
0
0
H
0
4
LIGHT
Y
L
L
0
0
5
LIGHT
Y
L
0
H
H
6
DARK
N
0
L
0
0
Notice: for each pair of input factors and combination of values
from those factors, some experiment has that combination, e.g.
Light
No carbon; Starve No Glu.
After doing this experiment,
certain factors jump out as worth further study: Illumination,
Carbon (both have significant repressive correlations)
13
Adaptation Following No Pivot Design
Key to activist data mining is adapting to results of
experiments already done. Many ways to do this,
e.g.Tong, S. & Koller, D. Active Learning for
Structure in Bayesian Networks. Seventeenth
International Joint Conference on
ArtificialIntelligence, 863-869 (2001).
Advocates pool-based active learning. Pool of unlabeled
instances (don’t know output value). An active
learner chooses which instance to query next in
hope it will reduce set of possible answers.
Ideker, T. E., Thorsson, V. & Karp, R. M. Discovery
of regulatory interactions through perturbation:
inference and experimental design. Pac Symp
Biocomput, 305-16 (2000).
14
Similar idea with Boolean circuits.
Our Use of Adaptation
Find most important inputs in order to see their
effects in more detail.
That is, we focus our search space on those inputs
that are likely to exert the most influence over
outputs of interest.
15
Three questions of particular interest
1. Is any single factor so important that its presence
determines the outcome regardless of the other
contexts? (e.g. Light in context X is repressive
compared with Dark in context Y for all X, Y)
2. Is a factor important enough that it has an effect
for any particular context? (e.g. for all X, Light in
context X is repressive compared with Dark in X)
3. Is a factor consistently important when compared
with a fixed background? (e.g. for all X, is Light in
context X repressive compared with background?)
16
Pivot Design 1: Start with no pivot design
EXPT 1
NO
PIVOT
LANE
ILLUMIN
STARVE
CARBON
NO3NH4
GLU
GLN
1
DARK
Y
0
L
H
H
2
DARK
N
L
0
0
H
3
LIGHT
N
0
0
H
0
4
LIGHT
Y
L
L
0
0
5
LIGHT
Y
L
0
H
H
6
DARK
N
0
L
0
0
Create dark and light pairs by just setting Illumin to light and dark
respectively.
17
Pivot Design 2: Dark Design
EXPT 1
NO
PIVOT
LANE
ILLUMIN
STARVE
CARBON
NO3NH4
GLU
GLN
1
DARK
Y
0
L
H
H
2
DARK
N
L
0
0
H
3
DARK
N
0
0
H
0
4
DARK
Y
L
L
0
0
5
DARK
Y
L
0
H
H
6
DARK
N
0
L
0
0
Exactly the same as no pivot tests but with DARK everywhere.
Requires only three more experiments than in no pivot case.
18
Pivot Design 3: Light Design
EXPT 1
NO
PIVOT
LANE
ILLUMIN
STARVE
CARBON
NO3NH4
GLU
GLN
1
LIGHT
Y
0
L
H
H
2
LIGHT
N
L
0
0
H
3
LIGHT
N
0
0
H
0
4
LIGHT
Y
L
L
0
0
5
LIGHT
Y
L
0
H
H
6
LIGHT
N
0
L
0
0
Exactly the same as DARK tests but with Light everywhere. Again,
three more experiments than in no pivot case.
Important: First experiment for light = First Experiment for Dark
except for Illumination itself. Differs only in pivot. Minimal pair.
19
What Accomplished
A set of well-spaced minimal pairs, differing only in
the pivot. Suggests answers for first two questions:
• Is any single factor so important that its presence
determines the outcome regardless of the other
contexts? (e.g. Light in context X is repressive
compared with Dark in context Y for all X, Y).
No, for this biological system.
• Is a factor important enough that it has an effect
for any particular context? (e.g. for all X, Light in
context X is repressive compared with Dark in X)
Yes, for this biological system.
20
“Half-pivot” Light against a fixed background
EXPT 1
NO
PIVOT
LANE
ILLUMIN
STARVE
CARBON
NO3NH4
GLU
GLN
1
LIGHT
Y
0
L
H
H
2
LIGHT
N
L
0
0
H
3
LIGHT
N
0
0
H
0
4
LIGHT
Y
L
L
0
0
5
LIGHT
Y
L
0
H
H
6
LIGHT
N
0
L
0
0
7
DARK
N
0
0
0
0
Exactly the same as LIGHT tests but with one added
background.
Allows us to create a circuit (binary in this case because
inputs are binary.)
21
B
C
Q
E
S
AND
Q
S
AND
N
L C
E
S
AND
Q
E
S
AND
S
OR
N
E
E
Q S
Q S
N
AND
AND
N
E
AND
Q
OR
AND
AND
AND
AND
N
N
L
E
S
OR
ASN1
OR
AND
AND
OR
AND
AND
OR
OR
AND
E
Q
S
N
Q
Q
C
L
S
E
N
N
L
Fig. 3
E
AND
Q
N
E
AND
Q
S
N
AND
AND
Q
C
22
S
AND
N
AND
Boolean Not Only Representation
Circuits could consist of continuous functions.
D`Haeseleer, P., Liang, S. & Somogyi, R. Genetic
network inference: from co-expression clustering to
reverse engineering. Bioinformatics 16, 707-26.
(2000).
Discuss pros and cons of such an approach. Major con
is that feedback is hard to represent correctly.
Major pro is that noise is less of an issue.
8.Huang, S. Genomics, complexity and drug discovery:
insights from Boolean network models of cellular
regulation. Pharmacogenomics 2, 203-22. (2001).
Argues that Boolean is very adept for genomics.
23
Adaptive Experimental Design along “Borders”
Because combinatorial design explores only a (well spread)
subset of possibilities, the apparent effects of factors
may depend on other factors that haven’t been
explored.
After constructing boolean circuits, software suggests
“experiments to clarify border” between inductive and
non-inductive, e.g.,
Starvation_Y, Carbon_N, NH4NO3_N, GLU_Y, GLN_Y
24
Steps of Methodology
No Pivot: Small set of well-spaced experiments to find
most important influences on a target. Also, a good
method in genomics applications to find clusters
because of good spacing. Small? 10 inputs with 4
values gives a no pivot of about 30 experiments.
Pivot: Can find out whether an input is likely to have an
effect regardless of context (for all X, for all Y) or for
every context (for all contexts X)
Half-pivot: For comparison with a fixed background
Border Adaptation: Study differences between repressinve
25
case and non-repressive one to discover fine structure.
Inspiration of this approach
Combinatorial design: Inspired by work in software
testing by
David Cohen, Siddhartha Dalal, Michael Fredman and
Gardner Patton at Bellcore/Telcordia.
Their problem: how to test a good set of inputs to a
program to discover whether there are any bugs.
Not program coverage, but input coverage.
Not all input combinations, but all combinations of
every pair of of input variables (“no pivot” design).
Hypothesis: every input combination should give
same output: no error.
If true for designed subset, then program is ok.
26
How This Could Help You
Use this approach: Pose an experimental setting of
interest to you. (Names of input variables, possible
values).
Describe a “no pivot” design for your setting.
Based on that result, describe a pivot design to isolate
the exact effect of a specific input.
Get a good sense of whether the pivot is decisive by
itself or has a consistent strong influence.
Theoretical Guarantee: For k-factor design, if there is a
set of k values that dominates the result, you will
find it.
27
Combinatorial Design vs. Random Sampling
Practical Question: Adaptive Combinatorial Design is a
sampling method. How well does it work compared to
random sampling?
Simulation experiment: Create simulated data with a
single important attribute and microarray-quality
noise (factor of 2 to 5 change in biological
replicates).
Empirical Conclusions: Random and Adaptive Combinatorial
Design did equally well at identifying the important
attribute, however Random falsely identified other
attributes as important about 4 times more often
than Adaptive Combinatorial Design. (see
cdtables.doc)
28
Ref: Lejay et al. Systems Bio vol. 1.2 Dec. 2004
Safecracking Solution
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Number S1 S2 S3 S4 S5 S6 S7 S8 S9 S10
1:
AAAAAAAAAA
2:
ABBBBBBBBB
3:
ACCCCCCCCC
4:
BABCABCABC
5:
BBCABCABCA
6:
BCABCABCAB
7:
CACBACBACB
8:
CBACBACBAC
9:
CCBACBACBA
10: X A A A B B B C C C
11: X A A A C C C B B B
12: X B B B A A A C C C
13: X B B B C C C A A A
14: X C C C A A A B B B
15: X C C C B B B A A A
29
Algorithmic (Heuristic) Idea
Finding the minimal no pivot design is NP hard, but
heuristic solutions work well in practice (within small
additive factor):
1. At first, each pair of input variables (n choose 2) is
considered to be unfinished.
2. Basic step: choose a random set of disjoint
unfinished pairs. Finish them by designing
treatments to include all remaining value
combinations for each pair, again at random.
3. If an input variable is already finished or not a
member of a disjoint pair, put in a don’t care.
4. Fill up earlier don’t cares to complete unfinished
pairs.
5. Repeat 2-4 with different random seeds.
30
Example (1)
S1:
S2:
S3:
S4:
A
A
A
A
B C
B
B
B C
1. Choose disjoint unfinished pairs at random
{ {S1, S3}, {S2, S4}}
Generate 6 rows (experiments):
S1 S2 S3 S4
A
B B
C
B
B A
B
A
A A A
C
A B
C
C
A A
B
B
B B
A
31
Example (2)
Not only is { {S1, S3}, {S2, S4}} finished but also:
{S2, S3}. Others are partly finished.
S1 S2 S3 S4
A
B B
C
B
B A
B
A
A A A
C
A B
C
C
A A
B
B
B B
A
32
Example (3)
Try
S1
A
B
A
C
C
B
A
B
C
{S1, S4} next. Fill S2 and S3 with X (don’t care).
S2 S3 S4
B B
C
B A
B
A A A
A B
C
A A
B
B B
A
X X
B
X X
C
X X
A
33
Example (3)
Now replace
S1 S2 S3
A
B B
B
B A
A
A A
C
A B
C
A A
B
B B
A
X B
B
A A
C
B X
X for unfinished pairs: {S1, S2}, {S3, S4}
S4
C
B
A
C
B
A
B
C
A
34
Further Reading: some other uses of combinatorial
design
in biology
Universal DNA tag systems: a combinatorial design scheme
Recomb 2000
Amir Ben-Dor, Richard Karp, Benno Schwikowski and Zohar
Yakhini.
Experimental design for gene expression microarrays,
Biostatistics, 2:183-201.Kerr and Churchill(2001),
Normal: N microarrays will be used to test N conditions
against a common reference.
Authors propose to use the colors to compare N conditions
against one another in a looping fashion: 1 with 2, 2 with 3, … n with 1.
Result: deconvolves certain effects (e.g. binding affinity of
reference dye.
35
Sungear Multifactor
Visualization
Joint work with Rodrigo Gutiérrez,
Manny Katari, Brad Paley, Chris
Poultney, and Gloria Coruzzi
Typical Genomic Questions
• Multiple experiments (multiple time points,
multiple conditions), many Go categories,
or other features of genes: want to know
when certain Go categories are highly
represented.
• Many species, want to know which genes
have presence in many species and
perhaps which GO categories
37
Computational Desires
•
•
•
•
Simple, responsive interface
Visualize lots of data
Many ways to query
Many different data representations
38
Sungear Design
• Generalizes Venn diagrams to more than
three
• Visual outline is an ellipse having anchors
on borders and vessels in the interior.
• Each vessel points to associated anchors.
• Linked views to hierarchies, lists, and
graphs, so can simultaneously update
data depending on user queries (selection
events).
39
Venn Diagram:
great for three factors
40
41
Sungear Principle
• “Sungear is stupid”
• Doesn’t care which kind of data it is
representing, though there is built-in
support for genes (because of links to GO
and to cytoscape).
• Basic Sungear representation could be
used to describe anything from yachting
gear to demographics.
42
Seed Development Stages
(an example)
Stage No.
Description
(main stage of embryogenesis)
Hours after
flowering
Sample harvested
AtGE No.
1
up to four cells
12 - 24
- (Weigel)
2
early globular to mid globular
24 - 48
-
3
mid globular to early heart
48 - 66
siliques containing seeds
ATGE_76
4
early heart to late heart
66 - 84
siliques containing seeds
ATGE_77
5
late heart to mid torpedo
84 - 90
siliques containing seeds
ATGE_78
6
mid torpedo to late torpedo
90 - 96
isolated seeds
ATGE_79
7
late torpedo to early walking-stick
96 - 108
isolated seeds
ATGE_81
8
walking-stick to early curled cotyledons
108 - 120
isolated seeds
ATGE_82
9
curled cotyledons to early green cotyledons
120 - 144
isolated seeds
ATGE_83
10
green cotyledons
144 - 192
isolated seeds
ATGE_84
43
44
Demos
• Growth stages showing when genes are
transcribed (N-reg AtGenExpSeedDev)
• Blast comparison of Arabidopsis against
most fully sequenced organisms.
• Nitrogen, carbon, light, organ showing
regulation -- relative expression (cnlo)
• Interspecies comparisons that might show
which kinds of genes are missing in
gymnosperms, for example (Vicogenta)
45
Genes that respond to N in leaves and
C in roots form the largest group (cnlo)
46
PII and other genes involved in Nmetabolism are among these 566
47
HYPOTHESIS: Most of the regulated
genes are involved in metabolism.
48
… this is not the case for other
processes
49
Genes that are regulated by
N & L together
50
Gene networks of NL-responsive
genes
51
Sungear Conclusion
• If you have lots of data about some
common entity (genes, people, goods,
whatever) and several factors or
experiments whose interaction you want to
visualize, Sungear is for you.
• Available late summer 2005.
(sungear2/run.bat)
52
Overall Conclusion
• Combinatorial design if you have a large
search space to analyze and you want an
intelligent sampling method.
• Sungear for visualizing experimental data.
53
Other Projects
• Diabetes treatment (Harvard med school):
given patient histories and interventions,
try to find best practice interventions.
• Time series work: fusing data from
different sources, detecting bursts over
many window lengths, query by humming
• Graph search and matching: given a
query graph find matching subgraphs.
54