Download supplementary information - Molecular Systems Biology

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Short interspersed nuclear elements (SINEs) wikipedia , lookup

Heritability of IQ wikipedia , lookup

X-inactivation wikipedia , lookup

Transposable element wikipedia , lookup

Pharmacogenomics wikipedia , lookup

Oncogenomics wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Epigenetics in learning and memory wikipedia , lookup

Essential gene wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Epistasis wikipedia , lookup

Gene therapy wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Genetic engineering wikipedia , lookup

NEDD9 wikipedia , lookup

Gene nomenclature wikipedia , lookup

Long non-coding RNA wikipedia , lookup

Quantitative trait locus wikipedia , lookup

Gene desert wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Pathogenomics wikipedia , lookup

History of genetic engineering wikipedia , lookup

Public health genomics wikipedia , lookup

Minimal genome wikipedia , lookup

Genomic imprinting wikipedia , lookup

Ridge (biology) wikipedia , lookup

Gene wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Genome evolution wikipedia , lookup

Epigenetics of human development wikipedia , lookup

RNA-Seq wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Genome (book) wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Microevolution wikipedia , lookup

Designer baby wikipedia , lookup

Gene expression programming wikipedia , lookup

Gene expression profiling wikipedia , lookup

Transcript
SUPPLEMENTARY INFORMATION
Strain Construction
Our seed genes were TEC1, CUP9, SFL1, SOK2, and SKN7. Tec1 is in the filamentation
MAP-kinase pathway. Sfl1 and Sok2 are involved in the Ras-cAMP pathway. Skn7 is in
a redox-sensing pathway. Cup9 is involved in ion homeostasis. The chosen transcription
factors have sets of known protein-DNA interactions and they have substantial overlap in
the genes they bind. This facilitated mapping of molecular influences mediating their
interacting influences on gene expression and phenotype.
All strains are derivatives of a filamentation-competent 1278b wild-type strain, G85
(MATa/ ura30/ ura30 his30::hisG/his30::hisG). Gene deletions were made using
a PCR-based strategy in which KanMX4 “barcode” deletion alleles (Winzeler et al.,
1999) or natMX4 alleles (Goldstein and McCusker, 1999) were amplified, transformed
into G85, and verified by PCR and tetrad dissection. Standard methods (Guthrie C,
1991) were used for all transformations and crosses to construct homozygous diploid
mutant derivatives. Strains tec1::bcKanMX4, sfl1::bcKanMX4, sok2::bcKanMX4,
phd1::bcKanMX4, and rox1::bcKanMX4 are from Drees et al. (Drees et al., 2005). The
deletion-insertion alleles, cup9::bcKanMX4, skn7::bcKanMX4, cin5::NatMX4,
mot3::NatMX4, sko1::NatMX4, and yap6::NatMX4 and all double mutants were
constructed for this study. Deletion of UME6 proved inviable in the 1278b genetic
background.
1
The CUP9 gene deletion was constructed using primers CUP9-up-F (5’-GCC TCC TGT
TTC TGT TAA TTG G-3’) and CUP9-down-R (5’-TCA GAC CAG GTT TCG ATG
AAG-3’). The SKN7 primers were SKN7-F1 (5’-AGG CTT GCT GCT TTT GTT TG3’) and SKN7-R1 (5’-AAT TTG AGA GCG GCA GAA AG-3’). The CIN5 primers
were CIN5-NatF (5’-AGA ATA ACA GCT TGG AAC AAG AAG GAA AAC CAA
AAA CCT ACT CAA GCA TAG GCC ACT AGT GGA TCT G-3’) and CIN5-NatR (5’TGA AAA CTT TTA AGA TGT TAC TAG TAC TAA TAA TTA TTC ATT ATT CAG
CTG AAG CTT CGT ACG C-3’). The MOT3 primers were MOT3-NatF (5’- AGG
CAA CAG TAG GCA AAT AGT AAA GGG ACA TAT CAT ATT TGA GCA GCA
TAG GCC ACT AGT GGA TCT G-3’) and MOT3-NatR (5’-GTT AAA TGA GTG
GGA AGG GAT ATT TTG TGT GTC TAT AAA GTC TAT CAG CTG AAG CTT
CGT ACG C-3’). The SKO1 primers were SKO1-NatF (5’-CAT TCC AAA TAC ACC
TGC CCA GTC TCT AGA CCC TGC TTA ATC ATT GCA TAG GCC ACT AGT
GGA TCT G-3’) and SKO1-NatR (5’-AAA GCA TCA GAT AGA AGA CTA TTT
AAG AAC CCC GTC GCT ATC TCG CAG CTG AAG CTT CGT ACG C-3’). The
YAP6 primers were YAP6-NatF (5’-GAA ATT TCA ATA AAC AAC AGA ATA ACG
AAG AGT GCT AAG GGA CAA GCA TAG GCC ACT AGT GGA TCT G-3’) and
YAP6-NatR (5’-GAT CTT CCA GTA CTA GAG ATC AAT ATC TGC TCC CTA TTT
ATT GTA CAG CTG AAG CTT CGT ACG C-3’).
2
Assays of filamentous growth
Synthetic Low-Ammonium Dextrose (SLAD) agar with uracil and histidine was used to
induce filamentous growth (Gimeno et al., 1992). Synthetic Complete Dextrose (SCD)
was used for yeast-form growth. To test filamentation phenotypes, strains were streaked
on SLAD plates, incubated at 30º for 8 hours and imaged using a Nikon Coolpix 990
digital camera mounted on a Nikon TS100 inverted microscope with a 40X objective.
Colonies were examined for elongated cell morphology, unipolar budding patterns, and
invasive growth into the solid medium and scored for overall filamentation relative to the
wild-type strain. Phenotype inequalities were deduced by comparing filamentation of the
four relevant strains on the same plate in order to avoid variation in plates.
Yeast-filamentation expression-profiling experiments
We collected expression profiles in triplicates (Thompson et al., 2005) of wild type and
mutant strains grown under filamentous-form conditions for 10 hours, as previously
described (Prinz et al., 2004). Target labeling with the GeneChip® One-Cycle Target
Labeling kit and hybridization to Yeast Genome S98 Arrays was done according to the
manufacturer’s protocols (www.affymetrix.com).
Microarray data were collected for yeast diploid strains of 16 genotypes including wild
type, deletion mutants for each of the 5 transcription factors, and double-deletion mutants
for all 10 double-mutant combinations of the 5 transcription factors. Microarray data
were normalized using robust multi-array averaging (RMA) (Irizarry et al., 2003) as
implemented in the BioConductor software package (Gentleman et al., 2004). Each gene
3
expression data point was taken as the mean of the three corresponding biological
replicates. We verified that the mean variation between biological replicates was less
than the mean variation between different strains (data not shown). Expression
intensities for each gene were transformed into Log-2 ratios relative to yeast-form wildtype expression for all subsequent analysis. We restricted all subsequent analysis to a set
of 1863 genes with differential expression, defined as having a factor of two difference
between their lowest and highest expression intensities.
Genetic Influences Decomposition
The matrix decomposition outlined in Figure 1 is readily expanded for the case of more
than two seed genes and an unlimited number of expression profiles. For example, three
seed genes we would be written as
 X WT
 WT
Y
 WT
Z
 

X A
X B
X C
X AB
X AC
A
B
C
AB
AC
Y
Z A

Y
Z B

Y
Z C

Y
Z AB

Y
Z AC

X BC   x0 x A xB xC   1
 
  WT
Y BC   y 0 y A y B yC   g A

 WT

Z BC   z0 zA zB zC   gB



   gCWT
    
1
1
1
1
1
1 

0 g AB g CA
0
0
g ABC 
gBA 0 gBC
0
gBAC
0 

A
B
AB
gC gC
0 gC
0
0 
(Eq. S1)
Subscripts denote the influencer gene and superscripts denote genetic backgrounds, in
which labels (A, AB, etc.) imply deleted genes. Our data involved five seed genes.
The form of matrix G specified in Equation S1 guarantees the existence a unique best-fit
solution due to the strict arrangement of ones and zeros required by the genotypes (i.e.,
the rows of matrix G are linearly independent and cannot be transformed to yield a
similar format). We used singular value decomposition (SVD) to aid in finding the bestfit solution. SVD dimensionally reduces the expression data set to a small number of
4
modes, each with a unique eigengene (Alter et al., 2000; Carter et al., 2006). Of the 16
SVD modes, the first 6 modes account for 96% of the information in the data set. This
provides support for the suitability of a linear model, because it is consistent with the
dimensional reduction of 16 experimental conditions to 6 linearly independent modes
(the 5 perturbed genes plus the collective remainder of the genome), plus a small noise
component. For comparison, in SVD analysis of single-gene perturbations in the RascAMP system (Carter et al., 2006) the 9 conditions effectively reduced to 7 modes, a
substantially higher fraction than that found in the present case.
Genetic-influences decomposition can be readily performed on this much smaller data
matrix. Then, the full influences matrix X can be determined by multiplying the results
by the SVD eigenarrays. Finding the best-fit solution then becomes a tractable problem
using commercial software on a PC. In matrix notation, this procedure is summarized:
D = u . v . wT  u . v . x . G = X . G,
(Eq. S2)
where the symbol  denotes a best-fit solution. The matrices u, v, and wT are the
singular value matrices for the first six modes. The 6 x 6 square matrix x encodes the
expression influences for the first six eigengenes. SVD results are discussed in greater
detail below. The 1863 x 6 matrix X contains the expression influences for each gene
(Equation S1).
Goodness of fit
To assess the goodness of fit for the genetic influences decomposition, we compared
fitted double-mutant expression values with those predicted by an additive control model
5
(Equation S8). This estimates the expression of every double mutant as the sum of
effects for the two single mutants. We then compared both models to the double-mutant
expression measurements. Reduced chi-square values were 0.035 for the model fit
compared to 0.41 for the additive control (chi-square variances for experimental data
were estimated by mean variation between biological replicates), and the subsequent
relative chi-square probability was negligible. For each expression profile, we computed
the Pearson correlation between the experimental data and both models. The mean
correlation for the model fit was 0.87, compared to 0.49 for the additive control.
Distributions of the correlation coefficients are shown in Figure S1, which demonstrates
that the model both fits the data very well and fits the data much better than the additive
control.
Identification of significant expression influences
The matrix X (Equations 1 and S2) is a table of influences from each seed gene (plus one
influence from the genetic background) to each gene in our expression set. To determine
which influences are significant, we performed a series of bootstrap cross-validations.
We performed influence decompositions (Equation 1) for 5000 subsets of the data matrix
D involving half of the 1863 rows, each chosen at random. This gave us approximately
2500 solutions for each matrix element (influence coefficient) that comprised
distributions that reflected the variability in the data. We then examined the probability
that two influence coefficients were identical as a function of their difference. We
performed Welch’s approximate t-tests for 5 x 105 randomly chosen pairs of coefficients
(all pairs would have been unnecessary and computationally prohibitive). The p-values
6
were binned based on the difference between the two corresponding coefficients and we
computed the mean p-value as a function of coefficient difference. The mean p-value
decreases monotonically from a value of p = 1 for negligible coefficient differences. We
chose p < 0.001 as a cutoff, and found that this corresponded to coefficient differences of
0.1 or greater. Thus, we consider every influence coefficient with magnitude of 0.1 or
greater to be significantly different from zero.
We then individually examined columns 2 through 6 of the X matrix to identify genes
that receive significant influences from the five seed genes. For each of the five seed
genes, we determined negative-influence and positive-influence gene sets. Thus we
obtained ten gene sets in total, with many overlapping elements. The genes are listed in
Table S2. The average set had 280 genes, although this ranged from 47 (TEC1-Positive
and CUP9-Positive) to 980 (SFL1-Positive) genes. Positive-influence gene sets were
many times larger than negative-influence sets for TEC1, CUP9, and SFL1, while the
SKN7 was the source of more than twice as many negative influences as positive ones.
We queried each gene set for over-represented targets of transcription factors based on a
hypergeometric distribution. Results are listed in Table S3. We constructed regulatory
networks based on this analysis as described above, mapping putative pathways of
transcriptional influence from each seed gene to the sets of genes it positively and
negatively influences. A representative network is shown in Figure 3.
7
Network construction
We constructed transcriptional regulation networks connecting the seed genes to its
influenced genes via the enriched transcription factors (Carter et al., 2006). Public
databases were queried for protein-protein and protein-DNA interactions. Because our
seed genes were transcription factor genes, we often found these factors among the
enriched regulators of the co-regulated genes. Each subnetwork was constructed using
the shortest molecular interaction paths connecting the seed gene to the positive or
negative influence gene-set via enriched transcription factors (Table S3). Pathways
involving five or greater links were discarded because they are longer than the average
shortest connection between any two elements in the global network, and thus are of
questionable biological relevance. Thus every path from a seed gene to its influence
targets passes through one or more of the enriched transcription factors, such that all
molecular paths in the network are obtained using statistical evidence of co-regulation.
The exclusion of alternate paths that bypass those involving enriched transcription factors
was justified a posteriori by the inaccuracy of the predictions generated from them (data
not shown). This provides indirect evidence of modularity in transcriptional regulation,
suggesting that genetic co-expression results from coordinated activity of small groups of
transcription factors. Shortest paths were chosen because they are most likely to be
biologically active (Steffen et al., 2002). Each resulting subnetwork (Figure S2) traces a
distinct putative influence from seed genes, through specific molecular interactions, to a
gene set that received either a positive or negative expression influence.
8
Modeling knockouts in the seed gene network
The genotype matrix G encodes activity levels of each seed gene for each of the mutant
strains involving seed gene knockouts. Since changes in activity levels result from
influence in the biochemical network, seed gene activities can be modified by additional
genetic perturbations. To model this, we next infer and quantify the influences the seed
genes exert on each others’ activity level. This is a further dimensional reduction of the
genotype matrix G. Starting again with the case of two seed genes, A and B, we can
write for their activity levels:
A = A0 + mAB B
B = B0 + mBA A
(Eq. S3)
The variables A and B define generalized activity levels of the seed genes, and the
parameters A0 and B0 represent basal input not directly due to genes A or B. The mij
account for influences between A and B. Self-influences, such as mAA, are not included
because they cannot be numerically distinguished from the basal input. The model can
be readily generalized to the case of N perturbed genes. For a vector of gene activities,
we replace {A, B} with g = {g1, g2, ..., gN} and write
gi = g0i + mij gj
(Eq. S4)
where g0 is a vector of base activity and the mij form a N x N matrix encoding the
influence of the ith gene on the jth gene. Equation S4 can be solved for the activities and
we find the vector solution
gWT = (1 - m)-1. g0
(Eq. S5)
where 1 is the N x N identity matrix. The vector {1, gWT} is the first column of the
genotype matrix G.
9
In this formulation, the deletion of a seed gene requires setting both its base activity and
its influences on other seed genes to zero. This corresponds to replacing the appropriate
entry in g0 with zero and the appropriate column in m with zeros. This can be achieved
by rewriting Equation S5 in terms of a diagonal base activity matrix, G0, formed by
placing the elements of the vector g0 along the diagonal, and a scaled influence matrix
with elements Mij = mij / (g0)i. Defining the vector 1 = {1, 1, …, 1, 1} of length N, we
then have
gWT = [(G0)-1 - M]-1. 1
(Eq. S6)
In this form, a deletion of gene A is modeled by taking the limit as (G0)AA → 0, which
means setting the basal activity of that gene to zero. The resulting gA (with a 1
prepended) corresponds to the second column of the matrix G. Multiple deletions are
modeled by taking multiple zero limits for entries of the diagonal matrix G0. This
effectively removes all traces of the deleted genes from the system. Note that by fixing
gWT = {1, 1, 1, …,1} we can find a solution for the matrix elements of M and the base
activities G0 using the matrix elements of the best fit solution for G as described above.
Since this is a further dimensional reduction, reducing N2(N-1)/2 parameters to N2, we
must again find a best fit solution. For N = 5 perturbed genes the task was easily
performed by a desktop PC.
For the ordering of seed genes {TEC1, CUP9, SFL1, SOK2, SKN7}, we found
10
 0.16  0.49
 0

0
 1.37
  0.39
M    0.2  0.28
0

0.41
0.68
 0.25

  0.11  0.19  0.81
 0.08
0.09 

 0.39 1.09 
 0.31 0.38 

0
 0.10 

 0.09
0 
(Eq. S7)
E.g., TEC1 activity receives an influence of –0.49 from SFL1. These influences are
depicted graphically in Figure 4A. For each of these influences, the shortest paths were
found in the global physical network following the procedure described above. These
paths represent putative biomolecular mechanisms for influence between the seed genes.
Select paths are shown in Figure S2.
The YAP6 gene deletion had multiple effects on M. As with the gene-influences network
for Mode 2 (Figure 4B), Yap6 was the primary candidate for all influences from CUP9
and many from SFL1. Specifically, the second and third columns of M correspond to
influences from CUP9 and SFL1 to the other seed genes, respectively. The paths that
putatively transmit these influences involve Yap6 (Figure S2). To model strains with
YAP6 gene deletions, we initially set all Mi2 = 0 and all Mi3 = 0 except for M53 (since we
did not find a candidate path involving YAP6 for the SFL1-SKN7 influence). As
explained in the main text, we set six of 20 nonzero elements in the matrix M to zero for
the YAP6 gene deletion, which lead to changes throughout the G matrix when its
columns were computed with Equation S6. This modified G matrix represented the
activities of our seed genes in the yap6Δ genetic background, and the columns
corresponding to yap6Δ, cup9Δyap6Δ, sfl1Δyap6Δ, and sok2Δyap6Δ were used to
compute expression predictions along with the yap6Δ version of X (main text).
11
Assessment of Gene Expression Predictions
To assess the accuracy of our predictions, we performed chi-square tests with the error
ranges serving as a measure of data variance. Experimental uncertainties were observed
in the median average deviation between biological replicates. Theoretical uncertainties
were estimated from genome-wide goodness-of-fit of our original model solution.
Likelihoods of fit goodness were calculated from a chi-square distribution.
The chi-square fits proved to be uniformly excellent. However, it is possible that trends
in expression across all genes are readily recovered with most linear modeling
procedures. As a control we computed similar predictions based on a model without
genetic interactions between the perturbed genes. In this control model, perturbed genes
will influence gene expression patterns, but they will not influence each other. Thus each
double-mutant prediction will be the direct sum of the single-mutant measurements, all
relative to wild-type expression. Following the notation used in Table 1, the expression
of gene X in a double-mutant background is:
(XAB – XWT) = (XA – XWT) + (XB – XWT).
(Eq. S8)
This control model retains general trends in gene expression but inherently lacks genetic
interactions and is thus an appropriate test of our approach. We repeated the chi-square
for both assessments of accuracy to determine if the model with genetic interactions is
consistently more predictive than the additive control model. To quantify this, we
computed the relative probability of chi-square goodness-of-fit for our model versus the
additive control.
12
Singular Value Decomposition Analysis
In parallel with our genetic influence decomposition described above, we analyzed the
expression data with SVD. This was done in order to identify a co-expression pattern
that best correlated with filamentous-growth phenotype observations for all strains. We
then hypothesized that the transcriptional network constructed for the genes showing this
pattern would be a basis for filamentous-growth phenotype predictions.
We performed SVD, a linear algebra method that rearranges microarray data into a series
of composite expression patterns, or eigengenes, and sets of genes that show those
patterns. We follow the methods of Carter, et al (Carter et al., 2006). Briefly, each mode
has a set of genes that exhibits the pattern with positive coefficient, and another set with
negative coefficient. These coefficients are the matrix elements of u in Equation S2.
The patterns (eigengenes, the matrix wT in Equation S2) and overall mode weights
(eigenvalues, diagonal elements of v in Equation S2) for the first six modes (of sixteen
total; see discussion following Equation S1) are shown in Figure S3. Because each of the
six relevant modes has positive and negative sets, there are 12 gene sets. Joint
membership of any gene in more than one SVD mode is possible. Roughly one third of
the genes have no gene-set memberships, one third are members of one set, and one third
are grouped into more than one mode. The joint membership of a gene in more than one
mode indicates that the expression pattern of the gene is a weighted composite of the
modes of which it is a member.
13
To identify which mode will serve as the best expression proxy for phenotype, we
computed the Pearson correlation coefficient for each expression pattern (eigengene) and
a discretized version of our filamentous-growth observations. The discretization is the
minimal number of integer values consistent with our strain-by-strain comparisons (see
above). For example, the phenotype inequality for the pair tec1 and cup9 is tec1
cup9 = tec1 < wt < cup9, as observed on a plate comparing filamentous growth of
all four strains (here A is the phenotype value of strain A). The results are listed in Table
S1. The correlation coefficients for the phenotype measurements and each of the 16 SVD
modes were, in mode order: {-0.31, 0.74, 0.44, -0.12, 0.0037, 0.19, -0.10, 0.17, 0.086, 0.18, 0.30, -0.00049, 0.21, 0.032, -0.0040, -0.0035}. The SVD gene set that is the best
candidate proxy for the phenotype is thus 2-Positive gene set (Table S4). Note that this is
not the dominant expression pattern. It is globally weighted less than the first expression
pattern (Figure S3A). This provides a posteriori justification of our choice of SVD
analysis, as this expression profile would be more difficult to identify using exclusive
clustering methods based on non-decomposed expression profiles. Moreover, the bestcorrelated mode was among the first six modes, and is thus one of those previously
deemed biologically relevant. We analyzed the Mode-2 gene sets for over-representation
of transcription factor targets, as described above for the influence sets. Results are
reported in Table S5.
To construct a biochemical network for the regulation of the Mode-2 gene set, we
determined which seed genes had a substantial influence on that expression pattern. The
influences on each SVD set were determined above, and encoded in the 6 x 6 matrix x in
14
Equation S2. To determine which of the seed genes had the strongest influence on a
given mode, we were not able to repeat the procedure used for the gene-by-gene
influence coefficients (see above) due to the limited number of pairwise comparisons.
Instead, we examined the distributions of the coefficients in x from the bootstrap subsolutions. These distributions had a mean standard deviation of 0.14. Thus we
considered all coefficients with magnitude greater than 0.14 to be substantially different
than zero. Influence coefficients for the Mode-2 gene set compose the second row of x.
The values in this row were: xTEC1 = 0.28, xCUP9 = 0.19, xSFL1 = 0.18, xSOK2 = 0.09, and
xSKN7 = -0.11. Thus we identified positive influences from TEC1, CUP9, and SFL1, and
constructed our network with these (Figure 4B). Network construction was performed as
described above, mapping influences from the seed genes (multiple in this case) to the
SVD set genes receiving the influences, via the over-represented transcription factors.
Phenotype Predictions
We made predictions of the genetic interactions and phenotypes of 13 novel doubleknockouts (including the four for which we collected microarray data) based on the
topology of the Mode-2 molecular network (Figure 4B). Most of these involved a newly
implicated gene with one of the original five seed genes. To make specific predictions
we identified three network configurations (or motifs) describing the relationship
between the two perturbed genes: (1) serial, in which one gene is in the only molecular
path by which an influence is passed from the upstream gene to the Mode-2 gene set; (2)
intermediate, in which the serial gene lies in one of two or more molecular paths from the
upstream gene; and (3) parallel, in which the two perturbed genes lie in two separate
15
branches of molecular influence. These are summarized in Figure 4C. For the serial
motif, we expected the downstream perturbation to mask effects of the upstream
perturbation; thus the double perturbation should be identical to a single perturbation of
the downstream gene. For the intermediate motif, the effects of the downstream
perturbation would only partly mask the upstream perturbation and the double mutant
will have a phenotype slightly less than the sum of the single mutant effects. For the
parallel motif, we proposed that the genes act independently and hence the double mutant
phenotype will be the sum of effects from the two single mutants. We matched one of
these motifs to each double mutant based on the two genes’ relative position in the Mode2 network. For example, YAP6 is downstream of CUP9, so the double-mutant phenotype
should resemble that of a YAP6 single deletion.
We assessed the accuracy of the model predictions by comparing with the results
expected from a training set of 1809 genetic interactions for a closely related phenotype
(Drees et al., 2005). Using these genetic interactions as a training set, we identified the
most probable phenotype for each novel double-mutant perturbation by determining the
probabilities for every possible double-mutant outcome given the wild-type and singlemutant phenotypes. For example, for a hypo-invasive A mutant and a hyper-invasive B
mutant, we found that the probabilities for the AB phenotype were 28% hypo-invasive,
50% hyper-invasive, and 22% wild-type. We compared the number of correct
predictions (NC, out of 13 possible) with the expected number correct obtained from the
training set. To do this, we summed the training-set probabilities of all possible
outcomes (Bernoulli trials) with NC or more correct to compute the likelihood of correctly
16
predicting NC or more phenotypes (a multinomial distribution is inadequate because the
outcome probabilities vary for each single-mutant combination).
SUPPLEMENTARY REFERENCES
Alter, O., Brown, P.O. and Botstein, D. (2000) Singular value decomposition for
genome-wide expression data processing and modeling. Proc Natl Acad Sci U S
A, 97, 10101-10106.
Carter, G.W., Rupp, S., Fink, G.R. and Galitski, T. (2006) Disentangling information
flow in the Ras-cAMP signaling network. Genome Res, 16, 520-526.
Drees, B.L., Thorsson, V., Carter, G.W., Rives, A.W., Raymond, M.Z., Avila-Campillo,
I., Shannon, P. and Galitski, T. (2005) Derivation of genetic interaction networks
from quantitative phenotype data. Genome Biol, 6, R38.
Gentleman, R.C., Carey, V.J., Bates, D.M., Bolstad, B., Dettling, M., Dudoit, S., Ellis,
B., Gautier, L., Ge, Y., Gentry, J., Hornik, K., Hothorn, T., Huber, W., Iacus, S.,
Irizarry, R., Leisch, F., Li, C., Maechler, M., Rossini, A.J., Sawitzki, G., Smith,
C., Smyth, G., Tierney, L., Yang, J.Y. and Zhang, J. (2004) Bioconductor: open
software development for computational biology and bioinformatics. Genome
Biol, 5, R80.
Gimeno, C.J., Ljungdahl, P.O., Styles, C.A. and Fink, G.R. (1992) Unipolar cell divisions
in the yeast S. cerevisiae lead to filamentous growth: regulation by starvation and
RAS. Cell, 68, 1077-1090.
Goldstein, A.L. and McCusker, J.H. (1999) Three new dominant drug resistance cassettes
for gene disruption in Saccharomyces cerevisiae. Yeast, 15, 1541-1553.
Guthrie C, F.G. (1991) Guide to yeast genetics and molecular biology. Academic Press,
New York.
Irizarry, R.A., Hobbs, B., Collin, F., Beazer-Barclay, Y.D., Antonellis, K.J., Scherf, U.
and Speed, T.P. (2003) Exploration, normalization, and summaries of high
density oligonucleotide array probe level data. Biostatistics, 4, 249-264.
Prinz, S., Avila-Campillo, I., Aldridge, C., Srinivasan, A., Dimitrov, K., Siegel, A.F. and
Galitski, T. (2004) Control of yeast filamentous-form growth by modules in an
integrated molecular network. Genome Res, 14, 380-390.
Steffen, M., Petti, A., Aach, J., D'Haeseleer, P. and Church, G. (2002) Automated
modelling of signal transduction networks. BMC Bioinformatics, 3, 34.
Thompson, K.L., Rosenzweig, B.A., Pine, P.S., Retief, J., Turpaz, Y., Afshari, C.A.,
Hamadeh, H.K., Damore, M.A., Boedigheimer, M., Blomme, E., Ciurlionis, R.,
Waring, J.F., Fuscoe, J.C., Paules, R., Tucker, C.J., Fare, T., Coffey, E.M., He,
Y., Collins, P.J., Jarnagin, K., Fujimoto, S., Ganter, B., Kiser, G., KaysserKranich, T., Sina, J. and Sistare, F.D. (2005) Use of a mixed tissue RNA design
for performance assessments on multiple microarray formats. Nucleic Acids Res,
33, e187.
Winzeler, E.A., Shoemaker, D.D., Astromoff, A., Liang, H., Anderson, K., Andre, B.,
Bangham, R., Benito, R., Boeke, J.D., Bussey, H., Chu, A.M., Connelly, C.,
17
Davis, K., Dietrich, F., Dow, S.W., El Bakkoury, M., Foury, F., Friend, S.H.,
Gentalen, E., Giaever, G., Hegemann, J.H., Jones, T., Laub, M., Liao, H.,
Liebundguth, N., Lockhart, D.J., Lucau-Danila, A., Lussier, M., M'Rabet, N.,
Menard, P., Mittmann, M., Pai, C., Rebischung, C., Revuelta, J.L., Riles, L.,
Roberts, C.J., Ross-MacDonald, P., Scherens, B., Snyder, M., Sookhai-Mahadeo,
S., Storms, R.K., Veronneau, S., Voet, M., Volckaert, G., Ward, T.R., Wysocki,
R., Yen, G.S., Yu, K., Zimmermann, K., Philippsen, P., Johnston, M. and Davis,
R.W. (1999) Functional characterization of the S. cerevisiae genome by gene
deletion and parallel analysis. Science, 285, 901-906.
www.affymetrix.com.
18
SUPPLEMENTARY TABLES
Table S1. Discretized phenotype measurements for the initial strain set. Values
were converted to integers based on phenotype assays, by taking the most parsimonious
set of values consistent with strain-by-strain comparisons. The origin is defined as wildtype filamentous growth and the overall scale is arbitrary.
Strain
wild-type
tec1
cup9
sfl1
sok2
skn7
tec1cup9
tec1sfl1
tec1sok2
tec1skn7
cup9sfl1
cup9sok2
cup9skn7
sfl1sok2
sfl1skn7
sok2skn7
Filamentation
Value
0
-2
1
3
2
-1
-2
-2
-2
-2
3
2
-1
2
3
2
Table S2. Sets of genes with expression influenced by the seed genes. See attached
file CarterTableS2.xls.
19
Table S3. Analysis of gene sets influenced by the seed genes. Gene sets composed of
genes with influence coefficients with magnitude of 0.1 or greater (see Materials and
Methods) and transcription factors with enriched targets in the set (Bonferroni-corrected
p < 0.001 enrichment). Network genes are the subset of genes to which molecular paths
could be mapped from the seed gene.
Gene Set
Genes
TEC1-Positive
TEC1-Negative
CUP9-Positive
CUP9-Negative
SFL1-Positive
305
47
275
47
980
Network
Genes
69
33
70
23
341
SFL1-Negative
519
280
SOK2-Positive
SOK2-Negative
115
123
68
92
SKN7-Positive
SKN7-Negative
122
270
54
161
Transcription Factors
Nrg1, Rox1, Skn7, Sko1, Sok2, Ste12, Tec1
Ace2, Fkh2, Flo8, Swi5
Nrg1, Phd1, Skn7, Sko1, Sok2, Ste12, Tec1
Ace2
Abf1, Dal82, Flo8, Phd1, Skn7, Sok2, Ste12,
Tec1
Cin5, Fkh1, Fkh2, Flo8, Gcn4, Mbp1, Nrg1,
Phd1, Rcs1, Sok2, Ste12, Swi5, Yap6
Cin5, Nrg1, Phd1, Rcs1, Skn7, Yap6
Cin5, Fkh2, Flo8, Mga1, Nrg1, Phd1, Skn7,
Sok2, Ste12, Sut1, Tec1, Yap6
Cin5, Flo8, Mga1, Nrg1, Yap6
Flo8, Msn2, Nrg1, Phd1, Skn7, Sko1, Sok2,
Ste12, Sut1, Tec1
Table S4. Genes in the SVD Mode-2 set. See attached file CarterTableS4.tsv.
20
Table S5. Enriched transcription factors for the Mode-2 SVD gene set.
Transcription factors with an over-representation of targets in the set (p < 0.001
enrichment).
Transcription Number of
-Log10(p) a
Factor
Targets
Mot3
14
3.15
Phd1
25
5.53
Rox1
18
6.28
Skn7
17
3.88
Sko1
7
3.60
Sok2
23
7.00
Ste12
32
10.70
Tec1
21
6.01
a
Bonferroni-corrected -log10 probability (annotation significance).
SUPPLEMENTARY FIGURE LEGENDS
Figure S1. Goodness of fit for linear influences decomposition. Histograms are
shown for Pearson correlation coefficients between experimental data and linear
influences decomposition fit (red) and the additive control (blue). Correlations are
computed for the expression of each gene for all double-mutant strains.
Figure S2. Molecular networks for seed genes. Putative interaction paths are shown
that transmit the influences (Figure 4A) from seed gene (A) TEC1, (B) CUP9, (C) SFL1,
(D) SOK2, and (E) SKN7 to the other seed genes. Interactions are colored as: proteinprotein in blue, protein-DNA in orange, and protein phosphorylation in violet. Black
arrows denote inferred influences for which no molecular path with fewer than five
interactions were found.
21
Figure S3. SVD eigenvalues and eigengenes matrix. (A) Bar chart of eigenvalues and
(B) raster plot of eigengenes matrix are shown for the first six SVD modes.
Contributions are either positive (red) or negative (green), with intensity proportional to
magnitude.
Figure S4. Phenotype measurements and the Mode-2 expression component.
Discretized filamentous growth measurements (blue) plotted with the second eigengene
(red). The vertical scale corresponds to fractional gene expression, and the phenotype
scale is arbitrary (see text and Table S1).
22