Download Supplementary information

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Genomic imprinting wikipedia , lookup

Cancer epigenetics wikipedia , lookup

Epigenetics in stem-cell differentiation wikipedia , lookup

Metagenomics wikipedia , lookup

Saethre–Chotzen syndrome wikipedia , lookup

Public health genomics wikipedia , lookup

Long non-coding RNA wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Oncogenomics wikipedia , lookup

Point mutation wikipedia , lookup

Neuronal ceroid lipofuscinosis wikipedia , lookup

Genetic engineering wikipedia , lookup

Genome evolution wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Gene wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Gene desert wikipedia , lookup

History of genetic engineering wikipedia , lookup

Genome (book) wikipedia , lookup

Mir-92 microRNA precursor family wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Gene nomenclature wikipedia , lookup

Gene therapy wikipedia , lookup

Helitron (biology) wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Gene expression programming wikipedia , lookup

Microevolution wikipedia , lookup

NEDD9 wikipedia , lookup

RNA-Seq wikipedia , lookup

Gene expression profiling wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Designer baby wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Transcript
Supplementary Results:
Human colorectal cancer gene expression data evaluation
We initially assembled 14 human colorectal cancer microarray gene expression data sets
from the Gene Expression Omnibus (GEO) and the ArrayExpress databases (Supplementary
Table 1). The 14 data sets consisted of gene expression data from 1420 colorectal tumor tissue
samples. For each normalized data set, we calculated the Pearson’s correlation coefficient for
each pair of genes. Supplementary Figures 2A and 2B illustrate the distribution of the
correlations for each data set. Although correlation coefficients from most of the data sets
showed a normal distribution centered near zero, a few data sets showed clearly skewed
distributions (GSE4554, GSE2138, and GSE3726). As depicted in Supplementary Figure 2C and
2D, most of the data sets showed a clear positive association between co-expression and
functional relevance. For example, for gene pairs with a Pearson’s correlation coefficient at the
range of 0.9-1.0 in the data set GSE17536, the likelihood ratio of being functionally related is
greater than 13, i.e. log likelihood ratio (LLR) greater than 2.6. However, the three data sets with
skewed correlation distribution showed very weak association between co-expression and
functional relevance. These three data sets were thus eliminated from further analyses.
Although the 11 remaining data sets passed our quality functional relevance criteria, it
was not clear whether the data sets held complementary or redundant information. To answer
this question, we selected three data sets (GSE17536, GSE14333, and GSE2109) that showed the
highest level of positive association between co-expression and functional relevance and
compared the top ranking gene pairs from these data sets. Pearson’s correlation coefficient
cutoffs were selected separately to get approximately the top 10,000 (i.e. top 0.005%) correlated
gene pairs for each data set. As shown in Supplementary Figure 3A, only 21% of the gene pairs
were common across all the three data sets, whereas 45% to 68% of gene pairs were specific to
individual data sets. Similar analysis was performed for the top 100,000 (i.e. top 0.05%)
correlated gene pairs and only an average of 18% of the gene pairs were common across the
three data sets, whereas 52% to 70% of gene pairs were specific to individual gene sets
(Supplementary Figure 3B). These results suggest that individual data sets contain
complementary functional information and that appropriate data integration can lead to a more
comprehensive and accurate co-expression network.
Comparing methods for network construction
To build an integrated gene co-expression network with nodes representing genes and
edges representing robust co-expression relationship between genes, we compared two
approaches for data integration. The first approach combines the data sets based on a naïve
Bayes model by summing up the data set-specific log likelihood ratios (LLRs) for each gene pair
(1). The second approach associates the maximum LLR from the eleven data sets as each gene
pair’s final LLR. Consistent with a previous study (2), the second approach performed slightly
better than the first approach (data not shown). Therefore, we applied the second approach to
calculate LLRs for all gene pairs and constructed networks based on a variety of LLR cutoffs
representing different levels of stringency.
FK506 inhibition
To test whether NFAT activity is associated with cell invasiveness and with expression of
target gene expression, we inhibited NFAT activity in the highly metastatic MC38Met cells using
the calcineurin inhibitor, FK506.
We found that FK506 inhibited nuclear localization of
NFATc1 (Supplementary Figure 7A-B), significantly decreased expression of 13 target genes
from the metastatic module (Supplementary Figure 7C), and significantly decreased the rate of
trans-endothelial invasion of MC38Met cells (Supplementary Figure 7D).
Supplemental Methods:
Human CRC gene expression data preprocessing. Human gene expression data sets
(Supplementary Table 1) were downloaded from the Gene Expression Omnibus (GEO) database
(http://www.ncbi.nlm.nih.gov/geo/)
and
the
ArrayExpress
Archive
(http://www.ebi.ac.uk/microarray-as/ae/). Choosing the right normalization procedure is a key
step towards the inference of accurate cellular networks, and comparative analysis suggests that
MAS5 provides the most faithful cellular network reconstruction compared to other
normalization procedures including RMA, GCRMA, and Li-Wong (3). Therefore, we used the
MAS5 algorithm implemented in Bioconductor (4) for data normalization. Probe set identifiers
(IDs) were mapped to gene symbols based on the mapping provided by the GEO database. Probe
sets that mapped to multiple genes were eliminated. When multiple probe sets were mapped to
the same gene, the probe set with the largest interquartile range (IQR) was selected owing to its
high variation across samples. To make the expression level comparable across genes, expression
values for each gene were standardized using a Z-score transformation. For each data set, a gene
expression matrix with normalized and standardized expression values was thus generated.
Mouse gene expression data analysis. The mouse gene expression data set was downloaded
from the GEO and processed using the Robust MultiChip Analysis (RMA) algorithm (5) as
implemented in Bioconductor. Probe set IDs were mapped to gene symbols as described above.
Mouse genes with one-to-one human ortholog mapping as annotated by HomoloGene
(http://www.ncbi.nlm.nih.gov/homologene) were carried forward for differential expression
analysis. Because there were only three replicates in each group, the moderated t-test in the
limma package (6) was used to identify differentially expressed genes between the two groups.
The moderated t-test uses an empirical Bayes method to moderate the standard errors of the
estimated log-fold changes. This results in more stable inference and improved power, especially
for experiments with small number of arrays. Genes were ranked according to their differential
expression level as measured by the moderated t-statistic.
Identification of functionally relevant and irrelevant gene pairs. To evaluate functional
similarity between two genes, the Resnik’s semantic similarity (7) was calculated based on the
GO biological process annotation according to Elo et al (8). All gene pairs were ranked from the
highest semantic similarity score to the lowest score. The top 25% gene pairs in the ranked list
were designated as a gold standard set of functionally relevant gene pairs while the bottom 25%
were designated as a gold standard set of functionally irrelevant gene pairs.Construction of an
integrated gene co-expression network. For a human CRC data set i, Pearson’s correlation
coefficients were calculated for all gene pairs in the data set and binned into 0.1 unit intervals.
Neighboring bins were merged if each had less than 100 gene pairs. The log likelihood for gene
pairs to be functionally relevant given a particular correlation range, bin j was calculated by
P(R | Dij ) /P(IR /Dij ) 
LLRij  ln

P(R) /P(IR)



where P(R|Dij) and P(IR|Dij) are the frequencies of functionally relevant (R) and irrelevant (IR)
gene pairs observed at the correlation range j in data set i, respectively, while P(R) and P(IR)
represent all functionally relevant and irrelevant gene pairs in our gold standard sets,
respectively. Although P(R) and P(IR) correspond to the top and bottom 25% of genes, they
numbers may be different due to the ties.
For a gene pair k, we used two different methods to summarize the scores derived from n
individual data sets. The first method probabilistically combines the data sets in a naïve Bayes
model by summing up the data set-specific LLRs for the gene pair:
in
LLRk   LLRij
i1

where LLRij is the LLR for the gene pair-containing bin j in data set i. The second method selects
the maximum LLR from all data sets for the gene pair:
LLRk  max( LLR1 j ,...,LLRnj )

After an integrated score was calculated for each gene pair, an LLR cutoff was selected to
achieve a balance between consistency with existing knowledge and network coverage. Based on
the selected threshold, a gene co-expression network is constructed in which each node is a gene
while two nodes are connected by an edge if corresponding LLR is above the threshold.
Human protein-protein interaction network. Human protein interaction data was collected
and integrated from HPRD, MINT, intact, REACTOME, BioGRID, and DIP in April 2010. Only
experimentally determined interactions supported by publications were considered to assure the
reliability of the network. The consolidated data set comprised 94,148 interactions involving
11,660 proteins.
Co-expression module identification. We used ICE2.0, a modified version of the intuitive
Iterative Clique Enumeration (ICE) algorithm (9) to identify modules from the co-expression
network. ICE2.0 includes two steps. First, iterative clique enumeration is performed with the
clique size threshold set to 1 so that all nodes in the graph can be assigned to at least one
maximal clique. Next, a clique merging algorithm (10) is used to merge highly overlapping
cliques to form co-expression modules for reporting. Clique overlapping degree is evaluated by
the Meet/min coefficient. In this study, a Meet/min coefficient threshold of 1/3 was used for
merging. The source code of ICE2.0 is available at http://bioinfo.vanderbilt.edu/ice.
Overall expression level of a module. Because the gene expression data is normalized and
standardized, for a selected sample and a specific module Mj, average expression of all genes in
E
EM j 
Gk M
the module is used to represent the overall expression of the module in the sample:
Network visualization. Networks were visualized using Cytoscape (11).

For more information
Gene Ontology website: http://www.geneontology.org/
NIH Gene Expression Omnibus website: http://www.ncbi.nlm.nih.gov/geo/
NIH HomoloGene website: http://www.ncbi.nlm.nih.gov/homologene
EMBL array Express website: http://www.ebi.ac.uk/microarray-as/ae/
Gene Set enrichment Analysis website: http://www.broadinstitute.org/gsea/
WEB-based GEne SeT AnaLysis Toolkit: http://bioinfo.vanderbilt.edu/webgestalt
Iterative Clique Enumeration (ICE), Vanderbilt University: http://bioinfo.vanderbilt.edu/ice
j
Mj
Gk
qPCR. A Tumor/Normal (T/NL) ratio was used to quantify relative mRNA levels for all genes
studied. Similarly, specific mouse mRNA species were measured relative to Pmm1 expression
using mouse specific gene primers, listed in Supplementary Table 2.
Invasion assays. MC38 cells were serum starved in 1% FBS media for 5 hrs prior to invasion
assay. For invasion assays, 40l of 2 mg/ml matrigel (BD Biosciences) was used to coat inserts
prior to running the assay. For both invasion and migration assays, 2 x 105 cells in 0.5% FBS +
0.2%BSA medium were added to Falcon HTS Fluoroblok inserts in 24 well plates with 10%
FBS complete medium at the bottom. After 48hrs (for invasion) and 24hrs (for migration) of
incubation at 37°C with 5% CO2, the inserts with cells attached were stained with Calcein-AM
(Invitrogen, Carlsbad, CA) (2M in PBS). The bottom (migrated cells) and the top (non
migrated) fluorescent reading were measured with (SOFT-PRO Max) plate reader at 485/530;
excitation/emission. The bottom fluorescence reading was normalized with top fluorescence
reading to negate any proliferation effect. Images of the migrated cells were taken on fluorescent
microscope with GFP filter.
For matrigel invasion studies, Cell Invasion/Migration (CIM) plates were treated with 10μl of
0.25mg/ml matrigel (BD Biosciences). Experimental cells with or without 20ng/ml FK506
(Sigma) were added to each well at a density of 10,000 cells/100μl, incubated for 24 hours, and
increase in cell index was monitored over time. The experiments were performed using 6
replicate wells per group. For trans-endothelial migration assays were conducted as described
previously (12) with the following modifications: E-plates (Roche Diagnostics, Indianapolis, IN)
were treated with 100μl of 0.1% sterile gelatin (Sigma, St. Louis, MO) overnight at 4°C. Plates
were washed once with sterile PBS before the addition of early passage Human Umbilical Vein
Endothelial Cells (HUVEC) in EBM-2 basal media (Lonza Biosciences, Basil, Switzerland)
supplemented with EGM-2 growth factors (Lonza Biosciences, Basil, Switzerland). The E-plates
were seeded with 25,000 HUVEC cells/100μl and incubated for 18 hours at 37°C in an
xCELLigence apparatus (Roche Diagnostics, Indianapolis, IN) while a monolayer was formed,
indicated by a plateau in cell index. Following the formation of the HUVEC monolayer, EGM-2
media was replaced with fresh 100μl EBM-2 basal media and the cell index was monitored for 4
hours and allowed to stabilize. Experimental cells (suspended in EBM-2 basal media) were next
added to each well at a density of 5000 cells/100μl and invasion was monitored over time. The
experiments were performed using 6 replicate wells per group. Trans-endothelial invasion is
measured over a 3-4 hr. time period of maximum drop in the cell index.
Immunofluorescence.
Immunofluorescence was conducted on fixed live cells using rabbit polyclonal anti-NFATc1
antibody (Abcam PLC, Cambridge England). The fluorescent images were captured on
AxioPlan2 microscope (Carl Zeiss Microscopy, Thornwood, NY) microscope at 400x
magnification.
Chromosome Immunoprecipitation:
To characterize the putative NFATc1 (NFAT family) binding sites on the mouse
target gene promoters, the ALGGEN PROMO, which runs TRANSFAC (version
8.3), was used (http://alggen.lsi.upc.es/cgibin/promo_v3/promo/promoinit.cgi?dirDB=TF_8.3). We used Selected factors; TFIID
[T00820]; NF-AT2 [T01945]; NF-AT1 [T01944]; NF-AT1 [T00550]; NF-AT1
[T01948] from the available list to predict transcription factor binding sites (TFBS)
in the DNA sequence of -2500 to +300, downloaded from mouse UCSC genome
browser. Independent analyses with WWWPromoter Scan ; (http://wwwbimas.cit.nih.gov/molbio/proscan/) and http://www-
bimas.cit.nih.gov/molbio/signal/ (WWWSignal Scan) was also performed. Two
to three predicted TFBS area were selected for primer design. Primers were
checked on the UCSC genome browser for unique product by in-silico PCR.
Primers were also verified by PCR on the sonicated chromatin (input DNA) for
single product. Primers which were finally used in the qPCR analysis of the
immune precipitated chromatin are listed in Supplementary Table 6 and the
location of specific primers on promoters of target genes is given in Supplementary
File 5.
ChIP assays were performed using Magna ChIP (Millipore) following the
manufacturer’s protocol. Protease inhibitors were added throughout the procedure.
In brief, MC38Par and MC38Met cells were grown in 150cm plates at 70%
confluency. Fixed using 18.5% fresh formaldehyde (final 1%) at room
temperature for 10 min. Crosslinking reactions were quenched with 125mM (final)
glycine. Cells were scrapped in 1xPBS, briefly spun and suspended in cell lysis
buffer and later in nuclear lysis buffer followed sonication of isolated chromatin.
Sonication at 65% output with a 1 sec pulse for 7 sec, 22 times (name of the
sonicator in Beauchamp lab) was performed. The sonicated chromatin were
checked on 1% Agarose for appropriate size fractionation between 200-1000 Kb.
Initially the sonication conditions were standardized using same cross linked cells.
For immunoprecipitation, 2 μg anti-NFATc1 or mouse-IgG (provided in kit) were
incubated with cell lysates overnight at 4°C with rotation. Anti-RNA Polymerase II
(clone CTD4H8, mouse, provided in kit) was used as a positive control. 2% of the
sheared chromatin was saved as input at 4°C (until elution of protein/DNA
complexes and reverse crosslinking of protein/DNA complexes to free DNA). 20
ul Protein A/G magnetic beads was added and incubated overnight at 4°C with
rotation. Beads were then washed with Low Salt immune complex wash buffer,
High Salt immune complex wash buffer, LiCl immune complex wash buffer and
TE for 3-5 minutes in sequence. DNA-protein complexes were eluted with 100 ul
Elution buffer containing Proteinase K, and crosslinks reversed at 62°C for 2 hr.
DNA (immunoprecipitated and input chromatin) was recovered on the spin filter
column. Finally the DNA was eluted with 50 ul elution buffer. Specific regions on
different promoters were analyzed by real time PCR using sybr green on Roche
Light Cycler LC480.
The immune precipitated DNA was analyzed by real time PCR using sybr green on
Roche 480 platform. Each primer set specific for a region was run in quadruplet.
Fold change of the target DNA (both in IgG control and anti-NFATc1 IP) was
calculated using 2^(Input Chromatin Cp - ChIP Cp). Fold enrichment was
calculated relative to IgG control.
Normalizing the target DNA with input chromatin while comparing several
samples (in this case MC38 Par and MC38 Met cells) with the same primers helps
normalize differences in input chromatin.
References
1.
2.
3.
4.
5.
6.
7.
8.
Rhodes, D.R., and Chinnaiyan, A.M. 2005. Integrative analysis of the cancer
transcriptome. Nat Genet 37 Suppl:S31-37.
Ramani, A.K., Li, Z., Hart, G.T., Carlson, M.W., Boutz, D.R., and Marcotte, E.M. 2008. A
map of human protein interactions derived from co-expression of human mRNAs and
their orthologs. Mol Syst Biol 4:180.
Lim, W.K., Wang, K., Lefebvre, C., and Califano, A. 2007. Comparative analysis of
microarray normalization procedures: effects on reverse engineering gene networks.
Bioinformatics 23:i282-288.
Reimers, M., and Carey, V.J. 2006. Bioconductor: an open source framework for
bioinformatics and computational biology. Methods Enzymol 411:119-134.
Irizarry, R.A., Hobbs, B., Collin, F., Beazer-Barclay, Y.D., Antonellis, K.J., Scherf, U., and
Speed, T.P. 2003. Exploration, normalization, and summaries of high density
oligonucleotide array probe level data. Biostatistics 4:249-264.
Smyth, G.K., editor. 2005. Limma: Linear Models for Microarray Data. New York:
Springer. 397-420 pp.
Resnik, P. 1999. Semantic similarity in a taxonomy: an iformation-based measure and its
application to problems of ambiguity in natural language. J Artif. Intel. Res. 11:95-130.
Elo, L.L., Jarvenpaa, H., Oresic, M., Lahesmaa, R., and Aittokallio, T. 2007. Systematic
construction of gene coexpression networks with applications to human T helper cell
differentiation process. Bioinformatics 23:2096-2103.
9.
10.
11.
12.
Shi, Z., Derow, C.K., and Zhang, B. 2010. Co-expression module analysis reveals
biological processes, genomic gain, and regulatory mechanisms associated with breast
cancer progression. BMC Syst Biol 4:74.
Zhang, B., Park, B.H., Karpinets, T., and Samatova, N.F. 2008. From pull-down data to
protein interaction networks and complexes with biological relevance. Bioinformatics
24:979-986.
Shannon, P., Markiel, A., Ozier, O., Baliga, N.S., Wang, J.T., Ramage, D., Amin, N.,
Schwikowski, B., and Ideker, T. 2003. Cytoscape: a software environment for integrated
models of biomolecular interaction networks. Genome Res 13:2498-2504.
Rahim, S., Beauchamp, E.M., Kong, Y., Brown, M.L., Toretsky, J.A., and Uren, A. 2011. YK4-279 inhibits ERG and ETV1 mediated prostate cancer cell invasion. PLoS One 6:e19343.