Download x`*z`* _ _

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Metagenomics wikipedia , lookup

The Bell Curve wikipedia , lookup

Oncogenomics wikipedia , lookup

Human genome wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Genomic library wikipedia , lookup

Long non-coding RNA wikipedia , lookup

Public health genomics wikipedia , lookup

Pathogenomics wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Gene wikipedia , lookup

Microevolution wikipedia , lookup

X-inactivation wikipedia , lookup

Gene expression programming wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Mir-92 microRNA precursor family wikipedia , lookup

History of genetic engineering wikipedia , lookup

Ridge (biology) wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Minimal genome wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Genomic imprinting wikipedia , lookup

Designer baby wikipedia , lookup

Genome evolution wikipedia , lookup

Genome (book) wikipedia , lookup

Gene expression profiling wikipedia , lookup

RNA-Seq wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Transcript
Supplementary Figures for
Vector Integration Sites Identification for Gene-Trap Screening in Mammalian Haploid Cells
Jian Yu1,2 and Constance Ciaudo1,*
1Swiss
Federal Institute of Technology Zurich, Department of Biology, Institute of Molecular Health Sciences, Chair of RNAi and Genome Integrity, Zurich,
Switzerland.
2Life Science Zurich Graduate School, Molecular and Translational Biomedicine program, University of Zurich, Zurich, Switzerland.
*
To whom correspondence may be addressed: Prof. Constance Ciaudo.
E-mail: [email protected]
1
Figure S1, Insertion profile near transcription starting sites (TSS) with or without removing duplicates, related to figure 1c.
(a) For the mouse dataset1, six independent biological replicates were merged and insertion profiles were generated before and
after removing duplicates. (b) For the human dataset2, insertion profiles were generated before and after removing duplicates for
each sample. All insertion profiles were generated using ngs.plot3.
2
Figure S2, Distribution of change of pAUC (indicated as ΔpAUC) after randomly removing 5,154 genes from the annotation
file in the human dataset2.
0
5
Density
10
15
20
Count Enrichment
Sense Enrichment
−0.2
−0.1
0.0
ΔpAUC
ΔpAUC: 0.058,
Permuted-P:
0.024
0.1
0.2
ΔpAUC: 0.239,
Permuted-P:
<0.001
The black curve represents the count enrichment and the red curve the sense enrichment. ΔpAUC, generated by removing real
non-expressed genes, were labeled in the horizontal axis. Permuted-P was calculated as the number of times that permuted
ΔpAUC bigger than real ΔpAUC, divided by 1,000.
3
Figure S3, pAUC for count enrichment test in the mouse dataset using different statistical tests and shrinkage methods.
0.35
0.30
Method
DESeq2_common
DESeq2_DSS
pAUC
DESeq2_LocalFit
DESeq2_Tagwise
0.25
edgeR_common
edgeR_DSS
edgeR_LocalFit
edgeR_Tagwise
Voom
0.20
0.15
3
4
5
SampleNumber
Rescaled pAUC were calculated at FPR=0.01 to compare different shrinkage methods for count enrichment test in mouse dataset1.
Methods include common dispersion, local fitting, tagwise dispersion and DSS, using DESeq2 (wald test)4 and edgeR (quasilikehood F-test)5,6. Voom+Limma7,8 packages were also included. Comparisons were performed when sample size increases from
3 to 5.
4
Figure S4, GC-content bias in human dataset.
a
b
SRR656615_Selected
SRR663777_Ctrl
10
log2-Normalized Count
log2-Normalized Count
10
0
0
-10
25
50
GC%
75
25
50
75
GC%
Log2-transformed number of EIs (normalized against total count) against GC-content (%) for (a) the control library and (b) the
selected library, using R package cqn9.
5
a
log2-Normalized Count
log2-Normalized Count
10
5
libraries and (g, h, i, j, k, l) selected libraries, using R
GC%
50
ELAM5C_Ctrl
60
70
5
30
40
GC%
60
40
GC% 50
ELAM5D_Sel
60
70
30
40
GC% 50
ELAM7D_Sel
60
70
30
40
GC%
50
ELAM8D_Sel
60
70
30
40
GC% 50
ELAM9D_Sel
60
70
30
40
GC%
60
70
5
70
f
log2-Normalized Count
log2-Normalized Count
50
ELAM7C_Ctrl
10
10
5
30
40
GC%
50
ELAM8C_Ctrl
60
5
70
h
log2-Normalized Count
g
log2-Normalized Count
30
10
e
10
10
5
30
40
GC%
i
log2-Normalized Count
d
log2-Normalized Count
log2-Normalized Count
40
10
50
ELAM9C_Ctrl
60
5
70
j
10
10
5
30
40
k
log2-Normalized Count
package cqn9.
30
c
5
log2-Normalized Count
count) against GC-content (%) for (a, b, c, d, e, f) control
10
GC%
50
ELAM10C_Ctrl
60
5
70
l
log2-Normalized Count
Log2-transformed number of EIs (normalized against total
ELAM4D_Sel
b
ELAM4C_Ctrl
Figure S5, GC-content bias in mouse dataset.
10
50
ELAM10D_Sel
10
5
30
40
GC% 50
60
70
5
30
40
GC% 50
60
70
6
Figure S6, Comparing different normalization methods in mouse dataset.
c
3.0
a
2.5
2.0
0.0
0.5
0.2
1.5
AUC
Method
TC
RLE
TMM
CisGenome
TC
RLE
TMM
CisGenome
1.0
0.3
Average Number of False Discoveries
0.4
0.1
3
4
5
0
Method
TC
RLE
TMM
CisGenome
0.2
15
20
TC
RLE
TMM
CisGenome
5
0.1
Average Number of False Discoveries
pAUC
0.3
10
Genes Chosen
15
d
0.4
5
10
b
20
SampleNumber
0
0.0
3
4
SampleNumber
5
0
10
20
30
40
50
Genes Chosen
Rescaled pAUC was calculated at FPR=0.01 for comparing different normalization methods for count enrichment test (a) and
sense enrichment test (b) in mouse dataset1. Normalization methods include total count (TC), RLE (from DESeq2)4, TMM (from
edgeR)10 and adapted CisGenome11. Comparisons for pAUC were performed when sample size increases from 3 to 6. The values
of pAUC for all 6 samples correspond to those in Table 1. False discovery curves were generated for count (c) and sense
enrichment tests (d), respectively. Three samples from control libraries were labeled as ‘selected library’ and compared with the
rest of the control libraries at FDR < 0.05. The curve showed the number of false discoveries after averaging all possible
compositions of the 3-versus-3 comparisons.
7
Figure S7, Comparison of the effect of upstream inclusion on VISITs performance in human and mouse datasets.
a
Count Enrichment
Sense Enrichment
0.0
0.0
0.2
0.1
0.4
0.2
pAUC
pAUC
0.6
0.3
0.8
0.4
Count Enrichment
Sense Enrichment
0.5
1.0
b
0
1
2
3
Inclusion of Promoter Region (kb)
4
5
0
1
2
3
4
Inclusion of Promoter Region (kb)
5
Rescaled pAUC at FPR=0.01 was calculated with different size of upstream region included in the human 2 (a) and mouse1 (b)
datasets.
8
Figure S8, Combined FDR achieved comparable performance for known genes, and generated more potential candidates,
compared to FDR derived from count or sense enrichments, individually.
0.30
0.25
(a-b) Performance of combined FDR
(green curve) versus FDR derived
using count (black curve) or sense
enrichment (red curve) in human (a)
0.05
0.6
0.10
0.7
TPR
0.15
TPR
0.8
0.20
1.0
b
0.9
a
0.00
0.05
0.10
0.15
pAUC of Count: 0.286
pAUC of Sense: 0.281
pAUC of sumz: 0.298
0.00
0.5
pAUC of Count: 0.790
pAUC of Sense: 0.806
pAUC of sumz: 0.877
0.000
0.20
0.005
0.010
0.015
FPR
FPR
datasets, ROC curve and rescaled
d
2.5
c
pAUC (at FPR=0.01) were generated.
2.0
5
TopG e
ne
sf
r
om CountEnri
ch
me
ntOnl
y
TopG e
ne
sf
r
om Se
nseEnri
ch
me
ntOnl
y
TopG e
ne
sf
r
om combi
ne
dScor
e
4
Asterisk was labeled at FDR=0.01.
3
(c-d) Ability of combined FDR (green
curve) to reveal new candidates,
2
− lo g10 E m p irical P
log1
0Empi
ri
calP
1.0
1.5
and mouse datasets (b). In both
0.020
10
20
30
Num. of Novel Genes
40
50
count
To p G en es fro m C o un t E n richm en t O n ly
To p G en es fro m S en s e E n richm en t O n ly
To p G en es fro m co m b in ed S co re
0
0.0
1
0.5
compared with FDR derived from
10
20
30
Num. of Novel Genes
40
(black
curve)
or
sense
enrichment (red curve), individually,
50
in human (c) and mouse datasets (d).
In both datasets, empirical p-values were generated from 10 to 50 potential candidates, using a randomly permuted STRING12
network for 10,000 times, calculating the proportion of times where the summarized connectivity between the known genes and
novel candidates is larger than the observed one. If the empirical p-value is 0, it is set to 1e-5. Potential candidates were defined as
the top 10 to 50 candidates ranked by FDR (excluding known genes). The gray line indicates an empirical p-value at 0.05.
9
Figure S9, Comparison of results generated by VISITs and those in the two original papers.
To investigate the difference our results with those already published, the human dataset2 was compared using the true-positive
genes, as shown below. Improved power by our approaches can be seen in both count and sense enrichment methods. The same
comparison cannot be performed for the mouse dataset1, as the author did not provide a full gene-list. However, we noticed in the
Table S31, where 25 significant candidates were listed. However, in this table, Tsix13, a well-known antisense lncRNA involved in X
chromosome inactivation (XCI) was missing. Other possible missing XCI factors include Suz12, a subunit of polycomb complex14
and Rlim, a ubiquitin ligase15. These missing genes may indicate inadequate power of the original methods used in the mouse
dataset1.
Boxplot of confidence level (indicated as log-transformed FDR) for 36 true-positive genes, using original gene-list generated in the
human dataset2 (red boxes) and re-analyzed by our methods (blue boxes), for count and sense enrichments, respectively.
10
Figure S10, Boxplot of intra-group variance of mouse dataset in selected and control libraries.
●
●
●
●
●
●
●
●
1.5
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
IntraVariance
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
1.0
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
Sample
ctrl
sel
0.5
0.0
Count
Sense
Type
For count enrichment, numbers of EIs were first normalized to total count. Subsequently, biological coefficient variances (BCV)
were calculated using R function estimateTagwiseDisp in edgeR5, for control and selected libraries, separately. For sense
enrichment, a standard deviation of proportion of EIs in each gene was calculated for control and selected libraries, separately. In
both case, the intra-group variance should be smaller in control compared to selected libraries, due to random selection.
11
Figure S11, Minus-Average plot for the mouse dataset1.
(a) For the count enrichment in the mouse dataset, independent insertions of each gene were first normalized to total count and
then M (minus of selected vs Ctrl libraries) was plot against A (average of selected and Ctrl libraries). (b) For the sense enrichment
in the mouse dataset, M of proportion of EIs was plot against A. For both figures, genes involved in regulation of proliferation in
stem cells (GO: 200648) are highlighted in red.
12
Figure S12, Coverage tracks of a known gene in the human dataset2.
ENSG 0
0
0
0
0
1
1
0
0
80
_
St3g
a
l
4_
ch
r
1
1
+
:
1
26355639
1
26440
345
1
0
.
5
Sel
ectedPlus
0
−0
.
5
−1
1
0
.
5
Sel
ectedMinus
0
−0
.
5
−1
1
0
.
5
Ctrl Plus
0
−0
.
5
1
−1
0
.
5
Ctrl Minus
0
−0
.
5
−1
Plus
Minus
1
.
264e+
0
5
1
.
264e+
0
5
1
.
264e+
0
5
1
.
264e+
0
5
1
.
264e+
0
5
1
.
264e+
0
5
1
.
264e+
0
5
1
.
264e+
0
5
1
.
264e+
0
5
Coverage tracks of human gene ST3GAL4 were generated using R package GenomeGraphs16 for control library (red) and selected
library (black) in a strand-specific way. This gene was selected as an example for visualization as it has higher coverage and has
been reported to be involved in Lassa virus infection in a second paper from the same group17. Insertions were observed enriched
in exonic and sense strands. Gene model and chromosome coordinates were shown in bottom.
13
Figure S13, Bubble plot in the mouse dataset1.
Only the first 1000 genes ranking by combined FDR were shown in the plot, and the top 20 genes were highlighted. The y-axis
indicates the significance level (-log10-transformed FDR); the a-axis indicates the chromosome and the size of the gene is
proportional to the number of insertions.
14
References
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
Monfort, A., et al., Identification of Spen as a Crucial Factor for Xist Function through Forward Genetic Screening in Haploid
Embryonic Stem Cells. Cell reports, 2015.
Jae, L.T., et al., Deciphering the glycosylome of dystroglycanopathies using haploid screens for lassa virus entry. Science, 2013.
340(6131): p. 479-483.
Li, S., et al., ngs.plot: Quick mining and visualization of next-generation sequencing data by integrating genomic databases. BMC
Genomics, 2014. 15(1): p. 284.
Love, M.I., W. Huber, and S. Anders, Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome
Biol, 2014. 15(12): p. 550.
Robinson, M.D., D.J. McCarthy, and G.K. Smyth, edgeR: a Bioconductor package for differential expression analysis of digital gene
expression data. Bioinformatics, 2010. 26(1): p. 139-140.
Lund, S.P., et al., Detecting Differential Expression in RNA-sequence Data Using Quasi-likelihood with Shrunken Dispersion Estimates.
Statistical Applications in Genetics and Molecular Biology, 2012. 11(5).
Ritchie, M.E., et al., limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids
Research, 2015. 43(7).
Law, C.W., et al., voom: Precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol, 2014. 15(2): p.
R29.
Hansen, K.D., R.A. Irizarry, and Z.J. Wu, Removing technical variability in RNA-seq data using conditional quantile normalization.
Biostatistics, 2012. 13(2): p. 204-216.
Robinson, M.D. and A. Oshlack, A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol,
2010. 11(3): p. R25.
Ji, H., et al., An integrated software system for analyzing ChIP-chip and ChIP-seq data. Nat Biotechnol, 2008. 26(11): p. 1293-300.
Szklarczyk, D., et al., STRING v10: protein-protein interaction networks, integrated over the tree of life. NAR, 2015. 43 (Database
issue): p. D447-52.
Lee, J.T., L.S. Davidow, and D. Warshawsky, Tsix, a gene antisense to Xist at the X-inactivation centre. Nature Genetics, 1999. 21(4):
p. 400-404.
Schoeftner, S., et al., Recruitment of PRC1 function at the initiation of X inactivation independent of PRC2 and silencing. Embo
Journal, 2006. 25(13): p. 3110-3122.
Shin, J., et al., RLIM is dispensable for X-chromosome inactivation in the mouse embryonic epiblast. Nature, 2014. 511(7507): p. 86U443.
Bullard, S.D.a.J., GenomeGraphs: Plotting genomic information from Ensembl. R package version 1.32.0. Bioconductor, 2016.
Jae, L.T., et al., Lassa virus entry requires a trigger-induced receptor switch. Science, 2014. 344(6191): p. 1506-1510.
15