* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download x`*z`* _ _
Metagenomics wikipedia , lookup
The Bell Curve wikipedia , lookup
Oncogenomics wikipedia , lookup
Human genome wikipedia , lookup
Epigenetics of neurodegenerative diseases wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Genomic library wikipedia , lookup
Long non-coding RNA wikipedia , lookup
Public health genomics wikipedia , lookup
Pathogenomics wikipedia , lookup
Polycomb Group Proteins and Cancer wikipedia , lookup
Microevolution wikipedia , lookup
X-inactivation wikipedia , lookup
Gene expression programming wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
Mir-92 microRNA precursor family wikipedia , lookup
History of genetic engineering wikipedia , lookup
Ridge (biology) wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Minimal genome wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Genomic imprinting wikipedia , lookup
Designer baby wikipedia , lookup
Genome evolution wikipedia , lookup
Genome (book) wikipedia , lookup
Supplementary Figures for Vector Integration Sites Identification for Gene-Trap Screening in Mammalian Haploid Cells Jian Yu1,2 and Constance Ciaudo1,* 1Swiss Federal Institute of Technology Zurich, Department of Biology, Institute of Molecular Health Sciences, Chair of RNAi and Genome Integrity, Zurich, Switzerland. 2Life Science Zurich Graduate School, Molecular and Translational Biomedicine program, University of Zurich, Zurich, Switzerland. * To whom correspondence may be addressed: Prof. Constance Ciaudo. E-mail: [email protected] 1 Figure S1, Insertion profile near transcription starting sites (TSS) with or without removing duplicates, related to figure 1c. (a) For the mouse dataset1, six independent biological replicates were merged and insertion profiles were generated before and after removing duplicates. (b) For the human dataset2, insertion profiles were generated before and after removing duplicates for each sample. All insertion profiles were generated using ngs.plot3. 2 Figure S2, Distribution of change of pAUC (indicated as ΔpAUC) after randomly removing 5,154 genes from the annotation file in the human dataset2. 0 5 Density 10 15 20 Count Enrichment Sense Enrichment −0.2 −0.1 0.0 ΔpAUC ΔpAUC: 0.058, Permuted-P: 0.024 0.1 0.2 ΔpAUC: 0.239, Permuted-P: <0.001 The black curve represents the count enrichment and the red curve the sense enrichment. ΔpAUC, generated by removing real non-expressed genes, were labeled in the horizontal axis. Permuted-P was calculated as the number of times that permuted ΔpAUC bigger than real ΔpAUC, divided by 1,000. 3 Figure S3, pAUC for count enrichment test in the mouse dataset using different statistical tests and shrinkage methods. 0.35 0.30 Method DESeq2_common DESeq2_DSS pAUC DESeq2_LocalFit DESeq2_Tagwise 0.25 edgeR_common edgeR_DSS edgeR_LocalFit edgeR_Tagwise Voom 0.20 0.15 3 4 5 SampleNumber Rescaled pAUC were calculated at FPR=0.01 to compare different shrinkage methods for count enrichment test in mouse dataset1. Methods include common dispersion, local fitting, tagwise dispersion and DSS, using DESeq2 (wald test)4 and edgeR (quasilikehood F-test)5,6. Voom+Limma7,8 packages were also included. Comparisons were performed when sample size increases from 3 to 5. 4 Figure S4, GC-content bias in human dataset. a b SRR656615_Selected SRR663777_Ctrl 10 log2-Normalized Count log2-Normalized Count 10 0 0 -10 25 50 GC% 75 25 50 75 GC% Log2-transformed number of EIs (normalized against total count) against GC-content (%) for (a) the control library and (b) the selected library, using R package cqn9. 5 a log2-Normalized Count log2-Normalized Count 10 5 libraries and (g, h, i, j, k, l) selected libraries, using R GC% 50 ELAM5C_Ctrl 60 70 5 30 40 GC% 60 40 GC% 50 ELAM5D_Sel 60 70 30 40 GC% 50 ELAM7D_Sel 60 70 30 40 GC% 50 ELAM8D_Sel 60 70 30 40 GC% 50 ELAM9D_Sel 60 70 30 40 GC% 60 70 5 70 f log2-Normalized Count log2-Normalized Count 50 ELAM7C_Ctrl 10 10 5 30 40 GC% 50 ELAM8C_Ctrl 60 5 70 h log2-Normalized Count g log2-Normalized Count 30 10 e 10 10 5 30 40 GC% i log2-Normalized Count d log2-Normalized Count log2-Normalized Count 40 10 50 ELAM9C_Ctrl 60 5 70 j 10 10 5 30 40 k log2-Normalized Count package cqn9. 30 c 5 log2-Normalized Count count) against GC-content (%) for (a, b, c, d, e, f) control 10 GC% 50 ELAM10C_Ctrl 60 5 70 l log2-Normalized Count Log2-transformed number of EIs (normalized against total ELAM4D_Sel b ELAM4C_Ctrl Figure S5, GC-content bias in mouse dataset. 10 50 ELAM10D_Sel 10 5 30 40 GC% 50 60 70 5 30 40 GC% 50 60 70 6 Figure S6, Comparing different normalization methods in mouse dataset. c 3.0 a 2.5 2.0 0.0 0.5 0.2 1.5 AUC Method TC RLE TMM CisGenome TC RLE TMM CisGenome 1.0 0.3 Average Number of False Discoveries 0.4 0.1 3 4 5 0 Method TC RLE TMM CisGenome 0.2 15 20 TC RLE TMM CisGenome 5 0.1 Average Number of False Discoveries pAUC 0.3 10 Genes Chosen 15 d 0.4 5 10 b 20 SampleNumber 0 0.0 3 4 SampleNumber 5 0 10 20 30 40 50 Genes Chosen Rescaled pAUC was calculated at FPR=0.01 for comparing different normalization methods for count enrichment test (a) and sense enrichment test (b) in mouse dataset1. Normalization methods include total count (TC), RLE (from DESeq2)4, TMM (from edgeR)10 and adapted CisGenome11. Comparisons for pAUC were performed when sample size increases from 3 to 6. The values of pAUC for all 6 samples correspond to those in Table 1. False discovery curves were generated for count (c) and sense enrichment tests (d), respectively. Three samples from control libraries were labeled as ‘selected library’ and compared with the rest of the control libraries at FDR < 0.05. The curve showed the number of false discoveries after averaging all possible compositions of the 3-versus-3 comparisons. 7 Figure S7, Comparison of the effect of upstream inclusion on VISITs performance in human and mouse datasets. a Count Enrichment Sense Enrichment 0.0 0.0 0.2 0.1 0.4 0.2 pAUC pAUC 0.6 0.3 0.8 0.4 Count Enrichment Sense Enrichment 0.5 1.0 b 0 1 2 3 Inclusion of Promoter Region (kb) 4 5 0 1 2 3 4 Inclusion of Promoter Region (kb) 5 Rescaled pAUC at FPR=0.01 was calculated with different size of upstream region included in the human 2 (a) and mouse1 (b) datasets. 8 Figure S8, Combined FDR achieved comparable performance for known genes, and generated more potential candidates, compared to FDR derived from count or sense enrichments, individually. 0.30 0.25 (a-b) Performance of combined FDR (green curve) versus FDR derived using count (black curve) or sense enrichment (red curve) in human (a) 0.05 0.6 0.10 0.7 TPR 0.15 TPR 0.8 0.20 1.0 b 0.9 a 0.00 0.05 0.10 0.15 pAUC of Count: 0.286 pAUC of Sense: 0.281 pAUC of sumz: 0.298 0.00 0.5 pAUC of Count: 0.790 pAUC of Sense: 0.806 pAUC of sumz: 0.877 0.000 0.20 0.005 0.010 0.015 FPR FPR datasets, ROC curve and rescaled d 2.5 c pAUC (at FPR=0.01) were generated. 2.0 5 TopG e ne sf r om CountEnri ch me ntOnl y TopG e ne sf r om Se nseEnri ch me ntOnl y TopG e ne sf r om combi ne dScor e 4 Asterisk was labeled at FDR=0.01. 3 (c-d) Ability of combined FDR (green curve) to reveal new candidates, 2 − lo g10 E m p irical P log1 0Empi ri calP 1.0 1.5 and mouse datasets (b). In both 0.020 10 20 30 Num. of Novel Genes 40 50 count To p G en es fro m C o un t E n richm en t O n ly To p G en es fro m S en s e E n richm en t O n ly To p G en es fro m co m b in ed S co re 0 0.0 1 0.5 compared with FDR derived from 10 20 30 Num. of Novel Genes 40 (black curve) or sense enrichment (red curve), individually, 50 in human (c) and mouse datasets (d). In both datasets, empirical p-values were generated from 10 to 50 potential candidates, using a randomly permuted STRING12 network for 10,000 times, calculating the proportion of times where the summarized connectivity between the known genes and novel candidates is larger than the observed one. If the empirical p-value is 0, it is set to 1e-5. Potential candidates were defined as the top 10 to 50 candidates ranked by FDR (excluding known genes). The gray line indicates an empirical p-value at 0.05. 9 Figure S9, Comparison of results generated by VISITs and those in the two original papers. To investigate the difference our results with those already published, the human dataset2 was compared using the true-positive genes, as shown below. Improved power by our approaches can be seen in both count and sense enrichment methods. The same comparison cannot be performed for the mouse dataset1, as the author did not provide a full gene-list. However, we noticed in the Table S31, where 25 significant candidates were listed. However, in this table, Tsix13, a well-known antisense lncRNA involved in X chromosome inactivation (XCI) was missing. Other possible missing XCI factors include Suz12, a subunit of polycomb complex14 and Rlim, a ubiquitin ligase15. These missing genes may indicate inadequate power of the original methods used in the mouse dataset1. Boxplot of confidence level (indicated as log-transformed FDR) for 36 true-positive genes, using original gene-list generated in the human dataset2 (red boxes) and re-analyzed by our methods (blue boxes), for count and sense enrichments, respectively. 10 Figure S10, Boxplot of intra-group variance of mouse dataset in selected and control libraries. ● ● ● ● ● ● ● ● 1.5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● IntraVariance ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Sample ctrl sel 0.5 0.0 Count Sense Type For count enrichment, numbers of EIs were first normalized to total count. Subsequently, biological coefficient variances (BCV) were calculated using R function estimateTagwiseDisp in edgeR5, for control and selected libraries, separately. For sense enrichment, a standard deviation of proportion of EIs in each gene was calculated for control and selected libraries, separately. In both case, the intra-group variance should be smaller in control compared to selected libraries, due to random selection. 11 Figure S11, Minus-Average plot for the mouse dataset1. (a) For the count enrichment in the mouse dataset, independent insertions of each gene were first normalized to total count and then M (minus of selected vs Ctrl libraries) was plot against A (average of selected and Ctrl libraries). (b) For the sense enrichment in the mouse dataset, M of proportion of EIs was plot against A. For both figures, genes involved in regulation of proliferation in stem cells (GO: 200648) are highlighted in red. 12 Figure S12, Coverage tracks of a known gene in the human dataset2. ENSG 0 0 0 0 0 1 1 0 0 80 _ St3g a l 4_ ch r 1 1 + : 1 26355639 1 26440 345 1 0 . 5 Sel ectedPlus 0 −0 . 5 −1 1 0 . 5 Sel ectedMinus 0 −0 . 5 −1 1 0 . 5 Ctrl Plus 0 −0 . 5 1 −1 0 . 5 Ctrl Minus 0 −0 . 5 −1 Plus Minus 1 . 264e+ 0 5 1 . 264e+ 0 5 1 . 264e+ 0 5 1 . 264e+ 0 5 1 . 264e+ 0 5 1 . 264e+ 0 5 1 . 264e+ 0 5 1 . 264e+ 0 5 1 . 264e+ 0 5 Coverage tracks of human gene ST3GAL4 were generated using R package GenomeGraphs16 for control library (red) and selected library (black) in a strand-specific way. This gene was selected as an example for visualization as it has higher coverage and has been reported to be involved in Lassa virus infection in a second paper from the same group17. Insertions were observed enriched in exonic and sense strands. Gene model and chromosome coordinates were shown in bottom. 13 Figure S13, Bubble plot in the mouse dataset1. Only the first 1000 genes ranking by combined FDR were shown in the plot, and the top 20 genes were highlighted. The y-axis indicates the significance level (-log10-transformed FDR); the a-axis indicates the chromosome and the size of the gene is proportional to the number of insertions. 14 References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. Monfort, A., et al., Identification of Spen as a Crucial Factor for Xist Function through Forward Genetic Screening in Haploid Embryonic Stem Cells. Cell reports, 2015. Jae, L.T., et al., Deciphering the glycosylome of dystroglycanopathies using haploid screens for lassa virus entry. Science, 2013. 340(6131): p. 479-483. Li, S., et al., ngs.plot: Quick mining and visualization of next-generation sequencing data by integrating genomic databases. BMC Genomics, 2014. 15(1): p. 284. Love, M.I., W. Huber, and S. Anders, Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol, 2014. 15(12): p. 550. Robinson, M.D., D.J. McCarthy, and G.K. Smyth, edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 2010. 26(1): p. 139-140. Lund, S.P., et al., Detecting Differential Expression in RNA-sequence Data Using Quasi-likelihood with Shrunken Dispersion Estimates. Statistical Applications in Genetics and Molecular Biology, 2012. 11(5). Ritchie, M.E., et al., limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research, 2015. 43(7). Law, C.W., et al., voom: Precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol, 2014. 15(2): p. R29. Hansen, K.D., R.A. Irizarry, and Z.J. Wu, Removing technical variability in RNA-seq data using conditional quantile normalization. Biostatistics, 2012. 13(2): p. 204-216. Robinson, M.D. and A. Oshlack, A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol, 2010. 11(3): p. R25. Ji, H., et al., An integrated software system for analyzing ChIP-chip and ChIP-seq data. Nat Biotechnol, 2008. 26(11): p. 1293-300. Szklarczyk, D., et al., STRING v10: protein-protein interaction networks, integrated over the tree of life. NAR, 2015. 43 (Database issue): p. D447-52. Lee, J.T., L.S. Davidow, and D. Warshawsky, Tsix, a gene antisense to Xist at the X-inactivation centre. Nature Genetics, 1999. 21(4): p. 400-404. Schoeftner, S., et al., Recruitment of PRC1 function at the initiation of X inactivation independent of PRC2 and silencing. Embo Journal, 2006. 25(13): p. 3110-3122. Shin, J., et al., RLIM is dispensable for X-chromosome inactivation in the mouse embryonic epiblast. Nature, 2014. 511(7507): p. 86U443. Bullard, S.D.a.J., GenomeGraphs: Plotting genomic information from Ensembl. R package version 1.32.0. Bioconductor, 2016. Jae, L.T., et al., Lassa virus entry requires a trigger-induced receptor switch. Science, 2014. 344(6191): p. 1506-1510. 15