Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Supplemental Methods Assembly and Annotation De novo transcriptome assembly was performed using the Oases transcriptome assembler (v.0.1.22) (Schulz et al., 2012), an extension of the Velvet genome assembler (v.1.1.05) (Zerbino and Birney, 2008). Assembly was performed in paired-end mode with a minimum output contig length of 200 bp. Unpaired reads (partner read was discarded during pre-processing) were also included in the assembly. The k-mer length (k) was set to values of 57, 59, 61, and 63 for different sub-assemblies. Transcriptome sub-assemblies were merged using CD-HIT-EST v.4.3 (Li and Godzik (2006)) and Phrap v.0.020425 (Green, 2009). Annotation was performed using a tiered approach. All contigs were first masked using RepeatMasker (v.3.3.0)(Smit et al., 1996-2010). NCBI BLAST was then used to align the masked contigs to the following nucleotide databases: mouse, rat and human cDNAs from Ensembl Release 64; FANTOM3 (Maeda et al., 2006); NONCODE v.2.0 (He et al., 2008); GenBank (est, nt, gss, and refseq_rna divisions) (March 2011); and an in-house collection of annotated Chinese hamster ESTs (Jacob et al., unpublished data). BLAST parameters were: blastn, E-value ≤ 1E-04, all others default. BLAST hits were first divided into tiers according to their bit scores (Tier 1 ≥ 200; 200>Tier 2 ≥ 100; Tier 3 > 100). The top tier hit is assigned to the contig, except where hits to multiple databases exist in the same top tier; in such cases, priority for annotation is given in the order Ensembl mouse > FANTOM > Ensembl rat > Ensembl human > NONCODE > GenBank > Chinese hamster ESTs. For masked contigs with no hits to the above databases, the sequences were unmasked and the BLAST search was repeated; hits were annotated as Tier 4 using the same database assignment priority. Gene sets for 186 KEGG pathways (http://www.genome.jp/kegg/pathway.html) were downloaded from the Broad Institute (http://www.broadinstitute.org/gsea/msigdb/collections.jsp) in the form of lists of Ensembl mouse gene IDs for functional class analysis. Analysis of Genes with Large-Magnitude Differences between Tissues and Cell Line We identified genes with at least 10-fold change in their expression (in units of upper-quartile normalized read counts) between the rBHK cell line and both tissues (includes genes absent in rBHK but present in both tissues, and vice versa). This list of genes was used as the input for an analysis of functional enrichment using DAVID (Huang et al., 2009; Huang et al., 2008), with the list of all Ensembl mouse gene IDs in the assembly used as the background for analysis. Statistically significant (p-value ≤0.05) pathways or functional classes were identified using DAVID’s gene functional annotation, functional annotation clustering, and functional annotation chart tools. Local Realignment BWA read alignments were refined by local realignment using GATK v. 1.5-20 (McKenna et al., 2010). A standard mapping algorithm takes each individual read and maps it to the reference. Such mapping is only optimal for each of the reads, but maybe sub-optimal for all of the reads. This local realignment step takes into account all of the reads that are locally mapped together (as opposed to individual read mapping) to correct for sub-optimal alignments that may introduce an erroneously called variant due to its close proximity to an insertion or deletion (see Figure S9). Filtering Criteria Base Quality Score The base Phred quality score (Q) (Ewing and Green, 1998; Ewing et al., 1998) is a measure of base calling error probability (P) and is defined as Q = -10 log10P. The chance of a base with a quality score of 30 being incorrectly called is therefore 1 in 1000. Since both the reference base and the variant base have an associated base quality score, both quality scores were required to be ≥ 30 for further consideration. The distribution of base quality scores for rBHK_1 and rBHK_2 potential variants is provided in Figure S10. Mapping Quality Score The mapping quality score (MQ) (Li et al., 2008) measures the probability that a read is being misaligned. Variants located in a read with low mapping quality are deemed low confidence variants. For a particular variant, the root mean square mapping quality (MQrms) of all the reads containing such variant gives an indication of its quality. The MQrms is reported in Phred scale similar to the base quality score. The distributions of the MQrms associated with the variants are shown in Figure S11. A filter of MQrms > 20 was chosen for high quality variants. Proximity to Insertion or Deletion Variants located near an insertion or deletion (indel) are typically artifacts of sub-optimal alignments that result in erroneous variant calling. Although the majority of variants located near an indel were eliminated by local realignment, a significant fraction of them may persist (Figure S12). To remove these possible errors, variants located within 5 bp of an indel were filtered out. Strand Bias The majority of systematic sequencing errors in Illumina results from ambiguity in base-calling. In particular, the spectra for the G and the T nucleotides are known to have significant overlap (Meacham et al., 2011). In addition, nucleotides in several specific sequence patterns, such as GGT, are prone to being misread. In both the cases above, the problems will only occur in one strand of the reads, while the complementary reads will not be affected. Since the Illumina protocol used in this study involves sequencing both strands of the cDNA, such systematic error will manifest itself in the form of an incorrect base call in only one specific direction of the reads. To eliminate this type of systematic error, variants that are supported by reads in only a single read direction are filtered out. Furthermore, while systematic error results in a variant present in only one direction of the reads, a combination of this systematic error and random error can cause an erroneous variant to appear in both directions but heavily biased towards one direction. For such cases, the Fisher Exact Test (FET) was applied. The FET is based on the null hypothesis that the type of base (i.e., number of reads with reference or variant base) and the strand direction (i.e., number of reads in the forward or reverse strand) are independent. If a single type of base (e.g the variant base) has a significant association with a specific strand direction, a strand bias will be indicated. The FET tested the significance of the deviation from the null hypothesis at each variant site. The calculated p value was converted to a Phred-scaled score. Histograms of Phred-scaled p value score are shown in Figure S13. We required the Phred-scaled p value to be at most 100 to call strand bias. Variant Position within Reads Both sequencing and alignment errors occur more frequently at the ends of the reads. The fraction of variant reads with the variant base located at either the first or the last position of the reads was calculated and the distribution is shown in Figure S14. This fraction was required to be <0.3. Poisson model To distinguish potentially real variants from random errors, the Poisson model was employed. In the Poisson model, the number of variant reads observed at a particular site is compared to the probability that all of those variant reads occurred due to random error (estimated at a maximum rate of 1% according to Illumina). The significance level at each variant site is quantified by a calculated p-value. Multiple hypothesis testing is performed by controlling the family-wise error rate (FWER) at the significance level α of 0.1 using the Bonferroni correction, corresponding to a p-value of ~1E-6 (see Figure S15). Supplemental Figures Figure S1. Distribution of contig lengths in final transcriptome assembly. Black portion of bar represents the fraction of contigs annotated; white portion represents the unannotated fraction. Minimum contig length is 200 bp. Figure S2. Percentage of genes, in each of 20 selected KEGG pathways, represented in the BHK transcriptome assembly. Number over each bar indicates the total number of genes in the KEGG pathway. Figure S3. Comparison of expression levels of N-glycosylation genes across all four libraries. Genes are sorted by rBHK1 upper-quartile normalized read counts, highest to lowest. Figure S4. Cell cycle genes in the rBHK cell line are over-expressed relative to their levels in both liver and brain tissue. (modified from “Cell cycle” pathway map in KEGG Pathway database: http://www.genome.jp/kegg/pathway.html) Figure S5. Workflow for single nucleotide variant detection. Figure S6. Receiver Operating Curves (ROC) for the Poisson filter in (a) brain and liver libraries and (b) rBHK1 and rBHK2 libraries. Figure S7. Chromatograms of the Sanger sequencing show the presence of (a) the wild-type sequence and (b) the G-to-A single nucleotide substitution variant sequence from the product-gene cDNA prepared with SSIII Reverse Transcriptase. Figure S8. Chromatograms of the Sanger sequencing show the presence of (a) the wild-type sequence and (b) the insertion of an A nucleotide variant sequence in the product-gene cDNA. Figure S9. Example of the difference in variant calls from read alignments (a) before and (b) after local realignment by GATK. Figure S10. Base quality score distributions for initially-shared and library-exclusive variants in the libraries (a) rBHK1 and (b) rBHK2. Figure S11. Mapping quality score distributions for initially-shared and library-exclusive variants in (a) rBHK1 and (b) rBHK2. Figure S12. Erroneous variants located near an indel, before (a) and after (b) local realignment. Some erroneous variants persist even after local realignment, indicating the necessity to apply a filter based on the proximity to an indel. Figure S13. Distributions of Fisher exact test Phred-scaled p-value for strand bias in initially-shared and library-exclusive variants for (a) rBHK1 and (b) rBHK2. Figure S14. Distributions of the fraction of total variant reads with the variant base located at either the first or the last position of the read for both initially-shared and library-exclusive variants for (a) rBHK1 and (b) rBHK2. 25000 # of common variants 20000 15000 rBHK_1 rBHK_2 10000 p-value = 1E-6 5000 0 0 10000 20000 30000 # of exclusive variants 40000 Figure S15. Receiver Operating Curve (ROC) for the Poisson p-value for rBHK1 and rBHK2 libraries. The arrows show the position for p-value of 1E-6. Supplemental Tables Table SI. Sequencing output. Library rBHK1 rBHK2 Liver Brain No. Reads 168,306,461 108,871,442 69,025,136 69,482,728 Raw Total bp 16,830,646,100 9,798,429,780 6,902,513,600 6,948,272,800 Post-Processed No. Reads Total bp 150,845,581 13,195,442,730 106,177,236 9,238,240,595 64,320,800 5,490,097,745 63,966,924 5,542,768,325 Total 415,685,767 40,479,862,280 385,310,541 33,466,549,395 Table SII. Number of reads mapped to final assembly and average transcriptome coverage. Library rBHK1 rBHK2 Liver Brain Number Contigs Mapped To 202,279 195,170 146,193 178,804 Number of Reads Mapped 121,416,122 (77.0%) 87,619,225 (82.5%) 43,699,787 (68.3%) 52,834,189 (82.1%) Average Coverage of Transcriptome 95X 62X 34X 41X Table SIII. GC content of three rodent transcriptomes. Species Syrian hamster Chinese hamster Mouse %GC 48.2% 48.7% 49.6% Table SIV. Summary of potential variants in all contigs upon application of the Poisson filtering criteria. a. In the Liver (A) and the Brain (B) A1∩B1 Filter Criteria (A) Initial call 6,287 (100%) A2 ∩ (A1∩B1) (B) (A) + Poisson 4,776 (76.0%) B2 ∩ (A1∩B1) 5,152 (82.0%) A2∩B2 4,261 (67.8%) A1∩ (B1) (A1) ∩ B1 4,694 (100%) 9,459 (100%) A2 ∩ (B1) B2 ∩ (A1) 1,938 (41.3%) 3,533 (37.3%) (C) (B) + FET 4,733 (75.3%) 5,124 (81.5%) 4,215 (67.0%) 1,875 (40.0%) 3,497 (37.0%) A1∩ (B1) (A1) ∩ B1 8,539 (100%) 30,790 (100%) b. In rBHK1 (A) and rBHK2 (B) A1∩B1 Filter Criteria (A) Initial call 19,926 (100%) A2 ∩ (A1∩B1) B2 ∩ (A1∩B1) A2∩B2 A2 ∩ (B1) B2 ∩ (A1) (B) (A) + Poisson 14,740 (74.0%) 15,286 (76.7%) 13,007 (65.3%) 2,530 (29.6%) 8,921 (29.0%) (C) (B) + FET 12,035 (60.4%) 12,685 (63.7%) 11,878 (59.6%) 2,340 (27.4%) 8,480 (27.5%) Table SV. Variant occurrence in highly abundant genes in recombinant BHK cell line. Contig CMAU029538 CMAU042624 CMAU036048 CMAU013637 CMAU011021 CMAU221473 CMAU221474 Gene Name Gapdh Rps27a Rpl19 Vim Eef1a1 N/A Mitochondria Contig Length 893 678 769 1572 1808 7514 16,264 Number of variants detected in rBHK1 reads 0 0 0 0 0 0 3 Number of variants detected in rBHK2 reads 0 0 0 0 0 1 6 Number of variants detected in liver reads Number of variants detected in brain reads 1 6 0 0 0 N/A 2 1 4 0 0 0 N/A 2 Table SVI. Potential mutations in the genes of the growth signaling pathways that may have arisen during cell line derivation process. Gene Contig Map3k6 CMAU017148 CMAU018896 CMAU001953 CMAU058061 CMAU003123 CMAU000122 CMAU004673 Map3k14 Araf Pik3cb Mdm2 Position 569 1145 806 447 938 7033 1229 Base call in tissues A A G A T G C Base call in cell line A/G A/G A G T/C G/A C/G Region Coding Coding Coding Coding Coding Coding Coding Type of Mutation Missense Missense Missense Missense Missense Missense Missense Amino acid in tissues Ser Lys Arg Gln Asn Val Ala Amino acid in cell line Ser/Gly Lys/Glu His Arg Ser Ile Gly %Variant in Liver 0% 0% 0% 0% 0% 0% 0% %Variant in Brain 0% 0% 0% 0% 0% 0% 0% %Variant %Variant in BGI in BMGC 42% 36% 29% 31% 100% 100% 100% 100% 26% 31% 41% 33% 13% 13% Supplemental References Ewing, B., and Green, P. (1998). Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res 8, 186-194. Ewing, B., Hillier, L., Wendl, M.C., and Green, P. (1998). Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res 8, 175-185. Green, P. (2009). Phrap http://phrap.org. He, S., Liu, C., Skogerbø, G., Zhao, H., Wang, J., Liu, T., Bai, B., Zhao, Y., and Chen, R. (2008). NONCODE v2.0: decoding the non-coding. Nucleic Acids Research 36, D170-D172. Huang, D., Sherman, B., and Lempicki, R. (2009). Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Research 37, 113. Huang, D.W., Sherman, B.T., and Lempicki, R.A. (2008). Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protocols 4, 44-57. Li, H., Ruan, J., and Durbin, R. (2008). Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res 18, 1851-1858. Li, W., and Godzik, A. (2006). Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22, 1658-1659. Maeda, N., Kasukawa, T., Oyama, R., Gough, J., Frith, M., Engström, P.r.G., Lenhard, B., Aturaliya, R.N., Batalov, S., Beisel, K.W., et al. (2006). Transcript Annotation in FANTOM3: Mouse Gene Catalog Based on Physical cDNAs. PLoS Genet 2, e62. McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K., Kernytsky, A., Garimella, K., Altshuler, D., Gabriel, S., Daly, M., et al. (2010). The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res 20, 1297-1303. Meacham, F., Boffelli, D., Dhahbi, J., Martin, D.I., Singer, M., and Pachter, L. (2011). Identification and correction of systematic error in high-throughput sequence data. BMC bioinformatics 12, 451. Schulz, M.H., Zerbino, D.R., Vingron, M., and Birney, E. (2012). Oases: Robust de novo RNAseq assembly across the dynamic range of expression levels. Bioinformatics. Smit, A., Hubley, R., and Green, P. (1996-2010). RepeatMasker Open-3.0 http://www.repeatmasker.org Zerbino, D.R., and Birney, E. (2008). Velvet: Algorithms for de novo short read assembly using de Bruijn graphs. Genome Research 18, 821-829.