Download Supplemental Methods

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Supplemental Methods
Assembly and Annotation
De novo transcriptome assembly was performed using the Oases transcriptome assembler
(v.0.1.22) (Schulz et al., 2012), an extension of the Velvet genome assembler (v.1.1.05) (Zerbino
and Birney, 2008). Assembly was performed in paired-end mode with a minimum output contig
length of 200 bp. Unpaired reads (partner read was discarded during pre-processing) were also
included in the assembly. The k-mer length (k) was set to values of 57, 59, 61, and 63 for
different sub-assemblies. Transcriptome sub-assemblies were merged using CD-HIT-EST v.4.3
(Li and Godzik (2006)) and Phrap v.0.020425 (Green, 2009).
Annotation was performed using a tiered approach. All contigs were first masked using
RepeatMasker (v.3.3.0)(Smit et al., 1996-2010). NCBI BLAST was then used to align the
masked contigs to the following nucleotide databases: mouse, rat and human cDNAs from
Ensembl Release 64; FANTOM3 (Maeda et al., 2006); NONCODE v.2.0 (He et al., 2008);
GenBank (est, nt, gss, and refseq_rna divisions) (March 2011); and an in-house collection of
annotated Chinese hamster ESTs (Jacob et al., unpublished data). BLAST parameters were:
blastn, E-value ≤ 1E-04, all others default. BLAST hits were first divided into tiers according to
their bit scores (Tier 1 ≥ 200; 200>Tier 2 ≥ 100; Tier 3 > 100). The top tier hit is assigned to the
contig, except where hits to multiple databases exist in the same top tier; in such cases, priority
for annotation is given in the order Ensembl mouse > FANTOM > Ensembl rat > Ensembl
human > NONCODE > GenBank > Chinese hamster ESTs. For masked contigs with no hits to
the above databases, the sequences were unmasked and the BLAST search was repeated; hits
were annotated as Tier 4 using the same database assignment priority.
Gene sets for 186 KEGG pathways (http://www.genome.jp/kegg/pathway.html) were
downloaded from the Broad Institute (http://www.broadinstitute.org/gsea/msigdb/collections.jsp)
in the form of lists of Ensembl mouse gene IDs for functional class analysis.
Analysis of Genes with Large-Magnitude Differences between Tissues and Cell Line
We identified genes with at least 10-fold change in their expression (in units of upper-quartile
normalized read counts) between the rBHK cell line and both tissues (includes genes absent in
rBHK but present in both tissues, and vice versa). This list of genes was used as the input for an
analysis of functional enrichment using DAVID (Huang et al., 2009; Huang et al., 2008), with
the list of all Ensembl mouse gene IDs in the assembly used as the background for analysis.
Statistically significant (p-value ≤0.05) pathways or functional classes were identified using
DAVID’s gene functional annotation, functional annotation clustering, and functional annotation
chart tools.
Local Realignment
BWA read alignments were refined by local realignment using GATK v. 1.5-20 (McKenna et al.,
2010). A standard mapping algorithm takes each individual read and maps it to the reference.
Such mapping is only optimal for each of the reads, but maybe sub-optimal for all of the reads.
This local realignment step takes into account all of the reads that are locally mapped together
(as opposed to individual read mapping) to correct for sub-optimal alignments that may
introduce an erroneously called variant due to its close proximity to an insertion or deletion (see
Figure S9).
Filtering Criteria
Base Quality Score
The base Phred quality score (Q) (Ewing and Green, 1998; Ewing et al., 1998) is a measure of
base calling error probability (P) and is defined as Q = -10 log10P. The chance of a base with a
quality score of 30 being incorrectly called is therefore 1 in 1000. Since both the reference base
and the variant base have an associated base quality score, both quality scores were required to
be ≥ 30 for further consideration. The distribution of base quality scores for rBHK_1 and
rBHK_2 potential variants is provided in Figure S10.
Mapping Quality Score
The mapping quality score (MQ) (Li et al., 2008) measures the probability that a read is being
misaligned. Variants located in a read with low mapping quality are deemed low confidence
variants. For a particular variant, the root mean square mapping quality (MQrms) of all the reads
containing such variant gives an indication of its quality. The MQrms is reported in Phred scale
similar to the base quality score. The distributions of the MQrms associated with the variants are
shown in Figure S11. A filter of MQrms > 20 was chosen for high quality variants.
Proximity to Insertion or Deletion
Variants located near an insertion or deletion (indel) are typically artifacts of sub-optimal
alignments that result in erroneous variant calling. Although the majority of variants located near
an indel were eliminated by local realignment, a significant fraction of them may persist (Figure
S12). To remove these possible errors, variants located within 5 bp of an indel were filtered out.
Strand Bias
The majority of systematic sequencing errors in Illumina results from ambiguity in base-calling.
In particular, the spectra for the G and the T nucleotides are known to have significant overlap
(Meacham et al., 2011). In addition, nucleotides in several specific sequence patterns, such as
GGT, are prone to being misread. In both the cases above, the problems will only occur in one
strand of the reads, while the complementary reads will not be affected. Since the Illumina
protocol used in this study involves sequencing both strands of the cDNA, such systematic error
will manifest itself in the form of an incorrect base call in only one specific direction of the
reads. To eliminate this type of systematic error, variants that are supported by reads in only a
single read direction are filtered out. Furthermore, while systematic error results in a variant
present in only one direction of the reads, a combination of this systematic error and random
error can cause an erroneous variant to appear in both directions but heavily biased towards one
direction. For such cases, the Fisher Exact Test (FET) was applied. The FET is based on the
null hypothesis that the type of base (i.e., number of reads with reference or variant base) and the
strand direction (i.e., number of reads in the forward or reverse strand) are independent. If a
single type of base (e.g the variant base) has a significant association with a specific strand
direction, a strand bias will be indicated. The FET tested the significance of the deviation from
the null hypothesis at each variant site. The calculated p value was converted to a Phred-scaled
score. Histograms of Phred-scaled p value score are shown in Figure S13. We required the
Phred-scaled p value to be at most 100 to call strand bias.
Variant Position within Reads
Both sequencing and alignment errors occur more frequently at the ends of the reads. The
fraction of variant reads with the variant base located at either the first or the last position of the
reads was calculated and the distribution is shown in Figure S14. This fraction was required to
be <0.3.
Poisson model
To distinguish potentially real variants from random errors, the Poisson model was employed. In
the Poisson model, the number of variant reads observed at a particular site is compared to the
probability that all of those variant reads occurred due to random error (estimated at a maximum
rate of 1% according to Illumina). The significance level at each variant site is quantified by a
calculated p-value. Multiple hypothesis testing is performed by controlling the family-wise error
rate (FWER) at the significance level α of 0.1 using the Bonferroni correction, corresponding to
a p-value of ~1E-6 (see Figure S15).
Supplemental Figures
Figure S1. Distribution of contig lengths in final transcriptome assembly. Black portion of bar represents the
fraction of contigs annotated; white portion represents the unannotated fraction. Minimum contig length is
200 bp.
Figure S2. Percentage of genes, in each of 20 selected KEGG pathways, represented in the BHK
transcriptome assembly. Number over each bar indicates the total number of genes in the KEGG pathway.
Figure S3. Comparison of expression levels of N-glycosylation genes across all four libraries. Genes are
sorted by rBHK1 upper-quartile normalized read counts, highest to lowest.
Figure S4. Cell cycle genes in the rBHK cell line are over-expressed relative to their levels in both liver and
brain tissue. (modified from “Cell cycle” pathway map in KEGG Pathway database:
http://www.genome.jp/kegg/pathway.html)
Figure S5. Workflow for single nucleotide variant detection.
Figure S6. Receiver Operating Curves (ROC) for the Poisson filter in (a) brain and liver libraries and (b)
rBHK1 and rBHK2 libraries.
Figure S7. Chromatograms of the Sanger sequencing show the presence of (a) the wild-type sequence and (b)
the G-to-A single nucleotide substitution variant sequence from the product-gene cDNA prepared with SSIII
Reverse Transcriptase.
Figure S8. Chromatograms of the Sanger sequencing show the presence of (a) the wild-type sequence and (b)
the insertion of an A nucleotide variant sequence in the product-gene cDNA.
Figure S9. Example of the difference in variant calls from read alignments (a) before and (b) after local
realignment by GATK.
Figure S10. Base quality score distributions for initially-shared and library-exclusive variants in the libraries
(a) rBHK1 and (b) rBHK2.
Figure S11. Mapping quality score distributions for initially-shared and library-exclusive variants in (a)
rBHK1 and (b) rBHK2.
Figure S12. Erroneous variants located near an indel, before (a) and after (b) local realignment. Some
erroneous variants persist even after local realignment, indicating the necessity to apply a filter based on the
proximity to an indel.
Figure S13. Distributions of Fisher exact test Phred-scaled p-value for strand bias in initially-shared and
library-exclusive variants for (a) rBHK1 and (b) rBHK2.
Figure S14. Distributions of the fraction of total variant reads with the variant base located at either the first
or the last position of the read for both initially-shared and library-exclusive variants for (a) rBHK1 and (b)
rBHK2.
25000
# of common variants
20000
15000
rBHK_1
rBHK_2
10000
p-value = 1E-6
5000
0
0
10000
20000
30000
# of exclusive variants
40000
Figure S15. Receiver Operating Curve (ROC) for the Poisson p-value for rBHK1 and rBHK2 libraries. The
arrows show the position for p-value of 1E-6.
Supplemental Tables
Table SI. Sequencing output.
Library
rBHK1
rBHK2
Liver
Brain
No. Reads
168,306,461
108,871,442
69,025,136
69,482,728
Raw
Total bp
16,830,646,100
9,798,429,780
6,902,513,600
6,948,272,800
Post-Processed
No. Reads
Total bp
150,845,581
13,195,442,730
106,177,236
9,238,240,595
64,320,800
5,490,097,745
63,966,924
5,542,768,325
Total
415,685,767
40,479,862,280
385,310,541
33,466,549,395
Table SII. Number of reads mapped to final assembly and average transcriptome coverage.
Library
rBHK1
rBHK2
Liver
Brain
Number Contigs
Mapped To
202,279
195,170
146,193
178,804
Number of Reads
Mapped
121,416,122 (77.0%)
87,619,225 (82.5%)
43,699,787 (68.3%)
52,834,189 (82.1%)
Average Coverage
of Transcriptome
95X
62X
34X
41X
Table SIII. GC content of three rodent transcriptomes.
Species
Syrian hamster
Chinese hamster
Mouse
%GC
48.2%
48.7%
49.6%
Table SIV. Summary of potential variants in all contigs upon application of the Poisson filtering criteria.
a. In the Liver (A) and the Brain (B)
A1∩B1
Filter Criteria
(A)
Initial call
6,287 (100%)
A2 ∩ (A1∩B1)
(B)
(A) + Poisson
4,776 (76.0%)
B2 ∩ (A1∩B1)
5,152 (82.0%)
A2∩B2
4,261 (67.8%)
A1∩ (B1)
(A1) ∩ B1
4,694 (100%)
9,459 (100%)
A2 ∩ (B1)
B2 ∩ (A1)
1,938 (41.3%)
3,533 (37.3%)
(C)
(B) + FET
4,733 (75.3%)
5,124 (81.5%)
4,215 (67.0%)
1,875 (40.0%)
3,497 (37.0%)
A1∩ (B1)
(A1) ∩ B1
8,539 (100%)
30,790 (100%)
b. In rBHK1 (A) and rBHK2 (B)
A1∩B1
Filter Criteria
(A)
Initial call
19,926 (100%)
A2 ∩ (A1∩B1)
B2 ∩ (A1∩B1)
A2∩B2
A2 ∩ (B1)
B2 ∩ (A1)
(B)
(A) + Poisson
14,740 (74.0%)
15,286 (76.7%)
13,007 (65.3%)
2,530 (29.6%)
8,921 (29.0%)
(C)
(B) + FET
12,035 (60.4%)
12,685 (63.7%)
11,878 (59.6%)
2,340 (27.4%)
8,480 (27.5%)
Table SV. Variant occurrence in highly abundant genes in recombinant BHK cell line.
Contig
CMAU029538
CMAU042624
CMAU036048
CMAU013637
CMAU011021
CMAU221473
CMAU221474
Gene Name
Gapdh
Rps27a
Rpl19
Vim
Eef1a1
N/A
Mitochondria
Contig
Length
893
678
769
1572
1808
7514
16,264
Number of
variants
detected in
rBHK1
reads
0
0
0
0
0
0
3
Number of
variants
detected in
rBHK2
reads
0
0
0
0
0
1
6
Number of
variants
detected in
liver reads
Number of
variants
detected in
brain reads
1
6
0
0
0
N/A
2
1
4
0
0
0
N/A
2
Table SVI. Potential mutations in the genes of the growth signaling pathways that may have arisen during cell line derivation process.
Gene
Contig
Map3k6
CMAU017148
CMAU018896
CMAU001953
CMAU058061
CMAU003123
CMAU000122
CMAU004673
Map3k14
Araf
Pik3cb
Mdm2
Position
569
1145
806
447
938
7033
1229
Base call
in tissues
A
A
G
A
T
G
C
Base call
in cell line
A/G
A/G
A
G
T/C
G/A
C/G
Region
Coding
Coding
Coding
Coding
Coding
Coding
Coding
Type of
Mutation
Missense
Missense
Missense
Missense
Missense
Missense
Missense
Amino acid
in tissues
Ser
Lys
Arg
Gln
Asn
Val
Ala
Amino acid
in cell line
Ser/Gly
Lys/Glu
His
Arg
Ser
Ile
Gly
%Variant
in Liver
0%
0%
0%
0%
0%
0%
0%
%Variant
in Brain
0%
0%
0%
0%
0%
0%
0%
%Variant %Variant
in BGI
in BMGC
42%
36%
29%
31%
100%
100%
100%
100%
26%
31%
41%
33%
13%
13%
Supplemental References
Ewing, B., and Green, P. (1998). Base-calling of automated sequencer traces using phred. II.
Error probabilities. Genome Res 8, 186-194.
Ewing, B., Hillier, L., Wendl, M.C., and Green, P. (1998). Base-calling of automated sequencer
traces using phred. I. Accuracy assessment. Genome Res 8, 175-185.
Green, P. (2009). Phrap http://phrap.org.
He, S., Liu, C., Skogerbø, G., Zhao, H., Wang, J., Liu, T., Bai, B., Zhao, Y., and Chen, R.
(2008). NONCODE v2.0: decoding the non-coding. Nucleic Acids Research 36, D170-D172.
Huang, D., Sherman, B., and Lempicki, R. (2009). Bioinformatics enrichment tools: paths
toward the comprehensive functional analysis of large gene lists. Nucleic Acids Research 37, 113.
Huang, D.W., Sherman, B.T., and Lempicki, R.A. (2008). Systematic and integrative analysis of
large gene lists using DAVID bioinformatics resources. Nat Protocols 4, 44-57.
Li, H., Ruan, J., and Durbin, R. (2008). Mapping short DNA sequencing reads and calling
variants using mapping quality scores. Genome Res 18, 1851-1858.
Li, W., and Godzik, A. (2006). Cd-hit: a fast program for clustering and comparing large sets of
protein or nucleotide sequences. Bioinformatics 22, 1658-1659.
Maeda, N., Kasukawa, T., Oyama, R., Gough, J., Frith, M., Engström, P.r.G., Lenhard, B.,
Aturaliya, R.N., Batalov, S., Beisel, K.W., et al. (2006). Transcript Annotation in FANTOM3:
Mouse Gene Catalog Based on Physical cDNAs. PLoS Genet 2, e62.
McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K., Kernytsky, A., Garimella,
K., Altshuler, D., Gabriel, S., Daly, M., et al. (2010). The Genome Analysis Toolkit: a
MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res 20,
1297-1303.
Meacham, F., Boffelli, D., Dhahbi, J., Martin, D.I., Singer, M., and Pachter, L. (2011).
Identification and correction of systematic error in high-throughput sequence data. BMC
bioinformatics 12, 451.
Schulz, M.H., Zerbino, D.R., Vingron, M., and Birney, E. (2012). Oases: Robust de novo RNAseq assembly across the dynamic range of expression levels. Bioinformatics.
Smit, A., Hubley, R., and Green, P. (1996-2010). RepeatMasker Open-3.0
http://www.repeatmasker.org
Zerbino, D.R., and Birney, E. (2008). Velvet: Algorithms for de novo short read assembly using
de Bruijn graphs. Genome Research 18, 821-829.