Download Supporting data 1: Experimental Procedures Plant materials and

Supporting data 1: Experimental Procedures Plant materials and genomic DNA extraction To generate transgenic rice plants for this study, we followed the protocol of Sallaud et al. (2003) with minor modifications. For treatment A, the process was similar to that used for transformation except that co-culture with Agrobacterium was omitted. In brief, embryos of mature japonica variety Tainung67 (TNG67) seeds were used to induce callus, which were then incubated at 25oC for 2 weeks in darkness. The embryogenic nodular callus was then transferred to liquid co-culture medium for 10 min, without the presence of Agrobacterium cells. Callus pieces were incubated on solid co-culture medium at 25oC in darkness for three days before transfer to selection medium with cefotaxime (400 mg/L) and incubated at 25oC in darkness for 5 weeks. Callus pieces were then placed onto pre-regeneration medium for 1 week before being transferred onto regeneration medium for 3 weeks with a 16-hr day length. The shoots regenerating from calli were dissected and sub-cultured on rooting medium. For treatment B, calli were co-cultivated with A. tumefaciens EHA105 (Hood et al., 1993). In treatment C, calli were treated with Agrobacterium incubation medium lacking bacteria. In treatment D, the calli were co-cultivated with the avirulent strain A. tumefaciens A136 (Montoya et al., 1978). For treatment E, the calli were co-cultivated with the avirulent strain A. tumefaciens A136 containing the T-DNA binary vector pCambia 1305.1 (hygromycin 50 mg/L selection). For treatment F, rice calli were co-cultivated with A. tumefaciens EHA105 containing pCambia 1305.1. Treatment G is very similar to F, except the binary vector pE3730 (bialaphos 2 mg/L selection) was used. For treatment H, rice calli were bombarded with the plasmid pCambia 1305.1. We also used treatment F to prepare transformants using another japonica rice (Taikang 9) and one indica rice (IR64) variety. For the analysis of T-DNA transformants, four Taiwan Rice Insertional Mutagenesis (TRIM) lines were selected (M0048349, M0053677, M0079651 and M0084311). These were all dwarf tillering mutants. For DNA extraction, plants of single seed descent were cultivated until the tillering stage in an Academia Sinica greenhouse under natural light. Healthy leaves without insect damage from one plant were harvested, frozen under liquid nitrogen, and stored 1 at -80oC. Genomic DNA was extracted from leaves by using a DNeasy Plant Mini Kit (Qiagen). Paired-end libraries with 450-500 bp insert sizes were constructed and sequenced on a GA2 or HiSeq2000 sequencer (Illumina). Sequence data were deposited into the NCBI Sequence Read Archive. SNP and indel calling Adaptor sequences, low-quality bases, and reads < 20 bp in length were discarded. The trimmed paired reads were then aligned to the reference rice Nipponbare genome sequence (IRGSP v1.0). SAMtools and VCFtools (Danecek et al. 2011) were used to manipulate and transform the SAM and variant call format (VCF; Danecek et al., 2011) file format. To detect SNPs and small indels, we used the command lines in the section “EXAMPLES” in the SAMtools manual without any restriction on depth or mapping quality. The information on SNP and small INDELs was recorded in VCF files. The VCF files for these lines were compared by “vcf-isec” to classify sample-specific or intersection variants. The VCF files were then imported into Integrative Genomics Viewer (IGV, Robinson et al., 2011; Thorvaldsdottir et al., 2013) to show genotypes. From the information in the SAM format, we determined whether paired reads were appropriate by insert size and orientation. The larger insert size of properly oriented paired reads means a possible deletion between the paired reads. According to this concept, we developed a program to scan each SAM file to search for long deletion regions against a reference sequence. Ploidy estimation Nuclei extraction material from leaf tissue was stained with CyStain PI absolute P (Partec, Germany) before analysis by flow cytometry (MoFlo XDP Cell Sorter, Beckman, USA) with laser excitation at 357 nm. TILLING analysis Leaf DNA samples from 15 plants were bulked and subjected to TILLING coupled with agarose gel analysis (Comai and Henikoff, 2006; Raghavan et al., 2007). Nine pairs of primers (Table S2) were used to cover the rice waxy gene region. DNA pools from each fragment were PCR-amplified and heteroduplexes were formed, followed by CEL I (home-made) cleavage (Till et al., 2007). The extra band in the gel indicates a mismatch, including a SNP or small indel, in some samples of each pool. 2 References Comai, L. and Henikoff, S. (2006) TILLING: practical single-nucleotide mutation discovery. Plant J, 45, 684-694. Danecek, P., Auton, A., Abecasis, G., Albers, C.A., Banks, E., DePristo, M.A., Handsaker, R.E., Lunter, G., Marth, G.T., Sherry, S.T., McVean, G., Durbin, R. and Genomes Project Analysis, G. (2011) The variant call format and VCFtools. Bioinformatics (Oxford, England), 27, 2156-2158. Hood, E., Gelvin, S., Melchers, L. and Hoekema, A. (1993) New Agrobacterium helper plasmids for gene transfer to plants. Transgen Res, 2, 208-218. Montoya, A.L., Moore, L.W., Gordon, M.P. and Nester, E.W. (1978) Multiple genes coding for octopine-degrading enzymes in Agrobacterium. J Bacteriol, 136, 909-915. Raghavan, C., Naredo, M.E., Wang, H., Atienza, G., Liu, B., Qiu, F., McNally, K. and Leung, H. (2007) Rapid method for detecting SNPs on agarose gels and its application in candidate gene mapping. Mol Breed, 19, 87-101. Robinson, J.T., Thorvaldsdottir, H., Winckler, W., Guttman, M., Lander, E.S., Getz, G. and Mesirov, J.P. (2011) Integrative genomics viewer. Nature Biotechnol, 29, 24-26. Sallaud, C., Meynard, D., van Boxtel, J., Gay, C., Bes, M., Brizard, J.P., Larmande, P., Ortega, D., Raynal, M., Portefaix, M., Ouwerkerk, P.B., Rueb, S., Delseny, M. and Guiderdoni, E. (2003) Highly efficient production and characterization of T-DNA plants for rice (Oryza sativa L.) functional genomics. Theor Appl Genet, 106, 1396-1408. Thorvaldsdottir, H., Robinson, J.T. and Mesirov, J.P. (2013) Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration. Briefings in bioinformatics, 14, 178-192. Till, B., Cooper, J., Tai, T., Colowit, P., Greene, E., Henikoff, S. and Comai, L. (2007) Discovery of chemically induced mutations in rice by TILLING. BMC Plant Biol, 7, 19. 3 Supporting data 2 Explanation and home-made code 1. Explanation and code for mapping There are three essential file formats related to this study, FASTQ, BAM, and VCF. The raw sequencing data is in FASTQ format, containing the sequenced read and quality score for each base. We used mapping software to align the reads against the reference sequence (IRGSP v1.0), then transformed to BAM format, with all information from FASTQ, and added information such as best aligned location and mapping quality for each read. The information of BAM could be used to judge the read that had been extracted from the similar segment of the genome or from the repetitive sequence region. Along with coordination with the reference genome, BAM could provide the information of every nucleotide from the reads covered by each base of the reference genome. VCF, on the other hand, consists of information, including SNP and indel to structure variations. The VCF file for each base pair includes the position of the start point, the reference and alternative sequences, the variation calling quality, the covered depth, etc. The commands used are as followed: # map to reference genome $ bwa aln -m 5000000000 -R 500 -t 8 -q20 OsNB1 \ sampleA_1.fq.gz > sampleA_1.sai $ bwa aln -m 5000000000 -R 500 -t 8 -q20 OsNB1 \ sampleA_2.fq.gz > sample_2.sai $ bwa sampe -a 600 -o 1000 -P -n 100 -N 100 OsNB1 \ sampleA_1.sai sampleA_2.sai \ sampleA_1.fq.gz sampleA_2.fq.gz | samtools view \ -@ 8 -bS - > sampleA-OsNB1.unsorted.bam $ samtools sort sampleA-OsNB1.unsorted.bam sample-OsNB1 $ samtools index sampleA-OsNB1.bam # call variant $ samtools mpileup -A -ugf OsNB1.fa.gz sampleA-OsNB1.bam \ | bcftools call -O b -Acv - > var_sampleA.raw.bcf $ bcftools view var_sampleA.raw.bcf | vcfutils.pl varFilter -Q 0 | 4 bgzip -c > var_sampleA.vcf.gz $ tabix -p vcf var_sampleA.vcf.gz # classify variant as specific on each sample and intersection among the samples and wildtype $ vcf-isec –p Isec var_TNG67.vcf.gz var_sampleA.vcf.gz var_sampleB.vcf.gz # The output of the last command represent: Isec0.vcf.gz : specific on TNG67 Isec1.vcf.gz : specific on sampleA Isec2.vcf.gz : specific on sampleB Isec0_1.vcf.gz : intersect between TNG67 and sampleA, not on sampleB Isec0_2.vcf.gz : intersect between TNG67 and sampleB, not on sampleA Isec1_2.vcf.gz : intersect between sampleA and sampleB, not on TNG67 Isec0_1_2.vcf.gz : shared between all three samples. 2. Explanation and code for depth filtering With the output file from vcf-isec, we then filtered out the variants with extreme sequence depth. Then, we used all filtered positions to produce the IGV snapshot image files which showed the NGS short read of TNG67 and other samples on the relative position of the reference genome. Through these images, we could check if there were false positive or false negative SNPs or indels. The commands used are as followed: # filter out the variant with too low (<8) or too high depth (>30) of reads, and output qualified homozygous SNP location $ fileVCF_SNP_D.pl Isec0.vcf.gz Hm 8 30 > Isec0_SNPHm_D8_30.bed # make batch script file for making IGV snapshot bedToIgv -path $HOME/PNGs -slop 100 -i Isec0_SNPHm_D8_30.bed > Isec0_SNPHm_D8_30.bat # Start IGV and load the mapped data (.bam) for comparing, then please refer to the section “Running IGV with a batch file” in IGV’s user guide to produce IGV snapshot 5 Supporting data 3 Analysis and Validation of the Sequence Changes Filtering and visualization checking All the split VCF files underwent filtering and were checked visually. The first condition was read depth and variant-call quality (maximum ~220). The second condition was read depth 8 to 30 based on the estimated average read depth of ~15x against the reference genome size. A base frequency > 0.8 was required for homozygous sites and 0.3 to 0.7 for heterozygous sites. The snapshot image files of IGV for the variant events on the genome NGS mapped images and the wild-type TNG67 were produced by a batched process based on the regional files (.bed), then manually validated. Figure S12 provides eight examples of how we assigned homozygous and heterozygous SNPs and indels. The curated SNPs along the 12 chromosomes were plotted using the BasicChromosome module of Biopython (Cock et al., 2009). SnpEff (Cingolani et al., 2012) was used to calculate the location of each SNP in the genome. Figure S7, which illustrates SNP sites and their nearby regions, was generated using WebLogo (Crooks et al., 2004). Calculation and confirmation of SNP frequency The plot of sequence coverage depth for several samples (Figure S13) indicates that most samples (except TNG67, which was sent for sequencing twice) fit a normal distribution, with a mean of 13- to 26-fold coverage. To set cutoff points for a dataset with a mean of 15-fold coverage, we chose 8x coverage, about half that of 15x, as a low point because of difficulties in differentiating SNPs from sequencing errors for coverage <8x. We chose 30x coverage (double 15x) as the high cutoff point because these sequences represent highly repetitive areas of the genome, with consequent difficulties in determining whether they are real SNPs or similar sequences of repeated copies of the fragment in the genome. Table S1 also lists the proportions of categories for these lines we used for detecting sequence changes. The regions of sequencing depth are 8-30 or 8-40 and they covered 81% to 92% of the mapped genome. To detect sequence variations, we mapped TNG67 and all regenerant and transformant sequence reads against the Nipponbare reference genome (IRGSP v1.0) using the Burrows-Wheeler Aligner (BWA) package, which provides information for all possible SNPs and short indels (range -42 to +59 bp). 6 Before filtering, the number of heterozygous SNPs from the BWA output for each regenerant ranged from 19,581 to 38,805 (Table S10). We then used different parameters for filtering SNPs: D2Q10 (read depth >2 and variation-calling quality >10), D3Q20 (read depth >3 and variation-calling quality >20), D3Q150 (read depth >3 and variation-calling quality >150) and D8_30 (reads with depth of between 8 and 30). With the original 20,081 putative SNPs of Regenerant A1 as an example (Table S10), if we used quality as the filtering criterion, we obtained 8,154, 5,174, and 53 heterozygous SNPs for the D2Q10, D2Q20, and D3Q150 reads, respectively. If we used sequence depth coverage as the criterion, we obtained 3,904 SNPs for the D8_30 reads. We then manually confirmed these data illustrated in Figure S12 (i.e., the heterozygous SNP frequency should be 30% to 70% of the sequence calls). The proportion of correct calls for D2Q10 was only 1.1% and missed 11 false-negative SNPs, that for D3Q20 was only 1.5% and missed 28 false-negative SNPs, that for D3Q150 was 5.7% and missed 100 false-negative SNPs, and that for D8_30 was 2.6% and missed one false-negative SNP. The number of SNPs with all filtered results was 90. Thus, sequencing depth provided better filtering. According to these results, we then used D3Q150 and D8_30 to detect heterozygous sequence changes in other regenerants (Tables 2 and S1), with a mean=107±23. For the R0 generation, these numbers represent sequence changes from the callus growth stage to regeneration. D8_30 and manual checking were also used to identify indels. In the current study, we filtered candidates by using both depth and quality (D2Q10, D3Q20, and D3Q150) or depth only (D8_30), followed by manual confirmation. To eliminate false-positive and false-negative calls and to improve screening efficiency, we systematically checked 1) the ratio of curated SNPs/indels versus the filtered ones and 2) correct calls using combinations of sequencing depth from 2-8 and quality values of 10, 20, 40, 80, 150, and 160 for the four TRIM transformants. From all 168 combinations, the mean proportion of selected versus putative calls was 14.8%, and the mean correct call rate was 68.5%. Among all the methods tested, the best was D8_30, with the two rates of 9.8% and 99.4%. The call quality for the SNPs passing through filtering and validation ranged from 3.41 to 223. This result explains in part why setting the quality value did not provide good results. 7 Reference Cingolani, P., Platts, A., Wang le, L., Coon, M., Nguyen, T., Wang, L., Land, S.J., Lu, X. and Ruden, D.M. (2012) A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly (Austin), 6, 80-92. Cock, P.J.A., Antao, T., Chang, J.T., Chapman, B.A., Cox, C.J., Dalke, A., Friedberg, I., Hamelryck, T., Kauff, F., Wilczynski, B. and de Hoon, M.J.L. (2009) Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics, 25, 1422-1423. Crooks, G.E., Hon, G., Chandonia, J.M. and Brenner, S.E. (2004) WebLogo: a sequence logo generator. Genome Res, 14, 1188-1190. 8

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Supporting data 1: Experimental Procedures Plant materials and