Download Supporting data 1: Experimental Procedures Plant materials and

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Supporting data 1:
Experimental Procedures
Plant materials and genomic DNA extraction
To generate transgenic rice plants for this study, we followed the protocol of Sallaud et
al. (2003) with minor modifications. For treatment A, the process was similar to that
used for transformation except that co-culture with Agrobacterium was omitted. In brief,
embryos of mature japonica variety Tainung67 (TNG67) seeds were used to induce
callus, which were then incubated at 25oC for 2 weeks in darkness. The embryogenic
nodular callus was then transferred to liquid co-culture medium for 10 min, without the
presence of Agrobacterium cells. Callus pieces were incubated on solid co-culture
medium at 25oC in darkness for three days before transfer to selection medium with
cefotaxime (400 mg/L) and incubated at 25oC in darkness for 5 weeks. Callus pieces
were then placed onto pre-regeneration medium for 1 week before being transferred
onto regeneration medium for 3 weeks with a 16-hr day length. The shoots regenerating
from calli were dissected and sub-cultured on rooting medium. For treatment B, calli
were co-cultivated with A. tumefaciens EHA105 (Hood et al., 1993). In treatment C,
calli were treated with Agrobacterium incubation medium lacking bacteria. In treatment
D, the calli were co-cultivated with the avirulent strain A. tumefaciens A136 (Montoya
et al., 1978). For treatment E, the calli were co-cultivated with the avirulent strain A.
tumefaciens A136 containing the T-DNA binary vector pCambia 1305.1 (hygromycin
50 mg/L selection). For treatment F, rice calli were co-cultivated with A. tumefaciens
EHA105 containing pCambia 1305.1. Treatment G is very similar to F, except the
binary vector pE3730 (bialaphos 2 mg/L selection) was used. For treatment H, rice calli
were bombarded with the plasmid pCambia 1305.1. We also used treatment F to prepare
transformants using another japonica rice (Taikang 9) and one indica rice (IR64)
variety.
For the analysis of T-DNA transformants, four Taiwan Rice Insertional
Mutagenesis (TRIM) lines were selected (M0048349, M0053677, M0079651 and
M0084311). These were all dwarf tillering mutants.
For DNA extraction, plants of single seed descent were cultivated until the tillering
stage in an Academia Sinica greenhouse under natural light. Healthy leaves without
insect damage from one plant were harvested, frozen under liquid nitrogen, and stored
1
at -80oC. Genomic DNA was extracted from leaves by using a DNeasy Plant Mini Kit
(Qiagen). Paired-end libraries with 450-500 bp insert sizes were constructed and
sequenced on a GA2 or HiSeq2000 sequencer (Illumina). Sequence data were deposited
into the NCBI Sequence Read Archive.
SNP and indel calling
Adaptor sequences, low-quality bases, and reads < 20 bp in length were discarded. The
trimmed paired reads were then aligned to the reference rice Nipponbare genome
sequence (IRGSP v1.0).
SAMtools and VCFtools (Danecek et al. 2011) were used to manipulate and
transform the SAM and variant call format (VCF; Danecek et al., 2011) file format. To
detect SNPs and small indels, we used the command lines in the section “EXAMPLES”
in the SAMtools manual without any restriction on depth or mapping quality. The
information on SNP and small INDELs was recorded in VCF files. The VCF files for
these lines were compared by “vcf-isec” to classify sample-specific or intersection
variants. The VCF files were then imported into Integrative Genomics Viewer (IGV,
Robinson et al., 2011; Thorvaldsdottir et al., 2013) to show genotypes.
From the information in the SAM format, we determined whether paired reads
were appropriate by insert size and orientation. The larger insert size of properly
oriented paired reads means a possible deletion between the paired reads. According to
this concept, we developed a program to scan each SAM file to search for long deletion
regions against a reference sequence.
Ploidy estimation
Nuclei extraction material from leaf tissue was stained with CyStain PI absolute P
(Partec, Germany) before analysis by flow cytometry (MoFlo XDP Cell Sorter,
Beckman, USA) with laser excitation at 357 nm.
TILLING analysis
Leaf DNA samples from 15 plants were bulked and subjected to TILLING coupled with
agarose gel analysis (Comai and Henikoff, 2006; Raghavan et al., 2007). Nine pairs of
primers (Table S2) were used to cover the rice waxy gene region. DNA pools from each
fragment were PCR-amplified and heteroduplexes were formed, followed by CEL I
(home-made) cleavage (Till et al., 2007). The extra band in the gel indicates a mismatch,
including a SNP or small indel, in some samples of each pool.
2
References
Comai, L. and Henikoff, S. (2006) TILLING: practical single-nucleotide mutation
discovery. Plant J, 45, 684-694.
Danecek, P., Auton, A., Abecasis, G., Albers, C.A., Banks, E., DePristo, M.A.,
Handsaker, R.E., Lunter, G., Marth, G.T., Sherry, S.T., McVean, G., Durbin,
R. and Genomes Project Analysis, G. (2011) The variant call format and
VCFtools. Bioinformatics (Oxford, England), 27, 2156-2158.
Hood, E., Gelvin, S., Melchers, L. and Hoekema, A. (1993) New Agrobacterium
helper plasmids for gene transfer to plants. Transgen Res, 2, 208-218.
Montoya, A.L., Moore, L.W., Gordon, M.P. and Nester, E.W. (1978) Multiple genes
coding for octopine-degrading enzymes in Agrobacterium. J Bacteriol, 136,
909-915.
Raghavan, C., Naredo, M.E., Wang, H., Atienza, G., Liu, B., Qiu, F., McNally, K.
and Leung, H. (2007) Rapid method for detecting SNPs on agarose gels and its
application in candidate gene mapping. Mol Breed, 19, 87-101.
Robinson, J.T., Thorvaldsdottir, H., Winckler, W., Guttman, M., Lander, E.S., Getz,
G. and Mesirov, J.P. (2011) Integrative genomics viewer. Nature Biotechnol, 29,
24-26.
Sallaud, C., Meynard, D., van Boxtel, J., Gay, C., Bes, M., Brizard, J.P., Larmande,
P., Ortega, D., Raynal, M., Portefaix, M., Ouwerkerk, P.B., Rueb, S.,
Delseny, M. and Guiderdoni, E. (2003) Highly efficient production and
characterization of T-DNA plants for rice (Oryza sativa L.) functional genomics.
Theor Appl Genet, 106, 1396-1408.
Thorvaldsdottir, H., Robinson, J.T. and Mesirov, J.P. (2013) Integrative Genomics
Viewer (IGV): high-performance genomics data visualization and exploration.
Briefings in bioinformatics, 14, 178-192.
Till, B., Cooper, J., Tai, T., Colowit, P., Greene, E., Henikoff, S. and Comai, L.
(2007) Discovery of chemically induced mutations in rice by TILLING. BMC
Plant Biol, 7, 19.
3
Supporting data 2
Explanation and home-made code
1. Explanation and code for mapping
There are three essential file formats related to this study, FASTQ, BAM, and VCF. The
raw sequencing data is in FASTQ format, containing the sequenced read and quality
score for each base. We used mapping software to align the reads against the reference
sequence (IRGSP v1.0), then transformed to BAM format, with all information from
FASTQ, and added information such as best aligned location and mapping quality for
each read. The information of BAM could be used to judge the read that had been
extracted from the similar segment of the genome or from the repetitive sequence
region.
Along with coordination with the reference genome, BAM could provide the
information of every nucleotide from the reads covered by each base of the reference
genome. VCF, on the other hand, consists of information, including SNP and indel to
structure variations. The VCF file for each base pair includes the position of the start
point, the reference and alternative sequences, the variation calling quality, the covered
depth, etc. The commands used are as followed:
# map to reference genome
$ bwa aln -m 5000000000 -R 500 -t 8 -q20 OsNB1 \
sampleA_1.fq.gz > sampleA_1.sai
$ bwa aln -m 5000000000 -R 500 -t 8 -q20 OsNB1 \
sampleA_2.fq.gz > sample_2.sai
$ bwa sampe -a 600 -o 1000 -P -n 100 -N 100 OsNB1 \
sampleA_1.sai sampleA_2.sai \
sampleA_1.fq.gz sampleA_2.fq.gz | samtools view \
-@ 8 -bS - > sampleA-OsNB1.unsorted.bam
$ samtools sort sampleA-OsNB1.unsorted.bam sample-OsNB1
$ samtools index sampleA-OsNB1.bam
# call variant
$ samtools mpileup -A -ugf OsNB1.fa.gz sampleA-OsNB1.bam \
| bcftools call -O b -Acv - > var_sampleA.raw.bcf
$ bcftools view var_sampleA.raw.bcf | vcfutils.pl varFilter -Q 0 |
4
bgzip -c > var_sampleA.vcf.gz
$ tabix -p vcf var_sampleA.vcf.gz
# classify variant as specific on each sample and intersection among
the samples and wildtype
$ vcf-isec –p Isec var_TNG67.vcf.gz var_sampleA.vcf.gz
var_sampleB.vcf.gz
# The output of the last command represent:
Isec0.vcf.gz : specific on TNG67
Isec1.vcf.gz : specific on sampleA
Isec2.vcf.gz : specific on sampleB
Isec0_1.vcf.gz : intersect between TNG67 and sampleA, not on sampleB
Isec0_2.vcf.gz : intersect between TNG67 and sampleB, not on sampleA
Isec1_2.vcf.gz : intersect between sampleA and sampleB, not on TNG67
Isec0_1_2.vcf.gz : shared between all three samples.
2. Explanation and code for depth filtering
With the output file from vcf-isec, we then filtered out the variants with extreme
sequence depth. Then, we used all filtered positions to produce the IGV snapshot image
files which showed the NGS short read of TNG67 and other samples on the relative
position of the reference genome. Through these images, we could check if there were
false positive or false negative SNPs or indels. The commands used are as followed:
# filter out the variant with too low (<8) or too high depth (>30)
of reads, and output qualified homozygous SNP location
$ fileVCF_SNP_D.pl Isec0.vcf.gz Hm 8 30 > Isec0_SNPHm_D8_30.bed
# make batch script file for making IGV snapshot
bedToIgv -path $HOME/PNGs -slop 100 -i Isec0_SNPHm_D8_30.bed >
Isec0_SNPHm_D8_30.bat
# Start IGV and load the mapped data (.bam) for comparing, then please
refer to the section “Running IGV with a batch file” in IGV’s user
guide to produce IGV snapshot
5
Supporting data 3
Analysis and Validation of the Sequence Changes
Filtering and visualization checking
All the split VCF files underwent filtering and were checked visually. The first
condition was read depth and variant-call quality (maximum ~220). The second
condition was read depth 8 to 30 based on the estimated average read depth of ~15x
against the reference genome size. A base frequency > 0.8 was required for homozygous
sites and 0.3 to 0.7 for heterozygous sites.
The snapshot image files of IGV for the variant events on the genome NGS
mapped images and the wild-type TNG67 were produced by a batched process based on
the regional files (.bed), then manually validated. Figure S12 provides eight examples of
how we assigned homozygous and heterozygous SNPs and indels. The curated SNPs
along the 12 chromosomes were plotted using the BasicChromosome module of
Biopython (Cock et al., 2009). SnpEff (Cingolani et al., 2012) was used to calculate the
location of each SNP in the genome. Figure S7, which illustrates SNP sites and their
nearby regions, was generated using WebLogo (Crooks et al., 2004).
Calculation and confirmation of SNP frequency
The plot of sequence coverage depth for several samples (Figure S13) indicates that
most samples (except TNG67, which was sent for sequencing twice) fit a normal
distribution, with a mean of 13- to 26-fold coverage. To set cutoff points for a dataset
with a mean of 15-fold coverage, we chose 8x coverage, about half that of 15x, as a low
point because of difficulties in differentiating SNPs from sequencing errors for coverage
<8x. We chose 30x coverage (double 15x) as the high cutoff point because these
sequences represent highly repetitive areas of the genome, with consequent difficulties
in determining whether they are real SNPs or similar sequences of repeated copies of
the fragment in the genome. Table S1 also lists the proportions of categories for these
lines we used for detecting sequence changes. The regions of sequencing depth are 8-30
or 8-40 and they covered 81% to 92% of the mapped genome. To detect sequence
variations, we mapped TNG67 and all regenerant and transformant sequence reads
against the Nipponbare reference genome (IRGSP v1.0) using the Burrows-Wheeler
Aligner (BWA) package, which provides information for all possible SNPs and short
indels (range -42 to +59 bp).
6
Before filtering, the number of heterozygous SNPs from the BWA output for each
regenerant ranged from 19,581 to 38,805 (Table S10). We then used different
parameters for filtering SNPs: D2Q10 (read depth >2 and variation-calling quality >10),
D3Q20 (read depth >3 and variation-calling quality >20), D3Q150 (read depth >3 and
variation-calling quality >150) and D8_30 (reads with depth of between 8 and 30). With
the original 20,081 putative SNPs of Regenerant A1 as an example (Table S10), if we
used quality as the filtering criterion, we obtained 8,154, 5,174, and 53 heterozygous
SNPs for the D2Q10, D2Q20, and D3Q150 reads, respectively. If we used sequence
depth coverage as the criterion, we obtained 3,904 SNPs for the D8_30 reads. We then
manually confirmed these data illustrated in Figure S12 (i.e., the heterozygous SNP
frequency should be 30% to 70% of the sequence calls). The proportion of correct calls
for D2Q10 was only 1.1% and missed 11 false-negative SNPs, that for D3Q20 was only
1.5% and missed 28 false-negative SNPs, that for D3Q150 was 5.7% and missed 100
false-negative SNPs, and that for D8_30 was 2.6% and missed one false-negative SNP.
The number of SNPs with all filtered results was 90. Thus, sequencing depth provided
better filtering. According to these results, we then used D3Q150 and D8_30 to detect
heterozygous sequence changes in other regenerants (Tables 2 and S1), with a
mean=107±23. For the R0 generation, these numbers represent sequence changes from
the callus growth stage to regeneration. D8_30 and manual checking were also used to
identify indels.
In the current study, we filtered candidates by using both depth and quality (D2Q10,
D3Q20, and D3Q150) or depth only (D8_30), followed by manual confirmation. To
eliminate false-positive and false-negative calls and to improve screening efficiency, we
systematically checked 1) the ratio of curated SNPs/indels versus the filtered ones and 2)
correct calls using combinations of sequencing depth from 2-8 and quality values of 10,
20, 40, 80, 150, and 160 for the four TRIM transformants. From all 168 combinations,
the mean proportion of selected versus putative calls was 14.8%, and the mean correct
call rate was 68.5%. Among all the methods tested, the best was D8_30, with the two
rates of 9.8% and 99.4%. The call quality for the SNPs passing through filtering and
validation ranged from 3.41 to 223. This result explains in part why setting the quality
value did not provide good results.
7
Reference
Cingolani, P., Platts, A., Wang le, L., Coon, M., Nguyen, T., Wang, L., Land, S.J.,
Lu, X. and Ruden, D.M. (2012) A program for annotating and predicting the
effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of
Drosophila melanogaster strain w1118; iso-2; iso-3. Fly (Austin), 6, 80-92.
Cock, P.J.A., Antao, T., Chang, J.T., Chapman, B.A., Cox, C.J., Dalke, A.,
Friedberg, I., Hamelryck, T., Kauff, F., Wilczynski, B. and de Hoon, M.J.L.
(2009) Biopython: freely available Python tools for computational molecular
biology and bioinformatics. Bioinformatics, 25, 1422-1423.
Crooks, G.E., Hon, G., Chandonia, J.M. and Brenner, S.E. (2004) WebLogo: a
sequence logo generator. Genome Res, 14, 1188-1190.
8